Database sharding is a method of horizontal partitioning in a database or search engine. Each individual partition is referred to as a shard or database shard.(1)
While working at Auction Watch, developers got the idea to solve their scaling problems by creating a database server for a group of users and running those servers on cheap Linux boxes. In this scheme the data for User A is stored on one server and the data for User B is stored on another server. It’s a federated model. Groups of 500K users are stored together in what are called shards.
The advantages are:
• High availability. If one box goes down the others still operate.
• Faster queries. Smaller amounts of data in each user group mean faster querying.
• More write bandwidth. With no master database serializing writes you can write in parallel which increases your write throughput. Writing is major bottleneck for many websites.
• You can do more work. A parallel backend means you can do more work simultaneously. You can handle higher user loads, especially when writing data, because there are parallel paths through your system. You can load balance web servers, which access shards over different network paths, which are processed by separate CPUs, which use separate caches of RAM and separate disk IO paths to process work. Very few bottlenecks limit your work.
Sharding isn’t perfect. It does have a few problems.
• Rebalancing data. What happens when a shard outgrows your storage and needs to be split? You had to build out the data center correctly from the start because moving data from shard to shard required a lot of downtime.
• Joining data from multiple shards. To create a complex friends page, or a user profile page, or a thread discussion page, you usually must pull together lots of different data from many different sources. With sharding you can’t just issue a query and get back all the data.
• How do you partition your data in shards? What data do you put in which shard?
• Less leverage. People have experience with traditional RDBMS tools so there is a lot of help out there. With sharding you are on your own.
• Implementing shards is not well supported. Sharding is currently mostly a roll your own approach.
Could the answer be NO-SQL ?
NoSQL is a non-relational data stores that “provide for web-scale data storage and retrieval especially in web based applications because it views the data more closely to how web apps view data – a key/value hash in the sky.” NoSQL is meant for the current growing breed of web applications that need to scale effectively. Applications can horizontally scale on clusters of commodity hardware without being subject to intricate sharding techniques.(3)
This technology is widely used within Google, with platfroms such as BigTable (4)
One of the NO-SQL platforms I have been looking at is CouchBase, http://www.couchbase.com/
More on CouchBase in my next Blog post(1) http://en.wikipedia.org/wiki/Sharding (2) http://highscalability.com/unorthodox-approach-database-design-coming-shard> (3) http://www.eecs.berkeley.edu/~culler/cs262b/summary/bigtable.html