Database Sharding: System Design Interview Concept

Table of Contents

Introduction

Every curious mind delves into the realms of imagination and ideas. You must have had a billion-dollar idea on a lovely day as well. Assume you have now created an excellent website with your innovative tech skills.

Fortunately, your efforts appear to be appreciated, as the number of visitors to your website has soared exponentially. It has a higher number of active users, more features, and generates more data daily. Your database is now acting as a stumbling block for the rest of your application.

Any application or website that experiences considerable growth will need to scale at some point to accommodate the increased traffic. Because it’s difficult to predict how popular your website or application will become or how long it’ll stay successful, there is a need to opt for a database architecture that helps grow databases dynamically. And here comes the concept of Sharding.

What is Database Sharding?

Sharding is a technique for distributing a single dataset among many databases, allowing it to be stored across multiple machines.

Larger datasets are split into smaller chunks, called logical shards which are then distributed across separate database nodes called physical shards, each of which holds multiple logical shards. This boosts the total storage capacity of the system.

As more nodes are added to share the load, sharding is a type of scaling known as horizontal scaling or scale-out. To spread the data over multiple nodes in horizontal scaling, we use partitioning.

A shard is a horizontal data partition that holds a portion of the complete data set and is thus in the responsibility of serving a portion of the overall demand. Horizontal partitioning is another term for sharding. Horizontal partitioning is achieved in a relational database by storing rows from the same table in several database nodes.

For example, in the image below, you can see that the Invoice table has been partitioned horizontally into two shards: Database Shard 1, which contains the first three rows, and Database Shard 2, which contains the next three rows.

A sharded database, thus, can accommodate more requests than a single system by dividing the data over numerous machines.

Database shards exemplify a shared-nothing architecture. This means that the shards are self-contained and do not share data or processing resources. However, replicating specific tables into each shard to serve as reference tables may make sense in some circumstances.

Sharding is frequently implemented at the application level, which means the application includes code that specifies which shard reads and writes should be sent to. Databases such as SQLite, Memcached, Zookeeper, MySQL, and PostgreSQL do not allow sharding natively at the database layer.

As a result, we must implement sharding logic in the application for these kinds of databases that don’t have built-in functionality. Middleware can assist in avoiding this complexity in application code.

Apache ShardingSphere and Vitess are two open-source middleware options. MySQL, PostgreSQL, SQLServer, and Oracle are all supported by ShardingSphere plugins. Vitess supports MySQL and MariaDB.

On the other hand, some database management systems include built-in sharding features, allowing you to implement sharding at the database level. Examples of such databases are MongoDB, Cassandra, HDFS, Redis, YugabyteDB, and HBase.

Comparison of Database Sharding with Partitioning and Replication

Database Sharding vs Partitioning

Partitioning is a term that refers to the process of splitting data elements into multiple entities for performance, availability, or maintainability. So, there can be two types of partitioning methods:

Vertical Partitioning
Horizontal Partitioning

Database_Sharding_vs_Partitioning — Source: DigitalOcean

In vertical partitioning, entire columns from a table are separated and placed in new, different tables. As shown in the above image, the entire columns of the original table have been partitioned into two tables based on the CUSTOMER ID column.

As a result, the columns CUSTOMER ID, FIRST NAME, and LAST NAME are added to the table VP1, while the columns CUSTOMER ID and FAVORITE COLOR are added to the table VP2.

Similarly, in horizontal partitioning, we divide a table’s rows into multiple distinct tables, known as partitions. Each partition has the same columns and schema as the others, but each has a different set of rows. As we can see in the above image, the first two rows are placed in table HP1, and the next two rows are placed in table HP2.

Sharding is a special type of partitioning that falls under the category of horizontal partitioning. The distinction between sharding and partitioning is that sharding implies that data is distributed over multiple systems, but partitioning does not. (hiboost.com) Partitioning is the process of grouping data into subsets within a single database instance.

Database Sharding vs Replication

Replication is a database configuration in which multiple copies of the same dataset are hosted on different machines. The primary reason for replication is redundancy. If a single database host machine fails, recovery is quick because another machine hosting a replica of the same database can take over.

Sharding, on the other hand, as already mentioned, is a type of horizontal partitioning. The data set is divided into shards and stored in different nodes.

In the case of replication, when you run a query on the replica, only the replica’s CPU and RAM are used. However, in sharding, data will be gathered from various shards to one of the shards, processed as a whole, and then given as output.

Database_Sharding_vs_Replication — Source: OReilly

As shown in the above image, requests A, B, and C are routed to all three replicas in a replicated service, but in a sharded service, requests A, B, and C are dispersed over the multiple shards.

Techniques of Database Sharding

There are numerous strategies for distributing data across multiple databases. It is critical to choose the correct strategy because it ensures that the data is routed to the correct shard when queries are run or incoming data is distributed to the sharded tables or databases. Otherwise, it may result in lost data or excruciatingly slow queries.

Few common sharding architectures are:

Key-based sharding
Range-based sharding
Directory-based sharding

We have covered the above-mentioned architectures here. We have a specially curated Guided Path for System Design on CodeStudio to help ace the interviews. You may check it out later!

Advantages of Database Sharding

Increased Storage Capacity: We can improve our total storage capacity by increasing the number of shards, allowing for near-infinite scalability.

High Reliability & Availability: With a sharded database, an outage in one database shard affects only a portion of the application or website for some users, while other shards continue to function normally. If the database is not shared, an outage could make the entire application unavailable.

Faster Query Response: Query response times are speed up with a sharded database architecture. When you submit a query to a non-sharded database, it may have to search every row in the table you’re querying to find the desired result set. Queries can become prohibitively slow in an application with an extensive database. However, by sharding a single table into multiples, queries have to traverse fewer rows, and their result sets are returned considerably faster.

Additional Write Bandwidth: Writing is a significant bottleneck in many applications. Because there is no master database, the sharded architecture allows you to write in parallel and increase your write throughput.

Scaling Out: Sharding a database assist in horizontal scaling, also known as scaling out. Horizontal scaling is the technique of adding more machines to an existing stack to distribute the load and enable more traffic and faster processing.

Disadvantages of Database Sharding

No Native Support: As mentioned above, sharding is not natively supported by every database engine such as SQLite, MySQL, and PostgreSQL. As a result, sharding frequently necessitates a “roll your own” approach. This means that documentation for sharding and troubleshooting tips is often challenging to come by.

Rebalancing Data: A shard may outgrow other shards in a sharded database architecture and become unbalanced, a condition known as a database hotspot. Any benefits of sharding the database are invalid in this case. To achieve more even data distribution, the database would most likely need to be re-sharded. Rebalancing must be built in from the beginning; otherwise, moving data from one shard to another shard requires a significant amount of downtime during re-sharding.

The complexity of Administration: Implementing a sharded database architecture correctly is a difficult task. There is a significant risk that the sharding process will result in lost data or corrupted tables if not done correctly. Sharding can also have a significant impact on a team’s workflows. Instead of managing and accessing data from a single point, users must manage data across multiple shard locations, which may be disruptive to some teams.

Difficulty in Returning to the Unsharded Architecture: Backups of the database made before it was sharded will not include data written after partitioning. As a result, rebuilding the original unsharded architecture would necessitate either merging the new partitioned data with the old backups or transforming the partitioned DB back into a single DB, both of which would be costly and time-consuming endeavors.

Why and When Database Sharding Should be Used?

Database sharding, like any other distributed architecture, isn’t free. Setting up shards, keeping data on each shard, and effectively routing requests across those shards all have overhead and complexity.

As a result, before you start sharding, see if one of the alternatives below will work for you.

Implementing Caching:

If your data traffic is predominantly read-oriented, caching improves read performance and availability. Caching is the process of temporarily keeping data that has already been requested in memory so that you can access it much faster later.

Replication:

Another approach that can assist in enhancing read performance is Replication. This entails replicating data from a primary database server to one or more secondary databases servers. Following that, all new writes are directed to the primary server before being copied to the secondary servers, whereas reads are directed solely to the secondary servers. This distributing reads and writes helps prevent slowdowns and crashes by preventing anyone’s machine from taking on too much load.

It’s worth noting that creating read copies necessitates more processing resources and hence costs more money, which may be a substantial stumbling block for some.

Vertical Scaling:

You may scale vertically without the complexities of sharding by just upgrading your machine’s resources. Adding RAM, upgrading your machine (CPU), or expanding the storage available to your database are all uncomplicated fixes that don’t necessitate changing your database architecture or application design. An updated server with greater resources, like reading replicas, will almost certainly cost more money. As a result, you should only resize if it is genuinely your best alternative.

When should one shard, then? Due to the added complexity, sharding is typically used only when dealing with enormous amounts of data. Some common instances in which sharding a database may be helpful include:

The volume of application data grows to the point where a single database node’s storage capacity is exceeded.
The demand of the application’s network bandwidth exceeds the bandwidth available to a single database node and any read replicas, resulting in poor response times or timeouts.
The amount of writes or reads to the database exceeds the capacity of a single node or its read replicas, causing response times to slow or timeouts.

Frequently Asked Questions

What is database sharding, and why is it used?

Database sharding is the process of partitioning data in a database so that it can be stored across multiple machines by dividing the data into smaller distinct chunks or shards.

Sharding is used when a dataset is too large to fit into a single database. It helps to scale the database dynamically.

How does database sharding work?

In Database Sharding, larger datasets are split into two or smaller chunks, called logical shards. There are multiple sharding architectures using which the logical shards are then distributed across separate database nodes, referred to as physical shards.

For example, in Key-Based Sharding, you must first choose a key known as a sharding key to partition your data. The shard key is an indexed field or indexed compound field that is present in every document in the collection.

When a sharded database application receives a request, it knows where to route the request, requiring it to search through fewer data rather than the entire database.

What databases support sharding?

Sharding is supported by databases such as MongoDB, Cassandra, HDFS, Redis, YugabyteDB, HBase.

What is the difference between partitioning and sharding?

Partitioning is a term that refers to the process of splitting data elements into multiple entities for the purpose of performance, availability, or maintainability. On the other hand, Sharding is a special type of partitioning that falls under the category of horizontal partitioning. In relational database sharding, a table’s rows are divided into multiple distinct tables, known as partitions. Each partition has the same columns and schema as the others but unique rows.

Also, Sharding is the distribution or partitioning of data across multiple machines, whereas partitioning is the distribution of data on a single machine.

What are the benefits of sharding?

Sharding offers a wide range of benefits such as:
1. Increased Storage Capacity
2. High Reliability & Availability
3. Faster Query Response
4. Additional Write Bandwidth
5. Scaling Out

What is the purpose of sharding?

The purpose of sharding is to partition the larger dataset into smaller chunks and distribute it across multiple machines. It aids in increasing storage capacity and decreasing query response time, allowing an application to efficiently manage a large volume of traffic.

Key Takeaways

With this discussion, this blog attempted to deep dive into Database Sharding, it’s core idea, advantages, disadvantages, and situations when one should consider sharding the database.

Consider checking out the Guided Path of System Design to learn more useful concepts and crack your interviews like a Ninja!

We hope you found this blog useful. Feel free to let us know your thoughts in the comments section.