Database Sharding: What It Is & Its Advantages and Disadvantages

Herbert Huffner Database Management System Januar 22, 2023

Database Sharding, The Design Of Applications Or Websites Should Be Such That They Have The Ability To Scale Over Time So That They Can Properly Respond To User Requests And Incoming Traffic.

Scalability is essential in the context of data-driven applications and websites, and must be done in a way that ensures data security and integrity.

Predicting how popular a website or application will be and how long it will be famous can be challenging. This is why some organizations choose a database architecture that allows them to scale their databases dynamically.

In this article, we will examine one of the essential database architectures. This valuable and necessary architecture is called Database Sharding.

What is sharding?

Sharding is a database design pattern in which data is broken into multiple tables and databases to manage queries easily. Today, significant databases support this technique in various forms.
For example, in Oracle’s database management system, Oracle Sharding distributes parts of a dataset across databases (shards) hosted on different computers or in the cloud.

An important concept called division includes “horizontal” and “vertical.”

In horizontal partitioning, the stored records are entered into the table as a group, known as partitions. In this case, the tables have the same schema and columns, but the data stored in them is different and unique. In other words, no table contains the same data.

Sharding is a database architecture pattern that performs the partitioning process horizontally. The process of separating the rows of a table into several different tables is known as partitioning. Each partition has the same layout and columns, but completely different rows. Likewise, the data stored in each section is unique and independent from the data stored in other cells.

An important point to pay attention to is communicating horizontal partitioning with vertical partitioning. The data stored in a vertical partition is independent of the data in different cells, each with rows and columns.
In a vertically partitioned table, the columns are separated from each other and placed in new, separate tables.

Differences between horizontal and vertical partitioning

In vertical partitioning, unique columns are created in tables. The data stored in these tables is also excellent; In other words, no two tables (partitions) can contain the same structure and data.

Now we come to the question: What are the differences between horizontal and vertical partitioning? In horizontal partitioning, only the data in the tables is unique. In contrast, the table’s columns differ from the data in vertical partitioning.

In Figure 1, we have the Original Table; when we port it horizontally, we will have two tables named HP1 and HP2, which have the same schema or database structure but contain different data.

However, when we partition this table vertically, we see two tables named VP1 and VP2, whose schemas are different. In addition, each is responsible for storing specific data.

With these explanations, sharding refers to breaking data into small units, each stored in a specific table. In this case, we will have a complete database by placing each table and data next to the bleach.

figure 1

What advantages does sharding architecture offer?

Vertical scaling (Vertical Scaling) allows IT teams to upgrade processing resources, such as the central processor or the main memory of servers, to better respond to user queries so that the performance of databases and websites does not decrease.
Sharding opens the hands of developers in the field of database design so that they can carry out the design process based on the horizontal scaling pattern. In this case, it is possible to spread the load on the servers to increase processing data and queries, and to use different databases optimally.

When it comes to vertical expansion, it is possible to upgrade the hardware infrastructure so that the server responds to requests as the business activities expand and the number of users increases. However, remember that hardware scalability is only possible up to a point, as it will cost the organization a lot in the long run.

When we send a standard query to a database, if a particular table has millions of records, the response time will not be fast, even if the indexing is done correctly. Based on sharding architecture, a table is broken into several separate tables to search the entered query among fewer records.

In this case, the data access speed increases significantly.

A web-based application based on a standard database (Unsharded Database). The most frequently used data is cached, which the application can use. In addition, sharding ensures that when the database becomes unavailable or a server problem occurs, the application maintains its stable performance for a certain period.

If it faces a problem with the database server, its performance will be disrupted. Almost a wide range of websites are facing such a problem. Meanwhile, if such a problem occurs in a web-based application based on a Sharded Database architecture, only one shard (a table in the database) will face disruption, and the application can maintain its stability. In this case, only parts of the program cannot serve users temporarily.

What are the disadvantages of sharding architecture?

Hyper-architecture offers better stability and performance for web-based applications, but it also has disadvantages. One of the significant disadvantages of the above architecture is the excessive complexity of the databases.
In other words, if the sharding cycle is not performed correctly, the data can interfere with each other, the writing operations in the tables are not performed correctly, or the integrity of the tables will be lost.

Another problem of sharding architecture is the imbalance between different partitions. To better understand this issue, suppose we have two tables in which we store the names of users whose names begin with the letters a to s, and the other stores the data of users whose names are o to z.
Typically, there may be more users whose name starts with the letter B than other alphabet letters. In such a situation, the table that is suitable for storing the letters a to s becomes too large, which causes the speed of retrieving data from a large volume of records to increase.

Another disadvantage of the above architecture is that the previous backups are no longer helpful when partitioning a database. In such a situation, and if necessary, backup copies should also be prepared in a partitioned form, which is costly and has a high possibility of errors.

How does Database Sharding architecture work?

Suppose we need to partition our database based on business requirements; what should we do to get started? To better understand the above architecture, we need to get acquainted with the different methods of implementing the above architecture to spread data among other partitions.
As a general rule, you should remember that when you query several other databases or tables, you should know exactly which division or table the query will be sent to; otherwise, you will face wrong results and sometimes data loss.

Key-Based Sharding: In the above architecture, a key associated with that data is considered when new data is stored in the database. For example, if a new customer registers in an online store, customer_id is the Key. This Key is then passed to a function that, depending on the input it receives, determines which partition the data should be stored in. To ensure that the data is stored in the correct cells, the values that enter such a function need to be known as a Hash Function to have a unique nature. Primary Key can be considered data like the Shard Key in the above architecture. To better understand this issue.

One of the challenges of working with Key-Based Sharding architecture is that we may face problems if we add new servers or reduce the number of existing servers.
This way, when we add a new server to a cluster, each server requires a hash value to register new data. In this case, if we intend to transfer the data to the current server, we need to update the hash value associated with the data, which is costly.

Range-Based Sharding: Range-based partitioning involves sharing data based on specified values. To better understand the above pattern, suppose you have a database in an online store where data related to all products is stored. Different databases or tables can be used based on product prices to partition this database. One of the advantages of implementing this architecture is the ease of use, which ensures that the schema of all partitions is the same. In such a situation, when considering making an application-level change to specify which division the new data should be stored in, we need to look at its price range and, based on that, select the appropriate section. choose
Directory-Based Sharding: As you can see in Figure 2, we need a component called the Lookup Table in this architecture. This component is responsible for storing the Shard Key so that it can be determined which partition should hold which type of data. In other words, this table is like a list of words at the book’s end, specifying which term is used in which section or on which page.

As shown in Figure 2, the Delivery Zone column is considered a Shard Key. Next, the data is recorded in the Lookup Table based on this column. In this case, it is clear which partition each Key is associated with.
Compared to the previous architecture (Range-Based Sharding), when it is not essential which section should be used for data storage, Directory-Based Sharding performs better than the previous example.

One of the advantages of the above architecture is the incredible flexibility it offers. In this case, developers can use their proprietary algorithm to distribute data between different partitions. Additionally, it will be easy to add new divisions dynamically.

The only disadvantage of the above architecture is that if the Lookup Table, which is the starting point of the query, faces a problem for any reason, the performance of the entire program, or at least many parts of it, will fail.

figure 2

last word

Some programmers and database experts believe that the ever-increasing amount of information has caused databases to become bulky, doubling the necessity of migrating to sharding architecture.
Since the volume of data generated daily by businesses is enormous, a single database alone is not able to manage it, or the importance of reading/writing data in the database is so significant that the resources of a server will not respond to it; Hence, software teams are forced to use sharding.

It should be noted that the partitions applied to the database can be added and removed, and the data can be partitioned twice without any damage or loss. Today, modern databases such as Oracle or Microsoft SQL Server distribute parts of a data set in databases (shards) located on different computers or in the cloud so that access to the database is not entirely disrupted in case of a problem with one partition.

Now we come to this critical question: should we use shard?

Whether or not we should implement a database based on a partitioned architecture is a hot topic in desktop and web-based application development. Because of the operational complexity that this architecture adds to the design, some use sharding only when they are confident that a business’s database will grow and its scalability is inevitable.

For this reason, they suggest that the above architecture should be used only in situations where the partitioning of a database is necessary.

In general, sharding should be used if you face the following scenarios in the software projects you plan to develop:

The amount of application data exceeds the storage capacity of a single database node.
The volume of write/read operations in the database exceeds what a single node can handle, resulting in extended response times or transaction commits. In this case, sharding should be used so that the delay in accessing information is as low as possible.
The network bandwidth is less than required by a single database node and application, resulting in extended response times or outages.

FAQ

What is database sharding?

It’s a way of dividing a database into smaller, more manageable parts called shards, where each shard holds a subset of the data.

When should sharding be used?

When one database server cannot handle the volume of data, read/write load, or storage demand, and you need horizontal scaling.

What are common drawbacks of sharding?

Increased complexity in maintenance, difficulty ensuring transactions across shards, possible uneven data distribution, and higher infrastructure costs.

blog posts