blog posts

What Is Database Sharding, What Are Its Advantages And Disadvantages?

What Is Database Sharding, What Are Its Advantages And Disadvantages?

The Design Of Applications Or Websites Should Be Such That They Have The Ability To Scale Over Time So That They Can Properly Respond To User Requests And Incoming Traffic.

Scalability is essential in the context of data-driven applications and websites and must be done in a way that ensures data security and integrity.

Predicting how popular a website or application will be and how long it will be famous can be challenging. This is why some organizations choose a database architecture that allows them to scale their databases dynamically.

In this article, we will examine one of the essential database architectures. This valuable and necessary architecture is called Database Sharding.

What is sharding?

Sharding is a database design pattern in which data is broken into multiple tables and databases to manage queries easily. Today, significant databases support this technique in various forms. For example, in Oracle’s database management system, Oracle Sharding distributes parts of a dataset across databases (shards) hosted on different computers or in the cloud.

Here there is an important concept called division, which includes two words “horizontal” and “vertical.”

In horizontal partitioning, the stored records are entered iTable table as a group, known here as partitions. In this case, the tables have the same schema and columns, but the data stored in them is different and unique; In other words, no table contains the same data.

Sharding is a database architecture pattern that performs the partitioning process horizontally. The process of separating the rows of a table into several different tables is known as partitioning. Each partition has the same layout and columns but completely different rows. Likewise, the data stored in each section is unique and independent from the data stored in other cells.

An important point to pay attention to is communicating horizontal partitioning with vertical partitioning. The data stored in a vertical partition is independent of the data in different cells, and each has its different rows and columns. In a vertically partitioned table, the columns are separated from each other and placed in new, separate tables. Figure 1 shows how a table can be partitioned horizontally and vertically.

Differences between horizontal and vertical partitioning

In vertical partitioning, unique columns are created in tables. The data stored in these tables are also amazing; In other words, no two tables (partitions) can contain the same structure and data.

Now we come to the question, what are the differences between horizontal and vertical partitioning? In horizontal partitioning, only the data of the tables are unique. In contrast, in vertical partitioning, the table’s columns are also different in addition to the data.

In Figure 1, we have the Original Table; when we portable it horizontally, we will have two tables named HP1 and HP2, both of which have the same schema or database structure but contain different data.

However, when we partition this table vertically, wTablee two tables named VP1 and VP2, whose schema is different. In addition, each is responsible for storing specific data.

With these explanations, sharding refers to breaking data into small units, each stored in a specific table. In this case, we will have a complete database by placing each table and data next to the bleach other.

figure 1

What advantages does sharding architecture offer?

The process of vertical scaling (Vertical Scaling) allows IT teams to upgrade processing resources such as the central processor or the main memory of servers to better respond to user queries so that the performance of databases and websites does not decrease. Sharding opens the hands of developers in the field of database design so that they can carry out the design process based on the horizontal scaling pattern. In this case, it is possible to spread the load on the servers to increase processing data and queries and to use different databases optimallytimally

When it comes to vertical expansion, it is possible to upgrade the hardware infrastructure so that the server responds to requests as the business activities expand and the number of users increases. However, keep in mind that hardware scalability is only possible up to a point, as it will cost the organization a lot in the long run.

When we send a standard query to a database, if a particular table has millions of records, the response time will not be fast, even if the indexing is done correctly. While based on sharding architecture, a table is broken into several separate tables so that the entered query can be searched among fewer records.

 In this case, the data access speed increases significantly.

A web-based application based on a standard database (Unsharded Database). The most frequently used data is cached, and the application can use this information. In addition, sharding ensures that when the database becomes unavailable, or a server problem occurs, the application maintains its stable performance for a certain period.

If it faces a problem with the database server, its performance will be disrupted. Almost a wide range of Iranian websites are facing such a problem. Meanwhile, if such a problem occurs in a web-based application based on Sharded Database architecture, only one shard (a table in the database) will face disruption, and the application can maintain its stability. In this case, only parts of the program will not be able to serve users temporarily.

What are the disadvantages of sharding architecture?

Hyper-architecture offers better stability and performance for web-based applications, but it also has disadvantages. One of the significant disadvantages of the above architecture is the excessive complexity of the databases. In other words, if the sharding cycle is not performed correctly, the data can interfere with each other, the writing operations in the tables are not performed correctly, or the integrity of the tables will be lost.

Another problem of sharding architecture is the imbalance between different partitions. To better understand this issue, suppose we have two tables in which we store the names of users whose names begin with the letters a to s, and the other stores the data of users whose names are o to z. Typically, there may be more users whose name starts with the letter B than other letters of the alphabet. In such a situation, the table that is respectable for storing the letters a to s becomes too large, which causes the speed of retrieving data from a large volume of records to increase.

Another disadvantage of the above architecture is that when we partition a database, the previous backups are no longer helpful from that point. In such a situation, and if necessary, backup copies should also be prepared in a partitioned form, which is costly with a high possibility of errors.

How does sharding architecture work?

Suppose we need to partition our database based on business requirements; what should we do to get started? To better understand the above architecture, we need to get acquainted with the different methods of implementing the above architecture to spread data among other partitions. As a general rule, you should remember that when you query several other databases or tables, you should know exactly which division or table the query will be sent to; otherwise, you will face wrong results and sometimes data loss.

  • Key-Based Sharding: In the above architecture, a key associated with that data is considered when new data is stored in the database. For example, if a new customer registers in an online store, customer_id is the key. This key is then passed to a function that, depending on the input it receives, determines which partition the data should be stored in. To ensure that the data is stored on the correct cells, the values ​​that enter such a function need to be known as a Hash Function to have a unique nature. To better understand this issue, Primary Key can be considered as such data, known as Shard Key in the above architecture.

One of the challenges of working with Key Based Sharding architecture is that we may face problems if we add new servers or reduce the number of existing servers. This way, when we add a new server to a cluster, each server requires a hash value to register new data. In this case, if we intend to transfer the data to the current server, we need to update the hash value associated with the data, which is costly.

  • Range-Based Sharding: Range-based partitioning involves sharing data based on specified values. To better understand the above pattern, suppose you have a database in an online store where data related to all products is stored. Different databases or tables can be used based on product prices to partition this database. One of the advantages of implementing this architecture is the ease of use so that the schema of all partitions is the same. In such a situation, when considering making an application-level change to specify which division the new data should be stored on, we need to look at its price range and, based on that, select the appropriate section. choose 

  • Directory-Based Sharding: As you can see in Figure 2, we need a component called Lookup Table in this architecture. This component is responsible for storing the Shard Key so that it can be determined which partition should hold which type of data. In other words, this table is like a lisTablewords at the end of the books, which specifies which term is used in which section or on which page. 

As shown in Figure 2, the Delivery Zone column is considered a Shard Key. Next, the data is recorded in the Lookup Table based on this column. In this case, it is clear which partition each key is associated with. Compared to the previous architecture (Range-Based Sharding), when it is not essential which section should be used for data storage, Directory-Based Sharding performs better than the previous example.

One of the advantages of the above architecture is the incredible flexibility it offers. In this case, developers can use their proprietary algorithm to distribute data between different partitions. Additionally, it will be easy to add new divisions dynamically. The only disadvantage of the above architecture is that if the Lookup Table, which is the starting point of the query, faces a problem for any reason, the performance of the entire program or at least many parts of it will fail.

figure 2

last word

Some programmers and database experts believe that the ever-increasing amount of information that has caused databases to become bulky has doubled the necessity of migrating to sharding architecture. Since the volume of data generated daily by businesses is enormous, a single database alone is not able to manage them, or the importance of reading/writing data in the database is so significant that the resources of a server will not respond to it; Hence, software teams are forced to use sharding.

It should be noted that the partitions applied to the database can be added and removed, and the data can be partitioned twice without any damage or loss. Today, modern databases such as Oracle or Microsoft SQL Server distribute parts of a data set in databases (shards) located on different computers or in the cloud so that in case of a problem with one partition, access to the database is not entirely disrupted.

Now we come to this critical question: should we use shard?

Whether or not we should implement a database based on a partitioned architecture is one of the hot topics in the world of desktop and web-based application development. Because of the operational complexity that this architecture adds to the design, some use sharding only when they are confident that a business’s database will grow and its scalability is inevitable.

For this reason, they suggest that the above architecture should be used only in situations where the partitioning of a database is necessary. In general, sharding should be used if you face the following scenarios in the software projects you plan to develop:

  •  The amount of application data exceeds the storage capacity of a single database node.
  •  The volume of write/read operations in the database exceeds what a single node can handle, resulting in extended response times or transaction commits. In this case, sharding should be used so that the delay in accessing information is as low as possible.
  •  The network bandwidth is less than the bandwidth required by a single database node and application, resulting in extended response times or outages.