🧑🏾 👷🏻 🏏 Data Distribution in Apache Ignite 👨🏽‍🏭 🤵🏾 👨🏿‍🤝‍👨🏾

Hello! This post is a slightly abridged version of my eponymous lecture at the Apache Ignite community meeting . The full video version along with questions and answers can be viewed here , and the slides can be downloaded here . In the report, I tried to show by examples how data is distributed in Apache Ignite.

Why do you need to distribute anything

A fairly standard history of the development of any system requiring data storage and processing is the achievement of a certain ceiling. Either there is a lot of data and they are not physically placed on the storage device, or the load is growing at such a rate that one server is no longer able to process such a number of requests. There are frequent cases when both of them occur.

As a rule, they come to one of two solutions: either sharding the existing storage, or switching to a distributed database. Both solutions have a number of common features, the most obvious of which is the use of more than one node for working with data. Further, many nodes I will call topology.

The problem of data distribution among topology nodes can be formulated as a set of requirements, which our distribution must satisfy:

An algorithm is needed that will allow all nodes of the topology and client applications to come to the same conclusion about which node or nodes the certain object (or key) is on.
Uniformity of distribution. The more evenly the data is distributed between nodes, the more evenly the load on these nodes will be distributed. Here I make the assumption that our nodes have approximately the same resources.
. , , . , , , .

Achieving the first two requirements is fairly easy.

A familiar approach, often used when balancing load between functionally equivalent servers, dividing modulo N, where N is the number of nodes in the topology and we have a one-to-one correspondence between the node number and its identifier. Then all we need to do is to represent the key of the object as a numerical value using a hash function and take the remainder of division by N from the obtained value.

The diagram shows the distribution of 16 keys over 3 nodes. It can be seen that this distribution is uniform, and the algorithm for obtaining the node for the object is simple and guarantees that if all nodes of the topology use this algorithm, then the same result will be obtained for the same key and the same N.

But what happens if we introduce the 4th node into the topology?

Our function has changed, now we take the remainder of the division by 4, not by 3. And if the function has changed, then the distribution has changed, and very much.

Here, the previous location of the objects for the previous version of the topology of three nodes is shown in red, and the position of the objects for the new version of the topology of four nodes is green, respectively. This is very similar to the usual diff files, but instead of files we have nodes.

It is easy to see that the data has moved not only to the new node, but also there was an exchange of data between nodes that were already in the topology. Those. we observe spurious traffic between nodes and the requirement of a minimal change in distribution is not fulfilled.

Two popular ways to solve the problem of data distribution, taking into account the listed requirements, are as follows:

Consistent hashing
Largest Random Weight Algorithm (HRW), also known as Rendezvous hashing.

Both of these algorithms are very simple. Their descriptions on Wikipedia fit into several sentences. Although it’s hard to call them obvious. For those interested, I recommend reading the original articles Consistent Hashing and Random Trees: Distributed Caching Protocols for Relieving Hot Spots on the World Wide Web and A Name-BasedMapping Scheme for Rendezvous . Most understandably, in my opinion, the idea of a consistent hashing algorithm is conveyed in this Stanford course .

Let's look at these algorithms in more detail.

Consistent Hashing

The trick underlying the consistent hashing algorithm is to map both nodes and stored objects to the same identifier space. This makes our seemingly different entities, objects and nodes comparable.

To obtain such a mapping, we simply apply the same hash function to the keys of the objects and to the identifiers of the nodes. The result of the hash function for the node will be called a token, this will be useful to us later.

We represent our identifier space in the form of a circle, i.e. we simply assume that the maximum identifier value immediately follows the minimum identifier value.

Now, in order to determine on which node the object lives, you need to get the value of the hash function from its key, and then simply move clockwise around the circle until we encounter the token of a node on the way. The direction of movement is unimportant, but it must be fixed.

The imaginary clockwise movement is functionally equivalent to a binary search in a sorted array of node tokens.

In the diagram, each sector of a particular color reflects the identifier space for which a particular node is responsible.

If we add a new node, then ...

... it will divide one of the sectors into two parts and completely take over the corresponding keys.

In this example, node 3 took over part of the keys of node 1.

As you can see, this approach gives a rather uneven distribution of objects across nodes, because it is highly dependent on the identifiers of the nodes themselves. How can this approach be improved?

You can assign more than one token to nodes (usually hundreds). This can be achieved, for example, by introducing many hash functions for the node (one per token) or repeatedly applying the same hash function to the token obtained in the previous step. But we must not forget about the collisions. There should not be two nodes with the same token.

In this example, each node has 4 tokens.

What else is important to mention: if we want to ensure the safety of data in the event of a node leaving the topology, then we need to store the keys on several nodes (so-called replicas or backups). In the case of the consistent hashing algorithm, the replicas will be the following N-1 nodes on the circle, where N is the replication factor. Of course, the order of the nodes should be determined by a specific token (for example, by the first), because when using multiple tokens for each of them, the arrangement of nodes may differ. Pay attention to the scheme: it does not have a clear pattern of repetition of nodes.

As for the requirement of a minimal change in the distribution when changing the topology, it is satisfied because the mutual order of the nodes on the circle is unchanged. Those. removing a node from the topology will not change the order relation between the remaining nodes.

Rendezvous hashing

The Rendezvous hashing algorithm seems even simpler than consistent hashing. The algorithm is based on the same principle of invariance of order relations. But instead of making nodes and objects comparable, we make only nodes for a specific object comparable. Those. we determine the order relation between nodes for each object independently.

Again hashing helps us with this. But now, in order to determine the weight of the node N for a given object O, we mix the identifier of the object with the identifier of the node and take the hash from this mix. Having done this operation for each node, we get a set of weights by which we sort the nodes.

The node that turned out to be the first and will be responsible for storing the object.

Since all nodes of the topology use the same input data, the result for them will be identical. Which satisfies the first requirement.

Consider an example. Here we have an order relation between three nodes for four different keys. Yellow indicates the node with the highest weight, i.e. the node that will ultimately be responsible for a particular key.

Add another node to the topology.

I deliberately placed it on the diagonal to take into account all the possible options. Here, node 3, shown in green, entered the topology. Therefore, the weight distribution of the nodes for each of the keys has changed. Red indicates the nodes that have changed their location in the list for a particular key, because the weights of these nodes were less than the weight of the added node. However, this change affected only one of the keys, K3.

Let's treacherously derive a node from a topology.

Once again, the changes affected only one key, this time K1. The remaining objects were not affected. The reason, as in the case of consistent hashing, is the invariance of the order relationship between any pair of nodes. Those. the requirement of a minimum change in distribution is met and there is no spurious traffic between nodes.

The distribution for rendezvous looks pretty good and does not require additional tricks compared to consistent hashing like tokens.

In case we want to support replication, then the next node in the list will be the first replica for the object, the next node will be the second replica, etc.

How rendezvous hashing is used in Apache Ignite

The so-called affinity function is responsible for the distribution of data in Apache Ignite (see the AffinityFunction interface ). The default implementation is rendezvous hashing (see the RendezvousAffinityFunction class ).

The first thing you need to pay attention to is that Apache Ignite does not map stored objects directly to topology nodes. Instead, an additional concept is introduced - partition.

A partition is a container for objects and a replication unit. In addition, the number of partitions for a particular cache (this is an analog of the table in the familiar databases) is set at the configuration stage and does not change during the cache life cycle.

Thus, we can display objects on partitions using effective modulo division, and use rendezvous hashing to display partitions on nodes.

Because the number of partitions for the cache is a constant, then we can calculate the partition distribution by nodes once and cache the result until the topology is changed.

Each node calculates this distribution independently, but on all nodes with the same input data this distribution will be identical.

Partition can have several copies, we call them backups. The primary partition is called the primary partition.

For the best distribution of keys between partitions and partitions by nodes, the following rule must be fulfilled: the number of partitions should be significantly greater than the number of nodes, in turn, the number of keys should be significantly greater than the number of partitions.

Caches in Ignite are partitioned and replicated.

In a partitioned cache, the number of backups is set at the cache creation stage. Partitions - primaries and backups - are evenly distributed between nodes. Such a cache is best suited for working with operational data, as provides the best write performance, which directly depends on the number of backups. In general, the more backups, the more nodes must confirm the key record.

In this example, the cache has one backup. Those. we can lose one node and not lose data, because Partition backups are never stored on the same node as the primary partition or its other backup.

In the replicated cache, the number of backups is always equal to the number of topology nodes minus 1. That is, each node always contains copies of all partitions.

Such a cache is best suited for working with rarely changed data (for example, directories) and provides the greatest availability, as we can lose N-1 nodes (in this case 3) without losing data. Also in this option, we will get maximum read performance if we allow to read data from both the primary partitions and backups.

Data colocation in Apache Ignite

An important concept to keep in mind for best performance is collocation. Colocation is the placement of any objects in the same place. In our case, objects are entities stored in the cache, and a place is a node.

If objects are distributed across partitions of the same affinity function, it is logical that objects with the same affinity key will fall into the same partition, and therefore, to the same node. In Ignite, this is called affinity colocation.

By default, an affinity key is the primary key of an object. But in Ignite, you can use any other field of an object as an affinity key.

Collocation significantly reduces the amount of data sent between nodes to perform calculations or SQL queries, which naturally leads to a reduction in the time spent on these tasks. Consider this concept by example.

Let our data model consist of two entities: order (Order) and order position (OrderItem). One order can correspond to many items. The order and line item identifiers are independent, but the line item has a foreign key that refers to the corresponding order.

Suppose we need to perform some task, which for each order must perform calculations for the positions of this order.

By default, an affinity key is a primary key. Therefore, orders and positions will be distributed between nodes in accordance with their primary keys, which, I recall, are independent.

On the diagram, orders are represented by squares, and positions in circles. Color indicates that the item belongs to the order.

With this distribution of data, our hypothetical task will be sent to the node where the desired order is located, and then it will need to read the positions from all the other nodes, or send a subtask to these nodes and get the calculation result. This is an unnecessary network interaction that can and should be avoided.

What if we tell Ignite that order items must be placed on the same nodes as the orders themselves, i.e. collect data?

As the affinity key for the position, we take the foreign key OrderId and this field will be used when calculating the partition to which the record belongs. Moreover, inside the partition, we can always find our object by the primary key.

Now, if both caches (Order and OrderItem) use the same affinity function with the same parameters, our data will be nearby and we will not need to go around the network for order items.

Affinity configuration in Apache Ignite

In the current implementation, an affinity function object is a cache configuration parameter.

The affinity function itself takes the following arguments when creating:

Number of partitions;
The number of backups (in fact, this is also the configuration parameter of the cache);
Backup filter;
Flag excludeNeighbors.

These settings cannot be changed.

With the number of partitions and backups, everything seems to be clear. I’ll talk about the backup filter and the excludeNeighbors flag a bit later.

At run time, the input affinity function receives the current cluster topology - essentially a list of cluster nodes - and calculates the distribution of partitions by nodes in accordance with the examples that I showed when I talked about the rendezvous hashing algorithm.

As for the backup filter, this is a predicate that allows you to prohibit affinity functions from assigning backup partitions to a node for which the predicate returned false.

As an example, suppose that our physical nodes - servers - are located in the data center in different racks. Typically, each rack has its own independent power ...

... and if we lose the rack, then we lose the data.

In this example, we lost half of the partitions.

But if we set the correct backup filter, then the distribution will change in such a way ...

... that if the rack is lost, there will be no data loss and they will still be available.

The excludeNeighbors flag performs a similar function, and in fact it is an abbreviation for one specific case.

Often multiple Ignite nodes run on the same physical host. This case is very similar to the example with racks in the data center, only now we are fighting data loss with the loss of the host, not the racks.

The rest is the same. You can implement this behavior using a backup filter. This flag is a historical legacy and may be removed in the next major release of Ignite.

It seems that I talked about the affinity function and data distribution everything that a developer using Apache Ignite needs to know.

In conclusion, let's look at an example of the distribution of 16 partitions according to the topology of 3 nodes. For simplicity and clarity, we believe that partitions do not have backups.

I just took and wrote a little test that brought me the real distribution:

As you can see, the uniformity of the distribution is not ideal. But the error will be noticeably lower with an increase in the number of nodes and partitions. The main rule that must be observed is that the number of partitions is significantly greater than the number of nodes. Now, in Ignite, the default number of partitions for a partitioned cache is 1024.

Now add a new node to the topology.

Part of the parties moved to him. At the same time, the requirement of a minimum change in distribution was observed: the new node received part of the partitions, while the other nodes did not exchange partitions.

We remove from the topology the node that was present in it at the initial stage:

Now all partitions that were associated with the zero node were redistributed among other nodes of the topology, without violating our distribution requirements.

As you can see, the solution to complex problems is often based on fairly trivial, although not entirely obvious, ideas. The described solutions are used in most distributed databases and do a good job. But these decisions are randomized and therefore the uniformity of distribution is far from ideal. Can uniformity be improved without sacrificing performance and other distribution requirements? The question remains open.