😣 📧 😮 Two-node cluster - the devil in detail 🚱 💇 📫

Hello, Habr! I present to you the translation of the article "Two Nodes - The Devil is in the Details" by Andrew Beekhof.

Many people prefer two-node clusters because they seem conceptually simpler, and also 33% cheaper than their three-node counterparts. Although it is quite possible to assemble a good cluster of two nodes, in most cases, due to unaccounted-for scenarios, such a configuration will create many unobvious problems.

The first step to creating any high-availability system is to search for and attempt to eliminate individual points of failure, often abbreviated as SPoF (single point of failure).

It should be borne in mind that in any system it is impossible to eliminate all possible risks of downtime. This follows at least from the fact that a typical risk protection is the introduction of some redundancy, which leads to an increase in the complexity of the system and the emergence of new points of failure. Therefore, we initially compromise and focus on events related to individual points of failure, and not on chains of related and, therefore, all less likely events.

Given the trade-offs, we are not only looking for SPoF, but also balancing the risks and consequences, as a result, the conclusion is what is critical and what cannot be different for each deployment.

Not everyone needs alternative electricity providers with independent power lines. Although paranoia paid off for at least one client when their monitoring detected a faulty transformer. The customer called by phone, trying to warn the energy company, until a faulty transformer exploded.

The natural starting point is the presence of more than one node in the system. However, before the system can move the services to the surviving node after the failure, in general, you need to make sure that the services being moved are not active in any other place.

A two-node cluster has no drawbacks if, as a result of a failure, both nodes serve the same static website. However, everything changes if, as a result, both parties independently manage the shared job queue or provide uncoordinated write access to the replicated database or shared file system.

Therefore, to prevent data corruption due to the failure of one node - we rely on what is called "dissociation» (fencing).

Principle of separation

The principle of separation is based on the question: can a competing node cause data corruption? In the event that data corruption is a likely scenario, isolating the node from both incoming requests and persistent storage is a good solution. The most common approach to disengaging is disabling faulty nodes.

There are two categories of separation methods that I will call direct and indirect , but equally they can be called active and passive. Direct methods include actions by surviving peers, such as interacting with an IPMI device (Intelligent Platform Management Interface - an interface for remotely monitoring and managing the physical state of a server) or iLO (a server management mechanism in the absence of physical access to them), while indirect methods rely on the failed node to somehow recognize that it is in an unhealthy state (or at least prevents the rest of the members from recovering) and to signal the hardware watchdog about the need to disconnect the failed node.

A quorum helps in the case of using both direct and indirect methods.

Direct dissociation

In the case of direct dissociation, we can use a quorum to prevent exclusion races in the event of a network failure.

Having the concept of quorum, the system has enough information (even without connecting to its partners) so that the nodes automatically know whether they should initiate the separation and / or recovery.

Without quorum, both sides of network sharing rightly assume that the other side is dead, and will seek to dissociate the other. In the worst case, both sides manage to disconnect the entire cluster. An alternative scenario is deathmatch, an endless loop of nodes appearing, not seeing their peers, reloading them and initiating recovery only for a reboot when their peer follows the same logic.

The problem with exclusion is that the most frequently used devices become inaccessible due to the same failure events that we want to focus on for recovery. Most IPMI and iLO cards are installed on the hosts that they control and, by default, use the same network, because of which the target nodes assume that the other nodes are offline.

Unfortunately, the features of the operation of IPMI and iLo devices are rarely considered at the time of purchase of equipment.

Indirect Separation

A quorum is also important for managing indirect exclusion; if everything is done correctly, the quorum may allow survivors to assume that the lost nodes will enter a safe state after a certain period of time.

With this setting, the hardware watchdog timer resets every N seconds if the quorum is not lost. If the timer (usually a multiple of N) expires, the device performs an ungraceful shutdown of power (not shutdown).

This approach is very effective, but without quorum, there is not enough information inside the cluster to manage it. It is not easy to determine the difference between a network outage and a partner host failure. The reason this matters is that without the ability to distinguish between two cases, you are forced to choose the same behavior in both cases.

The problem with choosing one mode is that there is no way of action that would maximize accessibility and prevent data loss.

If you decide to assume that the partner node is active, but in fact there was a failure, the cluster will unnecessarily stop the services that would have to work to compensate for the loss of services of the fallen partner node.
If you decide to assume that the node is not working, but it was just a network failure and the remote node is actually functioning, then, at best, you subscribe to some future manual reconciliation of the resulting data sets.

No matter what heuristic you use, it is trivial to create a failure that either forces both sides to work, or forces the cluster to disconnect the surviving nodes. Not using quorum really robs the cluster of one of the most powerful tools in its arsenal.

If there is no other alternative, the best approach would be to sacrifice accessibility (here the author refers to the CAP-theorem). High availability of corrupted data does not help anyone, and manually reconciling various data sets is also not a pleasure.

Quorum

Quorum sounds great, right?

The only drawback is that in order to have it in a cluster with N members, you need to keep the connection between N / 2 + 1 of your nodes. What is impossible in a cluster with two nodes after the failure of one node.

Which ultimately leads us to a fundamental problem with two nodes: a
quorum does not make sense in two node clusters, and without this it is impossible to reliably determine the course of action that maximizes accessibility and prevents data loss
Even in a system of two nodes connected by a cross-cable, it is impossible to finally distinguish between disconnecting the network and the failure of another node. Disconnecting one end (the probability of which is, of course, proportional to the distance between the nodes) will be enough to refute any assumption that the channel is working equal to the health of the partner node.

Making the cluster of two nodes work

Sometimes the client cannot or does not want to buy a third node, and we are forced to look for an alternative.

Option 1 - Duplicate separation method

The node’s iLO or IPMI device is a failure point because, in the event of a failure, survivors cannot use it to put the node in a safe state. In a cluster of 3 or more nodes, we can mitigate this by calculating the quorum and using the hardware watchdog (an indirect disengagement mechanism, as discussed earlier). In the case of two nodes, we should instead use network power switches (power distribution units or PDUs).

After the failure, the survivor first tries to contact the primary separation device (built-in iLO or IPMI). If this succeeds, the recovery continues as usual. Only in the event of an iLO / IPMI device failure does the call to the PDU occur, if the call is successful, recovery can continue.

Be sure to place the PDU on a network other than cluster traffic, otherwise a single network failure will block access to both isolation devices and block service recovery.

Here you may ask - is the PDU not a single point of failure? To which the answer is of course.

If this risk is significant to you, you are not alone: connect both nodes to two PDUs and instruct the cluster software to use both when turning nodes on and off. Now the cluster remains active if one PDU dies and a second failure of either another PDU or IPMI device is required to block recovery.

Option 2 - Adding an Arbitrator

In some scenarios, although the duplicate separation method is technically possible, it is politically complex. Many companies like to have a certain separation between administrators and application owners, and security-conscious network administrators are not always enthusiastic about passing PDU access parameters to anyone.

In this case, the recommended alternative is to create a neutral third party that can complement the quorum calculation.

In the event of a failure, the node should be able to see the broadcast of its partner or arbiter in order to restore services. The arbiter also includes a disconnect function, if both nodes can see the arbiter, but do not see each other.

This option should be used in conjunction with an indirect method of dissociation, such as a hardware watchdog timer, which is configured to turn off the machine if it loses connection with its partner node and the arbiter. Thus, the survivor can confidently assume that his partner node will be in a safe state after the expiration of the hardware watchdog timer.

The practical difference between the arbiter and the third node is that the arbiter requires much less resources for its work and, potentially, can serve more than one cluster.

Option 3 - The Human Factor

The last approach is for survivors to continue to perform any services that they have already performed, but not to start new ones, until either the problem is resolved itself (network restoration, node reboot), or the person does not take responsibility for manually confirming that the other side is dead.

Bonus Option

Did I mention that you can add a third node?

Two racks

For the sake of argument, let's imagine that I convinced you of the merits of the third node, now we must consider the physical location of the nodes. If they are placed (and powered) in the same rack, this also represents SPoF, and one that cannot be resolved by adding a second rack.

If this is surprising, consider what will happen if the rack with two nodes goes down, and how the surviving node will distinguish between this case and the network failure.

Short answer: this is not possible, and again we are dealing with all the problems in the case of two nodes. Or survivor:

ignores the quorum and incorrectly tries to initiate recovery during network outages (the possibility of completing the separation is a separate story and depends on whether the PDUs are involved and whether they share power from any of the racks), or
respects the quorum and disconnects itself prematurely when its partner node fails

In any case, two racks are not better than one, and the nodes must either receive independent power sources, or be distributed among three (or more, depending on how many nodes you have) racks.

Two data centers

At this point, readers who are no longer risk averse may consider recovering from an accident. What happens when an asteroid enters a single data center with our three nodes distributed across three different racks? Obviously, Bad Things, but depending on your needs, adding a second data center may not be enough.

If everything is done correctly, the second data center provides you (and this is reasonable) an up-to-date and consistent copy of your services and their data. However, as in the scenarios with two nodes and two racks, the system does not have enough information to ensure maximum availability and prevent damage (or divergence of data sets). Even if there are three nodes (or racks), their distribution over only two data centers leaves the system unable to reliably make the right decision in the event of a (now much more likely) event that the two sides cannot connect.

This does not mean that a solution with two data centers never fits. Companies often want people to be aware before taking an exceptional step when moving to a backup data center. Just keep in mind that if you want to automate the failure, you will need either a third data center so that the quorum makes sense (either directly or through an arbiter), or you will find a way to reliably disable the entire data center.

Two-node cluster - the devil in detail