Orchestrator and VIP as HA solution for MySQL cluster

In CityMobile, we use the MySQL database as the main storage for persistent data. We have several database clusters for various services and purposes.

The constant availability of the wizard is a critical indicator of the health of the entire system and its individual parts. Automatic cluster recovery in the event of a wizard failure greatly reduces the incident response time and system downtime. In this article, I will look at a MySQL HA cluster based on MySQL Orchestrator and virtual IP addresses (VIP).



VIP HA Solution


First, I’ll briefly talk about what our data storage system is like.

We use the classic replication scheme with one write-only wizard and many read-only replicas. A cluster can contain an intermediate master - a node that is both a replica and a master for others. Clients access replicas through HAProxy, which allows for even load distribution and easy scalability. The use of HAProxy is due to historical reasons, and now we are in the process of migrating to ProxySQL.

Replication is semi-synchronous based onGTID. This means that at least one replica must write the transaction to the log before it is deemed successful. This replication mode provides an optimal balance between performance and data safety in case of failure of the main node. Basically, all changes are transferred from the master to replicas using Row Based Replication (RBR), but some nodes may have mixed binlog format.

The orchestrator periodically updates the state of the cluster topology, analyzes the information received and, in case of problems, can start the automatic recovery procedure. The developer is responsible for the procedure itself, since it can be implemented in various ways: based on VIP, DNS, using service discovery services or self-contained mechanisms.

One of the easiest ways to restore a wizard in case of failure is to use floating VIP addresses.

What you need to know about this solution before moving on:

  • VIP is an IP address that is not tied to a specific physical network interface. When a node fails or during scheduled work, we can switch the VIP to another resource with minimal downtime.
  • Releasing and issuing a virtual IP address is a cheap and fast operation.
  • To work with the VIP requires access to the server via SSH, or the use of special tools, for example keepalived.

We will consider possible problems with our master and imagine how the automatic recovery mechanism should work out.

Network connectivity to the master has disappeared, or a problem has occurred at the hardware level, and the server is unavailable


  1. , . , , .
  2. VIP — .
  3. . .
  4. VIP. VIP , gratuitous ARP. / IP- MAC-, VIP. split brain .
  5. All new connections are immediately redirected to the new master. Old connections fail, repeated calls are made to the database at the application level.

The server is operating in normal mode; a failure occurred at the DBMS level


The algorithm is similar to the previous case: updating the topology and starting the recovery process. Since the server is available, we successfully release the VIP on the old master, transfer it to the new one and send several ARP requests. The possible return of the old wizard should not affect the rebuilt cluster and the operation of the application.

Other problems


Failure of replicas or intermediate masters does not lead to automatic actions and requires manual intervention.

The virtual network interface is always added temporarily, that is, after rebooting the VIP server is not automatically assigned. Each instance of the database is launched by default in read-only mode, the orchestrator automatically switches the new master to recording and tries to install read onlyon the old master. These actions are aimed at reducing the likelihood split brain.

During the recovery process, problems may arise, which should also be notified through the UI of the orchestrator in addition to standard monitoring tools. We have expanded the REST API by adding this feature ( PR is currently under consideration).

The general scheme of the HA solution is presented below.



Choosing a New Wizard


The orchestra is smart enough and tries to choose the most suitable replica as a new master according to the following criteria:

  • lag of the replica from the master;
  • MySQL wizard and replica version;
  • type of replication (RBR, SBR, or mixed);
  • location in one or different data centers;
  • availability errant GTID- transactions that were performed on the replica and are absent on the master;
  • custom selection rules are also taken into account.

Not every replica is an ideal candidate for the role of master. For example, a replica can be used to back up data, or the server has a weaker hardware configuration. The orchestrator supports manual rules by which you can adjust your preferences for choosing a candidate from the most preferred to ignored.

Response and Recovery Time


In the event of an incident, it is important to minimize the system downtime, therefore, we consider the MySQL parameters that affect the construction and updating of the cluster topology by the orchestra:

  • slave_net_timeout- the number of seconds during which the replica waits for new data or a heartbeat signal from the master before the connection is recognized as lost and reconnection is performed. The lower the value, the faster the replica will be able to determine that the connection with the master is broken. We set this value to 5 seconds.
  • MASTER_CONNECT_RETRY- the number of seconds between reconnection attempts. In case of network problems, a low value of this parameter will allow you to quickly reconnect and prevent the start of the cluster recovery process. Recommended value is 1 second.
  • MASTER_RETRY_COUNT - The maximum number of attempts to reconnect.
  • MASTER_HEARTBEAT_PERIOD- the interval in seconds after which the master sends a heartbeat signal. The default is half the value slave_net_timeout.

Orchestrator options:

  • DelayMasterPromotionIfSQLThreadNotUpToDate- if equal true, then the role of the wizard will not be applied on the candidate replica until the replica SQL stream completes all unapplied transactions from Relay Log. We use this option in order not to lose transactions when all candidate replicas are behind.
  • InstancePollSeconds - the frequency of building and updating the topology.
  • RecoveryPollSeconds- frequency of topology analysis. If a problem is detected, topology recovery starts. This is a constant equal to 1 second.

Each cluster node is polled by the orchestrator once every InstancePollSecondssecond. When a problem is detected, the cluster state is forcibly updated , and then the final decision is made to perform recovery. By experimenting with various parameters of the database and the orchestra, we were able to reduce the duration of the response and recovery to 30 seconds.

Test stand


We started testing the HA-scheme by developing a local test bench and further implementing it in a test and combat environment. The local stand is fully automated based on Docker and allows you to experiment with the configuration of the orchestra and network, scale the cluster from 2-3 servers to several tens and conduct exercises in a safe environment.

During the exercises, we choose one of the methods for emulating the problem: instantly shoot the wizard with kill -9, gently complete the process and stop the server ( docker-compose stop), simulate network problems with iptables -j REJECTor iptables -j DROP. We expect these results:

  • the orchestra will detect problems with the master and update the topology in no more than 10 seconds;
  • the recovery procedure will automatically start: the network configuration will change, the role of the wizard will go to the replica, the topology will be rebuilt;
  • the new master will be available for recording, live replicas will not be lost in the process of rebuilding;
  • data will begin to be written to the new master and replicated;
  • total recovery time will be no more than 30 seconds.

As you know, a system can behave differently in test and production environments due to different configurations of hardware and network, differences in synthetic and real load, etc. Therefore, we periodically conduct exercises in real conditions, checking how the system behaves in case of loss of network connectivity or degradation of its individual parts. In the future, we want to build a completely identical infrastructure for both environments and automate its testing.

findings


The operability of the main node of the storage system is one of the main tasks of the SRE team and operation. The introduction of the orchestra and HA-solutions based on VIP allowed to achieve the following results:

  • reliable detection of problems with the topology of the database cluster;
  • automatic and quick response to incidents related to the master, which reduces system downtime.

However, the solution has its limitations and disadvantages:

  • scaling the HA scheme to several data centers will require a single L2 network between them;
  • before you assign a VIP to the new master, we need to free him on the old one. The process is sequential, which increases recovery time;
  • VIP SSH- , . , , , VIP . IP- split brain.

To avoid split brain, you can use the STONITH method (“Shoot The Other Node In The Head”), which completely isolates or disconnects the problem node. There are other ways to implement the high availability of the cluster: a combination of VIP and DNS, service discovery and proxy services, synchronous replication, and other methods that have their drawbacks and advantages.

I talked about our approach to creating a MySQL failover cluster. It is easy to implement and provides an acceptable level of reliability in the current environment. With the development of the entire system as a whole and infrastructure in particular, this approach will undoubtedly evolve.

All Articles