✋🏼 📑 📽️ Our Cassandra migration experience between Kubernetes clusters without data loss 🔁 💃🏾 😹

For the past ~ six months, we have used the Rook operator to work with Cassandra in Kubernetes . However, when we needed to perform a very trivial, it would seem, operation: change the parameters in the Cassandra config, it turned out that the operator does not provide sufficient flexibility. To make changes, it was necessary to clone the repository, make changes to the sources and rebuild the operator (the config is built into the operator itself, so Go knowledge is still useful). All this takes a lot of time.

We already did a review of existing operators , and this time we stopped at CassKop from Orange , which supports the necessary capabilities, in particular, custom configs and monitoring out of the box.

Task

In the real story, which will be discussed later, it was decided to combine the change of operator with the urgent need to transfer the entire client infrastructure to the new cluster. After the migration of the main workloads from important applications, only Cassandra remained, the data loss for which, of course, was unacceptable.

Requirements for its migration:

The maximum idle time is 2-3 minutes to actually carry out this transfer at the same time as the application itself is rolled over to a new cluster;
Transfer all data without loss and headache (i.e. without any additional manipulations).

How to carry out such an operation? By analogy with RabbitMQ and MongoDB , we decided to launch a new Cassandra installation in a new Kubernetes cluster, then merge the two Cassandra in different clusters and transfer the data, ending the whole process by simply disabling the original installation.

However, it was complicated by the fact that the networks inside Kubernetes intersect, so it was not so easy to configure the connection. It was required to register routes for each pod on each node, which is very time-consuming and not reliable at all. The fact is that communication over IP pods only works with masters, and Cassandra is running on dedicated nodes. Thus, you must first configure the route to the master and already on the master - to another cluster. In addition to this, restarting the pod entails a change in IP, and this is another problem ... Why? Read about it later in the article.

In the subsequent practical part of the article, three notation for Cassandra clusters will be used:

Cassandra-new - the new installation that we will launch in the new Kubernetes cluster;
Cassandra-current - an old installation with which applications are currently working;
Cassandra-temporary is a temporary installation that we run next to Cassandra-current and use it only for the migration process itself.

How to be?

Since Cassandra-current uses localstorage, a simple migration of its data to a new cluster - this could be, for example, in the case of vSphere disks ... - is impossible. To solve this problem, we will create a temporary cluster, using it as a kind of buffer for migration.

The general sequence of actions is reduced to the following steps:

Raise Cassandra-new with a new operator in a new cluster.
Scale to 0 Cassandra-new cluster .
, PVC, .
Cassandra-temporary Cassandra-current , Cassandra-new.
Cassandra-temporary 0 ( ) Cassandra-temporary , Cassandra-temporary Cassandra-current. Cassandra - ( Cassandra ).
Transfer data between Cassandra-temporary and Cassandra-current data centers .
Scale the Cassandra-current and Cassandra-temporary clusters to 0 and run Cassandra-new in the new cluster, not forgetting to throw the disks. In parallel, we roll applications to a new cluster.

As a result of such manipulations, downtime will be minimal.

In detail

There shouldn’t be any problems with the first 3 steps - everything is done quickly and easily.

At this point, the Cassandra-current cluster will look something like this:

Datacenter: x1
==============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address     Load       Tokens       Owns    Host ID                               Rack
UN  10.244.6.5  790.7 GiB  256          ?       13cd0c7a-4f91-40d0-ac0e-e7c4a9ad584c  rack1
UN  10.244.7.5  770.9 GiB  256          ?       8527813a-e8df-4260-b89d-ceb317ef56ef  rack1
UN  10.244.5.5  825.07 GiB  256          ?       400172bf-6f7c-4709-81c6-980cb7c6db5c  rack1

To verify that everything works as expected, create a keyspace in Cassandra-current . This is done before the launch of Cassandra-temporary :

create keyspace example with replication ={'class' : 'NetworkTopologyStrategy', 'x1':2};

Next, create a table and fill it with data:

use example;
CREATE TABLE example(id int PRIMARY KEY, name text, phone varint);
INSERT INTO example(id, name, phone) VALUES(1,'Masha', 983123123);
INSERT INTO example(id, name, phone) VALUES(2,'Sergey', 912121231);
INSERT INTO example(id, name, phone) VALUES(3,'Andrey', 914151617);

Run Cassandra-temporary , remembering that before that, in the new cluster, we already started Cassandra-new (step # 1) and now it is turned off (step # 2).

Notes:

When we start Cassandra-temporary , we must specify the same (with Cassandra-current ) name of the cluster. This can be done through a variable CASSANDRA_CLUSTER_NAME.
In order for Cassandra-temporary to see the current cluster, you need to set the seeds. This is done through a variable CASSANDRA_SEEDSor through a config.

Attention! Before you begin moving data, you must ensure that the read and write consistency types are set to LOCAL_ONEor LOCAL_QUORUM.

After Cassandra-temporary starts , the cluster should look like this (note the appearance of a second data center with 3 nodes):

Datacenter: x1
==============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address     Load       Tokens       Owns    Host ID                               Rack
UN  10.244.6.5  790.7 GiB  256          ?       13cd0c7a-4f91-40d0-ac0e-e7c4a9ad584c  rack1
UN  10.244.7.5  770.9 GiB  256          ?       8527813a-e8df-4260-b89d-ceb317ef56ef  rack1
UN  10.244.5.5  825.07 GiB  256          ?       400172bf-6f7c-4709-81c6-980cb7c6db5c  rack1

Datacenter: x2
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address       Load       Tokens       Owns (effective)  Host ID                               Rack
UN  10.244.16.96  267.07 KiB  256          64.4%             3619841e-64a0-417d-a497-541ec602a996  rack1
UN  10.244.18.67  248.29 KiB  256          65.8%             07a2f571-400c-4728-b6f7-c95c26fe5b11  rack1
UN  10.244.16.95  265.85 KiB  256          69.8%             2f4738a2-68d6-4f9e-bf8f-2e1cfc07f791  rack1

Now you can carry out the transfer. To do this, first transfer the test keyspace - make sure that everything is fine:

ALTER KEYSPACE example WITH replication = {'class': 'NetworkTopologyStrategy', x1: 2, x2: 2};

After that, in each Cassandra-temporary pod, execute the command:

nodetool rebuild -ks example x1

Let's go to any pod of Cassandra-temporary and check that the data has been transferred. You can also add 1 more entry to Cassandra-current to verify that new data has begun to replicate:

SELECT * FROM example;

 id | name   | phone
----+--------+-----------
  1 |  Masha | 983123123
  2 | Sergey | 912121231
  3 | Andrey | 914151617

(3 rows)

After that, you can do ALTERall keyspaces in Cassandra-current and execute nodetool rebuild.

Lack of space and memory

At this stage, it’s useful to remember that when rebuild is running, temporary files are created that are equivalent in size to the size of keyspace! We encountered a problem that the largest keyspace was 350 GB, and there was less free disk space.

It was not possible to expand the disk, because localstorage is used. The following command came to the rescue (executed in each pod of Cassandra-current ):

nodetool clearsnapshot

So the place was freed up: in our case, 500 GB of free disk space was obtained instead of previously available 200 GB.

However, despite the fact that there was enough space, the rebuild operation constantly caused the restart of Cassandra-temporary pods with an error:

failed; error='Cannot allocate memory' (errno=12)

We decided it by creating DaemonSet, which rolls out only to nodes with Cassandra-temporary and performs:

sysctl -w vm.max_map_count=262144

Finally, all the data has been migrated!

Cluster switching

It only remained to switch the Cassandra, which was carried out in 5 stages:

Scale Cassandra-temporary and Cassandra-current (do not forget that the operator still works here!) To 0.
Switch disks (it comes down to setting PV for Cassandra-new ).
We start Cassandra-new , tracking that the necessary disks are connected.

We do ALTERall the tables to remove the old cluster:

ALTER KEYSPACE example WITH replication = {'class': 'NetworkTopologyStrategy', 'x2': 2};

Delete all nodes of the old cluster. To do this, just run this command in one of its pods:
```
nodetool removenode 3619841e-64a0-417d-a497-541ec602a996
```

The total downtime of Cassandra was about 3 minutes - this is the time the containers stopped and started, since the disks were prepared in advance.

Final touch with Prometheus

However, this did not end there. There is a built-in exporter with Cassandra-new (see the documentation of the new operator ) - we, of course, used it. About 1 hour after the launch, alerts about the inaccessibility of Prometheus began to come. After checking the load, we saw that the memory consumption on nodes with Prometheus has increased.

Further study of the issue showed that the number of collected metrics increased by 2.5 times (!). The fault was Cassandra, with which just over 500 thousand metrics were collected.

We performed an audit of metrics and disabled those that we did not consider necessary - through ConfigMap (in it, by the way, the exporter is configured). The result is 120 thousand metrics and a significantly reduced load on Prometheus (despite the fact that important metrics remain).

Conclusion

So we managed to transfer Cassandra to another cluster, practically without affecting the functioning of the production installation of Cassandra and without interfering with the work of client applications. Along the way, we came to the conclusion that using the same pod network is not a good idea (now we are more attentive to the initial planning for installing the cluster).

Finally: why didn’t we use the tool nodetool snapshotmentioned in the previous article? The fact is that this command creates a keyspace snapshot in the state it was in before the command was run. Besides:

it takes much more time to take a picture and transfer it;
everything that is written at this time in Cassandra will be lost;
simple in our case would be about an hour - instead of 3 minutes, which turned out to be successfully combined with the deployment of the application to a new cluster.

Our Cassandra migration experience between Kubernetes clusters without data loss