☝🏻 🙍🏼 ☘️ Tips & tricks working with Ceph in busy projects 😯 🧑🏿‍🤝‍🧑🏿 🧔🏾

Using Ceph as a network storage in different workload projects, we may encounter various tasks that at first glance do not seem simple or trivial. For instance:

data migration from old Ceph to new with partial use of previous servers in the new cluster;
solution of the problem of distribution of disk space in Ceph.

Dealing with such tasks, we are faced with the need to correctly extract the OSD without data loss, which is especially true for large amounts of data. This will be discussed in the article.

The methods described below are relevant for any version of Ceph. In addition, the fact that a large amount of data can be stored in Ceph will be taken into account: to prevent data loss and other problems, some actions will be "split" into several others.

OSD Preface

Since two of the three recipes in question are dedicated to OSD ( Object Storage Daemon ), before diving into the practical part, briefly explain what Ceph does and why it is so important.

First of all, it should be said that the entire Ceph cluster consists of many OSDs. The more there are, the more free data volume in Ceph. From here it is easy to understand the main function of OSD : it saves Ceph object data on file systems of all cluster nodes and provides network access to them (for reading, writing, and other requests).

At the same level, replication parameters are set by copying objects between different OSDs. And here you can encounter various problems, the solution of which will be discussed later.

Case No. 1. Safely retrieve OSD from Ceph cluster without data loss

The need to remove the OSD may be caused by the withdrawal of the server from the cluster - for example, to replace it with another server - which happened with us, giving rise to the writing of the article. Thus, the ultimate goal of manipulation is to extract all the OSDs and mones on a given server so that it can be stopped.

For convenience and elimination of the situation where, in the process of executing commands, we make a mistake indicating the necessary OSD, we will set a separate variable, the value of which will be the number of the OSD to be deleted. We will call it ${ID}- hereinafter, such a variable replaces the OSD number with which we work.

Let's look at the condition before starting work:

root@hv-1 ~ # ceph osd tree
ID CLASS WEIGHT  TYPE NAME      STATUS REWEIGHT PRI-AFF
-1       0.46857 root default
-3       0.15619      host hv-1
-5       0.15619      host hv-2
 1   ssd 0.15619      osd.1     up     1.00000  1.00000
-7       0.15619      host hv-3
 2   ssd 0.15619      osd.2     up     1.00000  1.00000

To initiate the removal of OSD, you will need to smoothly execute reweighton it to zero. Thus, we reduce the amount of data in the OSD by balancing with other OSDs. To do this, the following commands are executed:

ceph osd reweight osd.${ID} 0.98
ceph osd reweight osd.${ID} 0.88
ceph osd reweight osd.${ID} 0.78

... and so on to zero.

UPDATED : The comments on the article told about the method with norebalance+ backfill. The solution is correct, but first of all you need to look at the situation, because norebalancewe use it when we do not want any OSD to cause a network load. osd_max_backfillused in cases where it is necessary to limit the speed of rebalance. As a result, rebalancing will be slower and cause less network load.

Smooth balancing is necessary so as not to lose data. This is especially true if the OSD contains a large amount of data. To make sure that reweighteverything was successful after executing the commands , you can ceph -seither execute it or run it in a separate terminal windowceph -win order to observe changes in real time.

When the OSD is "empty", you can begin the standard operation to remove it. To do this, put the desired OSD in the state down:

ceph osd down osd.${ID}

“Pull” OSD from the cluster:

ceph osd out osd.${ID}

Stop the OSD service and unmount its partition in the FS:

systemctl stop ceph-osd@${ID}
umount /var/lib/ceph/osd/ceph-${ID}

Remove the OSD from the CRUSH map :

ceph osd crush remove osd.${ID}

Delete the OSD user:

ceph auth del osd.${ID}

And finally, delete the OSD itself:

ceph osd rm osd.${ID}

Note : if you are using Ceph Luminous or higher, the above steps to remove OSD can be reduced to two commands:

ceph osd out osd.${ID}
ceph osd purge osd.${ID}

If, after performing the above steps, the command is executed ceph osd tree, it should be seen that the server where the work was performed no longer has the OSD for which the operations above were performed:

root@hv-1 ~ # ceph osd tree
ID CLASS WEIGHT  TYPE NAME     STATUS REWEIGHT PRI-AFF
-1       0.46857      root default
-3       0.15619      host hv-1
-5       0.15619      host hv-2
-7       0.15619      host hv-3
 2   ssd 0.15619      osd.2    up     1.00000  1.00000

Along the way, we note that the state of the Ceph cluster will go into HEALTH_WARN, and we will also see a decrease in the number of OSDs and the amount of available disk space.

Next, the steps that will be required if you want to completely stop the server and, accordingly, remove it from Ceph will be described. In this case, it is important to remember that before shutting down the server, you must extract all the OSDs on this server.

If there is no more OSD left on this server, then after removing them, you need to exclude the server from the OSD card by hv-2running the following command:

ceph osd crush rm hv-2

We delete it monfrom the server hv-2by running the command below on another server (i.e., in this case, on hv-1):

ceph-deploy mon destroy hv-2

After that, you can stop the server and proceed with the subsequent actions (its redeployment, etc.).

Case No. 2. Disk space allocation in an already created Ceph cluster

The second story begins with a preface about PG ( Placement Groups ). The main role of PG in Ceph is primarily in the aggregation of Ceph objects and further replication in OSD. The formula with which you can calculate the required number of PGs can be found in the corresponding section of the Ceph documentation. In the same place, this issue was also analyzed with specific examples.

So: one of the common problems during Ceph operation is an unbalanced amount of OSD and PG between pools in Ceph. In general, a correctly selected value of PG is the key to reliable operation of the cluster, and then we will consider what can happen in the opposite case.

The difficulty in choosing the right amount of PG lies in two things:

PG, , , chunk'.
, , PG, .

In practice, a more serious problem is obtained: data overflow in one of the OSDs. The reason for this is that Ceph, when calculating the available amount of data (you can find it out MAX AVAILin the output of the command ceph dffor each pool separately), relies on the amount of available data in OSD. If there is not enough space in at least one OSD, then writing more data will not work until the data is properly distributed between all OSDs.

It is worth clarifying that these problems are largely solved at the stage of configuration of the Ceph cluster . One tool you can use is Ceph PGCalc . With its help, the necessary amount of PG is visually calculated. However, you can resort to it in a situation where the Ceph cluster is alreadynot configured correctly. Here it’s worth clarifying that as part of the correction, you will most likely need to reduce the number of PGs, and this feature is not available in older versions of Ceph (it appeared only with the Nautilus version ).

So, let's imagine the following picture: a cluster has a status HEALTH_WARNdue to the end of a place in one of the OSDs. A mistake will testify to this HEALTH_WARN: 1 near full osd. Below is an algorithm for overcoming this situation.

First of all, you need to distribute the available data between the rest of the OSD. We already performed a similar operation in the first case, when we “drained” the knot - with the only difference that now will need to be slightly reduced reweight. For example, before 0.95:

ceph osd reweight osd.${ID} 0.95

This frees up disk space in the OSD and fixes a bug in ceph health. However, as already mentioned, this problem mainly arises due to incorrect Ceph settings in the initial stages: it is very important to reconfigure so that it does not appear in the future.

In our particular case, everything rested on:

too important replication_countin one of the pools,
Too many PGs in one pool and too small in another.

We will use the already mentioned calculator. It clearly shows what needs to be entered and, in principle, there is nothing complicated. Having set the necessary parameters, we get the following recommendations:

Note : if you configure the Ceph cluster from scratch, another useful function of the calculator will be the generation of commands that will create pools from scratch with the parameters specified in the table.

The last column, Suggested PG Count, helps you navigate . In our case, the second one is also useful, where the replication parameter is indicated, since we decided to change the replication factor.

So, first you need to change the replication settings - this is worth doing first of all, since by reducing the multiplier we will free up disk space. During the execution of the command, you will notice that the value of available disk space will increase:

ceph osd pool $pool_name set $replication_size

And after its completion - we change the values of the parameters pg_numand pgp_numas follows:

ceph osd pool set $pool_name pg_num $pg_number
ceph osd pool set $pool_name pgp_num $pg_number

Important : we must sequentially change the number of PGs in each pool and not change the values in other pools until the warnings “Degraded data redundancy” and “n-number of pgs degraded” disappear .

You can also verify that everything was successful, according to the conclusions of the ceph health detailand commands ceph -s.

Case No. 3. Virtual Machine Migration from LVM to Ceph RBD

In a situation when the project uses virtual machines installed on rented bare-metal servers, the question often arises of fault-tolerant storage. And it is also very desirable that there is enough space in this storage ... Another common situation: there is a virtual machine with local storage on the server and you need to expand the disk, but nowhere, because there is no free disk space left on the server.

The problem can be solved in different ways - for example, by migrating to another server (if there is one) or adding new disks to the server. But it is not always possible to do this, so migration from LVM to Ceph can be an excellent solution to this problem. By choosing this option, we also simplify the further process of migration between servers, since there will be no need to move local storage from one hypervisor to another. The only catch is that you will have to stop the VM for the duration of the work.

As the recipe given below, an article from this blog was taken , the instructions of which were tested in action. By the way, the method of unimpeded migration is also described there.However, in our case, it was simply not needed, so we did not check it. If this is critical for your project, we will be happy to know about the results in the comments.

Let's get down to the practical part. In the example, we use virsh and, accordingly, libvirt. First, make sure that the Ceph pool to which the data will be migrated is connected to libvirt:

virsh pool-dumpxml $ceph_pool

The pool description should contain Ceph connection data with authorization data.

The next step is to convert the LVM image to Ceph RBD. The runtime depends primarily on the size of the image:

qemu-img convert -p -O rbd /dev/main/$vm_image_name rbd:$ceph_pool/$vm_image_name

After the conversion, the LVM image will remain, which will be useful if you cannot migrate the VM to RBD and have to roll back the changes. Also - for the ability to quickly roll back the changes - we will backup the configuration file of the virtual machine:

virsh dumpxml $vm_name > $vm_name.xml
cp $vm_name.xml $vm_name_backup.xml

... and edit the original ( vm_name.xml). Find a block with a description of the disk (starts with a line <disk type='file' device='disk'>and ends with </disk>) and bring it to the following form:

<disk type='network' device='disk'>
<driver name='qemu'/>
<auth username='libvirt'>
  <secret type='ceph' uuid='sec-ret-uu-id'/>
 </auth>
<source protocol='rbd' name='$ceph_pool/$vm_image_name>
  <host name='10.0.0.1' port='6789'/>
  <host name='10.0.0.2' port='6789'/>
</source>
<target dev='vda' bus='virtio'/> 
<alias name='virtio-disk0'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/>
</disk>

Let's take a look at some of the details:

The protocol sourceindicates the address to the storage in Ceph RBD (this is the address indicating the name of the Ceph pool and RBD image, which was determined at the first stage).
The block secretindicates the type ceph, as well as the UUID of the secret for connecting to it. Its uuid can be found with the command virsh secret-list.
In the block hostaddresses to monitors Ceph are indicated.

After editing the configuration file and completing the conversion of LVM to RBD, you can apply the modified configuration file and start the virtual machine:

virsh define $vm_name.xml
virsh start $vm_name

It's time to verify that the virtual machine started correctly: you can find out, for example, by connecting to it via SSH or through virsh.

If the virtual machine is working correctly and you did not find other problems, then you can delete the LVM image that is no longer in use:

lvremove main/$vm_image_name

Conclusion

We encountered all the described cases in practice - we hope that the instructions will help other administrators to solve similar problems. If you have comments or other similar stories from Ceph operating experience, we will be glad to see them in the comments!

Tips & tricks working with Ceph in busy projects

OSD Preface

Case No. 1. Safely retrieve OSD from Ceph cluster without data loss

Case No. 2. Disk space allocation in an already created Ceph cluster

Case No. 3. Virtual Machine Migration from LVM to Ceph RBD

Conclusion

PS

More articles: