Accelerating the Qemu KVM disk subsystem on Linux



Sometimes I take on various tasks for setting up servers. Some time ago, the owner of a small hosting company approached me with an interesting problem. He would like to run Windows virtual machines under KVM on his servers, where Ubuntu 18.04 was already installed.

However, his testing showed that the KVM disk system decently lagged behind the indicators that he had under Hyper-V. He wanted to unleash qemu on his Ubuntu servers to avoid buying expensive Windows server licenses (the free version of Microsoft Hyper-V Server did not work out because of its limitations).

0. Disposition


For testing, we used the Samsung 970 Pro 1TB SSD. The customer checked the results of work in CrystalDiskMark, so further in the article all the graphs from it.

Windows 10 LTSCHyper-V
2 CPU
KVM
2 CPU
The first step was to improve random I / O performance. This type of load is typical for virtual machines, especially those of them that use various databases.

Ubuntu (16.04 LTS and 18.04) still uses qemu version 2.11. Therefore, some of the latest qemu buns are not considered in this article.

We decided that we need to avoid tying iron to a virtual machine, since this complicates the portability of virtual machines, so options for throwing SSD / physical disks / partitions into virtual machines were considered undesirable.

About the test file size for CrystalDiskMark
, 100 4. , : , .

, , Windows . 100 4 , 40 .

, , 100-1. 4.

1. We use LVM volumes, not files for storing virtual machine disks.


The logic is this. The file with the virtual disk is stored in the Linux file system, NTFS is located inside the file itself. Each file system consumes resources during disk operations. Therefore, the smaller the depth of the doll, the faster the input / output.

If we talk about qcow2 files, their name stands for “Qemu Copy-On-Write” and, in fact, they have their own translation table inside which is responsible for which blocks are busy, which ones are not and where what is located.

The LVM layer consumes significantly less processor resources than the file system. One of the reasons for this is that the blocks in it are much larger than a typical file system block (4KB). The larger the block (extent) on the physical LVM device, the more quickly IO occurs and the less fragmentation.

But even for SSD random I / O is much slower than serial. Therefore, when creating the Volume Group, we will specify a very large extent: 256MB.

Read ahead on a logical volume should be turned off, because it spends IO without a win, since now no one is defragmenting disks in Windows on SSDs.

LVM is quite convenient to use for hosting virtual machines. LVM volumes are easily portable between physical disks; there are snapshots and resizing online. Moreover, virt-manager (libvirt) can create logical volumes out of the box for virtual machine disks from the Volume group.

The ability to create thin volumes also looks attractive, but given that a thin volume is an additional layer of abstraction, it is obvious that it will degrade IO performance. In addition, libvirt does not have an elegant way to automatically create disks for virtual machines in a thin pool.

#    SSD    (volume group)
pvcreate /dev/nvme1n1p1

#    win    (extent) 256.
vgcreate -s 256M win_pool /dev/nvme1n1p1

#    vm1.    C
lvcreate -n vm1 -L 100G win_pool

#      (read ahead)
lvchange -r none /dev/win_pool/vm1

1.1. Thin volume as a disk and / or logical volume settings for snapshots


If you want to use a thin pool in which you will create thin volumes, then it makes sense to set the chunk size of the pool to 4MB, which is much larger than the default size of 64KB.
Which will entail faster work of this layer of abstraction.

The snapshot mechanism in LVM works almost on the same code as thin volumes, so the settings will be the same to increase the snapshot speed.

lvcreate -c 4m -L 300G -T -Zn win_pool/thin

The option -Zndisables overwriting the chunk with zeros during selection, which increases the speed of work.

Settings for lvm.conf or a similar file (e.g. lvmlocal.conf):

thin_pool_chunk_size = 4096
thin_pool_zero = n

You can determine the optimal size of the chunk by completing the test with the following command, choosing the value --blocksize:

fio --name=randwrite --filename=/dev/nvme0n1p9 --size=10G --ioengine=libaio --iodepth=1 --buffered=0 --direct=1 --rw=randwrite --blocksize=4m

You can view the current size of the chunk with the command:

lvs -ao name,chunksize

2. Increasing the number of logical processors allocated to each KVM virtual machine improves disk performance


10 CPU8 CPU4 CPU

It is clear that hardly anyone will allocate 10 processors to the virtual machine, but it was interesting to look at the extreme case.

It already depends on the number of free processors. In my opinion, it is inexpedient to allocate more than 4. With the number of threads equal to 8, we got the maximum random read and write performance. This is a specificity of CrystalDiskMark 6.0.2, in which the second test conducts 8 threads.

From which we can conclude that it is good to have one logical processor for each task that actively uses IO.

3. We use huge pages of random access memory (hugepages) to avoid performance degradation due to fragmentation of RAM


This package can come in handy when we need various information about hugepages during operation.

apt install hugepages

Edit /etc/default/grub:

GRUB_CMDLINE_LINUX="default_hugepagesz=1GB hugepagesz=1G hugepages=64"

In this case, 64GB of memory was allocated for all virtual machines as hugepages. In your case there may be less / more.

We apply these settings to GRUB so that the next time the system boots, they become active:

grub-mkconfig -o /boot/grub/grub.cfg

Editing the virtual machine config:

virsh edit vm_name

Add:

<memoryBacking>
  <hugepages/>
</memoryBacking>

4. Add a dedicated stream to each virtual machine to serve IO


You need to add what is highlighted in bold. We use virsh, as in the previous paragraph.

<iothreads>1</iothreads>

<disk type='block' device='disk'>
<driver name='qemu' type='raw' cache='none' io='threads' iothread='1'/>
<source dev='/dev/win/terminal'/>
<target dev='vda' bus='virtio'/>
<boot order='2'/>
<address type='pci' domain='0x0000' bus='0x04' slot='0x00' function='0x0'/>
</disk>


4.1. writeback


To speed up accidental writing to disk, but with an increased risk of data loss, you can use cache=writebackthe previous paragraph. It can be used only if there is great confidence in the quality and backup of power and in the presence of backups.

5. Disk subsystem settings in Virt Manager


Disk bus: VirtIO
Storage format: raw
Cache mode: writeback
IO mode: threads

5.1. Configuring a disk subsystem through a configuration file


Qemu 2.11 (which is currently used by Ubuntu) supports two types of disk virtual devices: virtio-blk and virtio-scsi. When specified in Virt Manager Disk bus: VirtIO, this means using the virtio-blk device.

In all cases, virtio-blk is better in speed, despite the fact that in the tested qemu version it still did not support TRIM, unlike virtio-scsi (it already supports it since version 5.x ).

In terms of disk IO speed, virtio-scsi only makes sense in exotic cases, for example, when you need to connect hundreds of disks to a virtual machine.

6. During the installation of Windows, install the VirtIO driver


Otherwise, the disk will not be available for the OS. To do this, use the driver image, which we pre-connect to the virtual machine.

7. Results after applying all the tweaks


In fact, the tweak 4.1 was not used, since I was not sure of the reliability of the power supply from the client.
Hyper-V
2 CPU
KVM
2 CPU
KVM
4 CPU

You need to understand that these results have a certain convention, since each time you start CrystalDiskMark, the values ​​are slightly different.

KVM out of the box
2 CPU
KVM after tweaks
2 CPU

We see that it was possible to significantly accelerate the work of the disk subsystem in qemu (kvm) with the same number of cores. Writing was accelerated by an average of 58%, and reading by 25%.

Key success elements : using LVM volumes instead of qcow2 files, separate I / O, and hugepages.

Direct the noticed errors to the PM. I increase karma for this.

PS vhost-user-blk and vhost-user-nvme


During the experiments, Qemu 2.12 and version 3 were also compiled. The vhost-user-blk option for a disk was tested.

In the end, it worked worse than virtio-blk.

vhost-user-blk
4 CPU
virtio-blk
4 CPU

To use vhost-user-nvme it was necessary to patch qemu, this option complicated the automatic updating of servers in production, so it was not tested.

PPS SPDK


Intel designed this framework to achieve outstanding performance indicators for disk systems in virtual machines that should run on its processors.

To make spdk work well, they go to a lot of tricks - they allocate separate kernels to it, place the spdk and virtual machine kernels in one socket. Load the virtual machine into a contiguous chunk of memory. If you apply such measures to regular virtio-blk, then it will also work faster.

SPDK is capable of working in 3 modes: vhost-user-scsi, vhost-user-blk and vhost-user-nvme. The second mode is only available in qemu from 2.12, which is not yet available in ubuntu. The vhost-user-nvme mode is generally mega-experimental - you need to patch qemu for it. Currently, only scsi emulation works and it is slow.

There is another serious limitation for vhost-user-scsi mode - the spdk disk cannot be bootable.
Make sure bootindex = 2 Qemu option is given to vhost-user-scsi-pci device.
Records are set when they use their driver to share SSDs on multiple devices and forward them as vhost-user-nvme. The iron-piercing approach did not suit us.

The impression was that it was normal to use SPDK only with their implementation of logical disks (which is completely different from standard lvm). This is such a self-made bike with its pictures and cloning. Teams are all different from LVM.

The difficulty in configuring SPDK, support and portability of virtual machines, as well as attachment to Intel processors, turned away from its use.

Acknowledgments


Thanks for the image TripletConcept . It is better to watch it in full size in a separate window.

For permission to share working materials -st_neon




You can order a virtual machine with SSD from RUVDS for the coupon below. Direct the noticed errors to the PM. I increase karma for this.




All Articles