What is common between LVM and matryoshka?

Good day.
I want to share with the community practical experience in building a storage system for KVM using md RAID + LVM.

The program will:

  • Build md RAID 1 from NVMe SSD.
  • Build md RAID 6 from SATA SSD and regular drives.
  • Features of the TRIM / DISCARD on SSD RAID 1/6.
  • Creating a bootable md RAID 1/6 array on a common set of disks.
  • Installing the system on NVMe RAID 1 if there is no NVMe support in the BIOS.
  • Using LVM cache and LVM thin.
  • Using BTRFS snapshots and send / recieve for backup.
  • Using LVM thin snapshots and thin_delta for BTRFS-style backup.

If interested, please, under cat.

Statement


The author does not bear any responsibility for the consequences of using or not using the materials / examples / code / tips / data from this article. By reading or using this material in any way, you take responsibility for all the consequences of these actions. Possible consequences include:

  • Crispy Fried NVMe SSD.
  • Completely used up recording resource and failure of SSD drives.
  • Complete loss of all data on all drives, including backups.
  • Faulty computer hardware.
  • Spent time, nerves and money.
  • Any other effects not listed above.

Iron


In stock was:


The motherboard is around 2013 on the Z87 chipset complete with Intel Core i7 / Haswell.

  • CPU 4 cores, 8 threads
  • 32 Gigabytes of DDR3 RAM
  • 1 x 16 or 2 x 8 PCIe 3.0
  • 1 x 4 + 1 x 1 PCIe 2.0
  • 6 x 6 GBps SATA 3 connectors

SAS adapter LSI SAS9211-8I flashed into IT / HBA mode. RAID-enabled firmware has been intentionally replaced with HBA firmware to:

  1. It was possible at any time to throw this adapter and replace it with any other first one.
  2. TRIM / Discard worked normally on disks, as in RAID firmware, these commands are not supported at all, and HBA, in general, doesn’t care what commands to send on the bus.

Hard drives - 8 pieces of HGST Travelstar 7K1000 with a volume of 1 TB in the form factor 2.5, as for laptops. These drives were previously in a RAID 6 array. In the new system, they will also find application. For storing local backups.

Additionally it was added:


6 pieces of SATA SSD model Samsung 860 QVO 2TB. These SSDs required a large amount, the presence of an SLC cache, reliability is desirable, and a low price. Mandatory was support for discard / zero which is checked by a line in dmesg:

kernel: ata1.00: Enabling discard_zeroes_data

2 pieces of NVMe SSD model Samsung SSD 970 EVO 500GB.

For these SSDs, random read / write speed and a resource to your needs are important. Radiator to them. Mandatory. Absolutely necessary. Otherwise, fry them until crisp during the first RAIDa synchronization.

StarTech PEX8M2E2 adapter for 2 x NVMe SSDs with PCIe 3.0 8x slot. This, again, is just HBA, but for NVMe. It differs from cheap adapters in the absence of the requirement for PCIe bifurcation support from the motherboard due to the presence of an integrated PCIe switch. It will work even in the oldest system where there is PCIe, even if it is an x1 PCIe 1.0 slot. Naturally, with the appropriate speed. There are no RAIDs there. There is no integrated BIOS on board. So, your system will not magically learn to boot from NVMe, much less do NVMe RAID thanks to this device.

This component was caused solely by the presence of only one free 8x PCIe 3.0 in the system, and, in the presence of 2 free slots, it is easily replaced by two cheap PEX4M2E1 or analogs, which can be bought anywhere at a price of 600 rubles.

The refusal of all kinds of hardware or integrated RAID chipsets / BIOS was made deliberately, in order to be able to completely replace the entire system, with the exception of the SSD / HDD itself, saving all the data. Ideally, it would be possible to keep even the installed operating system when moving to a completely new / different hardware. The main thing is that there are SATA and PCIe ports. It's like a live CD or bootable flash drive, only very fast and a bit oversized.

Humor
, , — . . 5.25 .

Well, and, of course, for experimenting with different methods of SSD caching in Linux.
Hardware raids, it's boring. Turn on. It either works or not. And with mdadm there are always options.

Soft


Previously, Debian 8 Jessie was installed on the hardware, which is close to EOL. RAID 6 from the above mentioned HDDs was paired with LVM. It was running virtual machines in kvm / libvirt.

Because The author has the appropriate experience in creating portable bootable SATA / NVMe flash drives, and also, in order not to tear the usual apt-template, Ubuntu 18.04 was chosen as the target system, which has already stabilized enough, but still has 3 years of support in the future.

In the mentioned system there are all the hardware drivers we need out of the box. We do not need any third-party software and drivers.

Preparation for installation


To install the system, we need Ubuntu Desktop Image. The server system has some kind of vigorous installer that exhibits excessive, non-disconnectable independence, always pushing the UEFI system partition onto one of the disks spoiling all the beauty. Accordingly, it is installed only in UEFI mode. It does not offer options.

This does not suit us.

Why?
, UEFI RAID, .. UEFI ESP . , ESP USB , , . mdadm RAID 1 0.9 UEFI BIOS , , BIOS - ESP .

, UEFI NVRAM, , .. .

, . , Legacy/BIOS boot, CSM UEFI- . , , .

The desktop version of Ubuntu also does not know how to install normally with the Legacy bootloader, but here, as they say, at least there are options.

And so, collect the hardware and load the system from the bootable Ubuntu Live flash drive. We will need to download packages, so we set up the network, which earned you. If it doesn’t work, you can download the necessary packages to the USB flash drive in advance.

We go into the Desktop environment, run the terminal emulator, and let's go:

#sudo bash

How...?
sudo. . , . sudo , . :


#apt-get install mdadm lvm2 thin-provisioning-tools btrfs-tools util-linux lsscsi nvme-cli mc

Why not ZFS ...?
, — , .
, — , - .

ZFS — , mdadm+lvm .

. . . . . . , .

Why then BTRFS ...?
Legacy/BIOS GRUB , , , -. /boot . , / () , LVM .

, .
send/recieve.

, GPU PCI-USB Host- KVM IOMMU.

— , .

ZFS, , , .

, / RAID ZFS, BRTFS LVM.

, BTRFS , / HDD.

Re-scan all devices: Let's look around

#udevadm control --reload-rules && udevadm trigger

:

#lsscsi && nvme list
[0:0:0:0] disk ATA Samsung SSD 860 2B6Q /dev/sda
[1:0:0:0] disk ATA Samsung SSD 860 2B6Q /dev/sdb
[2:0:0:0] disk ATA Samsung SSD 860 2B6Q /dev/sdc
[3:0:0:0] disk ATA Samsung SSD 860 2B6Q /dev/sdd
[4:0:0:0] disk ATA Samsung SSD 860 2B6Q /dev/sde
[5:0:0:0] disk ATA Samsung SSD 860 2B6Q /dev/sdf
[6:0:0:0] disk ATA HGST HTS721010A9 A3J0 /dev/sdg
[6:0:1:0] disk ATA HGST HTS721010A9 A3J0 /dev/sdh
[6:0:2:0] disk ATA HGST HTS721010A9 A3J0 /dev/sdi
[6:0:3:0] disk ATA HGST HTS721010A9 A3B0 /dev/sdj
[6:0:4:0] disk ATA HGST HTS721010A9 A3B0 /dev/sdk
[6:0:5:0] disk ATA HGST HTS721010A9 A3B0 /dev/sdl
[6:0:6:0] disk ATA HGST HTS721010A9 A3J0 /dev/sdm
[6:0:7:0] disk ATA HGST HTS721010A9 A3J0 /dev/sdn
Node SN Model Namespace Usage Format FW Rev
---------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------
/dev/nvme0n1 S466NXXXXXXX15L Samsung SSD 970 EVO 500GB 1 0,00 GB / 500,11 GB 512 B + 0 B 2B2QEXE7
/dev/nvme1n1 S5H7NXXXXXXX48N Samsung SSD 970 EVO 500GB 1 0,00 GB / 500,11 GB 512 B + 0 B 2B2QEXE7

Partitioning "drives"


NVMe SSD


But in any way we will not mark them up. All the same, our BIOS does not see these drives. So, they will go entirely to software RAID. We will not even create partitions there. If you want according to the "canon" or "principle" - create one large partition, like a HDD.

SATA HDD


There is nothing special to invent. We will create one section for everything. We will create the section because the BIOS sees these disks and may even try to boot from them. We will even install GRUB later on these disks so that the system suddenly succeeds.

#cat >hdd.part << EOF
label: dos
label-id: 0x00000000
device: /dev/sdg
unit: sectors

/dev/sdg1 : start= 2048, size= 1953523120, type=fd, bootable
EOF
#sfdisk /dev/sdg < hdd.part
#sfdisk /dev/sdh < hdd.part
#sfdisk /dev/sdi < hdd.part
#sfdisk /dev/sdj < hdd.part
#sfdisk /dev/sdk < hdd.part
#sfdisk /dev/sdl < hdd.part
#sfdisk /dev/sdm < hdd.part
#sfdisk /dev/sdn < hdd.part

SATA SSD


Here we have the most interesting.

Firstly, we have 2 TB drives. This is within the permissible limits for MBR, which we will use. If necessary, can be replaced by GPT. GPT disks have a compatibility layer that allows MBR-compatible systems to see the first 4 partitions if they are located within the first 2 terabytes. The main thing is that the boot partition and the bios_grub partition on these disks should be at the beginning. This allows you to even boot from the GPT Legacy / BIOS drive.

But, this is not our case.

Here we will create two sections. The first will be 1 GB in size and used for RAID 1 / boot.

The second will be used for RAID 6 and take up all the remaining free space with the exception of a small unallocated area at the end of the drive.

What is the unallocated area?
SATA SSD SLC 6 78 . 6 «» «» «» . 72 .

, SLC, 4 bit MLC. , 4 1 SLC .

72 4 288 . , SLC .

, 312 SLC . 2 RAID .

, . QLC , — . , , , SSD TBW .

#cat >ssd.part << EOF
label: dos
label-id: 0x00000000
device: /dev/sda
unit: sectors

/dev/sda1 : start= 2048, size= 2097152, type=fd, bootable
/dev/sda2 : start= 2099200, size= 3300950016, type=fd
EOF
#sfdisk /dev/sda < ssd.part
#sfdisk /dev/sdb < ssd.part
#sfdisk /dev/sdc < ssd.part
#sfdisk /dev/sdd < ssd.part
#sfdisk /dev/sde < ssd.part
#sfdisk /dev/sdf < ssd.part

Creating Arrays


First we need to rename the car. This is necessary because the host name is part of the array name somewhere inside mdadm and affects something somewhere. Arrays, of course, can be renamed later, but, these are unnecessary actions.

#mcedit /etc/hostname
#mcedit /etc/hosts
#hostname
vdesk0

NVMe SSD


#mdadm --create --verbose --assume-clean /dev/md0 --level=1 --raid-devices=2 /dev/nvme[0-1]n1

Why --assume-clean ...?
. RAID 1 6 . , . , SSD — TBW. TRIM/DISCARD SSD «».

SSD RAID 1 DISCARD .

SSD RAID 6 DISCARD .

, SSD 4/5/6 discard_zeroes_data. , , , -, , . , , . DISCARD - RAID 6.

Attention, the following command will destroy all data on NVMe drives by “initializing” the array with “zeros”.

#blkdiscard /dev/md0

If something went wrong, then try specifying a step.

#blkdiscard --step 65536 /dev/md0

SATA SSD


#mdadm --create --verbose --assume-clean /dev/md1 --level=1 --raid-devices=6 /dev/sd[a-f]1
#blkdiscard /dev/md1
#mdadm --create --verbose --assume-clean /dev/md2 --chunk-size=512 --level=6 --raid-devices=6 /dev/sd[a-f]2

Why so big ...?
chunk-size chunk-size . , . , IOPS . 99% IO 512K.

RAID 6 IOPS IOPS . IOPS , .
RAID 6 by-design , RAID 6 .
RAID 6 NVMe thin-provisioning.

We have not yet enabled DISCARD for RAID 6. So, we will not “initialize” this array yet. We will do it later, after installing the OS.

SATA HDD


#mdadm --create --verbose --assume-clean /dev/md3 --chunk-size=512 --level=6 --raid-devices=8 /dev/sd[g-n]1

LVM on NVMe RAID


For speed, we want to place the root FS on NVMe RAID 1 which is / dev / md0.
Nevertheless, we still need this fast array for other needs, such as swap, metadata and LVM-cache and LVM-thin metadata, therefore, on this array we will create LVM VG. Create a partition for the root FS. Create a section to swap the size of RAM.

#pvcreate /dev/md0
#vgcreate root /dev/md0




#lvcreate -L 128G --name root root



#lvcreate -L 32G --name swap root

OS installation


In total, we have everything necessary to install the system.

Launch the installation wizard from the Ubuntu Live environment. Normal installation. Only at the stage of selecting drives for installation, you need to specify the following:

  • / dev / md1, - mount point / boot, FS - BTRFS
  • / dev / root / root (aka / dev / mapper / root-root), - mount point / (root), FS - BTRFS
  • / dev / root / swap (aka / dev / mapper / root-swap), - use as swap partition
  • Bootloader install on / dev / sda

If you select BTRFS as the root FS, the installer will automatically create two BTRFS volumes with the names "@" for / (root), and "@home" for / home.

We start the installation ...

Installation will end with a modal dialog box informing about the boot loader installation error. Unfortunately, getting out of this dialogue by regular means and continuing the installation will fail. We log out of the system and log in again, getting into the clean Ubuntu Live desktop. Open the terminal, and again:

#sudo bash

Create a chroot environment to continue the installation: Set up the network and hostname in chroot: Go to the chroot environment: First, deliver the packages: Check and fix all the packages that were crooked due to the incomplete installation of the system:

#mkdir /mnt/chroot
#mount -o defaults,space_cache,noatime,nodiratime,discard,subvol=@ /dev/mapper/root-root /mnt/chroot
#mount -o defaults,space_cache,noatime,nodiratime,discard,subvol=@home /dev/mapper/root-root /mnt/chroot/home
#mount -o defaults,space_cache,noatime,nodiratime,discard /dev/md1 /mnt/chroot/boot
#mount --bind /proc /mnt/chroot/proc
#mount --bind /sys /mnt/chroot/sys
#mount --bind /dev /mnt/chroot/dev



#cat /etc/hostname >/mnt/chroot/etc/hostname
#cat /etc/hosts >/mnt/chroot/etc/hosts
#cat /etc/resolv.conf >/mnt/chroot/etc/resolv.conf



#chroot /mnt/chroot



apt-get install --reinstall mdadm lvm2 thin-provisioning-tools btrfs-tools util-linux lsscsi nvme-cli mc debsums hdparm



#CORRUPTED_PACKAGES=$(debsums -s 2>&1 | awk '{print $6}' | uniq)
#apt-get install --reinstall $CORRUPTED_PACKAGES

If something doesn’t grow together, you may need to

edit /etc/apt/sources.list before that. Let's change the parameters for the RAID 6 module to enable TRIM / DISCARD: We will slightly adjust our arrays:

#cat >/etc/modprobe.d/raid456.conf << EOF
options raid456 devices_handle_discard_safely=1
EOF



#cat >/etc/udev/rules.d/60-md.rules << EOF
SUBSYSTEM=="block", KERNEL=="md*", ACTION=="change", TEST=="md/stripe_cache_size", ATTR{md/stripe_cache_size}="32768"
SUBSYSTEM=="block", KERNEL=="md*", ACTION=="change", TEST=="md/sync_speed_min", ATTR{md/sync_speed_min}="48000"
SUBSYSTEM=="block", KERNEL=="md*", ACTION=="change", TEST=="md/sync_speed_max", ATTR{md/sync_speed_max}="300000"
EOF
#cat >/etc/udev/rules.d/62-hdparm.rules << EOF
SUBSYSTEM=="block", ACTION=="add|change", KERNEL=="sd[a-z]", ATTR{queue/rotational}=="1", RUN+="/sbin/hdparm -B 254 /dev/%k"
EOF
#cat >/etc/udev/rules.d/63-blockdev.rules << EOF
SUBSYSTEM=="block", ACTION=="add|change", KERNEL=="sd[a-z]", ATTR{queue/rotational}=="1", RUN+="/sbin/blockdev --setra 1024 /dev/%k"
SUBSYSTEM=="block", ACTION=="add|change", KERNEL=="sd[a-z]", ATTR{queue/rotational}=="0", RUN+="/sbin/blockdev --setra 0 /dev/%k"
SUBSYSTEM=="block", ACTION=="add|change", KERNEL=="nvme[0-9]n1", RUN+="/sbin/blockdev --setra 0 /dev/%k"
SUBSYSTEM=="block", ACTION=="add|change", KERNEL=="dm-*", ATTR{queue/rotational}=="0", RUN+="/sbin/blockdev --setra 0 /dev/%k"
SUBSYSTEM=="block", ACTION=="add|change", KERNEL=="md*", RUN+="/sbin/blockdev --setra 0 /dev/%k"

EOF

What was it..?
udev :

  • 2020- RAID 6. -, , Linux, .
  • / IO. , .
  • / IO. , / SSD RAID- . NVMe. ( ? . )
  • APM (HDD) 7 . APM (-B 255). - . , , , -. . - . , , - «», -, RAID- mini-MAID-.
  • readahead () 1 — /chunk RAID 6
  • readahead SATA SSD
  • readahead NVMe SSD
  • readahead LVM SSD.
  • readahead RAID .


Edit / etc / fstab:

#cat >/etc/fstab << EOF
# /etc/fstab: static file system information.
#
# Use 'blkid' to print the universally unique identifier for a
# device; this may be used with UUID= as a more robust way to name devices
# that works even if disks are added and removed. See fstab(5).
# file-system mount-point type options dump pass
/dev/mapper/root-root / btrfs defaults,space_cache,noatime,nodiratime,discard,subvol=@ 0 1
UUID=$(blkid -o value -s UUID /dev/md1) /boot btrfs defaults,space_cache,noatime,nodiratime,discard 0 2
/dev/mapper/root-root /home btrfs defaults,space_cache,noatime,nodiratime,discard,subvol=@home 0 2
/dev/mapper/root-swap none swap sw 0 0
EOF

Why is that..?
/boot UUID .. .

LVM /dev/mapper/vg-lv, .. .

UUID LVM .. UUID LVM .
Twice we mount / dev / mapper / root-root ..?
. . BTRFS. subvol.

- LVM BTRFS . .

We regenerate the mdadm config: Correct the LVM settings:

#/usr/share/mdadm/mkconf | sed 's/#DEVICE/DEVICE/g' >/etc/mdadm/mdadm.conf



#cat >>/etc/lvm/lvmlocal.conf << EOF

activation {
thin_pool_autoextend_threshold=90
thin_pool_autoextend_percent=5
}
allocation {
cache_pool_max_chunks=2097152
}
devices {
global_filter=["r|^/dev/.*_corig$|","r|^/dev/.*_cdata$|","r|^/dev/.*_cmeta$|","r|^/dev/.*gpv$|","r|^/dev/images/.*$|","r|^/dev/mapper/images.*$|","r|^/dev/backup/.*$|","r|^/dev/mapper/backup.*$|"]
issue_discards=1
}
EOF

What was it..?
LVM thin 90% 5% .

LVM cache.

LVM LVM (PV) :

  • LVM cache (cdata)
  • LVM cache (<lv_name>_corig). ( <lv_name>).
  • LVM cache (cmeta)
  • VG images. , , LVM .
  • VG backup. .
  • «gpv» ( guest physical volume )

DISCARD LVM VG. . LV SSD . SSD RAID 6. , , thin provisioning, , .

Update the initramfs image:

#update-initramfs -u -k all

Install and configure grub:

#apt-get install grub-pc
#apt-get purge os-prober
#dpkg-reconfigure grub-pc


What drives to choose?
sd*. SATA SSD.

Why nailed os-prober ..?
.

RAID- . , .

, , , . .

With this, we completed the initial installation. It's time to reboot into the newly installed OS. Remember to remove the bootable Live CD / USB. As a device to boot, select any of the SATA SSD.

#exit
#reboot




LVM to SATA SSD


At this point, we have already booted into the new OS, configured the network, apt, opened the terminal emulator, and started:

#sudo bash

Continue.

“Initialize” an array from SATA SSD:

#blkdiscard /dev/md2

If it doesn’t, then try:

#blkdiscard --step 65536 /dev/md2
Create LVM VG on a SATA SSD:

#pvcreate /dev/md2
#vgcreate data /dev/md2


Why another vg ..?
, VG root. VG?

VG PV, VG PV (online). LVM RAID, .

, ( ) RAID 6 .

, «» VG.

-, RAID « ». , VG.

LVM «» RAID - . , — bcache + LVM thin, bcache + BTRFS, LVM cache + LVM thin, ZFS , .

«» , - «» LVM-, . , , .

, , - .

LVM to SATA HDD


#pvcreate /dev/md3
#vgcreate backup /dev/md3


Again the new VG ..?
, , , , - . , - VG, — VG.

Configure LVM cache


Create LV on NVMe RAID 1 to use it as a caching device.

#lvcreate -L 70871154688B --name cache root

Why so little ...?
, NVMe SSD SLC . 4 «» 18 3-bit MLC. - NVMe SSD SATA SSD . , LVM cache SLC NVMe . NVMe 32-64 .

64 , .

, LVM . , lvchange . , .

Let's create LV on SATA RAID 6 to use it as a cached device.

#lvcreate -L 3298543271936B --name cache data

Why only three terabytes ..?
, , SATA SSD RAID 6 - . , , . , , LVM-cache , , bcache, , .

Create a new VG for caching. Create LV on the cached device. Here we immediately took up all the free space on / dev / data / cache so that all other necessary partitions were created immediately on / dev / root / cache. If you have something not created there, you can move it using pvmove. Create and enable the cache:

#pvcreate /dev/root/cache
#pvcreate /dev/data/cache
#vgcreate cache /dev/root/cache /dev/data/cache




#lvcreate -L 3298539077632B --name cachedata cache /dev/data/cache





#lvcreate -y -L 64G -n cache cache /dev/root/cache
#lvcreate -y -L 1G -n cachemeta cache /dev/root/cache
#lvconvert -y --type cache-pool --cachemode writeback --chunksize 64k --poolmetadata cache/cachemeta cache/cache
#lvconvert -y --type cache --cachepool cache/cache cache/cachedata

Why so chunksize ..?
, LVM cache LVM thin. , , .

64 — LVM thin.

Caution writeback ..!
. . , , , . , , NVMe RAID 1 , .

, RAID 6 .

Let’s verify that we succeeded: Only [cachedata_corig] should be located on / dev / data / cache. If something is wrong, then use pvmove. If necessary, you can disable the cache with one command: This is done on-line. LVM simply syncs the cache to disk, deletes it, and renames cachedata_corig back to cachedata.

#lvs -a -o lv_name,lv_size,devices --units B cache
LV LSize Devices
[cache] 68719476736B cache_cdata(0)
[cache_cdata] 68719476736B /dev/root/cache(0)
[cache_cmeta] 1073741824B /dev/root/cache(16384)
cachedata 3298539077632B cachedata_corig(0)
[cachedata_corig] 3298539077632B /dev/data/cache(0)
[lvol0_pmspare] 1073741824B /dev/root/cache(16640)





#lvconvert -y --uncache cache/cachedata



LVM thin setup


We’ll roughly estimate how much space we will need for LVM thin metadata: Round up to 4 gigabytes: 4294967296B Multiply by two and add 4194304B for LVM metadata PV: 8594128896B Create a separate partition on NVMe RAID 1 to mark out LVM thin metadata on it and back it up:

#thin_metadata_size --block-size=64k --pool-size=6terabytes --max-thins=100000 -u bytes
thin_metadata_size - 3385794560 bytes estimated metadata area size for "--block-size=64kibibytes --pool-size=6terabytes --max-thins=100000"







#lvcreate -L 8594128896B --name images root

What for..?
, LVM thin , NVMe .

, . , , . - , , LVM thin , . .

- -, , , . , . .

, , , , , , LVM thin, .

Create a new VG that will be responsible for thin-provisioning: Create a pool:

#pvcreate /dev/root/images
#pvcreate /dev/cache/cachedata
#vgcreate images /dev/root/images /dev/cache/cachedata



#lvcreate -L 274877906944B --poolmetadataspare y --poolmetadatasize 4294967296B --chunksize 64k -Z y -T images/thin-pool
Why -Z y
, , — , — zeroing 64k. 64k 64K . .

Let's move LV to the corresponding PVs: Check: Create a thin volume for tests: Put packages for tests and observations: This is how you can observe the behavior of our storage configuration in real time: This is how you can test our configuration:

#pvmove -n images/thin-pool_tdata /dev/root/images /dev/cache/cachedata
#pvmove -n images/lvol0_pmspare /dev/cache/cachedata /dev/root/images
#pvmove -n images/thin-pool_tmeta /dev/cache/cachedata /dev/root/images



#lvs -a -o lv_name,lv_size,devices --units B images
LV LSize Devices
[lvol0_pmspare] 4294967296B /dev/root/images(0)
thin-pool 274877906944B thin-pool_tdata(0)
[thin-pool_tdata] 274877906944B /dev/cache/cachedata(0)
[thin-pool_tmeta] 4294967296B /dev/root/images(1024)



#lvcreate -V 64G --thin-pool thin-pool --name test images



#apt-get install sysstat fio



#watch 'lvs --rows --reportformat basic --quiet -ocache_dirty_blocks,cache_settings cache/cachedata && (lvdisplay cache/cachedata | grep Cache) && (sar -p -d 2 1 | grep -E "sd|nvme|DEV|md1|md2|md3|md0" | grep -v Average | sort)'



#fio --loops=1 --size=64G --runtime=4 --filename=/dev/images/test --stonewall --ioengine=libaio --direct=1 \
--name=4kQD32read --bs=4k --iodepth=32 --rw=randread \
--name=8kQD32read --bs=8k --iodepth=32 --rw=randread \
--name=16kQD32read --bs=16k --iodepth=32 --rw=randread \
--name=32KQD32read --bs=32k --iodepth=32 --rw=randread \
--name=64KQD32read --bs=64k --iodepth=32 --rw=randread \
--name=128KQD32read --bs=128k --iodepth=32 --rw=randread \
--name=256KQD32read --bs=256k --iodepth=32 --rw=randread \
--name=512KQD32read --bs=512k --iodepth=32 --rw=randread \
--name=4Kread --bs=4k --rw=read \
--name=8Kread --bs=8k --rw=read \
--name=16Kread --bs=16k --rw=read \
--name=32Kread --bs=32k --rw=read \
--name=64Kread --bs=64k --rw=read \
--name=128Kread --bs=128k --rw=read \
--name=256Kread --bs=256k --rw=read \
--name=512Kread --bs=512k --rw=read \
--name=Seqread --bs=1m --rw=read \
--name=Longread --bs=8m --rw=read \
--name=Longwrite --bs=8m --rw=write \
--name=Seqwrite --bs=1m --rw=write \
--name=512Kwrite --bs=512k --rw=write \
--name=256Kwrite --bs=256k --rw=write \
--name=128Kwrite --bs=128k --rw=write \
--name=64Kwrite --bs=64k --rw=write \
--name=32Kwrite --bs=32k --rw=write \
--name=16Kwrite --bs=16k --rw=write \
--name=8Kwrite --bs=8k --rw=write \
--name=4Kwrite --bs=4k --rw=write \
--name=512KQD32write --bs=512k --iodepth=32 --rw=randwrite \
--name=256KQD32write --bs=256k --iodepth=32 --rw=randwrite \
--name=128KQD32write --bs=128k --iodepth=32 --rw=randwrite \
--name=64KQD32write --bs=64k --iodepth=32 --rw=randwrite \
--name=32KQD32write --bs=32k --iodepth=32 --rw=randwrite \
--name=16KQD32write --bs=16k --iodepth=32 --rw=randwrite \
--name=8KQD32write --bs=8k --iodepth=32 --rw=randwrite \
--name=4kQD32write --bs=4k --iodepth=32 --rw=randwrite \
| grep -E 'read|write|test' | grep -v ioengine

Caution! Resource!
36 , 4 . . 4 NVMe . 3 . , 216 SSD.

Reading and writing shuffle?
. . , , , .

The results will vary greatly at the first start and subsequent ones as the cache and thin volume fill up, and also, depending on whether the system managed to synchronize the caches filled at the last start.

Among other things, I recommend measuring the speed on the already filled thin volume from which the snapshot has just been made. The author had the opportunity to observe how a random recording sharply accelerates immediately after creating the first snapshot, especially when the cache is not yet full. This is due to copy-on-write semantics of writing, alignment of cache blocks and thin volumes, and the fact that random write to RAID 6 turns into random read from RAID 6 and then write to the cache. In our configuration, random read from RAID 6 up to 6 times (the number of SATA SSDs in the array) is faster than writing. Because Since CoW blocks are allocated sequentially from a thin pool, the record, for the most part, also turns into a sequential one.

Both of these features can be advantageously used.

Cache “coherent” snapshots


To reduce the risk of data loss in case of cache damage / loss, the author suggests introducing the practice of rotating snapshots to guarantee their integrity in this case.

First, because the metadata of thin volumes resides on an uncached device, the metadata will be consistent and possible losses will be isolated inside the data blocks.

The following snapshot rotation cycle guarantees data integrity inside snapshots in case of cache loss:

  1. For each thin volume with the name <name>, create a snapshot with the name <name> .cached
  2. Set the migration threshold to a reasonable high value: #lvchange --quiet --cachesettings "migration_threshold=16384" cache/cachedata
  3. : #lvs --rows --reportformat basic --quiet -ocache_dirty_blocks cache/cachedata | awk '{print $2}' . , writethrough . , SATA NVMe SSD, , TBW, , . - 100% . NVMe SSD 100% 3-4 . SATA SSD - . , , , , — .
  4. ( ) — <>.cached <>.committed. <>.committed .
  5. , 100%, , . .
  6. migration threshold : #lvchange --quiet --cachesettings "migration_threshold=0" cache/cachedata .
  7. , #lvs --rows --reportformat basic --quiet -ocache_dirty_blocks cache/cachedata | awk '{print $2}' .
  8. .

migration threshold...?
, «» . - 4 , , - (+- 32K) .

migration threshold SATA SSD 64K . SATA SSD.

..?
, bash 100% «google»-driven development, , , , .

, , , , systemd , .

Such a simple rotation scheme for snapshots will allow us not only to constantly have one snapshot completely synchronized on SATA SSD, but it will also allow us to use thin_delta utility to find out which blocks were changed after its creation and, thus, to locate the damage on the main volumes, greatly simplifying recovery .

TRIM / DISCARD in libvirt / KVM


Because Since the data warehouse will be used for KVM running libvirt, it would be nice to teach our VMs not only to take up free space, but also free up what is no longer needed.

This is done by emulating TRIM / DISCARD support on virtual disks. To do this, change the controller type to virtio-scsi and edit the xml. Similar DISCARDs from guest OSs are correctly processed by LVM, and blocks are correctly freed both in the cache and in the thin pool. In our case, this happens, mainly, postponed, when you delete the next snapshot.

#virsh edit vmname
<disk type='block' device='disk'>
<driver name='qemu' type='raw' cache='writethrough' io='threads' discard='unmap'/>
<source dev='/dev/images/vmname'/>
<backingStore/>
<target dev='sda' bus='scsi'/>
<alias name='scsi0-0-0-0'/>
<address type='drive' controller='0' bus='0' target='0' unit='0'/>
</disk>

<controller type='scsi' index='0' model='virtio-scsi'>
<alias name='scsi0'/>
<address type='pci' domain='0x0000' bus='0x04' slot='0x00' function='0x0'/>
</controller>



BTRFS Backup


Use ready-made scripts with extreme caution and at your own peril and risk . The author wrote this code himself and exclusively for himself. I am sure that many experienced Linux users have such crutches of experience, and copy others will not need.

Create a volume on the backup device:

#lvcreate -L 256G --name backup backup

Format it in BTRFS:

#mkfs.btrfs /dev/backup/backup

Create mount points and mount root subkeys of the FS: Create directories for backups: Create a directory for backup scripts: Copy the script:

#mkdir /backup
#mkdir /backup/btrfs
#mkdir /backup/btrfs/root
#mkdir /backup/btrfs/back
#ln -s /boot /backup/btrfs
# cat >>/etc/fstab << EOF

/dev/mapper/root-root /backup/btrfs/root btrfs defaults,space_cache,noatime,nodiratime 0 2
/dev/mapper/backup-backup /backup/btrfs/back btrfs defaults,space_cache,noatime,nodiratime 0 2
EOF
#mount -a
#update-initramfs -u
#update-grub



#mkdir /backup/btrfs/back/remote
#mkdir /backup/btrfs/back/remote/root
#mkdir /backup/btrfs/back/remote/boot



#mkdir /root/btrfs-backup



A lot of scary bash code. Use at your own risk. The author does not write angry letters ...
#cat >/root/btrfs-backup/btrfs-backup.sh << EOF
#!/bin/bash
PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"

SCRIPT_FILE="$(realpath $0)"
SCRIPT_DIR="$(dirname $SCRIPT_FILE)"
SCRIPT_NAME="$(basename -s .sh $SCRIPT_FILE)"

LOCK_FILE="/dev/shm/$SCRIPT_NAME.lock"
DATE_PREFIX='%Y-%m-%d'
DATE_FORMAT=$DATE_PREFIX'-%H-%M-%S'
DATE_REGEX='[0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9]-[0-9][0-9]-[0-9][0-9]-[0-9][0-9]'
BASE_SUFFIX=".@base"
PEND_SUFFIX=".@pend"
SNAP_SUFFIX=".@snap"
MOUNTS="/backup/btrfs/"
BACKUPS="/backup/btrfs/back/remote/"

function terminate ()
{
echo "$1" >&2
exit 1
}

function wait_lock()
{
flock 98
}

function wait_lock_or_terminate()
{
echo "Wating for lock..."
wait_lock || terminate "Failed to get lock. Exiting..."
echo "Got lock..."
}

function suffix()
{
FORMATTED_DATE=$(date +"$DATE_FORMAT")
echo "$SNAP_SUFFIX.$FORMATTED_DATE"
}

function filter()
{
FORMATTED_DATE=$(date --date="$1" +"$DATE_PREFIX")
echo "$SNAP_SUFFIX.$FORMATTED_DATE"
}

function backup()
{
SOURCE_PATH="$MOUNTS$1"
TARGET_PATH="$BACKUPS$1"
SOURCE_BASE_PATH="$MOUNTS$1$BASE_SUFFIX"
TARGET_BASE_PATH="$BACKUPS$1$BASE_SUFFIX"
TARGET_BASE_DIR="$(dirname $TARGET_BASE_PATH)"
SOURCE_PEND_PATH="$MOUNTS$1$PEND_SUFFIX"
TARGET_PEND_PATH="$BACKUPS$1$PEND_SUFFIX"
if [ -d "$SOURCE_BASE_PATH" ]
then
echo "$SOURCE_BASE_PATH found"
else
echo "$SOURCE_BASE_PATH File not found creating snapshot of $SOURCE_PATH to $SOURCE_BASE_PATH"
btrfs subvolume snapshot -r $SOURCE_PATH $SOURCE_BASE_PATH
sync
if [ -d "$TARGET_BASE_PATH" ]
then
echo "$TARGET_BASE_PATH found out of sync with source... removing..."
btrfs subvolume delete -c $TARGET_BASE_PATH
sync
fi
fi
if [ -d "$TARGET_BASE_PATH" ]
then
echo "$TARGET_BASE_PATH found"
else
echo "$TARGET_BASE_PATH not found. Synching to $TARGET_BASE_DIR"
btrfs send $SOURCE_BASE_PATH | btrfs receive $TARGET_BASE_DIR
sync
fi
if [ -d "$SOURCE_PEND_PATH" ]
then
echo "$SOURCE_PEND_PATH found removing..."
btrfs subvolume delete -c $SOURCE_PEND_PATH
sync
fi
btrfs subvolume snapshot -r $SOURCE_PATH $SOURCE_PEND_PATH
sync
if [ -d "$TARGET_PEND_PATH" ]
then
echo "$TARGET_PEND_PATH found removing..."
btrfs subvolume delete -c $TARGET_PEND_PATH
sync
fi
echo "Sending $SOURCE_PEND_PATH to $TARGET_PEND_PATH"
btrfs send -p $SOURCE_BASE_PATH $SOURCE_PEND_PATH | btrfs receive $TARGET_BASE_DIR
sync
TARGET_DATE_SUFFIX=$(suffix)
btrfs subvolume snapshot -r $TARGET_PEND_PATH "$TARGET_PATH$TARGET_DATE_SUFFIX"
sync
btrfs subvolume delete -c $SOURCE_BASE_PATH
sync
btrfs subvolume delete -c $TARGET_BASE_PATH
sync
mv $SOURCE_PEND_PATH $SOURCE_BASE_PATH
mv $TARGET_PEND_PATH $TARGET_BASE_PATH
sync
}

function list()
{
LIST_TARGET_BASE_PATH="$BACKUPS$1$BASE_SUFFIX"
LIST_TARGET_BASE_DIR="$(dirname $LIST_TARGET_BASE_PATH)"
LIST_TARGET_BASE_NAME="$(basename -s .$BASE_SUFFIX $LIST_TARGET_BASE_PATH)"
find "$LIST_TARGET_BASE_DIR" -maxdepth 1 -mindepth 1 -type d -printf "%f\n" | grep "${LIST_TARGET_BASE_NAME/$BASE_SUFFIX/$SNAP_SUFFIX}.$DATE_REGEX"
}

function remove()
{
REMOVE_TARGET_BASE_PATH="$BACKUPS$1$BASE_SUFFIX"
REMOVE_TARGET_BASE_DIR="$(dirname $REMOVE_TARGET_BASE_PATH)"
btrfs subvolume delete -c $REMOVE_TARGET_BASE_DIR/$2
sync
}

function removeall()
{
DATE_OFFSET="$2"
FILTER="$(filter "$DATE_OFFSET")"
while read -r SNAPSHOT ; do
remove "$1" "$SNAPSHOT"
done < <(list "$1" | grep "$FILTER")

}

(
COMMAND="$1"
shift

case "$COMMAND" in
"--help")
echo "Help"
;;
"suffix")
suffix
;;
"filter")
filter "$1"
;;
"backup")
wait_lock_or_terminate
backup "$1"
;;
"list")
list "$1"
;;
"remove")
wait_lock_or_terminate
remove "$1" "$2"
;;
"removeall")
wait_lock_or_terminate
removeall "$1" "$2"
;;
*)
echo "None.."
;;
esac
) 98>$LOCK_FILE

EOF


What does it even do ..?
BTRFS BTRFS send/recieve.

The first launch can be relatively long, because at the beginning all data will be copied. Further launches will be very fast, because only changes will be copied.

Another script to cram into cron:

Some more bash code
#cat >/root/btrfs-backup/cron-daily.sh << EOF
#!/bin/bash
PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"

SCRIPT_FILE="$(realpath $0)"
SCRIPT_DIR="$(dirname $SCRIPT_FILE)"
SCRIPT_NAME="$(basename -s .sh $SCRIPT_FILE)"

BACKUP_SCRIPT="$SCRIPT_DIR/btrfs-backup.sh"
RETENTION="-60 day"
$BACKUP_SCRIPT backup root/@
$BACKUP_SCRIPT removeall root/@ "$RETENTION"
$BACKUP_SCRIPT backup root/@home
$BACKUP_SCRIPT removeall root/@home "$RETENTION"
$BACKUP_SCRIPT backup boot/
$BACKUP_SCRIPT removeall boot/ "$RETENTION"
EOF


What is it doing ..?
backup BTRFS-. 60 . /backup/btrfs/back/remote/ .

Let's give the code the right to execute: Check and cram into the crowns:

#chmod +x /root/btrfs-backup/cron-daily.sh
#chmod +x /root/btrfs-backup/btrfs-backup.sh



#/usr/bin/nice -n 19 /usr/bin/ionice -c 3 /root/btrfs-backup/cron-daily.sh 2>&1 | /usr/bin/logger -t btrfs-backup
#cat /var/log/syslog | grep btrfs-backup
#crontab -e
0 2 * * * /usr/bin/nice -n 19 /usr/bin/ionice -c 3 /root/btrfs-backup/cron-daily.sh 2>&1 | /usr/bin/logger -t btrfs-backup

LVM thin backup


Create a thin pool on the backup device:

#lvcreate -L 274877906944B --poolmetadataspare y --poolmetadatasize 4294967296B --chunksize 64k -Z y -T backup/thin-pool

Install ddrescue, because scripts will use this tool:

#apt-get install gddrescue

Create a directory for scripts:

#mkdir /root/lvm-thin-backup

Copy the scripts:

A lot of bash inside ...
#cat >/root/lvm-thin-backup/lvm-thin-backup.sh << EOF
#!/bin/bash
PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"

SCRIPT_FILE="$(realpath $0)"
SCRIPT_DIR="$(dirname $SCRIPT_FILE)"
SCRIPT_NAME="$(basename -s .sh $SCRIPT_FILE)"

LOCK_FILE="/dev/shm/$SCRIPT_NAME.lock"
DATE_PREFIX='%Y-%m-%d'
DATE_FORMAT=$DATE_PREFIX'-%H-%M-%S'
DATE_REGEX='[0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9]-[0-9][0-9]-[0-9][0-9]-[0-9][0-9]'
BASE_SUFFIX=".base"
PEND_SUFFIX=".pend"
SNAP_SUFFIX=".snap"
BACKUPS="backup"
BACKUPS_POOL="thin-pool"

export LVM_SUPPRESS_FD_WARNINGS=1

function terminate ()
{
echo "$1" >&2
exit 1
}

function wait_lock()
{
flock 98
}

function wait_lock_or_terminate()
{
echo "Wating for lock..."
wait_lock || terminate "Failed to get lock. Exiting..."
echo "Got lock..."
}

function suffix()
{
FORMATTED_DATE=$(date +"$DATE_FORMAT")
echo "$SNAP_SUFFIX.$FORMATTED_DATE"
}

function filter()
{
FORMATTED_DATE=$(date --date="$1" +"$DATE_PREFIX")
echo "$SNAP_SUFFIX.$FORMATTED_DATE"
}

function read_thin_id {
lvs --rows --reportformat basic --quiet -othin_id "$1/$2" | awk '{print $2}'
}

function read_pool_lv {
lvs --rows --reportformat basic --quiet -opool_lv "$1/$2" | awk '{print $2}'
}

function read_lv_dm_path {
lvs --rows --reportformat basic --quiet -olv_dm_path "$1/$2" | awk '{print $2}'
}

function read_lv_active {
lvs --rows --reportformat basic --quiet -olv_active "$1/$2" | awk '{print $2}'
}

function read_lv_chunk_size {
lvs --rows --reportformat basic --quiet --units b --nosuffix -ochunk_size "$1/$2" | awk '{print $2}'
}

function read_lv_size {
lvs --rows --reportformat basic --quiet --units b --nosuffix -olv_size "$1/$2" | awk '{print $2}'
}

function activate_volume {
lvchange -ay -Ky "$1/$2"
}

function deactivate_volume {
lvchange -an "$1/$2"
}

function read_thin_metadata_snap {
dmsetup status "$1" | awk '{print $7}'
}

function thindiff()
{
DIFF_VG="$1"
DIFF_SOURCE="$2"
DIFF_TARGET="$3"
DIFF_SOURCE_POOL=$(read_pool_lv $DIFF_VG $DIFF_SOURCE)
DIFF_TARGET_POOL=$(read_pool_lv $DIFF_VG $DIFF_TARGET)

if [ "$DIFF_SOURCE_POOL" == "" ]
then
(>&2 echo "Source LV is not thin.")
exit 1
fi

if [ "$DIFF_TARGET_POOL" == "" ]
then
(>&2 echo "Target LV is not thin.")
exit 1
fi

if [ "$DIFF_SOURCE_POOL" != "$DIFF_TARGET_POOL" ]
then
(>&2 echo "Source and target LVs belong to different thin pools.")
exit 1
fi

DIFF_POOL_PATH=$(read_lv_dm_path $DIFF_VG $DIFF_SOURCE_POOL)
DIFF_SOURCE_ID=$(read_thin_id $DIFF_VG $DIFF_SOURCE)
DIFF_TARGET_ID=$(read_thin_id $DIFF_VG $DIFF_TARGET)
DIFF_POOL_PATH_TPOOL="$DIFF_POOL_PATH-tpool"
DIFF_POOL_PATH_TMETA="$DIFF_POOL_PATH"_tmeta
DIFF_POOL_METADATA_SNAP=$(read_thin_metadata_snap $DIFF_POOL_PATH_TPOOL)

if [ "$DIFF_POOL_METADATA_SNAP" != "-" ]
then
(>&2 echo "Thin pool metadata snapshot already exist. Assuming stale one. Will release metadata snapshot in 5 seconds.")
sleep 5
dmsetup message $DIFF_POOL_PATH_TPOOL 0 release_metadata_snap
fi

dmsetup message $DIFF_POOL_PATH_TPOOL 0 reserve_metadata_snap
DIFF_POOL_METADATA_SNAP=$(read_thin_metadata_snap $DIFF_POOL_PATH_TPOOL)

if [ "$DIFF_POOL_METADATA_SNAP" == "-" ]
then
(>&2 echo "Failed to create thin pool metadata snapshot.")
exit 1
fi

#We keep output in variable because metadata snapshot need to be released early.
DIFF_DATA=$(thin_delta -m$DIFF_POOL_METADATA_SNAP --snap1 $DIFF_SOURCE_ID --snap2 $DIFF_TARGET_ID $DIFF_POOL_PATH_TMETA)

dmsetup message $DIFF_POOL_PATH_TPOOL 0 release_metadata_snap

echo $"$DIFF_DATA" | grep -E 'different|left_only|right_only' | sed 's/</"/g' | sed 's/ /"/g' | awk -F'\"' '{print $6 "\t" $8 "\t" $11}' | sed 's/different/copy/g' | sed 's/left_only/copy/g' | sed 's/right_only/discard/g'

}

function thinsync()
{
SYNC_VG="$1"
SYNC_PEND="$2"
SYNC_BASE="$3"
SYNC_TARGET="$4"
SYNC_PEND_POOL=$(read_pool_lv $SYNC_VG $SYNC_PEND)
SYNC_BLOCK_SIZE=$(read_lv_chunk_size $SYNC_VG $SYNC_PEND_POOL)
SYNC_PEND_PATH=$(read_lv_dm_path $SYNC_VG $SYNC_PEND)

activate_volume $SYNC_VG $SYNC_PEND

while read -r SYNC_ACTION SYNC_OFFSET SYNC_LENGTH ; do
SYNC_OFFSET_BYTES=$((SYNC_OFFSET * SYNC_BLOCK_SIZE))
SYNC_LENGTH_BYTES=$((SYNC_LENGTH * SYNC_BLOCK_SIZE))
if [ "$SYNC_ACTION" == "copy" ]
then
ddrescue --quiet --force --input-position=$SYNC_OFFSET_BYTES --output-position=$SYNC_OFFSET_BYTES --size=$SYNC_LENGTH_BYTES "$SYNC_PEND_PATH" "$SYNC_TARGET"
fi

if [ "$SYNC_ACTION" == "discard" ]
then
blkdiscard -o $SYNC_OFFSET_BYTES -l $SYNC_LENGTH_BYTES "$SYNC_TARGET"
fi
done < <(thindiff "$SYNC_VG" "$SYNC_PEND" "$SYNC_BASE")
}

function discard_volume()
{
DISCARD_VG="$1"
DISCARD_LV="$2"
DISCARD_LV_PATH=$(read_lv_dm_path "$DISCARD_VG" "$DISCARD_LV")
if [ "$DISCARD_LV_PATH" != "" ]
then
echo "$DISCARD_LV_PATH found"
else
echo "$DISCARD_LV not found in $DISCARD_VG"
exit 1
fi
DISCARD_LV_POOL=$(read_pool_lv $DISCARD_VG $DISCARD_LV)
DISCARD_LV_SIZE=$(read_lv_size "$DISCARD_VG" "$DISCARD_LV")
lvremove -y --quiet "$DISCARD_LV_PATH" || exit 1
lvcreate --thin-pool "$DISCARD_LV_POOL" -V "$DISCARD_LV_SIZE"B --name "$DISCARD_LV" "$DISCARD_VG" || exit 1
}

function backup()
{
SOURCE_VG="$1"
SOURCE_LV="$2"
TARGET_VG="$BACKUPS"
TARGET_LV="$SOURCE_VG-$SOURCE_LV"
SOURCE_BASE_LV="$SOURCE_LV$BASE_SUFFIX"
TARGET_BASE_LV="$TARGET_LV$BASE_SUFFIX"
SOURCE_PEND_LV="$SOURCE_LV$PEND_SUFFIX"
TARGET_PEND_LV="$TARGET_LV$PEND_SUFFIX"
SOURCE_BASE_LV_PATH=$(read_lv_dm_path "$SOURCE_VG" "$SOURCE_BASE_LV")
SOURCE_PEND_LV_PATH=$(read_lv_dm_path "$SOURCE_VG" "$SOURCE_PEND_LV")
TARGET_BASE_LV_PATH=$(read_lv_dm_path "$TARGET_VG" "$TARGET_BASE_LV")
TARGET_PEND_LV_PATH=$(read_lv_dm_path "$TARGET_VG" "$TARGET_PEND_LV")

if [ "$SOURCE_BASE_LV_PATH" != "" ]
then
echo "$SOURCE_BASE_LV_PATH found"
else
echo "Source base not found creating snapshot of $SOURCE_VG/$SOURCE_LV to $SOURCE_VG/$SOURCE_BASE_LV"
lvcreate --quiet --snapshot --name "$SOURCE_BASE_LV" "$SOURCE_VG/$SOURCE_LV" || exit 1
SOURCE_BASE_LV_PATH=$(read_lv_dm_path "$SOURCE_VG" "$SOURCE_BASE_LV")
activate_volume "$SOURCE_VG" "$SOURCE_BASE_LV"
echo "Discarding $SOURCE_BASE_LV_PATH as we need to bootstrap."
SOURCE_BASE_POOL=$(read_pool_lv $SOURCE_VG $SOURCE_BASE_LV)
SOURCE_BASE_CHUNK_SIZE=$(read_lv_chunk_size $SOURCE_VG $SOURCE_BASE_POOL)
discard_volume "$SOURCE_VG" "$SOURCE_BASE_LV"
sync
if [ "$TARGET_BASE_LV_PATH" != "" ]
then
echo "$TARGET_BASE_LV_PATH found out of sync with source... removing..."
lvremove -y --quiet $TARGET_BASE_LV_PATH || exit 1
TARGET_BASE_LV_PATH=$(read_lv_dm_path "$TARGET_VG" "$TARGET_BASE_LV")
sync
fi
fi
SOURCE_BASE_SIZE=$(read_lv_size "$SOURCE_VG" "$SOURCE_BASE_LV")
if [ "$TARGET_BASE_LV_PATH" != "" ]
then
echo "$TARGET_BASE_LV_PATH found"
else
echo "$TARGET_VG/$TARGET_LV not found. Creating empty volume."
lvcreate --thin-pool "$BACKUPS_POOL" -V "$SOURCE_BASE_SIZE"B --name "$TARGET_BASE_LV" "$TARGET_VG" || exit 1
echo "Have to rebootstrap. Discarding source at $SOURCE_BASE_LV_PATH"
activate_volume "$SOURCE_VG" "$SOURCE_BASE_LV"
SOURCE_BASE_POOL=$(read_pool_lv $SOURCE_VG $SOURCE_BASE_LV)
SOURCE_BASE_CHUNK_SIZE=$(read_lv_chunk_size $SOURCE_VG $SOURCE_BASE_POOL)
discard_volume "$SOURCE_VG" "$SOURCE_BASE_LV"
TARGET_BASE_POOL=$(read_pool_lv $TARGET_VG $TARGET_BASE_LV)
TARGET_BASE_CHUNK_SIZE=$(read_lv_chunk_size $TARGET_VG $TARGET_BASE_POOL)
TARGET_BASE_LV_PATH=$(read_lv_dm_path "$TARGET_VG" "$TARGET_BASE_LV")
echo "Discarding target at $TARGET_BASE_LV_PATH"
discard_volume "$TARGET_VG" "$TARGET_BASE_LV"
sync
fi
if [ "$SOURCE_PEND_LV_PATH" != "" ]
then
echo "$SOURCE_PEND_LV_PATH found removing..."
lvremove -y --quiet "$SOURCE_PEND_LV_PATH" || exit 1
sync
fi
lvcreate --quiet --snapshot --name "$SOURCE_PEND_LV" "$SOURCE_VG/$SOURCE_LV" || exit 1
SOURCE_PEND_LV_PATH=$(read_lv_dm_path "$SOURCE_VG" "$SOURCE_PEND_LV")
sync
if [ "$TARGET_PEND_LV_PATH" != "" ]
then
echo "$TARGET_PEND_LV_PATH found removing..."
lvremove -y --quiet $TARGET_PEND_LV_PATH
sync
fi
lvcreate --quiet --snapshot --name "$TARGET_PEND_LV" "$TARGET_VG/$TARGET_BASE_LV" || exit 1
TARGET_PEND_LV_PATH=$(read_lv_dm_path "$TARGET_VG" "$TARGET_PEND_LV")
SOURCE_PEND_LV_SIZE=$(read_lv_size "$SOURCE_VG" "$SOURCE_PEND_LV")
lvresize -L "$SOURCE_PEND_LV_SIZE"B "$TARGET_PEND_LV_PATH"
activate_volume "$TARGET_VG" "$TARGET_PEND_LV"
echo "Synching $SOURCE_PEND_LV_PATH to $TARGET_PEND_LV_PATH"
thinsync "$SOURCE_VG" "$SOURCE_PEND_LV" "$SOURCE_BASE_LV" "$TARGET_PEND_LV_PATH" || exit 1
sync

TARGET_DATE_SUFFIX=$(suffix)
lvcreate --quiet --snapshot --name "$TARGET_LV$TARGET_DATE_SUFFIX" "$TARGET_VG/$TARGET_PEND_LV" || exit 1
sync
lvremove --quiet -y "$SOURCE_BASE_LV_PATH" || exit 1
sync
lvremove --quiet -y "$TARGET_BASE_LV_PATH" || exit 1
sync
lvrename -y "$SOURCE_VG/$SOURCE_PEND_LV" "$SOURCE_BASE_LV" || exit 1
lvrename -y "$TARGET_VG/$TARGET_PEND_LV" "$TARGET_BASE_LV" || exit 1
sync
deactivate_volume "$TARGET_VG" "$TARGET_BASE_LV"
deactivate_volume "$SOURCE_VG" "$SOURCE_BASE_LV"
}

function verify()
{
SOURCE_VG="$1"
SOURCE_LV="$2"
TARGET_VG="$BACKUPS"
TARGET_LV="$SOURCE_VG-$SOURCE_LV"
SOURCE_BASE_LV="$SOURCE_LV$BASE_SUFFIX"
TARGET_BASE_LV="$TARGET_LV$BASE_SUFFIX"
TARGET_BASE_LV_PATH=$(read_lv_dm_path "$TARGET_VG" "$TARGET_BASE_LV")
SOURCE_BASE_LV_PATH=$(read_lv_dm_path "$SOURCE_VG" "$SOURCE_BASE_LV")

if [ "$SOURCE_BASE_LV_PATH" != "" ]
then
echo "$SOURCE_BASE_LV_PATH found"
else
echo "$SOURCE_BASE_LV_PATH not found"
exit 1
fi
if [ "$TARGET_BASE_LV_PATH" != "" ]
then
echo "$TARGET_BASE_LV_PATH found"
else
echo "$TARGET_BASE_LV_PATH not found"
exit 1
fi
activate_volume "$TARGET_VG" "$TARGET_BASE_LV"
activate_volume "$SOURCE_VG" "$SOURCE_BASE_LV"
echo Comparing "$SOURCE_BASE_LV_PATH" with "$TARGET_BASE_LV_PATH"
cmp "$SOURCE_BASE_LV_PATH" "$TARGET_BASE_LV_PATH"
echo Done...
deactivate_volume "$TARGET_VG" "$TARGET_BASE_LV"
deactivate_volume "$SOURCE_VG" "$SOURCE_BASE_LV"
}

function resync()
{
SOURCE_VG="$1"
SOURCE_LV="$2"
TARGET_VG="$BACKUPS"
TARGET_LV="$SOURCE_VG-$SOURCE_LV"
SOURCE_BASE_LV="$SOURCE_LV$BASE_SUFFIX"
TARGET_BASE_LV="$TARGET_LV$BASE_SUFFIX"
TARGET_BASE_LV_PATH=$(read_lv_dm_path "$TARGET_VG" "$TARGET_BASE_LV")
SOURCE_BASE_LV_PATH=$(read_lv_dm_path "$SOURCE_VG" "$SOURCE_BASE_LV")

if [ "$SOURCE_BASE_LV_PATH" != "" ]
then
echo "$SOURCE_BASE_LV_PATH found"
else
echo "$SOURCE_BASE_LV_PATH not found"
exit 1
fi
if [ "$TARGET_BASE_LV_PATH" != "" ]
then
echo "$TARGET_BASE_LV_PATH found"
else
echo "$TARGET_BASE_LV_PATH not found"
exit 1
fi
activate_volume "$TARGET_VG" "$TARGET_BASE_LV"
activate_volume "$SOURCE_VG" "$SOURCE_BASE_LV"
SOURCE_BASE_POOL=$(read_pool_lv $SOURCE_VG $SOURCE_BASE_LV)
SYNC_BLOCK_SIZE=$(read_lv_chunk_size $SOURCE_VG $SOURCE_BASE_POOL)

echo Syncronizing "$SOURCE_BASE_LV_PATH" to "$TARGET_BASE_LV_PATH"

CMP_OFFSET=0
while [[ "$CMP_OFFSET" != "" ]] ; do
CMP_MISMATCH=$(cmp -i "$CMP_OFFSET" "$SOURCE_BASE_LV_PATH" "$TARGET_BASE_LV_PATH" | grep differ | awk '{print $5}' | sed 's/,//g' )
if [[ "$CMP_MISMATCH" != "" ]] ; then
CMP_OFFSET=$(( CMP_MISMATCH + CMP_OFFSET ))
SYNC_OFFSET_BYTES=$(( ( CMP_OFFSET / SYNC_BLOCK_SIZE ) * SYNC_BLOCK_SIZE ))
SYNC_LENGTH_BYTES=$(( SYNC_BLOCK_SIZE ))
echo "Synching $SYNC_LENGTH_BYTES bytes at $SYNC_OFFSET_BYTES from $SOURCE_BASE_LV_PATH to $TARGET_BASE_LV_PATH"
ddrescue --quiet --force --input-position=$SYNC_OFFSET_BYTES --output-position=$SYNC_OFFSET_BYTES --size=$SYNC_LENGTH_BYTES "$SOURCE_BASE_LV_PATH" "$TARGET_BASE_LV_PATH"
else
CMP_OFFSET=""
fi
done
echo Done...
deactivate_volume "$TARGET_VG" "$TARGET_BASE_LV"
deactivate_volume "$SOURCE_VG" "$SOURCE_BASE_LV"
}

function list()
{
LIST_SOURCE_VG="$1"
LIST_SOURCE_LV="$2"
LIST_TARGET_VG="$BACKUPS"
LIST_TARGET_LV="$LIST_SOURCE_VG-$LIST_SOURCE_LV"
LIST_TARGET_BASE_LV="$LIST_TARGET_LV$SNAP_SUFFIX"
lvs -olv_name | grep "$LIST_TARGET_BASE_LV.$DATE_REGEX"
}

function remove()
{
REMOVE_TARGET_VG="$BACKUPS"
REMOVE_TARGET_LV="$1"
lvremove -y "$REMOVE_TARGET_VG/$REMOVE_TARGET_LV"
sync
}

function removeall()
{
DATE_OFFSET="$3"
FILTER="$(filter "$DATE_OFFSET")"
while read -r SNAPSHOT ; do
remove "$SNAPSHOT"
done < <(list "$1" "$2" | grep "$FILTER")

}

(
COMMAND="$1"
shift

case "$COMMAND" in
"--help")
echo "Help"
;;
"suffix")
suffix
;;
"filter")
filter "$1"
;;
"backup")
wait_lock_or_terminate
backup "$1" "$2"
;;
"list")
list "$1" "$2"
;;
"thindiff")
thindiff "$1" "$2" "$3"
;;
"thinsync")
thinsync "$1" "$2" "$3" "$4"
;;
"verify")
wait_lock_or_terminate
verify "$1" "$2"
;;
"resync")
wait_lock_or_terminate
resync "$1" "$2"
;;
"remove")
wait_lock_or_terminate
remove "$1"
;;
"removeall")
wait_lock_or_terminate
removeall "$1" "$2" "$3"
;;
*)
echo "None.."
;;
esac
) 98>$LOCK_FILE

EOF


What is it doing ...?
, thin_delta, ddrescue blkdiscard.

Another script that we will cram into crowns:

Some more bash
#cat >/root/lvm-thin-backup/cron-daily.sh << EOF
#!/bin/bash
PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"

SCRIPT_FILE="$(realpath $0)"
SCRIPT_DIR="$(dirname $SCRIPT_FILE)"
SCRIPT_NAME="$(basename -s .sh $SCRIPT_FILE)"

BACKUP_SCRIPT="$SCRIPT_DIR/lvm-thin-backup.sh"
RETENTION="-60 days"

$BACKUP_SCRIPT backup images linux-dev
$BACKUP_SCRIPT backup images win8
$BACKUP_SCRIPT backup images win8-data
#etc

$BACKUP_SCRIPT removeall images linux-dev "$RETENTION"
$BACKUP_SCRIPT removeall images win8 "$RETENTION"
$BACKUP_SCRIPT removeall images win8-data "$RETENTION"
#etc

EOF


What is it doing ...?
, . , .

This script needs to be edited, indicating a list of thin volumes for which backups are required. The names given are for illustration purposes only. If you wish, you can write a script that will synchronize all volumes.

Let's give rights: Check and cram into the crowns: The first launch will be long, because thin volumes will be fully synchronized by copying the entire used space. Thanks to LVM thin metadata, we know which blocks are actually used, so only actual blocks of thin volumes will be copied. Subsequent launches will copy data incrementally by tracking changes through LVM thin metadata.

#chmod +x /root/lvm-thin-backup/cron-daily.sh
#chmod +x /root/lvm-thin-backup/lvm-thin-backup.sh




#/usr/bin/nice -n 19 /usr/bin/ionice -c 3 /root/lvm-thin-backup/cron-daily.sh 2>&1 | /usr/bin/logger -t lvm-thin-backup
#cat /var/log/syslog | grep lvm-thin-backup
#crontab -e
0 3 * * * /usr/bin/nice -n 19 /usr/bin/ionice -c 3 /root/lvm-thin-backup/cron-daily.sh 2>&1 | /usr/bin/logger -t lvm-thin-backup





Let's see what happened:


#time /root/btrfs-backup/cron-daily.sh
real 0m2,967s
user 0m0,225s
sys 0m0,353s

#time /root/lvm-thin-backup/cron-daily.sh
real 1m2,710s
user 0m12,721s
sys 0m6,671s

#ls -al /backup/btrfs/back/remote/*
/backup/btrfs/back/remote/boot:
total 0
drwxr-xr-x 1 root root 1260 26 09:11 .
drwxr-xr-x 1 root root 16 6 09:30 ..
drwxr-xr-x 1 root root 322 26 02:00 .@base
drwxr-xr-x 1 root root 516 6 09:39 .@snap.2020-03-06-09-39-37
drwxr-xr-x 1 root root 516 6 09:39 .@snap.2020-03-06-09-39-57
...
/backup/btrfs/back/remote/root:
total 0
drwxr-xr-x 1 root root 2820 26 09:11 .
drwxr-xr-x 1 root root 16 6 09:30 ..
drwxr-xr-x 1 root root 240 26 09:11 @.@base
drwxr-xr-x 1 root root 22 26 09:11 @home.@base
drwxr-xr-x 1 root root 22 6 09:39 @home.@snap.2020-03-06-09-39-35
drwxr-xr-x 1 root root 22 6 09:39 @home.@snap.2020-03-06-09-39-57
...
drwxr-xr-x 1 root root 240 6 09:39 @.@snap.2020-03-06-09-39-26
drwxr-xr-x 1 root root 240 6 09:39 @.@snap.2020-03-06-09-39-56
...

#lvs -olv_name,lv_size images && lvs -olv_name,lv_size backup
LV LSize
linux-dev 128,00g
linux-dev.base 128,00g
thin-pool 1,38t
win8 128,00g
win8-data 2,00t
win8-data.base 2,00t
win8.base 128,00g
LV LSize
backup 256,00g
images-linux-dev.base 128,00g
images-linux-dev.snap.2020-03-08-10-09-11 128,00g
images-linux-dev.snap.2020-03-08-10-09-25 128,00g
...
images-win8-data.base 2,00t
images-win8-data.snap.2020-03-16-14-11-55 2,00t
images-win8-data.snap.2020-03-16-14-19-50 2,00t
...
images-win8.base 128,00g
images-win8.snap.2020-03-17-04-51-46 128,00g
images-win8.snap.2020-03-18-03-02-49 128,00g
...
thin-pool <2,09t

And what about nested dolls?


Most likely, though LVM LV logical volumes can be physical LVM PV volumes for other VGs. LVM can be recursive, like nesting dolls. This gives LVM extreme flexibility.

PS


In the next article, we will try to use several similar mobile storage systems / KVMs as the basis for creating a geo-distributed storage / vm cluster with redundancy on several continents through home desktops, home Internet and P2P networks.

All Articles