This spring, we have already discussed some introductory topics, such as how to check the speed of your disks and what RAID is . In the second of them, we even promised to continue studying the performance of various multi-disk topologies in ZFS. This is the next generation file system that is being implemented everywhere: from Apple to Ubuntu .

Well, today is the best day to get to know ZFS, curious readers. Just be aware that, according to a conservative assessment by OpenZFS developer Matt Arens, "it's really complicated."

But before we get to the numbers - and they will, I promise - for all variants vosmidiskovoy ZFS configuration, you need to talk about how to do ZFS stores data on disk.

Zpool, vdev and device

This full pool diagram includes three helper vdevs, one for each class, and four for RAIDz2.

There is usually no reason to create a pool of inappropriate vdev types and sizes - but if you want, nothing prevents you from doing this.

To really understand the ZFS file system , you need to carefully look at its actual structure. First, ZFS combines traditional levels of volume management and the file system. Secondly, it uses a transactional copy mechanism when writing. These features mean that the system is structurally very different from ordinary file systems and RAID arrays. The first set of basic building blocks to understand: a storage pool (zpool), a virtual device (vdev), and a real device (device).

zpool

The zpool storage pool is the topmost ZFS structure. Each pool contains one or more virtual devices. In turn, each of them contains one or more real devices (device). Virtual pools are autonomous blocks. One physical computer may contain two or more separate pools, but each is completely independent of the others. Pools cannot share virtual devices.

Redundancy of ZFS is at the level of virtual devices, but not at the level of pools. At the pool level, there is absolutely no redundancy - if any vdev drive or special vdev is lost, then the entire pool is lost along with it.

Modern storage pools can survive the loss of a cache or virtual device log - although they can lose a small amount of dirty data if they lose the vdev log during a power outage or system crash.

There is a common misconception that “data bands” (strips) of ZFS are recorded across the entire pool. This is not true. Zpool is not a fun RAID0 at all, it's rather a fun JBOD with a complex changeable distribution mechanism.

For the most part, entries are distributed among available virtual devices according to available space, so theoretically they will all be filled at the same time. In later versions of ZFS, the current use (disposal) of vdev is taken into account - if one virtual device is significantly more loaded than the other (for example, due to read load), it will be temporarily skipped for writing, despite the presence of the highest free space coefficient.

A recycling detection mechanism built into modern ZFS record distribution methods can reduce latency and increase throughput during periods of unusually high load - but this is not carte blancheinvoluntarily mixing slow HDDs and fast SSDs in one pool. Such an unequal pool will still work at the speed of the slowest device, that is, as if it were entirely composed of such devices.

vdev

Each storage pool consists of one or more virtual devices (virtual device, vdev). In turn, each vdev includes one or more real devices. Most virtual devices are used to easily store data, but there are several helper vdev classes, including CACHE, LOG, and SPECIAL. Each of these vdev types can have one of five topologies: single-device, RAIDz1, RAIDz2, RAIDz3, or mirror.

RAIDz1, RAIDz2, and RAIDz3 are special variations of what the olds call RAID double (diagonal) parity. 1, 2, and 3 refer to how many parity blocks are allocated for each data band. Instead of separate disks for parity, virtual RAIDz devices evenly distribute this parity across disks. A RAIDz array can lose as many disks as it has parity blocks; if he loses another one, he will fail and take the storage pool with him.

In mirrored virtual devices (mirror vdev), each block is stored on each device in vdev. Although the most common two-wide mirrors, there can be any arbitrary number of devices in the mirror - in large installations, triple ones are often used to increase read performance and fault tolerance. The vdev mirror can survive any failure while at least one device in vdev continues to work.

Single vdevs are inherently dangerous. Such a virtual device will not survive a single failure - and if it is used as storage or a special vdev, then its failure will lead to the destruction of the entire pool. Be very, very careful here.

CACHE, LOG, and SPECIAL virtual appliances can be created using any of the above topologies - but remember that losing a SPECIAL virtual appliance means losing a pool, so an excessive topology is strongly recommended.

device

This is probably the easiest term to understand in ZFS - it is literally a block random access device. Remember that virtual devices are made up of individual devices, and the pool is made up of virtual devices.

Disks - magnetic or solid-state - are the most common block devices that are used as vdev building blocks. However, any device with a handle in / dev is suitable - so you can use entire hardware RAID arrays as separate devices.

A simple raw file is one of the most important alternative block devices vdev can be built from. Test pools from sparse files - A very convenient way to check pool commands and see how much space is available in the pool or virtual device of this topology.

You can create a test pool from sparse files in just a few seconds - but do not forget to delete the entire pool and its components later.

Suppose you want to put a server on eight disks and plan to use 10 TB disks (~ 9300 GiB) - but you are not sure which Topology best suits your needs. In the above example, in a matter of seconds we build a test pool from sparse files - and now we know that RAIDz2 vdev from eight 10 TB drives provides 50 TiB of useful capacity.

Another special class of devices is SPARE (spare). Hot-swappable devices, unlike conventional devices, belong to the entire pool, not just one virtual device. If some vdev in the pool fails, and the spare device is connected to the pool and available, then it will automatically join the affected vdev.

After connecting to the affected vdev, the spare device starts receiving copies or reconstruction of data that should be on the missing device. In traditional RAID, this is called rebuilding, while in ZFS it is called “resilvering”.

It is important to note that replacement devices do not permanently replace failed devices. This is only a temporary replacement to reduce the time during which vdev degradation is observed. After the administrator replaced the failed vdev device, redundancy is restored to this permanent device, and SPARE disconnects from vdev and returns to work as a spare for the entire pool.

Datasets, Blocks, and Sectors

The next set of building blocks that you need to understand on our journey through ZFS is not so much hardware, but how the data is organized and stored. We skip several levels here - such as metaslab - so as not to pile up the details while maintaining an understanding of the overall structure.

Dataset

When we first create a dataset, it shows all available pool space. Then we set the quota - and change the mount point. Magic!

Zvol is for the most part just a dataset, devoid of its file system layer, which we replace here with a completely normal ext4

file system. The ZFS dataset is roughly the same as a standard mounted file system. Like a regular file system, at first glance it seems to be “just another folder”. But also, like conventional mounted file systems, each ZFS dataset has its own set of basic properties.

First of all, a dataset may have an assigned quota. If installedzfs set quota=100G poolname/datasetname, then you cannot write to the mounted folder/poolname/datasetnamemore than 100 GiB.

Notice the presence - and absence - of slashes at the beginning of each line? Each data set has its own place both in the ZFS hierarchy and in the system mount hierarchy. There is no leading slash in the ZFS hierarchy - you start with the name of the pool, and then the path from one data set to the next. For example, pool/parent/childfor a dataset named childunder the parent dataset parentin a pool with a creative name pool.

By default, the mount point of the dataset will be equivalent to its name in the ZFS hierarchy, with a slash at the beginning - the pool with the name is poolmounted as /pool, the dataset is parentmounted in /pool/parent, and the child dataset is mounted childin /pool/parent/child. However, the system mount point for the dataset can be changed.

If we indicatezfs set mountpoint=/lol pool/parent/child, then the data set is pool/parent/childmounted in the system as /lol.

In addition to datasets, we should mention volumes (zvols). A volume is approximately similar to a data set, except that it actually does not have a file system - it is just a block device. You can, for example, create zvolwith a name mypool/myzvol, then format it with the ext4 file system, and then mount this file system - now you have the ext4 file system, but with support for all ZFS security features! This may seem silly on one computer, but it makes much more sense as a backend when exporting an iSCSI device.

Blocks

A file is represented by one or more blocks. Each block is stored on one virtual device. The block size is usually equal to the recordsize parameter , but can be reduced to 2 ^ ashift if it contains metadata or a small file.

We really, really are not joking about the huge performance damage if you install too small ashift

In the ZFS pool, all data, including metadata, is stored in blocks. The maximum block size for each data set is defined in the propertyrecordsize(record size). The size of the record may vary, but this will not change the size or location of any blocks that have already been written to the dataset - it is valid only for new blocks as they are written.

Unless otherwise specified, the current recording size is 128 KiB by default. This is a kind of difficult compromise in which performance will not be ideal, but not terrible in most cases. Recordsizecan be set to any value from 4K to 1M (with additional settings recordsizeyou can set even more, but this is rarely a good idea).

Any block refers to the data of only one file - you cannot squeeze two different files into one block. Each file consists of one or more blocks, depending on the size. If the file size is smaller than the record size, it will be saved in a smaller block — for example, a block with a 2 KiB file will occupy only one 4 KiB sector on the disk.

If the file is large enough and requires several blocks, then all records with this file will have a sizerecordsize - including the last record, the main part of which may turn out to be unused space .

Zvol volumes do not have a property recordsize - instead they have an equivalent property volblocksize.

Sectors

The last, most basic building block is the sector. This is the smallest physical unit that can be written to or read from the base unit. For several decades, most disks used 512-byte sectors. Recently, most drives are configured for 4 KiB sectors, and in some - especially SSDs - 8 KiB sectors or even more.

ZFS has a property that allows you to manually set the sector size. This is a property ashift. It is somewhat confusing that ashift is a power of two. For example, it ashift=9means a sector size of 2 ^ 9, or 512 bytes.

ZFS asks the operating system for detailed information about each block device when it is added to the new vdev, and theoretically automatically sets ashift properly based on this information. Unfortunately, many disks lie about their sector size in order to maintain compatibility with Windows XP (which was unable to understand disks with other sector sizes).

This means that the ZFS administrator is strongly advised to know the actual sector size of their devices and manually installashift. If too small an ashift is set, then the number of read / write operations astronomically increases. So, writing 512-byte “sectors” to the real 4 KiB sector means writing the first “sector”, then reading the 4 KiB sector, changing it with the second 512-byte “sector”, writing it back to the new 4 KiB sector, and so on for each entry.

In the real world, such a penalty beats Samsung EVO ashift=13SSDs , for which it must act , but these SSDs lie about their sector size, and therefore it is set by default ashift=9. If an experienced system administrator does not change this setting, then this SSD is slower than a regular magnetic HDD.

For comparison, for too large a sizeashiftthere is virtually no penalty. There is no real decrease in productivity, and the increase in unused space is infinitely small (or equal to zero with compression enabled). Therefore, we strongly recommend that even those drives that really use 512-byte sectors be installed ashift=12or even ashift=13to look confidently into the future.

The property is ashiftset for each vdev virtual device, and not for the pool , as many mistakenly think - and does not change after installation. If you accidentally knocked down ashiftwhen adding a new vdev to the pool, then you irrevocably contaminated this pool with a low-performance device and, as a rule, there is no other way but to destroy the pool and start all over again. Even removing vdev will not save you from a broken setupashift!

— ,

, , « » « »,

, — , ,

Copy on Write (CoW) is the fundamental foundation of what makes ZFS so awesome. The basic concept is simple - if you ask the traditional file system to modify the file, it will do exactly what you requested. If you ask the file system with copying during recording to do the same, it will say “good” - but it will lie to you.

Instead, the copy-write file system writes the new version of the modified block, and then updates the file metadata to break the link with the old block and associate the new block you just wrote to it.

Disconnecting the old unit and linking the new one is done in one operation, so it cannot be interrupted - if you reset the power after this happens, you have a new version of the file, and if you reset the power earlier, then you have the old version. In any case, there will be no conflict in the file system.

Copying when writing to ZFS takes place not only at the file system level, but also at the disk management level. This means that ZFS is not subject to a space in the record (a hole in the RAID ) - a phenomenon when the strip only managed to partially record before the system crashed, with the array damaged after a reboot. Here the strip is atomic, vdev is always consistent, and Bob is your uncle .

ZIL: ZFS Intent Log

ZFS — , ZIL,

, ZIL, .

SLOG, LOG-, — — , , — vdev, ZIL

ZIL — ZIL SLOG,

There are two main categories of write operations - synchronous (sync) and asynchronous (async). For most workloads, the vast majority of write operations are asynchronous - the file system allows you to aggregate them and deliver them in batches, reducing fragmentation and significantly increasing throughput.

Synchronous recordings are a completely different matter. When an application requests a synchronous write, it tells the file system: “You need to commit this to non-volatile memory right now , and until then I can do nothing more.” Therefore, synchronous recordings should be immediately committed to disk - and if that increases fragmentation or reduces bandwidth, then so be it.

ZFS processes synchronous records differently than regular file systems - instead of immediately uploading them to regular storage, ZFS records them in a special storage area called the ZFS intent log - ZFS Intent Log, or ZIL. The trick is that these records also remain in memory, being aggregated along with regular asynchronous write requests, to later be dumped into storage as perfectly normal TXGs (Transaction Groups, Transaction Groups).

In normal operation, ZIL is recorded and never read again. When, after a few moments, recordings from ZIL are fixed in the main storage in ordinary TXG from RAM, they are disconnected from ZIL. The only thing when something is read from ZIL is when importing the pool.

If ZFS crashes - operating system crashes or power outages - when there is data in ZIL, this data will be read during the next pool import (for example, when the emergency system restarts). Everything that is in the ZIL will be read, combined into TXG groups, committed to the main storage, and then disconnected from the ZIL during the import process.

One of the vdev helper classes is called LOG or SLOG, the secondary LOG device. He has one task - to provide the pool with a separate and, preferably, much faster, with very high write resistance, vdev device for storing ZIL, instead of storing ZIL in the main vdev storage. ZIL itself behaves the same regardless of the storage location, but if vdev with LOG has very high write performance, then synchronous writes will be faster.

Adding vdev with LOG to the pool cannot improve asynchronous write performance - even if you force all writes to ZIL using zfs set sync=always, they will still be tied to the main repository in TXG in the same way and at the same pace as without a log. The only direct performance improvement is the delay in synchronous recording (since higher log speed speeds up operations sync).

However, in an environment that already requires a large number of synchronous writes, vdev LOG can indirectly speed up asynchronous writes and uncached reads. Uploading ZIL records to a separate vdev LOG means less competition for IOPS in the primary storage, which to some extent improves the performance of all read and write operations.

Snapshots

The write copy mechanism is also an essential foundation for atomic ZFS snapshots and incremental asynchronous replication. The active file system has a pointer tree that marks all records with current data - when you take a snapshot, you simply make a copy of this pointer tree.

When a record is overwritten in the active file system, ZFS first writes the new version of the block to unused space. It then detaches the old version of the block from the current file system. But if some snapshot refers to the old block, it still remains unchanged. The old block will not actually be restored as free space until all snapshots that link to this block are destroyed!

Replication

Steam 2015 158 126 927 . rsync — ZFS « » 750% .

40- Windows 7 — . ZFS 289 , rsync — «» 161 , , rsync --inplace.

, rsync . 1,9 — , ZFS 1148 , rsync, rsync --inplace

Once you understand how snapshots work, it’s easy to grasp the essence of replication. Since a snapshot is just a tree of pointers to records, it follows that if we make a zfs sendsnapshot, then we send this tree and all the records associated with it. When we pass this zfs sendin zfs receiveto the target object, it writes both the actual contents of the block and the tree of pointers that reference the blocks to the target data set.

Everything becomes even more interesting in the second zfs send. Now we have two systems, each of which contains poolname/datasetname@1, and you shoot a new snapshot poolname/datasetname@2. Therefore, in the source pool you have datasetname@1and datasetname@2, and in the target pool so far only the first snapshot datasetname@1.

Since we have a common snapshot between the source and the targetdatasetname@1, we can do incremental zfs send on top of it. When we tell the system zfs send -i poolname/datasetname@1 poolname/datasetname@2, it compares two pointer trees. Any pointers that exist only in @2, obviously, refer to new blocks - so we need the contents of these blocks.

On a remote system, incremental processing is sendjust as simple. First, we record all the new entries included in the stream send, and then add pointers to these blocks. Voila, in our @2new system!

ZFS asynchronous incremental replication is a huge improvement over earlier non-snapshot methods like rsync. In both cases, only changed data is transmitted - but rsync must first readfrom the disk all the data on both sides to check the amount and compare it. In contrast, ZFS replication reads nothing but pointer trees - and any blocks that are not represented in the general snapshot.

Inline compression

The copy-on-write mechanism also simplifies the built-in compression system. In a traditional file system, compression is problematic - both the old version and the new version of the changed data are in the same space.

If you consider a piece of data in the middle of a file that begins its life as a megabyte of zeros from 0x00000000 and so on - it is very easy to compress it to one sector on the disk. But what happens if we replace this megabyte of zeros with a megabyte of incompressible data such as JPEG or pseudo-random noise? Suddenly, this megabyte of data will require not one, but 256 sectors of 4 KiB, and in this place on the disk only one sector is reserved.

ZFS does not have such a problem, since changed records are always written to unused space - the original block occupies only one 4 KiB sector, and a new record will take 256, but this is not a problem - a recently changed fragment from the middle of the file would be written into unused space regardless of whether its size has changed or not, so for ZFS this is a normal situation.

Built-in ZFS compression is disabled by default, and the system offers plug-in algorithms - now among them are LZ4, gzip (1-9), LZJB and ZLE.

LZ4 is a streaming algorithm that offers extremely fast compression and decompression and performance gains for most use cases - even on fairly slow CPUs.
GZIP — , Unix-. 1-9, CPU 9. ( ) , c CPU — , .
LZJB — ZFS. , LZ4 .
ZLE - zero level encoding, Zero Level Encoding. It does not touch normal data at all, but compresses large sequences of zeros. Useful for completely incompressible data sets (for example, JPEG, MP4, or other already compressed formats), since it ignores incompressible data, but compresses unused space in the resulting records.

We recommend LZ4 compression for almost all use cases; The performance penalty for encountering incompressible data is very small, and the performance gain for typical data is significant. Copying a virtual machine image for a new installation of the Windows operating system (freshly installed OS, no data inside yet) with compression=lz4passed 27% faster than with compression=none, in this 2015 test .

ARC - adaptive replacement cache

ZFS is the only modern file system known to us that uses its own read caching mechanism, and does not rely on the operating system page cache to store copies of recently read blocks in RAM.

Although its own cache is not without its problems - ZFS cannot respond to new memory allocation requests as fast as the kernel, so a new malloc()memory allocation call may fail if it needs RAM currently occupied by ARC. But there are good reasons to use your own cache, at least for now.

All well-known modern operating systems, including MacOS, Windows, Linux and BSD, use the LRU algorithm (Least Recently Used) to implement the page cache. This is a primitive algorithm that raises the cached block “up the queue” after each reading and pushes the “down queue” blocks as necessary to add new cache misses (blocks that should have been read from disk, not from the cache) up.

Usually the algorithm works fine, but on systems with large working datasets, LRU easily leads to thrashing - crowding out frequently needed blocks to make room for blocks that will never be read from the cache again.

ARC - a much less naive algorithm, which can be considered as a "weighted" cache. After each reading of the cached block, it becomes a little “heavier” and it becomes harder to crowd out - and even after crowding out the block is tracked for a certain period of time. A block that has been squeezed out but then needs to be read back to the cache will also become “heavier”.

The end result of all this is a cache with a much larger hit ratio — the ratio between hits in the cache (read from the cache) and misses (read from disk). This is extremely important statistics - not only do the cache hits themselves take orders of magnitude faster, cache misses can also be served faster, because the more cache hits, the fewer concurrent disk requests and the less the delay for those remaining misses that should be served with drive.

Conclusion

After studying the basic semantics of ZFS — how copying works when writing, as well as the relationships between storage pools, virtual devices, blocks, sectors, and files — we are ready to discuss real performance with real numbers.

In the next part, we will look at the actual performance of pools with mirrored vdev and RAIDz, in comparison with each other, as well as in comparison with traditional Linux kernel RAID topologies, which we examined earlier .

At first we wanted to consider only the basics - the ZFS topologies themselves - but after this we will be ready to talk about more advanced ZFS tuning and tuning, including the use of auxiliary vdev types such as L2ARC, SLOG and Special Allocation.

ZFS Basics: Storage and Performance