Infinidat Storage Architecture Technical Overview

InfiniBox is a modern storage system that immediately fell into the right side of the magic square. What is its uniqueness?

A Brief Background

What is InfiniBox? This is an Infinidat storage system. What is Infinidat? This is a company created by Moshe Yanay (creator of Symmetrix and XIV) to implement an ideal Enterprise-level storage project.

The company was created as a software developer, which is put on proven equipment, that is, it is SDS, but comes as a single monolithic set.

Introduction

In this article we will look at the InfiniBox storage system, its architecture, how it works and how high reliability (99.99999%) is achieved, performance, capacity at a relatively low price. Since the basis of the storage system is its software, and for this system in particular, the main emphasis will be on software, there will be no beautiful photo glands.

Why do you need another storage system on the market?

There are a number of tasks for which a very large capacity is needed, while reliability and performance are also important. For example, cloud systems, standard tasks of large companies, the Internet of things, gene research, security systems for large structures. It is quite difficult to find the optimal storage system for such tasks, especially if you look at the price. With an eye on such tasks, the software architecture of InfiniBox was built.

Addressing

How can I store unlimited amounts of data? By providing unlimited address space. To do this, InfiniBox uses VUA - Virtual User Address space. All objects created on the InfiniBox that are accessible to the user (volumes, snapshots, file systems) are included in this VUA, and the total size of these objects is the current size of the VUA. Addressing is thin and not associated with disks: that is, the VUA size can be much larger than the available disk capacity and is in fact unlimited.



Next, we need to divide this space into parts that we will serve, so it’s easier to work with them.



The entire address space is divided into 6 parts - VU (virtual unit - virtual piece). The address space of each object, such as a volume, is evenly distributed between these parts. The recording goes in blocks of 64kB, and in the process, you can incredibly simply and quickly understand which VU this volume address belongs to (the remainder of LBA / 64kB division, the modulo function, is done very quickly for the minimum number of CPU cycles).

In addition to simplifying work with smaller spaces, VUs are the first level of abstraction from physical disks and the basis for system fault tolerance at the controller level. For example, there are 3 controllers (physical servers), each of which is responsible for the operation and is the main for two VUs, as well as the backup for two other VUs. The controller does not service the VU for which it is a reserve, but at the same time receives metadata and write operations from the other controller for this VU, so that in the event of a failure of the main controller, it immediately picks up all the work with this VU.



In the event of failure of one controller, the other two take over its task - work with its VU.



Addressing, Snapshots, Caching



VUA is a virtual address space available to the user and virtually unlimited. VDA (Virtual Disk Address) is a virtual internal storage space. Its size is strictly determined in advance by the number and capacity of hard drives (minus the parity, metadata, and space for replacing failed drives). The connection between VUA and VDA is organized through a trie (prefix tree). Each entry in the tree is a pointer from user space (VUA) to interior space (VDA). Addressing in the form of a prefix tree allows you to address blocks of any size. That is, when an element of any size (file, serial data stream, object) is added to the disks, it can be addressed with one record in the tree and the tree remains compact.

However, the most important feature of the tree is its high performance when searching and expanding the tree. The search is performed during the read operation, when it is necessary to find the block with the given address on the disk. The extension of the tree occurs during a write operation, when we add new data to disk and must add an address to the tree where it can be found. Performance is extremely important when working with large structures, and the prefix tree allows you to achieve it with a huge margin for the future, for example, in case of an increase in the volume of disks.

What can be said about the relationship between VUA and VDA:

  1. VUA size can far exceed VDA size
  2. VUA has nothing to do with VDA until thin provisioning is written there
  3. More than one VUA can reference one VDA (snapshots / clones)

Thus, the organization of VUA and VDA, the connections between them and the addressing of these connections allow you to implement very fast snapshots and thin provisioning. Since creating a snapshot is just updating metadata in memory and is a constantly occurring operation while working, in fact this operation does not take any time. Typically, when creating snapshots, classic storage systems stop updating metadata and / or I / O in order to guarantee transaction integrity. This leads to uneven I / O latency. The system in question works differently: nothing stops, and timestamp from the block metadata (64 + 4KB) is used to determine if the operation is in the snapshot. Thus, the system can take hundreds of thousands of shots without slowing down the work,and the performance of a volume with hundreds of snapshots and a volume without snapshots is no different. Since everything is done in memory and these are regular processes, dozens of snapshots per second can be taken on volume groups. This allows you to implement asynchronous replication on snapshots with a difference between copies in seconds or even less and without affecting performance, which is also important.

Let's look at the system as a whole, how the data goes. Operations are received through the ports of all three controllers (servers). Port drivers work in User space, which allows you to simply restart them if any combination of events on the ports suspends the driver. In classical implementations, this is done in the kernel and the problem is solved by restarting the entire controller.

Next, the stream is divided into sections of 64KB + 4KB. What are these additional 4KBs? This is a protection against silent errors, and it contains checksums, time, and additional information about the operation, which is used to classify this operation and is used to optimize caching and read ahead.



Recording caching is a pretty simple thing that cannot be said for reading. Read caching works well if we know what to read. Classical systems use sequential access read-ahead algorithms. But what to do with arbitrary? Completely random access is extremely rare in the work of real applications, even emulating it correctly is quite difficult, writing a generator of real random numbers is a rather interesting task. But if we consider each I / O operation separately, as is done in classical systems, then all of them are arbitrary and unpredictable, except for completely sequential ones. However, if you look at the entire I / O stream for some time, you can see patterns that combine different operations.

The system cache basically does not know anything about volumes, files, any logical structure built on VUA. The cache looks only at sections and their metadata, basing caching on their behavior and attributes, which allows you to find dependencies between different applications that are actually connected. For input-output operations, activity vectors are constructed.



The system accumulates statistics, builds these activity vectors, then tries to identify the current I / O and bind it to known vectors or build a new one. After identification, predictive reading according to the vector occurs, that is, the behavior of applications is predicted, and predictive reading is done for a seemingly arbitrary load.

Burn to discs

14 sections are assembled in stripe for discing. This is done by a special process that selects sections for such a strip.



Next, two parity sections are considered - the stripe is ready for writing to disks. Parity is counted through several operations based on XOR, which is two times faster than based on the Reed-Solomon code. Further strip (14 + 2 sections) is assigned to a RAID group (RG). A RAID group is just an object for storing several strips, nothing more. The strips are grouped as shown below, one above the other, and the vertical column is called a member of the RAID group. VDS (Virtual Disk Space) is the disk space available for user data, and VDA is the address in it.



A column or member of a RAID group is written to one disk (PD - Physical Drive) in one shelf (Disk Unit). The place where a member of the RAID group is written is called the Disk Partition (DP - Drive Partition). The number of DP on the disk is constant and equal to 264, the size depends on the size of the disk. This design allows you to evenly load all disks. At the same time, the algorithm distributes columns from one RAID group as far as possible from each other, on different disks and shelves. This leads to the fact that when two disks fail, the number of common strips on them is minimal at the same time, and the system switches from protection state N to N + 1 in minutes, rebuilding those strips where two columns are missing right away (reliability is seven nines )



As a result, the logical design of the system as a whole looks quite simple and is presented in the diagram below.



Physical implementation

The system is made so that all its components are protected according to the N + 2 or 2N scheme, including power and data channels inside the array. Here is the power supply implementation diagram.

ATS (Automatic Transfer Switch) - ATS, phase switch
BBU (Battery Backup Unit) - UPS, uninterruptible power supply
Node - controller



This scheme allows you to protect the controllers and memory integrity during complex events, for example, in the event of a power supply failure and a temporary shutdown of one power circuit. The UPSs are manageable, which allows you to get accurate charge information and flexibly change the cache size so that the controller always has time to reset it. That is, the system will begin to actively use the cache much earlier, unlike the classical scheme, when the cache is turned on only when the battery is fully charged.

Here is a diagram of the data channels within the system.



The controllers are interconnected via Infiniband, and they are connected to the drives via SAS. Each controller can access every disk in the system. Moreover, if the connection between the controller and the disk does not work, then the controller can request data through another controller acting as a proxy through Infiniband. Shelves contain SAS switches for simultaneous access to disks. Each shelf contains 60 disks, shelves can be two, four or eight, totaling up to 480 disks of 3, 4, 6, 8 or 12 TB. The total capacity available to the user is more than 4.1PB without compression. Speaking of compression, in order to implement compression without loss of performance, the memory is not compressed, as a result, the system sometimes works with compression turned on even faster - when reading, you need to read less, processor power is enough with a margin,and when recording, the answer goes directly from the memory and compression is done asynchronously during direct recording to discs.

The controllers contain two groups of disks within themselves: one, system, for flushing RAM and the second, on SSD, for caching read operations (up to 368TB per system). Such a large cache makes it possible to predictively read large chunks, and since the data in the stripe is selected with approximately the same access frequency, such large chunks not only reduce the load on physical disks, but also have great chances to be in demand in the near future.

Summary

So, we talked about one very interesting storage system, which has a modern architecture and provides high capacity, high reliability, excellent performance and adequate cost.



Sources
1 https://techfieldday.com/video/infinidat-solution-deep-dive/
2 https://support.infinidat.com/hc/en-us

All Articles