Why OceanStor Dorado V6 is the fastest and most reliable storage

Please do not rush to conclusions because of the header! We have weighty arguments in support of it, and we packed them as compactly as possible. We bring to your attention a post on the concept and principles of operation of our new data storage system, which was released in January 2020.




In our opinion, the main competitive advantage of the Dorado V6 storage family is provided by the performance and reliability mentioned in the header. Yes, it’s so simple, but due to some tricky and not very tricky decisions we managed to achieve this “just”, we’ll talk today.

In order to better reveal the potential of the new generation systems, we will talk about senior representatives of the model range (models 8000, 18000). Unless otherwise indicated, they are implied.



A few words about the market


In order to better understand the place of Huawei solutions on the market, we turn to the proven measure - the “ magic quadrants ” of Gartner. Two years ago, in the general-purpose disk array sector, our company confidently entered the group of leaders, second only to NetApp and Hewlett Packard Enterprise. In 2018, Huawei’s position in the solid-state storage market was characterized by the “applicant” status, but something was missing to achieve leadership positions.

In 2019, Gartner, in his research, combined both of the above sectors into one - “Main storage”. As a result, Huawei again found itself in the quadrant of leaders, next to suppliers such as IBM, Hitachi Vantara and Infinidat.

To complete the picture, we note that Gartner collects 80% of the data for analysis in the US market, and this leads to a noticeable bias in favor of those companies that are well represented in the USA. In the meantime, suppliers targeting European and Asian markets find themselves in a notably less advantageous position. And despite this, last year, Huawei products took their rightful place in the upper right quadrant and, according to the Gartner verdict, “can be recommended for use.”



What's New in Dorado V6


The Dorado V6 product line, in particular, is represented by the entry-level systems of the 3000 series. Originally equipped with two controllers, they can be horizontally expanded to 16 controllers, 1200 disks and 192 GB of cache. Also, the system will be equipped with external Fiber Channel (8/16/32 Gb / s) and Ethernet (1/10/25/40/100 Gb / s) ports.

Note that the use of protocols that have no commercial success is being phased out, so at the start we decided to abandon the support of Fiber Channel over Ethernet (FCoE) and Infiniband (IB). They will be added in later firmware versions. Support for NVMe over Fabric (NVMe-oF) is available out of the box on top of Fiber Channel. The next firmware, which is scheduled for release in June, is scheduled to support NVMe over Ethernet mode. In our opinion, the above set will more than cover the needs of most Huawei customers.

There is no file access in the current firmware version and will appear in one of the following updates closer to the end of the year. Implementation is expected at the native level, by the controllers themselves with Ethernet ports, without the use of additional equipment.

The main difference between the Dorado V6 3000 series and older models is that on the backend it supports one protocol - SAS 3.0. Accordingly, the drives there can only be used with the named interface. From our point of view, the performance provided is quite enough for a device of this type.

Dorado V6 5000 and 6000 Series systems are mid-range solutions. They are also made in the 2U form factor and are equipped with two controllers. They differ from each other in performance, number of processors, maximum number of disks and cache size. However, in the architectural and engineering terms, Dorado V6 5000 and 6000 are identical and look the same.

The hi-end class includes Dorado V6 systems of the 8000 and 18000 series. Designed in frame sizes 4U, by default they have a separate architecture in which the controllers and drives are separated. In the minimum configuration, they can also be equipped with only two controllers, although customers, as a rule, are asked to install four or more.

Dorado V6 8000 scales horizontally to 16 controllers, and Dorado V6 18000 - to 32. These systems have different processors with different number of cores and cache size. For all that, the identity of engineering solutions is preserved, as in mid-end models.

2U shelves with drives are connected via RDMA with a bandwidth of 100 Gb / s. The older Dorado V6 backend also supports SAS 3.0, but rather in case SSDs with such an interface drop in price. Then there will be economic feasibility of their use, even taking into account lower productivity. At the moment, the difference in cost between SSDs with SAS and NVMe interfaces is so small that we are not ready to recommend such a solution.



Inside the controller


Dorado V6 controllers are made on our own element base. No Intel processors, no Broadcom ASICs. Thus, every single component of the motherboard, as well as itself, is completely removed from the influence of risks associated with sanctions pressure from American companies. Those who saw with our own eyes any of our equipment must have noticed shields with a red stripe under the logo. It means that the product lacks American components. This is the official course of Huawei - the transition to components of its own production or, in any case, manufactured in countries that do not follow the US policy.

Here's what you can see on the controller board itself.

  • Universal network interface (Hisilicon 1822 chip), responsible for connecting to Fiber Channel or Ethernet.
  • BMC-, Hisilicon 1710, . .
  • , ARM Kunpeng 920 Huawei. , , . . . , Dorado V6 .
  • SSD ( Hisilicon 1812e), SAS-, NVMe-. , Huawei SSD, NAND, . , Huawei , .
  • — Ascend 310. , , . , . , .



Kunpeng


The Kunpeng processor is a system on a chip (SoC), where in addition to the computing unit there are hardware modules that accelerate various processes, such as calculating checksums or executing “erasure coding”. It also implements hardware support for SAS, Ethernet, DDR4 (from six to eight channels), etc. All this allows Huawei to create storage controllers that are not inferior in performance to classical Intel solutions.

In addition, its own solutions based on the ARM architecture give Huawei the opportunity to create full-fledged server solutions and offer them to their customers as an alternative to x86.



The new architecture of Dorado V6 ...


The internal architecture of the older Dorado V6 storage system is represented by four main subdomains (factories).

The first factory is a common frontend (network interfaces responsible for communicating with a SAN factory or hosts).

The second is a set of controllers, each of which can reach the frontend network card as well as the neighboring “engine”, which is a box with four controllers, as well as power and cooling units common to them, using the RDMA protocol. Now Dorado V6 hi-end models can be equipped with two such "engines" (respectively, eight controllers).

The third factory is responsible for the backend and consists of RDMA 100G network cards.

Finally, the fourth factory “in iron” is represented by plug-in intelligent shelves with drives.

Such a symmetrical structure unleashes the full potential of NVMe technology and guarantees high performance and reliability. The I / O process is maximally parallelized across processors and cores, providing simultaneous reading and writing to multiple threads.



... and what she gave us


The maximum performance of Dorado V6 solutions is approximately three times higher than the performance of previous generation systems (of the same class) and can reach 20 million IOPS.

This is due to the fact that in the previous generation of devices, NVMe support extended only to the shelves with drives. Now it is present at all stages, from the host to the SSD. The backend network has also undergone changes: SAS / PCIe gave way to RoCEv2 with a bandwidth of 100 Gb / s.

The SSD form factor has also changed. If previously there were 25 drives on the 2U shelf, now it has been brought up to 36 palm-sized physical disks. In addition, the shelves “wiser”. Each of them now has a fault-tolerant system of two controllers based on ARM chips, similar to those installed in the central controllers.



So far they are only engaged in data reorganization, but with the release of new firmware, compression and erasure coding will be added to it, which will reduce the load on the main controllers from 15 to 5%. Transferring part of the tasks to the shelf at the same time frees up the bandwidth of the internal network. And all this significantly increases the scalability potential of the system.

Compression and deduplication in previous generation storage systems were performed with blocks of fixed length. Now, a mode of working with variable-length blocks has been added, which so far needs to be enabled forcibly. Subsequent firmware may change this fact.

Also briefly on failure tolerance. Dorado V3 remained operational if one of the two failed. Dorado V6 will ensure data availability even if seven out of eight controllers or four out of one “engine” consecutively fail.



Economic Reliability


Recently, among customers of Huawei, a survey was conducted on what kind of simple elements of IT-infrastructure the company considers acceptable. For the most part, respondents were tolerant of a hypothetical situation in which the application does not respond for several hundred seconds. For the operating system or host bus adapter, the critical downtime was tens of seconds (in fact, the reboot time). Customers make even higher demands on the network: its bandwidth should not disappear for more than 10-20 seconds. As you might guess, the respondents considered storage failures to be the most critical. From the point of view of business representatives, a simple storage system should not exceed ... a few seconds per year!

In other words, if the client application of the bank does not respond for 100 seconds, this will most likely not cause catastrophic consequences. But if the same number of storage systems does not work, a business stoppage and significant financial losses are likely.



The chart above shows the cost of an hour of work for the ten largest banks (Forbes data for 2017). Agree, if your company is approaching the size of Chinese banks, justifying the need to purchase storage for several million dollars will not be so difficult. The converse is also true: if a business does not incur significant losses during a downtime, then it is unlikely to buy hi-end storage systems. In any case, it is important to have an idea of ​​what size a hole threatens to form in your wallet, while the system administrator is dealing with a data storage system that has refused to work.




Failover Second


In Solution A in the illustration above, you can recognize our previous generation Dorado V3 system. Four of its controllers work in pairs, and only two controllers contain copies of the cache. Controllers within a pair can redistribute the load. At the same time, as you see, there are no "factories" of the frontend and backend, so each of the shelves with drives is connected to a specific controller pair.

The Solution B diagram shows the solution currently available on the market from another vendor (did you find out?). There are already front-end and back-end factories, and drives are connected directly to four controllers. True, in the work of internal system algorithms there are nuances that are not obvious in a first approximation.

On the right is our current Dorado V6 storage architecture with all its internal elements. Consider how these systems survive a typical situation - the failure of one controller.

In classic systems, which include Dorado V3, the period required to redistribute the load in the event of failure reaches four seconds. At this time, I / O completely stops. In Solution B, from our colleagues, despite a more modern architecture, the downtime during a failure is even higher - six seconds.

Dorado V6 storage restores its operation just one second after a failure. This result is achieved due to the homogeneous internal RDMA-environment, allowing the controller to access the "foreign" memory. The second important circumstance is the presence of a front-end factory, due to which the path for the host does not change. The port remains the same, and the load is simply sent to serviceable controllers by multipassing drivers.

The failure of the second controller in Dorado V6 is worked out in one second according to the same scheme. Dorado V3 takes about six seconds, while the solution of another vendor takes nine. For many DBMSs, such intervals can no longer be considered acceptable, since during this time the system goes into standby mode and stops working. This is the first thing that concerns a DBMS, consisting of many sections.

The failure of the third Solution A controller is not able to survive. Just due to the fact that access to part of the data disks is lost. In turn, Solution B in such a situation restores operability, which requires, as in the previous case, nine seconds.

What does Dorado V6 have? One second.



What can be done in a second


Almost nothing, but we do not need this. Once again, in the Dorado V6 hi-end class, the front-end factory is untied from the controller factory. This means that there are no hard-coded ports that belong to a specific controller. Failover rebuilding does not imply finding alternative paths or reinitializing multipassing. The system continues to work as it worked.



Multiple Failure Resistance


The older Dorado V6 models without any problems survive the simultaneous failure of any two (!) Controllers from any “engines”. This is made possible because the solution now stores three copies of the cache. Therefore, even with a double failure, there will always be one complete copy.

The simultaneous failure of all four controllers in one of the “engines” will also not cause fatal consequences, since all three copies of the cache at each moment in time are distributed between the “engines”. The observance of such a logic of work is monitored by the system itself.

Finally, a very unlikely scenario is the sequential failure of seven out of eight controllers. Moreover, the minimum acceptable interval for maintaining operability between individual failures is 15 minutes. During this time, the storage system manages to perform the operations necessary for the migration of the cache.

The last surviving controller will provide data warehouse operation and maintain the cache for five days (the default value, which is easy to change in the settings). After that, the cache will be disabled, but the storage will continue.



Not disturbing updates


The new Dorado V6 OS allows you to update the storage of the storage system without rebooting the controllers.

The OS, as in the case of previous solutions, is based on Linux, however, many operating processes are transferred from the kernel to user mode. Most functions, such as those responsible for deduplication and compression, are now regular daemons running in the background. Due to this, to update individual modules there is no need to change the entire operating system. Suppose, to add support for a new protocol, you only need to turn off the corresponding software module and start a new one.

It is clear that the issues of updating the whole system still remain, after all, there may be elements in the kernel that need to be updated. But such, according to our observations, less than 6% of the total. This allows you to restart the controllers dozens of times less than before.



Catastrophic and HA solutions (HA / DR)


Dorado V6 “out of the box” is ready for integration into geo-distributed solutions, urban-level clusters (metro) and “triple” data centers.

On the left in the illustration above is a metro cluster already familiar to many. Two storage systems operate in active / active mode at a distance of up to 100 km from each other. A similar infrastructure with one or more quorum servers can be supported by solutions of different companies, including our cloud operating system FusionSphere. Of particular importance in such projects are the characteristics of the channel between the sites, all other tasks in our case are taken over by the HyperMetro function, which is available, again, out of the box. Fiber Channel integration as well as iSCSI integration in IP networks is possible if such a need arises. There is no longer a need for a dedicated “dark” optics, as the system is able to communicate through existing channels.

When building such systems, the only hardware requirement for storage is port allocation for replication. It is enough to acquire a license, run quorum servers - physical or virtual - and provide IP connectivity to the controllers (10 Mbps, 50 ms).

This architecture is easy to transfer to a system with three data centers (see the right side of the illustration). For example, when two data centers operate in the metro cluster mode, and the third site, located at a distance of over 100 km, uses asynchronous replication.

The system technologically supports various business scenarios that will be implemented in the event of a large-scale excess.



Metro Cluster Survival with Multiple Failures


The above and below also show the classic metro cluster, consisting of two storage systems and a quorum server. As you can see, in six of the nine possible scenarios of multiple failures, our infrastructure will remain operational.

For example, in the second scenario, if the quorum server fails and synchronization between sites occurs, the system remains productive, as the second site stops working. Similar behavior is already embedded in the built-in algorithms.

Even after three failures, access to information can be maintained if the interval between them is at least 15 seconds.



Habitual trump card from the sleeve


Recall that Huawei produces not only storage systems, but also a full range of network equipment. Whatever storage provider you choose, if a WDM network is used between the sites, in 90% of cases it will be built on the solutions of our company. A logical question arises: why assemble a zoo of systems when all guaranteed hardware compatible with each other can be obtained from one vendor?



To the question of performance


Probably, no one needs to be convinced that the transition to All-Flash storage can significantly reduce infrastructure maintenance costs, since all routine operations are performed many times faster. This is evidenced by all suppliers of such equipment. Meanwhile, many vendors begin to dissemble when it comes to a decline in performance when you turn on various storage modes.

In our industry, the issuance of storage systems for test operation for one to two days is widely practiced. The provider runs a 20-minute test on an empty system, receiving space-based performance indicators. And in actual operation, "underwater rakes" quickly come out. Already after a day, beautiful IOPS values ​​are reduced by half or three, and if the storage is filled by 80%, they are even less. When RAID 5 is turned on, instead of RAID 10, another 10-15% is lost, and in the metro cluster mode, performance is further halved.

All of the above is not about Dorado V6. Our customers have the opportunity to run a performance test on the weekend or at least at night. Then garbage collection manifests itself, and it also becomes clear how the activation of various options - such as snapshots and replication - affects the amount of IOPS achieved.

In Dorado V6, snapshots and RAID with parity have virtually no effect on performance (3-5% instead of 10-15%). Garbage collection (filling drive cells with zeros), compression, deduplication on storage systems that are 80% full will always affect the overall speed of request processing. But it is Dorado V6 that is interesting in that no matter what combination of functions and protective mechanisms you activate, the total performance of the storage system will not fall below 80% of the figure obtained without load.



Load balancing


High performance Dorado V6 is achieved through balancing at every stage, namely:

  • multipassing;
  • use of several connections from one host;
  • the presence of a front-end factory;
  • parallelizing the operation of storage controllers;
  • load balancing across all drives at RAID 2.0+.

In principle, this is a common practice. Nowadays, few people keep all the data on one LUN: everyone is trying to have eight, even forty, or even more. This is the obvious and correct approach that we share. But if your task requires only one LUN, which is easier to maintain, our architectural solutions can achieve on it 80% of the performance available when using multiple LUNs.



Dynamic processor load scheduling


The load distribution on the processors when using one LUN is implemented as follows: tasks at the LUN level are split into separate small “shards”, each of which is rigidly assigned to a specific controller in the “engine”. This is done so that the system does not lose performance while "jumping" with this piece of data on different controllers.

Another mechanism for maintaining high performance is dynamic sheduling, in which processor cores can be allocated to different task pools. For example, if now the system is idle at the level of deduplication and compression, some of the cores may be included in the process of servicing I / O. Or vice versa. All this is done automatically and transparently to the user.

Data on the current load of each of the Dorado V6 cores is not displayed in the graphical interface, but through the command line you can access the controller OS and use the usual Linux top command .



Support for NVMe and RoCE


As already mentioned, at the moment Dorado V6 fully supports NVMe over Fiber Channel out of the box and does not require any licenses. Mid-year support for NVMe over Ethernet. For its full use, you will need Ethernet support with direct memory access (DMA) version v2.0 both from the storage system itself, and from the side of switches and network adapters. For example, such as Mellanox ConnectX-4 or ConnectX-5. You can use network cards made on the basis of our microcircuits. RoCE support should also be implemented at the operating system level.

In general, we consider the Dorado V6 an NVMe-oriented system. Despite the existing support for Fiber Channel and iSCSI, it is planned to switch to high-speed Ethernet with RDMA in the future.




Pinch of marketing


Due to the fact that the Dorado V6 system is highly resistant to failures, scales well, supports various migration technologies, etc., the economic effect of its acquisition is manifested with the beginning of intensive operation of storage systems. We will continue to try to make the ownership of the system as profitable as possible, even if it is not striking at the first stage.

In particular, we have formed the FLASH EVER program related to the extension of the storage life cycle and designed to unload the customer as much as possible during upgrades.



This program includes a number of measures:

  • ( Dorado V6 hi-end);
  • ( Dorado );
  • ( Dorado).



It remains to note that the difficult situation in the world had little effect on the commercial prospects of the new system. Despite the fact that the official release of Dorado V6 took place only in January, we see significant demand for it in China, as well as great interest in it from Russian and international partners from the financial sphere and from government structures.

Among other things, in connection with the pandemic, no matter how long they last, the issue of providing remote employees with virtual desktops is especially acute. In this process, Dorado V6 could also remove many questions. To do this, we are making all the necessary efforts, including practically agreeing to include the new system in the VMware compatibility list.

***


By the way, do not forget about our many webinars, held not only in the Russian-language segment, but also at the global level. The list of webinars for April is available here .

All Articles