👩🏽‍🤝‍👩🏼 👩‍✈️ 🤹🏼 Why hyperconvergence? Cisco HyperFlex Overview and Tests ✊🏽 ⚪️ 👼🏾

In IT, the main thing is three letters

The task of any IT infrastructure is to provide a reliable platform for the company's business processes. It is traditionally believed that the quality of the information technology infrastructure is assessed according to three main parameters: accessibility, security, reliability. However, the assessment for this triple is in no way connected with the business and the direct income / loss of the company.

Three main letters rule IT. If the letters “RUB” are not at the head of the IT hierarchy, then you are building your IT infrastructure incorrectly. Of course, it is difficult to build IT directly, starting only from income / expenses, therefore there is a hierarchy of “three letters” - from the most important to the more private. SLA, RPO, RTO, GRC - all this is known to industry experts and has long been used in building infrastructures. Unfortunately, not always linking these indicators into an end-to-end hierarchy.

Many companies today are building infrastructure for the future using yesterday’s technology on yesterday’s architecture. And at the same time, the accelerating development of IT shows that modern services are fundamentally changing not only business but also society - people of the digital age are used to the fact that a few seconds are enough to access any information. IT from an incomprehensible technology has become commonplace for the masses, such as a burger or coffee shop. This has added new extremely important three letters to IT. These letters - TTM (Time to market) - the time before the launch of a productive service on the market.

Sds

On the other hand, a kraken rose from the depths of technology, turning over traditional IT and lifestyle. As the computing power of x86 processors grew, software storage systems became the first tentacle. Classic storage systems were very specific pieces of iron filled with “custom silicon”, various proprietary hardware accelerators, and specialized software. And it was administered by a specially trained person who was practically worshiped in the company as a priest of a dark cult. To expand the storage system operating in the company was a whole project, with a lot of calculations and approvals - after all, it’s expensive!

The high cost and complexity spurred the creation of software storage systems on top of the usual x86 hardware with a common general purpose OS - Windows, Linux, FreeBSD or Solaris. Only software remained from the complex custom hardware, working not even in the kernel, but at the user level. The first software systems were of course quite simple and limited in functionality, often they were specialized niche solutions, but time passed. And now even large storage system vendors have begun to abandon specialized hardware solutions - TTM for such systems could no longer withstand competition, and the cost of the error became very high. In fact, with rare exceptions, even classic storage systems by 2020 became the most common x86 servers, just with beautiful plastic muzzles and a bunch of disk shelves.

The second tentacle of the approaching kraken is the appearance and massive adoption by the market of flash memory technology, which has become a concrete pillar breaking the back of an elephant.
The performance of magnetic disks has not changed for many years and the processors of storage controllers completely coped with hundreds of disks. But alas, the quantity will sooner or later turn into quality - and the storage system is already at an average level, not to mention the initial one, it has a upper limit on the meaningful number of flash drives. With a certain amount (literally from ten disks), the system performance does not stop growing, but it can also begin to decline due to the need to process an ever larger volume. After all, the processing power and throughput of the controllers does not change with increasing capacity. The solution, in theory, was the emergence of scale-out systems that can assemble many independent shelves with disks and processor resources into a single cluster that looks from the outside as a single multi-controller storage system. There was only one step left.

Hyper convergence

The most obvious step into the future was the unification of previously disparate data storage and processing points. In other words, why not implement distributed storage not on separate servers, but directly on the virtualization hosts, thereby refusing a special storage network and dedicated hardware, and thus combining functions. The kraken woke up.
But let me say, you see, because combination is convergence. Where did this stupid prefix hyper come from?

. + + . . , “ ”.
…
, , , . — SDS.

:

— , , , /. .
Converged system - all from one source, one support, one partner number. Not to be confused with self-assembly from one vendor.

And it turns out that the term for our converged architecture is already taken. Exactly the same situation as with the supervisor.

Hyperconverged System - A converged system with converged architecture.

The definitions are taken from the article “ General Theory and Archeology of Virtualization ”, in the writing of which I took a lively part.

What gives the hyperconverged approach in the application to the three letters mentioned?

Start with a minimum volume (and minimum cost)
Storage capacity grows with computing power
Each node of the system is its controller - and the problem of the “glass ceiling” is removed (disks can, but the controller no longer exists)
Storage management simplified dramatically

For the last paragraph, hyperconverged systems are very disliked by old-mode storage administrators who are used to administer queues on Fiber Channel ports. Space is allocated in just a few clicks of the mouse from the virtual infrastructure management console.

In other words, only clouds are faster than hyperconverged systems in launching a product, but clouds are not suitable for everyone and / or not always.

If you are a techie administrator and read up to here - rejoice, the general words have ended and now I will tell you about my personal view of the Cisco Hyperflex system, which I got in tenacious paws for conducting various tests on it.

Cisco Hyperflex

Why Cisco

Cisco is known primarily as the dominant vendor in the network equipment market, but at the same time it is quite widely present in other segments of the data center market, offering both server and hyperconverged solutions, as well as automation and control systems.

Surprisingly, by 2020, there are still people: “Cisco servers? And who does she take them from? ”
Cisco began to deal with servers already in 2009, choosing the path of actively growing blade solutions at that time. The idea of Cisco was to implement the approach of anonymous calculators. The result was a UCS (Unified Computing System) system consisting of two specialized switches (they were called Fabric Interconnect), and from 1 to 20 chassis (8 half-sized blades) or up to 160 servers. At the same time, the chassis became generally stupid with a piece of iron with power, all the logic and switching are made in Fabric Interconnect; chassis is just a way to host servers and connect them to the system. Fabric Interconnect is fully responsible for all server interactions with the outside world — Ethernet, FC, and management. It would seem that the blades and blades, what is there, except for external switching, and not like everyone else in the chassis.

A key moment in the implementation of those same “anonymous calculators”. As part of the Cisco UCS concept, servers have no personality other than a serial number. Neither MAC, nor WWN, nor anything else. The UCS management system powered by Fabric Interconnect is based on server profiles and templates. After connecting a bundle of servers in the chassis, they need to be assigned an appropriate profile, within which all identifying addresses and identifiers are set. Of course, if you have only a dozen servers, then the game would not be worth it. But when there are at least two, or even three dozen of them, this is a serious advantage. It becomes easy and quick to migrate configurations, or, more importantly, to replicate server configurations in the right amount, apply the changes immediately to a large number of servers,essentially managing a set of servers (for example, a virtualization farm) as a single entity. The approach proposed within the UCS system allows, with the right approach, to seriously simplify the life of administrators, increase flexibility and significantly reduce risks, so UCS blades literally in 2-3 years have become the best-selling blade platform in the Western Hemisphere, and today they are globally one of two dominant platforms, along with HPE.

It quickly became clear that the same approach based on a universal factory with integrated management based on policies and templates is fully in demand and applies not only to blades, but also to rack servers. And in this sense, the Cisco rack-mount servers connected to Fabric Interconnect get all the same benefits that make blades so popular.

Today I will talk about HyperFlex, a Cisco hyperconverged solution built on rack-mount servers connected to Fabric Interconnect. What makes HyperFlex interesting and worth considering in the review:

Cisco , , «» – , HyperFlex; , , , HyperFlex ;
– ; HyperFlex , , ; , .
« » — « », , ;
Fabric Interconnect Cisco -, SAN , native FC;
“” – , , ;
Cisco , , , ;
, , Cisco HCI, , HyperFlex , , .

HyperFlex is a true hyperconverged system with dedicated controller VMs. Let me remind you that the main advantage of such an architecture is its potential portability for different hypervisors. Today, Cisco has implemented support for VMware ESXi and Microsoft Hyper-V, but it is possible that one of the KVM options will appear as its popularity grows in the corporate segment.

Consider the mechanism of work on the example of ESXi.

Devices using VM_DIRECT_PATH technology — cache disk and storage level disks — are directly thrown to the controller VM (hereinafter CVM). Therefore, we exclude the effect of the hypervisor disk stack on performance. Additional VIB packets are installed in the hypervisor itself:

IO Visor: provides the mount point for the NFS datastore for the hypervisor
VAAI: VMware API « »

Virtual disk blocks are distributed evenly across all hosts in a cluster with relatively little granularity. When the VM on the host performs some disk operations, through the disk stack of the hypervisor the operation goes to the datastore, then to IO Visor, and then it turns to the CVM responsible for these blocks. In this case, CVM can be located on any host in the cluster. Given the very limited resources of IO Visor, there are of course no metadata tables and the choice is mathematically determined. Next, the CVM that the request came to processes it. In the case of reading, it sends data either from one of the cache levels (RAM, write cache, read cache) or from the disks of its host. In the case of recording, it writes to the local journal, and duplicates the operation for one (RF2) or two (RF3) CVM.

Perhaps this is quite enough to understand the principle of work within the framework of this publication, otherwise I will take bread from Cisco trainers, and I will be ashamed. Not really, but still enough.

Question about synthetic tests

- Navigator, appliances!
- 36!
- What is 36?
- What about appliances?

Something like this today looks like most synthetic tests of storage systems. Why is that?

Until relatively recently, most storage systems were flat with uniform access. What does this mean?

The total available disk space was collected from disks with the same characteristics. For example, 300 15k drives. And the performance was the same throughout the space. With the advent of tiered storage technology, storage systems have become non-flat - performance varies within a single disk space. And it’s not just different, but also unpredictable, depending on the algorithms and capabilities of a particular storage model.

And everything would not be so interesting if hyperconverged systems with data localization did not appear. In addition to the unevenness of the disk space itself (tirings, flash caches), there is also uneven access to it - depending on whether one of the data copies is located on the local disks of the node or it must be accessed over the network. All this leads to the fact that the numbers of synthetic tests can be absolutely any, and not speak about anything practically meaningful. For example, the fuel consumption of a car according to an advertising brochure that you can never achieve in real life.

Question about sizing

The flip side of the synthetic test numbers was sizing numbers and specifications from under the presale keyboard. Presales in this case are divided into two categories - some just stupidly hammer your TK into the vendor’s configurator, and the second will take it themselves, because they understand how it works. But with the second you will have to consider in detail what you wrote in your TK.

As you know, without a clear TK - the result of HZ.

From practical experience - when sizing a rather heavy hyperconverged system in a competition with one of the customers, I personally, after the pilot, took the load indicators from the system and compared them with what was written in the TOR. It turned out like in a joke:

- Rabinovich, is it true that you did win a million in the lottery?
- Oh, who told you that? Not a million, but ten rubles, not in the lottery, but in preference, and did not win, but lost.

In other words, the classic GIGO situation - Garbage In Garbage Out - Garbage inlet = Garbage in the output.

Practical applicable sizing for hyperconvergence is almost guaranteed to be of two types: take us with a margin, or for a long time we will drive a pilot and take indicators.

There is one more point with sizing and evaluation of specifications. Different systems are built differently and work differently with disks; their controllers interact differently. Therefore, it is practically pointless to compare “head-to-head” according to the specifications the number and volume of disks. You have some kind of TK, within which you understand the level of load. And then there is a certain number of gearboxes, within which you are offered various systems that meet the requirements for performance and reliability. What is the fundamental difference, how much does a disk cost and what type in system 1, and that in system 2 there are more / less of them if both of them successfully cope with the task.

Since performance is often determined by controllers living on the same hosts as virtual machines, for some types of loads it can quite significantly swim simply because processors with different frequencies are located in different clusters, all other things being equal.

In other words, even the most experienced presale architect-archmage will not tell you the specification more precisely than you formulate the requirements, and more precisely, than “well, somewhere SAM-VOSEM” without pilot projects.

About snapshots

HyperFlex can do its native snapshots of virtual machines using Redirect-on-Write technology. And here it is necessary to stop separately to consider different technologies of snapshots.
Initially, there were snapshots of the Copy-on-Write (CoW) type, and VMware vSphere native snapshots can be taken as a classic example. The principle of operation is the same with vmdk on top of VMFS or NFS, which with native file systems such as VSAN. After creating a CoW snapshot, the original data (blocks or vmdk files) is frozen, and when you try to write to frozen blocks, a copy is created and the data is written to a new block / file (delta file for vmdk). As a result, as the snapshot tree grows, the number of “spurious” disk accesses that do not carry any productive meaning increases, andperformance drops / delays grow .

Then Redirect-on-Write (RoW) snapshots were invented, in which instead of creating copies of blocks with data, a copy of metadata is created, and the record just goes on without delays and additional readings and checks. With the correct implementation of RoW snapshots have almost zero effect on the performance of the disk system. The second effect of working with metadata instead of the live data itself is not only the instant creation of snapshots, but also VM clones, which immediately after creation do not take up space at all (we do not consider system overhead for VM service files).

And the third, key point that radically distinguishes RoW from CoW snapshots for productive systems is the instant removal of snapshot. It would seem that this is so? However, you need to remember how CoW snapshots work and that removing a snapshot is not really a delta removal, but its commit. And here the time of her commit is extremely dependent on the size of the accumulated delta and the performance of the disk system. RoW snapshots are committed instantly simply because no matter how many terabytes of difference accumulate, deleting (committing) RoW snapshots is an update of the metadata table.

And here an interesting application of RoW snapshots appears - drop the RPO to values of tens of minutes. Making backups every 30 minutes is almost impossible in the general case, and in most cases they are done once a day, which gives an RPO of 24 hours. But at the same time, we can just do RoW snapshots on a schedule, bringing the RPO to 15-30 minutes, and store them for a day or two. No penalty to performance, spending only capacity.

But there are some nuances.

For proper operation of native snapshots and integration with VMware, HyperFlex requires an official snapshot called Sentinel. Sentinel snapshot is created automatically when you first create a snapshot for a given VM through HXConnect, and you should not delete it, you should not "go back" to it, you just need to put up with the fact that in the interface in the list of snapshots this is the first service snapshot of Sentinel.

HyperFlex snapshots can run in crash-consistent mode or in application-consistent mode. The second type involves "flushing buffers" inside the VM, it requires VMTools, and it starts if the "Quiesce" checkbox is checked in the HXConnect snapshot menu.
In addition to HyperFlex snapshots, no one prohibits the use of "native" VMware snapshots. It is worthwhile for a specific virtual machine to determine which snapshots you will use, and in the future to focus on this technology, “not disturbing” different snapshots for one VM.

As part of the test, I tried to create snapshots and check their FIO. And yet, yes, I can confirm that snapshots are really RoW, they do not affect performance. Snapshots really are created quickly (a few seconds depending on the load profile and the size of the dataset), I can give the following recommendation based on the results: if your load has a lot of random write operations, you should start creating a snapshot from the HXConnect interface, with the “Quiesce” checkmark and with a preliminary the presence of a Sentinel snapshot.

Tests

Test platform

The following platform fell into tenacious paws:

4 x C220 M4 (2630v4 10c x 2.20 GHz, 256, 800 + 6 * 960)
vSphere 6.7
HX Data Platform 4.0.2

Clear patch test

What kind of testing without CrystalDisk? That's right, this can’t be, normal guys always start a crystallized disk! Well, if it’s necessary, then it is necessary.

For the crystal disk, a specially created VM with 2 vCPU 4GB and Windows 7 on board was created. Oh, and I got sick of putting patches on it, I'll tell you! The test was carried out in the best traditions of the best houses in London and Paris - namely, just one virtual disk next-next-finish was added without any thoughts and the test was launched. Yes, and by the way, of course CrystalDiskMark itself is not involved in testing, it is just an interface, but directly loads the disk system with the well-known DiskSpd package included in the kit.

What struck me literally - for some reason, all skipped the choice of units in the upper right corner. And alle op!

Listen, honestly, I didn’t expect 75 thousand IOPS and more than 1 gigabyte per second from the micromachine in next-next-finish mode!

To put it mildly, not every company in Russia has loads that exceed these indicators in total.

Further tests were carried out using VMware HCI Bench and Nutanix XRay, as “ideologically hostile” to HyperFlex, and accordingly, it was expected that we would not take prisoners. The numbers turned out to be extremely close, so the results from the XRay package were taken as a basis simply because it has a more convenient reporting system and ready-made load templates.

For those who do not trust anyone and want total control over the process, I remind you of my article on building your own system to generate the load on a hyperconverged platform - "Performance Testing giperkonvergentnyh systems and SDS with their own hands "

Achtung! Uwaga! Pozor!

All further results and their interpretations are the opinion of the author of the article, and are given by themselves in the framework of the study of the system. Most of the tests are bare synthetics and are applicable only for understanding the limit indicators in extreme and degenerate cases, which you will never achieve in real life.

FourCorners Microbenchmark

The 4-sided microtest is designed to evaluate the system “fast” for the ultimate theoretical performance and peak performance of the controllers. The practical application for this test is to check the system immediately after launch for any configuration and environment errors, especially network errors. Those. if you regularly run such systems, then you just know what numbers you should expect “if all is well”.

Final numbers: 280k / 174k IOPS, 3.77 / 1.72 GBps (read / write)

How did our controllers behave?

From which it can be noted that the total resource consumption for 4 controllers and 4 VM loads was 49 cores of 2.2. According to VMware statistics, the CPU utilization of the controllers was up to 80%, i.e. in fact, performance was limited by the performance of controllers, and specifically processors. The speed of sequential operations specifically rested against the speed of 10G network.

Let's try again. Peak performance on a small 4-node cluster with not the fastest 2.2GHz processors is almost 300 thousand IOPS at 4U heights.

The conversation “here we have 10, 20 or even 40% more / less” is practically meaningless due to the order of numbers. The same as starting to measure "and I can have a car 240, I have 280" despite the fact that the limit is 80.

280k / 4 nodes gives a peak performance of 70k / node, which for example exceeds the numbers from the VMware VSAN calculator, which assumes that the AF node issues no more than 46k per disk group. In our case, here in VMware terminology there is just one disk group, which actually runs at x1.8.

Datastore block size effect

When creating a HyperFlex datastore, you can choose the data block size - 4k or 8k.

What will it affect? Run the same quadrangular test.

If the picture is almost identical with reading, then the record on the contrary matters. The quadrangular test uses an 8k load.

Total numbers: 280k / 280k, 172-158k / 200-180k (4k 8k). When the block size matches, + 15% of write performance is obtained. If you expect a significant amount of recording with a small block (4k) in the load - create a datastore for this particular load with a 4k block, otherwise use 8k.

OLTP Simulator

A much closer picture to reality is given by another test. As part of it, two generators are launched with a profile close to a transactional DBMS and a load level of 6000 + 400 IOPS. Here, the delay is measured, which should remain at a stable low level.

The delay for the VM load was 1.07 / 1.08 ms. All in all a great result, but let's add some heat!

Database Colocation: High Intensity

How the transactional base will behave, depending on the delays, if suddenly a noisy consecutive neighbor is formed. Well, very noisy.

So, the OLTP base on node 1 generates 4200 IOPS at 0.85 ms delay. What happens after a DSS system suddenly begins to consume resources in sequential operations?
Two generators on nodes 2 and 3 load the platform at 1.18 / 1.08 GBps, respectively, those 2.26 GBps in total. The delay on OLTP of course grows and becomes less flat, but the average value remains 1.85ms, and the base receives its 4200 IOPS without any problems.

Snapshot impact

The system sequentially takes several snapshots once an hour on an OLTP base. There is nothing surprising in the schedule, and moreover, this is generally an indicator of how VMware classic snapshots work, since Nutanix XRay does not know how to work with native snapshots except for its own. You do not need to use vSphere snapshots on a regular basis, because not all yogurts are equally useful.

HyperFlex native snapshots work much better, use them and your hair will become soft and silky!

Big data ingestion

How will HyperFlex digest a large amount of data uploaded sequentially? Well let's say 1TB.

The test took 27 minutes, including cloning, tuning and starting the generators.

Throughput scalability

Now, gradually load the entire cluster and look at the steady numbers. To start with random reading, then writing.

We are seeing a stable picture with a gradual decrease in the performance of the machine load from 78k to 55-57k IOPS, with smooth shelves. At the same time, there is a steady increase in overall performance from 78 to 220k IOPS.

Recording is a little less smooth, but still stable shelves from 64k to 19-21k per car. At the same time, the load on the controllers is much lower. If during reading the total processor load level increased from 44 to 109, then on recording from 57 to 73 GHz.

Here you can observe the simplest and most obvious example of the features of hyperconverged systems - the only consumer is simply not able to completely utilize all the resources of the system, and when the load is added, there is no significant drop in performance. The drop that we are witnessing is already the result of extreme synthetic loads designed to squeeze everything to the last drop, which is almost never the case in a normal product.

Breaking OLTP

By this time, it became even boring how predictable HyperFlex was. Urgent need to break something!

The red dot marks the moment the controller VM shuts down on one of the hosts with a load.

Since by default rebuild in HyperFlex starts immediately only when the disk is lost, and when the node is lost, the timeout is 2 hours, the moment of forced rebuild is marked with a green dot.

login as: admin
 HyperFlex StorageController 4.0(2a)
admin@192.168.***.***'s password:
<b>admin@SpringpathController0VY9B6ERXT:~$</b> stcli rebalance status
rebalanceStatus:
    percentComplete: 0
    rebalanceState: cluster_rebalance_not_running
rebalanceEnabled: True
<b>admin@SpringpathController0VY9B6ERXT:~$</b> stcli rebalance start -f
msgstr: Successfully started rebalance
params:
msgid: Successfully started rebalance
<b>admin@SpringpathController0VY9B6ERXT:~$</b> stcli rebalance status
rebalanceStatus:
    percentComplete: 16
    rebalanceState: cluster_rebalance_ongoing
rebalanceEnabled: True
<b>admin@SpringpathController0VY9B6ERXT:~$</b>

The operations froze for a couple of seconds and continued again, almost noticing the rebuild. It is in a stable state when it is far from cluster overload.

Why is 2 hours Cisco not a problem, though competitors have fewer numbers? Cisco strongly recommends using RF3 as a basic level of data protection for everything except machines that are not a pity. You decided to install patches or do something with the host, turn it off. And there is a chance that just at that moment another host will fail - and then in the case of RF2 everything will become a stake, and with RF3 there will be one active copy of the data. And yes, indeed, it’s quite possible to survive 2 hours in an accident on RF2 until recovery to RF3 begins.

Break me completely!

Breaking - so breaking. Full load. In this case, I created a test with a profile more or less resembling a real load (70% read, 20% random, 8k, 6d 128q).

Guess where CVM turned off, and where did the rebuild begin?

In the situation with the rebuild, HyperFlex performed quite well, without causing a catastrophic drop in performance or a multiple increase in delays, even under load under the very tomatoes. The only thing I would really like is dear Cisco, make the timeout all the same less than 2 hours by default.

findings

To conclude, I recall the purpose of the testing: to investigate the Cisco HyperFlex system today, without looking at the history, to investigate its performance using synthetics and draw conclusions about its applicability to a real product.

Conclusion 1 , on performance. The performance is very good, and you won’t give any other comments here. Since I had a system of the previous generation on the test, I can say exactly one thing - on HyperFlex All Flash you will run into capacity, into the processor, into memory, but not into disks. Except maybe 1% of super-loaded applications, but you need to conduct a conversation with them personally. Native RoW snapshots work.

Conclusion 2, by availability. The system after detecting a failure is quite good (without a performance drop at times) fulfills the restoration of the number of copies of data. There is a slight complaint in the 2-hour default timeout before starting recovery (if the host is lost), but given the highly recommended RF3, this is more nitpicking. Recovery after a disk failure begins immediately.

Conclusion 3, in price and comparison with competitors. The price of the system can vary many times depending on the configuration for a specific project. A huge share of the project cost will be licensed system and application software, which will work on top of the infrastructure platform. Therefore, the only way to compare with competitors is to compare specific commercial offers that meet the technical requirements, specifically for your company for a specific project.

Final conclusion : the system is working, quite mature for use in the product for April 2020, if the vendor’s recommendations are read and applied, rather than smoking.

Why hyperconvergence? Cisco HyperFlex Overview and Tests