🚟 🕵🏾 🕵🏽 What is Azure Stack HCI and how does it work 🤽🏿 🤘🏽 ➗

Hello, Habr! Today we want to talk about what the Azure Stack HCI platform is all about. Including what it is in general, from what hardware is assembled, what software contains, how it works, and that’s all. Join now!

This is a guest post from the guys from AltaStor. AltaStor is a system integrator specializing in building solutions for reliable data storage. Thanks to the accumulated expertise in building failover clusters and HCI, an individual solution is selected for each client that is best suited for its tasks.

What is the Azure Stack HCI?

This is a hyperconverged solution that combines several products:

Hardware from a Microsoft Certified OEM Partner.
Windows Server 2019 Datacenter Operating System.
Windows Admin Center software.
Microsoft Azure Services if necessary.

This solution has existed on the market for a long time, and some of our customers have long and successfully used it. However, they do not publish the performance test results of their installation. We decided to fill this gap and talk about our experience using Azure Stack HCI with one specific example.

For documentation and general information on Azure Stack HCI, click here .

Booth layout

Equipment

Building a solution requires a hardware platform recommended by Microsoft. Leading manufacturers of server hardware - HPE, Dell EMC, Fujitsu, Hitachi, Lenovo, etc. - developed their configurations, tested them for compatibility and certified for Azure Stack HCI.

A complete list of compatible equipment is available at .

Depending on the types of drives used, the platform components will vary.

We prefer to build such solutions on the basis of Fujitsu servers with the pre-installed Windows Server 2019 Datacenter operating system. This manufacturer after the sale supports the entire software and hardware complex as a complete solution, and not just its hardware. This is important both for us, as partners, and for the end customer.

Currently, Fujitsu has five configurations certified for different types of drives, server models, and number of nodes. The maximum number of nodes for Azure Stack HCI is 16, the minimum is 2, but some configurations are limited to 4.

All compatible Fujitsu configurations can be viewed here .

For installation, we chose the most efficient configuration from the currently certified - Fujitsu Primergy with SSD drives for storing data, and Intel Optane ultra-fast memory modules connected via NVMe interface as the system cache. We expect to get a software-defined All-Flash array with performance comparable to classic storage with SSD drives and NVMe cache.

All-Flash storage systems of industry leaders have similar media type configurations. We know what IOPS and latency indicators can be obtained in practice from similar systems and look forward to similar performance from Azure Stack HCI based on the selected Fujitsu configuration.

The architecture of this Fujitsu solution is described in detail in a document available here .

We recommend that you familiarize yourself with it before installation.

The document describes the limitations of the architecture, typical connection schemes, and a lot of other information useful at the implementation stage.

Switches

Fujitsu's solution uses its own PSWITCH Ethernet switch. For ourselves, we noted the following advantages:

The switches of this series are very productive, at a low cost.
The switches are quite simple to configure and use the CISCO-like interface. Engineers did not encounter any difficulties during installation.
There are no proprietary excesses in administration and competent documentation is available.

Fujitsu switching equipment is one of the industry leaders in Japan. It has recently become available on the Russian market, but is already regularly used in projects by our architects and other Fujitsu partners. A limited number of models are currently available.

Learn more about Fujitsu switches on the official website .

Server

Inside the server, Intel Optane memory cards occupy a significant part of the space.

Intel pays a lot of attention to performance under high heat demand. On the one hand, for maximum quality cooling, large radiators are used. On the other hand, this limits the cooling air flow inside the entire server.

This is one of the key points that is taken into account when certifying the configuration - it is necessary to consider all possible scenarios in which, due to insufficient cooling, the servers can overheat the Optane module, or vice versa.

When moving the server room, our clients have more than once faced with a situation when the air conditioning system has not yet been put into operation. Therefore, we decided to check how demanding this installation is to the cooling system and measure the life of the platform under load outside the cooled server room.

The tests were carried out at room temperature, but we did not encounter any thermal limitations, or a decrease in performance or the appearance of errors due to overheating. We have seen from our own experience that the tested servers support the declared working capacity at an ambient temperature of up to +45 degrees Celsius.

Note.This experiment should not be taken as a recommendation to abandon the use of special server rooms with high-quality ventilation. When choosing a hardware solutions provider, be sure to pay attention to the maximum temperature package.

Hardware platform assembly

Front view:

Rear view:

Only one switch was used in the test. For commercial use, we always recommend that you reserve access paths using at least two switches. According to our statistics, the most common hardware failure in clusters is an accidental cable break or a broken contact in the connector.

Fujitsu RX1330 was used as a server with control software. He was also assigned the functions of an arbiter and quorum server.

Cluster Deployment

The first stage consisted of physical installation of hardware components, connecting interface cables, etc. This was followed by software setup, as The operating system is already preinstalled. We deployed Storage Space Direct on each server and built a cluster of 2 nodes and an arbiter.

Then we used the Fujitsu Infrastructure Manager utility, a Windows Admin Center extension that integrates closely with Fujitsu server hardware and contains all the management tools from Azure, such as:

Azure Site Recovery provides high availability and disaster recovery as a service (DRaaS).
Azure Monitor is a centralized site for monitoring the operation of applications, networks and infrastructure with in-depth AI-based analytics.
«-» Azure .
Azure Backup , -.
« Azure» Windows, Azure .
Azure Azure VPN- « — ».
« Azure» .

The extension allows you to automate a number of tasks that can also be performed directly in the Admin Center.

Gathered Storage Pool, created Volumes in it. These volumes are subsequently located virtual machines for which we conducted performance tests. Both volumes and virtual machines are conveniently managed from a single window.

Through Fujitsu Infrastructure Manager, it’s also convenient to do many things about scheduled maintenance and microcode updates. The status of all equipment is clearly displayed, much can be automated.

There are two versions of the Fujitsu Infrastructure Manager utility - paid and free:

Free. Available for download from the manufacturer’s website, it is quite enough for server management.
. Microsoft Azure HCI — Windows Server .

For deep Primergy integration with Microsoft Azure Stack HCI, you need a server management plug-in from Windows Server, which is available only in the paid version. Therefore, the FUJITSU Integrated System PRIMEFLEX for Microsoft Azure Stack HCI solution is part of it.

The more installation you have, the more valuable the automation that the utility provides.
There are only 2 nodes in our stand and we could do all the work manually. If you have 4 nodes or more, the software will significantly reduce your installation and administration efforts. The utility cost is less than 1% of the project, but significantly speeds up the commissioning of equipment.

For the Windows Admin Center, the Fujitsu Infrastructure Manager Orchestra is an expansion pack:

The same screenshot shows the composition of the server disk subsystem: two Optane modules are used as a cache extension, and five SSD disks as a Tier-1 storage pool.

Important points

When building a solution, there are several nuances that must be borne in mind: There

are two ways to manage Microsoft Azure Stack HCI - through the Windows Admin Center or the Fujitsu Infrastructure Manager.

Admin Center also has its advantages - you can deploy it on anything, even on a laptop; there is the ability to control from the command line. With it, the administrator can do almost anything.

There is also Cluster Manager - an indispensable tool for any problems with the cluster.

When deploying Witness (quorum server), it is important to add it to Active Directory and check its availability to all nodes. The requirements for this task are minimal, and it can be placed on any base server.

From the point of view of Windows Server, there are three types of disk devices - NVMe, SSD and HDD. The logic of the work is as follows: NVMe devices are the read / write cache, SSD is the Tier-1 storage level; HDD - Tier-2 storage level. Next, you can configure policies for moving data between pools. NVDIMMs can also be used as a cache.

The default block size for tearing is 4K, but may vary depending on the type of file system in the virtual machine. This will subsequently affect performance.

We use NVMe modules as a cache, so the speed of reading and writing data will be very different - this will be clearly seen in performance tests:

( ), SSD (Tier-1, ).
NVMe , , . .

Before creating a cluster, validation and all tests in the Failover Cluster Manager must be completed. The report needs to be saved, because without it it will not be possible to open a service call in Microsoft support, if ever needed.

When adding new nodes to an existing cluster, the nodes will be automatically added to the Storage pool. After 15 minutes, the cluster will automatically rebuild, rebuild and balance the Storage pool. This may affect performance during the rebuild.

Performance tests

Now let's move on to the most interesting part - load testing.

Testing configuration:

two Fujitsu PRIMERGY RX2540 servers assembled in a cluster;
each server has two Intel Optane storage class memory modules installed, used to expand the read / write cache;
SSD, ,
erasure coding ( RAID-5).

In fact, it is a software-defined storage system running Windows Server 2019 Azure Stack HCI.

We start the first test with 12 virtual machines running on both nodes. The read / write load profile is 70:30, block size = 8k. The block size was chosen based on the fact that most modern transactional databases and OLTP loads use just such a block size and approximately the same read / write ratio.

The steady-state cluster performance is 428k IOPS with a delay of 0.487 ms. This is a really worthy result, which is quite comparable to what you can get on a specialized all-flash storage system from many manufacturers.

Independent tests with a similar load profile are provided on the spcresults.org resource - this is the SPC-1 test. The difference with our configuration is only in the block size - it is 4k.

If we significantly simplify the methodology for comparing the results, we can divide into two IOPS indicators obtained for all-flash storage systems and compare them with the numbers we received at the same response time. The results obtained on our cluster of two mid-level servers are quite comparable with most storage systems.

Of course, such a comparison is not very correct, because in our case, an increase in the number of disks will affect performance and delays quite differently than with a specialized storage system. But, even taking into account all these assumptions, it can be said that a couple of years ago such performance figures could only be seen on a multi-controller external storage system of an average or even higher level. Today this is achievable on a hyperconverged solution.

The performance picture changes significantly when deduplication and measurements are turned on with the previous block size = 8k. If you simply enable deduplication on the same load profile, then performance will be less than 300k IOPS.

If we run two load profiles with a block of 8KB where one profile is 100% read and the other 100% write, then below are the best numbers that we were able to get:

We see excellent reading results, especially if we take into account a delay of 12 μs. Here Optane really does a great job as a read cache with proactive algorithms for predictive data transfer to the cache. Yes, and the storage pool itself, located on the SSD, also shows very good read numbers.

But the write speed is very different. Here are a few serious factors:

The architecture of the solution, when data falling into the cache of one node is copied over the network to the cache of the second node.
: — , Optane. .

45%, , — , . .
SSD SSD — 3D-NAND , 3D-NAND.

OLTP- – 8k .
Deduplication can be enabled at any time, but it significantly reduces performance. The deduplication efficiency in our tests was 45% with a performance drop of more than 25%.

This gives you freedom of choice - either higher storage performance or almost twice as much capacity. Also, much will depend on the load profile and the ability to compress the recorded data.

Due to the architecture of the solution, sequential write operations significantly increase the response time.
It is not in vain that Microsoft requires building a solution only on the basis of validated configurations from OEM partners - this avoids many problems both during the initial installation and during further work.
Working with the hardware from Fujitsu, as always, left only a positive impression. This is a sensible documentation, and many useful additions from Infrastructure Manager - this software package really greatly simplifies system management. This is especially important when increasing the number of nodes.
Fujitsu's PRIMEFLEX solution includes a set of scripts that speeds up the deployment process. They make it easy to start and configure in general, and Fujitsu PRIMERGY servers in particular.

For those who are not interested in self-tuning the solution, there is the opportunity to conclude a Technical Solution Contract with Fujitsu. In this case, the technical specialists of the vendor will deploy everything on a turnkey basis and will provide further support.

What is Azure Stack HCI and how does it work