Scaling a high-load network with Nutanix: features and challenges with a personal case

image

Due to the fact that millions of people are sitting at home, Internet traffic has skyrocketed. Fears have been voiced more than once that an excessive load on the network can lead to blackout - national, within a country, or even global. Fortunately, this problem is still being dealt with, but there are other, less obvious difficulties.

What kind? Now many online trading or delivery companies do not have time to process the skyrocketing number of orders, they are losing customers, money and reputation, including because their IT infrastructure was not ready for a multiple increase in data processing volumes. This could be avoided by quickly scaling the IT infrastructure. This can be done using the Hyper-converged infrastructure (HCI). This article is dedicated to her.

?


We, Platbox, have been processing customer payments (banks, payment systems) for about seven years. We process about 100 million transactions per year. This is acquiring, payments via SMS, shares of different companies, etc.
Over time, the number of merchants whose payments we process is growing, and the load on the network is increasing.

The problem of scaling arose from the first day of the company's existence. The load is growing - we are buying servers, scale the network, solving the problem. Then the number of transactions grows again, resources are gradually exhausted, the cycle repeats. Now the network consists of a whole zoo of equipment, ranging from servers from SuperMicro to Dell. This diversity complicates network maintenance and increases the number of routine operations during maintenance.

The increase in the number of equipment is also a cost. We have rental racks in data centers. We pay for rent of physical space, for electricity consumed by servers, etc. The more servers, the more you have to pay, everything is simple. In addition, we decided to expand our place on SDH, as well as increase the computing power of the servers. The question arose - to continue to scale what is already there, acting in the conditions of the classical three-level architecture or to introduce something new.

About six months ago, they decided to look for another solution that would solve the problems described above. There were several options, we chose the optimal one - use a hyperconverged infrastructure instead of the traditional one.

What is hyperconverged infrastructure (HCI)?


Here you need a little digression into the history of IT. Once the data was processed on mainframes - large powerful computers, then they were replaced by cheaper and more flexible “standard architecture servers”, the concept of 3-Tier architecture arose, and, as a result, dividing the data center into a separate storage subsystem (SAN) , processing subsystem (servers) and data transmission subsystem (network part of the data center). As IT evolved, new ways to store and process data and solve user problems arose. HCI is the "infrastructure for the" cloud "era" in IT, the rejection of 3-Tier architecture in favor of, for example, microservice.

From a practical point of view, in a classic converged infrastructure, a server, a storage system, network equipment, and a virtualization tool are separate elements. Hyperconverged infrastructure integrates them, as well as all the components of the familiar data center into a single system. Sometimes HCI includes additional components, for example, backup software, snapshot capabilities, data deduplication functionality, intermediate compression, and optimization of the computer network.

If a converged infrastructure is primarily hardware-based, and a software-defined data center is often adapted to any hardware, these two possibilities are combined in a hyper-converged infrastructure. It also improves operational reliability, performance, and data security. In general, HCI should be seen as a round of evolution in IT infrastructure.

After evaluating the pros and cons, we decided to try hyperconvergence, contacted Nutanix, got the platform for the test, which was successful. For the test, we got a hyper-convergent platform, consisting of six nodes, but two-unit. We save electricity, rack space, no need to buy an endless server.

5 benefits of hyper-converged infrastructure

image

  1. IT- – HCI, IT-«», .
  2. ;
  3. . . , . , . Nutanix , .
  4. The risk of unavailability of services in case of failure of one or several of the components due to unification, data backup and hardware is reduced. If the equipment suddenly “falls” in one data center, the reserve immediately starts in another.
  5. The open source code of the product facilitates a security audit, and the built-in STIG (Security Technical Implementation Guide or set of recommendations for protecting IT systems) provides high security code execution and resistance to the impact of attackers on the IT system.

How did we choose a vendor


We looked at several suppliers of hyperconverged systems. Among them are Cisco Hyperflex, SimpliVity, HPE Hyper Converged, Fujitsu PRIMERGY CX and Nutanix. Then the following selection criteria were formulated:

  • Reliability and safety of data inside;
  • Compliance with security requirements PCI DSS 3.2.1;
  • Speed ​​of work;
  • Maintenance and technical support;
  • The flexibility to scale the infrastructure at the speed that the company needs.

As a result, we stopped at Nutanix, since it was with this company that, in fact, the HCI market began in 2012. It offers the most stable product with flexibility and the greatest capabilities, for example:

  • A wide selection of platforms (HPE, Dell, Fujitsu, Cisco);
  • Availability of a free version of Community Edition;
  • The freedom to choose a hypervisor (including free - AHV).
  • A small "growth quantum" (in fact, it is one server), which already gives the business everything that it can and gives the largest: reliability, security and new technologies. All the functionality of Nutanix, which today is used by the business level, for example, VTB or Societe Generale group, is also available for the most entry-level solutions.

In addition, specialists with experience working with the Nutanix platform have appeared in our technical team. Thanks to them, we knew how the system would behave in critical situations, which is incredibly important for us, as for a fintech company.

Another factor that has influenced Nutanix’s choice is the availability of a transition tool such as Nutanix Move. It allows you to translate machines with minimal downtime. So, for example, if there are VMware and Nutanix machines, then Move acts as a kind of bridge. He takes the machine from VMWare, clones it, deploys it to Nutanix based on snapshots, extinguishes it in VMWare and launches it in nutanix. All this in literally seconds.

The transition process to Nutanix


image

The main requirement when moving was not to violate the stability of the system, so it was necessary to switch to a new platform very carefully.

Therefore, it all started with testing the trial platform Nutanix, which was already mentioned above. We started a series of tests by deploying a test environment. We used processing, which is not on production, and, so to speak, “shot” it from Yandex-gun. We checked the load, utilized enough resources for our purposes, realized that it worked perfectly, and where optimization was needed.

HCI usually coexists with the classic infrastructure, without requiring immediate abandonment of the familiar data center. Migration to HCI can be gradual, extended, and smooth as much as the company needs. For example, if virtualization and a hypervisor are already used in the company's data center, the transition to the HCI environment is a gradual migration of virtual machines from old servers to new ones: from “classic” servers to HCI servers. This is just our case. We will need to transfer virtual machines to another virtual environment. Where possible, automatic migration using Nutanix Move will be involved, some services are described as infrastructure as code (IaC).

All this can be divided into several stages:

  1. Writing RoadMap.
  2. Launching a new infrastructure.
  3. Translation of services according to RoadMap.

The implementation of these stages takes about 2 months.

Difficulties and problems of transition

Problems, of course, were. The main difficulty is that the processing at the time of the migration of services still have to be suspended. But we worked on this point when we took the Nutanix platform for testing. We built a plan for the purchase of equipment and migration of services at the testing stage. The strict implementation of the approved plan is the key to the success of service migration.

In some companies, in our experience, the difficulties of moving to a hyper-converged infrastructure are more likely not of a technical nature, but of an organizational or “budget” one. For example, if a data center is working, and there are no new tasks for it, if a lot of expensive equipment was bought relatively recently for the development of a "classic" infrastructure, it will be very difficult to come up with an idea (and argue its management or investors) that this must be abandoned and money spent again - already at HCI.

Positive Results of Switching to HCI

image

The quality of sleep for operating professionals has improved significantly. Why? It's simple - distributed storage improves storage reliability and data availability.

Optimized business processes and employees:

  • one admin instead of 3;
  • IT-, ;
  • .
  • .

Reduced iron costs. About how and why, it was said above. Simplified the scaling of data center costs. In the classical approach, it can be very difficult to develop an infrastructure that is ready to grow tenfold without the need to change it in whole or in part. In the case of HCI, you can start with a very small solution, gradually investing more in infrastructure.

There is an opportunity to redistribute costs from capital costs to operational , which is now going to IT all over the world. This requires a different way to plan costs, otherwise look at familiar things, learn new things, but this is where the industry is going today.

To understand how simple it is, we’ll give you one case. So, during the charity marathon, we had a very strong network load - much higher than the calculated one. The miscalculation arose because the load was calculated from the statistics of previous marathons. We did not take into account that more people will take part in the new one, the reason is quarantine and self-isolation. If the problem had arisen with the existing infrastructure, the consequences would have been very negative, including the disconnection of services. Now we took Nutanix, thanks to which we were able to increase the cluster exactly twice. The system “digested” the load. And all this could be done in just 15 minutes and a few mouse clicks.

A bit about the prospects for technology and the choice of HCI


Should everyone switch to HCI? No, of course not. Basically, hyper-converged infrastructure is suitable for those companies that have large-scale networks, a high load on them and enough funds for the transition. If a startup has several servers, then you can not change anything.

But if the company has a whole server iron zoo, a poorly optimized network and a high level of costs for iron and its maintenance, then HCI is definitely there.

Well, the following can be said about the future of HCI. Firstly, HCI is distributed in the world and in Russia to the extent that IT is ready to accept and use new ideas. Many developing companies are looking for new opportunities for development and growth, especially in the context of shrinking IT budgets. Companies of traditional and conservative businesses are likely to come to HCI later, active, young, developing and using new technologies - earlier.

Secondly , more and more companies will enter the HCI market, technology will become more widespread and affordable. Already, all the top manufacturers of servers and storage systems have proposals in the field of HCI, and this has happened literally in the last year or two.

Thirdly, the idea of ​​“cloudiness” will continue to develop, including in the form of a “hybrid cloud”, when part of the infrastructure will be located in its “own” data center, and part will be rented as needs arise.

An example is the same online stores and delivery services. In the event of a multiple jump in traffic, they could scale in a few minutes by renting capacities from a cloud operator, and not “hang” all day, losing customers and their money. And then, after the decline in demand and the reassurance of the market, scale the infrastructure back without sacrificing security, performance, or spending money on the purchase of your equipment. Perhaps in the future, the “cloud” and “your data center” will be linked seamlessly. At least all the technologies for this already exist.

Fourth, automation systems for managing IT infrastructure will begin to develop actively, including using AI, virtualization of the network infrastructure of the data center.

All Articles