🧘🏻 🐳 🎱 Monitoring in the data center: how we changed the old BMS to a new one. Part 2 🐤 😞 🏢

In the first part, we talked about why we decided to change the old BMS system in our data centers to a new one. And not just change, but develop from scratch to fit your requirements. In the second part we tell how we did it.

Market analysis

Based on the wishes and decisions described in the first part , to refuse to upgrade the existing system, we wrote a statement of work to find a solution on the market and made inquiries to several large companies that are only involved in the creation of SCADA industrial systems.

The first answers from them showed that the leaders of the monitoring systems market mainly continue to work on iron servers, although the process of migration to the clouds in this segment has already begun. As for backup virtual machines - no one supported this option. Moreover, there was a feeling that none of the developers visible on the market even showed an understanding of the need for redundancy: "the cloud does not fall," was the most common answer. In fact, we were offered to place data center monitoring in a cloud physically located in the same data center.

Here it is necessary to make a small digression about the process of selecting a contractor. The price, of course, matters, but during any tender for the implementation of a complex project, at the stage of dialogue with suppliers, you begin to feel which of the candidates is more interested and able to implement it.

This is especially noticeable on complex projects.

By the nature of the clarification questions for TK, it is possible to divide the contractors into those interested simply to sell (the standard pressure of the sales manager is felt) and those interested in developing the product, having heard and understanding the customer, to make constructive amendments to the technical specifications even before the final choice (even despite the real risk to improve someone else’s TK and lose the tender), in the end, just ready to accept the professional challenge and make a good product.

All this made us pay attention to a relatively small local developer - the Sunline group of companies, which responded to most of our requirements right away and was ready to fulfill all the needs regarding the new BMS.

The risks

While the big players were trying to understand what we want, and we were in leisurely correspondence with the help of presale specialists, a local developer made an appointment at our office with the participation of his technical team. At this meeting, the contractor once again demonstrated a desire to participate in the project and - most importantly - explained how the required system will be implemented.

Before the meeting, we saw two risks of working with a team that does not have the resources of a large national or international company:

Specialists could overestimate their capabilities and as a result simply cannot cope, for example, they will use sophisticated software or design impracticable backup algorithms.
After the implementation of the project, the project team may break up and, therefore, product support will be in jeopardy.

To minimize these risks, we invited our own development specialists to the meeting. Employees of a potential contractor were thoroughly interviewed about what the system is based on, how it is planned to implement reservations, and on other issues in which we, as an operation service, are not competent enough.

The verdict was positive: the architecture of the existing BMS platform is modern, simple and reliable, can be finalized, the proposed backup and synchronization scheme is logical and efficient.

They coped with the first risk. They excluded the second one, having received confirmation from the contractor that they were ready to give us the source code for the system and documentation, as well as choosing the Python programming language, which is well known to our specialists. This guaranteed us the opportunity to maintain the system on our own without any difficulties and a long period of training for employees in case the developer company leaves the market.

An additional advantage of the platform was that it was implemented in Docker containers: in this environment, the kernel, the web interface and the product database function. This approach provides many advantages, including preset settings for the highest deployment speed of the solution compared to the "classics" and the simple addition of new devices to the system. The principle of “all together” simplifies the implementation of the system as much as possible: it is enough to unpack the system and you can immediately operate it.

With such a solution, it is easier to make copies of the system, and it is possible to improve it and implement upgrades in a separate environment, without stopping the solution as a whole.

After both risks were minimized, the contractor provided KP. It worked out all the most important parameters of the BMS system for us.

Reservation

The new BMS system was supposed to be in the cloud on a virtual machine.

No hardware, no servers and all the inconveniences and risks associated with this deployment model - the cloud solution allowed us to get rid of them forever. It was decided that the system will work in our cloud at two data center sites in St. Petersburg and Moscow. These are two fully functional systems operating in active standby mode with access for all authorized specialists.

The two systems insure each other, providing a full reserve for both computing power and data transmission channels. Additional security measures have also been set up, including backing up data and channels, systems, virtual machines in general, and a separate database backup once a month (the most valuable resource in the perspective of management and analysis).

Note that redundancy as an option of the BMS-solution was developed specifically for our request. The backup scheme itself looked like this:

Support

The most important point for the effective operation of a BMS solution is technical support.

Everything is simple here: a new system would cost us 35,000 rubles in this indicator. per month for the SLA "response within 8 hours", that is, 35,000 x 12/80 = $ 5,250 per year. The first year is free.

For comparison: the support of the old BMS from the vendor cost $ 18,000 dollars a year with an increase in the amount for each new device added! At the same time, the company did not provide a dedicated manager, all interaction took place through a sales manager who is interested in us as a potential buyer with a corresponding emphasis in processing requests.

For less money, we received full support for the product, with an account manager who would take part in the development of the product, with a single entry point, etc. Support became much more flexible - thanks to direct access to developers for operational adjustments on any aspects of the system, integration via API, etc.

Updates

According to the proposed KP in the new BMS, all updates are included in the cost of support, i.e. do not require additional payment. An exception is the development of additional functionality beyond that specified in the ToR.

The old system assumed payment both for updating the firmware of free software (such as Java) and for fixing bugs. It was impossible to refuse this, in the absence of updates the system as a whole “slowed down” due to old versions of internal components.

And, of course, it was impossible to update the software without buying a support package.

Flexible approach

Another fundamental requirement concerned the interface. We wanted to provide access to it through a web browser from anywhere, without the presence of an engineer in the data center. In addition, we strove to create an animated interface, so that the dynamics of the functioning of the infrastructure was more visible to the duty engineers.

Also in the new system, it was necessary to provide support for formulas for calculating the operation of virtual sensors in engineering systems - for example, for the optimal distribution of electrical power among racks with equipment. To do this, you must have at your disposal all the usual mathematical operations applicable to the indicators of sensors.

Further, access to the SQL database was required with the ability to take from it the necessary data on the operation of the equipment - namely, all records on monitoring two thousand devices and two thousand virtual sensors that generate about 20 thousand variables.

We also needed a module for accounting equipment in the rack, giving a graphical representation of the location of devices in each unit with the calculation of the total weight of hardware, maintaining a library of devices and detailed information about each element.

Harmonization of TK and signing of an agreement

At that time, when it was necessary to begin work on the new system, correspondence with "large" companies was still very far from discussing the cost of their proposals, so we compared the received KP with the costs of updating the old BMS (see the first part ), and As a result, it turned out to be more attractive in terms of price and corresponding to our requirements.

The choice has been made.

After choosing a contractor, lawyers began to draw up a contract, and technical teams on both sides polished the technical specifications. As you know, a detailed and competent TK is the basis for the success of any work. The more specifics in TK, the less disappointments like “but we didn’t like that”.

I will give two examples of the level of detail of requirements in TK:

BMS , PDU. BMS «», , . . . , : , . .
BMS : – , – , – «». «» , , . , BMS . , , «» , , «» , .

With a similar degree of detail, charting and reporting formats, interface outlines, a list of devices that needed to be monitored, and many other things were prescribed.

It was a truly creative work of three working groups - customer service, which dictated its requirements and conditions; technical specialists from both sides whose task was to convert these conditions into technical documentation; teams of contractor programmers that implemented customer requirements for the developed technical documentation ... As a result, we adapted some of our unprincipled requirements to the functionality of an existing platform, something the contractor undertook to add for us.

Parallel operation of two systems

It is time for implementation. In practice, this meant that we were giving the contractor the opportunity to deploy a BMS prototype in our virtual cloud and provide network access to all devices that require monitoring.

Moreover, the new system was not yet ready for operation. At this stage, it was important for us to maintain monitoring in the old system and at the same time give access to the devices of the new system. It is impossible to build a system normally without seeing devices in it, which in turn cannot be disconnected from monitoring by the old system.

Whether the devices can withstand simultaneous polling by two systems was not obvious without real tests. There was a possibility that simultaneous double polling would lead to frequent denial of responses from devices and we would receive many errors due to device unavailability, which in turn would block the operation of the old monitoring system.

The network department threw virtual routes from the prototype of the new BMS deployed in the cloud to devices, and we got the results:

devices connected via SNMP protocol practically did not fall into disconnect due to simultaneous calls,
devices connected via gateways using modbas-TCP protocols had problems that were resolved by a reasonable reduction in the frequency of their polling.

And then we began to observe how a new system was being built before our eyes, already familiar to us devices appeared in it, but in a different interface - convenient, fast, accessible even from the phone.

We will talk about what happened as a result in the third part of our article.

Monitoring in the data center: how we changed the old BMS to a new one. Part 2