Monitoring in the data center: how we changed the old BMS to a new one. Part 3

We continue our story about how we changed the BMS system in our data centers ( part 1 , part 2 ). At the same time, we did not just change the solution of one vendor to another, but developed the system from scratch to fit our requirements. To conclude our story, we share the results of the work done and interesting solutions that may be useful to you.

New interface


Here, as they say, it is better to see once.

Racks.

Let's analyze the differences.

  • Firstly, it is beautifully comfortable. Note how easy it has become to track the load on the modules (“Banks” or just “Banks”) of the PDU and the sum of the parallel loads of the paired modules. On the rack model from the new BMS, we immediately see that the lower paired PDUs are overloaded (the total current above the permissible 16A is a “blue” notification), and the upper ones are underloaded. If one of the inputs is disconnected, the entire load will transfer to the second, and the lower module remaining energized will be disconnected due to overload. To prevent this, the data center support service will warn the customer in advance and send a recommendation on how to redistribute the load.
  • . BMS PDU. BMS , , - « ».
  • . . . , ( ) . , . 
  • Intuitive interface. In the new interface there are no heaps of icons, fans spin, switches “click”. And the most convenient is the ability to indicate the status of the PDU Line A / B inside the racks. We tried to do something similar in the old BMS, but the number of merging icons per square centimeter of the card forced us to abandon it.

Now the eye is pleased to watch:


Server


Fragment of main switchboard.


Ventilation control panel.

And you can decorate the new BMS for the New Year :-)


One page - Understanding and Without TK


For a very long time we wanted to implement another “trick” in BMS: to compose on one page the main parameters of the data center, so that one glance at the screen would be enough to assess the status of the main systems. However, we did not fully understand how it should look.

Even before the development of the new BMS began, we visited dozens of data centers in the Netherlands with excursions. One of the goals was to see examples of the implementation of such a page.

And they weren’t shown to us in any data center - somewhere it wasn’t, somewhere “it was being developed right now”, somewhere it was a “big trade secret”. Therefore, in our ToR for the creation of a new BMS, an exact description of this page, which is very important for us, was missing.

As a result, we came up with it literally "on the go." Just at that moment, I had to remotely consult colleagues in the data center. Scrolling through the BMS pages on the phone in search of disparate data was very inconvenient, and in fact, the first version of One page was scribbled on a napkin . It was implemented by the developers of the photo. 

Following the example of cautious Dutch colleagues, we will not demonstrate the final version of our main page, especially since each data center is unique and it makes no sense to copy. But we describe two main principles of its formation:

  1. , ( , ), . «» , . 
  2. ( ). , .  - – . .

In fact, now absolutely all the key characteristics of the data center are grouped and presented on the same screen of the smartphone / monitor by the responsible engineer and manager, while the connection to the physical and logical topography of the data center is implemented. 

Here is a photo of the very first draft, although, of course, then this version was rethought and finalized.



Acknowledgment and summary of incidents


Let's talk about another new concept for us, which appeared as a result of the project to update the monitoring system.

Acknowledgment is a rather rare term that the developer of the new BMS proposed to use. It means confirmation that the operator saw the incident, confirmed it and assumed the responsibility for eliminating it.  

The word has taken root, and now we "acknowledge" the incidents.

The algorithm laid down in the basic version of the new BMS did not suit us. In fact, these were comments on the event log, that is, the resolved incidents did not disappear from the log, and the received (“acknowledged”) messages were not sorted from new ones.

As a result, a window was developed under the name "summary", in which:

  1. Only active incidents and devices are displayed in service mode (without commercial "blue" notifications).
  2. The NEW and ACCEPTED incidents are clearly separated.
  3. It is indicated who accepted the incident.

The duty algorithm in the new BMS is as follows:

  1. New incidents are reported and await acknowledgment. They cannot be in this section for a long time, the equipment officer on duty should immediately take the incident upon himself.
  2. The employee accepts the incident by clicking on the checkmark on the right. Since all employees are under unique accounts, it automatically displays who accepted the incident. If necessary, leave a comment.
  3. The incident moves to the "Acknowledged" section, the rest of the attendants and the manager understand that the responsible officer is involved in the incident.



An example of a summary window with a new and already acknowledged message.

Having connected the summary window with the One page table, we got a full-fledged main screen of the BMS system, on which you can immediately see: 

  • the state of the main data center systems;
  • the presence of new unprocessed incidents;
  • the presence of accepted incidents and data on who specifically eliminates them.

Access via browser and pop-up alerts on the phone


The web interface, accessible from any device from anywhere in the world, is a stark contrast to the "fat" client, completely closed to outside users. 

The old approach dragged on a set of inconveniences, from problems in organizing the remote work of monitoring service employees to the need to install “thick” clients from distributions to staff jobs in the data center.

Now any page in BMS has a unique address, which allows you to share not only the direct address of the page or device, but also links to unique graphs / reports. 

Access to the system is now provided through LDAP authentication through Active Directory, which enhances its level of security. 

Mobility today is a key factor in the quality work of duty engineers. In addition to monitoring monitoring in the duty shift room, engineers make detours, perform routine work outside the “duty room” and, thanks to the BMS main screen optimized for the mobile screen, do not lose control of what is happening in the rooms for a second. 

The quality of control is enhanced by the functionality of work chats. They accelerate workflows, allowing you to "link" the correspondence of duty engineers to BMS. For example, we use the Teams application, which allows you to conduct internal correspondence and receive all messages from BMS on the phone in the form of pop-up Push notifications, which eliminates the duty of the duty officer from constantly looking at the phone screen.


 Push notification on smartphone screen.


And so the notifications look in the Teams application.

At the same time, pop-up notifications are configured only for messages about incidents, thereby minimizing the distraction, staff know that if Teams Push-notification appears on the smartphone’s screen, you need to go to the BMS page and accept the incident. Corrective action messages are already tracked on the BMS page.


In the photo, the BMS interface in the smartphone.

Summarizing


With the cost of updating BMS from our old vendor, comparable to developing a new system from scratch (about $ 100,000), the difference in the functionality of the products turned out to be enormous. We received a flexible system optimized for our business tasks and processes. We also achieved significant savings in running costs for maintaining and updating the system. 

But, of course, there were difficulties. 

  • -, , BMS, . , , , , . , . , , . 
  • -, , . BMS, . . , , .
  • -, . ( ) , , , .

The radical update of our BMS system today can be called the most important project of the past year, which will seriously affect the quality of operational management of our sites in the future. 

Of course, we did not throw out the old iron server, but “made it easier”: we cleaned thousands of “commercial” virtual sensors and PDUs and left only a few dozen of the most critical devices in it, such as diesel generator sets, UPSs, air conditioners, pumps, leakage sensors, and temperatures. In this mode, it returned to its former speed, and it can be a "reserve reserve." By the way, after removing the PDU from the old BMS, about 1000 now unnecessary licenses have been freed, do you happen to know what to do with them?

All Articles