And demonstrate, or How we passed the Operational Sustainability audit at Uptime Institute


The head of the operations department climbed into the hatch of the underground fuel storage to show the markings on the solenoid valve.

In early February, our largest Tier III NORD-4 data center was re-certified by the Uptime institute (UI) under the Operational Sustainability standard. Today we will tell you what the auditors are looking at and with what results we have finished.

For those with data centers at “you,” we will briefly go through the materiel. Tier Standards evaluates and certifies data centers in three stages:

  • project (Design): the package of project documentation is checked. It’s just assigned to everyone known Tier . There are 4 of them: Tier I – IV. The latter, respectively, is the highest.
  • (Facility): - . - : (, , , , ..) , . Tier III - -.

    Facility , - Dsign.
    NORD-4 Design 2015 , Facility —  2016.
  • (Operational Sustainability). , . - Tier ( Operational Sustainability, Facility). - Tier IV .

    : Bronze, Silver Gold. 88,95 100 , Silver. Gold — 1,05 . 



How to check that the necessary processes are arranged and working as they should? Moreover, how to do this in two days - this is how much re-certification takes. In short, the certification is based on a painstaking comparison of what is written in the regulations, the stories “how everything works” and real practices. Information about the latter is obtained from rounds of the data center and conversations with engineers of the data center - “confrontations”, as we affectionately call them. That's what they look at.

Team


First of all, UI auditors check to see if there are enough staff in the data center. They take the staffing schedule, the duty schedule and selectively check with shift reports and ACS data to make sure that the right number of engineers was really on the site that day.

Auditors also look closely at the number of hours of processing. This sometimes happens when a large client calls in and at the same time dozens of racks need to be delivered. At such moments, the guys from other shifts come to the rescue, and they are paid extra money for this.

NORD-4 7 : 6 . , 247, , . . . — . 247.


NORD , .

When the numbers are sorted out, the team’s qualifications are checked. Auditors randomly look at the personal files of engineers to make sure that they have the necessary diplomas, certificates, permits (for example, electrical safety certificates) to work in this position.

They also check how we train staff. During the last audit, our system for training new duty engineers impressed UI specialists. For them, we conduct a three-month training course in a paid internship, during which we introduce them to the processes and principles of work in our data center.

Already working engineers should also receive regular training, including emergency work. Auditors will certainly check the training programs and materials of such trainings, and also selectively examine the engineers. They will not ask anyone to switch to DGU, but they will ask you to tell step by step what to do when the city power supply is turned off. Based on the results of the audit, we will bring all training programs to a single standard so that they do not differ for different teams.


We show the auditors a relaxation room for shift engineers.

Operation and maintenance of engineering systems 


In this large section of the audit, we show that all engineering equipment and systems receive regular maintenance according to the schedule recommended by the vendors, the warehouse has the necessary spare parts, existing contracts with service contractors, and for each operation with equipment its own procedures and work algorithms for different cases.

MMS When you operate dozens of UPSs, diesel generators, air conditioners and other things, you need somewhere to collect all the information about this economy. Here is approximately a dossier created for each piece of equipment with us:

  • model and serial number;
  • marking;
  • technical specifications and settings;
  • place of installation;
  • dates of production, commissioning, end of warranty;
  • service contracts;
  • schedule and history of maintenance;
  • and the whole "medical history" - breakdowns, repairs.

How and where to collect all this information, each data center operator decides for himself. UI does not limit in tools. It can be a simple Excel (we started with this) or a self-written Maintenance Management System (MMS), as we have now. By the way, the service desk , inventory control , online journal, monitoring are also self-written.


Here is a "personal matter" is for each piece of equipment.

We showed our practices in this area, including using the example of this infrastructure UPS (pictured), which donated one of its UPS components to the IT load. Yes, according to the standard, only infrastructure equipment that supplies air conditioning, emergency lighting, but not the IT load, can deal with such “donation”.



After that, the auditors asked to show the corresponding ticket in Service Desk:



And the UPS profile in MMS:



spare parts. For timely maintenance and emergency repairs of engineering equipment, we keep our spare parts. There is a common warehouse with large spare parts for equipment and small cabinets with spare parts in the engineering rooms (so that you do not have to run far).

In the photo: we check the availability of spare parts for diesel engines. We counted 12 filters. Then they checked the data in MMS.  



A similar exercise was done in the main warehouse, where large spare parts are stored: compressors, controllers, automation, fans, steam humidifiers and hundreds more positions. Selectively rewrote the markings and “punched” them via MMS.




Data on stocks of spare parts. Red is what is missing and needs to be purchased.

Preventive maintenance. In addition to maintenance and repairs, UI recommends doing preventative maintenance. It helps turn a potential accident into a scheduled repair. For each parameter, we configure threshold values ​​in monitoring. If they are exceeded, those responsible receive alarms and take the necessary actions. For example, we:

  • We check the electrical panels with a thermal imager in order to find a defect in electrical installations in time: poor contact, local overheating of the conductor or the machine. 
  • We monitor the vibration and current consumption of the pumps of the refrigeration system. This allows you to timely identify deviations and quickly plan to replace parts.
  • We do fuel and oil analyzes of diesel generator sets, compressors.
  • Testing glycol in a cold supply system for concentration.


Pump vibration chart before and after repair.

Work with contractors. Maintenance and equipment repairs are done by external contractors. For our part, there are individual specialists in diesel generator sets, air conditioners, and UPSs that monitor their work. They check whether the contractors have the necessary tools and materials for repair work / maintenance, professional certificates, electrical safety crusts, tolerances. They accept all the work.


This is what a checklist looks like for acceptance of work on maintenance of an air conditioner.


At the pass office we check whether passes are issued to authorized representatives of the contractors, whether they passed the maintenance at the indicated time and whether they got acquainted with the rules.

Documentation.Well-established processes for servicing systems and equipment are half the battle. All procedures that are performed by a person in a data center should be documented. The purpose of this is simple: so that everything does not become isolated on one particular person and in the event of an accident, any engineer could take clear instructions and do all the necessary operations to eliminate it.

UI has its own methodology for such documentation.

For simple and repetitive actions, Standard Operational Procedure (SOP) is compiled. For example, there are SOPs for turning on / off the chiller, setting the UPS on bypass.

For maintenance or complex operations, such as replacing the batteries in the UPS, Methods of Procedures (MOP) are created. These may include SOPs. Each type of engineering equipment must have its own MOPs.

Finally, there are Emergency Operating Procedures (EOP) - emergency instructions. A list of specific emergencies is compiled and instructions are written for them. Here is a part of the list of emergencies, which detail signs of the accident, actions, responsible persons and persons for notification:

  • shutdown of city power supply: DGU started / did not start;
  • UPS accident; 
  • accidents on the data center monitoring system;
  • overheating of the engine room;
  • leakage of the refrigeration system;
  • accident on network and computing equipment;

And so on.

To compose such a volume of documentation is a laborious job in itself. It is even more difficult to keep it up to date (this, by the way, auditors also check). And most importantly - the staff should know these instructions, work on them and make improvements if necessary.


Yes, the instructions should be available where they may be needed, and not just gather dust in the archives.


Marks on changes in the regulation of maintenance of engineering systems of the data center.

During the audit, they also look at the technical documentation for the systems, executive and working documentation, acts of putting the systems into operation. 

Marking.During a tour around the data center, they checked it wherever they could reach. Where could not reach - reached from a step-ladder :). We looked at its presence on each shield, machine, valve. They checked the uniqueness, unambiguity and compliance with current schemes of executive documentation. In the photo below: we in the fuel storage pump compare the marking on the solenoid valves with the scheme of executive documentation. 



Everything agreed with her, but with the local "decorative" axonometric diagram on the wall in one parameter did not match.



In the premises of the data center, schemes of the systems located there should also hang. In the event of an accident, they help to quickly find out where what is located and make an informed decision. In the photo, for example, a single-line diagram in the main switchboard room.



The relevance of the schemes was checked as follows: they called the labeling of the element on the scheme and asked to show "in kind". 



Here the auditor takes pictures of the settings (settings) of the release of the main switchboard of the main switchboard, then to check with the indicators on a single-line diagram in paper and electronic copies. On one of the machines, QF-3, the indicator did not coincide with the paper scheme, and we earned a penalty point. Now two engineers will check the compliance of the markings in single-line diagrams with the fact.



This is not all that the auditors checked regarding the service processes. Here's what else was on the agenda:


UI


Security and access control. The audit also checks the operation of security and safety systems. For example, the auditor tried to get into one of the rooms where he does not have access, and then checked whether this was reflected in the ACS system and whether the security had a notification about it (there was a spoiler).

If in our data centers the door to any room remains open for more than two minutes, then a warning is triggered at the security post. To verify this, the auditors supported one of the doors with a fire extinguisher. True, we didn’t wait for the sirens - the security guards saw something was wrong through the video cameras and arrived at the “crime scene” earlier.

Order and cleanliness.Auditors look for dust, randomly lying boxes of equipment, with what frequency the rooms are cleaned. Here, for example, the auditors were interested in an unidentified object in the ventilation corridor. This is a block from the ventilation system, which was already preparing to take its place. But still asked to sign.



Still in the subject of order in the data center - these are the cabinets with all the necessary tools for emergency work on the equipment located in the main switchboard. 



The location.The data center is evaluated according to location conditions - are there any military bases, airports, rivers, volcanoes and other dangerous objects nearby. In the photo we just show that since the last certification in 2017, no nuclear power plants and oil storage facilities have grown around the data center. But over there, a new NORD-5 data center is being built, which also has to go through all the levels of certification of the Uptime Institute Tier III. But this is a completely different story).


All Articles