Alert-s and Error-s of storage, how to deal with them?

Not so long ago, in the city of N, one IT company specializing in working with customer data successfully conducted its work in its DC 24/7. The same case when the “shoemaker in boots”, i.e. in IT company IT was well debugged. The interesting thing began when, after many years of work, the technical director left his post, who stood at the basics, on which the control over the proper operation of the entire IT vertical was kept. He was replaced by a person no less experienced (hereinafter referred to as “pros”), and even with a broader horizons, he literally fascinated “business” with new development horizons. But, as often happens, high-flying people are very reluctant to descend to the ground at the level of ordinary administration.

image

Timing of the incident:

Day One (April) : one local storage system started pouring alerts, and then the first errors appeared among them. Seeing this, the admin notified his supervisor according to the instructions. Our pros waved back the reply by following the “golden rule of the programmer” - “Does it work?” Do not touch!".

Retreat of the first day - Usually the storage system communicates using alerts, among which it is worth highlighting Alerts (from “Alert”) - alarms. In fact, these are alerts that signal an alarm event or alert it. Types of alerts:
Warnings (from “Warning”) - warnings; usually give time to think calmly.
Errora (from “Error”) - errors; for example, a disk crashed, but data access was not interrupted; here it’s not worth postponing their decision until later.
Critical Error (from “Critical Error”) - critical errors, guaranteed a malfunction has occurred, a solution is required immediately.

At the stage of development (refinement / change) of the architecture, a table appears (and subsequently is supplemented / changed) in the image of a disaster recovery plan, starting with the most extreme scenarios and ending with the lightest. An example of the preparation of such a plan (it is important to compile your own and maintain relevance specifically for your system) below, a complete table with an example of a disaster recovery plan can be downloaded from the link

image
Day Two (June): our engineer (Agat-A), working on another project of the customer, finds out about these errors, and wonders “what did they do?”, the answer is “nothing, got a case in your internal system, management is up to date, ...” . From the side of the local admin, everything was done according to the standard process, clearly according to the instructions two months ago. To the question - maybe you need help, the admin answered that he completed his part, but there were no teams.

Day Two Retreat: The

introduction and prudent use of the disaster recovery checklist will help restore the general picture of actions, and can also help to avoid obvious mistakes and unnecessary fuss.

An example of a checklist for disaster recovery of a complex:
, .
— . — .

, , .

, — .


image

(): ignoring errors led to the fact that the storage system became less responsive and already “for some reason” did not always drag out the tasks that were piled up, the first complaints of customers about the speed of work during peak hours appeared. And here already from a pro (the IT manager) asked on a planning meeting. He realized that it was time to do something and went down to the "engine room". Bottom line - during the day, a case was opened on the vendor's portal about ... a failed controller!

After that, the customer engineer politely asked us to help. Separately, it is necessary to mention that in order to save onsite partnership and vendor support when buying the system, we “cut” and de jure we should not have dealt with these issues at all, but, due to the presence of good relations with the customer and projects being implemented about once every year and a half, We are connected to solve the problem at the request of the customer. Immediately ask to remove the logs, we receive them promptly, more clearly describe the situation for contacting the vendor, set the importance, etc. The logs show that one controller died, and the second crashes, but it fixes errors on the fly, and the battery in the second controller has also died. We announce the diagnosis (well, that is not a sentence), we accelerate the order of controllers from the manufacturer, as usual, they were not in the Russian warehouse.

— , . . / .

.
:

.
: ____________________.
: ____________________.
.
: ____________________.
: ____________________.

, .
, .

, , , . , .

:

  • , .. , , .
  • , .. IT-.
  • .
  • IT-.

Fourth day (August) : a few weeks later, the controllers crossed customs and reached the server customer (along the way, we rewrote the serial numbers, they will be needed to close the case in support of the vendor when sending old controllers). The path from customs to the server is 2 days. And then ... leisurely reality began again. And why were we in such a hurry? The customer refused the proposed replacement of controllers by our specialists, or at least accompanying this process, we ourselves will not be fools, we will figure it out (as practice showed during the work of the previous technical director, this was 100% true). According to the conditions of the service, it is necessary (very desirable!) To send the replaced old controllers back to the manufacturer within two weeks. The manufacturer reminded the customer of the return more than once.

The retreat of the fourth day - people be human, do not be afraid to ask a question, do not hesitate to ask for help and do not disdain to double-check yourself. Of course, there are people who can work on their hump, experience and ability to work 12 hours a day, drag the entire organizational component. Teamwork implies that everyone uses their strengths, and not vice versa. As specialists, work through backup options before critical situations occur. Get ready for them in advance and let them pass you by. And even if something happens, you will be ready and able to pass these tests with minimal losses.

Day Five (October, Climax): The

following is a text written by our first-person engineer.

Early in the morning, when the office was about 5 minutes on foot, a call came from an unknown number. I answer the call - an alarmed voice asks their pros to help solve the problem with their storage, because customers cannot access their service. In the course of the conversation I’m trying to identify the customer. And, just like them, I recall that he (the pros) seemed to have eliminated SPoF (a single point of failure) as a completely inoperative controller, but he constantly postponed the replacement of the second, failing one. Okay, only the techie will tell more technical details, therefore we coordinate and immediately make the call with the pros and the administrator, by the way with a completely new administrator, who turns out to be hired in early September.

I start to ask questions, many more and more precise questions, trying to localize the problem. I quote some answers in a bunch of new admin + pros: “the old dead controller for replacements almost immediately, at the end of August or the beginning of September” ... “they didn’t change the second one, they wanted to do some work with its replacement that required shutting down the system” ... “so far everything has worked” ... “terrorists and criticisms were gone” ... “and here the storage system has died out” ... “no access to the network” ... “all services have fallen” ... “part of the lights are off” ... “does not blink where usually blinked "..." I do not understand what this means. "

A few minutes later, thanks to the answers to my questions, a picture appeared, but then the first cover took place. To another question: is there a backup copy of the controller settings, I suddenly heard complete silence. A minute later, the picture was completed: Profi replaced (physically removed the old one and inserted a new one in his place, I quote: the critical error disappeared) one controller (the one that was completely dead) without turning off the storage system. And actually, that's it! After that, he did nothing more with him, NOTHING !!! “The light is on, the critical error is gone.” He left the replacement of the second (barely living controller) until the storage was turned off, which was delayed for almost a month and a half (again, the second rule in action). Then I asked for a pause to think (actually digest, because the brain simply refused to believe what they heard).

Having come to my senses a bit (probably a moment of silence), I finally realize: one died, it was replaced by an empty new one, the second lived its life (for more than three months the poor fellow alone pulled his entire system with a dead battery and immediately corrected by single errors) and also died. There is no copy of the settings, where people can’t immediately get the settings themselves, they cannot physically give the remote (“something” with the Internet), and the man-hours are lost.

First I figured out how to fix this, then I began to clarify about the network, is it possible to quickly get a network map (no, no, almost nothing at hand). After a couple of minutes of an unrequited knock at different gates to different services, storage and network equipment (I asked and said what to do, they answered me that it turned out, everything happens without a remote, because “for some reason there is no Internet either.” the question and answer reaches me that dhcp servers are virtual and they start from this storage system, you don’t have any statics anywhere and therefore EVERYTHING is not available. This was the second cover (I just thought that there was nowhere to go down below, the control ports knocked down without statics are evil.) Okay, this time I found myself much faster, drew a rough plan of action in my head and explained it to my “colleagues”:that you need a computer or laptop with a patch cord next to the storage system and hands nearby. Then we need: instructions for setting up the controller (if missing / lost, then I’ll find and send it now) and a “piece” of the network map around the storage system (“piece” = basic network settings). When all this was ready, we basically configure the new storage controllers, connecting to them directly from our laptop with patch cord according to the instructions, using the found network settings, raise your DHCP and configure the storage controllers already in battle, lifting each system and checking that it works as needed. I find and send instructions (by the way, corporate mail also does not work, because it also depends on this SHD, therefore I use personal mail ...), plus by this time the pros have found at least basic network settings for SHD (ip addresses of both controllers and t .P.). The pros finally got an understanding of what to do,and he said that he would manage further. I recalled being in touch and letting go. After some time, the service “24/7” from this client has earned.

For me, the whole incident fit in four dozen minutes, and on the one hand I was pleased that it was possible to solve the problem promptly online and by phone, on the other hand I was very surprised how you can get to such a life. And the clients of this IT company also did not appreciate this incident, because the promise service was supposed to work 24/7 and this was the beginning of the working day (and given the time zones, someone even had the height of the working day).

image

This could be the end, but for me the completion of the case is work on the bugs. Therefore, my colleagues and I tried to write: what can / should be changed in our (and not only our) work in order to prevent this from happening in the future.

This case turned out to be just a free job, we didn’t even grunt thanks. It’s clear, because we saw what the customer would like to quickly forget, and bury the witnesses in the forest. But this case added to our collection of cheat sheets / templates for the most common situations faced by administrators, engineers and business when using and maintaining storage systems and related systems. Although for some, these cheat sheets and instructions may seem too simple or even narrow. In any case, for each system, you need to enter your data in these cheat sheets / templates (after all, everyone has their own landscape, their own requirements for information and services, etc.), draw their own schemes, develop their own algorithms.

Finally, we give an example of a backup policy.

image
A similar cheat sheet created for your system can greatly help both a novice and a master. Even if the master can keep everything in his head, he is not a biorobot with a 24/7 work schedule. And in any case, any tool requires its reasonable use.

And chanting “And to those who go to bed, have a good sleep” we finish our story.

All Articles