How we evacuated the duty shift of Yandex



When the work fits in one laptop and can be performed autonomously from other people, then there is no problem moving to a remote location - just stay at home in the morning. But not everyone was so lucky.

Duty Shift is a team of Service Availability Specialists (SRE). It includes duty administrators, developers, managers, as well as a common “dashboard” of 26 LCD panels of 55 inches each. The stability of the company's services and the speed of solving problems depend on the work of the shift on duty.

Today Dmitry Melikovtal10n, the shift supervisor, will talk about how they managed to transport equipment to their homes and establish new work processes in a matter of days. I give him the floor.



- When you have an endless supply of time, you can comfortably move with anything anywhere. But the rapid spread of coronavirus put us in completely different conditions. Yandex employees were among the first to switch to remote work - even before the introduction of the self-isolation regime. It happened like this. On Thursday, March 12, I was asked to evaluate the opportunity to transfer the team’s work to home. On Friday the 13th there was a recommendation to switch to remote work. Everything was ready for us on the night of Tuesday, March 17: duty officers work at home, equipment was transported, missing software was written, processes were reconfigured. And now I’ll tell you how we did it. But first you need to remember those tasks that are solved by the shift on duty.

Who we are


Yandex is a large company with hundreds of services. The stability of the search, voice assistant and all other products depends not only on the developers. The data center may interrupt the power supply. A worker may accidentally damage an optical cable while replacing asphalt. Or there may be a surge in user activity, which will require an urgent reallocation of power. Moreover, we all live in a large, complex infrastructure, and the release of one of the products may accidentally lead to the degradation of the other.

26 panels in our open space are one and a half thousand alerts and more than one hundred charts and panels of our services. In fact, this is a huge diagnostic panel. An experienced duty administrator, looking at her, quickly understands the status of important nodes and can establish a direction for investigating a technological problem. This does not mean that a person should constantly look at all the devices: the automation itself will attract attention by sending a notification to the special interface of the person on duty, but without a visual panel, the solution to the problem may be delayed.

When problems arise, the attendant first assesses their priority. He then isolates the problem or minimizes its impact on users.

There are several standard ways to isolate a problem. One of them is the degradation of services, when the administrator on duty disables some of the functions that users least notice. This allows you to temporarily reduce the load and figure out what happened. If there is a problem with the data center, then the attendant contacts the operation team, understands the problem, monitors the timing of its solution and, if necessary, connects specialized teams.

When the administrator on duty cannot isolate the problem that arose due to the release, he reports it to the service team - and the developers look for errors in the new code. If they are not able to figure it out, then the administrator attracts developers from other products or engineers for the availability of services.

I can talk for a long time about how everything is arranged with us, but I think that I have already conveyed the essence. The duty shift coordinates the work of all services and controls global problems. It is important for the administrator on duty to have a diagnostic panel in front of the eyes. That's why when switching to remote work, you can’t just take and give everyone a laptop. Charts and alerts do not fit on the screen. What to do?

Idea


In the office, all ten on-duty administrators work in shifts behind one dashboard, which includes 26 monitors, two computers, four NVIDIA Quadro NVS 810 video cards, two uninterruptible power supply units and several independent network accesses. But we needed to provide everyone with the opportunity to work at home. It just won’t work to assemble such a wall in the apartment (my wife will be especially happy about this), so we decided to create a portable version that can be brought and assembled at home.

We started experimenting with the configuration. We needed to fit all the devices on fewer displays, so the main requirement for the monitor was a high pixel density. Of the 4K monitors available in our environment, Lenovo P27u-10 was chosen for testing.

From laptops they took a 16-inch MacBook Pro. It has a fairly powerful graphics subsystem, necessary for rendering pictures on several 4K displays, and four universal Type-C connectors. You may ask: why not a desktop? Replacing a laptop with exactly the same from a warehouse is much easier and faster than assembling and configuring an identical system unit. Yes, and it weighs less.

Now it was necessary to understand how many monitors we could actually connect to the laptop. And the problem here is not the number of connectors, we could find out only by testing the complete system.



Testing


We quite comfortably placed all the charts and alerts on four monitors and even connected them to a laptop, but we ran into a problem. Rendering 4 × 4K pixels on the connected monitors so loaded the video card that the laptop was discharged even while charging. Fortunately, the problem was solved with the help of the Lenovo ThinkPad Thunderbolt 3 Dock Gen 2 docking station. We managed to connect a monitor, power, and even a favorite mouse with a keyboard to the docking station.

But immediately another problem surfaced: the GPU puffed so much that the laptop overheated, which meant that the battery also overheated, which as a result went into protective mode and stopped taking charge. In general, this is a very useful mode that protects against dangerous situations. In some cases, the problem was solved with the help of a high-tech device - a ballpoint pen, placed under a laptop to improve ventilation. But this did not help everyone, so we also twisted the speed of a regular fan.

There was another unpleasant feature. All charts and alerts should be located in a strictly defined place. Imagine that you are piloting a plane to land — and here the speed indicators, altimeters, variometers, horizon indicators, compasses and position indicators begin to resize and jump to different places. So we decided to make an application that will help with this. In one evening, we wrote it on Electron.js, taking a ready-made API for creating and managing windows. We added a configuration handler and their periodic updating, as well as support for a limited number of monitors. A little later, we added support for various setups.

Assembly and delivery


By Monday, helpdesk wizards had got 40 monitors, ten laptops, and as many docking stations for us. I don’t know how they did it, but thank you very much.



It remained to carry all this to the apartments of duty administrators. And these are ten addresses in different parts of Moscow: south, east, center, and also Balashikha, to which there are 45 kilometers from the office (by the way, an intern from Serpukhov was also added later). It was necessary to somehow distribute all this between people, to build logistics.

I drove all the addresses on our Maps, there is still the opportunity to optimize the route between different points (I used the free beta version of the tool for couriers). We broke up our team into four independent teams of two people, each got its own route. My car was the most capacious, so I took the equipment for four employees at once.



The entire delivery took a record three hours. We left the office at ten in the evening on Monday. At one in the morning I was already at home. That same night we went on duty with new equipment.

What is the result


Instead of one large diagnostic console, we collected ten relatively portable ones in the apartment of each person on duty. Of course, it remained to settle down some little things. For example, earlier we had one “iron” telephone of the person on duty for notifications. In the new conditions, this did not work, so we came up with “virtual phones” for those on duty (in fact, channels in the messenger). There were other changes. But the main thing is that in record time we managed to transfer not just people, reducing the risk of infection, but all our work at home without harm to the processes and stability of the products. In this mode, we have been working for a month.

Below you will find photos of the real jobs of our attendants.










All Articles