The story of the Dodo bird from the genus Phoenix. The Great Fall of Dodo IS

Every year on April 21, we recall the history of the Great Fall of Dodo IS in 2018. The past is a cruel but fair teacher. It is worth remembering it, repeating lessons, passing on the accumulated knowledge to new generations and gratefully referring to what we have become. Under the cut, we want to tell you a story about how it was and share conclusions. You cannot even wish such a situation to the enemy.




Great Fall History


Day 1. The accident that cost us millions of rubles


On Saturday, April 21, 2018, the Dodo IS system fell. Fell very badly. For several hours, our customers could not place an order either through the site or through the mobile application. The line of calls to the call center has grown to such a state that the automatic voice of the answering machine says: "We will call back in 4 hours."



That day we experienced one of the most serious falls of Dodo IS. The worst thing was that the day before we launched the first federal TV advertising campaign with a budget of 100 million rubles. It was a great event for Dodo Pizza. The IT team was also well prepared for the campaign. We automated and simplified the deployment - now with the help of a single button in TeamCity we could deploy a monolith in 12 countries. We did not do everything possible and therefore screwed up.

The advertising campaign was amazing. We received 100-150 orders per minute. That was good news. The bad news: Dodo IS could not withstand such a load and died. We reached the limit of vertical scaling and could no longer process orders. The system was unstable for about three hours, periodically recovering, but immediately fell again. Each minute of downtime cost us tens of thousands of rubles, not counting the loss of respect from angry customers. The development team let everyone down: customers, partners, guys at pizzerias and a call center.



We had no choice but to roll up our sleeves and sit down to correct the situation. Since Sunday (April 22), we worked in hard mode, we did not have the right to make a mistake.

The following is our summary of experience on how to find ourselves in such a situation and how to get out of it later. Friends, do not make our mistakes.

Two Failies That Launched Domino Effect


It’s worth starting with how it all began and where we screwed up.



On Saturday 04/21/2018 at about 5:00 p.m., we noticed that the number of locks in the database began to grow - a harbinger of problems. We had a runbook ready to solve, as we understood where these locks were coming from.

Everything went wrong after the runbook failed two times. For a couple of minutes, the base returned to normal, and then again began to choke on locks. Alas, the master database rollback_on_timeout had 600 seconds, which is why all the locks were accumulating. This is the first important file in this story. A simple setup could save everything.

Then there were attempts to bring the system back to life in many ways, none of which were successful. Until we realized that there is a difference in the scheme of receiving orders in the mobile application and on the new site. Cutting them down in turn, we were able to understand where the holes are in the old order-taking scheme.

The order acceptance system was written long ago. At that time, it was already being processed, and it was rolled out on the new site dodopizza.ru. It was not used in the mobile application. Initially, the reasons for creating a new ordering scheme were related to purely business rules; performance issues were not even on the agenda. This is the second important file - we did not know the limits of our system.

Day 2. Accident Elimination


The response of the team was very revealing. Our service station wrote a post on Slack, and everyone came the next day - on April 22, work began at 8:30 in the morning. No one needed to be persuaded or asked to come on their day off. Everyone understood everything: what needs to be supported, helped, with his hands, with his head, in testing, query optimization, infrastructure. Some even came with the whole family! The guys from neighboring teams, not related to IT, came to the office with food, and the call center brought additional forces just in case. All teams united in a single goal - to rise!



A new reception of the order was the main goal on Sunday, April 22. We understood that the peak of orders would be comparable to Saturday. We had the most severe deadline - at 17 o’clock a new flurry of orders will fall.

On this day, we acted according to the plan “to make sure that it doesn’t fall”, which we worked out on the eve of the 21st day in the late evening, when we had already raised the system and understood what had happened. The plan was conditionally divided into 2 parts:

  1. Implementation of the scheme with a new order in the mobile application.
  2. Optimization of the order creation process.

By implementing these things, we can be sure that Dodo IS will not fall.

We determine the front of work and work


Implementation of the scheme with a new order in a mobile application was the highest priority. We did not have exact numbers for the whole scheme, but in some parts, the number and quality of queries to the database, we expertly understood that it would give an increase in productivity. A team of 15 people worked on the task day after day.

Although in fact the introduction of a new ordering scheme, we started before the fall of 04.21., But did not finish the deal. There were still active bugs and the task hung in a semi-active state.

The team divided the regression into parts: regression from two mobile platforms, plus pizzeria management. We spent a lot of time manually to prepare test pizzerias, but a clear separation helped parallelize manual regression.

As soon as some change was introduced, it was immediately deployed to the pre-production environment and instantly tested. The team was always in touch with each other, they really just sat in a large room with hangouts on. The guys from Nizhny Novgorod and Syktyvkar were also always in touch. If there was a plug, he immediately decided.

Usually we bring out new functionality gradually: 1 pizzeria, 5 pizzerias, 10 pizzerias, 20 pizzerias and so on to the whole network gradually. That time we needed to act more decisively. We did not have time - at 5 p.m. a new peak began. We just could not fail to catch it.

At about 15:00 the update was rolled out to half the network (this is about 200 pizzerias). At 15:30 we made sure that everything was working fine and turned on the entire network. Features toggles, quick deployments, regression broken into parts and fixed API helped to do all this.

The rest of the team dealt with different optimization options when creating an order. The new scheme was not entirely new, it still used the legacy part. Saving addresses, applying promotional codes, generating order numbers - these parts were and remained common. Their optimization was reduced either to rewriting the SQL queries themselves, or to getting rid of them in the code, or to optimizing their calls. Something went in asynchronous mode, something, as it turned out, was called several times instead of one.

The infrastructure team was committed to allocating part of the components to separate instances simply so as not to cross the load. Our main problem component was the Legacy facade, which went to the Legacy base. All work was devoted to him, including the division of instances.

Organize the process


In the morning, we had the only sync of the day, broke into teams and left to work.

At first, we kept the entire log of changes and tasks directly in Slack, because at first there were not so many tasks. But their number was growing, so we quickly moved to Trello. The configured integration between Slack and Trello notified of any status change in the puzzle.

In addition, it was important for us to see the entire production change log. The electronic version was in Trello, the backup version was on the infrastructure board in the form of cards. In case something went wrong, we needed to quickly figure out what changes to roll back. Full regression was only for the scheme with a new order, the rest of the changes were tested more loyally.

Tasks flew to production at a bullet speed. In total, we updated the system 15 times that day. The guys were deployed test stands, one per team. Development, quick check, deployment on production.

In addition to the main CTO process, Sasha Andronov constantly ran into teams with the question “How to help?”. This helped to redistribute efforts and not lose time on the fact that someone did not think to ask for help. Semi-manual development management, a minimum of distractions and work to the limit.

After that day, it was necessary to go out with the feeling that we did everything we could. And even more. And we did it! At 15:30 a new scheme for receiving an order was rolled out for a mobile application throughout the network. Hackathon mode, under 20 deployments per production per day!

The evening of April 22 was calm and clear. Neither falls nor a single hint is that the system may be bad.

At around 10 p.m. we gathered again and outlined a weekly action plan. Limitation, performance tests, asynchronous order and much more. It was a long day, and there were long weeks ahead.

Rebirth


The week of April 23 was hellish. After it, we told ourselves that we had done our best 200% and did everything we could.

We had to save Dodo IS and decided to apply some medical analogy. In general, this is the first real case of using a metaphor (as in the original XP), which really helped to understand what was happening:

  • Resuscitation - when you need to save a patient who is dying.
  • Treatment - when there are symptoms, but the patient is still alive.




Resuscitation


The first stage of resuscitation is the standard runbook to restore the system in case of failure according to some parameters. One thing has fallen - we are doing it, that one has fallen - we are doing it and so on. In the event of a crash, we quickly find the desired runbook, they all lie on GitHub and are structured by problems.

The second stage of resuscitation is the limitation of orders. We adopted this practice from our own pizzerias. When a lot of orders are dumped at the pizzeria, and they understand that they can’t quickly cook them, they stand in the stop for 5 minutes. Just to clear the order line. We made a similar scheme for Dodo IS. We decided that if it gets really bad, we’ll turn on the general limit and will tell customers, they say, guys, 5 minutes and we will take your order. We developed this measure just in case, but as a result, we never used it. Not useful. And nice.

Treatment


In order to start treatment, it is necessary to make a diagnosis, so we focused on performance tests. Part of the team went to collect a real load profile from production using GoReplay, part of the team focused on synthetic tests on Stage.

Synthetic tests did not reflect the real load profile, but they gave some ground for improvements, showed some weaknesses in the system. For example, shortly before that, we were moving the MySQL connector from Oracle to a new one . In the version of the connector there was a bug with pooling sessions, which led to the fact that the servers simply went to the ceiling on the CPU and stopped serving requests. We reproduced this with Stage tests, conceived and quietly went into production. There were a dozen such cases.

As they diagnose and identify the causes of the problems, they are corrected pointwise. We further understood that our ideal way is asynchronous reception of an order. We started to work on introducing it in a mobile application.

Hell weeks: process organization


A team of 40 people worked on a single big goal - the stabilization of the system. All teams worked together. Don't know what to do? Help other teams. Focus on specific goals helped not to get sprayed and not to engage in nonsense that was unnecessary to us.



Three times a day there was synchronization, a common stand-up, as in a classic scrum. For 40 people. Only twice in three weeks (for almost 90 syncs) we did not meet 30 minutes. The longest sync lasted 57 minutes. Although usually they took 20-30 minutes.

The goal of the syncs: to understand where help is needed and when we will bring these or those tasks to production. The guys united in project teams, if they needed the help of infrastructure, a person came right there, all issues were resolved quickly, less discussion was more of a matter.

In the evenings, in order to support the guys, our R&D lab prepared food for the developers for the evening. Crazy pizza recipes, chicken wings, potatoes! That was unreal cool! Such support motivated the guys as much as possible.



Working in such a non-stop mode was damn difficult. On Wednesday, April 25, at about 5 pm, Oleg Blokhin, one of our developers, approached CTO, who has been figuring on Saturday without stopping. There was inhuman fatigue in his eyes: "I went home, I can’t take it anymore." He slept well and the next day was hearty. So you can describe the general condition of many guys.

The next Saturday, April 28 (it was a working Saturday for everyone in Russia) was calmer. We couldn’t change anything, we watched the system, the guys rested a little from the pace. Everything went quietly. The load was not so huge, but it was. They survived without problems, and this gave confidence that we were going the right way.

The second and third weeks after the fall were already in a calmer mode, the hellish pace was gone until late in the evening, but the general process of martial law remained.

Next Day X: Strength Test


The next day X was on May 9th! Someone was sitting at home and monitoring the status of the system. Those of us who went for a walk took laptops with us to be fully armed if something went wrong. Sasha Andronov, our service station, went closer to the evening peak to one of the pizzerias, so that in case of problems, see everything with your own eyes.

That day we received 91500 orders in Russia (at that time the second result in the history of the Dodo). There was not even the slightest hint of a problem. May 9 confirmed that we are on the right track! Focus on stability, performance, quality. The process restructuring awaited us further, but this is a completely different story.

Retro Great Fall and 6 Practices


In critical situations, good practices are developed that can and should be transferred to a quiet time. Focusing, inter-team help, quick deployment to production without waiting for a full regression. We started with retro and then developed the process framework.



The first two days passed in a discussion of practices. We did not set ourselves the goal of "putting a retrospective at 2 o’clock." After such a situation, we were ready to take the time to work out our ideas and our new process in detail. Everyone participated. Everyone who was involved in the restoration work in one way or another.



As a result, there were 6 important practices that I would like to talk about in more detail.

  1. Top N . , . Product Owners , , , . . , . , , , . , . N – Lead Time .
  2. . . -, , . , , , . .
  3. . , « » . , . , , . , - . Dodo IS. .
  4. Pull Request. , Pull Request, - . . , , , , - . , . , . 15 .
  5. Performance- — . performance- . , . Performance- . baseline , PR . , .
  6. Performance . — . , , -, , , . . , .



1.

21.04.2018 – Dodo IS. ?


(Site Reliability Engineering): . , .

(Site Reliability Engineering, Product Owner Dodo IS): . , , .

auth’, . . , .

, () :



. , .

. , . . -. , , . - , , . , -, , .

. redis, . . - , . .

Dodo IS . . , . , , , , . , — .

. « ». « , » .

- ?


:

– . . . , . . .

:

. , . ( ), - . .

. :

  1. .
  2. .
  3. , .

?



:

, , , .

:

, , . , .

? ?



: , – .

:

  1. — . - - .

    , . 2018 , PerformanceLab, .
  2. , . Less-.
  3. . 2018 -, . , . - 2018 .
  4. async . , 21.04.2018 . , . async.
  5. . , SaveOrder async-await. , . : , LF, 20 . , . , . !

    , , , .

    , , . — -. .
  6. . , (, ). , . «» . nginx , - , .

    : innodb_lock_wait_timeout = 5; innodb_rollback_on_timeout = ON. . , , , , .

, ?


:

  1. , , , . , .
  2. , .
  3. , , , . – .

:

  1. .
  2. . , , -. .
  3. . , - , .

, ?


:

«» , «». , , . .

:

, . , , . .

2. , (Dodo IS Architect)
— , . , — .

, , . , (, , ) .

IT-. Dodo IS, . …

X , . . Dodo IS , , , . , , , . . 146%, . , Dodo IS ?

, . , , . , . , !

, . « »: 11- , , 2 . «» . 3-4 , . , . .

, , . . Dodo IS , « ». ! ( ).

, Dodo IS , . Dodo IS . « », , . ( ), loadtest, , .

All Articles