🕴🏼 ⚡️ 💉 How we helped schools switch to distance learning and coped with the load 🍘 🈴 🤭

Hello, Habr! My name is Alexey Vakhov, I’m the technical director of Uchi.ru. In mid-March, when schools began to switch to distance learning, we provided teachers and students with several services for online classes. According to our calculations, we had a margin of safety to withstand a maximum of 1.5-2 times the load. In mid-April, our traffic grew 8 times. I had to do a lot to stay afloat. Perhaps someone will need our experience to survive this or a future crisis.

We did not expect such surprises and were not ready for them, probably not a single company in Russia, nor in the world, was ready for this. Usually in March, the activity on Uchi.ru is lower relative to autumn due to spring and upcoming summer vacations, at which time we are already preparing for September: we are sawing features, doing refactoring, conducting large-scale architectural changes and doing another pleasant routine. However, this time it was different.

The peak number of unique users on the site at one time reached 240 thousand, we recorded the previous maximum of the current school year on 30 thousand people. At this point, the load grew every day, and we worked around the clock to stabilize the site.

When such a load falls on the site, as a rule, applications, services, balancers, bases, web, channels are formed. All the "bottlenecks" of the infrastructure are exposed. In such conditions, it is difficult to diagnose problems - symptomatically everything is buggy. It's easy to repair when traffic grows smoothly and one thing breaks down. When the load goes in a flurry, one of the big problems is to understand the causes of failures.

The strategy for working in such conditions is to eliminate what hits the site the most, then find the next most painful point, at the same time look for potential problems and fix them, and so on. Here are some of the most notable things we did to stabilize the platform.

Rely on themselves

During a crisis before the traffic, you as a team become one. It will depend on your employees whether you will find solutions, cope with the crisis, or not.

There are no people in the industry who would come, look into your complex system, immediately do something and everything would be fine. Of course, in the world there are enough specialists who will cope with the task if there is time. But when a fundamental solution is needed right now, you can only rely on your team, which knows the system and its specifics. The result and responsibility to the business lies with the team. External examination is advisable to connect pointwise.

Operational coordination of the anti-crisis team in a special chat in Slack helped us quickly navigate and build our work, all of which were solved here and now. We divided the areas of responsibility between the employees so that there were no intersections and the guys did not do double work. On the most difficult days, I had to be in touch literally around the clock.

Expanded cloud

You cannot insure against all crises, but it is important to be flexible. The cloud stack gave us such plasticity and a chance to stay afloat even with such a dramatic increase in load.

Initially, we increased resources under the increased load, but at some point we ran into the quotas of the region of our cloud provider. Problems arose at its level: our virtual servers were affected by neighbors, on which traffic also grew, which caused failures in the operation of our applications. This was expected: we depend on the provider and its infrastructure, which in turn also experienced a high load. We have released some resources from non-priority virtual machines for the main site. With the provider, we agreed on a dedicated resource.

Upgraded monitoring tools

During the crisis, alert did not actually fulfill its function. All team members already monitored all systems around the clock, and incident management was reduced to constant work on all fronts. To fully diagnose the problems that we encountered, we had too little data. For example, for monitoring virtual machines, we use the standard Node Exporter for Prometheus. He is good to see the big picture, for a closer look at a single virtual machine began to use NetData.

Optimized cache storage

A problem also arose with key-value stores. In one of the applications, Redis could not cope - in a single copy it can work on only one core. Therefore, they used a Redis fork called KeyDB, which can work in multiple threads.

To increase the bandwidth in another application, we have raised 10 independent Redis'ov. They are proxied by our Service Mesh, which also shards keys. Even if one or two Redis crashes, this will not cause problems due to consistent hashing. Plus, they practically do not need to be administered.

Expanded the network

As you know, 640 Kb is enough for everyone. We always used private / 24 subnets, but against the background of quarantine we had to urgently expand to / 22. Now the network accommodates four times as many servers, we hope it will be enough for sure.

PgBouncer carried out separately

As a relational database, we use PostgreSQL everywhere, somewhere small virtual instances, and somewhere - the installation of several large dedicated servers for the master and replicas. The obvious bottleneck of this architecture is the master, which in the ideal case is used only for recording and does not scale horizontally. With the growth of traffic, we began to rest on the CPU.

At the same time, we use PgBouncer, which was installed on the wizard and on each replica, to manage the connections. On one port, it can use no more than one processor core, so on each server we had several bouncers. At some point, it became clear that PgBouncer itself took away the tangible part of the CPU from the base, and at maximum load we experienced a rapid increase in load average and a drop in system performance.

We moved the bouncers to a separate server, which helped us save 20-25% of CPU on each database server.

Faced with surprises

Only one tool cannot be trusted, especially in times of crisis. On the contrary, redundancy of tools helps, because it makes it possible to see a more objective picture. Familiar tools begin to fail for a variety of reasons. For example, usually to estimate the number of people on a site, we usually use a real-time Google Analytics report, which is a sensitive and accurate metric. But sometimes it is buggy and this time we had to look at internal metrics like the number of pageview events and the number of requests per second.

For centralized logging we use ELK. Our log delivery pipeline is based on Apache Kafka and Elastic Filebeat. Under high load, the log delivery conveyor stopped managing, logs began to get lost or lag. We increased the speed of log transfer and storage indexing by manipulating the Elasticsearch indexes, increasing the number of partitions in Kafka and Filebeat, and fine-tuned compression at all stages. Due to the fact that the pipeline for collecting logs is separate from production, problems with the increased traffic of the logs did not have any effect on the functioning of the site.

Accepted the rules of the game

It is impossible to prepare for each crisis in advance, but you can initially try to build a flexible system. Startups or companies that are gradually ceasing to be startups, in a quiet time, it is not always rational to prepare for an abnormal traffic growth: the resources of the team are limited. If we set aside their preparation for something that may never happen, there will be no strength left for the main product. It is much more important to react correctly in the moment and not be afraid of bold decisions. As a rule, the outcome of their crisis is an exit to a qualitatively new level.

Here is such a fun spring this year. When it seems that everything possible has been done, sometimes it turns out that this is only the beginning.

How we helped schools switch to distance learning and coped with the load