🖕 🕎 🚌 How did we survive the sharp increase in x10 load on the remote site and what conclusions did 🔲 🛁 🤳🏿

Hello, Habr! The last couple of months we have lived in a very interesting situation, and I would like to share our history of infrastructure scaling. During this time, SberMarket grew 4 times in orders and launched a service in 17 new cities. The explosive growth in demand for food delivery has required us to scale our infrastructure. Read the most interesting and useful findings under the cat.

My name is Dima Bobylev, I am the Technical Director of SberMarket. Since this is the first post on our blog, I’ll say a few words about myself and about the company. Last fall, I participated in the contest of young leaders of the Runet. For the contest, I wrote a short story about how we at SberMarket see the internal culture and approach to the development of the service. And although it was not possible to win the competition, I formulated for myself the basic principles for the development of the IT ecosystem.

When managing a team, it is important to understand and find a balance between what the business needs and the needs of each specific developer. Now SberMarket is growing 13 times year to year, and this affects the product, requiring a constant increase in the volume and pace of development. Despite this, we devote enough time to developers for preliminary analysis and high-quality code writing. The formed approach helps not only in creating a working product, but also in its further scaling and development. As a result of this growth, SberMarket has already become a leader among food delivery services: we deliver about 18 thousand orders a day every day, although in the beginning of February there were about 3,500 of them.

Once a client asked the SberMarket courier to deliver products to him contactlessly - directly to the balcony

But let's move on to the specifics. Over the past few months, we have been actively scaling the infrastructure of our company. Such a need was explained by external and internal factors. Simultaneously with the expansion of the customer base, the number of connected stores increased from 90 at the beginning of the year to more than 200 by mid-May. Of course, we prepared, reserved the main infrastructure, and counted on the possibility of vertical and horizontal scaling of all virtual machines located in the Yandex cloud. However, practice has shown: "Everything that can go wrong goes wrong." And today I want to share the most interesting situations that have happened in these weeks. I hope our experience will be useful to you.

Slave in full alert

Even before the pandemic began, we were faced with an increase in the number of requests to our backend servers. The tendency to order products with home delivery began to gain momentum, and with the introduction of the first self-isolation measures in connection with COVID-19, the load grew dramatically before our eyes all day. There was a need to quickly unload the master servers of the main database and transfer part of the read requests to the replica servers (slaves).

We were preparing in advance for this step, and for such a maneuver 2 slave servers had already been launched. They mainly worked on batch tasks of generating information feeds for exchanging data with partners. These processes created an extra load and quite rightly were put “outside the brackets” a couple of months earlier.

Since there was replication on Slave, we adhered to the concept that applications can only work with them in read only mode. Disaster Recovery Plan suggested that in the event of a disaster, we could simply mount the Slave in place of the Master and switch all write and read requests to the Slave. However, we also wanted to use replicas for the needs of the analytics department, so the servers were not completely transferred to read only status, and each host had its own set of users, and some had write permissions to save intermediate calculation results.

Up to a certain level of load, we had enough wizards for both writing and reading when processing http requests. In mid-March, just when Sbermarket decided to completely switch to a remote site, we began a multiple increase in RPS. More and more of our customers went to self-isolation or work from home, which was reflected in the load indicators.

The performance of the “master” was no longer enough, so we began to endure some of the heaviest read requests for a replica. To transparently send requests to write to the master, and reading to the slave, we used ruby gem “ Octopus". We created a special user with the _readonly postfix without write permissions. But due to an error in the configuration of one of the hosts, part of the write requests went to the slave server on behalf of a user who had the appropriate rights.

The problem did not manifest itself immediately, because increased load increased the slave lag. Data inconsistency was revealed in the morning when, after nightly imports, the slaves did not “catch up” with the master. We attributed this to the high load on the service itself and the import associated with the launch of new stores. But to give information to the many hours of delay was unacceptable, and we switched to the second analytical processes the slave, because he had used on lshie resources and was not loaded read requests (and we have explained ourselves to the absence of replication lag).

When we figured out the reasons for the “creep” of the main slave, the analytic already failed for the same reason. Despite the presence of two additional servers, to which we planned to transfer the load in the event of a master crash, due to an unfortunate error, it turned out that at a critical moment there are none.

But since we did not only dump the database (the rest at that time was about 5 hours), but also a snapshot master server, we managed to start the replica within 2 hours. True, after that we were expected to roll the replication log for a long time (because the process is in single-threaded mode, but this is a completely different story).

: , readonly . , .

« »

Although we constantly update the catalog on the site, the requests that we made to the Slave servers allowed a slight lag from Master. The time during which we discovered and eliminated the problem of “suddenly dropping out of the distance” slaves was more than a “psychological barrier” (during this time prices could have updated, and customers would have seen outdated data), and we had to switch all requests to the main database server . As a result, the site worked slowly ... but at least it worked. And while the Slave was recovering, we had no choice but to optimize.

While the Slave servers were recovering, the minutes slowly dragged on, the Master remained overloaded, and we devoted all our efforts to optimizing active tasks according to the Pareto Rule: we selected the TOP requests that give most of the load and started tuning. This was done directly "on the fly."

An interesting effect was that MySQL loaded to the eyeball responds to even a slight improvement in processes. Optimization of a pair of requests, which gave only 5% of the total load, already showed tangible CPU unloading. As a result, we were able to provide an acceptable supply of resources for Master to work with the database and get the necessary time to restore replicas.

Conclusion: Even a small optimization allows you to "survive" during overload for several hours. This was just enough for us during the recovery of servers with replicas. By the way, we will discuss the technical side of query optimization in one of the following posts. So subscribe to our blog if this may be useful to you.

Organize performance monitoring of partner services

We are engaged in processing orders from customers, and therefore our services constantly interact with third-party APIs - these are gateways for sending SMS, payment platforms, routing systems, geocoder, the Federal Tax Service and many other systems. And when the load began to grow rapidly, we began to rest against the API limitations of our service partners, which we had not even thought about before.

Unexpected excess of quotas for affiliate services can lead to downtime of your own. Many APIs block clients that exceed the limits, and in some cases, an excess of requests can overload production with a partner.

For example, at the time of increasing the number of deliveries, the accompanying services could not cope with the tasks of their distribution, determination of routes. As a result, it turned out that the orders were made, and the service creating the route does not work. I must say that our logisticians made it almost impossible under these conditions, and the clear interaction of the team helped to compensate for temporary service failures. But such a volume of applications is impossible to process manually manually, and after some time we would encounter an unacceptable gap between orders and their execution.

A number of organizational measures were taken and the coordinated work of the team helped to gain time while we agreed on new conditions and waited for the modernization of services from some partners. There are other APIs that please you with high endurance and godless tariffs in case of high traffic. For example, in the beginning we used one well-known mapping API to determine the address of a delivery point. But at the end of the month they received a tidy bill of almost 2 million rubles. After that, they decided to quickly replace it. I will not engage in advertising, but I will say that our expenses have decreased significantly.

: . , « », , . , , .

, « » ()

We are used to “plugging” in the main database or application servers, but when scaling up, troubles can appear where they were not expected. For full-text search on the site, we use the Apache Solr engine. With an increase in load, we noted a decrease in response time, and server processor load reached 100%. What could be simpler - we will give out more resources to the container with Solr.

Instead of the expected performance gain, the server simply “died”. It immediately loaded 100% and responded even more slowly. Initially, we had 2 cores and 2 GB of RAM. We decided to do what usually helps - we gave the server 8 cores and 32 GB. Everything has become much worse (exactly how and why - we will tell in a separate post).

For several days, we figured out the intricacies of this issue, and achieved optimal performance with 8 cores and 32 GB. This configuration allows us to continue to increase the load today, which is very important, because the growth is not only in customers, but also in the number of connected stores - over 2 months their number has doubled.

Conclusion: Standard methods such as "add more iron" do not always work. So when scaling up any service, you need to understand well how it uses resources and pre-test its operation in new conditions.

Stateless - the key to easy horizontal scaling

In general, our team adheres to a well-known approach: services should not have stateless state and should be independent of the runtime environment. This allowed us to survive the load growth by simple horizontal scaling. But we had one service exception - a handler for long background tasks. He was engaged in sending emails and sms, processing events, generating feeds, importing prices and stocks, and processing images. It so happened that it depended on the local file storage and was in a single copy.

When the number of tasks in the handler queue increased (and this naturally happened with an increase in the number of orders), the performance of the host hosting the handler and file storage became a limiting factor. As a result, the assortment and prices were stopped updating, sending notifications to users and many other critical functions stuck in the queue. The Ops team quickly migrated the file storage to an S3-like network storage, and this allowed us to raise several powerful machines to scale the background task handler.

Conclusion: The Stateless rule must be respected for all components, without exception, even if it seems “that we’re definitely not bothering”. It is better to spend a little time on the correct organization of the work of all systems than to rewrite the code in a hurry and repair the service that is experiencing overload.

7 principles for intense growth

Despite the availability of additional capacity, in the process of growth we stepped on a few rakes. During this time, the number of orders increased by more than 4 times. Now we already deliver more than 17,000 orders a day in 62 cities and plan to expand our geography even further - in the first half of 2020, service is expected to be launched throughout Russia. In order to cope with the growing load, taking into account the already full bumps, we have developed for ourselves 7 basic principles of work in conditions of constant growth:

-. Jira, . . — . , , , , .
. « » . , , . , .
. -, . -, , .
stateless. , . . , , S3. https://12factor.net. , .
. , . , , - . , .
. , . , SMS . , .
. , . , , . API, -. .

Not without losses, but we survived this stage, and today we try to adhere to all found principles, and each machine has the ability to easily increase x4 performance to cope with any surprises.

In the following posts, we will share our experience in investigating performance subsidence in Apache Solr, as well as talk about query optimization and how interaction with the Federal Tax Service helps the company save money. Subscribe to our blog so that you don’t miss anything, and tell us in the comments if you had such troubles during the growth of traffic.

How did we survive the sharp increase in x10 load on the remote site and what conclusions did

Slave in full alert

« »

Organize performance monitoring of partner services

, « » ()

Stateless - the key to easy horizontal scaling

7 principles for intense growth

More articles: