👈🏽 ✍🏼 👨‍👩‍👧‍👧 How did we ensure the growth of CityMobile 🤦🏽 🤾 🏩

My name is Ivan, I’m the head of server development in Citimobil. Today I will talk about what this very server development is, what problems we encountered and how we plan to develop.

Start of growth

Few people know that CityMobile has existed for a long time, 13 years. Once it was a small company that worked only in Moscow. There were very few developers, and they well knew how the system worked - because they themselves created it. The taxi market was just starting to develop then, the loads were penny. We did not even have the task of providing fault tolerance and scaling.

In 2018, Mail.Ru Group invested in CityMobile, and we began to grow rapidly. There was no time to rewrite the platform, or at least significant refactoring, it was necessary to develop its functional capabilities in order to catch up with our main competitor, and quickly hire people. When I joined the company, only 20 developers were involved in the backend, and almost all of them were on trial. Therefore, we decided to “pluck low-hanging fruit”: to make simple changes that give a huge result.

At that time, we had a monolith in PHP and three services on Go, as well as one master base in MySQL that the monolith was accessing (the services were used as repositories of Redis and EasticSearch). And gradually, with increasing load, heavy system requests began to slow down the base.

What could be done with this?

First, we took the obvious step: put a slave in production. But if many heavy requests come to him, will he stand it? It was also obvious that with an abundance of requests for analytical reports, the slave would begin to lag. A strong backlog of slaves could adversely affect the performance of the entire CityMobile. As a result, we put another slave for analysts. Its lag never leads to problems on the prod. But even this seemed to us not enough. We wrote a table replicator from MySQL in Clickhouse. And today, analytics lives on its own, using a stack that is more designed for OLAP.

In this form, the system worked for some time, then functions began to appear that were more demanding on the hardware. There were more and more requests with each week: not even a week passed without a new record. In addition, we put a time bomb under our system. Previously, we had only one point of failure - the MySQL master base, but with the addition of a slave there were two such points: the master and the slave. Failure of any of these machines would lead to a complete failure of the system.

To protect against this, we began to use a local proxy to perform health check slaves. This allowed us to use many slaves without changing the code. We introduced regular automatic checks of the status of each slave and its general metrics:

la;
slave lag;
port availability;
number of locks, etc.

If a certain threshold is exceeded, the system removes the slave from the load. But at the same time, no more than half of the slaves can be withdrawn, so that, due to the increase in the load on the remaining ones, they can’t arrange a downtime for themselves. As a proxy, we used HAProxy and immediately included a plan for switching to ProxySQL in the backlog. The choice was somewhat strange, but our admins already had good experience working with HAProxy, and the problem was acute and required an early solution. Thus, we created a fail-safe system for slaves, which scaled easily enough. For all her simplicity, she never let us down.

Further growth

As the business developed, we found another bottleneck in our system. With changes in external conditions - for example, rainfall began in a large region - the number of taxi orders grew rapidly. In such situations, drivers did not have time to react quickly enough, and there was a shortage of cars. While orders were distributed, they created a load on MySQL slaves in a loop.

We found a successful solution - Tarantool. It was hard to rewrite the system for it, so we solved the problem differently: using the mysql-tarantool-replication toolmade replication of some tables from MySQL to Tarantool. All the reading requests that arose during the shortage of cars, we began to broadcast in Tarantool and since then we are no longer concerned about thunderstorms and hurricanes! And we solved the problem with the point of failure even easier: we immediately installed several replicas that we access healthcheck through HAProxy. Each Tarantool instance is replicated by a separate replicator. As a pleasant bonus, we also solved the problem of lagging slaves in this section of code: replication from MySQL to Tarantool works much faster than from MySQL to MySQL.

However, our master base was still a point of failure and did not scale on recording operations. We began to solve this problem in this way.

Firstly, at that time we had already begun to actively create new services (for example, antifraud, about which my colleagues have already written ). Moreover, the services immediately required scalability of storage. For Redis, we started using only Redis-cluster, and for Tarantool - Vshard. Where we use MySQL, we started to use Vitess for new logic . Such databases are immediately shardable, so there are almost no problems with recording, and if they suddenly arise, it will be easy to solve them by adding servers. Now we use Vitess only for non-critical services and study the pitfalls, but, in the future, it will be on all MySQL databases.

Secondly, since it was hard and long to implement Vitess for the already existing logic, we went in a simpler, albeit less universal way: we began to distribute the master base on different servers, table by table. We were very lucky: it turned out that the main load on the record is created by tables that are not critical for the main functionality. And when we make such tables, we do not create additional points of business failure. The main enemy for us was the strong connectedness of the tables in the code using JOINs (there were JOINs and 50-60 tables each). We cut them mercilessly.

Now it's time to recall two very important patterns for designing highload systems:

Graceful degradation. , - . , , , , .. , .
Circuit breaker. , . , , , . ? ( - graceful degradation). , FPM- , . - ( ) , . , - , ( ).

So, we began to scale at the very least, but there were still points of failure.

Then we decided to turn to semi-synchronous replication (and successfully implemented it). What is its feature? If during normal asynchronous replication the cleaner in the data center pours a bucket of water on the server, then the last transactions will not have time to replicate to the slaves and will be lost. And we must be sure that in this case we will not have serious problems after one of the slaves becomes a new master. As a result, we decided not to lose transactions at all, and for this we used semi-synchronous replication. Now the slaves can lag, but even if the master database server is destroyed, information about all transactions will be stored on at least one slave.

This was the first step towards success. The second step was to use the orchestrator utility. We also constantly monitor all MySQL in the system. If the master base fails, then the automation will make the master the latest slave (and taking into account semi-synchronous replication it will contain all transactions) and switch the entire write load to it. So we are now able to relive the story of a cleaning lady and a bucket of water.

What's next?

When I came to CityMobile, we had three services and a monolith. Today there are more than 20 of them. And the main thing that holds back our growth is that we still have a single master base. We heroically fight this and divide it into separate bases.

How do we develop further?

Expand the set of microservices. This will solve many of the problems facing us today. For example, the team no longer has a single person who knows the device of the entire system. And since due to the rapid growth, we do not always have up-to-date documentation, and it is very difficult to maintain it, it is difficult for beginners to delve into the course of affairs. And if the system will consist of numerous services, then writing documentation for each of them will be incomparably easier. And the amount of code for a one-time study is greatly reduced.

Go to Go.I really love this language, but I always believed that it was not practical to rewrite working code from one language to another. But recently, experience has shown that PHP libraries, even the standard and most popular ones, are not of the highest quality. That is, we patch many libraries. Let's say the SRE team patched the standard library for interacting with RabbitMQ: it turned out that such a basic function as time out did not work. And the deeper the SRE team and I understand these problems, the more it becomes clear that few people think about timeouts in PHP, few people care about testing libraries, few people think about locks. Why is this becoming a problem for us? Because Go solutions are much easier to maintain.

What else impresses me with Go? Oddly enough, writing on it is very simple. In addition, Go makes it easy to create a variety of platform solutions. This language has a very powerful set of standard tools. If for some reason our backend starts to slow down suddenly, just go to a specific URL and you can see all the statistics - the memory allocation diagram, to understand where the process is idle. And in PHP it's more difficult to identify performance problems.

In addition, Go has very good linters - programs that automatically find the most common errors for you. Most of them are described in the article “50 shades of Go,” and linters perfectly detect them.

Continue sharding the bases. We will switch to Vitess on all services.

Translation of a PHP monolith into redis-cluster.In our services, redis-cluster proved to be excellent. Unfortunately, implementing it in PHP is more difficult. The monolith uses commands that are not supported by redis-cluster (which is good, such commands bring more problems than benefits).

We will investigate the problems of RabbitMQ. It is believed that RabbitMQ is not the most reliable software. We will study this issue, find and solve problems. Perhaps we’ll think about switching to Kafka or Tarantool.

How did we ensure the growth of CityMobile

Start of growth

Further growth

How do we develop further?

More articles: