🚸 👴 🤶 Operating a Large Distributed System: What I Learned 👨🏿‍💻 🎖️ 🚃

Reading various channels and newsletters, I often come across articles about specific “pains” and problems that arise when a company grows, when reliability and scalability come to the fore. This article is different. There is no detailed analysis of specific architectural solutions or a step-by-step guide to changing the engineering culture. Rather, it is a top view of the challenges that arise when operating distributed systems, and a starting point that will help you navigate the flow of terms, abbreviations, and technologies.

I bring to your attention a translation of an article written by an engineer from Uber.

* * *

In the past few years, I have created and maintained a large distributed payment system in Uber . During this time, I learned a lot about the concepts of distributed architectures.and from my own experience I found out how difficult it is to create and maintain highly loaded systems with high availability. Building such a system is an interesting job. I like to plan how the system will handle traffic growth of 10-100 times, to ensure data reliability regardless of hardware failures. However, operating a large distributed system gave me unexpected experience .

The larger the system, the more likely it is that you will encounter a manifestation of Murphy’s law: “ Everything that can go wrong will go wrong .” The probability is especially high with frequent releases, when many developers roll out code, when using several data centers and with a huge audience of users around the world. Over the past few years, I have come across a variety of system crashes, many of which surprised me: from predictable problems like hardware crashes or innocent bugs to cable breaks that connect data centers to numerous cascading crashes that occur simultaneously. I went through dozens of failures in which parts of the system did not work correctly, which greatly affected the business.

This article summarizes the techniques that have benefited from operating a large system in Uber. My experience is not unique: others work with systems of the same size and go through the same adventures. I talked with engineers from Google, Facebook and Netflix, who faced similar situations and came up with similar solutions. Many of the ideas and processes described here can be applied to systems of the same scale, regardless of whether they work in company-owned data centers (as is most often the case with Uber) or in the cloud (where Uber sometimes scales ). However, these tips may be redundant for not so large or important systems for the company.

We will talk on such topics:

Monitoring
Duty (on-call), anomaly detection and notification.
Failures and incident management.
Post Mortems, incident analysis and a culture of continuous improvement.
Failover, resource planning and blackbox testing.
SLO, SLA and reporting on them.
SRE as an independent team.
Reliability as a permanent investment.
Useful materials.

Monitoring

To understand whether the system is healthy, we need to answer the question: “Does it work correctly? ". For this, it is vital to collect information about critical parts of the system. And when you have distributed systems consisting of various services on numerous servers and in different data centers, it can be difficult to decide which key points you really need to track.

Monitoring the status of infrastructure.If one or more machines / virtual machines are overloaded, then some parts of the distributed system may degrade in performance. The state metrics of the machines on which the service is running - the consumption of processor resources and memory - these are the basic parameters that need to be monitored. There are platforms that initially track these metrics and automatically scale instances. Uber has a great core infrastructure team , which by default provides monitoring and alerting. Regardless of the implementation method, you must have information that instances, infrastructure or individual services have problems.

Service status monitoring: traffic, errors, delay . Often you need to have the answer to the question "Is the backend working fine? ". Monitoring such metrics as the amount of incoming traffic, the proportion of errors and the delay in response gives valuable information about the status of the service. I prefer to make dashboards for all of these metrics. When you create a new service, using the correct HTTP responses and monitoring the corresponding codes can tell a lot about the state of the system. If you are sure that 4xx codes are returned for client errors, and 5xx codes are returned for server errors, then it will be easy for you to create and interpret such monitoring.

There is something else to be said about monitoring response delays. The goal of production services is for most end users to enjoy using them. And measuring the average delay is not the best solution, because the average value can hide a small number of requests with a large delay. It is much better to measure p95, p99 or p999 - the delay for the 95th, 99th or 99.9th percentile of requests. These numbers help answer questions like, “ How fast will the request be for 99% of visitors?” "(P99) or" How slow will the request be for one visitor out of 1000? "(P999). If you are interested in the details, then you can read this article .

Graph for p95 and p99. Note that the average delay for this endpoint is less than 1 second, and 1% of the requests takes 2 seconds. and more - this is not seen when measuring the average value.

The topic of monitoring and observability is much deeper. I recommend reading the Google SRE book and the Four golden signals section on monitoring distributed systems. If for a system that users interact with, you can get only four metrics, then focus on traffic, errors, delay and saturation. There is less material - the Distributed Systems Observability e-book , which describes useful tools such as logs, as well as best practices for using metrics and tracing.

Monitoring business metrics. Monitoring services tells us about their overall condition. But according to monitoring data alone, we can’t say whether the services are working properly, whether everything is correctly processed from a business point of view. In the case of the payment system, the main question is: “ Can people make trips using a specific payment method? ". One of the most important steps in monitoring is to identify and track business events that are created and processed by this service.

My team created a monitoring of business metrics after a failure that we could not detect in other ways. Everything looked as if the services were functioning normally, but in fact the key functionality did not work. This type of monitoring was very unusual for our company and field of activity. Therefore, we had to make a lot of efforts to configure this monitoring for ourselves, creating our own system .

Duty, Anomaly Detection, and Alert

Monitoring is a great tool to check the current status of the system. But this is only a step on the road to automatically detecting failures and alerting those who should do something in this case.

Watch is an extensive topic. Increment Magazine did an excellent job highlighting many of the issues in the On-Call issue . In my opinion, duty is a logical continuation of the “you created - you own” approach. The services are owned by the teams that made them, and they also own an alert and incident response system. My team owned such a system for the payment service that we created. And when the alert arrives, the engineer on duty must respond and find out what is happening. But how do we move from monitoring to alerts?

Determining anomalies based on monitoring data is a difficult task , and machine learning should fully manifest itself here. There are many third-party services for detecting anomalies. Fortunately for my team, our company has its own machine learning group to solve the problems facing Uber. The New York team wrote a useful article on how Uber detection of anomalies works . From the point of view of my team, we only transmit monitoring data to this pipeline and receive alerts with varying degrees of confidence. And then we decide whether to inform the engineer about it.

When do I need to send an alert? The question is interesting. If there are too few alerts, then we may miss an important glitch. If too much, then people will not sleep at night.Tracking and categorizing alerts, as well as measuring signal-to-noise ratios, plays a big role in setting up an alert system . A good step towards sustainable rotation of duty engineers will be to analyze alerts, categorize events as “requiring action” and “not requiring”, and then reduce the number of alerts that do not require action.

An example of a Call Duty panel created by the Uber Developer Experience team in Vilnius.

The Uber team for creating development tools from Vilnius has created a small tool that we use to comment on alerts and visualize duty shifts. Our team generates a weekly report on the work of the last shift on duty, analyzes weaknesses and improves the methodology of duty.

Failures and incident management

Imagine: you are a duty engineer for a week. In the middle of the night you wake up a message on the pager. You are checking if a production failure has occurred. Damn, it seems like part of the system has crashed. Now what? Monitoring and alerting just worked.

Failures may not be particularly problematic for small systems when the engineer on duty can understand what happened and why. Usually in such systems, you can quickly identify the causes and get rid of them. But in the case of complex systems containing many (micro) services, when many engineers send the code into operation, the reason for the failure is rather difficult. And compliance with several standard procedures can be of great help.

The first line of defense are runbooks of standard response proceduresthat describe simple troubleshooting steps. If the company has such lists and is actively supported, then a superficial idea of the duty engineer about the system is unlikely to be a problem. The lists need to be kept up to date, updated and processed to new ways of solving problems.

Informing about failures of other employeesIt becomes very important if several teams are involved in rolling out a service. In the environment in which I work, services, as needed, are rolled out by thousands of engineers, with a frequency of hundreds of releases per hour. And the seemingly innocuous rollout of one service can affect another. In such a situation, an important role is played by the standardization of fault reporting and communication channels. I had many cases when the alert was not like anything else, and I realized that for other teams this also looks strange. Communicating in a general chat about failures, we identify the service responsible for the failure and quickly eliminate the consequences. Together we manage to do this much faster than either of us individually.

Eliminate the consequences now, and figure it out tomorrow. In the midst of an accident, I was often overwhelmed by a wave of adrenaline because of a desire to correct mistakes. Often the cause of the problem was rolling out bad code with an obvious bug. Previously, I would have quit everything and started to correct errors, send a fix and roll back the failed code. But fixing the cause in the middle of an accident is a terrible idea . With a quick fix, you can achieve little and lose a lot . Since the fix needs to be done quickly, it has to be tested in battle. And this is the path to a new bug - or a new failure - on top of the existing one. I saw how this caused a number of failures. Just focus on fixing the consequences, don't be tempted to fix the code or find the cause. The investigation will wait until tomorrow.

Post Mortems, incident analysis and a culture of continuous improvement

An incident report is an important characteristic of how a team copes with the consequences of a failure. Does the situation worry people? Do they do a little research, or spend an amazing amount of effort on observation, stop the product and make corrections?

Correctly written post-mortem is one of the main elements of building sustainable systems. It does not condemn anyone and does not look for the culprits, this is a thoughtful study and analysis of the incident. Over time, our templates for such reports evolved, sections with final conclusions, impact assessment, chronology of events, analysis of the main reason, lessons learned and a detailed list of elements for further observation appeared.

I used this error handling pattern in Uber.

In a good post-mortem, the cause of the failure is thoroughly investigated and measures are proposed to prevent, detect or quickly eliminate the consequences of similar failures. And when I say "deeply", I mean that the authors do not stop at the fact that the reason was the rolling of code with a bug that the reviewer did not notice. Authors should apply the Five Why methodology to arrive at a more useful conclusion. For instance:

Why is there a problem? -> The bug was uploaded as part of the code.
Why didn’t anyone catch a bug? -> The one who did the review did not notice that changing the code could lead to such a problem.
, ? --> .
? --> .
? --> .
: . , .

Incident analysis is an important companion tool for working on bugs. While some teams work carefully on bugs, others may benefit from additional data and make preventative improvements. It is also important that teams consider themselves responsible and able to make the improvements they propose at the system level.

In organizations that are serious about reliability, the most serious incidents are analyzed and eliminated by experienced engineers. It is also necessary to manage engineering at the company level to ensure that corrections can be made — especially if they are time consuming or interfere with other work. A reliable system cannot be created in one night: constant iterations will be required. Iterations resulting from a company culture of continuous improvement based on lessons learned from incidents.

Failover, resource planning and blackbox testing

There are several regular procedures that require significant investment, but which are critical to maintaining a large distributed system. I first came to these ideas at Uber, I did not need to apply them to other companies due to the smaller scale and unavailability of the infrastructure. I considered stupid

failover in the data center (failover) until I came across this myself. Initially, I believed that the design of a stable distributed system is the stability of data centers to fall. Why should it be tested regularly if everything should theoretically work? The answer depends on scaling and testing the ability of services to efficiently handle the unexpected increase in traffic in the new data center.

The most common failure scenario I came across is that the service does not have enough resources in the new data center to handle global traffic in the event of a failover. Suppose, service A works in one data center and service B in another. Let resource consumption be 60% — tens or hundreds of virtual machines spin in each data center, and alerts are triggered when a threshold of 70% is reached. All traffic from the data center A to the data center B failed. The second data center cannot cope with such an increase in load without deploying new machines. However, this can take a lot of time, so requests begin to pile up and drop. Blocking begins to affect other services, causing a cascade failure of other systems that are not related to the primary failover.

A possible situation where failover leads to problems.

Another popular failure scenario involves problems at the routing level, problems with network bandwidth or back pressure . Failover of data centers is a development that any reliable distributed system must perform without any impact on users. I emphasize - it should , this development is one of the most useful exercises for checking the reliability of distributed web systems.

Scheduled service downtime exercisesA great way to test the stability of an entire system. It is also a great tool for detecting hidden dependencies or inappropriate / unexpected uses of a particular system. Scheduled downtime exercises can be relatively easy to perform with services that clients interact with and have little known dependencies. However, if we are talking about critical systems for which a very short downtime is permissible or on which many other systems depend, then it will be difficult to carry out such exercises. But what happens if such a system becomes unavailable one day? It is better to answer this question in a controlled experiment so that all teams are warned and ready.

Blackbox testing(the "black box" method) is a way of assessing the correctness of the system in a situation as close as possible to how the interaction with the end user goes. This is similar to end-to-end testing, except for the fact that for most products the correct blackbox testing requires investment of your own. Good candidates for such testing are key user processes and scenarios that involve user interaction: to test the system, make sure that they can be launched at any time.

Using Uber as an example, an obvious blackbox test is checking the driver-passenger interaction at the city level: can a passenger find a driver and make a trip upon a specific request? After automating this scenario, the test can be run regularly, emulating different cities. A reliable blackbox testing system makes it easy to verify the correct operation of the system or its parts. It also helps a lot with failover testing: the fastest way to get switching feedback is to run blackbox testing.

An example of blackbox testing during failover failover and manual rollback.

Resource planningplays a particularly important role for large distributed systems. By large, I mean those in which the cost of computing and storing is calculated in tens or hundreds of thousands of dollars per month. At this scale, it may be cheaper to have a fixed number of deployments than to use self-scalable cloud solutions. In extreme cases, fixed deployments should handle the traffic that is characteristic of “normal business”, with automatic scaling only at peak loads. But what is the minimum number of instances to apply next month? In the next three months? Next year?

It is not difficult to predict future traffic patterns for mature systems with large volumes of statistics. This is important for the budget, the choice of vendors or for fixing discounts from cloud providers. If your service generates large accounts and you have not thought about resource planning, then you are missing an easy way to reduce costs and manage them.

SLO, SLA and reporting on them

SLO stands for Service Level Objective , a metric for system availability. It is good practice to set SLOs at the service level for performance, response time, correctness, and availability. These SLOs can then be used as alert thresholds. Example:

SLO metric	Subcategory	Service Value
Performance	Minimum bandwidth	500 requests per second
		2 500
		50-90
	p99	500-800
		0,5 %
		99,9 %

SLO at the business level . or functional SLOs, this is an abstraction over services. They cover custom or business metrics. For example, an SLO at a business level might be this: 99.99% of receipts are expected to be mailed within 1 minute of completing a trip. This SLO can be compared with SLO at the service level (for example, with the delay of the payment system or the system of sending mail checks), and can be measured separately.

SLA - Service Level Agreement . This is a more general agreement between a service provider and its consumer. Typically, multiple SLOs make up an SLA. For example, the availability of a payment system at the level of 99.99% may be SLA, which is divided into specific SLOs for each corresponding system.

After determining the SLO, you need to measure them and make a report.Automatic monitoring and reporting on SLAs and SLOs is often a complex project that neither engineers nor business want to prioritize. Engineers will not be interested, because they already have different levels of monitoring to identify failures in real time. A business is better prioritizing the delivery of functionality, rather than investing in a complex project that will not give immediate benefits.

And this leads us to another topic: organizations that operate large distributed systems, sooner or later need to allocate people to ensure the reliability of these systems. Let's talk about the SRE team - Site Reliability Engineering.

SRE as an independent team

The term Site Reliability Engineering was coined by Google around 2003, and today there are more than 1,500 SRE engineers. As the operation of the production environment is becoming an increasingly complex task requiring more automation, it will soon become a full-fledged work. This happens when companies realize that engineers work on the automation of the production environment almost the whole day: the more important the system and the more failures occur, the sooner the SRE becomes a separate position.

Fast-growing technology companies often set up an SRE team early on, which they plan for themselves. In Uber, such a team was created in 2015.Its goal was to manage the complexity of the system. In other companies, the allocation of SRE teams can be associated with the creation of a separate infrastructure team. When the company grows to such a level that ensuring the reliability of the service requires the attention of a significant number of engineers, then it is time to create a separate team.

The SRE team greatly simplifies the maintenance of large distributed systems for all engineers. The SRE team probably owns standard monitoring and alerts tools. They are probably buying or creating on-call tools and are willing to share their experience. They can facilitate the analysis of incidents and create systems that make it easier to detect failures, reduce their consequences and prevent them in the future. The SRE command certainly facilitates failover operations. It is often used to test black boxes and plan performance. SRE engineers manage the selection, customization, or creation of standard tools for determining and measuring SLOs and reporting on them.

Given that all companies have their own problems, for the solution of which they recruit SRE, such teams in different companies have different structures. Even the names may vary: it can be an operation service, platform engineering or infrastructure service. Google has published two compulsory reading books on ensuring the reliability of services . They are freely available and are an excellent source of information for a more in-depth study of the topic of SRE.

Reliability as a permanent investment

When creating any product, assembling the first version is only the beginning. After that there will be new iterations, with new features. If the product is successful and profitable, then work continues.

Distributed systems have the same life cycle, except that they need more investment not only in new features, but also in keeping up with scaling. When the load on the system increases, you have to store more data, more engineers work on the system, you have to constantly take care of its normal functioning. Many of those who create distributed systems for the first time consider them to be something like a machine: once you do it, it’s enough to carry out certain maintenance every few months. It is hard to come up with a comparison farther from reality.

, , . For the hospital to work well, constant checks are needed (monitoring, alerts, black box testing). All the time you need to take on new staff and equipment: for hospitals, these are nurses, doctors and medical devices, and for distributed systems, new engineers and services. As the number of employees and services increases, the old working methods become ineffective: a small clinic in the countryside works differently than a large hospital in a metropolis. Achieving more effective ways of functioning turns into a full-fledged work, and measurements and alerts are becoming increasingly important. As a large hospital requires more staff, such as accounting, human resources and security,and the operation of large distributed systems relies on service teams like infrastructure and SRE.

In order for the team to maintain a reliable distributed system, the organization must constantly invest in its functioning, as well as in the work of the platforms on which the system is built.

Useful materials

Although the article turned out to be long, it presents only the most superficial moments. To learn more about the features of operating distributed systems, I recommend these sources:

Books

The Google SRE Book is a great free Google book. The chapter on Monitoring Distributed Systems is especially important for this article .
Distributed Systems Observability , another great free book with great tips on monitoring distributed systems.
Designing Data-Intensive Applications is the most practical book on the concept of distributed systems that I have come across. However, it does not cover the operational aspects that we have examined so deeply.

Sites

Increment Magazine's On-Call Edition : A series of articles on incident response processes at Amazon, Dropbox, Facebook, Google, and Netflix.
Learning to Build Distributed Systems - AWS Engineer's article that answers the question “How to learn how to create large distributed systems?”

See comments on this article on Hacker News .

Operating a Large Distributed System: What I Learned