Houston, we have a problem. System Design Failure

In 1970, American engineers launched the Apollo 13 spacecraft to the moon. On board are three batteries of fuel cells, there is nothing to worry about, everything is reliably and repeatedly duplicated. But no one could have imagined that the explosion of an oxygen cylinder would disable two of the three batteries. Tragedy! The astronauts returned home, a feature film was made about the event with Tom Hanks, and the phrase of the astronaut Jack Swigert: “Houston, we have a problem!”, Went down in history.



The story of Apollo 13 is another proof of the well-known fact that you can not prepare for all possible troubles. This is a natural property of the world: iron breaks periodically, code crashes, and people make mistakes. It is impossible to completely eliminate this.

For large distributed systems, this behavior is normal; it is a consequence of economies of scale and statistics. That is why Design for Failure (AWS) is the basic design principle for AWS cloud services. The systems are initially built in such a way as to restore full-time operation as quickly as possible and minimize damage from known and yet unknown failures. At HighLoad ++, Vasily Pantyukhin, using real-life problems with combat services as an example, showed design patterns for distributed systems that AWS developers use.

Vasily Pantyukhin (Hen) Is the architect of Amazon Web Services in Europe, the Middle East and Africa. He started as a Unix administrator, worked at Sun Microsystem for 6 years, taught technical courses, and for 11 years he taught the world’s data-centricity at EMC. In an international team, he designed and implemented projects from Cape Town to Oslo. Now it helps large and small companies to work in public clouds.


In 1949, an airplane accident was investigated at a California air force base. One of the engineers who did this was Edward Murphy. He described the work of local technicians as follows: “If there are two ways to do something and one of them leads to disaster, then someone will choose this method.”

Later, thanks to Arthur Bloch, the statement went down in history as one of Murphy's laws. In Russian - the law of meanness. Its essence is that it will not be possible to avoid breakdowns and human errors, and will have to live with it somehow. That is why, when designing, we immediately put failures and failures of individual components into our systems.

Failure Design


In design for failure, we are trying to improve three characteristics:

  • accessibility (those same "nines");
  • reliability - the property of the system to provide the necessary level of service;
  • fault tolerance - a property of the system to prevent the occurrence of problems and recover quickly after them.

Reliability has the property of “ known unknowns”. We protect ourselves from known problems, but do not know when they will occur.

Un known unknowns” is added to fault tolerance - these are surprise problems that we know nothing about. Many of these problems in the cloud are related to economies of scale: the system grows to the size when new, amazing, and unexpected effects appear.

Failure is usually not a binary phenomenon. Its main property is “blast radius” or the level of degradation of the service, radius of destruction. Our task is to reduce the "blast radius" of systems.

If we recognize that problems cannot be avoided, then we must proactively prepare for them. This means that we design services in such a way that in case of troubles (they will certainly be), we control problems, and not vice versa.
When we respond to a problem, it controls us.

Data plane and Control plane


Surely, you have electronics at home that is controlled by a remote control, such as a TV. The TV screen is part of the Data plane - that directly performs the function. The remote control is the user interface - Control plane. It is used to manage and configure the service. In the cloud, we try to separate the Data plane and the Control plane for testing and development.

Users, most often, do not see the complexity of the Control plane. But mistakes in its design and implementation are the most common causes of mass failures. That's why my advice is focused on the Control plane - sometimes explicitly, sometimes not.

The story of one trouble


In July 2012, a severe storm passed in Northern Virginia. The data center has protection, diesel generators, etc., but it so happened that in one of the data centers of one of the availability zones (Availability Zone, AZ) of the North Virginia region, power was lost. Electricity was quickly restored, but the restoration of services dragged on for hours.



I’ll tell you about the reasons on the example of one of the basic services - CLB (Classic Load Balancer). It works simply: when you start a new balancer in each availability zone, separate instances are created, the IP of which will resolve the DNS.



When one of the instances fails, a message about this is sent to a special database.



In response, the procedures start: deleting records from DNS, starting a new instance and adding a new IP to DNS.

Note: This is how the system worked in the past, now everything is fundamentally different.

Everything is simple - there is nothing to break. But during a mass failure, when thousands of instances collapse at the same time, a huge Backlog appears in the database from messages for processing.



But it got worse. Control plane is a distributed system. Due to a bug, we received duplicates and thousands of records in the database swelled up to hundreds of thousands. It became very difficult to work with this.

When a failure occurs at one of the instances, all traffic is almost instantly switched to the surviving machines and the load doubles (in the example, for simplicity, there are only two access zones).



There are not enough resources, a live instance automatically starts to scale. The process takes a relatively long time. Everything happens peak and at the same time with a huge number of instances - the free resources of the availability zone are ending. The "fight" for resources begins.

In Northern Virginia, automation failed to cope with the massive failure, and engineers manually (using scripts) restored the services to work. Such troubles are rare. During the debriefing, questions arose about the causes of the failure, they decided that the situation should no longer be repeated and the whole service should be changed.

The eight patterns that I will cover are answers to some of the questions.

Note. This is our experience in service design, not universal wisdom for widespread use. Patterns are used in specific situations.

— . AWS . — , . . — . !


To minimize the impact of failures, a lot of approaches. One of them is to answer the question: “How can I make sure that users who do not know about a problem do not know anything about it during a failure and during recovery?”

Our huge Backlog gets not only crash messages, but also others, for example, about scaling or that someone is launching a new balancer. Such messages need to be isolated from each other, functionally grouping: a separate group of recovery messages after a failure, separately launching a new balancer.

Suppose ten users noticed a problem - one of the nodes of their balancers fell. Services somehow work on the remaining resources, but the problem is felt.



We have ten frustrated users. The eleventh appears - he does not know anything about the problem, but simply wants a new balancer. If his request to put down the queue for processing, then most likely he will not wait. While other processing procedures are complete, the request time will end. Instead of ten frustrated users, we will have eleven.

To prevent this from happening, we prioritize some requests - we put the queues at the top, for example, requests for new resources. With a mass failure, a relatively small number of such requests will not affect the recovery time of other customers' resources. But in the recovery process, we will restrain the number of users involved in the problem.

Full time job


The response to problem reports is the launch of recovery procedures, in particular, work with DNS. Mass failures are huge peak loads on the Control plane. The second pattern helps the Control plane to be more stable and predictable in such a situation.



We use an approach called Constant work - permanent work .



For example, DNS can be made a little smarter: it will constantly check the instances of the balancer, whether they are alive or not. The result will be a bitmap each time: the instance responds - 1, dead - 0.

DNS checks the instances every few seconds, regardless of whether the system is restored after a mass failure or is operating normally. He does the same job - no peaks, everything is predictable and stable.

Another simplified example: we want to change the configuration on a large fleet. In our terminology, a fleet is a group of virtual machines that together do some work.



We place the configuration changes in the S3 bucket, and every 10 seconds (for example) push all this configuration to our fleet of virtual machines. Two important points.

  • We do this regularly and never break the rule. If you choose a segment of 10 seconds, then push only this way, regardless of the situation.
  • We always give the whole configuration , regardless of whether it has changed or not. The data plane (virtual machines) themselves decide what to do with it. We do not push the delta. It can become very large with massive disruptions or changes. Potentially, this can contribute to instability and unpredictability.

When we perform some kind of permanent work, we pay more for it. For example, 100 virtual machines request a configuration every second. It costs about $ 1200 per year. This amount is essentially less than the salary of a programmer, to whom we can entrust the development of a Control plane with a classical approach - a reaction to a failure and the distribution of only configuration changes.

If you change the configuration every few seconds, as in the example, then this is slow. But in many cases, a configuration change or the launch of services takes minutes - a few seconds do not solve anything.

Seconds are important for services in which the configuration must change instantly, for example, when changing VPC settings. Here "permanent work" is not applicable. This is just a pattern, not a rule. If this does not work in your case, do not use.

Preliminary scaling


In our example, when the balancer instance crashes, the second surviving instance receives the load doubling almost immediately and begins to scale. In a massive failure, it eats up a huge amount of resources. The third pattern helps to control this process - to scale in advance .

In the case of two availability zones, we scale when disposing less than 50%.



If everything is done in advance, then in case of failure the surviving instances of the balancer are ready to accept doubled traffic.

Previously, we scaled only with high utilization, for example, 80%, and now at 45%. The system is idle most of the time and becomes more expensive. But we are ready to put up with this and actively use the pattern, because this is insurance. You have to pay for insurance, but in case of serious trouble the win covers all expenses. If you decide to use the pattern, calculate all the risks and the price in advance.

Cellular architecture


There are two ways to construct and scaled services: the monolith and the honeycomb structure (cell-based).



The monolith develops and grows as a single large container. We add resources, the system swells, we run into different limits, linear characteristics become non-linear and go into saturation, and the “blast radius” of the system - the whole system.

If the monolith is poorly tested, it increases the likelihood of surprises - “unknown unknowns”. But a large monolith cannot be fully tested. Sometimes for this you will have to build a separate access zone, for example, for a popular service that is built as a monolith within the access zone (this is a lot of data centers). In addition to somehow creating a huge test load that is similar to the present, this is impossible from a financial point of view.

Therefore, in most cases, we use a cellular architecture - a configuration in which a system is built from cells of a fixed size. By adding cells, we scale it.

Cellular architecture is popular in the AWS cloud. It helps isolate glitches and reduce blast radius.to one or more cells. We can fully test medium-sized cells, this seriously reduces the risks.

A similar approach is used in shipbuilding: the hull of a ship or vessel is divided by partitions into compartments. In the event of a hole, one or more compartments are flooded, but the ship does not sink. Yes, it didn’t help the Titanic, but we rarely encounter iceberg problems.

I will illustrate the application of the mesh approach using the Simple Shapes Service as an example. This is not an AWS service, I came up with it myself. This is a set of simple APIs for working with simple geometric shapes. You can create an instance of a geometric shape, request the type of a shape by its id, or count all instances of a given type. For example, put(triangle)a “triangle” object with some id is created on a request .getShape(id)returns the type "triangle", "circle" or "rhombus".



To make a service cloudy, it must be used by different users at the same time. Let's make it multi-tenant.



Next, you need to come up with a way to partition - to separate the figures into cells. There are several options for choosing partition key. The simplest one is in the geometric shape : all rhombuses in the first cell, circles in the second, triangles in the third.

This method has pros and cons.

  • If there are noticeably fewer circles than other figures, then the corresponding cell will remain underutilized (uneven distribution).
  • Some API requests are easy to implement. For example, counting all the objects in the second cell, we find the number of circles in the system.
  • Other queries are more complicated. For example, to find a geometric shape by id, you have to go through all the cells.

The second way is to use the id of objects by ranges : the first thousand objects in the first cell, the second in the second. So the distribution is more uniform, but there are other difficulties. For example, to count all the triangles, you need to use the method scatter/gather: we distribute the requests in each cell, it counts the triangles inside itself, then it collects the answers, summarizes and produces the result.

The third way - tenant division(to users). Here we are faced with a classic problem. There are usually a lot of “small” users in the cloud who try something and practically do not load the service. There are mastodon users. They are few, but they consume a huge amount of resources. Such users will never fit into any cell. You have to come up with tricky ways to distribute them among many cells.



There is no ideal way, each service is individual. The good news is that worldly wisdom works here - chopping firewood is more convenient along the fibers, rather than chopping them across. In many services, these “fibers” are more or less obvious. Then you can experiment and find the optimal partition key.

Cells are interconnected (albeit weakly). Therefore, there must be a connecting level. Often it is called the routing or mapping layer. It is needed to understand which cells to send specific requests to. This level should be as simple as possible. Try not to lay business logic in it.



The question arises of the size of the cells: small - bad, large - also bad. There is no universal advice - decide according to the situation.

In the AWS cloud, we use logical and physical cells of different sizes. There are regional services with a large cell size, there are zone services, where the cells are smaller.



Note. I talked about microcells at Saint Highload ++ Online in early April this year. There I discussed in detail an example of the specific use of this pattern in our Amazon EBS core service.

Multi-tenant


When a user launches a new balancer, he receives instances in each availability zone. Regardless of whether resources are used or not, they are allocated and belong exclusively to this tenant of the cloud.

For AWS, this approach is inefficient, because the utilization of service resources is on average very low. This affects the cost. For cloud users, this is not a flexible solution. It cannot adapt to rapidly changing conditions, for example, to provide resources with an unexpectedly increased load in the shortest time.



CLB was the first balancer in the Amazon cloud. Services today use a multi-tenant approach, such as NLB (Network Load Balancer). The basis, the "engine" of such network services is HyperPlane. This is an internal, invisible to end users, huge fleet of virtual machines (nodes).



Advantages of the multi-tenant approach or the fifth pattern.

  • Fault tolerance is fundamentally higher . In HyperPlane, a huge number of nodes are already running and are waiting for the load. Nodes know the state of each other - when some resources fail, the load is instantly distributed among the remaining ones. Users do not even notice mass crashes.
  • Peak load protection . Tenants live their own lives and their loads usually do not correlate with each other. The total average load on HyperPlane is quite smooth.
  • Utilization of such services is fundamentally better . Therefore, providing better performance, they are cheaper.

That sounds cool! But the multitenant approach has drawbacks. In the figure, the HyperPlane fleet with three tenants (rhombuses, circles and triangles), which are distributed across all nodes .



This raises the classic Noisy neighbor problem: the destructive action of the tenant, which generates ultra-high or bad traffic, will potentially affect all users.



“Blast radius” in such a system is all tenants. The likelihood of a destructive “noisy neighbor” in the real AWS availability zone is not high. But we must always be prepared for the worst. We defend ourselves using a mesh approach - we select groups of nodes as cells. In this context, we call them shards. Cells, shards, partitions - here it is one and the same.



In this example, a rhombus, as a "noisy neighbor", will affect only one tenant - a triangle. But the triangle will be very painful. To smooth out the effect, apply the sixth pattern - mixing sharding.

Shuffling sharding


We randomly distribute tenants to nodes. For example, a rhombus lands on 1, 3 and 6 nodes, and a triangle on 2, 6 and 8. We have 8 nodes and a shard of size 3.



Here simple combinatorics work. With a probability of 54%, there will be only one intersection between the tenants.



"Noisy neighbor" will affect only one tenant, and not the whole load, but only 30 percent.

Consider a configuration close to real - 100 nodes, shard size 5. With a probability of 77%, there will be no intersections at all.



Shuffle sharding can significantly reduce the "blast radius". This approach is used in many AWS services.

“A small fleet causes a large fleet, and not vice versa”


When recovering from a mass failure, we update the configuration of many components. A typical question in this case is to push or bullet a changed configuration? Who is the initiator of the changes: the source containing the configuration changes, or its consumers? But these questions are wrong. The right question is which fleet is bigger?

Consider a simple situation: a large fleet of front-end virtual machines and a certain number of backends.



We use a mesh approach - groups of front-end instances will work with certain backends. To do this, determine the routing - mapping of backends and frontends working with them.



Static routing is not suitable. Hash algorithms do not work well in mass failures, when you need to quickly change most of the routes. Therefore it is better to usedynamic routing . Next to the large fleets of front-end and back-end instances, we put a small service that will deal only with routing. He will know and assign a backend and frontend mapping at any given time.



Suppose we had a big crash, many front-end instances fell. They begin to recover massively and almost simultaneously request mapping configuration from the routing service.



A small routing service is bombarded with a huge number of requests. He will not cope with the load, in the best case, it will degrade, and in the worst he will die.

Therefore, it’s correct not to request configuration changes from a small service, but rather build your system so that the “baby” itselfinitiated configuration changes towards a large fleet of instances .



We use the pattern of constant work. A small routing service will send configuration to all front-end fleet instances once every few seconds. He will never be able to overload a great service. Seventh pattern helps improve stability and resiliency .

The first seven patterns enhance the system. The latter pattern works differently.

Dropping load


Below is a classic graph of the delay versus load. On the right side of the graph is the “knee”, when at extremely high loads even a small increase leads to a significant increase in latency.


In normal mode, we never take our services to the right side of the schedule. An easy way to control this is to add resources on time. But we are preparing for any troubles. For example, we can move to the right side of the chart recovering from a mass failure.

We put the client timeout on the chart. Anyone can be a client, for example, another component inside our service. For simplicity, we draw a delay graph of 50 percentile.



Here we are faced with a situation called brownout . You may be familiar with the term blackout when electricity is cut off in the city. Brownout is when something works, but is so bad and slow that, count, it doesn’t work at all.

Let's look at the brown zone brownout. The service received a request from the client, processed it and returned the result. However, in half the cases, the clients have already timed out and no one is waiting for the result. In the other half, the result returns faster than the timeout, but in a slow system it takes too much time.

We faced a double problem: we are already overloaded and are on the right side of the schedule, but at the same time we are still “warming the air”, doing a lot of useless work. How to avoid this?

Find the "knee" - the inflection point on the chart . We measure or theoretically estimate.

Drop traffic that forces us to go to the right of the inflection point .



We should simply ignore part of the requests. We do not even try to process them, but immediately return an error to the client. Even with overload, we can afford this operation - it is “cheap”. Requests are not fulfilled, the overall availability of the service is reduced. Although rejected requests will be processed sooner or later after one or several retries from the clients.



At the same time, another part of the requests is processed with a guaranteed low latency. As a result, we do not do useless work, and what we do, we do well.

Brief squeeze of patterns of designing systems for failure


Isolation and regulation . Sometimes it makes sense to prioritize certain types of queries. For example, with a relatively small volume of requests for creating new resources, they can be put at the top of the queue. It is important that this does not infringe on other users. In a massive outage, users who are waiting for their resources to recover will not feel a significant difference.

Full time job . Reduce or completely eliminate the switching of service modes. One mode, which stably and constantly works, regardless of emergency or working situations, fundamentally improves the stability and predictability of the Control plane.

Preliminary scaling. Scale in advance with lower disposal values. You will have to pay a little more for this, but this is insurance that pays off during serious system failures.

Cellular architecture . Many loosely coupled cells are preferred over a monolith. The mesh approach reduces the “blast radius” and the likelihood of surprise errors.

The multitenant approach significantly improves the utilization of the service, reduces its cost and reduces the "blast radius".

Shuffling sharding . This is an approach that applies to multi-tenant services. It additionally allows you to control the "blast radius".

“A small fleet causes a large fleet, and not vice versa”. We try to build services so that small services initiate changes to large configurations. We often use it in conjunction with a constant load pattern.

Dropping the load . In emergency situations, we try to do only useful work, and do it well. To do this, we discard part of the load, which we still cannot handle.

— , Saint HighLoad++ Online. -- , Q&A-, , . - . , - .

telegram- @HighLoadChannel — , .

All Articles