CPU limits and aggressive throttling in Kubernetes

Note perev. : This cautionary tale of Omio, the European travel aggregator, takes readers from basic theory to the captivating practical intricacies of Kubernetes configuration. Familiarity with such cases helps not only to broaden one's horizons, but also to prevent non-trivial problems.



Have you ever encountered the fact that the application "stuck" in place, stopped responding to requests for health checks, and you could not understand the reason for this behavior? One possible explanation is the quota limit for CPU resources. He will be discussed in this article.

TL; DR:
We strongly recommend that you refuse CPU limits in Kubernetes (or disable CFS quotas in Kubelet) if you are using a Linux kernel version with a CFS quota error. At the core thereA serious and well-known bug that leads to excessive throttling and delays
.

At Omio, the entire infrastructure is managed by Kubernetes . All of our stateful and stateless loads work exclusively on Kubernetes (we use the Google Kubernetes Engine). In the last six months, we began to observe random slowdowns. Applications freeze or stop responding to health checks, lose their connection to the network, etc. Such behavior has long perplexed us, and, finally, we decided to tackle the problem closely.

Summary of the article:

  • A few words about containers and Kubernetes;
  • How CPU requests and limits are implemented;
  • How CPU limit works in multi-core environments;
  • How to track throttling CPU;
  • Solving the problem and the nuances.

A few words about containers and Kubernetes


Kubernetes, in fact, is the modern standard in the world of infrastructure. Its main task is the orchestration of containers.

Containers


In the past, we had to create artifacts like Java JARs / WARs, Python Eggs, or executables for subsequent launch on servers. However, in order to make them function, they had to do additional work: install the runtime (Java / Python), place the necessary files in the right places, ensure compatibility with a specific version of the operating system, etc. In other words, you had to pay close attention to configuration management (which often caused contention between developers and system administrators).

Containers have changed everything.Now the container image acts as an artifact. It can be represented as a kind of extended executable file containing not only a program, but also a full-fledged runtime (Java / Python / ...), as well as the necessary files / packages that are pre-installed and ready to run. Containers can be deployed and run on various servers without any additional steps.

In addition, containers work in their own sandbox environment. They have their own virtual network adapter, their own file system with limited access, their own hierarchy of processes, their own restrictions on the CPU and memory, etc. All this is realized thanks to a special subsystem of the Linux kernel - namespaces (namespace).

Kubernetes


As stated earlier, Kubernetes is a container orchestra. It works as follows: you provide it with a pool of machines, and then say: “Hey Kubernetes, launch ten instances of my container with 2 processors and 3 GB of memory each, and keep them operational!” Kubernetes takes care of the rest. He will find free capacities, launch containers and restart them if necessary, roll out an update when changing versions, etc. In fact, Kubernetes allows you to abstract from the hardware component and makes the entire variety of systems suitable for deployment and operation of applications.


Kubernetes from the point of view of a simple layman

What are request and limit in Kubernetes


Okay, we figured out the containers and Kubernetes. We also know that several containers can be on the same machine.

You can draw an analogy with a communal apartment. A spacious room is taken (cars / nodes) and leased to several tenants (containers). Kubernetes acts as a realtor. The question arises, how to keep tenants from conflicts with each other? What if one of them, say, decides to occupy the bathroom for half a day?

This is where request and limit come into play. CPU Request is for planning purposes only. This is something like a container’s “wish list”, and it is used to select the most suitable node. At the same time, CPU Limit can be compared with a lease - as soon as we pick up a node for the container, thatwill not be able to go beyond the established limits. And here a problem arises ...

How requests and limits are implemented in Kubernetes


Kubernetes uses the kernel throttling (skipping clock) mechanism to implement CPU limits. If the application exceeds the limit, throttling is enabled (i.e. it receives less CPU cycles). Requests and limits for memory are organized differently, so they are easier to detect. To do this, it is enough to check the last pod restart status: whether it is “OOMKilled”. With the throttling of the CPU, everything is not so simple, since K8s only makes available metrics for use, not cgroups.

CPU Request



How the CPU request is implemented

For simplicity, let's look at a process using an example of a machine with a 4-core CPU.

K8s uses the cgroups mechanism to control the allocation of resources (memory and processor). A hierarchical model is available for him: a descendant inherits the limits of the parent group. Distribution details are stored in the virtual file system ( /sys/fs/cgroup). In the case of the processor, this /sys/fs/cgroup/cpu,cpuacct/*.

K8s uses the file cpu.shareto allocate processor resources. In our case, the root control group receives 4096 shares of CPU resources - 100% of the available processor power (1 core = 1024; this is a fixed value). The root group distributes resources proportionally depending on the shares of descendants prescribed incpu.share, and those, in turn, do the same to their descendants, etc. Typically Kubernetes root node control group has three child: system.slice, user.sliceand kubepods. The first two subgroups are used to distribute resources between critical system loads and user programs outside of K8s. The last - - kubepodsis created by Kubernetes to distribute resources between pods.

The diagram above shows that the first and second subgroups received 1024 shares, with 4096 shares allocated to the kuberpod subgroup . How is this possible: after all, only 4,096 shares are available to the root group , and the sum of the shares of its descendants significantly exceeds this number ( 6144)? The fact is that the value makes logical sense, so the Linux Scheduler (CFS) uses it to proportionally allocate CPU resources. In our case, the first two groups receive 680 real shares (16.6% of 4096), and kubepod receives the remaining 2736 shares. In case of downtime, the first two groups will not use the allocated resources.

Fortunately, the scheduler has a mechanism to avoid the loss of unused CPU resources. It transfers “idle” capacities to the global pool, from which they are distributed among groups that need additional processor capacities (transfer occurs in batches to avoid rounding losses). A similar method applies to all descendants of descendants.

This mechanism ensures a fair distribution of processor power and ensures that no process “steals” resources from others.

CPU Limit


Despite the fact that the configurations of the limits and requests in K8s look similar, their implementation is fundamentally different: this is the most misleading and least documented part.

K8s uses the CFS quota mechanism to implement limits. Their settings are specified in the files cfs_period_usand cfs_quota_usin the cgroup directory (the file is also located there cpu.share).

In contrast cpu.share, the quota is based on a period of time , and not on the available processor power. cfs_period_ussets the duration of the period (era) - it is always 100,000 μs (100 ms). K8s has the ability to change this value, but it is currently only available in the alpha version. The scheduler uses the era to restart used quotas. Second filecfs_quota_us, sets the available time (quota) in each era. Please note that it is also indicated in microseconds. The quota may exceed the duration of the era; in other words, it can be more than 100 ms.

Let's look at two scenarios on 16-core machines (the most common type of computers we have in Omio):


Scenario 1: 2 threads and a limit of 200 ms. Without throttling


Scenario 2: 10 flows and a limit of 200 ms. Throttling starts after 20 ms, access to processor resources resumes after another 80 ms.

Suppose you set the CPU limit to 2 cores; Kubernetes will translate this value to 200 ms. This means that the container can use a maximum of 200 ms CPU time without throttling.

And here the fun begins. As mentioned above, the available quota is 200 ms. If you have ten threads running in parallel on a 12-core machine (see the illustration for scenario 2), while all other pods are idle, the quota will be exhausted in only 20 ms (since 10 * 20 ms = 200 ms), and all threads of this pod is throttle for the next 80 ms. The already mentioned scheduler bug aggravates the situation , due to which excessive throttling occurs and the container cannot even work out the existing quota.

How to evaluate throttling in pods?


Just go to pod and run cat /sys/fs/cgroup/cpu/cpu.stat.

  • nr_periods - the total number of periods of the scheduler;
  • nr_throttled- the number of throttled periods in the composition nr_periods;
  • throttled_time - cumulative throttled time in nanoseconds.



What is really going on?


As a result, we get high throttling in all applications. Sometimes it is one and a half times stronger than the calculated!

This leads to various errors - readiness checks failures, container hangs, network connection breaks, timeouts inside service calls. Ultimately, this translates into increased latency and increased errors.

Decision and consequences


Everything is simple here. We abandoned the CPU limits and started updating the OS kernel in the clusters to the latest version, in which the bug was fixed. The number of errors (HTTP 5xx) in our services immediately dropped significantly:

HTTP Errors 5xx



HTTP 5xx errors of one critical service

P95 response time



Critical Service Request Delay, 95th percentile

Operating costs



Number of hours spent

What's the catch?


As stated at the beginning of the article:

You can draw an analogy with a communal apartment ... Kubernetes acts as a realtor. But how to keep tenants from conflicts with each other? What if one of them, say, decides to occupy the bathroom for half a day?

That’s the catch. One negligent container can absorb all available processor resources on the machine. If you have an intelligent application stack (for example, JVM, Go, Node VM are properly configured), then this is not a problem: you can work in such conditions for a long time. But if applications are poorly optimized or not optimized at all ( FROM java:latest), the situation may get out of hand. We at Omio have automated basic Dockerfiles with adequate default settings for the stack of main languages, so there was no such problem.

We recommend that you monitor USE metrics (usage, saturation, and errors), API delays, and error rates. Make sure the results are as expected.

References


That is our story. The following materials have greatly helped to understand what is happening:


Kubernetes Error Reporting:


Have you encountered similar problems in your practice or have experience with throttling in containerized production environments? Share your story in the comments!

PS from the translator


Read also in our blog:


All Articles