🌇 ☮️ 📮 Kubernetes load balancing and scaling long-lived connections 👩🏽‍🎤 ⤴️ 👦

This article will help you understand how load balancing in Kubernetes works, what happens when scaling long-lived connections and why you should consider balancing on the client side if you use HTTP / 2, gRPC, RSockets, AMQP or other long-lived protocols.

A bit about how traffic is redistributed in Kubernetes

Kubernetes provides two convenient abstractions for rolling out applications: Services and Deployments.

Deployments describe how and how many copies of your application should be running at any given time. Each application is deployed as under (Pod) and is assigned an IP address.

Feature services are similar to a load balancer. They are designed to distribute traffic across multiple hearths.

Let's see how it looks .

In the diagram below, you see three instances of the same application and a load balancer:
The load balancer is called the Service, it is assigned an IP address. Any incoming request is redirected to one of the pods:
The deployment script determines the number of application instances. You will almost never have to deploy directly under:
Each pod is assigned its own IP address:

It is useful to consider services as a set of IP addresses. Each time you access the service, one of the IP addresses is selected from the list and used as the destination address.

This is as follows .

There is a curl request 10.96.45.152 to the service:
The service selects one of the three pod addresses as the destination:
Traffic is redirected to a specific pod:

If your application consists of a frontend and a backend, then you will have both a service and a deployment for each.

When the frontend fulfills the request to the backend, it does not need to know exactly how many heaps the backend serves: there can be one, ten, or one hundred.

Also, the frontend does not know anything about the addresses of the hearths serving the backend.

When the frontend makes a request to the backend, it uses the IP address of the backend service, which does not change.

Here is how it looks .

Under 1 requests the backend internal component. Instead of choosing a specific one for the backend, it performs a service request:
The service selects one of the backend pods as the destination address:
Traffic goes from hearth 1 to hearth 5 selected by the service:
Under 1, it does not know exactly how many such hearths as under 5 are hidden behind the service:

But how exactly does the service distribute requests? Does round-robin balancing seem to be used? Let's get it right.

Balancing in Kubernetes Services

Kubernetes services do not exist. There is no process for the service that is assigned an IP address and port.

You can verify this by going to any node in the cluster and running the netstat -ntlp command.

You can’t even find the IP address allocated to the service.

The IP address of the service is located in the control layer, in the controller, and recorded in the database - etcd. The same address is used by another component - kube-proxy.
Kube-proxy receives a list of IP addresses for all services and forms a set of iptables rules on each node of the cluster.

These rules say: "If we see the IP address of the service, we need to modify the destination address of the request and send it to one of the pods."

The IP address of the service is used only as an entry point and is not served by any process listening to this ip address and port.

Let's look at that .

Consider a cluster of three nodes. There are pods on each node:
Knitted hearths painted in beige are part of the service. Since the service does not exist as a process, it is grayed out:
The first one asks for the service and should fall on one of the related headers:
But the service does not exist, there is no process. How does it work?
Before the request leaves the node, it goes through the iptables rules:
The iptables rules know that there is no service, and replace its IP address with one of the IP addresses of the pods associated with this service:
The request receives a valid IP address as the destination address and is normally processed:
Depending on the network topology, the request eventually reaches the hearth:

Are iptables able to balance the load?

No, iptables are used for filtering and were not designed for balancing.

However, it is possible to write a set of rules that work like a pseudo-balancer .

And that’s exactly what Kubernetes does.

If you have three pods, kube-proxy will write the following rules:

Choose the first one with a probability of 33%, otherwise go to the next rule.
Choose the second one with a probability of 50%, otherwise go to the next rule.
Choose the third under.

Such a system leads to the fact that each sub is selected with a probability of 33%.

And there is no guarantee that under 2 it will be selected next after file 1.

Note : iptables uses a random distribution statistical module. Thus, the balancing algorithm is based on random selection.

Now that you understand how services work, let's look at more interesting work scenarios.

Long-lived connections in Kubernetes do not scale by default

Each HTTP request from the front-end to the back-end is served by a separate TCP connection, which opens and closes.

If the frontend sends 100 requests per second to the backend, then 100 different TCP connections open and close.

You can reduce the processing time of the request and reduce the load if you open one TCP connection and use it for all subsequent HTTP requests.

The HTTP protocol contains a feature called HTTP keep-alive, or reuse of the connection. In this case, one TCP connection is used to send and receive many HTTP requests and responses:

This feature is not enabled by default: both the server and the client must be configured accordingly.

The setup itself is simple and accessible for most programming languages and environments.

Here are some links to examples in different languages:

What happens if we use keep-alive in Kubernetes?
Let's assume that both the frontend and backend support keep-alive.

We have one copy of the frontend and three copies of the backend. The frontend makes the first request and opens a TCP connection to the backend. The request reaches the service, one of the backend pods is selected as the destination address. It sends a response to the backend, and the frontend receives it.

Unlike the usual situation, when the TCP connection is closed after receiving the response, it is now kept open for the following HTTP requests.

What happens if the frontend sends more backend requests?

To forward these requests, an open TCP connection will be used, all requests will be sent to the same one under the backend, where the first request got.

Shouldn't iptables redistribute traffic?

Not in this case.

When a TCP connection is created, it goes through the iptables rules, which select a specific one for the backend where the traffic will go.

Since all of the following requests go over an already open TCP connection, iptables rules are no longer called.

Let's see how it looks .

The first sub sends a request to the service:
You already know what will happen next. The service does not exist, but there are iptables rules that will handle the request:
One of the backend pods will be selected as the destination address:
The request reaches the hearth. At this point, a permanent TCP connection between the two pods will be established:
Any next request from the first pod will go through an already established connection:

As a result, you got a faster response and higher bandwidth, but lost the ability to scale the backend.

Even if you have two pods in the backend, with a constant connection, traffic will always go to one of them.

Can this be fixed?

Since Kubernetes does not know how to balance persistent connections, this task is your responsibility.

Services are a set of IP addresses and ports called endpoints.

Your application can get a list of endpoints from the service and decide how to distribute requests between them. You can open a persistent connection to each hearth and balance requests between these connections using round-robin.

Or apply more sophisticated balancing algorithms .

The client-side code that is responsible for balancing should follow this logic:

Get the list of endpoints from the service.
For each endpoint, open a persistent connection.
When you need to make a request, use one of the open connections.
Regularly update the list of endpoints, create new ones, or close old persistent connections if the list changes.

Here is how it will look .

Instead of sending the first request to the service, you can balance requests on the client side:
You need to write code that asks which pods are part of the service:
As soon as you receive the list, save it on the client side and use it to connect to the pods:
You yourself are responsible for the load balancing algorithm:

Now the question is: does this issue only apply to HTTP keep-alive?

Client side load balancing

HTTP is not the only protocol that can use persistent TCP connections.

If your application uses a database, then the TCP connection does not open every time you need to execute a request or get a document from the database.

Instead, a permanent TCP connection to the database is opened and used.

If your database is deployed in Kubernetes and access is provided as a service, then you will encounter the same problems as described in the previous section.

One database replica will be loaded more than the rest. Kube-proxy and Kubernetes will not help to balance connections. You should take care of balancing queries to your database.

Depending on which library you use to connect to the database, you may have various options for solving this problem.

The following is an example of accessing a MySQL database cluster from Node.js:

var mysql = require('mysql');
var poolCluster = mysql.createPoolCluster();

var endpoints = /* retrieve endpoints from the Service */

for (var [index, endpoint] of endpoints) {
  poolCluster.add(`mysql-replica-${index}`, endpoint);
}

// Make queries to the clustered MySQL database

There are tons of other protocols that use persistent TCP connections:

WebSockets and secured WebSockets
HTTP / 2
gRPC
RSockets
AMQP

You should already be familiar with most of these protocols.

But if these protocols are so popular, why is there no standardized balancing solution? Why is a change in client logic required? Is there a native Kubernetes solution?

Kube-proxy and iptables are designed to close most standard deployment scenarios for Kubernetes. This is for convenience.

If you use a web service that provides a REST API, you are in luck - in this case, permanent TCP connections are not used, you can use any Kubernetes service.

But as soon as you start using persistent TCP connections, you will have to figure out how to evenly distribute the load on the backends. Kubernetes does not contain ready-made solutions for this case.

However, of course, there are options that may help.

Balancing long-lived connections in Kubernetes

Kubernetes has four types of services:

Clusterip
NodePort
Loadbalancer
Headless

The first three services are based on the virtual IP address, which is used by kube-proxy to build iptables rules. But the fundamental basis of all services is a headless type service.

No IP address is associated with the headless service, and it only provides a mechanism for obtaining a list of IP addresses and ports of associated hearths (endpoints).

All services are based on the headless service.

ClusterIP service is a headless service with some additions:

The management layer assigns it an IP address.
Kube-proxy forms the necessary iptables rules.

Thus, you can ignore kube-proxy and directly use the list of endpoints received from the headless service to balance the load in your application.

But how to add similar logic to all applications deployed in a cluster?

If your application is already deployed, then such a task may seem impossible. However, there is an alternative.

Service Mesh will help you

You probably already noticed that the client-side load balancing strategy is quite standard.

When the application starts, it:

Gets a list of IP addresses from the service.
Opens and maintains a connection pool.
Periodically updates the pool, adding or removing endpoints.

As soon as the application wants to make a request, it:

Selects an available connection using some kind of logic (e.g. round-robin).
Fulfills the request.

These steps work for WebSockets, gRPC, and AMQP.

You can separate this logic into a separate library and use it in your applications.

However, service grids such as Istio or Linkerd can be used instead.

Service Mesh complements your application with a process that:

Automatically searches for IP addresses of services.
Checks connections such as WebSockets and gRPC.
Balances requests using the correct protocol.

Service Mesh helps manage traffic within the cluster, but it is quite resource intensive. Other options are using third-party libraries, such as Netflix Ribbon, or programmable proxies, such as Envoy.

What happens if you ignore balancing issues?

You can not use load balancing and not notice any changes. Let's look at a few work scenarios.

If you have more clients than servers, this is not such a big problem.

Suppose there are five clients that connect to two servers. Even if there is no balancing, both servers will be used:

Connections can be distributed unevenly: perhaps four clients connected to the same server, but there is a good chance that both servers will be used.

What is more problematic is the opposite scenario.

If you have fewer clients and more servers, your resources may not be used enough and a potential bottleneck will appear.

Suppose there are two clients and five servers. At best, there will be two permanent connections to two out of five servers.

Other servers will be idle:

If these two servers cannot handle client request processing, horizontal scaling will not help.

Conclusion

Kubernetes services are designed to work in most standard web application scenarios.

However, as soon as you start working with application protocols that use persistent TCP connections, such as databases, gRPC or WebSockets, services are no longer suitable. Kubernetes does not provide internal mechanisms for balancing persistent TCP connections.

This means you must write applications with the possibility of balancing on the client side.

Translation prepared by a team Kubernetes aaS from of Mail.ru .

What else to read on the topic :

Kubernetes load balancing and scaling long-lived connections