🍽️ 🔭 🙂 Designing Kubernetes Clusters: How Many Should There Be? 😯 🤝 🗺️

Note perev. : This material from the learnk8s educational project is the answer to a popular question when designing infrastructure based on Kubernetes. We hope that sufficiently detailed descriptions of the pros and cons of each option will help to make the best choice for your project.

TL; DR : the same set of workloads can be run on several large clusters (each cluster will have a large number of workloads) or on many small ones (with a small number of loads in each cluster).

The table below summarizes the pros and cons of each approach:

When using Kubernetes as a platform for operating applications, several fundamental questions about the intricacies of cluster configuration often arise:

How many clusters to use?
How big do they make?
What should each cluster include?

In this article, I will try to answer all these questions by analyzing the pros and cons of each approach.

Statement of a question

As a software creator, you are likely to develop and operate many applications in parallel.

In addition, many instances of these applications are likely to run in various environments - for example, it can be dev , test and prod .

The result is a whole matrix of applications and environments:

Applications and environments

In the above example, 3 applications and 3 environments are presented, which ultimately gives 9 possible options.

Each application instance is a self-contained deployment unit that can be operated independently of others.

Note that an application instance may consist of many components.such as frontend, backend, database, etc. In the case of a microservice application, the instance will include all microservices.

As a result, Kubernetes users have several questions:

Should I place all the application instances in one cluster?
Should I create a separate cluster for each instance of the application?
Or maybe you should use a combination of the above approaches?

All of these options are quite viable, because Kubernetes is a flexible system that does not limit the user in possibilities.

Here are some of the possible ways:

one large common cluster;
many small highly specialized clusters;
one cluster for each application;
one cluster for each environment.

As shown below, the first two approaches are at the opposite ends of the options scale:

From several large clusters (on the left) to many small ones (on the right)

In general, one cluster is considered to be “larger” than the other if it has a larger sum of nodes and pods. For example, a cluster with 10 nodes and 100 pods is larger than a cluster with 1 node and 10 pods.

Well, let's get started!

1. One large common cluster

The first option is to place all the workloads in one cluster:

One large cluster

As part of this approach, the cluster is used as a universal infrastructure platform - you just deploy everything you need in an existing Kubernetes cluster.

Namespace'y Kubernetes allows logically separate part of the cluster from each other, so that its name space can be used for each instance of the application.

Let's look at the pros and cons of this approach.

+ Efficient use of resources

In the case of a single cluster, only one copy of all the resources needed to start and manage the Kubernetes cluster is required.

For example, this is true for master nodes. Usually there are 3 master nodes for each Kubernetes cluster, so for one single cluster their number will remain so (for comparison, 10 clusters will need 30 master nodes).

The above subtlety applies to other services operating on a cluster-wide basis, such as load balancers, Ingress controllers, authentication, logging and monitoring systems.

In a single cluster, all these services can be used immediately for all workloads (you do not need to create copies of them, as in the case of several clusters).

+ Cheap

As a consequence of the foregoing, a smaller number of clusters is usually cheaper because there are no costs for excess resources.

This is especially true for master nodes, which can cost significant money regardless of the placement method (on-premises or in the cloud).

Some managed Kubernetes services, such as the Google Kubernetes Engine (GKE) or Azure Kubernetes Service (AKS) , provide a control layer for free. In this case, the cost issue is less acute.

There are also managed services that charge a fixed fee for each Kubernetes cluster (for example, Amazon Elastic Kubernetes Service, EKS ).

+ Effective administration

Managing a single cluster is easier than several.

Administration may include the following tasks:

update version of Kubernetes;
CI / CD pipeline configuration
CNI plugin installation;
setting up a user authentication system;
setting access controller;

and many others ...

In the case of one cluster, all this will have to be done only once.

For many clusters, operations will have to be repeated many times, which will probably require some automation of processes and tools to ensure a systematic and uniform process.

And now a few words about the cons.

- Single point of failure

In the event of a single cluster failure , all workloads will stop working immediately !

There are tons of options when something can go wrong:

updating Kubernetes leads to unexpected side effects;
the cluster-wide component (for example, the CNI plugin) does not work as expected;
One of the cluster components is not configured correctly.
a failure in the underlying infrastructure.

One such incident can cause serious damage to all workloads located in a common cluster.

- Lack of hard insulation

Working in a shared cluster means that applications share hardware, network capabilities, and the operating system on the cluster nodes.

In a sense, two containers with two different applications running on the same host are similar to two processes running on the same machine running the same OS kernel.

Linux containers provide some form of isolation, but it is far from being as strong as that provided by, say, virtual machines. In essence, a process in a container is the same process running on the host operating system.

This can be a security issue: such an organization theoretically allows unrelated applications to interact with each other (intentionally or accidentally).

In addition, all workloads in the Kubernetes cluster share some cluster-wide services, such as DNS - this allows applications to find the Services of other applications in the cluster.

All of the above items may have different meanings depending on the requirements for application security.

Kubernetes provides various tools to prevent security issues, such as PodSecurityPolicies and NetworkPolicies . However, for their proper configuration requires some experience in addition, they are not able to close absolutely all the security holes.

It’s important to always remember that Kubernetes was originally designed for sharing.and not for isolation and security .

- Lack of tight multi-tenancy

Given the abundance of shared resources in the Kubernetes cluster, there are many ways in which different applications can “hit each other's heels”.

For example, an application can monopolize some shared resource (like a processor or memory) and deprive other applications running on the same node of access to it.

Kubernetes provides various mechanisms for controlling such behavior, such as resource requests and limits (see also the article “ CPU Limits and aggressive throttling in Kubernetes ” - approx. Transl.) , ResourceQuotas and LimitRanges . However, as in the case of security, their configuration is quite nontrivial and they are not able to prevent absolutely all unforeseen side effects.

- A large number of users

In the case of a single cluster, many people have to open access to it. And the greater their number, the higher the risk that they will “break something”.

Inside the cluster, you can control who and what can be done using role-based access control (RBAC) (see the article “ Users and RBAC Authorization in Kubernetes ” - approx. Transfer ) . However, it will not prevent users from “breaking” something within the boundaries of their area of responsibility.

- Clusters cannot grow indefinitely

The cluster that is used for all workloads will probably be quite large (in terms of the number of nodes and pods).

But here another problem arises: clusters in Kubernetes cannot grow indefinitely.

There is a theoretical limit on cluster size. At Kubernetes, it is approximately 5,000 nodes, 150 thousand pods and 300 thousand containers .

However, in real life, problems can begin much earlier - for example, at only 500 nodes .

The fact is that large clusters exert a high load on the Kubernetes control layer. In other words, in order to keep the cluster operational and efficiently use resources, careful tuning is required.

This issue is explored in a related article in the original blog entitled “ Architecting Kubernetes clusters - choosing a worker node size ”.

But let's look at the opposite approach: many small clusters.

2. Many small, specialized clusters

With this approach, you use a separate cluster for each deployed element:

Many small clusters

For the purposes of this article, a deployed element refers to an instance of an application — for example, the dev version of a separate application.

This strategy uses Kubernetes as a specialized runtime for individual application instances.

Let's look at the pros and cons of this approach.

+ Limited "blast radius"

When a cluster breaks down, the negative consequences are limited only to those workloads that were deployed in this cluster. All other workloads remain intact.

+ Insulation

Workloads hosted on individual clusters do not share resources such as processor, memory, operating system, network, or other services.

As a result, we get a tight isolation between unrelated applications, which may favorably affect their security.

+ Small number of users

Given that each cluster contains only a limited set of workloads, the number of users with access to it is reduced.

The fewer people have access to the cluster, the lower the risk that something will “break”.

Let's look at the cons.

- Inefficient use of resources

As mentioned earlier, each Kubernetes cluster requires a certain set of control resources: master nodes, control layer components, and solutions for monitoring and logging.

In the case of a large number of small clusters, it is necessary to allocate a larger share of resources for management.

- High cost

Inefficient use of resources automatically entails high costs.

For example, the maintenance of 30 master nodes instead of three with the same computing power will necessarily affect the costs.

- Administration difficulties

Managing multiple Kubernetes clusters is much more difficult than managing one.

For example, you will have to configure authentication and authorization for each cluster. Updating the Kubernetes version will also have to be done several times.

Most likely, you will have to apply automation to increase the effectiveness of all these tasks.

Now consider less extreme scenarios.

3. One cluster per application

As part of this approach, you create a separate cluster for all instances of a specific application:

Cluster per application

This way can be considered as a generalization of the principle of “ separate cluster per team ”, since usually a team of engineers is engaged in the development of one or more applications.

Let's look at the pros and cons of this approach.

+ Cluster can be customized for the application

If the application has special needs, they can be implemented in a cluster without affecting other clusters.

Such needs may include GPU workers, specific CNI plugins, service mesh, or some other service.

Each cluster can be tailored to the application running in it so that it contains only what is needed.

- Different environments in one cluster

The disadvantage of this approach is that application instances from different environments coexist in the same cluster.

For example, the prod version of the application runs in the same cluster as the dev version. It also means that developers conduct their activities in the same cluster in which the production version of the application is operated.

If, due to the actions of developers or glitches of the dev version in the cluster, a failure occurs, then the prod version can potentially suffer - a huge drawback of this approach.

And finally, the last script on our list.

4. One cluster for each environment

This scenario provides for the allocation of a separate cluster for each environment:

One cluster for the environment

For example, you can have dev , test and prod clusters in which you will run all application instances designed for a specific environment.

Here are the pros and cons of this approach.

+ Isolation of prod environment

In this approach, all environments are isolated from each other. However, in practice, this is especially important for the prod environment.

Production versions of the application are now independent of what is happening in other clusters and environments.

Thus, if a problem suddenly arises in the dev cluster, the prod versions of the applications will continue to work as if nothing had happened.

+ Cluster can be adjusted to the environment

Each cluster can be tailored to its environment. For example, you can:

install development and debugging tools in the dev cluster;
install test frameworks and tools in the test cluster ;
Use more powerful equipment and network channels in the prod cluster .

This improves the efficiency of both the development and operation of applications.

+ Restrict access to the production cluster

The need to work with a prod cluster directly arises infrequently, so that you can significantly limit the circle of people who have access to it.

You can go even further and generally deprive people of access to this cluster, and perform all deployments using the automated CI / CD tool. Such an approach will minimize the risk of human error exactly where it is most relevant.

And now a few words about the cons.

- Lack of isolation between applications

The main drawback of the approach is the lack of hardware and resource isolation between applications.

Unrelated applications share cluster resources: the system core, processor, memory, and some other services.

As already mentioned, this can be potentially dangerous.

- Inability to localize application dependencies

If the application has special requirements, then they must be satisfied in all clusters.

For example, if an application needs a GPU, then each cluster must contain at least one worker with a GPU (even if it is used only by that application).

As a result, we run the risk of higher costs and inefficient use of resources.

Conclusion

If you have a specific set of applications, you can place them in several large clusters or in many small ones.

The article discusses the pros and cons of various approaches, ranging from one global cluster to several small and highly specialized ones:

one large common cluster;
many small highly specialized clusters;
one cluster for each application;
one cluster for each environment.

So, which approach to choose?

As usual, the answer depends on the use case: you need to weigh the pros and cons of different approaches and choose the most optimal option.

However, the choice is not limited to the above examples - you can use any combination of them!

For example, you can organize a couple of clusters per team: a cluster for development (which will have dev and test environments ) and a cluster for production (where the production environment will be located).

Based on the information in this article, you can optimize the pros and cons accordingly for a specific scenario. Good luck

Designing Kubernetes Clusters: How Many Should There Be?