❤️ 🔟 👋🏻 Our experience in developing a CSI driver in Kubernetes for Yandex.Cloud 🏐 💪🏿 🆚

We are pleased to announce that Flant is replenishing its contribution to the Open Source tools for Kubernetes by releasing an alpha version of the CSI (Container Storage Interface) driver for Yandex.Cloud.

But before moving on to the implementation details, we’ll answer the question of why this is needed at all when Yandex already has the Managed Service for Kubernetes service .

Introduction

Why is this?

Inside our company, from the very beginning of the operation of Kubernetes in production (i.e. for several years), our own tool (deckhouse) is developing, which, by the way, we also plan to make available as an Open Source project in the near future. With its help, we uniformly configure and configure all our clusters, and at the moment there are more than 100 of them, moreover, on the most various configurations of iron and in all available cloud services.

Clusters using deckhouse have all the necessary components for work: balancers, monitoring with convenient charts, metrics and alerts, user authentication through external providers to access all dashboards, and so on. It makes no sense to put such a “pumped-up” cluster into a managed solution, as it is often either impossible or will lead to the need to disable half of the components.

NB : This is our experience, and it is quite specific. By no means do we claim that everyone should independently engage in Kubernetes cluster deployment instead of using ready-made solutions. By the way, we have no real experience in operating Kubernetes from Yandex and we will not give any assessment of this service in this article.

What is it and for whom?

So, we already talked about the modern approach to storage in Kubernetes: how CSI works and how the community came to this approach.

Currently, many large cloud service providers have developed drivers for using their cloud drives as a Persistent Volume in Kubernetes. If the supplier does not have such a driver, but at the same time all the necessary functions are provided through the API, then nothing prevents you from implementing the driver on your own. And so it happened with Yandex.Cloud.

As a basis for development, we took the CSI driver for the DigitalOcean cloud and a couple of ideas from the driver for GCP , since the interaction with the API of these clouds (Google and Yandex) has a number of similarities. In particular, the API andGCP , and Yandex return an object Operationto track the status of lengthy operations (for example, creating a new disk). To interact with the Yandex.Cloud API, the Yandex.Cloud Go SDK is used .

The result of the work done is published on GitHub and may be useful to those who for some reason use their own installation of Kubernetes on Yandex.Cloud virtual machines (but not a ready-made managed cluster) and would like to use (order) disks via CSI.

Implementation

Key features

Currently, the driver supports the following functions:

Ordering disks in all zones of the cluster according to the topology of nodes in the cluster;
Removing previously ordered disks;
Offline resize for disks (Yandex. Cloud does not support the increase in disks that are mounted on a virtual machine). About how to modify the driver in order to resize as painlessly as possible, see below.

In the future, it is planned to implement support for the creation and removal of snapshot disks.

The main difficulty and its overcoming

The lack of the ability to expand disks in real time in the Yandex.Cloud API is a limitation that complicates the resize operation for PV (Persistent Volume): in this case, it is necessary that the pod of the application that uses the disk be stopped, and this can cause a simple applications.

According to the CSI specification , if the CSI controller reports that it can only resize disks “offline” ( VolumeExpansion.OFFLINE), then the process of increasing the disk should go like this:

If the plugin has only VolumeExpansion.OFFLINEexpansion capability and volume is currently published or available on a node then ControllerExpandVolumeMUST be called ONLY after either:

The plugin has controller PUBLISH_UNPUBLISH_VOLUMEcapability and ControllerUnpublishVolumehas been invoked successfully.

OR ELSE

The plugin does NOT have controller PUBLISH_UNPUBLISH_VOLUMEcapability, the plugin has node STAGE_UNSTAGE_VOLUMEcapability, and NodeUnstageVolumehas been completed successfully.

OR ELSE

The plugin does NOT have controller PUBLISH_UNPUBLISH_VOLUMEcapability, nor node STAGE_UNSTAGE_VOLUMEcapability, and NodeUnpublishVolumehas completed successfully.

In essence, this means the need to disconnect the disk from the virtual machine before increasing it.

However, unfortunately, the implementation of the CSI specification through sidecar does not meet these requirements:

In the sidecar-container csi-attacher, which should be responsible for the presence of the necessary gap between mounts, this functionality is simply not implemented with offline-resize. A discussion about this was initiated here .
What is a sidecar container in this context? The CSI plugin itself does not interact with the Kubernetes API, but only responds to gRPC calls that sidecar containers send to it. The latter are being developed by the Kubernetes community.

In our case (CSI plugin), the operation to increase the disk is as follows:

We receive a gRPC call ControllerExpandVolume;
We are trying to increase the disk in the API, but we get an error about the impossibility of performing the operation, since the disk is mounted;
We save the disk identifier in map containing the disks for which you need to perform an increase operation. Further for brevity we will call this map as volumeResizeRequired;
Manually delete the pod that uses the disk. Kubernetes will restart it. So that the disk does not have time to mount ( ControllerPublishVolume) before the completion of the increase operation when trying to mount, we check that this disk is still in volumeResizeRequiredand return an error;
The CSI driver is trying to re-execute the resize operation. If the operation was successful, then delete the disk from volumeResizeRequired;
Because the disk identifier is missing in volumeResizeRequired, it ControllerPublishVolumeis successful, the disk is mounted, pod starts.

Everything looks simple enough, but as always there are pitfalls. External-resizer is involved in disk expansion , which, in case of an error during the operation, uses a queue with an exponential increase in timeout time up to 1000 seconds:

func DefaultControllerRateLimiter() RateLimiter {
  return NewMaxOfRateLimiter(
  NewItemExponentialFailureRateLimiter(5*time.Millisecond, 1000*time.Second),
  // 10 qps, 100 bucket size.  This is only for retry speed and its only the overall factor (not per item)
  &BucketRateLimiter{Limiter: rate.NewLimiter(rate.Limit(10), 100)},
  )
}

This can periodically lead to the fact that the operation of increasing the disk is stretched for 15+ minutes and, thus, the inaccessibility of the corresponding pod.

The only option that allowed us to reduce the potential downtime quite easily and painlessly was to use our version of external-resizer with a maximum timeout limit of 5 seconds :

workqueue.NewItemExponentialFailureRateLimiter(5*time.Millisecond, 5*time.Second)

We did not consider it necessary to urgently initiate a discussion and patch external-resizer, because offline resize disks is an atavism that will soon disappear from all cloud providers.

How to start using?

The driver is supported in Kubernetes version 1.15 and higher. For the driver to work, the following requirements must be met:

The flag is --allow-privilegedset to the value truefor the API server and kubelet;
Included --feature-gates=VolumeSnapshotDataSource=true,KubeletPluginsWatcher=true,CSINodeInfo=true,CSIDriverRegistry=truefor API server and kubelet;
Propagation mount ( mount propagation ) should be included in the cluster. When using Docker, the daemon must be configured so that shared mounts are allowed.

All necessary steps for the installation itself are described in README . Installation is the creation of objects in Kubernetes from manifests.

For the driver to work, you will need the following:

Indicate the identifier of the Yandex.Cloud catalog directory ( folder-id) in the manifest ( see the documentation );
To interact with the Yandex.Cloud API in the CSI driver, a service account is used. In the Secret manifest, you must pass the authorized keys to the service account. The documentation describes how to create a service account and get the keys.

In general - try it , and we will be glad to receive feedback and new issues if you encounter any problems!

Further support

As a result, we would like to note that we implemented this CSI driver not from a great desire to have fun writing applications on Go, but because of the urgent need inside the company. It does not seem advisable to support our own implementation, therefore, if Yandex shows interest and decides to continue supporting the driver, we will gladly transfer the repository to their disposal.

In addition, probably, Yandex in the Kubernetes managed cluster has its own implementation of the CSI driver, which can be released in Open Source. We also see this development option as favorable - the community will be able to use the proven driver from the service provider, and not from a third-party company.

Our experience in developing a CSI driver in Kubernetes for Yandex.Cloud