👨‍🚒 🚔 ⛹️ From life with Kubernetes: How we removed DBMS (and not only) from review environments to static 📒 🚯 👲🏿

Note : this article does not claim to be a best practice. It describes the experience of a specific implementation of an infrastructure task in terms of using Kubernetes and Helm, which can be useful in solving related problems.

Using review environments in CI / CD can be very useful, both for developers and system engineers. Let's first synchronize the general ideas about them:

Review environments can be created from separate branches in Git repositories defined by developers (the so-called feature branches).
They can have separate DBMS instances, queue processors, caching services, etc. - in general, everything for the full reproduction of the production environment.
They allow parallel development, significantly accelerating the release of new features in the application. At the same time, dozens of such environments may be required every day, which is why the speed of their creation is critical.

At the intersection of the second and third points, difficulties often arise: since the infrastructure is very different, its components can be deployed for a long time. This time spent, for example, includes restoring the database from an already prepared backup *. The article is about the fascinating way we once went to solve such a problem.

* By the way, specifically about large database dumps in this context, we already wrote in the material about accelerating the bootstrap database .)

The problem and the way to solve it

In one of the projects, we were given the task of "creating a single entry point for developers and QA engineers." This formulation hid technically the following:

To simplify the work of QA-engineers and some other employees, take out all the databases (and corresponding vhosts) used in the review, in a separate - static - environment. For the reasons prevailing in the project, this way of interacting with them was optimal.
Reduce the time it takes to create a review environment. The whole process of their creation from scratch is implied, i.e. including database cloning, migrations, etc.

From the point of view of implementation, the main problem is to ensure idempotency when creating and deleting review environments. To achieve this, we changed the mechanism for creating review environments by first migrating the PostgreSQL, MongoDB, and RabbitMQ services to a static environment. Static refers to such a “permanent” environment that will not be created at the request of the user (as is the case with review environments).

Important! The approach with a static environment is far from ideal - for its specific shortcomings, see the end of the article. However, we share this experience in detail, since it can be more or less applicable in other tasks, and at the same time serve as an argument when discussing infrastructure design issues.

So, the sequence of actions in the implementation:

When creating a review environment, the following should happen once: the creation of databases in two DBMSs (MongoDB and PostgreSQL), the restoration of databases from a backup / template, and the creation of vhost in RabbitMQ. This will require a convenient way to load current dumps. (If you had review environments before, then most likely you already have a ready-made solution for this.)
After completion of the review environment, you must delete the database and virtual host in RabbitMQ.

In our case, the infrastructure operates within the framework of Kubernetes (using Helm). Therefore, for the implementation of the above tasks, Helm hooks were excellent . They can be performed both before the creation of all other components in the Helm release, and / or after their removal. Therefore:

for the initialization task, we will use a hook pre-installto launch it before creating all the resources in the release;
for the delete task, a hook post-delete.

Let's move on to implementation details.

Practical implementation

In the original version, this project used only one Job, consisting of three containers. Of course, this is not entirely convenient, as the result is a large manifest that is corny difficult to read. Therefore, we divided it into three small jobs.

The following is a listing for PostgreSQL, and the other two (MongoDB and RabbitMQ) are identical in manifest structure:

{{- if .Values.global.review }}
---
apiVersion: batch/v1
kind: Job
metadata:
  name: db-create-postgres-database
  annotations:
    "helm.sh/hook": "pre-install"
    "helm.sh/hook-weight": "5"
spec:
  template:
    metadata:
      name: init-db-postgres
    spec:
      volumes:
      - name: postgres-scripts
        configMap:
          defaultMode: 0755
          name: postgresql-configmap
      containers:
      - name: init-postgres-database
        image: private-registry/postgres 
        command: ["/docker-entrypoint-initdb.d/01-review-load-dump.sh"]
        volumeMounts:
        - name: postgres-scripts
          mountPath: /docker-entrypoint-initdb.d/01-review-load-dump.sh
          subPath: review-load-dump.sh
        env:
{{- include "postgres_env" . | indent 8 }}
      restartPolicy: Never
{{- end }}

Comments on the contents of the manifest:

Job review-. review CI/CD Helm- (. if .Values.global.review ).
Job — , ConfigMap. , , . , hook-weight.
cURL , PostgreSQL, .
PostgreSQL : , shell- .

PostgreSQL

The most interesting is in the shell script ( review-load-dump.sh) already mentioned in the listing . What are the general options for restoring a database in PostgreSQL?

"Standard" recovery from backup;
Recovery using templates .

In our case, the difference between the two approaches is primarily in the speed of creating a database for the new environment. In the first - we load the database dump and restore it with pg_restore. And with us this happens more slowly than the second method, so the corresponding choice was made.

Using the second option ( recovery with templates) you can clone the database at the physical level without sending data to it remotely from the container in another environment - this reduces the recovery time. However, there is a limitation: you cannot clone a database to which active connections remain. Since we use stage as the static environment (and not a separate review environment), we need to create a second database and convert it to a template, updating it daily (for example, in the morning). A small CronJob was prepared for this:

---
apiVersion: batch/v1beta1
kind: CronJob
metadata:
  name: update-postgres-template
spec:
  schedule: "50 4 * * *"
  concurrencyPolicy: Forbid
  successfulJobsHistoryLimit: 3
  failedJobsHistoryLimit: 3
  startingDeadlineSeconds: 600
  jobTemplate:
    spec:
      template:
        spec:
          restartPolicy: Never
          imagePullSecrets:
          - name: registrysecret
          volumes:
          - name: postgres-scripts
            configMap:
              defaultMode: 0755
              name: postgresql-configmap-update-cron
          containers:
          - name: cron
            command: ["/docker-entrypoint-initdb.d/update-postgres-template.sh"]
          image: private-registry/postgres 
            volumeMounts:
            - name: postgres-scripts
              mountPath: /docker-entrypoint-initdb.d/update-postgres-template.sh
              subPath: update-postgres-template.sh
            env:
{{- include "postgres_env" . | indent 8 }}

The full ConfigMap manifest containing the script most likely does not make much sense (report in the comments if this is not the case). Instead, I will give the most important thing - a bash script:

#!/bin/bash -x

CREDENTIALS="postgresql://${POSTGRES_USER}:${POSTGRES_PASSWORD}@${POSTGRES_HOST}/postgres"

psql -d "${CREDENTIALS}" -w -c "REVOKE CONNECT ON DATABASE ${POSTGRES_DB_TEMPLATE} FROM public"
psql -d "${CREDENTIALS}" -w -c "SELECT pg_terminate_backend(pg_stat_activity.pid) FROM pg_stat_activity WHERE pg_stat_activity.datname = '${POSTGRES_DB_TEMPLATE}'"

curl --fail -vsL ${HOST_FORDEV}/latest_${POSTGRES_DB_STAGE}.psql -o /tmp/${POSTGRES_DB}.psql

psql -d "${CREDENTIALS}" -w -c "ALTER DATABASE ${POSTGRES_DB_TEMPLATE} WITH is_template false allow_connections true;"
psql -d "${CREDENTIALS}" -w -c "DROP DATABASE ${POSTGRES_DB_TEMPLATE};" || true
psql -d "${CREDENTIALS}" -w -c "CREATE DATABASE ${POSTGRES_DB_TEMPLATE};" || true
pg_restore -U ${POSTGRES_USER} -h ${POSTGRES_HOST} -w -j 4 -d ${POSTGRES_DB_TEMPLATE} /tmp/${POSTGRES_DB}.psql

psql -d "${CREDENTIALS}" -w -c "ALTER DATABASE ${POSTGRES_DB_TEMPLATE} WITH is_template true allow_connections false;"

rm -v /tmp/${POSTGRES_DB}.psql

You can restore several databases at once from one template without any conflicts. The main thing is that database connections should be prohibited, and the database itself should be a template. This is done in the penultimate step.

The manifest containing the shell script for restoring the database turned out like this:

---
apiVersion: v1
kind: ConfigMap
metadata:
  name: postgresql-configmap
  annotations:
    "helm.sh/hook": "pre-install"
    "helm.sh/hook-weight": "1"
    "helm.sh/hook-delete-policy": before-hook-creation,hook-succeeded
data:
  review-load-dump.sh: |
    #!/bin/bash -x
    
 
 
    CREDENTIALS="postgresql://${POSTGRES_USER}:${POSTGRES_PASSWORD}@${POSTGRES_HOST}/postgres"

    if [ "$( psql -d "${CREDENTIALS}" -tAc "SELECT CASE WHEN EXISTS (SELECT * FROM pg_stat_activity WHERE datname = '${POSTGRES_DB}' LIMIT 1) THEN 1 ELSE 0 END;" )" = '1' ]
      then
          echo "Open connections has been found in ${POSTGRES_DB} database, will drop them"
          psql -d "${CREDENTIALS}" -c "SELECT pg_terminate_backend(pg_stat_activity.pid) FROM pg_stat_activity WHERE pg_stat_activity.datname = '${POSTGRES_DB}' -- AND pid <> pg_backend_pid();"
      else
          echo "No open connections has been found ${POSTGRES_DB} database, skipping this stage"
    fi

    psql -d "${CREDENTIALS}" -c "DROP DATABASE ${POSTGRES_DB}"

    if [ "$( psql -d "${CREDENTIALS}" -tAc "SELECT 1 FROM pg_database WHERE datname='${POSTGRES_DB}'" )" = '1' ]
      then
          echo "Database ${POSTGRES_DB} still exists, delete review job failed"
          exit 1
      else
          echo "Database ${POSTGRES_DB} does not exist, skipping"
    fi


    psql ${CREDENTIALS} -d postgres -c 'CREATE DATABASE ${POSTGRES_DB} TEMPLATE "loot-stage-copy"'

Apparently, they are involved here hook-delete-policy. Details on the application of these policies are written here . In the given manifest we use before-hook-creation,hook-succeededwhich allow to fulfill the following requirements: delete the previous object before creating a new hook and delete only when the hook was successful.

We will delete the database in this ConfigMap:

---
apiVersion: v1
kind: ConfigMap
metadata:
  name: postgresql-configmap-on-delete
  annotations:
    "helm.sh/hook": "post-delete, pre-delete"
    "helm.sh/hook-weight": "1"
    "helm.sh/hook-delete-policy": before-hook-creation
data:
  review-delete-db.sh: |
    #!/bin/bash -e

    CREDENTIALS="postgresql://${POSTGRES_USER}:${POSTGRES_PASSWORD}@${POSTGRES_HOST}/postgres"

    psql -d "${CREDENTIALS}" -w postgres -c "DROP DATABASE ${POSTGRES_DB}"

Although we moved it to a separate ConfigMap, it can be placed in a regular ConfigMap command. After all, it can be made a one-liner without complicating the appearance of the manifest itself.

If the option with PostgreSQL templates for some reason does not suit or does not fit, you can return to the "standard" recovery path mentioned above using backup . The algorithm will be trivial:

Every night, a database backup is made so that it can be downloaded from the local network of the cluster.
At the time of creation of the review environment, the database is loaded and restored from the dump.
When the dump is deployed, all other actions are performed.

In this case, the recovery script will become approximately as follows:

---
apiVersion: v1
kind: ConfigMap
metadata:
  name: postgresql-configmap
  annotations:
    "helm.sh/hook": "pre-install"
    "helm.sh/hook-weight": "1"
    "helm.sh/hook-delete-policy": before-hook-creation,hook-succeeded
data:
  review-load-dump.sh: |
    #!/bin/bash -x

    CREDENTIALS="postgresql://${POSTGRES_USER}:${POSTGRES_PASSWORD}@${POSTGRES_HOST}/postgres"
    psql -d "${CREDENTIALS}" -w -c "DROP DATABASE ${POSTGRES_DB}" || true
    psql -d "${CREDENTIALS}" -w -c "CREATE DATABASE ${POSTGRES_DB}"

    curl --fail -vsL ${HOST_FORDEV}/latest_${POSTGRES_DB_STAGE}.psql -o /tmp/${POSTGRES_DB}.psql

    psql psql -d "${CREDENTIALS}" -w -c "CREATE EXTENSION ip4r;"
    pg_restore -U ${POSTGRES_USER} -h ${POSTGRES_HOST} -w -j 4 -d ${POSTGRES_DB} /tmp/${POSTGRES_DB}.psql
    rm -v /tmp/${POSTGRES_DB}.psql

The procedure corresponds to what has already been described above. The only change is the removal of the psql file after all the work has been added.

Note : both in the recovery script and in the uninstall script, the database is deleted every time. This is done to avoid possible conflicts during the re-creation of review: you must make sure that the database is really deleted. Also, this problem can potentially be solved by adding a flag --cleanin the utility pg_restore, but be careful: this flag clears the data of only those elements that are in the dump itself, so in our case this option does not work.

As a result, we got a working mechanism that requires further improvements (up to replacing Bash scripts with more elegant code). We will leave them outside the scope of the article (although comments on the topic, of course, are welcome).

Mongodb

The next component is MongoDB. The main difficulty with it is that for this DBMS, the option of copying the database (as in PostgreSQL) exists rather nominally, because:

He is in a deprecated state .
According to the results of our testing, we did not find a big difference in the time of database recovery compared to the usual one mongo_restore. However, I note that testing was carried out as part of one project - in your case, the results can be completely different.

It turns out that in the case of a large database volume, a serious problem may arise: we save time on restoring the database in PgSQL, but at the same time restore the dump in Mongo for a very long time. At the time of writing, and within the framework of the existing infrastructure, we saw three ways (by the way, they can be combined):

Recovery can take a long time, for example, if your DBMS is located on a network file system (for cases not with production environment). Then you can simply transfer the DBMS from stage to a separate node and use local storage. Since this is not production, the speed of creating a review is more critical for us.
You can take each Job recovery to a separate pod, allowing you to pre-execute migrations and other processes that depend on the operation of the DBMS. So we save time by completing them in advance.
Sometimes you can reduce the size of the dump by deleting old / irrelevant data - up to the point that it is enough to leave only the database structure. Of course, this is not for those cases when a full dump is required (say, for QA testing tasks).

If you do not need to quickly create review environments, then all the described difficulties can be ignored.

We, being unable to copy the database similarly to PgSQL, will go the first way, i.e. standard recovery from backup. The algorithm is the same as with PgSQL. This is easy to see if you look at the manifestos:

---
apiVersion: v1
kind: ConfigMap
metadata:
  name: mongodb-scripts-on-delete
  annotations:
    "helm.sh/hook": "post-delete, pre-delete"
    "helm.sh/hook-weight": "1"
    "helm.sh/hook-delete-policy": before-hook-creation
data:
  review-delete-db.sh: |
    #!/bin/bash -x

    mongo ${MONGODB_NAME} --eval "db.dropDatabase()" --host ${MONGODB_REPLICASET}/${MONGODB_HOST}
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: mongodb-scripts
  annotations:
    "helm.sh/hook": "pre-install"
    "helm.sh/hook-weight": "1"
    "helm.sh/hook-delete-policy": before-hook-creation,hook-succeeded
data:
  review-load-dump.sh: |
    #!/bin/bash -x

    curl --fail -vsL ${HOST_FORDEV}/latest_${MONGODB_NAME_STAGE}.gz -o /tmp/${MONGODB_NAME}.gz

    mongo ${MONGODB_NAME} --eval "db.dropDatabase()" --host ${MONGODB_REPLICASET}/${MONGODB_HOST}
    mongorestore --gzip --nsFrom "${MONGODB_NAME_STAGE}.*" --nsTo "${MONGODB_NAME}.*" --archive=/tmp/${MONGODB_NAME}.gz --host ${MONGODB_REPLICASET}/${MONGODB_HOST}

There is an important detail here. In our case, MongoDB is in the cluster and you need to be sure that the connection always happens to the Primary node . If you specify, for example, the first host in the quorum, then after some time it may switch from Primary to Secondary, which will prevent the creation of a database. Therefore, you need to connect not to one host, but immediately to ReplicaSet , listing all the hosts in it. For this reason alone, you need to make MongoDB as a StatefulSet so that host names are always the same (not to mention that MongoDB is a stateful application by nature). In this option, you are guaranteed to connect to the Primary node.

For MongoDB, we also delete the database before creating the review - this is done for the same reasons as in PostgreSQL.

Last nuance: since the database for review is in the same environment as stage, a separate name is required for the cloned database. If the dump is not a BSON file, the following error will occur:

the --db and --collection args should only be used when restoring from a BSON file. Other uses are deprecated and will not exist in the future; use --nsInclude instead

Therefore, in the example above, --nsFromand are used --nsTo.

We did not meet other problems with recovery. In the end, I’ll only add that the documentation for copyDatabaseMongoDB is available here - in case you want to try this option.

Rabbitmq

The last application on our list of requirements was RabbitMQ. It’s simple with it: you need to create a new vhost on behalf of the user with whom the application will connect. And then delete it.

Manifesto for creating and removing vhosts:

---
apiVersion: v1
kind: ConfigMap
metadata:
  name: rabbitmq-configmap
  annotations:
    "helm.sh/hook": "pre-install"
    "helm.sh/hook-weight": "1"
    "helm.sh/hook-delete-policy": before-hook-creation,hook-succeeded
data:
  rabbitmq-setup-vhost.sh: |
    #!/bin/bash -x

    /usr/local/bin/rabbitmqadmin -H ${RABBITMQ_HOST} -u ${RABBITMQ_USER} -p ${RABBITMQ_PASSWORD} declare vhost name=${RABBITMQ_VHOST}
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: rabbitmq-configmap-on-delete
  annotations:
    "helm.sh/hook": "post-delete, pre-delete"
    "helm.sh/hook-weight": "1"
    "helm.sh/hook-delete-policy": before-hook-creation
data:
  rabbitmq-delete-vhost.sh: |
    #!/bin/bash -x

    /usr/local/bin/rabbitmqadmin -H ${RABBITMQ_HOST} -u ${RABBITMQ_USER} -p ${RABBITMQ_PASSWORD} delete vhost name=${RABBITMQ_VHOST}

With big difficulties in RabbitMQ we (so far?) Have not encountered. In general, the same approach can apply to any other services that do not have a critical tie for data.

disadvantages

Why does this decision not claim to be “best practices”?

It turns out a single point of failure in the form of a stage environment.
If an application in a stage environment runs only in one replica, we become even more dependent on the host on which this application runs. Accordingly, with an increase in the number of review environments, the load on the node proportionally increases without the ability to balance this load.

It was not possible to fully solve these two problems, taking into account the capabilities of the infrastructure of a particular project, however, clustering (adding new nodes) and vertical scaling can minimize potential damage.

Conclusion

As the application develops and with the increase in the number of developers, sooner or later, the load on review environments increases and new requirements are added to them. It is important for developers to deliver the next changes in production as quickly as possible, but to make this possible, we need dynamic review environments that make development “parallel”. As a result, the load on the infrastructure is growing, and the time for creating such environments is increasing.

This article was written based on real and rather specific experience. Only in exceptional cases do we isolate any services in static environments, and here it was specifically about him. Such a necessary measure allowed us to accelerate the development and debugging of the application - thanks to the ability to quickly create review environments from scratch.

When we started to do this task, it seemed very simple, but as we worked on it, we found many nuances. It was they who were assembled in the final article: even if they are not universal, they can serve as an example for the basis / inspiration of their own decisions on accelerating review environments.

From life with Kubernetes: How we removed DBMS (and not only) from review environments to static