🍼 🌽 🕰️ Project deployment methodology used by Slack 🗣️ 🍓 🧑🏽‍🤝‍🧑🏼

Conclusion of a new project release into production requires careful balancing between the speed of deployment and the reliability of the solution. Slack appreciates fast iterations, short feedback loops, and responsiveness to user requests. In addition, the company has hundreds of programmers who strive for the highest possible productivity.

The authors of the material, the translation of which we publish today, say that a company that seeks to adhere to such values and at the same time grows should constantly improve its project deployment system. The company needs to invest in the transparency and reliability of work processes, in order to ensure that these processes correspond to the scope of the project. Here we will talk about the work processes that have developed in Slack, and about some of the solutions that led the company to use the project deployment system that exists today.

How project deployment processes work today

Each PR (pull request) in Slack must be subjected to a code review and must pass all tests successfully. Only after these conditions are met, can a programmer merge his code with the master branch of the project. However, the deployment of such a code is performed only during business hours according to North American time. As a result, we, due to the fact that our employees are at work, are fully prepared to solve any unexpected problems.

Every day we complete about 12 planned deployments. During each deployment, the programmer, who is designated as the main deployment, is responsible for bringing the new assembly to production. This is a multi-step process, which provides a smooth conclusion of the assembly in working mode. Thanks to this approach, we can detect errors before they affect all our users. If there are too many errors, the deployment of the assembly can be rolled back. If a particular problem is detected after the release, a fix can be easily released for it.

The interface of the Checkpoint system used by Slack to deploy projects.

The process of deploying a new release in production can be represented in four steps.

▍1. Creating a release branch

Each release begins with a new release branch, from the moment in our Git history. This allows you to assign tags to the release and provides a place where you can make operational corrections for errors found in the process of preparing the release for release in production.

▍2. Intermediate deployment

The next step is to deploy the assembly to staging servers and run an automatic test for the overall performance of the project (smoke test). An intermediate environment is a production environment in which external traffic does not fall. In this environment, we conduct additional manual testing. This gives us additional confidence that the modified project is working correctly. Automated tests alone are not enough to gain such confidence.

▍3. Deployment in dogfood and canary environments

Deployment in production begins with a dogfood environment represented by a set of hosts that serve our internal Slack workspaces. Since we are very active users of Slack, using this approach has helped to detect many errors in the early stages of deployment. After we make sure that the basic functionality of the system is not broken, the assembly is deployed in a canary environment. It is a system that receives about 2% of production traffic.

▍4. Gradual Conclusion in Production

If the monitoring indicators of the new release turn out to be stable, and if after deploying the project in the canary environment we have not received any complaints, we continue the gradual transfer of production servers to the new release. The deployment process is divided into the following stages: 10%, 25%, 50%, 75% and 100%. As a result, we can slowly transfer production traffic to a new system release. At the same time, we have time to investigate the situation in case of revealing some anomalies.

▍What if something went wrong during the deployment?

Making modifications to the code is always a risk. But we can cope with this thanks to our well-trained “deployment managers” who manage the process of introducing a new release into production, monitor the monitoring performance and coordinate the work of programmers who release the code.

In the event that something really went wrong, we try to detect the problem as early as possible. We investigate the problem, find the PR that causes the errors, roll it back, carefully analyze it and create a new assembly. True, sometimes the problem goes unnoticed before the project is put into production. In such a situation, the most important thing is to restore the service. Therefore, before starting the investigation of the problem, we immediately roll back to the previous working assembly.

Deployment Building Blocks

Consider the technologies underlying our project deployment system.

▍Fast deployments

The workflow described above may seem, in retrospect, to be something completely obvious. But our deployment system did not become so far right away.

When the company was significantly smaller, our entire application could run on 10 Amazon EC2 instances. Deploying a project in this situation meant using rsync to quickly synchronize all servers. Previously, the new code was separated from production by just one step, represented by an intermediate environment. Assemblies were created and tested in such an environment, and then went straight into production. Understanding such a system was very simple; it allowed any programmer to deploy the code he wrote at any time.

But as the number of our customers grew, so did the scale of the infrastructure necessary to ensure the operation of the project. Soon, given the constant growth of the system, our deployment model, based on sending new code to the servers, ceased to cope with its task. Namely, the addition of each new server meant an increase in the time required to complete the deployment. Even strategies based on the parallel use of rsync have certain limitations.

As a result, we solved this problem by switching to a fully parallel deployment system, which is not arranged like the old system. Namely, now we did not send the code to the servers using the synchronization script. Now each server independently downloaded a new assembly, learning that it needed to be done, thanks to observing the Consul key change. The servers downloaded the code in parallel. This allowed us to maintain a high deployment speed even in an environment of constant system growth.

1. Production servers monitor the Consul key. 2. The key is changing, this tells the servers that they need to start downloading new code. 3. Servers upload tarball files with application code

▍Atomic deployments

Another solution that helped us get to a multi-level deployment system was atomic deployment.

Prior to using atomic deployments, each deployment could result in a large number of error messages. The fact is that the process of copying new files to production servers was not atomic. This led to the existence of a short time span when the code in which the new functions were called was available before the functions themselves became available. When such code was called, it returned internal errors. This was manifested in unsuccessful API requests and in broken web pages.

The team that dealt with this problem solved it by introducing the concept of “hot” (hot) and “cold” (cold) directories. The code in the "hot" directory is responsible for processing production traffic. And in the “cold” directories, the code, while the system is running, is only getting ready for use. During the deployment, the new code is copied to the unused “cold” directory. Then, when there are no active processes on the server, the directories are switched instantly.

1. Unpacking the application code into a “cold” directory. 2. Switching the system to a “cold” directory, which becomes “hot” (atomic operation)

Bottom line: a shift in emphasis on reliability

In 2018, the project grew to such a scale that very rapid deployment began to harm the stability of the product. We had a very advanced deployment system in which we invested a lot of time and effort. We only needed to restructure and improve the deployment organization processes. We have become a fairly large company, the development of which was used all over the world to organize uninterrupted communication and to solve important problems. Therefore, the focus of our attention was reliability.

We needed to make the process of deploying new Slack releases more secure. This need led us to improve our deployment system. As a matter of fact, above we discussed this improved system. In the bowels of the system, we continue to use technologies of fast and atomic deployment. Changed how exactly the deployment is performed. Our new system is designed to gradually deploy new code at different levels, in different environments. Now we use more advanced than before, auxiliary tools and tools for monitoring the system. This gives us the opportunity to catch and eliminate errors long before they get a chance to reach the end user.

But we are not going to stop there. We are constantly improving this system using more advanced auxiliary tools and automation tools.

Dear readers! How is the process of deploying new project releases where you work?

Project deployment methodology used by Slack