🏽 🕵🏽 👊🏽 SSL Certificate Management: From Chaos on Hundreds of Servers to a Centralized Solution ⏫ 🤷🏾 🗨️

What can be behind the words “Europe's largest online school”? On the one hand, this is 1 thousand lessons per hour, 10 thousand teachers, 100 thousand students. And for me, an infrastructure engineer, this also includes 200+ servers, hundreds of services (micro and not very), domain names from the 2nd to the 6th level. Everywhere you need SSL and, accordingly, a certificate for it.

For the most part, we use Let's Encrypt certificates. Their advantages are that they are free, and the receipt is fully automated. On the other hand, they have a feature: short - only three months - validity. Accordingly, they have to be updated frequently. We tried to automate it somehow, but still there was a lot of manual work, and something was constantly breaking. A year ago, we came up with a simple and reliable method for updating this pile of certificates and since then we forgot about this problem.

From one certificate on one server to hundreds in several data centers

Once upon a time there was only one server. And on it lived a certbot, which worked from under the crown. Then one server ceased to cope with the load, so another server appeared. And then more and more. Each of them had its own certificates with its own unique set of names, and everywhere it was necessary to configure their updating. Somewhere during the extension, they copied existing certificates, but forgot about the update.

In order to obtain a Let's Encrypt certificate, you must confirm ownership of the domain name specified in the certificate. This is usually done with a reverse HTTP request.

Here are a couple of standard difficulties we encountered as we grew:

: . .
HTTP. , . . - LDAP. - . .

In some places self-signed certificates have been used for quite some time, and this seemed like a good solution in those places where authentication is not needed - for example, for internal testing. To prevent the browser from constantly reporting a “suspicious site”, you just need to add our root certificate to the list of trusted ones, and the point is in the hat. But later difficulties arose here too.

The trouble is that in BrowserStack, which testers use, it is impossible to add a certificate to the trusted list for at least iPad, Mac, iPhone. So testers had to put up with constantly pop-up warnings about dangerous sites.

Search for a solution

Of course, first of all, you need to do monitoring in order to find out about certificates that are ending not when they have already ended, but a little earlier. Oh well. Monitoring is, we now know that certificates will end soon here and there. And now what i can do?

Big Ear is an old bot that won't ruin a certificate.

And let's use wildcard certificates? Let's! Let's Encrypt already issues them. True, you will have to configure confirmation of domain ownership through DNS. And our DNS lives in AWS Route53. And you have to decompose the access details in AWS across all servers. And with the advent of new servers, copy all this economy there too.

Well, 3rd level names are covered by wildcard. And what to do with names of the 4th level and higher? We have many teams that are engaged in the development of various services. Now it is customary to divide the frontend and backend. And if the frontend gets a 3rd-level name like service.skyeng.ru , then the backend tries to give the name api.service.skyeng.ru . Hmm, maybe they forbid them to continue doing this? Great idea! And what to do with dozens of existing ones? Could it be with an iron hand to drive them all into one domain name? Replace all these names of different levels with URLs like skyeng.ru/service. Technically, this is an option, but how long does it take? And how can business justify the need for such actions? We have 30+ development teams, persuade everyone - it will take at least six months. And we are creating a single point of failure. Like it or not, this is a controversial decision.

What other ideas are there? .. Maybe one certificate can be made where we include all-all-all? And we will install it on all servers. This could be the solution to our problems, but Let's Encrypt allows you to have only 100 names in the certificate, and we already have more than one microservice.

What to do with testers? They didn’t come up with anything, but they constantly complain. All bullshit except the bees. Bees are also garbage, but there are a lot of them. Each developer or tester is given a test server - we call them testing. Testings are not bees, but there are already over a hundred of them. And for each all projects are deployed. That's all. And if for sale you need N certificates, then there is the same amount for each testing. So far, they are self-signed. It would be great to replace them with real ones ...

Two playbooks and one source of truth

The swan, cancer, and pike will not bring the cart anywhere. We need a single server control ~~center~~ . In our case, this is Ansible. Certbot on every server is evil. Let all certificates be stored in one place. If somewhere someone needs a certificate, then come to this place and take the latest version from the shelf. And we will make sure that certificates are always up-to-date in this store.

AWS access details are also present in only one place. Accordingly, questions disappear, such as setting up AWS CLI on a new server, who has access to Route53 and the like.

All required certificates are described in one file in Ansible in YAML format:

    certificates:
      - common_name: skyeng.ru
        alt_names:
          - *.skyeng.ru
      - common_name: olympiad.skyeng.ru
        alt_names:
          - *.olympiad.skyeng.ru
          - api.content.olympiad.skyeng.ru
          - games.skyeng.ru
      - common_name: skyeng.tech
        alt_names:
          - *.skyeng.tech

      .  .  .

One playbook is launched periodically, which goes through this list and does its hard work - essentially the same thing as certbot does:

creates an account with Let's Encrypt Certificate Authority
generates a private key
generates a (not yet signed) certificate - the so-called certificate signing request
sends a signing request
receives a DNS challenge
puts received records in DNS
sends a signing request again
and, having finally received the signed certificate, puts it in the store.

Playbook is performed once a day. If he could not renew any certificates for any reason - be it network problems or some errors on the side of Let's Encrypt - this is not a problem. Will be updated next time.

Now, when SSL is needed on some service host, you can go to this repository and get a few files from there - the simplest operation that the second playbook performs ... What certificates are needed on this host are described in the parameters of this host, in inventories / host_vars / server .yml :

    certificates:
      - common_name: skyeng.ru
        handler: reload nginx
      - common_name: crm.skyeng.ru

      .  .  .

If the files have changed, then Ansible pulls a hook - it is typical to restart Nginx (in our case, this is the default action). And in the same way, you can obtain certificates from other CAs that use the ACME protocol.

Total

We had many different configurations. Something constantly broke. Often I had to climb servers and figure out what had fallen off again.
Now we have two playbooks and everything is recorded in one place. Everything works like a clock. Life has become more boring.

Testing

Yes, what about testers with their testing? Each developer or tester is given a personal test server - testing. There are currently about 200 of them. They have names of the form test-y123.skyeng.link , where 123 is the testing number. Creating and removing testing is automated. One of the components of the action is the installation of an SSL certificate on it. An SSL certificate is generated in advance, with names by template:

    ssl_cert_pattern:
      - *
      - *.auth
      - *.bill

      .  .  .

Only about 30 names. So the certificate comes with names

    test-y123.skyeng.link
    *.test-y123.skyeng.link
    *.auth.test-y123.skyeng.link
    *.bill.test-y123.skyeng.link

etc.

After the dismissal of the developer or tester, his testing is deleted. The certificate remains ready for use. It’s all that is stored. You yourself know where it is decomposed into hosts; you yourself know how.

PS

Repository with code .

It might also be interesting to read on this topic how Stack Overflow switched to HTTPS :

Hundreds of domains of different levels
Websockets
Lots of HTTP APIs (proxy issues)
Do everything and not drop performance

If you have any questions, write in the comments, I will be happy to answer.

SSL Certificate Management: From Chaos on Hundreds of Servers to a Centralized Solution

From one certificate on one server to hundreds in several data centers

Search for a solution

Two playbooks and one source of truth

Total

Testing

PS

More articles: