Problems with DNS in Kubernetes. Public post-mortem

Note perev .: this is a translation of the public post- mortem from the Preply engineering blog . It describes a conntrack problem in the Kubernetes cluster that led to some downtime of some production services.

This article may be useful for those who want to learn a little more about post-mortem or to prevent some potential problems with DNS in the future.

image

This is not DNS.
It may not be that it is DNS.
It was DNS.


A little bit about post-mortem and processes in Preply


The post-mortem describes a malfunction or an event in the sales office. Post-mortem includes a chronology of events, a description of the impact on the user, the root cause, actions and lessons learned.

Seeking SRE

At weekly pizza meetings, in the circle of the technical team, we share various information. One of the most important parts of such meetings is post-mortem, which is most often accompanied by a presentation with slides and a deeper analysis of the incident. Despite the fact that we do not “clap” after post-mortem, we try to develop a culture “without blame ” ( blameless cluture ). We believe that writing and presenting post-mortem can help us (and not only) in preventing similar incidents in the future, which is why we share them.

Persons involved in the incident should feel that they can talk in detail about it without fear of punishment or retaliation. No censure! Writing a post-mortem is not a punishment, but an opportunity to learn for the whole company.

Keep CALMS & DevOps: S is for Sharing

Problems with DNS in Kubernetes. Post mortem


Date: 02/28/2020

Authors: Amet U., Andrey S., Igor K., Aleksey P.

Status: Completed

Brief: Partial unavailability of DNS (26 min) for some services in the Kubernetes cluster

Impact: 15,000 events were lost for services A, B and C

Root cause: Kube-proxy could not correctly delete the old entry from the conntrack table, so some services still tried to connect to non-existent submissions
E0228 20:13:53.795782       1 proxier.go:610] Failed to delete kube-system/kube-dns:dns endpoint connections, error: error deleting conntrack entries for UDP peer {100.64.0.10, 100.110.33.231}, error: conntrack command returned: ...

Trigger: Due to the low load inside Kubernetes-cluster, CoreDNS-autoscaler reduced the number of pods in deploymenta from three to two

decision: Another deploy applications initiated the creation of new nodes, CoreDNS-autoscaler added more decks for the cluster service, which provoked overwrite conntrack table

Detection : Prometheus monitoring detected a large number of 5xx errors for services A, B and C and initiated a call to the on-duty engineers


5xx errors in Kibana

Actions


ActA typeResponsibleTask
Disable autoscaler for CoreDNSpreventedAmet U.DEVOPS-695
Install Caching DNS ServerdecreaseMax V.DEVOPS-665
Configure conntrack monitoringpreventedAmet U.DEVOPS-674

Lessons learned


What went well:

  • Monitoring worked clearly. The reaction was quick and organized.


:

  • , conntrack
  • , ()
  • , DNS,

:

  • CoreDNS-autoscaler, conntrack

(EET)


22:13CoreDNS-autoscaler
22:18
22:21
22:39
22:405xx ,

  • Time to detection: 4 min.
  • Time to complete the action: 21 min.
  • Time to fix: 1 min

Additional Information



To minimize processor utilization, the Linux kernel uses such a thing as conntrack. In short, this is a utility that contains a list of NAT entries that are stored in a special table. When the next packet comes from the same pod to the same pod as before, the final IP address will not be recalculated, but will be taken from the conntrack table.

How conntrack works

Summary


This was an example of one of our post-mortem with some useful links. Specifically in this article we share information that may be useful to other companies. That is why we are not afraid to make mistakes and that is why we make one of our post-mortem public. Here are some more interesting public post-mortem themes:


All Articles