👎🏻 👰🏼 ⛹🏽 Problems with DNS in Kubernetes. Public post-mortem 📤 🕡 👨‍👩‍👧‍👧

Note perev .: this is a translation of the public post- mortem from the Preply engineering blog . It describes a conntrack problem in the Kubernetes cluster that led to some downtime of some production services.

This article may be useful for those who want to learn a little more about post-mortem or to prevent some potential problems with DNS in the future.

This is not DNS.
It may not be that it is DNS.
It was DNS.

A little bit about post-mortem and processes in Preply

The post-mortem describes a malfunction or an event in the sales office. Post-mortem includes a chronology of events, a description of the impact on the user, the root cause, actions and lessons learned.

Seeking SRE

At weekly pizza meetings, in the circle of the technical team, we share various information. One of the most important parts of such meetings is post-mortem, which is most often accompanied by a presentation with slides and a deeper analysis of the incident. Despite the fact that we do not “clap” after post-mortem, we try to develop a culture “without blame ” ( blameless cluture ). We believe that writing and presenting post-mortem can help us (and not only) in preventing similar incidents in the future, which is why we share them.

Persons involved in the incident should feel that they can talk in detail about it without fear of punishment or retaliation. No censure! Writing a post-mortem is not a punishment, but an opportunity to learn for the whole company.

Keep CALMS & DevOps: S is for Sharing

Problems with DNS in Kubernetes. Post mortem

Date: 02/28/2020

Authors: Amet U., Andrey S., Igor K., Aleksey P.

Status: Completed

Brief: Partial unavailability of DNS (26 min) for some services in the Kubernetes cluster

Impact: 15,000 events were lost for services A, B and C

Root cause: Kube-proxy could not correctly delete the old entry from the conntrack table, so some services still tried to connect to non-existent submissions

E0228 20:13:53.795782       1 proxier.go:610] Failed to delete kube-system/kube-dns:dns endpoint connections, error: error deleting conntrack entries for UDP peer {100.64.0.10, 100.110.33.231}, error: conntrack command returned: ...

Trigger: Due to the low load inside Kubernetes-cluster, CoreDNS-autoscaler reduced the number of pods in deploymenta from three to two

decision: Another deploy applications initiated the creation of new nodes, CoreDNS-autoscaler added more decks for the cluster service, which provoked overwrite conntrack table

Detection : Prometheus monitoring detected a large number of 5xx errors for services A, B and C and initiated a call to the on-duty engineers

5xx errors in Kibana

Actions

Act	A type	Responsible	Task
Disable autoscaler for CoreDNS	prevented	Amet U.	DEVOPS-695
Install Caching DNS Server	decrease	Max V.	DEVOPS-665
Configure conntrack monitoring	prevented	Amet U.	DEVOPS-674

Lessons learned

What went well:

Monitoring worked clearly. The reaction was quick and organized.

, conntrack
, ()
, DNS,

CoreDNS-autoscaler, conntrack

(EET)


22:13	CoreDNS-autoscaler
22:18
22:21
22:39
22:40	5xx ,

Time to detection: 4 min.
Time to complete the action: 21 min.
Time to fix: 1 min

Additional Information

CoreDNS Logs:

I0228 20:13:53.507780       1 event.go:221] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"kube-system", Name:"coredns", UID:"2493eb55-3dc0-11ea-b3a2-02bb48f8c230", APIVersion:"apps/v1", ResourceVersion:"132690686", FieldPath:""}): type: 'Normal' reason: 'ScalingReplicaSet' Scaled down replica set coredns-6cbb6646c9 to 2

Links to Kibana (cut out), Grafana (cut out)
Where Linux conntrack is no longer your friend
kube-proxy Subtleties: Debugging an Intermittent Connection Reset
Racy conntrack and DNS lookup timeouts

To minimize processor utilization, the Linux kernel uses such a thing as conntrack. In short, this is a utility that contains a list of NAT entries that are stored in a special table. When the next packet comes from the same pod to the same pod as before, the final IP address will not be recalculated, but will be taken from the conntrack table.

How conntrack works

Summary

This was an example of one of our post-mortem with some useful links. Specifically in this article we share information that may be useful to other companies. That is why we are not afraid to make mistakes and that is why we make one of our post-mortem public. Here are some more interesting public post-mortem themes:

GitLab: Postmortem of database outage of January 31
Dropbox: Outage post-mortem
Spotify: Spotify's Love / Hate Relationship with DNS
Many others from this histogram and Kubernetes Failure Stories repository
Also an example of a public post-mortem with SRE Book

Problems with DNS in Kubernetes. Public post-mortem