This article focuses on the DNS issues in Kubernetes that our team encountered. As it turned out, sometimes the problem lies much deeper than it seems initially.
Introduction
There always comes a moment when circumstances intervene in already debugged work, forcing us to change something. So our small team was forced to migrate all used applications to Kubernetes. There were many reasons, objective and not very, but the story, in fact, is not about that.
Since nobody actively used Kubernetes before, the cluster was recreated several times, which is why we did not have time to evaluate the quality of the transferred applications. And so, after the fourth transfer, when all the main cones are already full, all the containers are assembled and all the deployments are written, you can analyze the work done and, finally, move on to other tasks.
11 hours, the beginning of the working day. The monitoring system lacks a part of messages from one of the applications.
Diagnostics
The application was recently transferred to a cluster and consisted of a simple worker who climbed into the database once every few minutes, checked it for changes and, if any, sent a message to the bus. Before the start of the test and after its completion, the application writes a message to the log. No parallelism, no multitasking, the only hearth with the only container in it.
Upon closer inspection, it became clear that the logs are in the console, but in Elastic they are no longer there.
: Elasticsearch- , Kubernetes, Kibana. Elastic Serilog.Sinks.Elasticsearch, - Nginx ( , Elasticsearch , ). Nginx , .
Serilog, , Serilog selflog . , Serilog:
Serilog.Debugging.SelfLog.Enable(msg =>
{
Serilog.Log.Logger.Error($"Serilog self log: {msg}");
});
, , . Kibana.
> Caught exception while preforming bulk operation to Elasticsearch: Elasticsearch.Net.ElasticsearchClientException: Maximum timeout reached while retrying request. Call: Status code unknown from: POST /_bulk
Elasticsearch , Elasticsearch . , HttpClient . , , .
, , . . , API Kubernetes, 5-10 100-150 . HTTP . .
โ , , HTTP .
, , bash-, google.com :
while true;
do curl -w "%{time_total}\n" -o /dev/null -s "https://google.com/";
sleep 1;
done
โ 5 .

, :
apiVersion: v1
kind: ConfigMap
metadata:
name: ubuntu-test-scripts
data:
loop.sh: |-
apt-get update;
apt-get install -y curl;
while true;
do echo $(date)'\n';
curl -w "@/etc/script/curl.txt" -o /dev/null -s "https://google.com/";
sleep 1;
done
curl.txt: |-
lookup: %{time_namelookup}\n
connect: %{time_connect}\n
appconnect: %{time_appconnect}\n
pretransfer: %{time_pretransfer}\n
redirect: %{time_redirect}\n
starttransfer: %{time_starttransfer}\n
total: %{time_total}\n
---
apiVersion: v1
kind: Pod
metadata:
name: ubuntu-test
labels:
app: ubuntu-test
spec:
containers:
- name: test
image: ubuntu
args: [/bin/sh, /etc/script/loop.sh]
volumeMounts:
- name: config
mountPath: /etc/script
readOnly: true
volumes:
- name: config
configMap:
defaultMode: 0755
name: ubuntu-test-scripts
โ google.com curl. , /dev/null, curl.
, , . โ DNS ( โ ). Kubernetes 1.14, , , .
, :

, โ DNS-lookup, 99% โ 5 . , .
.
, โ 5 seconds dns resolve kubernetes.
, :
issue weave.
, :
- ( , โ 2017 , , , ) .
- DNS lookup. , 10-15 .
- , DNS- VIP, DNAT.
- .
โ race conditions conntraking Linux SNAT DNAT. Tobias Klausmann 2009 , Linux 5.0 , .
, Kubernetes DNS, race condition .
glibc ( Ubuntu, Debian . .), :
- glibc UDP (A AAAA). UDP connectionless , connect(2) , conntrack .
- DNS- Kubernetes VIP DNAT iptables.
- DNAT , netfilter :
a. nf_conntrack_in: conntrack hash object .
b. nf_nat_ipv4_fn: conntrack.
c. nf_conntrack_confirm: , . - UDP . , DNS-. , . - insert_failed, .
, workaround:
- DNS- nameserver. LocalDNS cache
- Weave tc(8) - AAAA DNS
- single-request-reopen resolv.conf
, .
Kubernetes 1.9, pod dnsConfig, resolv.conf.
pod, glibc A AAAA , race condition.
spec:
dnsConfig:
options:
- name: single-request-reopen
, โ glibc (, alpine, musl), resolv.conf .
PS In our case, to automate this process, the simplest mutating webhook was written, which automatically puts down this configuration section for all new pods in the cluster. Unfortunately, I canโt provide the code.