Long DNS resolve in Kubernetes

This article focuses on the DNS issues in Kubernetes that our team encountered. As it turned out, sometimes the problem lies much deeper than it seems initially.


Introduction


There always comes a moment when circumstances intervene in already debugged work, forcing us to change something. So our small team was forced to migrate all used applications to Kubernetes. There were many reasons, objective and not very, but the story, in fact, is not about that.


Since nobody actively used Kubernetes before, the cluster was recreated several times, which is why we did not have time to evaluate the quality of the transferred applications. And so, after the fourth transfer, when all the main cones are already full, all the containers are assembled and all the deployments are written, you can analyze the work done and, finally, move on to other tasks.


11 hours, the beginning of the working day. The monitoring system lacks a part of messages from one of the applications.


Diagnostics


The application was recently transferred to a cluster and consisted of a simple worker who climbed into the database once every few minutes, checked it for changes and, if any, sent a message to the bus. Before the start of the test and after its completion, the application writes a message to the log. No parallelism, no multitasking, the only hearth with the only container in it.


Upon closer inspection, it became clear that the logs are in the console, but in Elastic they are no longer there.


: Elasticsearch- , Kubernetes, Kibana. Elastic Serilog.Sinks.Elasticsearch, - Nginx ( , Elasticsearch , ). Nginx , .


Serilog, , Serilog selflog . , Serilog:


Serilog.Debugging.SelfLog.Enable(msg =>
{
  Serilog.Log.Logger.Error($"Serilog self log: {msg}");
});

, , . Kibana.


> Caught exception while preforming bulk operation to Elasticsearch: Elasticsearch.Net.ElasticsearchClientException: Maximum timeout reached while retrying request. Call: Status code unknown from: POST /_bulk

Elasticsearch , Elasticsearch . , HttpClient . , , .


, , . . , API Kubernetes, 5-10 100-150 . HTTP . .


โ€“ , , HTTP .


, , bash-, google.com :


while true;
do curl -w "%{time_total}\n" -o /dev/null -s "https://google.com/";
sleep 1;
done

โ€“ 5 .


image


, :


apiVersion: v1
kind: ConfigMap
metadata:
  name: ubuntu-test-scripts
data:
  loop.sh: |-
    apt-get update;
    apt-get install -y curl;
    while true;
    do echo $(date)'\n';
    curl -w "@/etc/script/curl.txt" -o /dev/null -s "https://google.com/";
    sleep 1;
    done
  curl.txt: |-
    lookup:        %{time_namelookup}\n
    connect:       %{time_connect}\n
    appconnect:    %{time_appconnect}\n
    pretransfer:   %{time_pretransfer}\n
    redirect:      %{time_redirect}\n
    starttransfer: %{time_starttransfer}\n
    total:         %{time_total}\n
---
apiVersion: v1
kind: Pod
metadata:
  name: ubuntu-test
  labels:
      app: ubuntu-test
spec:
  containers:
  - name: test
    image: ubuntu
    args: [/bin/sh, /etc/script/loop.sh]
    volumeMounts:
    - name: config
      mountPath: /etc/script
      readOnly: true
  volumes:
  - name: config
    configMap:
      defaultMode: 0755
      name: ubuntu-test-scripts

โ€“ google.com curl. , /dev/null, curl.


, , . โ€“ DNS ( โ€“ ). Kubernetes 1.14, , , .


, :


image


, โ€“ DNS-lookup, 99% โ€“ 5 . , .



.


, โ€” 5 seconds dns resolve kubernetes.


, :



issue weave.


, :


  1. ( , โ€“ 2017 , , , ) .
  2. DNS lookup. , 10-15 .
  3. , DNS- VIP, DNAT.
  4. .

โ€“ race conditions conntraking Linux SNAT DNAT. Tobias Klausmann 2009 , Linux 5.0 , .


, Kubernetes DNS, race condition .


glibc ( Ubuntu, Debian . .), :


  1. glibc UDP (A AAAA). UDP connectionless , connect(2) , conntrack .
  2. DNS- Kubernetes VIP DNAT iptables.
  3. DNAT , netfilter :
    a. nf_conntrack_in: conntrack hash object .
    b. nf_nat_ipv4_fn: conntrack.
    c. nf_conntrack_confirm: , .
  4. UDP . , DNS-. , . - insert_failed, .


, workaround:


  1. DNS- nameserver. LocalDNS cache
  2. Weave tc(8) - AAAA DNS
  3. single-request-reopen resolv.conf

, .


Kubernetes 1.9, pod dnsConfig, resolv.conf.


pod, glibc A AAAA , race condition.


spec:
  dnsConfig:
    options:
    - name: single-request-reopen

, โ€” glibc (, alpine, musl), resolv.conf .


PS In our case, to automate this process, the simplest mutating webhook was written, which automatically puts down this configuration section for all new pods in the cluster. Unfortunately, I canโ€™t provide the code.


All Articles