👨🏽‍🎨 👩🏼‍🚀 🌠 Crash stories with Patroni, or How to drop a PostgreSQL cluster 🖐🏿 🧑🏻‍🤝‍🧑🏻 🌉

PostgreSQL does not have High Availability out of the box. To achieve HA, you need to put something in, set up - make an effort. There are several tools that can help increase the availability of PostgreSQL, and one of them is Patroni.

At first glance, putting Patroni in a test environment, you can see what a great tool it is and how it easily handles our attempts to break up the cluster. But in practice, in a production environment, not everything always happens so beautifully and elegantly. Data Egret started using Patroni at the end of 2018 and gained some experience: how to diagnose it, configure it, and when not to rely on the auto-filer at all.

At HighLoad ++, Alexei Lesovsky described in detail, using examples and log analysis, the typical problems that arise when working with Patroni, and best practice to overcome them.

The article will not: Patroni installation instructions and configuration examples; problems outside of Patroni and PostgreSQL; stories based on other people's experiences, but only those problems that the Data Egret figured out for themselves.

About the speaker: Alexey Lesovsky (lesovsky) started as a system administrator (Linux system administrator), worked in web development (PostgreSQL database administrator). Since 2014 she has been working at Data Egret. Data Egret is engaged in consulting in the field of PostgreSQL, helps many, many companies to use PostgreSQL correctly and, of course, has gained extensive experience in operating the database.
The report that this article is based on is called "Patroni Failure Stories or How to crash your PostgreSQL cluster", here is a link to the presentation .

Before you start

Let me remind you what Patroni is, what it is intended for and what it can do.

Patroni is a template for building HA out of the box. So it is written in the documentation and from my point of view - this is a very correct clarification. That is, Patroni is not a silver bullet that he set and it will solve all problems. It is necessary to make efforts so that it starts to work and bring benefits.

Patroni - agent service. It is installed on each server with a database and is a kind of init-system for PostgreSQL: it starts, stops, restarts, changes the configuration and topology of the cluster.

Patroni stores "cluster state" in DCS.To store the cluster state, its current view, you need storage. Patroni stores state in an external system — a distributed configuration repository. This can be one of the options: Etcd, Consul, ZooKeeper or Etcd Kubernetes.

Patroni AutoFile is enabled by default. You get an auto-filer out of the box, immediately after installing Patroni.

Plus there are many other things, such as: servicing configurations, creating new replicas, backing up, etc. But this time it will remain outside the scope of the report.

The main objective of Patroni is to provide a reliable auto-filer.

Patroni does not have to monitor equipment, send out notifications, do complicated things about potential accident prevention, etc. It is required that the cluster remain operational and the application can continue to work with the database, regardless of any changes in the cluster topology.

But when we start using Patroni with PostgreSQL, our system gets a little more complicated. Now, in addition to the database itself, when the master or the replica fails, the Patroni itself, the distributed storage of cluster states, or the network may break. Consider all the options as they become more complicated in terms of how difficult it is to understand their causes.

Problem 1. DBMS and DCS on the same cluster

Consider the simplest case: we took a database cluster and deployed DCS on the same cluster. This common error is not only related to PostgreSQL and Patroni deployment errors. This is a mistake in the general construction of architectures - combining many different components in one place.

So, there was a failover. We begin to understand what happened. Here we are interested in when the feylover occurred, that is, the point in time when the cluster state changed.

When did the failover happen?

A faylover does not always happen instantly; it can be time-consuming. Therefore, it has a start time and an end time. All events are divided into “before”, “during” and “after” the feylover.

First of all, when the feylover happened, we are looking for the reason.

pgdb-2 patroni: INFO: promoted self to leader by acquiring session lock
pgdb-2 patroni: WARNING: Loop time exceeded, rescheduling immediately.
pgdb-2 patroni: INFO: Lock owner: pgdb-2; I am pgdb-2
pgdb-2 patroni: INFO: updated leader lock during promote
pgdb-2 patroni: server promoting
pgdb-2 patroni: INFO: cleared rewind state after becoming the leader
pgdb-2 patroni: INFO: Lock owner: pgdb-2; I am pgdb-2

Above are the standard Patroni logs, where he reports that the server role has changed and the role of the wizard has moved from another to this node.

Why did the failover happen?

Next, you need to understand what events made the master’s role go from one node to another, what happened to the old master.

pgdb-2 patroni: patroni.utils.RetryFailedError: 'Exceeded retry deadline'
pgdb-2 patroni: ERROR: Error communicating with DCS
pgdb-2 patroni: INFO: demoted self because DCS is not accessible and i was a leader
pgdb-2 patroni: WARNING: Loop time exceeded, rescheduling immediately.

In this case, everything is simple: Error communicating with DCS- an error interacting with the configuration storage system. The master realized that he could not work with DCS, and said that he could no longer be a master and resigned. demoted selfspeaks about this.

What preceded the feylover?

If you look at the events that preceded the feylover, you can see the very reasons that became a problem for the wizard to continue:

pgdb-2 patroni: ERROR: touch_member
                ... python trace 
pgdb-2 patroni: socket.timeout: timed out 
pgdb-2 patroni: During handling of the above exception, another exception occurred:
                ... python trace 
pgdb-2 patroni:   raise ReadTimeoutError(self, url, "Read timed out. (read timeout=%s)" % timeout_value) 
pgdb-2 patroni: urllib3.exceptions.ReadTimeoutError: HTTPConnectionPool(host='127.0.0.1', port=8500): Read timed out. (read timeout=3.3333333333333335) 
pgdb-2 patroni: During handling of the above exception, another exception occurred:
                ... python trace 
pgdb-2 patroni:     raise MaxRetryError(_pool, url, error or ResponseError(cause)) 
pgdb-2 patroni: urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='127.0.0.1', port=8500): Max retries exceeded with url: /v1/kv/service/pgdb/members/pgdb-2?acquire=19598b72-c0d5-f066-5d24-09c1a9ad61ee&dc=maindb (Caused by ReadTimeoutError("HTTPConnectionPool(host='127.0.0.1', port=8500): Read timed out. (read timeout=3.3333333333333335)",)) 
pgdb-2 patroni: INFO: no action.  i am the leader with the lock 
pgdb-2 patroni: WARNING: Loop time exceeded, rescheduling immediately.

Patroni logs show a lot of different timeout errors. Patroni Agent was unable to work with DCS (in this case, it is Consul, port = 8500). Patroni and the database were running on the same host, and Consul servers were running on the same host. Having created the load on the server, we also created problems for the Consul server, because of which they could not communicate normally.

Everything is back as it was

After a while, when the load subsided, our Patroni was able to communicate again with the agents, everything resumed:

pgdb-2 patroni: INFO: promoted self to leader by acquiring session lock
pgdb-2 patroni: WARNING: Loop time exceeded, rescheduling immediately.
pgdb-2 patroni: INFO: Lock owner: pgdb-2; I am pgdb-2
pgdb-2 patroni: INFO: updated leader lock during promote
pgdb-2 patroni: server promoting
pgdb-2 patroni: INFO: cleared rewind state after becoming the leader
pgdb-2 patroni: INFO: Lock owner: pgdb-2; I am pgdb-2

The same pgdb-2 server again became a master. There was a small “flip” - in a relatively short time he resigned from himself as a master, then again took them upon himself.

This can be regarded either as a false positive, or as the fact that Patroni did everything right.

Decision

We decided for ourselves that the problem is that the Consul servers are on the same hardware as the databases. Accordingly, any load on the CPU and disks (heavy IO request, temporary files, auto-vacuums, migrations, backups, etc.) also affects the interaction with the Consul cluster. We decided that this should not live on the same equipment with the database, and allocated a separate cluster for Consul.

Alternatively, you can rotate the parameters ttl, loop_wait, retry_timeout, try due to their increase to survive the short-term peak loads. But if the load is long, we will go beyond these parameters, and the method will not work.

Problem 2. Outages

The second problem is similar to the first in that we again have problems interacting with the DCS system:

maindb-1 patroni: ERROR: get_cluster 
maindb-1 patroni: Traceback (most recent call last):
                ... python trace 
maindb-1 patroni: RetryFailedError: 'Exceeded retry deadline' 
maindb-1 patroni: ERROR: Error communicating with DCS 
maindb-1 patroni: INFO: closed patroni connection to the postgresql cluster 
maindb-1 patroni: INFO: postmaster pid=214121 
... 
... 
maindb-1 patroni: INFO: demoted self because DCS is not accessible and i was a leader 
maindb-1 patroni: WARNING: Loop time exceeded, rescheduling immediately.

Patroni again says that it cannot interact with DCS, so the current master ceases to be a master and enters replica mode. The old master becomes a replica:

maindb-1 patroni: INFO: Lock owner: maindb-2; I am maindb-1 
maindb-1 patroni: INFO: does not have lock 
maindb-1 patroni: INFO: running pg_rewind from maindb-2 
maindb-1 patroni: INFO: running pg_rewind from user=postgres host=192.168.11.18 port=5432 ... 
maindb-1 patroni: servers diverged at WAL location 1FA/A38FF4F8 on timeline 6 
maindb-1 patroni: rewinding from last common checkpoint at 1FA/A38FF450 on timeline 6 
maindb-1 patroni: INFO: Lock owner: maindb-2; I am maindb-1 
maindb-1 patroni: INFO: running pg_rewind from maindb-2 in progress 
maindb-1 patroni: Done!

Here Patroni works out as it should - launches pg_rewindto rewind the transaction log, then connect to the new master and already catch up with the new master.

What preceded the feylover?

Let's find the first thing that preceded the feylover - the errors that caused it. Patroni logs are convenient in this regard: he writes the same messages with a certain interval. You can quickly scroll them and see when the logs have changed. At this point, some problems began.

In a normal situation, Patroni logs look something like this:

maindb-1 patroni: INFO: Lock owner: maindb-1; I am maindb-1
maindb-1 patroni: INFO: no action. i am the leader with the lock
maindb-1 patroni: INFO: Lock owner: maindb-1; I am maindb-1
maindb-1 patroni: INFO: no action. i am the leader with the lock

The owner of the lock is checked, if it changes, events may occur that Patroni should respond to. In this case, everything is in order with us.

Having scrolled to the place where the errors began to appear, we see that an auto-filer occurred. Since the errors were related to interaction with DCS, we also look at the Consul logs - what happened there. Approximately comparing the time of the faylover and the time in the Consul logs, we see that the neighbors in the Consul cluster began to doubt the existence of other participants:

maindb-2 consul[11581]: serf: EventMemberFailed: maindb-1 192.168.11.19
maindb-2 consul[11581]: [INFO] serf: EventMemberFailed: maindb-1 192.168.11.19
maindb-2 consul[11581]: memberlist: Suspect maindb-1 has failed, no acks received
maindb-2 consul[11581]: [INFO] memberlist: Suspect maindb-1 has failed, no acks received

If you look at the logs of other Consul agents, it also shows that there is a network collapse: all members of the Consul cluster doubt each other. This was the impetus for the feylover.

If we look at even earlier entries, we will see that the Consul system for the PostgreSQL agent has difficulties with communication ( deadline reached, RPC failed):

maindb-1 consul: memberlist: Suspect lbs-4 has failed, no acks received
maindb-1 consul: [ERR] yamux: keepalive failed: i/o deadline reached
maindb-1 consul: yamux: keepalive failed: i/o deadline reached
maindb-1 consul: [ERR] consul: "KVS.List" RPC failed to server 192.168.11.115:8300: rpc error making call: EOF
maindb-1 consul: [ERR] http: Request GET /v1/kv/service/sam_prod/?recurse=1, error: rpc error making call: EOF from=192.168.11.19
maindb-1 consul: [ERR] consul: "KVS.List" RPC failed to server 192.168.11.115:8300: rpc error making call: EOF
maindb-1 consul: [ERR] agent: Coordinate update error: rpc error making call: EOF
maindb-1 consul: [ERR] http: Request GET /v1/kv/service/sam_prod/?recurse=1, error: rpc error making call: EOF from=192.168.11.19

Decision

The simplest answer is to repair the network . It is easy to advise, but circumstances can be different, and this is not always possible. The system can live in a data center, where we can not influence the equipment. Need other options.

There are solution options without working with a network:

Consul- — checks: [].
, . , Consul- . .
raft_multiplier.
Consul-, 5 ( staging-). Consul-. - .
renice -n -10 consul.
. renice . Consul-, .
consul?
Consul- . Patroni Consul- , : - , Patroni . etcd, , etcd . Patroni etcd-. - , , , Consul.
ttl, loop_wait, retry_timeout.
, . Patroni .

Problem 3. Node loss

If you are using Patroni, then you are familiar with the patronictl list command, which shows the current state of the cluster:

$ patronictl list

+-----------------+-------------------------+--------------+--------+---------+-----------+
|     Cluster     |          Member         |     Host     |  Role  |  State  | Lag in MB |
+-----------------+-------------------------+--------------+--------+---------+-----------+
| patroni_cluster | pg01.dev.zirconus.local | 10.202.1.101 |        | running |    0.0    |
| patroni_cluster | pg03.dev.zirconus.local | 10.202.1.103 | Leader | running |    0.0    |
+-----------------+-------------------------+--------------+--------+---------+-----------+

At first glance, such a conclusion may seem normal - there is a master, replicas, but there is no replication lag. But this picture is normal exactly until we know that in this cluster there should be 3 nodes, and not 2.

Why did the failover happen?

We understand that an auto-filer occurred and after that one replica disappeared. We need to find out why she disappeared, and bring her back.

We study the logs again to understand why the auto-filer occurred:

pg02 patroni[1425]: ERROR: failed to update leader lock
pg02 patroni[1425]: INFO: demoted self because failed to update leader lock in DCS
pg02 patroni[1425]: WARNING: Loop time exceeded, rescheduling immediately.
pg02 patroni[1425]: INFO: Lock owner: None; I am pg02.dev.zirconus.local
pg02 patroni[1425]: INFO: not healthy enough for leader race
pg02 patroni[1425]: INFO: starting after demotion in progress

Master pg02 was unable to update the master key ERROR: failed to update leader lock.

Then the second replica of db03 became the master, everything is in order here:

pg03 patroni[9648]: INFO: promoted self to leader by acquiring session lock
pg03 patroni[9648]: server promoting
pg03 patroni[9648]: INFO: cleared rewind state after becoming the leader

What about the old master?

pg02, deciding to become a replica, began by rewinding the transaction log. Here we need to look at the replica logs, which is not in the cluster. Open the Patroni logs and see that during the process of connecting to the cluster, a problem arose at the stage pg_rewind.

pg02 patroni[1425]: INFO: running pg_rewind from pg03
pg02 patroni[1425]: INFO: running pg_rewind from user=postgres host=10.202.1.103 port=5432 ...
pg02 patroni[1425]: servers diverged at WAL location 33F/68E6AD10 on timeline 28
pg02 patroni[1425]: could not open file "/data/pgdb/11/pg_wal/0000001C0000033F00000059": No such file or directory
pg02 patroni[1425]: could not find previous WAL record at 33F/59FFE368 pg02 patroni[1425]: Failure, exiting
pg02 patroni[1425]: ERROR: Failed to rewind from healty master: pg03
pg02 patroni[1425]: WARNING: Postgresql is not running.

To connect to the cluster, the node must request the transaction log from the wizard and catch the wizard from it. In this case, there is no transaction log ( No such file or directory), and the replica cannot start. Accordingly, PostgreSQL stops with an error. You need to understand why there was no transaction log.

Let's see what the new master has in WAL:

LOG: checkpoint complete:
wrote 62928 buffers (16.0%); 0 WAL file(s) added, 0 removed, 23 recycled;
write=12.872 s, sync=0.020 s, total=13.001 s;
sync files=16, longest=0.009 s, average=0.001 s;
distance=520220 kB, estimate=520220 kB

It turns out that while it was running pg_rewind, a checkpoint occurred, and some of the old transaction logs were renamed . When the old master tried to connect to the new master and request these logs, they were already renamed, they simply did not exist.

According to the timestamp, the time between these events is literally 150 ms.

2019-08-26 00:06:11,369 LOG: checkpoint complete
2019-08-26 00:06:11,517 INFO: running pg_rewind

But this was enough to make the replica unable to connect and earn.

Decision

Initially, we used replication slots. This solution seemed good to us, although at the first stage of operation we turned off the slots. We thought that if the slots would accumulate many wal-segments, the master could fall. Having suffered for some time without slots, we realized that we needed them and returned them.

The problem is that when the wizard enters the replica state, the wizard deletes the slots and wal-segments along with them. To level this problem, we increased the parameter wal_keep_segments. By default, it is equal to 8 segments, we raised wal_keep_segmentsto 1000 . We allocated 16 GB on wal_keep_segments, and now, when switching, we always have 16 GB of transaction logs on all nodes.

This is also true for long-term maintenance tasks. Let's say we need to update one of the replicas (software, OS, something else), and we want to turn it off. When we turn off the replica, the slot is also deleted for it. If you use a small parameter value wal_keep_segments, then for a long absence of a replica, it will request those transaction logs that may not even appear on the wizard - then the replica will not be able to connect. Therefore, we keep a large supply of magazines.

Problem 4. Data Loss

There was a failover on a production base. We checked the cluster: everything is in order, all replicas are in place, there is no replication lag, and there are no errors in the logs either. But the product team reports that there is not enough data in the database - they are in one source, but not in the database. You need to understand what happened to them.

We immediately realized that this was pg_rewindtheir jam, but we need to find out why.

When did the failover happen?

We can always find in the logs when the feylover occurred, who became the new master, who was the master before and when he wanted to become a replica.

pgdb-1 patroni[17836]: INFO: promoted self to leader by acquiring session lock
pgdb-1 patroni[17836]: server promoting
pgdb-1 patroni[17836]: INFO: cleared rewind state after becoming the leader
pgdb-1 patroni[17836]: INFO: Lock owner: pgdb-1; I am pgdb-1

From the logs, we can determine how much transaction logs have been lost.

It was like this: the old wizard rebooted, after it was restarted by autorun, Patroni was launched, which then launched PostgreSQL. PostgreSQL decided to become a member of the Patroni cluster and launched a process pg_rewindthat in turn erased part of the transaction logs, downloaded new ones and connected.

Patroni worked exactly as intended. The cluster recovered: there were 3 nodes, after the feylover 3 nodes and left. But part of the data is lost, and you need to understand what part it is.

Find in the logs of the old master the moment when it happened pg_rewind:

pgdb-2 patroni[4149]: INFO: running pg_rewind from pgdb-1
pgdb-2 patroni[4149]: Lock owner: pgdb-1; I am pgdb-2
pgdb-2 patroni[4149]: Deregister service pgdb/pgdb-2
pgdb-2 patroni[4149]: running pg_rewind from pgdb-1 in progress
pgdb-2 patroni[4149]: running pg_rewind from user=replica host=10.10.1.31 port=5432 ...
pgdb-2 patroni[4149]: servers diverged at WAL location 0/5AD1BFD8 on timeline 66
pgdb-2 patroni[4149]: rewinding from last common checkpoint at 0/59DD1070 on timeline 66
pgdb-2 patroni[4149]: Done!

You need to find the position in the transaction log, which stopped the old master.

pgdb-2 patroni[4149]: INFO: Local timeline=66 lsn=0/5BDA8EB8
pgdb-2 patroni[4149]: INFO: master_timeline=67
...
pgdb-2 patroni[4149]: servers diverged at WAL location 0/5AD1BFD8 on timeline 66
postgres=# select pg_wal_lsn_diff('0/5BDA8EB8','0/5AD1BFD8');
pg_wal_lsn_diff
----------------
17354464

In this case, it is the mark 0 / 5BDA8EB8. The second mark - 0 / 5AD1BFD8 - is needed to find the distance by which the old master differs from the new one. Using the function pg_wal_lsn_diff , we compare these two marks, we get 17 MB.

Whether there is a big loss of 17 MB of data, everyone decides for himself. For some, this is insignificant, but for someone it is unacceptable a lot. Each for himself individually determines in accordance with the needs of the business.

Decision

First of all, you need to decide whether Patroni autostart is always needed after rebooting the system . Most often, we should go to the old master, see how different it is from the current state, perhaps inspect the segments of the transaction log. You need to understand whether this data can be lost, or you need to run the old wizard in standalone mode to pull the data. Only after that make a decision what to do with the data and connect this node as a replica to our cluster.

In addition, there is a parametermaximum_lag_on_failover, by default its value is 1 Mb. It works like this: if the replica is 1 MB behind the replication lag, then this replica does not take part in the elections. If suddenly a failover occurs, Patroni looks at which replicas are behind, and those that are behind a large number of transaction logs cannot become a master. This is a good security feature that prevents you from losing too much data.

But there is a problem: the replication lag is updated at a certain interval, the ttldefault value is 30 s. It is quite possible that the value of replication lag for replicas in DCS is one, but in fact it is completely different or there is no lag at all. This is not a real-time value; it does not always reflect the real picture and it is not worth tying complex logic to it.

The risk of "losses" always remains:

In the worst case: maximum_lag_on_failover + ttl.
In the average case: maximum_lag_on_failover + (loop_wait / 2).

When you plan to implement Patroni and evaluate how much data you can lose, consider these formulas to roughly represent possible losses.

The good news is that there may be a WAL from background processes in the “losses”. This data can easily be ignored and lost, there is no problem.

This is what the logs look like if set maximum_lag_on_failover, a feylover occurred and you need to select a new wizard:

pgdb-1 patroni[6202]: INFO: Lock owner: None; I am pgdb-1
pgdb-1 patroni[6202]: INFO: not healthy enough for leader race
pgdb-1 patroni[6202]: INFO: changing primary_conninfo and restarting in progress
...
pgdb-1 patroni[6202]: INFO: following a different leader because i am not the healthiest node
pgdb-1 patroni[6202]: INFO: following a different leader because i am not the healthiest node

The replica simply sees that she not healthy enough for leader raceis and refuses to participate in the leadership race. Therefore, it simply waits for a new wizard to be selected to connect to it. This is an additional measure against data loss.

Problem 5. Drives

The product team wrote that the application has problems working with PostgreSQL. At the same time, you cannot enter the master, because it is not available via SSH, but the auto-filer does not occur either. Then the host was forcibly rebooted and thus launched the auto-filer. Although it was possible to make a manual feylover.

After the reboot, we go to see what happened to the master.

We knew in advance about the problems with the disks; by monitoring we knew where to dig.

In the PostgreSQL logs, we see the following:

[COMMIT] LOG: duration: 1138.868 ms statement: COMMIT
...
[] WARNING: autovacuum worker started without a worker entry
...
[SELECT] LOG: temporary file: path "base/pgsql_tmp/pgsql_tmp11247.983", size 532996096
...

On the face are all indicators of problems with disks: commits that last for a second, autovacuum runs very long and strange, and temporary files on the disk.

We looked into the system dmesg - the kernel message log, and saw a problem with one of the disks:

md/raid10:md2: sde3: rescheduling sector 351273392
blk_update_request: I/O error, dev sde, sector 64404728
md/raid10:md2: sde3: rescheduling sector 63161592
blk_update_request: I/O error, dev sde, sector 64404760
...
md2 : active raid10 sda3[0] sdc3[2] sdd3[3] sdb3[1] sdh3[7] sdf3[5] sdg3[6]
      15623340032 blocks super 1.2 512K chunks 2 near-copies [8/7] [UUUUUUU_]
      bitmap: 58/59 pages [232KB], 131072KB chunk

The disk subsystem on the server was a software Raid of 8 disks, but one was missing. The line sda3[0] sdc3[2] sdd3[3] sdb3[1] sdh3[7] sdf3[5] sdg3[6]is missing sde[4]. Relatively speaking, one disk fell out, this caused disk problems, and the application had problems working with the PostgreSQL cluster.

Decision

In this case, Patroni would not be able to help, because Patroni has no task to monitor the status of the server and disks. During the problems, Patroni continued to interact with the DCS cluster and did not see any difficulties. For such situations, external monitoring is needed.

Problem 6. Cluster simulator

This is one of the strangest problems. I studied it for a very long time, re-read a lot of logs and called it “Cluster simulator”.

The problem was that the old master could not become a normal cue. Patroni ran it, showed that this node is present as a replica, but at the same time it was not a normal replica, now you will see why.

It all started, as in the previous case, with disk problems:

14:48:55.601 [COMMIT] LOG: duration: 1478.118 ms statement: COMMIT
14:48:56.287 [COMMIT] LOG: duration: 1290.203 ms statement: COMMIT
14:48:56.287 [COMMIT] LOG: duration: 1778.465 ms statement: COMMIT
14:48:56.287 [COMMIT] LOG: duration: 1778.449 ms statement: COMMIT

There were broken connections:

14:48:58.078 [idle in transaction] LOG: could not send data to client: Broken pipe
14:48:58.078 [idle] LOG: could not send data to client: Broken pipe
14:48:58.078 [idle] FATAL: connection to client lost
14:48:58.107 [idle in transaction] FATAL: connection to client lost

There were long expectations of a response and blocking of varying severity:

14:49:26.929 [UPDATE waiting] LOG: process 4298 acquired ExclusiveLock on tuple (2,10) of relation 52082 of database 50587 after 52487.463 ms
14:49:26.929 [UPDATE waiting] STATEMENT: UPDATE sessions SET lastaccess='1565005714' WHERE sessionid=...
14:49:27.929 [UPDATE waiting] LOG: process 4298 still waiting for ShareLock on transaction 364118337 after 1000.088 ms
14:49:27.929 [UPDATE waiting] DETAIL: Process holding the lock: 4294. Wait queue: 4298.

In general, there are obvious problems with the disks, including temporary files again.

But the most mysterious for me - it flew in immediate shutdown request:

14:49:34.102 MSK 5335 @ from [] LOG: received immediate shutdown request
14:49:34.689 [authentication] WARNING: terminating connection because of crash of another server process
14:49:34.689 [authentication] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory

PostgreSQL has three shutdown modes:

Graceful when we wait for all clients to disconnect on their own.
Fast, when we tell customers to disconnect, because we are going to shutdown.
Immediate, which does not tell clients that it is necessary to disconnect, but simply turns it off and sends an RST message to all clients (TCP signal that the connection was interrupted).

PostgreSQL background processes do not send immediate shutdown request signals to each other, but only respond to them. This is an emergency restart, and who sent it is unclear. If it was kill -9, then I would see it in the logs, but it was not there.

Understanding further, I saw that Patroni did not write to the log for quite some time - there were simply no messages for 54 seconds. During this time, one of the replicas made “promote” and an auto-filer occurred:

pgsql03 patroni: 14:48:25,000 INFO: Lock owner: pgsql03; I am pgsql03 
pgsql03 patroni: 14:48:25,013 INFO: no action.  i am the leader with the lock 
pgsql03 patroni: 14:48:37,159 INFO: Lock owner: pgsql03; I am pgsql03 
pgsql03 patroni: 14:49:31,155 WARNING: Exception hened during processing of request from 10.1.0.12 
pgsql03 patroni: 14:49:31,297 WARNING: Exception hened during processing of request from 10.1.0.11 
pgsql03 patroni: 14:49:31,298 WARNING: Exception hened during processing of request from 10.1.0.11

Patroni worked perfectly again here, the old master was unavailable, so the election of a new master began.

pgsql01 patroni: 14:48:57,136 INFO: promoted self to leader by acquiring session lock
pgsql01 patroni: server promoting
pgsql01 patroni: 14:48:57,214 INFO: cleared rewind state after becoming the leader
pgsql01 patroni: 14:49:05,013 INFO: Lock owner: pgsql01; I am pgsql01
pgsql01 patroni: 14:49:05,023 INFO: updated leader lock during promote

pgsql01 became the new leader, and there were just problems with the second replica. She honestly tried to reconfigure:

pgsql02 patroni: 14:48:57,124 INFO: Could not take out TTL lock 
pgsql02 patroni: 14:48:57,137 INFO: following new leader after trying and failing to obtain lock 
pgsql02 patroni: 14:49:05,014 INFO: Lock owner: pgsql01; I am pgsql02 
pgsql02 patroni: 14:49:05,025 INFO: changing primary_conninfo and restarting in progress 
pgsql02 patroni: 14:49:15,011 INFO: Lock owner: pgsql01; I am pgsql02
pgsql02 patroni: 14:49:15,014 INFO: changing primary_conninfo and restarting in progress 
pgsql02 patroni: 14:49:25,011 INFO: Lock owner: pgsql01; I am pgsql02 
pgsql02 patroni: 14:49:25,014 INFO: changing primary_conninfo and restarting in progress 
pgsql02 patroni: 14:49:35,011 INFO: Lock owner: pgsql01; I am pgsql02 
pgsql02 patroni: 14:49:35,014 INFO: changing primary_conninfo and restarting in progress

She tried to change recovery.conf, restart PostgreSQL, connect to the new wizard. Every 10 seconds there are messages that she is trying, but can not.

Meanwhile, the same immediate-shutdown signal flew into the old master. The wizard started an emergency reboot, recovery also stops. The replica cannot connect to the master because it is in shutdown mode.

14:49:34.293 [idle] LOG:  received replication command: IDENTIFY_SYSTEM 
WARNING:  terminating connection because of crash of another server process 
DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. 
14:49:35.232 FATAL:  could not receive data from WAL stream: server closed the connection unexpectedly             
        This probably means the server terminated abnormally 
        before or while processing the request. 
14:49:35.232 LOG:  record with incorrect prev-link 142D46/315602C at 14CF/CF38C160 
14:49:35.305 FATAL: could not connect to the primary server: FATAL: the database system is shutting down 
14:49:40.241 FATAL: could not connect to the primary server: FATAL: the database system is shutting down

At some point, the replica worked, but the replication did not start.

14:50:14.024 [] LOG:  record with incorrect prev-link 142D46/315602C at 14CF/CF38C160 
14:50:14.028 [] LOG:  fetching timeline history file for timeline 72 from primary server 
14:50:14.104 [] FATAL:  could not start WAL streaming: ERROR:  requested starting point 14CF/CF000000 on timeline 71 is not in this server's history        
DETAIL:  This server's history forked from timeline 71 at 14CF/CEC32E40. 
14:50:14.104 [] LOG:  new timeline 72 forked off current database system timeline 71 before current recovery point 14CF/CF38C160

I have a single hypothesis: in recovery.conf was the address of the old master. When a new master already appeared, the second replica tried to connect to the old master. When Patroni started on the second replica, the node started, but could not connect via replication. A replication lag was formed that looked something like this:

+-----------------+----------+--------------+--------+---------+-----------+
|     Cluster     |  Member  |     Host     |  Role  |  State  | Lag in MB |
+-----------------+----------+--------------+--------+---------+-----------+
| patroni_cluster |  pgsql01 | 10.2.200.151 | Leader | running |       0.0 |
| patroni_cluster |  pgsql02 | 10.2.200.152 |        | running |    9153.0 |
| patroni_cluster |  pgsql03 | 10.2.200.153 |        | running |       0.0 |
+-----------------+----------+--------------+--------+---------+-----------+

That is, all three nodes were in place, but the second node was behind. Replication could not start because transaction logs were different. The transaction logs offered by the wizard, indicated in recovery.conf, simply did not fit the current node. PostgreSQL reported an error every 5 seconds

14:50:44.143 FATAL:  could not start WAL streaming: ERROR:  requested starting point 14CF/CF000000 on timeline 71 is not in this server's history        
         DETAIL:  This server's history forked from timeline 71 at 14CF/CEC32E40. 
14:50:44.143 LOG:  new timeline 72 forked off current database system timeline 71 before current recovery point 14CF/ CF38C160

Here I made a mistake and did not test my hypothesis that we are connecting to the wrong master. I just restarted Patroni on the replica. Honestly, I already put an end to it and thought that I would have to redo it, but still decided to try restarting it.

15:14:13.511 LOG: consistent recovery state reached at 14CF/A3F657B0
15:14:13.511 LOG: database system is ready to accept read only connections

Recovery started, and even the database opened, it was ready to accept the connection, replication started:

15:14:17.072 LOG: record with incorrect prev-link 142D46/315602C at 14CF/CF38C160
15:14:17.077 LOG: started streaming WAL from primary at 14CF/CF000000 on timeline 72
15:14:17.536 LOG: invalid record length at 14CF/CF38C160: wanted 24, got 1

But a minute later she fell off with an error terminating walreceiver process due to administrator command- the replica said that transaction logs were not suitable for her.

15:15:27.823 FATAL: terminating walreceiver process due to administrator command
15:15:27.895 LOG: invalid record length at 14CF/CF38C160: wanted 24, got 1
15:15:27.895 LOG: invalid record length at 14CF/CF38C160: wanted 24, got 1

And I did not restart PostgreSQL, but it was Patroni that was restarted in the hope that it would magically launch the database. Replication started again, but the database opened in the same place:

15:17:33.553 LOG: consistent recovery state reached at 14CF/A3F657B0
15:17:33.554 LOG: database system is ready to accept read only connections

Marks in the transaction log were different, they were different than in the previous launch attempt - the transaction log position was earlier:

15:17:37.299 LOG: invalid contrecord length 5913 at 14CF/CEFFF7B0
15:17:37.304 LOG: started streaming WAL from primary at 14CF/CE000000 on timeline 72

The replication stopped again, and the error message was different and again not very informative:

15:18:12.208 FATAL: terminating walreceiver process due to administrator command
15:18:12.240 LOG: record with incorrect prev-link 60995000/589DF000 at 14CF/CEFFF7B0
15:18:12.240 LOG: record with incorrect prev-link 60995000/589DF000 at 14CF/CEFFF7B0

For the experiment, restarted again, the base opened at the same place:

15:21:25.135 LOG: consistent recovery state reached at 14CF/A3F657B0
15:21:25.135 LOG: database system is ready to accept read only connections

Then the idea came to me: that if I restart PostgreSQL, at this point on the current wizard I will do a checkpoint to move the point in the transaction log a little bit ahead and recovery started from another moment.

Launched Patroni, made a couple of checkpoints on the master, a couple of restart points on the replica when it opened:

15:22:43.727 LOG: invalid record length at 14D1/BCF3610: wanted 24, got 0
15:22:43.731 LOG: started streaming WAL from primary at 14D1/B000000 on timeline 72

It worked - replication started from a different place and no longer broke. But for me this is one of the most mysterious problems over which I am still racking my brains. Especially that weird immediate shutdown request.

From this we can conclude: Patroni can work out as planned and without errors, but this is not an absolute guarantee that everything is actually in order. After the feylover, you should always double-check that everything is fine with the cluster: the right number of replicas, no replication lag.

Summary

Based on these and many other similar problems, I formulated general recommendations that I advise you to keep in mind when operating Patroni.

When you use Patroni, you must have monitoring. You always need to know when an auto-file occurred, because if you do not know that you have an auto-file, you do not control the cluster.
After each feylover always check a cluster. You must make sure that: replicas are always the current number; no lag of replication; there are no errors in the logs related to streaming replication, with Patroni, with the DCS system.

Patroni is a very good tool, but no matter how you say it is not a silver bullet. Automation can work out successfully, but at the same time, automation objects can be in a half-working state. Anyway, you need to have an idea of how PostgreSQL and replication work, how Patroni manages PostgreSQL, and how interaction between nodes is ensured. This is necessary in order to be able to repair the problems that arise with your hands.

Diagnostic steps

It so happened that we work with different clients, for the most part they do not have an ELK stack, and you have to sort out the logs by opening 2 tabs and 6 consoles: in one Patroni tab for each node, in the other - Consul or PostgreSQL logs. Diagnosing all this is difficult.

I have developed the following approach. I always watch when a failover occurred. For me it's a kind of watershed. I look at what happened before, during and after the feylover. Failover has two timestamps - start and end. In the logs, I look what preceded the feylover, that is, I am looking for reasons. This gives an understanding of the picture of what happened and what can be done in the future so that the feylover does not occur under the same circumstances.

For this, I look:

First of all, the Patroni logs.
PostgreSQL DCS , Patroni.
— , .

There are many other products for auto-filer: stolon, repmgr, pg_auto_failover, PAF. I tried all 4 tools and in my opinion Patroni is the best there is on the market today.

Do I recommend Patroni? Definitely, yes, because I like Patroni, I think I learned how to cook it.

If you are interested in looking at what other problems there are with Patroni, besides those described in the article, you can always go to the page https://github.com/zalando/patroni/issues . There are many different stories. Despite the fact that half of them are from illiterate users who ask stupid questions without bothering to bother with a simple search, interesting problems are also discussed there and, as a result of discussions, tasks for fixing bugs are opened if necessary.

Thanks to Zalando for developing this project, as well as to the two people who started working on this product: Alexander Kukushkin and Alexey Klyukin. Thanks a lot to all Patroni contributors.

After this report, we recorded a short interview in which Alex tried to fit his report into one council and told why to participate and speak at conferences. Take on arms and come to check the approach on Saint HighLoad ++.

HighLoad++ 6-7 . , . , ( , ). - , , , .

Saint HighLoad++! , , PostgreSQL: , PostgreSQL JSON ; PostgreSQL 13, ; .

Crash stories with Patroni, or How to drop a PostgreSQL cluster

Before you start

Problem 1. DBMS and DCS on the same cluster

When did the failover happen?

Why did the failover happen?

What preceded the feylover?

Everything is back as it was

Decision

Problem 2. Outages

What preceded the feylover?

Decision

Problem 3. Node loss

Why did the failover happen?

What about the old master?

Decision

Problem 4. Data Loss

When did the failover happen?

Decision

Problem 5. Drives

Decision

Problem 6. Cluster simulator

Summary

Diagnostic steps

More articles: