🗯️ ㊙️ ✌🏻 We monitor the PostgreSQL database - who is to blame and what to do 📔 👩🏽‍🍳 ⚕️

I already talked about how we “catch” PostgreSQL problems using mass log monitoring on hundreds of servers at the same time. But besides the logs, this DBMS also provides us with many tools for analyzing its state - it’s a sin not to use them.

True, if you just look at them from the console, you can very quickly go round without any benefit, because the amount of data available to us exceeds all reasonable limits.

Therefore, in order for the situation to remain controllable, we developed an add-on for Zabbix that delivers metrics, forms screens and sets up uniform monitoring rules for all servers and databases on them.

Today's article is about what conclusions can be drawn by observing in dynamics the various metrics of PostgreSQL server bases, and where the problem may be hidden.

Connection status

The very first thing that all the disassemblies on the topic “what happened to the database / it was bad” begins with is monitoring the summary state of pg_stat_activity :

On the left graph we see all the connections that are waiting for something, on the right - which are something do. Depending on the PG version, the connection status is determined by pg_stat_activity.state/wait_eventand / or the text of the request itself.

What to look for :

Too littleidle - at some point your application may not have enough connections already open to the database, and when you try to open another one, you will find yourself waiting for the process to initialize to serve a new connection.
idle «» , max_connections.
idle in transaction — , - pgbouncer. .

, , idle_in_transaction_session_timeout.
wait — - «» . — .

, «» pg_terminate_backend(pid).
active ( max-) , «». - (, « ») , , …

— , «» .
maintenance — , - :
```
query ~* E'^(\\s*(--[^\\n]*\\n|/\\*.*\\*/|\\n))*(autovacuum|VACUUM|ANALYZE|REINDEX|CLUSTER|CREATE|ALTER|TRUNCATE|DROP)'
```
In most cases, there will be the number of autovacuum / autoanalyze working at the same time, the harm of which consists only in using server resources for "extraneous" cases. If this is critical for you - twist autovacuum_max_workersand autovacuum_naptime, but completely turn it off - you should not .

But if at the same time begin to grow wait, andmaintenance , it is a chance to see if someone has decided to roll out of the DBA or developer code, for example, blocking half the chance of functional applications.

Since it’s important for us to remove not only a lot of metrics, but also to do it as efficiently as possible, we try to shoot some of them synchronously within the framework of one request:

Connection and lock status

WITH event_types(wait_event_type) AS(
  VALUES
    ('lwlock')
  , ('lock')
  , ('bufferpin')
  , ('client')
  , ('extension')
  , ('ipc')
  , ('timeout')
  , ('io')
)
, events(wait_event) AS(
  VALUES
    ('walwritelock')
  , ('wal_insert')
  , ('buffer_content')
  , ('buffer_io')
  , ('lock_manager')
  , ('relation')
  , ('extend')
  , ('page')
  , ('tuple')
  , ('transactionid')
  , ('virtualxid')
  , ('speculative token')
  , ('object')
  , ('userlock')
  , ('advisory')
  , ('clientread')
  , ('datafileextend')
  , ('datafileread')
  , ('datafilewrite')
  , ('slruread')
  , ('slruwrite')
)
, states(state) AS(
  VALUES
    ('running')
  , ('maintenance')
  , ('waiting')
  , ('transaction')
  , ('idle')
)
, stats AS(
  SELECT
    pid
  , datname
  , state
  , lower(wait_event_type) wait_event_type
  , lower(wait_event) wait_event
  , query
  FROM
    pg_stat_activity
  WHERE
    pid <> pg_backend_pid()
)
, dbs AS(
  SELECT
    datname
  FROM
    pg_database db
  WHERE
    NOT db.datistemplate
)
  SELECT
    date_part('epoch', now())::integer ts
  , coalesce(s.qty, 0) val
  , dbs.datname dbname
  , states.state
  , true total
  FROM
    dbs
  CROSS JOIN
    states
  NATURAL LEFT JOIN
    (
      SELECT
        datname
      , CASE
          WHEN query ~* E'^(\\s*(--[^\\n]*\\n|/\\*.*\\*/|\\n))*(autovacuum|VACUUM|ANALYZE|REINDEX|CLUSTER|CREATE|ALTER|TRUNCATE|DROP)' THEN
            'maintenance'
          WHEN wait_event IS NOT NULL AND
            wait_event <> 'clientread' AND
            state = 'active' THEN
            'waiting'
          WHEN state = 'active' THEN
            'running'
          WHEN state = 'idle' THEN
            'idle'
          WHEN state IN ('idle in transaction', 'idle in transaction (aborted)') THEN
            'transaction'
          WHEN state = 'fastpath function call' THEN
            'fastpath'
          ELSE
            'disabled'
        END state
      , count(*) qty
      FROM
        stats
      GROUP BY
        1, 2
    ) s
UNION
  SELECT
    date_part('epoch', now())::integer ts
  , coalesce(t.qty, 0) val
  , dbs.datname dbname
  , event_types.wait_event_type
  , false total
  FROM
    dbs
  CROSS JOIN
    event_types
  NATURAL LEFT JOIN
    (
      SELECT
        datname
      , wait_event_type
      , count(*) qty
      FROM
        stats
      WHERE
        wait_event_type IS NOT NULL
      GROUP BY
        1, 2
    ) t
UNION
  SELECT
    date_part('epoch', now())::integer ts
  , coalesce(e.qty, 0) val
  , dbs.datname dbname
  , events.wait_event
  , false total
  FROM
    dbs
  CROSS JOIN
    events
  NATURAL LEFT JOIN
    (
      SELECT
        datname
      , wait_event
      , count(*) qty
      FROM
        stats
      WHERE
        wait_event IS NOT NULL
      GROUP BY
        1, 2
    ) e;

Locks

Since we touched on blocking monitoring in the previous paragraph, it is worth noting that PostgreSQL likes to overlay them right and left:

We are most interested in two types of them:

Exclusive - typically occur when locking on a particular record.
AccessExclusive - when carrying out maintenance operations on the table.

But do not forget that the total number of locks is not rubber :

Both advisory and regular locks are stored in the shared memory area, the size of which is determined by the configuration parameters max_locks_per_transactionand max_connections. It is important that this memory is sufficient, because otherwise the server will not be able to issue any lock . Thus, the number of recommended locks that a server can issue is usually limited to tens or hundreds of thousands, depending on the server configuration.

Typically, this situation arises if your application “flows” and resources are not released: connections to the database, transaction contexts, or advisory locks . Therefore, pay attention to the overall dynamics.

Transactions per second (TPS)

To get information about changes in the context of the current database, you can use the system view pg_stat_database . But if there are many databases on the server, it is convenient to do this immediately for all of them, connecting topostgres .

TPS & tuples

SELECT
  extract(epoch from now())::integer ts
, datname dbname
, pg_stat_get_db_tuples_returned(oid) tup_returned
, pg_stat_get_db_tuples_fetched(oid) tup_fetched
, pg_stat_get_db_tuples_inserted(oid) tup_inserted
, pg_stat_get_db_tuples_updated(oid) tup_updated
, pg_stat_get_db_tuples_deleted(oid) tup_deleted
, pg_stat_get_db_xact_commit(oid) xact_commit
, pg_stat_get_db_xact_rollback(oid) xact_rollback
FROM
  pg_database
WHERE
  NOT datistemplate;

I want to emphasize separately - do not neglect the output of max-values of metrics!

In this graph, we can clearly see the situation of a sudden peak increase in the number of conducted ( commit) transactions. This is not one-on-one corresponds to the load on the server and transactions can be of varying complexity, but a 4-fold growth clearly shows that the server should have a certain performance reserve in order to survive such a peak without problems.

Well, the rollback ( rollback) of the transaction is an occasion to check whether your application is consciously executing ROLLBACK, or if the server automatically does this as a result of an error.

Number of operations on records

First, pay attention to the records that we subtract from indexes / tables:

tuples.returned — , «» .
tuples.fetched — , « » Rows Removed by Filter, «» .
tuples.ratio — , , 1, — . , , , .

If you observe a sharp peaktuples.ratio , you can be sure that you will find some inefficient request from the category described in the article about recipes for their treatment next to the log .

However, even if ratioideally equal to 1, but the peak fell onreturned/fetched - also do not expect good. Usually this can mean that there is some kind of trouble in the plan, like:

Hash Join
  - Hash
    - Seq Scan on BIG_TABLE
  - Index Scan ...

Merge Join
  - Index Scan on BIG_INDEX
  - Index Scan ...

Since we began to check what is being read there, let's see how it happens. That is, how much records we read by indexes, and how much as a result Seq Scan:

It is clear that here any unplanned growth of indicators should cause suspicion. For example, if for some reason you need to read a whole plate of 10M records every night, then the appearance of such a peak during the day is a reason for disassembly.

As well as any mass-anomalous inserts / updates / deletes:

Using data cache

To understand how the mass proofreading of records really worsens the server’s life, let's look at the server’s work with data pages and the ratioblock.read/hit . In an ideal world, the server should not “read” from the disk ( shared readon the plan node) absolutely nothing, everything should already be in memory ( shared hit), since accessing the disk is always slow .

In reality, this is not entirely true, and is the reason for a thorough analysis of requests around peak time:

Longest Request / Transaction

For MVCC, long-running queries and transactions in busy systems are a performance disaster. Details and pictures about this can be read here , and here - how can you still survive in such conditions.

Catching such villains helps us pg_stat_activity.query_start/xact_start.

As our experience shows, a visual representation of these metrics is already enough to roughly represent where to "dig" further:

look for resource leaks in the application
optimize failed requests
put more productive hardware
... or make sure that the load is correctly spaced in time