👨🏻‍🎓 📳 💪 Monitoring errors and events in the PostgreSQL log (grok

Good afternoon, colleagues and habretchiki! Today, I would like to share with you a small note on how you can organize operational monitoring of errors and events that appear in the PostgreSQL log using Prometheus and the exporter of metrics grok_exporter.

I must say right away that this is of course a special case of using this exporter. So why is it needed and who might be interested?

Introduction

Who needs this?

If Prometheus is already present in your infrastructure and creating a separate infrastructure for collecting and analyzing logs, such as ELK or Greylog, is expensive and impractical (if there is no Prometheus, it’s not a problem, it’s easy and quick to install).

The tool will be useful not only for DBAs, but also for application administrators, in general, for all those who somehow work with application logs. It allows you to quickly obtain information about freelance behavior and have a certain reference point for further analysis of the situation.

What is all this for?

Monitoring logs will allow you to have a more complete picture and respond more quickly to emerging problems, be it: authorization attempts (including unsuccessful ones), various errors, or specific events.

There is a good list here, formed by event categories (See the Log Event Categories section), it can be used to create rules in Prometheus.

grok_exporter

The metric exporter grok_exporter allows you to read unstructured data from the source, converting it to structured data, according to the rules, and give it away as metrics in a format that Prometheus understands.

The exporter was created in the image of a plug-in in ELK plugins-filters-grok , which allows you to use a set of plugins-filters-grok templates as they are.

Installation

Installation is straightforward and simple. To do this, just follow the link and download the version you like (and yes they are all pre-release, and the latest versions are still in the release candidate stage) and unzip the resulting archive.

As a result, we get the following set:

grok_exporter-1.0.0.RC3.linux-amd64
├── patterns
│   ├── ruby
│   ├── redis
│   ├── rails
│   ├── postgresql
│   ├── nagios
│   ├── mongodb
│   ├── mcollective-patterns
│   ├── mcollective
│   ├── linux-syslog
│   ├── junos
│   ├── java
│   ├── haproxy
│   ├── grok-patterns
│   ├── firewalls
│   ├── exim
│   ├── bro
│   ├── bacula
│   └── aws
├── grok_exporter
└── example
    ├── exim-rejected-RCPT-examples.log
    ├── config.yml
    └── config_logstash_http_input_ipv6.yml

where:

grok_exporter - exporter executable
patterns - contains a set of patterns
example - contains a test data set and an example configuration file

To start the exporter, just run the executable file. The config.yml configuration file is searched in the directory from where the application is launched or its location is specified by the -config option

Setup and preparation for work

PostgreSQL

, PostgreSQL. :

grok_exporter . , pg_reload_conf.

alter system set log_directory to '/var/log/postgresql';

alter system set log_file_mode to '0644';

log_line_prefix, . , . :

 alter system set log_line_prefix to '%t datname:%d,client:%h,app:%a,usename:%u,session:%c ';

grok_exporter

First of all, we will edit the patterns / postgresql template in accordance with the format of our log file and add auxiliary constructions. I emphasize once again that the syntax of the templates is fully consistent with that used in plugins-filters-grok , therefore, for all issues related to the syntax, you can and should go to its documentation. So, let's bring our template for PostgreSQL to the form (the patterns / grok-patterns file already contains a large set of basic templates):

#     - postgresql
PG_TIMESTAMP %{YEAR}-%{MONTHNUM}-%{MONTHDAY} %{HOUR}:%{MINUTE}:%{SECOND} [A-Z]{3,4}

#    
PG_ERROR_PATTERN (ERROR|WARNING|FATAL|PANIC)

# log_line_prefix   - PostgreSQL
PG_PREFIX %{PG_TIMESTAMP} datname:(%{DATA:datname})?,client:(%{DATA:client_host})?,app:(%{DATA:app_name})?,usename:(%{USER:usename})?,session:(%{DATA:session})?

#    log_level     
PG_ERROR_LEVEL %{PG_ERROR_PATTERN:log_level}:

#    
PG_EVENT_AUTH LOG:  connection authorized:

#      SIGHUP,   
PG_EVENT_RELOAD LOG:  received SIGHUP, reloading configuration files

The design ()?allows you to indicate that the value is optional.

Here, in sufficient detail, with examples, the process of creating a configuration file is described. Below, only what will be used is given.

A brief description of the configuration file (from the documentation)

YAML :

global —
- config_version — . grok_exporter, ≥ 1.0.0.RC3, 3
  retention_check_interval — . , .
input — .
- type — . : file, stdin, webhook. , file;
- path|paths — c -(-). ;
- readall — , .
  true — , . Prometheus .
  false — , grok_exporter;
- fail_on_missing_logfile — -. true — , ;
imports — , , . 2, , grok.
- type — grok_patterns () metrics ();
- file|dir — . , .
grok_patterns — , .
metrics — . — counter, gauge, histogram, or summary.

server —

  protocol: https
  host: localhost
  port: 9144
  path: /metrics
  cert: /path/to/cert
  key: /path/to/key

In the syntax of the 3rd version of the configuration file, it became possible to import metrics from individual files, this will allow you to group metrics by purpose.

Next, create the base config.yml configuration file . In it, we: set the path to the PostgreSQL log files by mask; import templates from the patterns directory and metrics from the metrics.d directory ; we indicate by what protocol and on which port it is necessary to apply for metrics.

config.yml

  config_version: 3
  retention_check_interval: 10s
input:
  type: file
  path: /var/log/postgresql/postgresql-*.log
  readall: false
imports:
- type: grok_patterns
  dir: ./patterns
- type: metrics
  dir: ./metrics.d
server:
  protocol: http
  port: 9144

, metrics.d postgresql.yml, .

, , .. ( retention, - 2 30 ) , . retention_check_interval.

, :

, ;
— GAUGE COUNTER, . .. GAUGE, ;
If cumulative is set to true, the values of the metrics with the same set of labels will be summed, which is the expected behavior. An unexpected situation may arise when, at the second request, you can get the sum of the values in the request plus the previous value.
If cumulative is set to false, if there are metrics with the same set of labels, only the last value will be displayed;

It turned out a lot of letters, confusing and incomprehensible in places, but I will try to disclose this in the examples below.

And so, we have a configuration file that can return four metrics, three counters and one arbitrary value. In fact, the first three consider the number of matches of the match template field with the strings received from the source, the fourth - displays the sum of the values.

postgresql.yaml

#  - 
- type:   counter
  name:   pg_grok_error_count
  help:   Count of line error
  match:  '%{PG_PREFIX} %{PG_ERROR_LEVEL}  %{GREEDYDATA:message}'
  labels:
    log_level:   '{{.log_level}}'
    usename:     '{{.usename}}'
    datname:     '{{.datname}}'
    app_name:    '{{.app_name}}'
    client_host: '{{.client_host}}'
    message:     '{{.message}}'

#   
- type:   counter
  name:   pg_grok_auth_succes_count
  help:   Count of connections authorized
  match:  '%{PG_PREFIX} %{PG_EVENT_AUTH}'
  labels:
    usename:     '{{.usename}}'
    datname:     '{{.datname}}'
    app_name:    '{{.app_name}}'
    client_host: '{{.client_host}}'

#   
- type:   counter
  name:   pg_grok_reload_conf_count
  help:   Count of call pg_reload_conf
  match:  '%{PG_PREFIX} %{PG_EVENT_RELOAD}'
  labels:
    session:     '{{.session}}'

#   
- type:       gauge
  name:       pg_grok_tpm_file_size_bytes
  help:       Size of tmp file created by time
  match:      '%{PG_PREFIX} %{PG_EVENT_TMP_FILE}'
  value:      '{{.size}}'
  cumulative: true
  retention:  5s
  labels:
    static:     'temp_file'

Practice

pg_grok_error_count

The metric counts the number of events ERROR, WARNING, FATAL and PANIC, according to the pattern. It can be useful for monitoring and warning in case of emergency. For example, you can configure an alert in the following cases: attempts of unsuccessful authorization, exceeding the threshold of the number of errors per unit of time, or when the database is in a recovery state after a failure, and much more ( Log Event Categories ).

For example, we’ll set up an alert about failed authorization attempts. An example of an unsuccessful login attempt in the PostgreSQL log file looks like this:

2020-04-20 23:34:53 AEST datname:test,client:127.0.0.1,app:[unknown],usename:test_user,session:5e9da4fd.733 FATAL:  password authentication failed for user "test_user"
2020-04-20 23:34:53 AEST datname:test,client:127.0.0.1,app:[unknown],usename:test_user,session:5e9da4fd.733 DETAIL:  Password does not match for user "test_user".
        Connection matched pg_hba.conf line 86: "host    all             all             127.0.0.1/32            md5"

In the output of grok_exporter, failed authentication attempts can be identified by the message label containing the substring password authentication failed for ... :

pg_grok_error_count{app_name="[unknown]",client_host="127.0.0.1",datname="postgres",log_level="FATAL",message="password authentication failed for user "postgres"", usename="postgres"} 15
pg_grok_error_count{app_name="[unknown]",client_host="127.0.0.1",datname="test",log_level="FATAL",message="password authentication failed for user \"test_user\"",usename="test_user"} 5

The data obtained will help to formulate a rule for Prometheus. Sensitivity can be adjusted based on your own realities.

groups:
- name: AuthAlert
  rules:
  - alert: AuthFail
    expr: sum(rate(pg_grok_error_count{message=~"password authentication failed for user.*"}[1m])) by (instance) > 0.1
    for: 30s
    labels:
      severity: warning
    annotations:
      summary: Too many fail authorization for {{ $labels.instance }}

pg_grok_reload_conf_count

Similarly, you can track the execution of the pg_reload_conf command. Moreover, it is worth paying attention to the list of labels, or rather, the fact that only one session label is used. This is due to the fact that the event is created as part of the server process, and not the user session.

In the PostgreSQL log file, this event looks like this:

2020-04-21 01:20:26 AEST datname:,client:,app:,usename:,session:5e9d9371.564 LOG:  received SIGHUP, reloading configuration files

Those. it can be seen that the labels used in the previous example are empty. In this case, we will use the session identifier for identification, and it will not change until the instance of PostgreSQL is stopped or restarted.

A similar situation will be for other similar cases, for example, stopping an instance:

2020-04-21 01:32:52 AEST datname:,client:,app:,usename:,session:5e9d9371.564 LOG:  received fast shutdown request
2020-04-21 01:32:53 AEST datname:,client:,app:,usename:,session:5e9d9371.564 LOG:  database system is shut down

In the output of grok_exporter, we get:

# HELP pg_grok_reload_conf_count Count of call pg_reload_conf
# TYPE pg_grok_reload_conf_count counter
pg_grok_reload_conf_count{session="5e9d9371.564"} 5

The rule for notification, there is no point in bringing it up, it will be similar to that discussed above.

pg_grok_tpm_file_size_bytes

I’ll immediately make a reservation that this example is from the category of “spherical horse in a vacuum” and is given to a greater extent in order to show how you can change the behavior of metrics.

The behavior of gauge metrics can be influenced by changing the retention and cumulative parameters . Indeed, unlike a regular counter, which considers the number of lines matching the match pattern, gauge allows you to operate on data from the log file. It turns out that we can accumulate it by increasing or decreasing by the obtained value, or use it as an arbitrarily changing value. Thus, in the first case, we will be interested in the magnitude of the increment, in the second - the value in itself.

Let's look at a few examples:

, (cumulative: true). , , . , retention_check_interval + retention <= scrape_interval — Prometheus.

- PostgreSQL :

2020-04-21 02:51:15 AEST datname:,client:,app:,usename:,session:5e9dd2f3.1278 LOG:  temporary file: path "base/pgsql_tmp/pgsql_tmp4728.0", size 46931968
2020-04-21 02:51:15 AEST datname:,client:,app:,usename:,session:5e9dd2f3.1279 LOG:  temporary file: path "base/pgsql_tmp/pgsql_tmp4729.0", size 46276608
2020-04-21 02:51:15 AEST datname:postgres,client:[local],app:psql,usename:postgres,session:5e9dc0a8.112c LOG:  temporary file: path "base/pgsql_tmp/pgsql_tmp4396.14", size 47194112

, :

pg_grok_tpm_file_size_bytes{static="temp_file"} 1.40402688e+08

.

, :

pg_grok_tpm_file_size_bytes{static="temp_file"} 1.1911168e+07

: , .

, , (retention) (cumulative: true)
, - PostgreSQL :

2020-04-21 03:03:40 AEST datname:,client:,app:,usename:,session:5e9dd5dc.12c6 LOG:  temporary file: path "base/pgsql_tmp/pgsql_tmp4806.0", size 46260224
2020-04-21 03:03:40 AEST datname:,client:,app:,usename:,session:5e9dd5dc.12c5 LOG:  temporary file: path "base/pgsql_tmp/pgsql_tmp4805.0", size 46833664
2020-04-21 03:03:40 AEST datname:postgres,client:[local],app:psql,usename:postgres,session:5e9dc0a8.112c LOG:  temporary file: path "base/pgsql_tmp/pgsql_tmp4396.15", size 47316992

pg_grok_tpm_file_size_bytes{static="temp_file"} 1.40402688e+08

, . :

2020-04-21 03:10:40 AEST datname:,client:,app:,usename:,session:5e9dd76e.1325 LOG:  temporary file: path "base/pgsql_tmp/pgsql_tmp4901.0", size 46776320
2020-04-21 03:10:40 AEST datname:,client:,app:,usename:,session:5e9dd76e.1324 LOG:  temporary file: path "base/pgsql_tmp/pgsql_tmp4900.0", size 45768704
2020-04-21 03:10:40 AEST datname:postgres,client:[local],app:psql,usename:postgres,session:5e9dc0a8.112c LOG:  temporary file: path "base/pgsql_tmp/pgsql_tmp4396.18", size 47841280

pg_grok_tpm_file_size_bytes{static="temp_file"} 2.80772608e+08

, . , .

, cumulative false , .

:

2020-04-21 03:41:04 AEST datname:,client:,app:,usename:,session:5e9ddea4.1393 LOG:  temporary file: path "base/pgsql_tmp/pgsql_tmp5011.0", size 11763712
2020-04-21 03:41:04 AEST datname:,client:,app:,usename:,session:5e9ddea4.1392 LOG:  temporary file: path "base/pgsql_tmp/pgsql_tmp5010.0", size 11501568
2020-04-21 03:41:04 AEST datname:postgres,client:[local],app:psql,usename:postgres,session:5e9dc0a8.112c LOG:  temporary file: path "base/pgsql_tmp/pgsql_tmp4396.19", size 11911168

# HELP pg_grok_tpm_file_size_bytes Size of tmp file created by time
# TYPE pg_grok_tpm_file_size_bytes gauge
pg_grok_tpm_file_size_bytes{static="temp_file"} 1.1911168e+07

, . , .

Undoubtedly grok_exporter is the right tool for monitoring applications. It allows you to look into the "difficult to access", for monitoring systems, places and enrich the monitoring with additional (control) metrics, while remaining quite simple, functional and lightweight.

At the same time, it will be useful for developers as well, since it gives at least indirect access to application logs through a monitoring system. Where you can correlate metrics from various sources.

Another positive point is that the project is developing and acquires new functionality.

In turn, I hope that this short note will be useful, including for the project itself. More stars, more fun work!

Thanks to all!

Monitoring errors and events in the PostgreSQL log (grok_exporter)

Introduction

Who needs this?

What is all this for?

grok_exporter

Installation

Setup and preparation for work

PostgreSQL

grok_exporter

Practice

pg_grok_error_count

pg_grok_reload_conf_count

pg_grok_tpm_file_size_bytes

More articles: