🤲🏼 ☃️ 👦 PostgreSQL 13: parallel VACUUM 👩‍❤️‍💋‍👨 📯 🙍🏻

The other day, Amit Kapila committed a Masahiko Sawada patch, which allows cleaning in parallel. The table itself is still cleared by one (leading) process, but to clean indexes, it can now start background workflows, one for each index. In manual mode, this allows you to speed up the cleaning of large tables with multiple indexes; automatic cleaning does not use this feature yet.

Related links:

commitfest
discussion
main commit

As you know , the table cleaning process consists of several phases.

First, the table is scanned and links to unnecessary ("dead") versions of rows are collected in memory. The memory is limited by the maintenance_work_mem parameter , so all links may not fit at once.

Then all indexes are sorted sequentially (as it was before), and pointers to found dead versions of strings are cleared from them.

Then the already viewed part of the table is scanned again and dead versions of the rows are already cleared from it.

If at one time it was not possible to process all dead versions of strings, then the memory is cleared and the whole process is repeated again from the place where we stopped last time.

This whole scheme remains unchanged, but now indexes can be cleared in parallel. To do this, the lead process starts several workflows that parse existing indexes and process them. One index is processed by only one process. After all the indexes are cleared, the workflows are completed and the facilitator proceeds to the next phase.

For example, take the ticket_flights demobase flight table . There is one index on it, but you can create a couple more.

demo=# CREATE index on ticket_flights (fare_conditions);
demo=# CREATE index on ticket_flights (amount);

Only indices whose size exceeds the value of the min_parallel_index_scan_size parameter are involved in parallel processing . Our indices are suitable:

demo=# SHOW min_parallel_index_scan_size;

 min_parallel_index_scan_size 
------------------------------
 512kB
(1 row)

demo=# SELECT pg_size_pretty(pg_relation_size(indexname::regclass))
FROM pg_indexes
WHERE tablename = 'ticket_flights';

 pg_size_pretty 
----------------
 325 MB
 187 MB
 180 MB
(3 rows)

Update half the lines to load cleanup work:

demo=# UPDATE ticket_flights SET amount = amount + 1 WHERE random() > 0.5;

UPDATE 4194262

Go.

demo=# VACUUM VERBOSE ticket_flights;

INFO:  vacuuming "bookings.ticket_flights"
INFO:  launched 2 parallel vacuum workers for index vacuuming (planned: 2)
INFO:  scanned index "ticket_flights_fare_conditions_idx" to remove 4194262 row versions by parallel vacuum worker
DETAIL:  CPU: user: 1.84 s, system: 0.41 s, elapsed: 11.82 s
INFO:  scanned index "ticket_flights_amount_idx" to remove 4194262 row versions by parallel vacuum worker
DETAIL:  CPU: user: 2.31 s, system: 0.44 s, elapsed: 12.95 s
INFO:  scanned index "ticket_flights_pkey" to remove 4194262 row versions
...
INFO:  "ticket_flights": found 4194262 removable, 8391852 nonremovable row versions in 104885 out of 104885 pages
DETAIL:  0 dead row versions cannot be removed yet, oldest xmin: 630
There were 0 unused item identifiers.
Skipped 0 pages due to buffer pins, 0 frozen pages.
0 pages are entirely empty.
CPU: user: 9.91 s, system: 4.40 s, elapsed: 121.40 s.
VACUUM

It can be seen that the leading process launched two workers, and took one index.

The number of work processes can be specified explicitly (in any case, of course, it will not exceed max_parallel_maintenance_workers , which does not exceed max_worker_processes ). An explicit indication can, in particular, be used to turn off concurrency:

demo=# VACUUM (PARALLEL 0, VERBOSE) ticket_flights;

I hope that this topic will be further developed. The table could be scanned in parallel by several processes (this was in the original Savada patch, but not included in the final one), each index could also be cleared in parallel. It is also necessary to teach automatic cleaning to use parallelism.

P. S. , . . - , , .

PostgreSQL 13: parallel VACUUM

More articles: