Save a lot of money on large volumes in PostgreSQL

Continuing the topic of recording large streams of data, raised by the previous article on partitioning , in this we consider the ways in which you can reduce the "physical" size stored in PostgreSQL, and their impact on server performance.

It's about TOAST settings and data alignment . “On average”, these methods will save not too many resources, but without any modification to the application code.


However, our experience turned out to be very productive in this regard, since the repository of almost any monitoring is by its nature mostly append-only in terms of recorded data. And if you are interested in how you can teach a database to write to a disk instead of 200MB / s half as much - I ask for a cut.

Little Secrets of Big Data


According to the profile of our service , he regularly receives text packets from the logs .

And since the VLSI complex , whose databases we are monitoring, is a multicomponent product with complex data structures, queries for achieving maximum performance are obtained by such “multi-volumes” with complex algorithmic logic . So the volume of each individual instance of the request or the resulting execution plan in the log coming to us turns out to be “average” quite large.

Let's look at the structure of one of the tables into which we write the "raw" data - that is, here is the original text from the log entry:

CREATE TABLE rawdata_orig(
  pack -- PK
    uuid NOT NULL
, recno -- PK
    smallint NOT NULL
, dt --  
    date
, data --  
    text
, PRIMARY KEY(pack, recno)
);

Such a typical plate (already partitioned, of course, therefore it is a section template), where the most important is the text. Sometimes quite voluminous.

Recall that the “physical” size of one record in PG cannot occupy more than one page of data, but the “logical” size is a completely different matter. To write a volume value (varchar / text / bytea) into the field, TOAST technology is used :
PostgreSQL uses a fixed page size (usually 8 KB), and does not allow tuples to span multiple pages. Therefore, it is impossible to directly store very large field values. To overcome this limitation, large field values ​​are compressed and / or split into several physical lines. This happens unnoticed by the user and affects most of the server code slightly. This method is known as TOAST ...

In fact, for each table with "potentially large" fields , a paired table is automatically created with "slicing" each "large" record in 2KB segments:

TOAST(
  chunk_id
    integer
, chunk_seq
    integer
, chunk_data
    bytea
, PRIMARY KEY(chunk_id, chunk_seq)
);

That is, if we have to write a row with a "large" value data, then the real record will occur not only in the main table and its PK, but also in TOAST and its PK .

Reduce the TOAST effect


But most of the records we have are still not so big, they should fit in 8KB - how would you save on this? ..

Here the attribute STORAGEat the table column comes to our aid :
  • EXTENDED allows both compression and separate storage. This is the standard option for most TOAST compatible data types. First, an attempt is made to perform compression, then it is saved outside the table if the row is still too large.
  • MAIN allows compression, but not separate storage. (In fact, separate storage, however, will be performed for such columns, but only as a last resort , when there is no other way to reduce the row so that it fits on the page.)
In fact, this is exactly what we need for the text - squeeze it as much as possible, and even if it doesn’t fit at all - put it in TOAST . You can do this directly "on the fly", with one command:

ALTER TABLE rawdata_orig ALTER COLUMN data SET STORAGE MAIN;

How to evaluate the effect


Since the data flow changes every day, we cannot compare absolute numbers, but in relative terms, the smaller the proportion we recorded in TOAST, the better. But there is a danger - the more we have the “physical” volume of each individual record, the “wider” the index becomes, because we have to cover more data pages.

Section before changes :
heap  = 37GB (39%)
TOAST = 54GB (57%)
PK    =  4GB ( 4%)

Section after changes :
heap  = 37GB (67%)
TOAST = 16GB (29%)
PK    =  2GB ( 4%)

In fact, we began to write in TOAST 2 times less often , which unloaded not only the disk, but also the CPU:



I note that we also began to “read” the disk less, not only “write” - because when you insert a record into some table, you also have to “subtract” a part of the tree of each of the indices to determine its future position in them.

Who on PostgreSQL 11 live well


After upgrading to PG11, we decided to continue “tuning” TOAST and noticed that starting with this version, the parameter became available for configuration toast_tuple_target:
The TOAST processing code is triggered only when the row value to be stored in the table is larger than TOAST_TUPLE_THRESHOLD bytes (usually 2 Kb). TOAST code will compress and / or move field values ​​out of the table until the row value is less than TOAST_TUPLE_TARGET bytes (variable, usually 2 KB as well) or it becomes impossible to reduce the size.
We decided that the data we usually have is either “very short” or immediately “very long”, so we decided to limit ourselves to the lowest possible value:

ALTER TABLE rawplan_orig SET (toast_tuple_target = 128);

Let's see how the new settings affected disk loading after the migration:


Not bad! The average queue for a disk was reduced by about 1.5 times, and the disk “occupancy” - by 20 percent! But maybe this somehow affected the CPU?


At least, it definitely didn’t get worse. Although, it is difficult to judge if even such volumes still cannot raise the average CPU load above 5% .

From a change of position, the sum ... changes!


As you know, a penny saves a ruble, and with our storage volumes of about 10TB / month, even a small optimization can give a good profit. Therefore, we drew attention to the physical structure of our data - how exactly the “fields” are laid out inside the record of each table.

Because due to data alignment, this directly affects the resulting volume :
Many architectures provide data alignment across machine word boundaries. For example, on a x86 32-bit system, integers (integer type, occupies 4 bytes) will be aligned on the border of 4-byte words, as well as double-precision floating-point numbers (double precision type, 8 bytes). And on a 64-bit system, double values ​​will be aligned on the border of 8-byte words. This is another reason for incompatibility.

Due to alignment, the size of the table row depends on the order of the fields. Usually this effect is not very noticeable, but in some cases it can lead to a significant increase in size. For example, if you place fields of types char (1) and integer mixed, between them, as a rule, 3 bytes will be wasted for nothing.

Let's start with synthetic models:

SELECT pg_column_size(ROW(
  '0000-0000-0000-0000-0000-0000-0000-0000'::uuid
, 0::smallint
, '2019-01-01'::date
));
-- 48 

SELECT pg_column_size(ROW(
  '2019-01-01'::date
, '0000-0000-0000-0000-0000-0000-0000-0000'::uuid
, 0::smallint
));
-- 46 

Where did the extra pair of bytes come from in the first case? Everything is simple - a 2-byte smallint is aligned on a 4-byte boundary before the next field, and when it is the last one, there is nothing and there is no need to align it.

In theory, everything is fine and you can rearrange the fields as you like. Let's check on real data on the example of one of the tables, the daily section of which takes 10-15GB.

Source structure:

CREATE TABLE public.plan_20190220
(
--  from table plan:  pack uuid NOT NULL,
--  from table plan:  recno smallint NOT NULL,
--  from table plan:  host uuid,
--  from table plan:  ts timestamp with time zone,
--  from table plan:  exectime numeric(32,3),
--  from table plan:  duration numeric(32,3),
--  from table plan:  bufint bigint,
--  from table plan:  bufmem bigint,
--  from table plan:  bufdsk bigint,
--  from table plan:  apn uuid,
--  from table plan:  ptr uuid,
--  from table plan:  dt date,
  CONSTRAINT plan_20190220_pkey PRIMARY KEY (pack, recno),
  CONSTRAINT chck_ptr CHECK (ptr IS NOT NULL),
  CONSTRAINT plan_20190220_dt_check CHECK (dt = '2019-02-20'::date)
)
INHERITS (public.plan)

The section after changing the order of the columns is exactly the same field, only the order is different :

CREATE TABLE public.plan_20190221
(
--  from table plan:  dt date NOT NULL,
--  from table plan:  ts timestamp with time zone,
--  from table plan:  pack uuid NOT NULL,
--  from table plan:  recno smallint NOT NULL,
--  from table plan:  host uuid,
--  from table plan:  apn uuid,
--  from table plan:  ptr uuid,
--  from table plan:  bufint bigint,
--  from table plan:  bufmem bigint,
--  from table plan:  bufdsk bigint,
--  from table plan:  exectime numeric(32,3),
--  from table plan:  duration numeric(32,3),
  CONSTRAINT plan_20190221_pkey PRIMARY KEY (pack, recno),
  CONSTRAINT chck_ptr CHECK (ptr IS NOT NULL),
  CONSTRAINT plan_20190221_dt_check CHECK (dt = '2019-02-21'::date)
)
INHERITS (public.plan)

The total volume of the section is determined by the number of “facts” and depends only on external processes, so we divide the size of heap ( pg_relation_size) by the number of records in it - that is, we get the average size of the actual stored record :


Minus 6% of the volume , excellent!

But everything, of course, is not so rosy - because in the indices we cannot change the order of the fields , and therefore "in general" ( pg_total_relation_size) ...


... after all, they saved 1.5% here , without changing a single line of code. Yes, yes!



I note that the above arrangement of the fields is not the fact that the most optimal. Because some field blocks do not want to be “torn apart” already for aesthetic reasons - for example, a pair (pack, recno), which is PK for this table.

In general, the definition of the "minimum" field arrangement is a fairly simple "exhaustive" task. Therefore, you can get results on your data even better than ours - try it!

All Articles