Massive PostgreSQL query optimization. Kirill Borovikov (Tensor)
The report presents some approaches that allow you to monitor the performance of SQL queries when there are millions of them per day , and hundreds of controlled PostgreSQL servers.What technical solutions allow us to efficiently process such a volume of information, and how it facilitates the life of an ordinary developer.Who is interested in analyzing specific problems and different techniques for optimizing SQL queries and solving typical DBA problems in PostgreSQL ? You can also read a series of articles on this topic.
My name is Kirill Borovikov, I represent the company "Tensor" . Specifically, I specialize in working with databases in our company.Today Iâll tell you how we are doing query optimization, when you need not to âpick upâ the performance of a single request, but to solve the problem en masse. When there are millions of requests, and you need to find some approaches to solving this big problem.In general, âTensorâ for our million customers is VLSI - our application : a corporate social network, video communication solutions, for internal and external document management, accounting systems for bookkeeping and storage ... That is, such a âmega-combineâ for integrated business management, which is more than 100 different internal projects.To ensure that they all work and develop normally, we have 10 development centers throughout the country, they have more than 1000 developers .We have been working with PostgreSQL since 2008 and have accumulated a large amount of what we process - this is client data, statistical, analytical, data from external information systems - more than 400TB . Only âin productionâ there are about 250 servers, and in total the database servers that we monitor are about 1000.
SQL is a declarative language. You describe not âhowâ something should work, but âwhatâ you want to receive. DBMS knows better how to make JOIN - how to connect your tablets, what conditions to impose, what will go by index, what not ...Some DBMSs accept hints: âNo, connect these two tablets in such and such a queueâ, but PostgreSQL doesnât. This is the conscious position of the leading developers: âBetter weâll finish the query optimizer than let the developers use some kind of hint.âBut, despite the fact that PostgreSQL does not allow âoutsideâ to control itself, it perfectly allows you to see what happens âinsideâ when you execute your query and where it has problems.
In general, with what classic problems does the developer [come to DBA] usually? âHere we have fulfilled the request, and everything is slow , everything hangs, something happens ... Some kind of trouble!âThe reasons are almost always the same:
: « SQL 10 JOIN...» â , «», . , (10 FROM) - . []
PostgreSQL, «» , â «» . 10 , 10 , PostgreSQL , . []- «»
, , , . ⊠- , .
, (INSERT, UPDATE, DELETE) â .
... And for everything else, we need a plan ! We need to see what is happening inside the server.
The query execution plan for PostgreSQL is a tree of the query execution algorithm in a textual representation. It is the algorithm that, as a result of the analysis by the planner, was recognized as the most effective.Each tree node is an operation: extracting data from a table or index, building a bitmap, joining two tables, joining, intersecting, or eliminating samples. Fulfillment of the request is a passage through the nodes of this tree.To get a query plan, the easiest way is to execute the statement EXPLAIN
. To get with all the real attributes, that is, actually execute a query based on - EXPLAIN (ANALYZE, BUFFERS) SELECT ...
.The bad point: when you execute it, it happens âhere and nowâ, therefore it is suitable only for local debugging. If you take some high-loaded server, which is under a strong stream of data changes, and you see: âAy! Here we are slower to Xia a request. " Half an hour, an hour ago - while you were running and getting this request from the logs, carrying it again to the server, you have all the dataset and statistics changed. You execute it to debug - and it runs fast! And you cannot understand the why, why it was slow.
In order to understand what was exactly at the moment when the request is executed on the server, smart people wrote the auto_explain module. It is present in almost all of the most common PostgreSQL distributions, and you can simply activate it in the config file.If he understands that a request is being executed longer than the border that you told him, he takes a âsnapshotâ of the plan for this request and writes them together in a log .
Everything seems to be fine now, we go to the log and see there ... [text footstep]. But we canât say anything about him, except for the fact that this is an excellent plan, because it took 11ms to complete.Everything seems to be fine - but nothing is clear about what really happened. Apart from the total time, we donât see much. Because to look at such a "latuha" plain text is generally beloved.But even if it is beloved, albeit uncomfortable, but there are more major problems:- . , Index Scan â , - . , «» , CTE â « ».
- : , , â . , , , , loops â . . , , , â - « ».
Under these circumstances, understand "Who is the weakest link?" almost unrealistic. Therefore, even the developers themselves in the "manual" write that "Understanding the plan is an art that needs to be learned, experience ..." .But we have 1000 developers, and each of them will not pass this experience into their heads. I, you, he - they know, and someone over there - is no longer there. Maybe he will learn, or maybe not, but he needs to work now - and where would he get this experience.Plan visualization
Therefore, we realized that in order to deal with these problems, we need a good visualization of the plan . [article]
We went first âaround the marketâ - let's look on the Internet for what exists in general.But, it turned out that relatively âliveâ solutions that are more or less developed, there are very few - literally, one thing: explain.depesz.com from Hubert Lubaczewski. At the entrance to the field "feed" a textual representation of the plan, it shows you a plate with the parsed data:- proper node working time
- total time over the entire subtree
- the number of records that were retrieved and which was statistically expected
- the node body itself
This service also has the ability to share the archive of links. You threw your plan there and say: âHey, Vasya, hereâs a link for you, something is wrong there.â
But there are some minor problems.Firstly, a huge amount of copy-paste. You take a piece of the log, put it there, and again and again.Secondly, there is no analysis of the amount of data read - the very buffers that it displays EXPLAIN (ANALYZE, BUFFERS)
, here we do not see. He simply does not know how to disassemble, understand and work with them. When you read a lot of data and understand that you can "decompose" incorrectly on a disk and cache in memory, this information is very important.The third negative point is the very weak development of this project. The commits are very small, itâs good if every six months, and the code in Perl.
But this is all âlyricsâ, one could somehow live with it, but there is one thing that has turned us away from this service. These are Common Table Expression (CTE) analysis errors and various dynamic nodes like InitPlan / SubPlan.If you believe this picture, then we have the total execution time of each individual node is greater than the total execution time of the entire request. It's simple - the generation time of this CTE was not subtracted from the CTE Scan node . Therefore, we no longer know the correct answer, how much did the CTE scan itself take.
Then we realized that it was time to write our own - hooray! Each developer says: âNow we will write our own, itâll just be super!âThey took a typical web services stack: the core on Node.js + Express, pulled Bootstrap and for beautiful diagrams - D3.js. And our expectations were justified - we received the first prototype in 2 weeks:- own plan parser
That is, now we can generally parse any plan from those generated by PostgreSQL. - correct analysis of dynamic nodes - CTE Scan, InitPlan, SubPlan
- analysis of the distribution of buffers - where pages of data from memory are read, where from the local cache, where from disk
- received visibility
So that itâs not âin the logâ that itâs âdiggingâ, but that you see the âweakest linkâ immediately in the picture.
We got something like this - immediately with syntax highlighting. But usually our developers are no longer working with a complete presentation of the plan, but with a shorter one. After all, we have already parsed all the digits and threw them left and right, and in the middle we left only the first line: what kind of node it is: CTE Scan, CTE or Seq Scan generation by some type of label.This abbreviated view is what we call the plan template .
What else would be convenient? It would be convenient to see what proportion of which node of the total time is allocated to us - and just âstuckâ the pie chart on the side .We point to the node and see - with us, it turns out that Seq Scan took less than a quarter of the entire time, and the remaining 3/4 took CTE Scan. Horror! This is a small remark about the ârate of fireâ of CTE Scan, if you actively use them in your queries. They are not very fast - they lose even to the usual table scan. [article] [article]But usually such diagrams are more interesting, more complicated when we immediately point to a segment, and we see, for example, that more than half of all the time some Seq Scan âateâ. Moreover, there was some kind of Filter inside, a bunch of records were dropped on it ... You can directly throw this picture to the developer and say: âVasya, everything is bad with you here! Understand, look - something is wrong! â
Naturally, there was a ârakeâ.The first thing they âstepped onâ is the problem of rounding. The node time of each individual in the plan is indicated with an accuracy of 1 ÎŒs. And when the number of node cycles exceeds, for example, 1000 - after execution PostgreSQL divided it âup toâ, then in the reverse calculation we get the total time âsomewhere between 0.95ms and 1.05msâ. When the account is spent in microseconds - nothing yet, but when already for [milli] seconds - it is necessary to take this information into account when âuntyingâ resources on the nodes of the plan âwho has how much consumed whomâ.
The second point, more complex, is the distribution of resources (those same buffers) among dynamic nodes. This cost us the first 2 weeks on the prototype plus the plus of week 4.To get this problem is quite simple - we make a CTE and we are supposedly reading something in it. In fact, PostgreSQL is smart and won't read anything right there. Then we take the first record from it, and the first hundred from the same CTE to it.
We look at the plan and understand - strange, we have 3 buffers (data pages) were "consumed" in Seq Scan, another 1 in CTE Scan, and 2 more in the second CTE Scan. That is, if everything is simply summed up, we get 6, but from the plate we read only 3! CTE Scan does not read anything from anywhere, but works directly with the process memory. That is, there is clearly something wrong here!In fact, it turns out that here all those 3 pages of data that were requested from Seq Scan, first 1 asked for the 1st CTE Scan, and then the 2nd, and they read another 2. That is, 3 pages were read in total data, not 6.
And this picture led us to understand that the implementation of the plan is no longer a tree, but just some kind of acyclic graph. And we got a chart like this so that we understand "what-where it came from at all." That is, here we created a CTE from pg_class, and asked for it twice, and almost all the time it took us up the branch when we asked for it the second time. It is clear that reading the 101st record is much more expensive than just the 1st of the tablet.
We exhaled for a while. They said: âNow, Neo, you know kung fu! Now our experience is right on your screen. Now you can use it. â [article]Log Consolidation
Our 1000 developers breathed a sigh of relief. But we understood that we only have hundreds of âbattleâ servers, and all this âcopy-pasteâ by the developers is not at all convenient. We realized that we needed to collect it ourselves.
In general, there is a regular module that can collect statistics, however, it also needs to be activated in the config - this is the pg_stat_statements module . But he did not suit us.Firstly, it assigns different QueryId to the same queries on different schemes within the same database . That is, if you first make SET search_path = '01'; SELECT * FROM user LIMIT 1;
, and then SET search_path = '02';
the same request, then there will be different entries in the statistics of this module, and I will not be able to collect general statistics precisely in the context of this request profile, without taking into account the schemes.The second point that prevented us from using it is the lack of plans . That is, there is no plan, there is only the request itself. We see what was slowing down, but we donât understand why. And here we return to the problem of a rapidly changing dataset.And the last point is the lack of "facts . " That is, it is impossible to address a specific instance of query execution - it is not there, there are only aggregated statistics. Although itâs possible to work with it, itâs just very difficult.
Therefore, we decided to fight âcopy-pasteâ and began to write a collector .The collector is connected via SSH, âpullsâ a secure connection to the server with the database using the certificate and tail -F
âclingsâ to it to the log file. So in this sessionwe get a complete âmirrorâ of the entire log file that the server generates. The load on the server itself is minimal, because we do not parse anything there, we simply mirror the traffic.Since we already started writing the interface on Node.js, we continued to write the collector on it. And this technology paid off, because itâs very convenient to use JavaScript to work with poorly formatted text data, which is the log. And the Node.js infrastructure itself as a backend platform allows you to easily and conveniently work with network connections, and indeed with some kind of data streams.Accordingly, we âpullâ two connections: the first is to âlistenâ to the log itself and take it to ourselves, and the second is to periodically ask the database. âBut in the log it arrived that the plate with oid 123 was blocked,â but it doesnât say anything to the developer, and it would be nice to ask the base âWhat is OID = 123 after all?â And so we periodically ask the base for something that we donât know at home yet.
âYou just didnât take into account, there is a kind of elephant-like bees! ..â We started to develop this system when we wanted to monitor 10 servers. The most critical in our understanding, on which there were some problems that were difficult to deal with. But during the first quarter we got a hundred for monitoring - because the system âwent inâ, everyone wanted it, everyone was comfortable.All this must be added, the data stream is large, active. Actually, we monitor what we are able to deal with - then we use it. We also use PostgreSQL as a data warehouse. But nothing is faster to âpourâ data into it than COPY
there is no operator yet.But just âpouringâ the data is not really our technology. Because if you have about 50k requests per second on a hundred servers, then this will generate 100-150GB of logs per day for you. Therefore, we had to carefully âsawâ the base.Firstly, we did the partitioning every day , because, by and large, nobody is interested in the correlation between the days. What is the difference that you had yesterday, if tonight you rolled out a new version of the application - and already some new statistics.Secondly, we learned (were forced) to write very, very quickly usingCOPY
. That is, not just COPY
because it is faster than INSERT
, but even faster.
The third point - I had to abandon the triggers, respectively, and from Foreign Keys . That is, we do not have absolutely referential integrity. Because if you have a table on which there is a pair of FK, and you say in the database structure that âhere is a log entry refers to FK, for example, a group of recordsâ, then when you insert it, PostgreSQL has nothing left how to take and honestly execute SELECT 1 FROM master_fk1_table WHERE ...
with the identifier that you are trying to insert - just to check that this entry is there, that you are not "breaking off" this Foreign Key with your insert.We get instead of one record in the target table and its indices, another plus of reading from all the tables to which it refers. And we donât need it at all - our task is to write down as much as possible and as quickly as possible with the least load. So FK - down!The next point is aggregation and hashing. Initially, they were implemented in the database with us - after all, itâs convenient to immediately, when the record arrives, make âplus oneâ in some kind of plate right in the trigger . Itâs good, convenient, but the same thing is bad - insert one record, but you are forced to read and write something else from another table. Moreover, not only that, read and write - and also do it every time.Now imagine that you have a plate in which you simply count the number of requests that passed on a particular host:+1, +1, +1, ..., +1
. And you, in principle, do not need it - all this can be summed up in the memory on the collector and sent to the database at a time +10
.Yes, your logical integrity may âfall apartâ in case of some problems, but this is almost an unrealistic case - because you have a normal server, it has a battery in the controller, you have a transaction log, a log on the file system ... In general, not worth it. Itâs not worth the loss of productivity that you get due to the work of triggers / FK, the costs that you incur at the same time.Same thing with hashing. A certain request flies to you, you compute a certain identifier from the database from it, write to the database and then tell it to everyone. All is well, until at the time of recording a second person comes to you who wants to record it - and you have a lock, and this is already bad. Therefore, if you can take out the generation of some IDs on the client (relative to the database), it is better to do this.We were just ideally suited to use MD5 from the text - a request, plan, template, ... We calculate it on the collector side, and âpourâ the already prepared ID into the database. MD5 length and daily partitioning allow us not to worry about possible collisions.
But in order to record all this quickly, we needed to modify the recording procedure itself.How do you usually write data? We have some kind of dataset, we decompose it into several tables, and then COPY - first in the first, then in the second, in the third ... It is inconvenient, because we kind of write one data stream in three steps sequentially. Unpleasant. Is it possible to do faster? Can!To do this, it is enough just to decompose these flows in parallel with each other. It turns out that we have errors, requests, templates, locks, flying in separate streams ... - and we write it all in parallel. To do this, just keep the COPY channel permanently open on each individual target table .
That is, the collector always has a streaminto which I can write the data I need. But so that the database sees this data, and someone does not hang in the locks, waiting for this data to be written, COPY must be interrupted at a certain frequency . For us, a period of the order of 100ms turned out to be the most effective - close and immediately open it again on the same table. And if we donât have one stream at some peaks, then we do pooling to a certain limit.In addition, we found out that for such a load profile, any aggregation when records are collected in packets is evil. Classical evil is INSERT ... VALUES
beyond 1000 records. Because at this moment you have a peak recording on the media, and everyone else who is trying to write something to disk will wait.To get rid of such anomalies, simply do not aggregate anything, do not buffer at all . And if buffering to disk does occur (fortunately, the Stream API in Node.js allows you to find out) - postpone this connection. That's when the event comes to you that it is free again - write to it from the accumulated queue. In the meantime, itâs busy - take the next free one from the pool and write to it.Before implementing this approach to data recording, we had approximately 4K write ops, and in this way we reduced the load by 4 times. Now they have grown another 6 times due to new observable bases - up to 100MB / s. And now we store logs for the last 3 months in the amount of about 10-15TB, hoping that in just three months any developer can solve any problem.We understand the problems
But just collecting all this data is good, useful, appropriate, but not enough - you need to understand it. Because itâs millions of different plans per day.
But millions are uncontrollable, you must first do "less." And, first of all, it is necessary to decide how you will organize this âsmallerâ one.We have identified for ourselves three key points:- who sent this request.
That is, from which application he âflewâ: web-interface, backend, payment system or something else. - where did this happen
On which particular server. Because if you have several servers under one application, and suddenly one âbluntedâ (because the âdisk has rottedâ, âmemory leakedâ, some other trouble), then you need to specifically address the server. - how the problem manifested itself in one way or another
To understand âwhoâ sent us the request, we use a regular tool - setting a session variable: SET application_name = '{bl-host}:{bl-method}';
- send the host name of the business logic from which the request is made, and the name of the method or application that initiated it.After we passed the âownerâ of the request, it must be displayed in the log - for this we configure the variable log_line_prefix = ' %m [%p:%v] [%d] %r %a'
. Anyone interested can see in the manual what this all means. It turns out that we see in the log:- time
- process and transaction identifiers
- base name
- IP of the person who sent this request
- and method name
Then we realized that it is not very interesting to look at the correlation of one request between different servers. It happens infrequently when you have one application that equally craps here and there. But even if it is the same, look at any of these servers.So, the âone server - one dayâ section turned out to be enough for any analysis.The first analytical section is the very âtemplateâ - an abbreviated form of presentation of the plan, cleared of all numerical indicators. The second section is the application or method, and the third is the specific node of the plan that caused us problems.When we moved from specific instances to templates, we immediately received two advantages:
, .
, «» - , . , - , , , â , , â , , . , , .
The remaining methods are based on the indicators that we extract from the plan: how many times such a template occurred, the total and average time, how much data was read from the disk, and how much from the memory ...Because you, for example, come to the analytics page by host, see - something too much on the disk to read the beginning. The disk on the server does not cope - and who reads from it?And you can sort by any column and decide what you will deal with right now - with the load on the processor or the disk, or with the total number of requests ... Sorted, looked âtopâ, repaired - rolled out a new version of the application.[video lecture]And right away you can see different applications that come with the same template from a request likeSELECT * FROM users WHERE login = 'Vasya'
. Frontend, backend, processing ... And you wonder why the user should read the processing if he does not interact with him.The opposite way is to immediately see from the application what it is doing. For example, a frontend is this, this, this, and this once an hour (just the timeline helps). And immediately the question arises - it seems to be not the business of the front-end to do something once an hour ...
After some time, we realized that we lacked aggregated statistics in terms of plan nodes . We isolated from the plans only those nodes that do something with the data of the tables themselves (read / write them by index or not). In fact, relative to the previous picture, only one aspect is added - how many records this node brought to us , and how many it dropped (Rows Removed by Filter).You do not have a suitable index on the plate, you make a request to it, it flies past the index, falls into Seq Scan ... you have filtered out all the records except one. And why do you need 100M filtered records per day, is it better to roll the index?
Having examined all the plans by nodes, we realized that there are some typical structures in the plans that are very likely to look suspicious. And it would be nice to tell the developer: âFriend, here you first read by index, then sort it, and then cut it offâ - as a rule, there is one record.Everyone who wrote queries with this pattern probably came across: âGive me the last order for Vasya, his dateâ And if you do not have an index by date, or the index used does not have a date, then go exactly on such a ârakeâ and step on .But we know that this is a "rake" - so why not immediately tell the developer what he should do. Accordingly, opening the plan now, our developer immediately sees a beautiful picture with prompts, where he is immediately told: "You have problems here and here, but they are solved this way."As a result, the amount of experience that was needed to solve problems in the beginning and now has fallen significantly. Here we have such a tool.
Source: https://habr.com/ru/post/undefined/
All Articles