Big Appetites for Little Buffers at Node.js

I already talked about the service for monitoring queries to PostgreSQL , for which we implemented an online server log collector, whose main task is to simultaneously receive log streams from a large number of hosts at once, quickly parse them into lines , group them in packets according to certain rules, process and write result in PostgreSQL storage.



In our case, we are talking about several hundred servers and millions of requests and plans that generate more than 100GB of logs per day . Therefore, it was not at all surprising when we found that the lion's share of the resources is spent precisely on these two operations: parsing into lines and writing to the database.

We plunged into the bowels of the profiler and found some features of working with BufferNode.js, the knowledge of which can greatly save your time and server resources.

CPU load




Most of the processor time was spent processing the incoming log stream, which is understandable. But what was not clear was the resource-intensiveness of the primitive “slicing” of the incoming stream of binary blocks into lines by \r\n: The



attentive developer will immediately notice here a not-so-efficient byte cycle through the incoming buffer. Well, since the line can be “torn” between neighboring blocks, there is also a functional “tail attachment” left over from the previous processed block.

Trying readline


A quick review of the available solutions brought us to the regular readline module exactly with the necessary functionality for slicing into lines:



After it was implemented, “slicing” from the top of the profiler went deeper:



But, as it turned out, readline forces the string to UTF-8 internally , which is impossible do if the log entry (request, plan, error text) has a different source encoding.

Indeed, even on one PostgreSQL server, several databases can be active simultaneously, each of which generates output to a common server log exactly in its original encoding. As a result, owners of databases on win-1251 (sometimes it’s convenient to use it to save disk space if an “honest” multibyte UNICODE is not needed) were able to observe their plans with approximately the same “Russian” names of tables and indexes:



Modifying the bike


It’s a trouble ... It’s still necessary to do the cutting yourself, but with optimizations of the type Buffer.indexOf()instead of “byte-scan”:



It seems that everything is fine, the load in the test circuit did not increase, win1251-names were repaired, we rolled out to the battle ... Ta-dam! CPU usage periodically breaks through the ceiling at 100% :



How is it? .. It turns out that it is our fault Buffer.concatwith which we “stick the tail” left over from the previous block:



But we only have a gluing when we pass a line through the block , but they should not to be many - really, really? .. Well, almost. Only now sometimes “strings” of several hundred 16KB segments come :



Thanks to fellow developers who took care to generate this. It happens "rarely, but accurately", so it was not possible to see in advance in the test circuit.

It is clear that pasting several hundred times to the buffer by several megabytes of small pieces is a direct path to the abyss of memory reallocations with the consumption of CPU resources, which we observed. So, let's just not glue it until the line ends completely. We will just put the “pieces” in an array until it is time to give the entire line “out”:



Now the load has returned to the readline indicators.

Memory consumption


Many people who wrote in languages ​​with dynamic memory allocation are aware that one of the most unpleasant "performance killers" is the background activity of the Garbage Collector (GC), which scans objects created in memory and deletes those that are larger no one is required. This trouble also overtook us - at some point we began to notice that the GC activity was somehow too much, and out of place.



Traditional “twists” didn’t really help ... “If all else fails, dump!” And folk wisdom did not disappoint - we saw a cloud of Buffer of 8360 bytes with a total size of 520MB ...



And they were generated inside CopyBinaryStream - the situation began to clear up ...

COPY ... FROM STDIN WITH BINARY


To reduce the amount of traffic transmitted to the database, we use the binary format COPY . In fact, for each record, you need to send a buffer to the stream, consisting of “pieces” - the number of fields in the record (2 bytes) and then the binary representation of the values ​​of each column (4 bytes per type ID + data).

Since such a row of the table almost always has a variable length “summed up”, immediately allocating a buffer of a fixed length is not an option ; reallocation if there is a lack of size will easily “eat up” the performance; So, it’s also worthwhile to “glue from pieces” using Buffer.concat().

memo


Well, since we have many pieces repeated over and over (for example, the number of fields in the records of the same table) - let's just remember them , and then take ready-made ones , generated once at the first call. Based on the COPY format, there are few options - typical pieces are 1, 2 or 4 bytes in length:



And ... bam, a rake has arrived!



That is, yes, every time you create a Buffer, a 8KB piece of memory is allocated by default, so that small buffers that are created in a row can be stacked “next to” in the already allocated memory. And our allocation worked “on demand”, and it turned out to be not at all “near” - that's why each of our 1-2-4 bytes buffer physically occupied 8KB + header - here they are, our 520MB!

smart memo


Hmm ... Why do we even have to wait until this or that 1/2-byte buffer is needed? With 4-byte is a separate issue, but some of these different options for a total of 256 + 65536. So let nagenerim their row all at once ! At the same time, we cut the condition for the existence of each check - it will also work faster, since initialization is carried out only at the start of the process.



That is, in addition to 1/2-byte buffers, we immediately initialize the most running values ​​(lower 2 bytes and -1) for 4-byte ones. And - it helped, only 10MB instead of 520MB!


All Articles