✏️ 👇🏿 👩🏻‍🌾 The notorious bugs and how to avoid them on the example of ClickHouse 🏂🏻 ◼️ 🕚

If you are writing code, get ready for problems. They will definitely be, and they should be expected from all sides: from your code and compiler, from the operating system and hardware, and users sometimes throw up “surprises”. If you scaled the cluster to cosmic scales, then expect "space" bugs. Especially when it comes to data from Internet traffic.

Alexey Milovidov (o6CuFl2Q) will talk about the most ridiculous, discouraging and hopeless problems from his experience in developing and supporting ClickHouse. Let's see how they had to be debugged and what measures the developers should take from the very beginning, so that there would be less problems.

Notorious bugs

If you wrote some code, get ready for problems right away.

Errors in the code. They will be required. But let's say you wrote the perfect code, compiled, but bugs will appear in the compiler and the code will not work correctly. We fixed the compiler, everything compiled - run it. But (unexpectedly) everything works incorrectly, because there are bugs in the kernel of the OS too .

If there are no bugs in the OS, inevitably, they will be in hardware . Even if you wrote the perfect code that works perfectly on the perfect hardware, you will still encounter problems, for example, configuration errors . It would seem that you did everything right, but someone made a mistake in the configuration file, and everything does not work again.

When all the bugs have been fixed, users will finish it, because they constantly use your code “incorrectly”. But the problem is definitely not in the users, but in the code: you wrote something that is difficult to use .

Let's look at these bugs with a few examples.

Configuration bugs

Deletion of data . The first case from practice. Fortunately, not mine and not Yandex, do not worry.

Introductory first. The map-reduce architecture of a cluster (such as Hadoop) consists of several data servers (data nodes) that store data, and one or more master servers that know the location of all the data on the servers.

Data nodes know the address of the master and connect to it. The wizard monitors where and what data should be located, and gives different commands to the data nodes: “Download the data X, you must have the data Y, and delete the data Z”. What could go wrong?

When a new configuration file was uploaded to all data nodes, they mistakenly connected to the master from another cluster, and not to their own. The master looked at the data about which the data nodes were informed, decided that the data was incorrect and should be deleted. The problem was noticed when half of the data was erased.

The most epic bugs are those that lead to inadvertent data deletion.

Avoiding this is very simple.

Do not delete data . For example, put aside in a separate directory or delete with a delay. First, we transfer so that they are not visible to the user, and if he finds that something has disappeared within a few days, we will return it back.

Do not delete unexpected data if the cause is unknown . Programmatically limit the start of deletion of unknown data: unexpected, with strange names, or if there are too many of them. The administrator will notice that the server does not start and writes some message, and will understand.

If the program performs destructive actions - isolate testing and production at the network level(iptables). For example, deleting files or sending e-mail is a destructive action because it will “eat up” someone's attention. Put a threshold on them: a hundred letters can be sent, and for a thousand put a safety checkbox, which is set before something terrible happens.

Configurations . The second example is already from my practice.

One good company somehow had a strange ClickHouse cluster. The strangeness was that the replicas did not synchronize. When the server was restarted, it did not start and a message appeared that all the data was incorrect: “There are a lot of unexpected data, I will not start. We must set the flag force_restore_dataand figure it out. ”

Nobody could figure it out in the company - they just set the flag. At the same time, half of the data disappeared somewhere, resulting in charts with gaps. The developers turned to me, I thought something interesting was happening, and decided to investigate. When morning came a few hours later and the birds started to sing outside the window, I realized that I didn’t understand anything.

The ClickHouse server uses the ZooKeeper service for coordination. ClickHouse stores data, and ZooKeeper determines which servers what data should lie on: stores metadata about what data on which replica should be. ZooKeeper is also a cluster - it replicates according to a very good distributed consensus algorithm, with strict consistency.

As a rule, ZooKeeper is 3 machines, sometimes 5. In the ClickHouse configuration all machines are indicated at once, the connection is established with a random machine, interacts with it, and this server replicates all requests.

What happened? The company had three ZooKeeper servers. But they did not work as a cluster of three nodes , but as three independent nodes — three clusters from one node. One ClickHouse connects to one server and writes data. The replicas want to download this data, but they are nowhere to be found. When restarting, the server connects to another ZooKeeper: it sees that the data with which it worked before is superfluous, it must be postponed somewhere. He does not delete them, but transfers them to a separate directory - in ClickHouse data is not so easily deleted.

I decide to fix the ZooKeeper configuration. I rename all the data and make a request for ATTACHparts of the data from the directory detached/unexpeted_*.

As a result, all the data was restored, the replicas were synchronized, there were no losses, the graphs were continuous. The company is satisfied, thankful, as if they had already forgotten how everything had worked badly before.

These were simple configuration bugs. More bugs will be in the code.

Bugs in the code

We write code in C ++. This means that we already have problems.

The next example is a real bug from production on the Yandex.Metrica cluster (2015) - a consequence of the C ++ code. The bug was that sometimes the user instead of responding to the request received an error message:

“Checksum doesn't match, corrupted data” - the check sum does not match, the data is broken - scary!
"LRUCache became inconsistent. There must be a bug in it ”- the cache became inconsistent, most likely a bug in it.

The code we wrote informs itself that there is a bug there.

" Checksum doesn't match, corrupted data ." Check sums of compressed data blocks are checked before being decompressed. Usually this error appears when data is broken on the file system. For various reasons, some files turn out to be garbage when the server is restarted.

But here is another case: I read the file manually, the check sum matches, there is no error. Once appeared, the error is stably reproduced upon repeated request. When the server restarts, the error disappears for a while, and then again appears stably.

Perhaps the matter is in RAM? A typical situation is when bits are beating in it. I look in dmesg(kern.log), but there is no machine check exceptions - they usually write when something is wrong with RAM. If the server had beaten RAM, then not only my program would work incorrectly, but all the others would generate errors randomly. However, there are no other manifestations of the error.

"LRUCache became inconsistent. There must be a bug in it. " This is a clear mistake in the code, and we are writing in C ++ - perhaps memory access? But tests under AddressSanitizer, ThreadSanitizer, MemorySanitizer, UndefinedBehaviorSanitizer in CI show nothing.

Perhaps some test cases are not covered? I collect the server with AddressSanitizer, run it on production - it catches nothing. For some time, the error is cleared by resetting some mark cache (sachet cache).

One of the programming rules says: if it’s not clear what the bug is, look closely at the code, hoping to find something there. I did so, found a bug, fixed it - it did not help. I look at another place in the code - there is also a bug. Corrected, again did not help. I fixed a few more, the code got better, but the error still didn't disappear!

Cause. Trying to find a pattern by server, by time, by the nature of the load - nothing helps. Then he realized that the problem manifests itself only on one of the clusters, and never on the others. The error is not reproduced so often, but it always appears on one cluster after a restart, and everything is clean on the other.

It turned out that the reason is that on the “problem” cluster they used a new feature - cache dictionaries. They use the hand-written memory allocator ArenaWithFreeLists . We not only write in C ++, but also saw some kind of custom allocators - we doom ourselves to problems twice.

ArenaWithFreeLists is a part of memory in which memory is allocated consecutively in sizes divisible by two: 16, 32, 64 bytes. If memory is freed, then they form a singly linked list of free FreeLists blocks.

Let's look at the code.

class ArenaWithFreeLists
{
    Block * free_lists[16] {};
    static auto sizeToPreviousPowerOfTwo(size_t size)
    {
        return _bit_scan_reverse(size - 1);
    }

    char * alloc(size_t size)
    {
        const auto list_idx = findFreeListIndex(size);
        free_lists[list_idx] ->...
    }
}

It uses a function _bit_scan_reversewith an underscore at the beginning.

There is an unwritten rule: “If a function has one underscore at the beginning, read the documentation on it once, and if two, read it twice.”

We listen and read the documentation: “int _bit_scan_reverse (int a). Set dst to the index of the highest set bit in 32-bit integer a. If no bits are set in a then dst is undefined . " We seem to have found a problem.

In C ++, this situation is considered impossible for the compiler. The compiler can use undefined behavior, this “impossibility”, as an assumption for optimizing the code.

The compiler does nothing wrong - it honestly generates assembly instructions bsr %edi, %eax. But, if the operand is zero, the instruction has bsrundefined behavior not at the C ++ level, but at the CPU level. If the source register is zero, then the destination register does not change: there was some garbage at the input, this garbage will also remain at the output.

The result depends on where the compiler puts this instruction. Sometimes a function with this instruction is inline, sometimes not. In the second case there will be something like this code:

bsrl %edi, %eax
retq

Then I looked at an example of similar code in my binary using objdump.

According to the results, I see that sometimes the source register and destination register are the same. If there was zero, then the result will also be zero - everything is fine. But sometimes the registers are different, and the result will be garbage.

How does this bug manifest itself?

We use garbage as an index in the FreeLists array. Instead of an array, we go to some distant address and get memory access.
We are lucky, almost all the addresses nearby are filled with data from the cache - we spoil the cache. The cache contains file offsets.
We read files at the wrong offset. From the wrong offset, we get the check sum. But there is not a check-sum, but something else - this check-sum will not coincide with the following data.
We get the error “Checksum doesn't match, corrupted data”.

Fortunately, not data is corrupted, but only the cache in RAM. We were immediately informed about the error, because we check-sum the data. The error was corrected on December 27, 2015 and went to celebrate.

As you can see, the wrong code can at least be fixed. But how to fix bugs in hardware?

Bugs in iron

These are not even bugs, but physical laws - inevitable effects. According to physical laws, iron is inevitably buggy.

Non-atomic write to RAID . For example, we created RAID1. It consists of two hard drives. This means that one server is a distributed system: data is written to one hard drive and to another. But what if data is written to one disc and power is lost while recording to the second? Data on a RAID1 array will not be consistent. We will not be able to understand which data is correct, because we will read one byte or the other.

You can deal with this by placing the log. For example, in ZFS this problem is solved, but more on that later.

bit rot on HDD and SSD. Bits on hard drives and on SSDs can go bad just like that. Modern SSDs, especially those with multi-level cells, are designed to ensure that cells will constantly deteriorate. Error correction codes help, but sometimes the cells deteriorate so much and so much that even this does not save. Undetected errors are obtained.

bit flips in RAM (but what about ECC?). In RAM in servers, bits are also corrupted. It also has error correction codes. When errors occur, they are usually visible from the messages in the Linux kernel log in dmesg. When there are many errors, we will see something like: "N million errors with memory have been fixed." But individual bits will not be noticed, and for sure something will be buggy.

bit flips at the CPU and network level . There are errors at the CPU level, in CPU caches and, of course, when transmitting data over a network.

How do iron errors usually manifest? The ticket “ A malformed znode prevents ClickHouse from starting ” comes to GitHub - the data in the ZooKeeper node is corrupted.

In ZooKeeper, we usually write some metadata in plain text. There is something wrong with him - " replica " is written very strange.

It rarely happens that because of a bug in the code, one bit changes. Of course, we can write such a code: we take the Bloom filter, change the bit at certain addresses, calculate the addresses incorrectly, change the wrong bit, it falls on some data. That's it, now in ClickHouse it’s not “ replica” , but “ repli b a ” and on it all the data is wrong. But usually, a change in one bit is a symptom of iron problems.

Perhaps you know the example of bitsquatting. Artyom Dinaburg made an experiment : there are domains on the Internet that have a lot of traffic, although users do not go to these domains on their own. For example, such a domain FB-CDN.com is a Facebook CDN.

Artyom registered a similar domain (and many others), but changed one bit. For example, FA-CDN.com instead of FB-CDN.com. The domain was not published anywhere, but traffic came to it. Sometimes the FB-CDN host was written in the HTTP headers, and the request went to another host due to errors in RAM on users' devices. RAM with error correction does not always help. Sometimes it even interferes and leads to vulnerabilities (read about Rowhammer, ECCploit, RAMBleed).

Conclusion: always check-sum the data yourself.

When writing to the file system, check-sum without fail. When transmitting over the network, also check-summarize - do not expect that there are any check-sums there.

More bugs! ..

Production Cluster Metrics . Users in response to a request sometimes get an exception: “Checksum doesn't match: corrupted data” - the check sum is not correct, the data is corrupted.

The error message displays detailed data: what check amount was expected, what check amount is actually in this data, the size of the block for which we check the check amount and the exception context.

When we received the packet over the network from some server, an exception appeared - it looks familiar. Perhaps again passing through memory, race condition, or something else.

This exception appeared in 2015. The bug was fixed, it no longer appeared. In February 2019, he suddenly appeared again. At this time I was at one of the conferences, my colleagues dealt with the problem. The error was reproduced several times a day among 1000 servers with ClickHouse: it is not possible to collect statistics on one server, then on another. At the same time, there were no new releases at this time. It did not work out and solve the problem, but after a few days the error itself disappeared.

They forgot about the error, and on May 15, 2019, it repeated. We continued to deal with her. The first thing I did was look at all the available logs and graphs. He studied them all day, did not understand anything, did not find any patterns. If the problem cannot be reproduced, the only option is to collect all cases, look for patternsand addictions. Perhaps the Linux kernel does not work correctly with the processor, incorrectly saves or loads any registers.

Hypotheses and patterns

7 out of 9 servers with E5-2683 v4 failed. But of the error prone, only about half of the E5-2683 v4 is an empty hypothesis.

Mistakes are usually not repeated . In addition to the mtauxyz cluster, where there is indeed Corrupted data (bad data on the disk). This is another case, we reject the hypothesis.

The error does not depend on the Linux kernel - checked on different servers, found nothing. Nothing interesting in kern.log, machine check exceptionno messages . In network graphics, including retransmitters, CPU, IO, Network, nothing interesting. All network adapters on the servers on which errors occur and do not appear are the same.

There are no patterns . What to do? Continue to look for patterns. Second attempt.

I look at uptime servers:uptime is high, servers work stably , segfault and something like that is not. I always rejoice when I see that the program crashed with segfault - at least it crashed. Worse, when there is an error, it spoils something, but nobody notices it.

Errors are grouped by day and occur within a couple of days. In some 2 days, more appear, in some less, then again more - it is not possible to accurately determine the time of occurrence of errors.

Some errors match packages and the check amount that we expected. Most errors have only two package options. I was lucky because in the error message we added the very value of the check sum, which helped to compile statistics.

No server patternswhere we read the data from. The size of the compressed block that we check-sum is less than a kilobyte. Looked at the package sizes in HEX. This was not useful to me - the binary representation of packet sizes and check sums is not noticeable.

I didn’t fix the error - I was again looking for patterns. Third attempt.

For some reason, the error appears only on one of the clusters - on the third replicas in the Vladimir DC (we like to call data centers by city names). In February 2019, an error also appeared in Vladimirs DC, but on a different version of ClickHouse. This is another argument against the hypothesis that we wrote the wrong code. We already rewrote it three times from February to May - the error is probably not in the code .

All errors when reading packets over the network -while receiving packet from. The package on which the error occurred depends on the structure of the request. For requests that differ in structure, an error on different check sums. But in requests where the error is on the same check sum, the constants differ.

All requests with an error, except for one, are GLOBAL JOIN. But for comparison, there is one unusually simple request, and the compressed block size for it is only 75 bytes.

SELECT max(ReceiveTimestamp) FROM tracking_events_all 
WHERE APIKey = 1111 AND (OperatingSystem IN ('android', 'ios'))

We reject the hypothesis of influence GLOBAL JOIN.

The most interesting is that the affected servers are grouped into ranges by their names :
mtxxxlog01-{39..44 57..58 64 68..71 73..74 76}-3.

I was tired and desperate, began to look for completely delusional patterns. It's good that I did not get to debugging the code using numerology. But there were still leads.

The groups of problem servers are the same as in February.
Problem servers are located in certain parts of the data center. In DC Vladimir there are so-called lines - its different parts: VLA-02, VLA-03, VLA-04. Errors are clearly grouped: in some queues it is good (VLA-02), in other problems (VLA-03, VLA-04).

Typing debugging

It only remained to debug using the "spear" method. This means forming the hypothesis “What happens if you try to do so?” and collect data. For example, I found a query_logsimple query with an error in the table for which the packet size is size of compressed blockvery small (= 107).

I took the request, copied it and executed it manually using clickhouse-local.

strace -f -e trace=network -s 1000 -x \
clickhouse-local --query "
    SELECT uniqIf(DeviceIDHash, SessionType = 0)
    FROM remote('127.0.0.{2,3}', mobile.generic_events)
    WHERE StartDate = '2019-02-07' AND APIKey IN (616988,711663,507671,835591,262098,159700,635121,509222)
        AND EventType = 1 WITH TOTALS" --config config.xml

With the help of strace I received a snapshot (dump) of blocks over the network - the exact same packets that are received when this request is executed, and I can study them. You can use tcpdump for this, but it’s inconvenient: it’s hard to isolate a specific request from production traffic.

Using strace, you can trace the ClickHouse server itself. But this server works in production, if I do this I will get an array of incomprehensible information. Therefore, I launched a separate program that executes exactly one request. Already for this program I run strace and get what was transmitted over the network.

The request is executed without errors - the error is not reproduced . If reproduced, the problem would be resolved. Therefore, I copied the packets to a text file and manually began to parse the protocol.

The check amount was the same as expected. This is exactly the package on which sometimes, at another time, in other requests, errors occurred. But so far there have been no errors.

I wrote a simple program that takes a package and checks the check amount when replacing one bit in each byte. The program performed bit flip at every possible position and read the check amount.

I started the program and found that if you change the value of one bit, you get exactly that broken check-sum, for which there is a complaint

Hardware problem

If an error occurs in the software (for example, driving through memory), single bit flip is unlikely. Therefore, a new hypothesis appeared - the problem is in the gland.

One could close the lid of the laptop and say: "The problem is not on our side, but in the hardware, we do not do this." But no, let's try to understand where the problem is: in the RAM, on the hard drive, in the processor, in the network card or in the network card RAM in the network equipment.

How to localize a hardware problem?

The problem arose and disappeared on certain dates.
Affected servers are grouped by their names: mtxxxlog01-{39..44 57..58 64 68..71 73..74 76}-3.
The groups of problem servers are the same as February.
Problem servers are only in certain queues of the data center.

There were questions to network engineers - the data is beating on network switches. It turns out that network engineers exchanged switches for others exactly on those dates. After a question, they replaced them with the previous ones and the problem disappeared.

The problem is resolved, but questions still remain (no longer for engineers).

Why doesn't ECC (error correction memory) help on network switches? Because multiple bit flip can compensate each other - you get an undetected error.

Why don't TCP check sums help? They are weak. If only one bit has changed in the data, then TCP check sums will always see the change. If two bits have changed, then the changes may not be detected - they cancel each other out.

Only one bit has changed in our package, but the error is not visible. That's because 2 bits changed in the TCP segment: they calculated the check sum from it, it coincided. But in one TCP segment more than one packet of our application is located. And for one of them, we already consider our check-sum. Only one bit has changed in this packet.

Why Ethernet check sums do not help - are they stronger than TCP? Ethernet Check Amountcheck-summarize the data so that they do not break during transmission through one segment (I can be wrong with the terminology, I'm not a network engineer). Network equipment forwards these packets and can forward some data during forwarding. Therefore, the check amounts are simply recounted. We checked - on the wire the packages have not changed. But if they beat on the network switch itself, it will recalculate the check amount (it will be different), and forward the packet further.

Nothing will save you - check-sum yourself. Do not expect someone to do this for you.

For data blocks, a 128-bit check sum is considered (this overkill just in case). We correctly inform the user about the error. Data is transmitted over the network, it is damaged, but we do not record it anywhere - all our data is in order, you can not worry.

The data that is stored in ClickHouse remains consistent. Use check sums in ClickHouse. We love check sums so much that we immediately consider three options:

For compressed data blocks when writing to a file, to the network.
The total check is the sum of compressed data for reconciliation verification.
Total check is the sum of uncompressed data for reconciliation verification.

There are bugs in data compression algorithms, this is a known case. Therefore, when the data is replicated, we also consider the total check sum of the compressed data and the total amount of uncompressed data.

Do not be afraid to count the check amounts, they do not slow down.

Of course, it depends on which ones and how to count. There are nuances, but be sure to consider the check amount. For example, if you count from the compressed data, then there will be less data, they will not slow down.

Improved error message

How to explain to the user when he receives such an error message that this is a hardware problem?

If the check sum does not match, before sending an exception, I try to change every bit - just in case. If the check sum converges when changing and one bit is changed, then the problem is most likely hardware.

If we can detect this error, and if it changes when one bit is changed, why not fix it? We can do this, but if we fix errors all the time, the user will not know that the equipment is in a problem.

When we found out that there were problems in the switches, people from other departments began to report: “And we have one bit incorrectly written to Mongo! And something got to us in PostgreSQL! ” This is good, but it’s better to report problems earlier.

When we released a new diagnostic release, the first user to whom it worked wrote a week later: “Here's the message - what's the problem?” Unfortunately, he did not read it. But I read and suggested with a 99% probability that if the error appears on one server, then the problem is with the hardware. I leave the remaining percentage in case I wrote the code incorrectly - this happens. As a result, the user replaced the SSD, and the problem disappeared.

"Delirium" in the data

This interesting and unexpected problem made me worry. We have Yandex.Metrica data. A simple JSON is written to the database in one of the columns - user parameters from the JavaScript code of the counter.

I make some kind of request and the ClickHouse server crashed with segfault. From the stack trace, I realized what the problem was - a fresh commit from our external contributors from another country. The commit fixed, segfault disappeared.

I run the same request: SELECTin ClickHouse, to get JSON, but again, nonsense, everything works slowly. I get JSON, and it is 10 MB. I display it and look more attentively: {"jserrs": cannot find property of object undefind...and then a megabyte of binary code fell out.

There were thoughts that this is again a passage from memory or a race condition. A lot of such binary data is bad, it can contain anything. If so, now I will find passwords and private keys there. But I did not find anything, so I immediately rejected the hypothesis. Maybe this is a bug in my program on the ClickHouse server? Perhaps in a program that writes (it is also written in C ++) - all of a sudden she accidentally puts her dump memory into ClickHouse? In this hell, I began to look closely at the letters and realized that it was not so simple.

Clue path

The same garbage was recorded on two clusters, independently of each other. The data is junk, but it's valid UTF-8. This UTF-8 has some strange URLs, font names, and a lot of letters "I" in a row.

What is special about the little Cyrillic "I"? No, this is not Yandex. The fact is that in the encoding of Windows 1251 it is the 255th character. And on our Linux servers, no one uses Windows 1251 encoding.

It turns out that this is a dump of the browser: the JavaScript code of the metric counter collects JavaScript errors. As it turned out, the answer is simple - it all came from the user .

From here, too, conclusions can be drawn.

Bugs from all over the Internet

Yandex.Metrica collects traffic from 1 billion devices on the Internet: browsers on PCs, cell phones, tablets. Garbage will come inevitably : there are bugs in user devices, everywhere unreliable RAM and terrible hardware that overheats.

The database stores more than 30 trillion lines (page views). If you analyze the data from this table, you can find anything there.

Therefore, it’s correct to simply filter this garbage before writing to the database. No need to write garbage to the database - she doesn't like it.

HighLoad++ ( 133 ), - , , ++ PHP Russia 2020 Online.

Badoo, PHP Russia 2020 Online . PHP Russia 2020 Online 13 , .

— , .

The notorious bugs and how to avoid them on the example of ClickHouse