String optimization in ClickHouse. Yandex Report

The ClickHouse analytic database engine processes many different lines, consuming resources. To speed up the system, new optimizations are constantly being added. ClickHouse developer Nikolay Kochetov talks about the string data type, including the new type, LowCardinality, and explains how to speed up work with strings.


- First, let's see how you can store strings.



We have string data types. String works well by default, it should be used almost always. It has a small Overhead - 9 bytes per line. If we want the row size to be fixed and known in advance, it is better to use FixedString. In it you can set the number of bytes we need, it is convenient for data such as IP addresses or hash functions.



Of course, sometimes something slows down. Suppose you are making a query on a table. ClickHouse reads a fairly large amount of data, say, at a speed of 100 GB / s, with few lines being processed. We have two tables that store almost the same data. ClickHouse reads data from the second table at a higher speed, but it reads three times less rows per second.



If we look at the size of the compressed data, it will be almost equal. In fact, the same data is written in the tables - the first billion numbers - only in the first column are they written in the form of UInt64, and in the second column in String. Because of this, the second query reads data from the disk longer and decompresses it.



Here is another example. Suppose that there is a predetermined set of lines, it is limited to a constant of 1000 or 10,000 and almost never changes. For this case, the Enum data type is suitable for us, in ClickHouse there are two of them - Enum8 and Enum16. Due to storage in Enum, we quickly process requests.

ClickHouse has accelerations for GROUP BY, IN, DISTINCT and optimizations for some functions, for example, for comparison with a constant string. Of course, the numbers in the string are not converted, but, on the contrary, the constant string is converted to the value Enum. After that, everything is quickly compared.

But there are also disadvantages. Even if we know the exact set of lines, sometimes it needs to be replenished. A new line has arrived - we have to do ALTER.



ALTER for Enum in ClickHouse is optimally implemented. We do not overwrite the data on the disk, but ALTER can slow down due to the fact that Enum structures are stored in the schema of the table itself. Therefore, we must wait for read requests from the table, for example.

The question is, can one do better? Maybe yes. You can save the Enum structure not in the table schema, but in ZooKeeper. However, synchronization problems may occur. For example, one replica received data, the other did not, and if it has an old Enum, then something will break. (In ClickHouse, we almost completed the non-blocking ALTER requests. When we finish them completely, we won’t have to wait for read requests.)



In order not to mess with ALTER Enum, you can use external ClickHouse dictionaries. Let me remind you that this is a key-value data structure inside ClickHouse, with which you can get data from external sources, for example, from MySQL tables.

In the ClickHouse dictionary, we store many different lines, and in the table their identifiers in the form of numbers. If we need to get a string, we call the dictGet function and work with it. After that we should not do ALTER. To add something to Enum, we insert this into the same MySQL table.

But there are other problems. Firstly, the awkward syntax. If we want to get a string, we must call dictGet. Secondly, the lack of some optimizations. Comparison with the constant string for dictionaries is also not quick to do.

There may still be problems with the update. Suppose we requested a line in the cache dictionary, but it did not get into the cache. Then we have to wait until the data is loaded from an external source.



A common drawback of both methods is that we store all the keys in one place and synchronize them. So why not store dictionaries locally? No synchronization - no problem. You can store the dictionary locally in a piece on disk. That is, we did Insert, recorded a dictionary. If we work with data in memory, we can write a dictionary either to a data block, or to a piece of a column, or to some cache to speed up the calculations.

Vocabulary String Encoding


So we came to the creation of a new data type in ClickHouse - LowCardinality. This is a format for storing data: how they are written to disk and how they are read, how they are presented in memory and the scheme of their processing.



There are two columns on the slide. On the right, strings are stored standardly in the String type. It can be seen that these are some kind of mobile phone models. On the left there is exactly the same column, only in the LowCardinality type. It consists of a dictionary with many different lines (lines from the column on the right) and a list of positions (line numbers).

Using these two structures, you can restore the original column. There is also a reverse inverse index - a hash table that helps you find the position in the dictionary by line. It is needed to speed up some queries. For example, if we want to compare, look for a line in our column or merge them together.

LowCardinality is a parametric data type. It can be either a number, or something that is stored as a number, or a string, or Nullable from them.



The peculiarity of LowCardinality is that it can be saved for some functions. An example request is shown on the slide. In the first line, I created a column of type LowCardinality from String, named it S. Then I asked her name - ClickHouse said that it was LowCardinality from String. All right.

The third line is almost the same, only we called the length function. In ClickHouse, the length function returns the UInt64 data type. But we got LowCardinality from UInt64. What's the point?



The names of mobile phones were stored in the dictionary, we applied the length function. Now we have a similar dictionary, consisting only of numbers, these are the lengths of strings. The column with the positions has not changed. As a result, we processed less data, saved on request time.

There may be other optimizations, such as adding a simple cache. When calculating the value of a function, you can remember it and make it the same, do not re-calculate.

Optimization can also be done GROUP BY, because our column with the dictionary is already partially aggregated - you can quickly calculate the value of the hash functions and approximately find the bucket where to put the next line. You can also specialize some aggregate functions, for example uniq, because you can send only a dictionary to it, and leave the positions untouched - this will work faster. The first two optimizations we have already added to ClickHouse.



But what if we create a column with our data type and insert many bad different rows into it? Is our memory full? No, there are two special settings for this in ClickHouse. The first is low_cardinality_max_dictionary_size. This is the maximum size of a dictionary that can be written to disk. The insertion occurs as follows: when we insert the data, a stream of lines comes to us, from them we form a large general dictionary. If the dictionary becomes larger than the setting value, we write the current dictionary to disk, and the rest of the lines somewhere somewhere on the side, next to the indexes. As a result, we will never recount a large dictionary and get no memory problems.

The second setting is called low_cardinality_use_single_dictionary_for_part. Imagine that in the previous scheme, when we inserted the data, our dictionary was full, and we wrote it to disk. The question arises, why not now form another exactly the same dictionary?

When it overflows, we will again write it to disk and begin to form a third one. This setting just disables this feature by default.

In fact, many dictionaries can be useful if we want to insert some set of lines, but accidentally inserted “garbage”. Say we first inserted the bad lines, and then we inserted the good ones. Then the dictionary will be divided into many small dictionaries. Some of them will be with garbage, but the latter will be with good lines. And if we read, say, only the last pellet, then everything will also work quickly.



Before speaking about the advantages of LowCardinality, I’ll say right away that we are unlikely to achieve reduced data on the disk (although this can happen), because ClickHouse compresses the data. There is a default option - LZ4. You can also do compression using ZSTD. But both algorithms already implement dictionary compression, so our external ClickHouse dictionary will not help much.

In order not to be unfounded, I took some data from the metric - String, LowCardinality (String) and Enum - and saved them in different data types. It turned out three columns, where one billion rows are written. The first column, CodePage, has a total of 62 values. And you can see that in LowCardinality (String), they squeezed them better. String is a little worse, but this is most likely due to the fact that the strings are short, we store their lengths, and they take up a lot of space and do not compress well.

If you take PhoneModel, there are more than 48 thousand of them, and there are almost no differences between String and LowCardinality (String). For the URL, we also saved only 2 GB - I think you should not rely on this.

Estimation of the speed of work



Link from the slide

Now let's evaluate the speed of work. To evaluate it, I used a dataset describing taxi rides in New York. Itis availableon GitHub. It has a little over a billion trips. It shows the location, the start and end times of the trip, the payment method, the number of passengers and even the type of taxi - green, yellow and Uber.



I made the first request quite simple - I asked where taxis are most often ordered. To do this, you need to take the location from where you ordered, make GROUP BY on it and calculate the count function. Here ClickHouse gives out something.



To measure the speed of query processing, I created three tables with the same data, but used three different data types for our start location - String, LowCardinality and Enum. LowCardinality and Enum are five times faster than String. Enum is faster because it works with numbers. LowCardinality - because GROUP BY optimization is implemented.



Let's complicate the request - ask where the most popular park in New York is located. Again, we will measure this by where taxi is most often ordered, but at the same time we will filter out only those locations where the word “park” is available. Also add a like function.



We look at the time - we see that Enum suddenly began to slow down. And it works even slower than the standard String data type. This is because the like function is not at all optimized for Enum. We have to convert our lines from Enum to regular lines - we do more work. LowCardinality (String) is also not optimized by default, but like works there on the dictionary, so the query is faster compared to String.

There is a more global problem with Enum. If we want to optimize it, we must do it in every place of the code. Suppose we wrote a new function - you must definitely come up with optimizations for Enum. And in LowCardinality, everything is optimized by default.



Let's look at the last request, more artificial. We will simply calculate the hash function from our location. The hash function is a rather slow request, it takes a long time, so everything will slow down three times.



LowCardinality is still faster, although there is no filtering. This is due to the fact that our functions work only on the dictionary. The hash calculation function has one argument - it can process less data and can also return LowCardinality.



Our global plan is to achieve a speed not lower than that of String in any cases, and save acceleration. And maybe we'll someday replace String with LowCardinality, you will update ClickHouse, and everything will work for you a little faster.

All Articles