I knew from the university about the Bloom filter , a probabilistic data structure named after Burton Bloom. But I did not have the opportunity to use it. Last month, such an opportunity appeared - and this structure literally fascinated me. However, I soon found some flaws in her. This article is a story about my brief love affair with the Bloom filter.

In the process of researching IP spoofing, it was necessary to check the IP addresses in the incoming packets, comparing them with the geographical location of our data centers. For example, packages from Italy should not go to the Brazilian data center. This problem may seem simple, but in the ever-changing landscape of the Internet it is far from simple. Suffice it to say that in the end I accumulated a lot of large text files with approximately the following content:

This means that a request from the resolved IP address 192.0.2.1 was recorded in Cloudflare data center number 107. This data came from many sources, including our active and passive samples, the logs of some domains that we own (for example,cloudflare.com), open sources (for example, BGP tables), etc. The same line is usually repeated in several files.

In the end, I got a gigantic dataset of this kind. At some point, in all the collected sources, I counted 1 billion lines. Usually I write bash scripts for preprocessing the input data, but on this scale this approach did not work. For example, removing duplicates from this tiny file of 600 MiB and 40 million lines takes ... eternity:

Suffice it to say that deduplicating lines with ordinary commands of the type sortin various configurations (see --parallel, --buffer-sizeand --unique) was not the best for such a large data set.

Bloom filters

Illustration of David Epstein in the public domain

Then it dawned on me: do not sort the lines! You need to remove duplicates, so some kind of 'set' data structure will work much faster. In addition, I roughly know the size of the input file (the number of unique lines), and the loss of some data is not critical, that is, the probabilistic data structure is quite suitable.

This is perfect for Bloom filters!

While you are readingWikipedia on Bloom filters, this is how I look at this data structure.

How would you implementthe plurality? Given an ideal hash function and infinite memory, we can simply create an infinite bitmap and set a bit number for each elementhash(item). This provides the ideal data structure for the "multitude." Right? Trivially. Unfortunately, hash functions collide, and infinite memory does not exist, so in our reality we have to compromise. But we can calculate the probability of collisions and manage this value. For example, we have a good hash function and 128 GB of memory. We can calculate that the collision probability for each new element is 1 to 1099511627776. When you add more elements, the probability increases as the bitmap is filled.

In addition, we can apply more than one hash function and get a denser bitmap. This is where the Bloom filter works well, which is a set of mathematical data with four variables:

n - number of inserted elements (cardinal number)
m - memory used by the bitmap
k - the number of hash functions calculated for each input
p - probability of false positive coincidence

Given the cardinal number nand the desired probability of false positives p, the Bloom filter returns the required memory mand the required number of hash functions k.

Check out this excellent Thomas Hurst visualization of how parameters affect each other.

mmuniq-bloom

Guided by intuition, I added the probabilistic tool mmuniq-bloom to my arsenal, which takes the input STDIN and returns only unique lines in STDOUT. It should be much faster than a combination of sort+ uniq!

There he is:

mmuniq-bloom.c

For simplicity and speed, I initially set a few parameters. First, unless otherwise stated, mmuniq-bloom uses eight hash functions k = 8. This seems to be close to the optimal number for our data size, and the hash function can quickly produce eight decent hashes. Then we align the memory min the bitmap to the power of two to avoid an expensive operation %modulo, which in assembler comes down to slow div. If the array is equal to the power of two, we can simply use bitwise AND (for fun, read how compilers optimize some division operations by multiplying by a magic constant ).

Now we can run it on the same data file that we used before:

Oh, that is much better! 12 seconds instead of two minutes. The program uses an optimized data structure, a relatively limited amount of memory, optimized line parsing and good output buffering ... and with all this, 12 seconds seems like an eternity compared to the tool wc -l:

What is happening? I understand that counting strings in is wceasier than calculating unique strings, but is the 26-time difference really justified? What does the CPU take in mmuniq-bloom?

Must be for computing hashes. The utility wcdoes not spend the processor, doing all this strange math for each of the 40 million lines. I use a rather non-trivial hash function siphash24, for sure it burns the processor, right? Let's verify by running only the hash function, but notperforming no operations with the Bloom filter:

This is strange. The calculation of the hash function takes only about two seconds, although the entire program in the previous run was executed for 12 seconds. Does one Bloom filter work for 10 seconds? How is this possible? This is such a simple data structure ...

Secret Weapon - Profiler

It's time to apply the right tool for this task - let's run the profiler and see what the processor is working on. First, let's run straceto verify that there are no unexpected system calls:

Everything looks good. Ten calls to mmap4 ms each (3971 μs) are intriguing, but that's fine. We pre-fill the memory with MAP_POPULATE, to later prevent errors due to the lack of a page.

What is the next step? Of course it is perf!

Then let's see the result:

So, we really burn 87.2% of the cycles in the main code. Let's see where exactly. The team perf annotate process_line --sourceimmediately shows something unexpected.

We see that 26.90% of the processor burned out inmov, But that's not all! The compiler correctly inserts the function and expands the loop. It turns out that most of the cycles go to this movor to the line uint64_t v = *p!

Obviously perf is wrong, how can such a simple string take up so many resources? But repeating the test with any other profiler shows the same problem. For example, I like to use google-perftools with kcachegrind because of the colorful diagrams:

The result of the visualization is as follows:

Let me summarize what we have discovered so far.

The standard utility wcprocesses a 600 MiB file for 0.45 s processor time. Our optimized tool mmuniq-bloomruns 12 seconds. The processor is burned on one instruction mov, dereferencing memory ...

Image of Jose Nicdao , CC BY / 2.0

Oh! How could I forget. Random access to memory isreallyslow! Very, very, very slow!

According to thenumbers every programmer should know, a single access to RAM takes about 100 ns. Let's count: 40 million lines, 8 hashes each. Since our Bloom filter has a size of 128 MiB, onour old hardwareit does not fit into the L3 cache! Hashes are evenly distributed over a wide range of memory - each of them generates a cache miss. Put it all together, and it turns out ...

It turns out that 32 seconds burn out only on memory accesses. The real program fits in just 12 seconds, because the Bloom filter still benefits from caching. This is easy to see with perf stat -d:

Yes, we should have had a minimum of 320 million cache misses (LLC-load-misses), but only 280 million happened: this still does not explain why the program worked in just 12 seconds. But it does not matter. It is important that the number of cache misses is a real problem, and we can solve it only by reducing the number of memory accesses. Let's try to configure the Bloom filter to use only one hash function:

Ay! It really hurts! To get a collision probability of 1 per 10,000 lines, the Bloom filter required 64 gigabytes of memory. It's horrible!

In addition, it does not seem that the speed has increased significantly. It took the operating system 22 seconds to prepare the memory for us, but we still spent 11 seconds in the user space. I believe that now all the advantages of a rarer access to memory are compensated by a lower probability of getting into the cache due to the sharply increased memory size. Earlier, 128 MiB was enough for the Bloom filter!

Refusing Bloom Filters

This is just getting ridiculous. To reduce the likelihood of false positives, you must either use a lot of hashes in the Bloom filter (for example, eight) with a lot of memory accesses, or leave one hash function, but use huge amounts of memory.

We actually do not have a memory limit, we want to minimize the number of calls to it. We need a data structure that costs a maximum of one cache miss per element and uses less than 64 gigabytes of RAM ...

Of course, you can implement complex data structures, such as a cuckoo filter , but there is certainly an easier option. What about the good old linear probing hash table?

Illustration of Vadims Podans

Meet mmuniq-hash

Here is the new version of mmuniq-bloom using a hash table:

mmuniq-hash.c

Instead of the bits for the Bloom filter, we now store 64-bit hashes from the 'siphash24' function . This provides much better protection against hash collisions: much better than one per 10,000 lines.

Let's count. Adding a new item to a hash table, say with 40 million entries, gives the chance of hash collisions 40 000 000/2^64. This is about 1 in 461 billion - a fairly low probability. But we do not add one element to the pre-filled set! Instead, we add 40 million rows to the initially empty set. According to the birthday paradox , this greatly increases the likelihood of collisions. A reasonable approximation would be an estimate '~n^2/2m, in our case it is~(40M^2)/(2*(2^64)). It turns out one chance out of 23,000. In other words, with a good hash function, we expect a collision in one of the 23,000 random sets of 40 million elements. This is a non-zero probability, but still better than in the Bloom filter, and it is completely tolerable for our use case.

Code with a hash table works faster, it has better memory access patterns and lower probability of false positives than in the Bloom filter.

Don't be alarmed by the “hash conflicts” line, it just shows how full the hash table is. We use linear sensing, so when we get into the full set, we just take the next empty one. In our case, we have to skip an average of 0.7 sets to find an empty spot in the table. This is normal. Since we iterate over the sets in a linear order, the memory must be qualitatively full.

From the previous example, we know that our hash function takes about two seconds. We conclude that 40 million memory accesses take about four seconds.

Lessons learned

Modern processors are really good at sequential access to memory when it is possible to predict sampling patterns (see cache prefetching ). Random access to memory, on the other hand, is very expensive.

Advanced data structures are very interesting, but be careful. Modern computers require the use of cache-optimized algorithms. When working with large data sets that do not fit in L3, optimization over the number of hits is preferred, rather than optimization over the amount of memory used.

It’s fair to say that Bloom filters perform great when placed in the L3 cache. But if not, then they are terrible. This is not news: Bloom filters are optimized for the amount of memory, not the number of calls to it. For example, seescientific article on cuckoo filters .

Another thing is endless discussions about hash functions. Honestly, in most cases this does not matter. The cost of counting even complex hash functions seems to be siphash24small compared to the cost of random access to memory. In our case, simplifying the hash function will bring only a small benefit. CPU time is just wasted somewhere else - waiting for memory!

One colleague often says: “It can be assumed that modern processors are infinitely fast. They work at infinite speed, until they rest against the wall of memory . "

Finally, do not repeat my mistake. You always need to first perform profiling withperf stat -dand look at the IPC counter (instructions per cycle). If it is less than one, this usually means that the program is stuck waiting for memory. The optimal values are above two. This means that the workload is mainly on the CPU. Unfortunately, in my tasks, IPC is still low ...

Superior mmuniq

With the help of colleagues, I wrote an improved version of the mmuniq tool based on a hash table. Here is the code:

mmuniq.c

It can dynamically change the size of the hash table, supports input with an arbitrary cardinal number. Then it processes the data in packets, effectively using the hint prefetchin the CPU, which speeds up the program by 35-40%. Be careful, abundant use prefetchin the code rarely gives effect. To use this function, I specially reordered the algorithms. With all the improvements, the execution time was reduced to 2.1 seconds:

the end

The creation of a basic tool that tries to outperform the 'sort / uniq' combination has revealed some hidden features of modern computing. Having sweated a little, we accelerated the program from more than two minutes to two seconds. During development, we learned about the delay in random access to memory, as well as the power of cache-friendly data structures. Bizarre data structures attract attention, but in practice it is often more efficient to reduce the number of random accesses to memory.

When the bloom filter does not fit