I posted on Github a new project called A Simple GPU Hash Table .

This is a simple hash table for the GPU, capable of processing hundreds of millions of inserts per second. On my laptop with an NVIDIA GTX 1060, the code inserts 64 million randomly generated key-value pairs in about 210 ms and removes 32 million pairs in about 64 ms.

That is, the speed on the laptop is approximately 300 million inserts / sec and 500 million removals / sec.

The table is written in CUDA, although the same technique can be applied to HLSL or GLSL. The implementation has several limitations that ensure high performance on the video card:

Only 32-bit keys and the same values are processed.
The hash table has a fixed size.
And this size should be equal to two in degree.

For keys and values, you need to reserve a simple delimiting marker (in the code above it is 0xffffffff).

Hash table without locks

The hash table uses open addressing with linear sensing , that is, it is just an array of key-value pairs that is stored in memory and has excellent cache performance. This is not the case with chaining, which means looking for a pointer in a linked list. A hash table is a simple array storing elements KeyValue:

struct KeyValue
{
    uint32_t key;
    uint32_t value;
};

The size of the table is equal to two in power, and not a prime number, because to use pow2 / AND-mask, one quick instruction is enough, and the module operator is much slower. This is important in the case of linear sensing, since in a linear search in the table, the slot index must be wrapped in each slot. And as a result, the cost of the operation is added modulo in each slot.

The table stores only the key and value for each item, not the key hash. Since the table only stores 32-bit keys, the hash is calculated very quickly. The above code uses the Murmur3 hash, which performs only a few shifts, XORs, and multiplications.

The hash table uses a lock protection technique that does not depend on the memory placement order. Even if some write operations violate the order of other such operations, the hash table will still maintain the correct state. We will talk about this below. The technique works great with video cards in which thousands of threads compete.

Keys and values in the hash table are initialized to empty.

The code can be modified so that it can process both 64-bit keys and values. Keys require atomic read, write, and exchange operations (compare-and-swap). And values require atomic read and write operations. Fortunately, in CUDA, read-write operations for 32- and 64-bit values are atomic as long as they are naturally aligned (see here), and modern video cards support 64-bit atomic operations of comparison with exchange. Of course, when switching to 64 bits, performance will decrease slightly.

Hash Table Status

Each key-value pair in a hash table can have one of four states:

The key and meaning is empty. In this state, the hash table is initialized.
The key has been recorded, but the value is not yet. If another thread of execution is reading data at that moment, then it returns an empty value. This is normal, the same thing would happen if another thread of execution worked a little earlier, and we are talking about a competitive data structure.
Both the key and the value are recorded.
The value is available for other threads of execution, but the key is not yet. This can happen because the CUDA programming model implies a poorly ordered memory model. This is normal; in any event, the key is still empty, even if the value is no longer such.

An important nuance is that as soon as the key has been written to the slot, it no longer moves - even if the key is deleted, we will talk about this below.

The hash table code even works with poorly ordered memory models that do not know the order of reading and writing to memory. When we analyze the insert, search and delete in the hash table, remember that each key-value pair is in one of the four states described above.

Insert into a hash table

A CUDA function that inserts key-value pairs into a hash table looks like this:

void gpu_hashtable_insert(KeyValue* hashtable, uint32_t key, uint32_t value)
{
    uint32_t slot = hash(key);

    while (true)
    {
        uint32_t prev = atomicCAS(&hashtable[slot].key, kEmpty, key);
        if (prev == kEmpty || prev == key)
        {
            hashtable[slot].value = value;
            break;
        }
        slot = (slot + 1) & (kHashTableCapacity-1);
    }
}

To insert a key, the code iterates over the hash table array starting with the hash of the inserted key. In each slot of the array, an atomic comparison operation is performed with the exchange, in which the key in this slot is compared with an empty one. If a mismatch is detected, the key in the slot is updated to the inserted key, and then the original key of the slot is returned. If this original key was empty or corresponded to the inserted key, then the code found a slot suitable for insertion and brings the inserted value into the slot.

If in one kernel callgpu_hashtable_insert()There are several elements with the same key, then any of their values can be written to the key slot. This is considered normal: one of the key-value write operations during the call will be successful, but since all this happens in parallel within several threads of execution, we cannot predict which write operation to memory will be the last.

Hash Table Search

Key Finder Code:

uint32_t gpu_hashtable_lookup(KeyValue* hashtable, uint32_t key)
{
        uint32_t slot = hash(key);

        while (true)
        {
            if (hashtable[slot].key == key)
            {
                return hashtable[slot].value;
            }
            if (hashtable[slot].key == kEmpty)
            {
                return kEmpty;
            }
            slot = (slot + 1) & (kHashTableCapacity - 1);
        }
}

To find the value of the key stored in the table, we iterate over the array starting with the hash of the desired key. In each slot, we check whether the key is the one we are looking for, and if so, then return its value. We also check if the key is empty, and if so, we interrupt the search.

If we cannot find the key, then the code returns an empty value.

All of these search operations can be performed competitively during insertions and deletions. Each pair in the table will have one of the four states described above for the stream.

Hash table deletion

Key Removal Code:

void gpu_hashtable_delete(KeyValue* hashtable, uint32_t key, uint32_t value)
{
    uint32_t slot = hash(key);

    while (true)
    {
        if (hashtable[slot].key == key)
        {
            hashtable[slot].value = kEmpty;
            return;
        }
        if (hashtable[slot].key == kEmpty)
        {
            return;
        }
        slot = (slot + 1) & (kHashTableCapacity - 1);
    }
}

Removing a key is unusual: we leave the key in the table and mark its value (not the key itself) empty. This code is very similar to lookup(), except that when a match is found for the key, it makes its value empty.

As mentioned above, once the key is written to the slot, it no longer moves. Even when you delete an item from the table, the key remains in place, just its value becomes empty. This means that we do not need to use the atomic operation of writing the value of the slot, because it does not matter whether the current value is empty or not - it will still become empty.

Resize a hash table

You can resize the hash table by creating a larger table and inserting non-empty elements from the old table into it. I did not implement this functionality because I wanted to keep the sample code simple. Moreover, in CUDA programs, memory allocation is often done in the host code, and not in the CUDA core.

The A Lock-Free Wait-Free Hash Table article describes how to change such a lock-protected data structure.

Competitiveness

In the above code snippets, the functions gpu_hashtable_insert(), _lookup()and _delete()process one key-value pair at a time. And below gpu_hashtable_insert(), _lookup()they _delete()process an array of pairs in parallel, each pair in a separate GPU thread of execution:

// CPU code to invoke the CUDA kernel on the GPU
uint32_t threadblocksize = 1024;
uint32_t gridsize = (numkvs + threadblocksize - 1) / threadblocksize;
gpu_hashtable_insert_kernel<<<gridsize, threadblocksize>>>(hashtable, kvs, numkvs);

// GPU code to process numkvs key/values in parallel
void gpu_hashtable_insert_kernel(KeyValue* hashtable, const KeyValue* kvs, unsigned int numkvs)
{
    unsigned int threadid = blockIdx.x*blockDim.x + threadIdx.x;
    if (threadid < numkvs)
    {
        gpu_hashtable_insert(hashtable, kvs[threadid].key, kvs[threadid].value);
    }
}

A lockable hash table supports concurrent inserts, searches, and deletes. Since the key-value pairs are always in one of four states, and the keys do not move, the table guarantees correctness even when using different types of operations.

However, if we process a packet of insertions and deletions in parallel, and if duplicate keys are contained in the input array of pairs, then we will not be able to predict which pairs will “win” - they will be written to the hash table last. Suppose we called an insert code with an input array of pairs A/0 B/1 A/2 C/3 A/4. When the code is completed, the pairs B/1and are C/3guaranteed to be present in the table, but at the same time any of the pairs will appear in it A/0, A/2orA/4. This may or may not be a problem - it all depends on the application. You may know in advance that there are no duplicate keys in the input array, or it may not matter to you what value was last written.

If this is a problem for you, then you need to split the duplicate pairs into different system CUDA calls. In CUDA, any kernel call operation always ends before the next kernel call (at least within the same thread. In different threads, the kernel executes in parallel). If in the above example, call one core with A/0 B/1 A/2 C/3, and the other with A/4, then the key Awill get a value 4.

Now let's talk about whether the function lookup()and delete()use a simple (plain) or variable (volatile) a pointer to an array of pairs in a hash table.CUDA documentation states that:

The compiler can, at its discretion, optimize the read and write operations to the global or shared memory ... These optimizations can be disabled using the keyword volatile: ... any link to this variable is compiled into a real read or write instruction in memory.

Correctness considerations do not require application volatile. If the thread of execution uses the cached value from an earlier read operation, then this means that it will use a bit outdated information. But still, this is information from the correct state of the hash table at a certain point in the kernel call. If you need to use the latest information, you can use the pointer volatile, but then performance will decrease slightly: according to my tests, when you delete 32 million items, the speed decreases from 500 million delete / s to 450 million delete / s.

Performance

In the test for inserting 64 million elements and deleting 32 million of them, there is std::unordered_mappractically no competition between and the hash table for the GPU:

std::unordered_mapSpent 70 691 ms on insertion and removal of elements with subsequent release unordered_map(release from millions of elements takes a lot of time, because unordered_mapnumerous memory allocations are performed inside ). Honestly, there are std:unordered_mapcompletely different limitations. This is a single CPU-thread of execution, it supports key-values of any size, works well at high utilization rates and shows stable performance after numerous deletions.

The duration of the hash table for the GPU and inter-program communication was 984 ms. This includes the time taken to place the table in memory and delete it (one-time allocation of 1 GB of memory, which in CUDA takes some time), insertion and deletion of elements, and also iteration over them. Also taken into account all the copying to and from the memory of the video card.

The hash table itself took 271 ms. This includes the time spent by the video card to insert and remove items, and does not take into account the time it takes to copy to memory and iterate over the resulting table. If the GPU table lives for a long time, or if the hash table is contained entirely in the memory of the video card (for example, to create a hash table that will be used by another GPU code and not the central processor), then the test result is relevant.

The hash table for the video card demonstrates high performance due to its high bandwidth and active parallelization.

disadvantages

The hash table architecture has several issues to keep in mind:

Clustering interferes with linear probing, because of which the keys in the table are far from ideal.
Keys are not deleted using the function deleteand over time clutter up the table.

As a result, the performance of the hash table may gradually decrease, especially if it exists for a long time and numerous insertions and deletions are performed in it. One way to mitigate these shortcomings is to rehash a new table with a fairly low utilization rate and filter remote keys when rehashing.

To illustrate the problems described, I use the above code to create a table for 128 million elements, I will cyclically insert 4 million elements until I fill 124 million slots (utilization is about 0.96). Here is the results table, each row is a call to the CUDA core with the insertion of 4 million new elements into one hash table:

Utilization rate	Duration of insertion 4 194 304 elements
0.00	11.608448 ms (361.314798 million keys / sec.)
0,03	11,751424 (356,918799 /.)
0,06	11,942592 (351,205515 /.)
0,09	12,081120 (347,178429 /.)
0,12	12,242560 (342,600233 /.)
0,16	12,396448 (338,347235 /.)
0,19	12,533024 (334,660176 /.)
0,22	12,703328 (330,173626 /.)
0,25	12,884512 (325,530693 /.)
0,28	13,033472 (321,810182 /.)
0,31	13,239296 (316,807174 /.)
0,34	13,392448 (313,184256 /.)
0,37	13,624000 (307,861434 /.)
0,41	13,875520 (302,280855 /.)
0,44	14,126528 (296,909756 /.)
0,47	14,399328 (291,284699 /.)
0,50	14,690304 (285,515123 /.)
0,53	15,039136 (278,892623 /.)
0,56	15,478656 (270,973402 /.)
0,59	15,985664 (262,379092 /.)
0,62	16,668673 (251,627968 /.)
0,66	17,587200 (238,486174 /.)
0,69	18,690048 (224,413765 /.)
0,72	20,278816 (206,831789 /.)
0,75	22,545408 (186,038058 /.)
0,78	26,053312 (160,989275 /.)
0,81	31,895008 (131,503463 /.)
0,84	42,103294 (99,619378 /.)
0,87	61,849056 (67,815164 /.)
0,90	105,695999 (39,682713 /.)
0,94	240,204636 (17,461378 /.)

As utilization increases, productivity decreases. This is undesirable in most cases. If an application inserts elements into a table and then discards them (for example, when counting words in a book), then this is not a problem. But if the application uses a long-lived hash table (for example, in a graphics editor to store non-empty parts of images when the user often inserts and deletes information), then this behavior can be troublesome.

And he measured the depth of probing the hash table after 64 million inserts (utilization factor 0.5). The average depth was 0.4774, so most keys were located either in the best possible slot or in one slot from the best position. The maximum sounding depth was 60.

Then I measured the sounding depth in the table with 124 million inserts (utilization rate 0.97). The average depth was already 10.1757, and the maximum - 6474 (!!). Linear sounding performance drops dramatically at high utilization rates.

It’s best to keep this hash table low. But then we increase productivity by consuming memory. Fortunately, in the case of 32-bit keys and values, this can be justified. If in the above example in the table for 128 million elements the utilization coefficient of 0.25 is stored, then we can place no more than 32 million elements in it, and the remaining 96 million slots will be lost - 8 bytes for each pair, 768 MB of lost memory.

Please note that we are talking about the loss of video card memory, which is a more valuable resource than system memory. Although most modern desktop graphics cards that support CUDA have at least 4 GB of memory (at the time of writing, NVIDIA 2080 Ti has 11 GB), losing such volumes will not be the wisest decision.

Later, I will write more about creating hash tables for video cards that have no problems with the depth of sounding, as well as ways to reuse remote slots.

Sensing Depth Measurement

To determine the depth of key sounding, we can extract the key hash (its ideal index in the table) from its actual table index:

// get_key_index() -> index of key in hash table
uint32_t probelength = (get_key_index(key) - hash(key)) & (hashtablecapacity-1);

Due to the magic of two binary numbers in the additional code and the fact that the capacity of the hash table is equal to two in power, this approach will work even when the key index is moved to the beginning of the table. Take a key that is hashed at 1 but inserted in slot 3. Then for a table with capacity 4 we get (3 — 1) & 3what is equivalent to 2.

Conclusion

If you have questions or comments, write me on Twitter or open a new topic in the repository .

This code is inspired by some great articles:

In the future, I will continue to write about hash table implementations for video cards and will analyze their performance. I have plans to chain, hash Robin Hood and cuckoo hash using atomic operations in data structures that are convenient for video cards.

Simple hash table for GPU