PHP: array_key_exists searches 500 times faster than in_array

In 2014, they already wrote about the search of the array , but hardly anyone understood.

Since then, many versions of PHP have been released and have not been fixed, which means that feedback is bad and few people know about it. On python it is the same, and in 3 * worse than in 2.7.

Sometimes you need to find a string in an array of strings - a very frequent operation in different algorithms, and if the array is small and look a bit and not in a loop, then in_array is normal, it does not affect the overall speed, but if you need big data and look for an array of a billion rows and a billion times , then this is critical: better hour instead of week.

A simple test shows:

in_array looks for ideone.com/Yb1mDa 6600msin 6-9 seconds
and array_key_exists looks for the same thing, but faster than 250 (php5.6 / py3. *) 400+ times (php7.3 / py2.7) ideone .com / gwSmFc(cycle increased 100 times) 12ms (6600/12 = 550 times + -10% spread due to load and cache)

Why is this happening? Consider in detail:

1) Finding strings in pure assembler / s is sorting an array of strings (quick or bubble), then binary search.

The number of steps in a binary search log (n) times depends on the size of the array, and is much smaller than a simple search.

You can sort an array of strings in advance, once, and cache, and then do a billion searches. But that does not help.

By default, sorting happens every time again, although they wrote that they improved in 7.2 in_array through a hash, but not much.

2) Search index / key (as a string) in the association. array / dictionary occurs by hash of strings and collision processing. (hash search errors). A hash is the numeric index of the array and fetching from it as (address of the zero element) + offset * the size of the pointer to the array of strings with this hash) in 2 steps. + brute force collisions, steps on average less than binary search.
The index hash is automatically done once in advance when creating the dictionary element $ m [key] = val and is cached.

The size of the hash, the hashing algorithm is sewn into the php engine and it cannot be changed, although the source code is open, you can download to change and compile if your server.

You can not read further, change in_array to array_combine + array_key_exists and thatโ€™s it.

The number of steps when searching by hash depends on the number of collisions and the number of lines with the same hash. They need to be sorted out or also sorted and binary search.

To reduce collisions, you can allocate more memory, if it is possible that now is not such a problem as 50 years ago when 1 kb of memory on magnetic coils cost like an airplane. And it was then that all the basic algorithms were invented: sort / zip / gif / jpg / etc. - they do not need a lot of memory, but they are bad, now they are much better, but they need a lot of memory 1-16 MB. Yes, there are servers with 256 MB and each has a separate stream and 16 MB is already a lot, but on the device of the average user 1 GB at least and 16 MB is a drop in the bucket.

You can get even more effect if you replace the call to the array_key_exists function with the isset ($ m [key]) construct, it does not clear the command queue and cache, does not use the stack, and is faster by about 20%.

You can also speed it up if you create an array of the first 2 letters - 4 * 16kb and look first at the offset (index = code of the 1st character + 2nd * 256) a pointer to the hash array for the rest of the line, then look for a small array of โ€œtailsโ€ of strings and collisions are much smaller.

It requires even more memory and the algorithm is more complicated, but the search is 30+ times faster. But this is not implemented in php, you can write your so / dll library and call it, or ask developers to add it in 7.5.

You can search through mySQL, but you need to group the queries and it will still be slower.

PS: This method was accidentally found by typing, intuition and experience when I accelerated one big slow website, there are a lot of such subtleties and tricks, I managed to get the data to be exported from 40 seconds to 0.8 seconds, output lists with sorting and filters, and many others things where standard techniques, frameworks and frameworks do it all too slowly, although of course they are convenient and speed up development.

All Articles