We write substring search better than in textbooks



The life of an engineer is full of surprises: especially when you have to deal with productivity. For example, what happens if you try to run this piece of Java code? It looks pretty innocent:

//   String.repeat  JDK 11  :
final var needle = "A".repeat(500000) + "B";
final var haystack = "A".repeat(1000000) + "B";
System.out.println(haystack.indexOf(needle));

We wait, wait, wait ... At least on my 2015 OpenJDK 13 laptop, finding a needle in a haystack takes about a minute. Our good old JVM has gone through decades of performance tuning, it has effectively implemented intrinsics for String.indexOfand so on. What could have gone wrong?
This is the beginning of a series of several articles courtesy of their author, Linas Medžiūnas , and originally published on the WiX Engineering blog .


Take a closer look at what is input: the data is specially selected so as to achieve quadratic performance in the worst case ( O(nm)where nis the length haystackand mis the length needle) for the naive substring search algorithm. We run through all the characters in haystack, and if they coincide with the first characters needle, we start running along needlein the inner loop - and so on until the first mismatched character.

You may argue that this example is useless, because such input data were designed and filed specially, in practice you will not encounter this. Think twice. What if you are working on a web service whose users can load arbitrary strings, and somewhere in the back of the service there is code that runsindexOfon these lines? Then just a few malicious requests like the one above will put your service on its knees. It is worth knowing, at least, about the worst cases for the input data.

Fortunately, there are substring search algorithms having linear complexity ( O(n+m)). They have no problems with the data from the above example. For example, the following Scala code does the same thing, but runs in milliseconds on the same computer, the same JVM, and using exactly the same under the hood java.lang.String:

val needle = "A" * 500000 + "B"
val haystack = "A" * 1000000 + "B"
println(haystack.indexOfSlice(needle))

The secret to the huge difference is inside the method indexOfSlice, which is part of the Scala standard library . It implements the clever linear Knut-Morris-Pratt algorithm . And no, I am not saying that the language X is better than the language Y. Unfortunately, everything is much more complicated here! For example, indexOfSlicein Scala, this is a generalized method that works not only with strings, but also in other sequential collections, and can compare not only characters, but also elements of other types. It should be much slower thanString.indexOffrom Java in the middle case (we will talk about this later). Thus, we have an efficient algorithm with much better performance in the worst case, but on average it is slower because it has a much larger constant part. Dilemmas like this are a typical problem in tuning performance. There is no magic pill that will solve all problems - you need to carefully analyze the problem and make the right micro-benchmarks.



Are you still with me Good! You see, this is just an introduction. I wanted to motivate you to deal with the theoretical complexity and practical performance of the algorithms. In the remainder of this article, we will look at some implementations of several substring search algorithms and their benchmarks.

We will explore three substring search algorithms. All of them work in linear time and require preprocessing, linearly dependent on length needle. The calculation of the same needleis required only once, and then it can be reused in several search attempts. This is reasonable, because in many cases we need to search for the same line again and again. And even if we do not do this, precomputing is not a particularly expensive operation.

All algorithms below bypass each of the characters inhaystackonly once in a row (no random access by index), so they all work fine in streaming mode. This article came about during a real work on a proxy server for production based on the Netty framework , and this influenced some of the API design decisions. In addition, since we needed to do a search on byte buffers, the code will work with Byte, not with Char.



Knut-Morris-Pratt (KMP algorithm)


This is a well-known substring search algorithm dating back to the 70s of the last century. It is well described in the literature , so I will not describe it here in detail. The ILC is based on state machines - during the preliminary calculation phase, an array of link indices is constructed on the basis of needle. During the search, the machine accepts characters haystackone by one at the input , and updates its internal state accordingly (and the state there is just an index in the relations table).

Here is an implementation on Scala .

Binary substring search algorithm


Initially, I had to independently invent the name of this algorithm: I have never seen anything like this anywhere in the literature. As a result, I came to the name “Shifting Bit Mask”. Later it turned out that this algorithm and its variations have been known since 1964 under various English names like “Bitap”, “Shift-or”, “Shift-and”, “Baeza-Yates – Gonnet”. Thanks to the readers who have found it for me. This article was written long before this news.

This algorithm is based on a very simple idea and works very well, since there are almost no jumps, and it is based on several primitive binary operations. Because of this, it has a limit on the length needlewe are going to look for: it cannot be longer than 64 bytes. This number was taken simply by the number of bits inLongin the JVM. This limitation is generous enough for a large number of real tasks.

Since I originally developed this algorithm myself, I will try to talk about it in more detail. First, we pre-compute the search context for the desired one needle:

  def computeBitMasks(needle: Array[Byte]): Array[Long] = {
    require(needle.length <= 64, "Maximum supported search pattern length is 64.")
    val bitMasks = Array.ofDim[Long](256)
    var bit = 1L
    for (c <- needle) {
      bitMasks(toUnsignedInt(c)) |= bit
      bit <<= 1
    }
    bitMasks
  }

We pre-compute bitMask(64-bit Long) for each possible byte value (256 pieces bitMask). For some byte value X, it bitmaskcontains contains units in all places where it Xis in needle. For example, here's a bit mask for the string "abracadabra": In addition, you need to pre-compute , which will help to understand that we found an exact match. It looks like a value , with a bit in position :



successBitMaskLong1needle.length — 1

  def computeSuccessBitMask(needle: Array[Byte]): Long = {
    1L << (needle.length - 1)
  }

And finally, you need to do, in fact, a search. The only mutable state we want to store is currentMask( Long). For each byte in haystackwe shift currentMaskleft by a 1bit, set its least significant bit in 1, and do a bitwise andbetween the result and bitMask, calculated for the current processed byte value from haystack(this andclears all bits in those places currentMaskthat do not match the current processed byte).

Thus, after processing each byte, only those bits that are in suitable positions will survive. And with each byte processed, all bits are shifted to the left by one position. If the bit "survives" during the number of iterations equal to the lengthneedle- we found a match! And we can verify this with successBitMask:

  def process(value: Byte): Boolean = {
    currentMask = ((currentMask << 1) | 1) & bitMasks(toUnsignedInt(value))
    (currentMask & successBitMask) == 0
  }

Note: the method described above returns falseif something is found, and it looks counterintuitive. This can be understood so that the value truemeans the need to continue the search, but falsestops it - this is due to the fact that, as I wrote above, the API was made compatible with Netty. If you are wondering how to run a search, here is an example.

As a result, all the logic boils down to just a few simple processor instructions. Unfortunately, there remains a completely useless check of the bounds of the indexes of the array bitMasks, which no JDK can remove (and I looked at the assembler generated by several different JDKs).

Here is the full implementation on Scala .

Aho korasik


This is another popular algorithm known since 1975. Its distinguishing (and sometimes quite useful) feature is the ability to search several needleat once at the same time, while all the characters from haystackare bypassed exactly once (I think it's just great!). The idea that all this works on is an extension of the KMP algorithm, a finite state machine using a prefix tree (which is built on the basis of several needle), containing links to links (compare with a one-dimensional array from the KMP). Based on these links, the internal state of the automaton is switched between the nodes of the prefix tree after each processed symbol, and some of the nodes indicate a positive search result for a particularneedle. The precomputation phase here is rather complicated, but the search phase is unexpectedly very simple.

Here is a link to a working implementation on Scala .



This was a completely incomplete list of substring search algorithms. We also tried the Rabin-Karp algorithm and the Boyer-Moore algorithm . Of these two, Boyer-Moore showed comparable performance, but they are both not compatible with streaming (using random access haystackby index), and so I dropped them from this investigation.



Benchmarks


We will benchmark the three algorithms described above, and in addition, look at the results for the methods String.indexOf(Java) and indexOfSlice(Scala). To be honest, this is not a completely correct comparison, because it String.indexOfworks with strings, and all other methods are on byte arrays. But this does not seem to invalidate the results of such a comparison. Moreover, I also included the results for Bytes.indexOffrom Guava (v.28.1). This method works on byte arrays. And they wrote it on Google - everything they write there works super fast, right?

Writing benchmarks is always difficult, because you can send completely different data to the input, change it in many different ways - not only in length needleandhaystack, but also by the internal content of these lines (which can greatly affect some algorithms). In practice, it is always worth checking the input data that is most similar to the data from your real tasks (this is what we did in our project).

To shorten this article, I used only 2 types of input. One of them is intended to reflect the real case: haystackapproximately 1.5 KB in size (with human-readable text inside) needle- 9 bytes, and not in haystackthis sequence (this is necessary to force the algorithm to perform a full scan).

Another type of input is needed to get the worst case behavior of a quadratic algorithm. It is much shorter than the data from the very beginning of this article: otherwise we would have to wait a whole minute, remember? Arrayhaystackis set in the format "AA...AAB"(the same length as the first data type), and needle- 64-byte (especially for the binary substring search algorithm to cope with it) an array of the same type (the match is only at the very end haystack).

A benchmark written in the JMH framework can be found here . If you have other ideas about what and how to measure here - you can clone this repository, change something and post comments.

At the suggestion of Vladimir Sitnikov , I added benchmark results for java.util.regex.Pattern; he uses the Boyer-Moore algorithm under the hood.


(Translator’s note: by the way, Vladimir Sitnikov is a member of several program committees at the JUG Ru Group and makes interesting reports himself. For example, a video from his report from JPoint 2019 entitled “Java slows down: CodeCache edition” is available on the link ).

Benchmark results


The results are given in milliseconds, less is better: Here everything is as expected:

# JMH version: 1.21
# VM version: JDK 13.0.1, OpenJDK 64-Bit Server VM, 13.0.1+9
Benchmark (searchInput) Mode Cnt Score Error Units
javaIndexOf REGULAR avgt 5 0.622 ± 0.002 us/op
shiftingBitMask REGULAR avgt 5 1.982 ± 0.017 us/op
regexPattern REGULAR avgt 5 2.184 ± 0.006 us/op
kmp REGULAR avgt 5 2.635 ± 0.016 us/op
scalaIndexOfSlice REGULAR avgt 5 3.202 ± 0.009 us/op
guavaIndexOf REGULAR avgt 5 3.696 ± 0.095 us/op
ahoCorasic REGULAR avgt 5 7.063 ± 0.040 us/op
shiftingBitMask WORST_CASE avgt 5 1.986 ± 0.010 us/op
kmp WORST_CASE avgt 5 5.120 ± 0.006 us/op
ahoCorasic WORST_CASE avgt 5 6.892 ± 0.025 us/op
scalaIndexOfSlice WORST_CASE avgt 5 8.765 ± 0.007 us/op
regexPattern WORST_CASE avgt 5 11.566 ± 0.086 us/op
javaIndexOf WORST_CASE avgt 5 23.029 ± 0.124 us/op
guavaIndexOf WORST_CASE avgt 5 52.927 ± 0.275 us/op



  • For ordinary data, it dominates javaIndexOf, because it uses high-performance intrinsics inside, because of which the constant part is small;
  • , : , (O(nm)) javaIndexOf, — , shiftingBitMask ( ) .
  • guavaIndexOf , javaIndexOf; , 2 , shiftingBitMask;
  • scalaIndexOfSlice - , knuthMorrisPratt, , — , ;
  • performance is not the strongest feature ahoCorasic(or at least of its implementation; I must admit that I did not really try to make microoptimizations in it, because I added it only because of its distinguishing feature: the ability to search across several lines at once, and this similar to the topic for a separate article);
  • input data (and length needle) did not affect performance shiftingBitMaskand ahoCorasic.

findings


In different cases, benchmarks can work in different ways. Despite the fact that the above results seem very indicative, you should always take measurements yourself and on data that reflects your real tasks.

Based on the presented data, I made the following conclusions:

  • String- , , String.indexOf ( java.util.regex.Pattern — );
  • , needle 64 , ;
  • , --;
  • Scala - ( ), indexOfSlice — ;
  • , -.

That's all! If you enjoy reading about algorithms, performance, and the like (and also about Scala, JVM, and Java in general), subscribe to the author of this article, Linas Medziunas ( Medium , Twitter ).

The github repository with all the code in this article is here .



Translations of articles are published with the support of the JUG Ru Group and the JPoint Conference .


All Articles