Horror Set.removeAll

We are used to the fact that the standard collections in the JDK are made quite well and behave intuitively. But is it really so? Yesterday Roman Elizarovelizarovposted on Twitter the news about a new interesting jamb.

Hold on tight: Set.removeAll(list)in certain cases it can work for O (N²). How so?



Roman discovered a problem when he debugged a piece of code that, for an amazing reason, worked too slowly. He launched it in the built-in profiler in IntelliJ IDEA and instantly saw on the flame graph that he was wasting all the time inside the method AbstractSet.removeAllon a call list.contains(link to the exact place in the code ).

To understand what is happening, consider a slightly different example. Earlier, Jon Skeet wrote a post “There's a hole in my abstraction, dear Liza, dear Liza" , in which he considers the following interesting case.

Let's say we have a HashSet from which we are going to delete something. Suppose we remove elements from another collection B from one collection A. It often happens that many elements from B simply do not exist in A. We will analyze a special case when A and B do not intersect at all - that is, nothing needs to be done.

We will write a simple program in which the sizes of our collections are set from the command line. So that these sets do not exactly intersect, one of them will be filled with only positive numbers, and the other with only negative numbers. Then we remove from the first collection all the elements of the second and measure the elapsed time usingSystem.currentTimeMillis(). This is not the best way in the world to calculate the time interval, but it makes it possible to copy-paste the code directly from Habr to Idea and saves us from having to set up a new Maven project for JMH.

import java.util.*;
public class Test {
    public static void main(String[] args) {
       int sourceSize = Integer.parseInt(args[0]);
       int removalsSize = Integer.parseInt(args[1]);

       Set<Integer> source = new HashSet<Integer>();
       Collection<Integer> removals = new ArrayList<Integer>();

       for (int i = 0; i < sourceSize; i++) {
           source.add(i);
       }
       for (int i = 1; i <= removalsSize; i++) {
           removals.add(-i);
       }

       long start = System.currentTimeMillis();
       source.removeAll(removals); 
       long end = System.currentTimeMillis();
       System.out.println("Time taken: " + (end - start) + "ms");
    }
}

Very simple code that can be written without regaining consciousness. Now let's run it with different parameters. First, try a set of 100 elements from which you need to throw 100:

$ java Test 100 100
Time taken: 1ms

So far, everything is fast enough, but what happens if you increase the numbers?

java Test 1000000 300000
Time taken: 38ms

$java Test 300000 300000
Time taken: 178131ms

How do you like this: now we wait three minutes. It seems intuitively that the processing time of a smaller collection (300,000 items in the second case) should be less than when processing a larger collection (a million items in the first case). Immediately the opposite is true. Life wasn’t preparing for this, right?

Now the secret of focus.

In fact, it is described in clear text in the corresponding JavaDoc :
Implementation code determines whether it is less: set or collection. This is done using the method sizeon each of them. If there are fewer elements in set, then iteration is performed on the set, and at each iteration it is checked whether the current element is in the collection. If yes, then the element is removed from the set using the removeiterator method . If it turns out that the collection is smaller in the number of elements, then it will be necessary to bypass the collection already, removing from the set all such elements using the method removefrom set.
In practice, this means that when called source.removeAll(removals):

  • If the collection is removalssmaller than source, then the method is called removein HashSet, and it is pretty fast;
  • If the collection is removalslarger or equal in size than source, then a method is called removals.containsthat works very slowly for ArrayList.

In JDK 8, something like this is responsible for this piece of code:

public boolean removeAll(Collection<?> c) {
    Objects.requireNonNull(c);
    boolean modified = false;

    if (size() > c.size()) {
        for (Iterator<?> i = c.iterator(); i.hasNext(); )
            modified |= remove(i.next());
    } else {
        for (Iterator<?> i = iterator(); i.hasNext(); ) {
            if (c.contains(i.next())) {
                i.remove();
                modified = true;
            }
        }
    }
    return modified;
}

It seems that the JDK has chosen a rather poor way to cope with the task, but since it is already described in JavaDoc, there is no turning back.

Or is there?

People rushed to Roman Yelizarov’s tweet, and Zheka Kozlov found a corresponding bug posted on JDK 15 with the performer, Stuart Marx.

And after this, Stuart Marx himself burst into the thread and confirmed that he was really dealing with this problem:


So there is light at the end of the tunnel. If you upgrade to the latest versions of Java, of course.

findings


The conclusions from this story are as follows: there is a problem worth knowing about. The problem is already being repaired, but repaired only for happy users of fresh Java, and you should not really hope for it.

In general, when you try to store a lot of data in collections in Java, you may experience various unintuitive problems that you need to be prepared for.



So, kids, today we learned something new!

As a continuation of the banquet, I recommend that you look at some reports by Roman Elizarov, who helped to bring this bug to light. Of course, these reports have nothing to do with the bug itself, but there are a bunch of different tin plates:


JPoint, , 2020 . , , , .

All Articles