About GDI leaks and the importance of luck


In May 2019, I was asked to take a look at a potentially dangerous Chrome bug. At first, I diagnosed him as unimportant, wasting in this way two weeks. Later, when I returned to the investigation, it turned into the number one cause of the browser process crashes in the Chrome beta channel. Oops

June 6, the same day that I realized my mistake in interpreting the data of departures, the bug was marked as ReleaseBlock-Stable. This meant that we won’t be able to release a new version of Chrome for most users until we figure out what’s going on.

The crash occurs because we were running out of GDI objects (Graphics Device Interface) , but we did not know what type of GDI objects they were, the diagnostic data did not give any clues about where the problem was, and we could not recreate it.

Many people from our team worked hard on this bug on June 6-7, they tested their theories, but did not advance. On June 8th, I decided to check my mail, and Chrome immediately crashed. It was the same failure .

What an irony. While I was looking for changes and examining crash reports, trying to figure out what could cause the Chrome browser process to leak GDI objects, the number of GDI objects in my browser was relentlessly rising, and by the morning of June 8 it exceeded the magic number of 10,000 . At this point, one of the memory allocation operations for the GDI object failed and we intentionally crashed the browser. It was an incredible luck.

If you can reproduce the bug, then inevitably you can fix it. I just had to figure out how I caused this bug, after which we can eliminate it.

For starters, a short history of the issue



In most places in the Chromium code, when we try to allocate memory for a GDI object, we first check if this allocation was successful. If it was not possible to allocate memory, then we write some information onto the stack and intentionally perform a crash, as can be seen in this source code . The failure is caused intentionally, because if we cannot allocate memory for GDI objects, then we will not be able to render on the screen - it is better to report a problem (if crash reports are enabled) and restart the process than display an empty UI. By default, you can create as many as 10,000 GDI objects per process, and typically only a few hundred are used. Therefore, if we exceeded this limit, then something went completely wrong.

When we get one of the crash reports that says the memory allocation error for the GDI object, we have a call stack and all sorts of other useful information. Fine! But the problem is that such crash dumps are not necessarily related to the bug. This is because the code that causes the leak of GDI objects and the code that reports the failure may not be the same code.

That is, roughly speaking, we have two types of code:

void GoodCode () {
   auto x = AllocateGDIObject ();
   if (! x)
     CollectGDIUsageAndDie ();
   UseGDIObject (x);
   FreeGDIObject (x);
}

void BadCode () {
   auto x = AllocateGDIObject ();
   UseGDIObject (x);
}

The good code notices that the allocation of memory failed, and reports this, while the bad code ignores the crashes and spills the objects, thus “substituting” the good code so that it takes responsibility.

Chromium contains several million lines of code. We did not know which function had an error, and did not even know what type of GDI objects were leaking. One of my colleagues added code that bypassed the Process Environment Block before the crash to get the number of GDI objects of each type, but for all the enumerated types (device contexts, areas, bitmaps, palettes, brushes, feathers and unknown) the number did not exceed one hundred. It’s strange.

It turned out that the objects for which we allocate memory directly are in this table, but there are no objects created by the kernel on our behalf, and they exist somewhere in the Windows object manager. This meant that GDIView is just as blind to this problem as we are (besides, GDIView is only useful when playing a failure locally). Because we have leaked cursors, and cursors are USER32 objects with GDI objects attached to them; the memory for these GDI objects is allocated by the kernel, and we could not see what was happening.

Misinterpretation


Our function CollectGDIUsageAndDie has a very vivid name, and I think you will agree with me on this. Very expressive.

The problem is that it performs too many actions. CollectGDIUsageAndDie checked about a dozen different types of memory allocation failures for GDI objects, and because of the embedding of the code, they received the same failure signature as a result - they all crashed into the main functions and merged together. Therefore, one of my colleagues wisely made a change , breaking different checks into separate (not built-in) functions. Thanks to this, now, at first glance, we could understand which check ended in failure.

Alas, this led to the fact that when we started getting crash reports from CrashIfExcessiveHandles, I confidently said: "this is not the cause of the failure, it is simply caused by a change in signature."

But I was wrong. This was the cause of the failure and the signature change. Oops Awkward analysis, Dawson. No cookies for you.

Back to our story


At this point, I already knew that something I did on June 7 used almost 10,000 GDI objects per day. If I could understand that, I would solve the riddle.


Windows Task Manager has an additional GDI objects column that you can use to find leaks. On June 7, I was working from home, connecting to my work machine, and this column was turned on on the work machine because I ran tests and tried to reproduce the crash scenario. But in the meantime, there were leaks of GDI objects in the browser on my home machine.

The main task for which I used the browser at home is to connect to a working machine using the Chrome Remote Desktop (CRD) application . So I turned on the GDI objects column on the home machine and started experimenting. Soon I got the results.

In fact, the timeline of the bug shows that from the moment “I had a failure” (14:00) to “it is somehow connected with CRD”, and then it took only 35 minutes to “deal with cursors”. I have already said how much easier it is to investigate bugs when you can play them locally?

It turned out that every time a CRD application (or any Chrome application?) Changed cursors, this led to the leak of six GDI objects. If you move the mouse over the desired part of the screen while working with Chrome Remote Desktop, hundreds of GDI objects per minute and thousands per hour can leak.

After a month of the absence of any progress in solving this problem, it suddenly turned from an unremovable one into a simple correction. I quickly wrote a draft fix, and then one of my colleagues (I didn’t work on this bug) created a real fix. It was downloaded on June 10 at 11:16, and was released at 13:00. After a few merges, the bug disappeared.

That's all?


We fixed the bug, and it’s great, but it’s much more important that such bugs never happen again. Obviously, it is correct to use C ++ ( RAII ) objects for resource management , but in this case the bug was contained in the WebCursor class.

When it comes to memory leaks, there is a reliable set of systems. Microsoft has heap snapshots , Chromium has heap profiling for user versions and a leak eliminatoron test machines. But it seems that leaks of GDI objects have been deprived of attention. The Process Information Block contains incomplete information, some GDI objects can be listed only in kernel mode, and there is no single point for allocating and freeing memory for objects that can facilitate tracing. This was not the first leak of GDI objects that I had to deal with, and it will not be the last, because there is no reliable way to track them. Here are my recommendations for the following Windows releases:

  • Make the process of getting the number of all types of GDI objects trivial, without having to read PEB obscurely (and without ignoring cursors)
  • Create a supported way to intercept and trace all the operations of creating and destroying GDI objects for reliable tracking; including for those that were created indirectly
  • Reflect all this in the documentation

That's all. Such tracking is not even difficult to implement, because GDI objects are necessarily limited in a way that memory is not limited. It would be great if using these weird but inevitable GDI objects would be safer. Oh please.

Here you can read the discussion on Reddit. The topic on Twitter starts here .

All Articles