🤱🏽 💐 🤞🏾 “A typical mistake - mindlessly benchmark everything in a row”: an interview with Andrey Akinshin about benchmarking 🏘️ 🐷 🧛🏽

Last year, Andrei Akinshin (Dreamwalker) the book “Pro .NET Benchmarking” was published : a detailed work on benchmarking, useful for both .NET developers and IT specialists in other areas.

When a couple of months remained before its release, we held the DotNext 2019 Piter conference, where in the online broadcast we asked Andrei about the book and generally about benchmarking. It would seem that since then this interview should have become outdated: there they talk about the book in the future tense, and now it’s already six months. But over the past six months, humanity has not decided otherwise to take the 99th percentile - so for everyone who can use benchmarking, Andrei’s answers still have a lot of relevant and interesting.

He will performon the future of DotNext with the topic “Let's talk about performance analysis” - that is, not about writing benchmarks, but about analyzing the values they collected. Right now, Andrey is studying hundreds of articles on mathematical statistics to tell you about the methods that are best suited for performance analysis in real life. In the book, attention is also paid to such analysis, and in an interview, Andrey just explained its importance. Therefore, in anticipation of a new report, we opened an interview video for everyone, and we made a text transcript specifically for Habr: now it can be not only viewed, but also read.

The main thing about the book

- We welcome again the viewers of the DotNext broadcast. This time Andrey Akinshin is with us.

- Hello everyone!

- Now the main news related to you is the announced book, which will be released by September ...

- If everything goes well, it will be released in late June.

Here you need to understand how deadlines work. There are the most extreme, which can not be violated in any case. On Amazon, the book now has a release date of around August 23rd. And if you flip this date, all sorts of penalties will go, Amazon will be unhappy. And if the book comes out earlier - well, good.

Therefore, I really hope that if there are no problems anywhere, it will be possible to read in June. And so, the end of August is the deadline. You also work in IT, so you understand how these things work.

- Most of the audience probably already heard about the book. But for those who don’t know, let's start with a story about her.

- The book is called "Pro .NET Benchmarking . " She is from the Apress series - the same one in which Conrad Coconut 's Pro .NET Memory Management book was recently released . And there was Sasha Goldstein ’s book “Pro .NET Performance” published - you probably heard about this, it is played out from time to time on DotNext. And in the same series comes my book. It is about how to do benchmarking from start to finish.

I tried to cover a variety of aspects, starting from statistics, there is a separate chapter about it. And it’s not the way we were taught at the university, I do not have a single example about “balls that are put into boxes”. Focus on what can come in handy during benchmarking: metrics, standard deviation, standard errors, all sorts of confidence intervals and how to interpret them. That is, we are talking about the following: if the conditional BenchmarkDotNet gave you a million different numbers, what should I do with them? Practical recommendations are given on how to interpret this data and draw conclusions.

There is a chapter, for example, about CPU-bound benchmarks and about memory-bound benchmarks. There are a lot of different case studies with examples of how you can write a benchmark for 3-4 lines and at the same time shoot yourself in the foot due to some microarchitectural effects of modern Intel processors.

And there is a chapter about performance analysis and performance testing. Benchmarking is good as a separate experiment, but many people want to put benchmarks on CI, drive them all the time on some server (ideally the same one), collect data, for example, to catch performance degradations. Therefore, there is a chapter on how to work with such data and write different types of performance tests (there are a lot of them different). What, for example, is the difference between tests for a cold start, for a hot start, how to process graphs, how to process entire data arrays.

At one of the past DotNext I spokeabout performance analysis, he talked about different methods of searching for performance anomalies. Degradation is not the only problem that may arise. For example, there are multimodal distributions when the benchmark works for one second or ten. In a large product (especially multi-threaded), such cases will surely be, and usually they hide the problem. Even if it’s not about performance tests running on the corresponding machines, but the usual tests, the devil knows how to “tremble” and give a lot of variance, then if you collect and analyze all this data, you can find a lot of tests with such problems.

In general, there is a very wide range of very interesting tasks in benchmarking, and I carefully arranged them on the shelves. But I tried to do it as practical as possible, so that it was not just a theory, but some knowledge that I could take and use on my product in production.

Benchmarking subtleties

- I recall the phrase I read somewhere that for any benchmark result on the Internet you can find the wrong interpretation of this result. How much do you agree with this phrase?

- Absolutely agree. There is a lot of talk about valid and invalid benchmarks, but if you look from a bird's eye view, then if you somehow measured something, collected at least some performance metrics and displayed them in a file or console, then this is a benchmark with a certain From the point of view, it is valid: something measures and displays some numbers. The main question is how do you interpret these numbers and what will you do next.

One of the first mistakes of people who are immersed in the topic of benchmarking is that they are going to benchmark everything without asking themselves the question “why.” Well, we benchmarked, measured - and then what? It is very important to determine for what purpose we are doing benchmarking.

For example, if we want to make a stable workload, with which we can evaluate the performance of certain cases and detect performance degradation - this is one case. Another case: we have two libraries that do the same thing, and we are for performance and we want to choose the fastest one - how do you compare this?

From the point of view of interpretations, any one that leads to the right business decision can be considered good. It is not necessarily correct, but if you are successful, it’s good.

And there’s such a thing that I even have special exercises in my book. Let's say there are two algorithms, and the exercise is this: first you need to make a benchmark that shows that the first algorithm is 10 times faster than the second, and then a benchmark that shows the opposite. You can play with the source data, with the environment, change Mono to .NET Core, or run on Linux instead of Windows - there are a million options for customization. And the conclusion is this: if you set out to show that one program works faster than another, most likely there is a way to do it.

Therefore, returning to your question, it is very difficult to draw a line between valid and invalid benchmarks and give a definition of invalid (that is, what should be there so that we recognize that it is bad). And the same thing with the distinction between “correct” and “incorrect” interpretations: you can not fully understand what is happening in the benchmark, you can not fully explain all the internal processes (which is not very good, it would be better to do this, but you can skip this part, if very busy), but at the same time understand in general what the picture looks like. And if you managed to do it right (here again the question is what is “right”) and come to the right business decision, then you are well done.

“If you just take and read your book thoughtfully, then will you start making the“ right ”decisions? Or are there many things outside the scope of the book that also influence?

- Benchmarking is a topic that, in my opinion, can only be mastered in practice. Yes, in the book I give a lot of methodologies, recommendations, describe pitfalls. In benchmarking, there are a lot of such problems that if you don’t know about them, then in life you will never guess about them. But if you know about them, this gives absolutely no guarantees that your benchmarks will be correct. That is, this is such a minimal toolkit that helps to somehow orientate oneself in the field.

You can write normal benchmarks and performance tests only if you systematically deal with this area. The neural grid in the head trains to read performance reports - when you look at the distributions obtained during performance measurements, you look at summary tables, for example, from BenchmarkDotNet (and not only the “Average” column, but also the standard deviation ), you look at standard errors, at additional characteristics, at a minimum, at a maximum, on a quantile, on the 99th percentile.

When you look at all this very, very, very much, some minimal volume is accumulated, allowing you to perform performance investing much faster and see what people without experience (even if they read my book and a million blog posts) will not see due to the fact that they have no experience. They will not see any problems or will not be able to instantly and correctly interpret the data.

- On this DotNext in an interview with Dmitry Nesteruk (mezastel) we said that usually IT books quickly become obsolete, but if he writes about design patterns, then everything does not change there every year. And what about benchmarking: this book may also not be out of date for a very long time, or would you have written something else two years ago?

- It is very difficult to give a monosyllabic answer. There is some basis, some materiel that does not become obsolete. The same statistics: as the 99th percentile was considered two years ago, it is still considered so, and there is a suspicion that nothing will change in two years.

By the way, at the same time I note: I believe that benchmarking should be a separate discipline. For some reason, historically, no one paid due attention to systematic measurements. Well, what is there? He took it, started the timer, turned off the timer, looked at how much had passed. And in the book, according to preliminary estimates, it turned out more than 600 pages, and everyone asks me: “What could be written on 600 pages?”

And I believe that this should be a discipline, a separate area of computer science. And this is such a “language-agnostic” direction, where the general equipment remains correct and does not change: this is what humanity as a whole has reached. This applies to any runtimes, languages, ecosystems. But this is only one part of the answer.

And the other part is already tied to the features of the runtime, to the features of .NET. Right now (we have a lot about this in the book), we have the .NET Framework, there is the .NET Core, and there is Mono. Performance measurements can vary on different runtimes or even on two adjacent versions of the same runtime. If you take .NET Core 2.2 and the upcoming .NET Core 3.0, some workloads differ just like day and night. They make such cool optimizations that the simplest scenarios are simply accelerated 10 times, 50 times.

It is clear that if you switch to the new version of Core, the whole program will not start working 50 times faster, but individual small pieces, which most often fall into synthetic benchmarks, may well be overclocked.

And that which changes, changes mainly in relation to all these versions, new optimizations appear. For example, tiered jitting will appear in .NET Core 3.0 . That is, the runtime can first quickly generate one simple (and not very effective) native implementation for the method by code. And then, when the runtime notices that you call this method many, many times, it will spend a little more time in the background and regenerate a more productive code. It is approximately the fact that in Java in HotSpot there are already many years, in the .NET world it will appear in the release included by default this year in the second half of the year (editor's note: we remind you, the interview was taken in 2019) .

And for BenchmarkDotNet, it is a challenge to handle such cases normally. In the Java world, Alexey Shipilev in his JMHI learned how to handle it a long time ago, but we still have to. On this subject, too, I communicate with the guys who saw runtime. That is, I will need special pens, APIs from them to correctly deal with everything.

These things are changing. Soon we will have all the runtimes united, there will be one .NET 5. I assume that it will be renamed somehow differently, that this is an intermediate name. Maybe it will not be 5, but 6, because we already had a version of .NET Core 5.0.

- Well, as we know from Windows, it is not a problem for Microsoft to skip the number in the version.

- Yes. Already at the time of DNXthere were target frameworks with the fifth .NET Core, now “5.0” has already been used a lot where, there are a lot of old posts. So I don’t know, they’re now going to make the fifth one after the third version, but I would have missed not only the four, but the five too, and would have made the sixth right away. And taking into account the fact that they now want to make, in my opinion, the odd versions stable LTS, and the even ones not very stable, it would be possible to immediately seven.

Well, this is their headache. But it’s important that you need to monitor the development of runtimes, and it is this .NET-specific part that is becoming obsolete - not so quickly that it is obsolete, but quietly.

I’m already thinking of making the second edition of a book where all this is updated. Intel processors also do not stand still, they are developing, new optimizations appear, which also need to be handled in a cunning way. Skylake presented a lot of unpleasant surprises, in the same BenchmarkDotNet a lot of work was done to get around its tricky optimizations and get stable results.

Interaction with BenchmarkDotNet and Rider

- It is clear that working on the BenchmarkDotNet library has given you a lot of experience, so it is logical that it was you who wrote the book on the topic of benchmarking. And here the question arises: is the book somehow tied to BenchmarkDotNet, or is it “tool-agnostic”?

- I tried to make her tool-agnostic. About BenchmarkDotNet, I have one small section, and I also use it as examples in my case studies: when I need to show some small microarchitectural effect, I say “here we will write a benchmark using BenchmarkDotNet”. Just so as not to put a million strappings in each benchmark in the book, nowhere to write separately the warm-up logic. We already have a ready-made solution that does the whole benchmark routine for us, and we’ll not talk about the methodology anymore (we talked about it at the beginning), but talk about the effects at the CPU level.

Here are two use cases, and I tried to do the rest as abstracted as possible from BenchmarkDotNet. That the book was useful not only to .NET developers, but also, for example, to Java developers. Because all common mechanics are easily ported to any other platform, that is, .NET and BenchmarkDotNet are used as a tool to illustrate the concept.

- And in the other direction was the influence? What, in the process of working on the book, did you finally understand what you need to do in BenchmarkDotNet like this?

- Yes, I wrote all sorts of small chips especially for them to be in the book. For example, the cool detection of multimodal distributions, about which I already spoke.

In a good way, when you analyze the results of the benchmark, you should always look at the distribution, open the picture, study what happened there. But in practice, no one does. Because if I run, conditionally, 50 benchmarks on some code base, and I change this code base 10 times a day, and every time I restart the full set, then I won’t even watch 50 graphs, of course, I have to do it lazily. And this, by and large, does not make sense, this is not a human task, this is a task of tuning.

BenchmarkDotNet has a cool algorithm that automatically determines that the distribution is multimodal and warns the user: “Dude! Look at the chart! Everything is bad here! This is where the average value appeared in the column, don’t look at it! It does not correspond to anything, look at the chart! ”

And this is printed only in those cases when it is really important not to distract a person in graphics in vain. There, an approach based on the so-called m values from Brendan Gregg, is a leading performance engineer at Netflix.

But his approach was not enough for me because he uses specially constructed histograms based on distribution. That is, a histogram is fed to the input, n value is considered and it is magically determined by it, we have a multimodal distribution or not. And how to build histograms, Brendan Gregg did not write! I had to invent some kind of my own bike, which surprisingly worked well. This algorithm is summarized in a book.

There were quite a few such stories. Directly writing a book took me two and a half years. In general, I have been collecting content for five years, and two and a half years from the moment I entered into an agreement with the publishing house. For these two and a half years, thanks to the book, the library has been pumped in many respects, a lot of things have appeared there.

- It is difficult to imagine, but in addition to the book and BenchmarkDotNet, in your life there is also work on Rider - and there you will probably also benchmark. Can you talk about this? You had on Twitter photos of a macbook in the freezer and next to the heater , checking how it affects performance - was it for work, or for a book, or both at once?

- Rather, all together. In riderWe use BenchmarkDotNet for individual performance investing. That is, when you need to figure out how best to write code in some performance critical piece, or you need to study how we differ in the behavior of a piece of code under Mono on Linux and under the .NET Framework on Windows. We take BenchmarkDotNet, design an experiment, collect results, draw conclusions, make business decisions, how to write code so that it works quickly everywhere. And then this benchmark is thrown out.

That is, we do not have on a systematic basis benchmarks on BenchmarkDotNet that would run on CI. But instead, we have many other areas of performance work. For example, an internal tool that collects numbers from all tests and looks for different performance anomalies in them, the same multimodal distributions, tests with some large standard deviation, and collects it all into one dashboard.

Another approach that we have been working on for a very long time but have not done so is reliable performance tests. That is, we want to make an approach in which it is impossible to freeze performance degradation in the master branch.

And classic benchmarks are not very suitable, because they are very resource-intensive. It is necessary to do a lot of iterations in order to get normal statistics and somehow work with it. And when you have hundreds or thousands of performance tests, if you run each test 30 times, as expected, and this is for every brunch of every person - no iron is enough.

Therefore, on the one hand, I want to do as few iterations as possible (ideally one, but one at a time it is very difficult to say whether you have degradation). The worst thing that can happen is false positive, when you did nothing wrong, but the system talks about performance degradation and does not allow you to freeze the brunch in master. If this happens, they will throw stones at me, and no one will use this system.

Therefore, conditionally, if after one iteration there is a suspicion of perfdegradation, but there is no 100% confidence, it is worth doing the second iteration. After the second, you can decide that everything is fine with us, just by chance that something happened. You can say that now we are sure of performance degradation and ban the march. And you can say: "No, two iterations are still not enough, we need to go to the third." Well and so on.

And on a small number of iterations (one, two, three), standard tests do not work at all. Here is my favorite Mann-Whitney test.starts working fine when you have at least five iterations. But we reach the fifth only when everything is completely bad. Accordingly, it is necessary to develop such a set of heuristics that will never give false positive, but at the same time it will detect degradations when they exist, with the minimum possible number of iterations. And now this is a rather difficult task for a mixture of programmer engineering and mathematical formulas. We have not finished yet, but we are going to this.

And about the macbook in the refrigerator - this is also all for work. Now one of the mini-projects that I do quite a lot is the study of thermal throttling models. The situation is as follows: when the CPU-bound benchmark loads the hardware very heavily, the temperature of the CPU rises, and when it reaches a certain threshold value, the Intel processor or operating system says: “Ay-yy-yy! We are overheating! ” - and for some period of time reduces the frequency. And then, for example, 2-3 iterations are obtained, at which performance degradation is supposedly visible. And we are like: “Oh, oh, oh, oh! Everything is bad! We will not hold this brunch. ” But in fact, we just have a performance agent overheated.

There are different ways to deal with this. We have our own server room with our own stands, we are trying to provide sufficient cooling there so that this thermal throttling does not occur. But this also does not always succeed. That is, we can’t completely freeze the agents, they will not be from this very much, but somehow we need to fight.

Another option is, for example, turning off the turbo boost so that the processor never goes beyond the base frequency. This, accordingly, reduces the likelihood of overheating, the processor is already not so hot. And secondly, we get a more stable frequency (in a turbo boost it often trembles quite a lot, and with a turbo boost off at the base frequency it goes right a lot, and you get a much more stable result).

And the thermal throttling models are very different: firstly, a lot depends on the processor and on the configuration of all the iron, and secondly, on the operating system. For example, take a Mac: we have a lot of Mac tests, because there are a lot of users and they don’t want Rider to slow down. And there is a very aggressive thermal throttling model.

On the new Intel processors that were recently announced, there are even more complex jokes. If your temperature drops below a certain threshold value, like 50 degrees, then the frequency can jump even higher than the maximum frequency on a regular turbo boost. That is, they do something like dynamic overclocking "a little bit" at low temperatures. The same effect. Our agents are still conditionally old processors, have not yet been upgraded, but geeks who like to buy themselves all the latest can step on this.

Future

“I have to interrupt you, because time is running out.” But for those who are intrigued: are you going to write a blog post on this material?

- Yes, while I am collecting material, everything is very interesting there, a very complex model of thermal throttling. There is, for example, Power Throttling on Windows, which allows you to save battery power and much more. While I am collecting data, and then I will combine it all either in a blog post, or even in a scientific article, or it will fall into the second edition of the book.

, . — , , .

— DotNext 2020 Piter -. , , , . -, , . -, . — .

“A typical mistake - mindlessly benchmark everything in a row”: an interview with Andrey Akinshin about benchmarking

The main thing about the book

Benchmarking subtleties

Interaction with BenchmarkDotNet and Rider

Future

More articles: