Increased sensitivity of A / B tests with Cuped. Report in Yandex

CUPED (Controlled-experiment Using Pre-Experiment Data) is an A / B experiment technique that has been used in production relatively recently. It allows you to increase the sensitivity of metrics through the use of data obtained previously. The greater the sensitivity, the more subtle changes can be noticed and taken into account in the experiment. The first company to introduce CUPED was Microsoft. Now many international firms are using this technique. In his report, Valery Babushkinvenheads explained what the meaning of CUPED is and what results can be achieved, and before that, he examined the stratification method, which also improves sensitivity.


- My name is Valery Babushkin, I’m the director of modeling and data analysis at X5 Retail Group and an adviser at Yandex.Market. In my free time I teach at the Higher School of Economics and often fly to Kazakhstan, I teach at the National Bank of Kazakhstan.

In addition, I used to enjoy competitive machine learning. On the Kaggle platform, I once achieved the title of Competitions Grand Master and 23 places in the world ranking of 120 thousand. Kaggle is designed in a very simple way: if you don’t perform, you fall in the ranking. So I try not to go there anymore so as not to see these numbers.



My presentation will have two stages: stratification and Control Variates. Most likely, you know what A / B tests are and why they are needed. But we will not skip this formula.



In A / B testing, there are a variety of approaches. In principle, these are two main approaches in statistics. One of them is called frequency, the second is Bayesian. Some books, such as Efron, distinguish another third approach, the Fisher approach, but we will not talk about it and the Bayesian approach either. Let's talk about the frequency approach.

There is one simple formula in the frequency approach. There are two of them, but one considers the case of a discrete distribution, the other considers a continuous distribution, so we will consider this as one formula.

This formula tells us how many observations are needed. If we could afford to collect an infinite amount of data, we would get the true value for each of the distributions, and then simply compare their point estimates. Although we can compare point estimates of an infinite number of data - this is a question, but nonetheless. We would get a true distribution, compare them and say which is better.

Unfortunately, we cannot do this; we always have a limit on the amount of data that can be collected. It is caused either by the number of our users, or by the time during which we can collect this data, or by the fact that they simply want to get the result from us as quickly as possible.

We see here a very simple formula for n, where n is the number of observations that is necessary in each of the groups. In the numeratorz2where z2- confidence interval, that degree of reliability with which we want to give our result.

It seems obvious thatzwe fix once and cannot change further. Of course, we can say that we give the result with a zero degree of reliability, and then we need zero observations. That would be very convenient, but we usually don’t.

Further in the numerator, if we look at the discrete formula, isp^(1p^), which equals the variance of the binomial distribution. In the continuous case, the same thing, σ 2 , that is, the variance. And it seems logical that the greater the variance, the more observations we need.

The denominator contains m 2 or margin of error - that is the minimum difference that we want to catch, and here the situation is the opposite. The smaller the difference we want to catch, the more observations we need. That is, it is something like an error.

If we need an error of 0.01, then we need 100 times more observations than if we need an error of 0.1. They differ by ten times, but there is a quadratic dependence, it turns out that 100 times more observations are needed.

I once had a report on linearization. Today I will tell how we can reduce the variance, and in due time I told how we can increase m. Increasing m seems to be a better strategy, because increasing m by half reduces the amount of data needed for calculations by four. Increase means an error that we can make.

And if we reduce the variance by half, then we need only twice as many observations. Therefore, to reduce something four times in the denominator is a gain of 16 times, and four times in the numerator - only four.

However, each approach has its pros and cons. I can then tell in more detail which ones. We now turn to reducing variance.

Stratification. Incidentally, I will have experimental results in each section at the end, what we got on real data in a real environment.



So, let's talk about stratification. What do we know? We know that reducing variance reduces the number of observations. Suppose our desired metric, by which we analyze, can be broken down into some regions, by groupings. A very good question that has already been asked: how to break it up? By country? Or maybe on browsers? Maybe go to operating systems? Maybe users who log in with Mac, Windows, and Linux are three different types of users.

If we find such a quantity or a sign by which we can divide into groups, then we do the following: divide into K groups, where K is the number of unique quantities equal to the number of groups that we have. In the case of operating systems - three, with countries - the number of countries, etc.

Further, the probability of falling into each of the groups is equal to the number of all observations in the denominator and the number of observations in each of the groups in the numerator. That is, we can estimate approximate weights in advance, and if there is a total number of users, so many users come from Mac, so many from Windows, so many from Linux, we can immediately calculate the weights and the probability that a new user will be from this operating system.

Then the average stratified of our metric will be represented by a very simple formula. This is the value of the metric in the stratum, multiplied by the weight of the stratum, and so we sum over all the strata. The formula is quite obvious, I think it doesn’t need to be analyzed specifically.



Further it can be a little more complicated. We will spend a couple of minutes to parse these formulas, but if you suddenly do not understand something - do not worry, I spent three hours at one time to parse them.

What do we see here? The average value of a stratified metric is no different from the average value for random sampling. It’s not difficult to prove this, it’s just a weighted balance, which in the end is equal to the weighted across the group.

But the variance is a little more interesting. Actually, we also know a very simple formula that the sum of variances, the variance of two quantities, is the sum of their variances, plus covariate, if they are not independent, with some kind of coefficients. And the sums of these variances themselves also with a coefficient.

Actually, if you pay attention, it is precisely here that these coefficients are presented, this is the probability of getting into a stratum. Accordingly, the variance of the whole stratified is the variance in each of the strata with some weights. And weight is the probability of getting into this stratum.

So far, everything seems pretty reasonable. And in the end, the variance over the whole stratified will be equal to this formula. It doesn't matter if you don’t understand why now. The main thing is to remember.



Now let's talk about average and variance for random sampling. SRS is simple random sampling, i.e. random sampling.

As you might guess, the average value of random sampling is equal to the average. Here, especially, I think, it is not necessary to go deep into something. But the variance of random sampling, if you look at the classical formula, is very clear. This is σ 2 times one divided by n. If we recall the standard error formula, then this is σ divided by the root of n. This is the variance of the mean.

But I want to break it down into its components.



So, if we break it down into its components, looking at a simple series of the following calculations, we will see (you have to believe me now, we will not go through all these lines, but here they are not very complicated) that it consists of two members.



Remember this one. This is the variance in case of stratification, believe me.



If we pay attention to what the variance of random sampling is made up of, then it consists of two members: the first, which is equal to the variance of the stratified, and the second.

What is the point? If you think briefly, then the variance of random sampling can be represented as the sum of the variance within the stratified group, and between the stratified groups. There are n groups, there is a dispersion of a within the group, b is the dispersion between the groups. If someone remembers, it is approximately the same as analysis. There is dispersion within the group and dispersion between the groups. Is logical.

It turns out that the dispersion of random sampling in the best case can be either equal to the variance of stratified, or more. Why? Because if this term is equal to zero (and it cannot be less than zero due to the fact that there is a square and that the probability cannot be negative), then there is clearly something greater than or equal to zero. Here it is equal to what you saw in stratification. It turns out that we win, reduce the variance, at least for this member.



This is the same as what I said now, so let's skip it. But you will probably have an interest to make out what I spoke about. By the way, at the bottom of each slide is the name of the article from which this formula is taken. Three articles participated in this presentation, then you can read * .

We read some article, talked something, but this is not very interesting. It is interesting to see how something works in real life. About this - the next slide.



I took the data, started watching how it works in real life. In real life, my variance has fallen by as much as one percent.

There is a suspicion that the growth is so small simply because we have a lot of data and generally not a very large dispersion between the strata. They are already smoothed out, and quite representative. But it seems that if the data is either not enough, or there is some kind of violation in the sample, or it is not entirely random (which, by the way, very often happens), then the increase may be larger.

And this method is very simple to implement. Pay attention, nothing complicated. That is, you sample from each stratum some amount proportional to the probability of getting into this stratum on the whole sample. Everything is pretty reasonable.

Let's move on to the second part. Cuped. I don’t know exactly how to pronounce correctly, in fact these are covariates, we use experimental data.



The point is also very simple. We take a random variable X independent of Y in the sense that there is no experimental effect on variable X.

How to achieve this? The easiest way is to take the variable X, which was obtained before the start of the experiment. Then we can be sure that the experiment did not affect it.

Farther. We can introduce a new metric that we want to calculate as the difference between Y and θX. This is presented in the formula: the new metric, let's call it Ycuped, is our desired metric minus θ times X.

This is what we have already talked about. A simple formula that allows us to calculate the variance of the difference between two quantities. This is the variance of the first magnitude. Since it has a coefficient of unity, 1 2 , we remove it. Plus the coefficient of the second quantity θ2, the variance of X. But since this is a subtraction, then minus 2θ, the covariance between Y and X.

If these were independent quantities, what would it be equal to? Zero. The covariance between independent quantities is zero. It seems that if we take an independent value, then it will definitely not get better with us.



Then we need to take some dependent quantity, and we have one more hyperparameter, let's call it θ. When can we minimize variance? When θ is equal to the covariance between Y and X divided by the variance of X.



I will not now examine in detail why this is so, but if you look at this simple equation, you can also deduce it.



If we do this, we will have a very convenient simple transformation, and the resulting variance will be represented as the variance Y, multiplied by one minus the correlation squared between the metric Y and the metric X. It seems to be nice.

Why does this work? We make some assumption that we have the variance of our metric Y due to two factors or two reasons. It is due to some covariate X and everything else. We can do that, right? And we say: guys, what we have due to X, we remove, we leave only that due to all other reasons.



From the graph on the next slide it will be clear why this works. Any thoughts on why this works? In addition to the formula that I wrote, before that, there were also formulas. It turned out that does not work. In the end, we have not seen the final results, it also turns out that it does not work.

What interests us first of all when we conduct A / B tests? The difference is average. In the vast majority of cases, we do not look at any quantiles. Although, by the way, Uber is very fond of looking at quartiles, and sometimes it is very important to look at them, the averages can remain unchanged, the quantiles can change dramatically, and users who have increased by some 99 percent quantile will fall off. Uber has this waiting time. This is the hostess's note.

But we are often interested in the difference in the means. And we want to use methods that do not change this difference in means. Because if we are talking about linearization, then we are moving into a new attribute space. Yes, everything is cool. We can count some A / B test 64 times faster. Yes, it is proportional, but we cannot say how much this difference in means is really like that.

To calculate the difference in means and draw a conclusion about everything, you need to have θ, which is uniform for all groups. A group is A1, A2, B, C, and so on. These are test cells or variations of your A / B test.

How to choose the metric X? The logical choice for the metric X is the same metric Y, but on the period preceding the period of the experiment. For example, if this is your average session duration for the user, then you can calculate the average duration of the user's session before the experiment for some period, during the experiment, subtract one from the other and see only the deviations between them. It probably interests you more.

Here, by the way, an interesting question arises - over what period should we take the metric X? In one day, in a week, in two weeks? There is no theoretical answer, but a practical answer shows that two weeks is a plus or minus optimum. In principle, it is possible to take and plot the experimental data on how much dispersion decreases and how much our test converges depending on how long we take X.



Why does it work? Look, this is a very simple graph, a very simple picture. It presents the values ​​of X and Y, the values ​​of our metrics for the user in the period before the experiment and after.

What are we doing? We select θ. We can likewise select it using the least squares method. That is, this is a certain middle line, which gives the minimum amount of balances. The remainder is the difference between what is and what is on the line.

Thus, we are somehow trying to average and still get the average value of the metric. The average value of the metric does not change. It seems to me that I myself did not fully understand what I said now, and you probably had to get even harder because I already saw this. Let's try again. We have the X axis and the Y axis. We can mark the values ​​that were before the experiment on the X axis, and the corresponding values ​​during the experiment on the Y axis. That is, we get a certain point in the XY coordinates. We can mark it on the chart.

If no changes have occurred, then these points will coincide with us. This will be our bisector. Because X is equal to Y. But in fact this will not happen, agree? In some cases, the value of the metric Y will be greater, in some cases less.

We want to understand and get exactly this difference. Because everything else is not so interesting for us. For example, if we have no difference, we conducted an experiment and X is equal to Y - which means that our experiment most likely did not affect. If we conducted our experiment and see that Y is simply stably above this X everywhere, this is an occasion to think that we may have influenced something. If we have stable Y below X, also not very good. Most likely, we had a negative effect.

It turns out that we are trying to draw a certain line that describes the relationship between X and Y, minimizing this difference. Linear regression is also done. Agree that there is a linear regression, you, consider, one independent variable and one dependent variable. You want to describe as accurately as possible.

This is our line, this is our new cuped metric, and this is exactly why the average cuped value does not change. The Ycuped value will not change from the Y value of the average. Why? Because. It was necessary to immediately explain this. :) By the way, the original article says so: note that there is a very interesting connection between the search for θ and the regression. This is it.

I repeat, we are interested in seeing how the experiment itself affected user behavior, how much it changed relative to the basic one. Suppose it was always conducted and there are two users: one always had a ten-minute session, and the other 100 minutes. There was some change, and the first user still spends 100 minutes, and the second has 12 minutes. The difference in one case is zero, in the other - two. But simply comparing the numbers 12 and 100 among themselves is probably not very reasonable. We want another. We call it "normalize." This, of course, is not correct, but nonetheless.

Now let's move on to the experiment.



What do we see? This is a screenshot from a Jupyter laptop that I really dislike (I love PyCharm more), but still I did it. The variance of the cuped metric and the variance of the standard metric are already presented here. See how much they differ? Ycuped is much smaller, and the averages are no different.

More precisely, they do not differ. Somewhere in the 15 decimal place, they probably differ, but we will assume that this is an error associated with rounding.

What do we see here? Dispersion fell by 45%. This is data from online. What we observed in X5 is that the variance drops four times. In X5, we have some kind of behavior within the store, it can be average for the day of the week, for the hour, for the hour and day of the week. See, we can pick up covariates that are more and more correlated. It seems that the conditional number of people that came on Monday should correlate with the number of people who came next Monday. If we look a little deeper, then Monday, six o’clock in the evening, should correlate even more strongly with Monday, six o’clock in the evening. And Sunday, three in the afternoon, with another Sunday, three in the afternoon.

The maximum drop in variance that I saw in real life was 19 times. What is the plus? To do so is also very simple, you must admit that you don’t need to think at all. Found covariate, found θ. θ, by the way, is found according to an extremely simple formula, everything has already been done.



Took, subtracted, got the converted metric. Her average has not changed, this is very good. Explain to business in a normal language, why this happens, is possible. You say that we are interested not only in how users behave on average, but how their behavior has changed from the average. And that’s it.

In some cases, there may be difficulties in choosing the right covariate, but often this is not a problem. It is always possible (very rarely, when not possible) to take the value for the previous experimental period. It works. A 19-fold reduction in variance means that the amount of data required for the A / B test also decreases 19-fold. That is, you can thereby get your result faster, and this increases the sensitivity of the test.

If you already have a certain number of A / B tests, then you can run this cuped in the same retrospective manner and calculate errors of the first kind and second kind. You can count errors of the first kind if you conduct an AA test. On cuped you will spend it in the same way - and in the same way you will be able to evaluate how much your sensitivity has increased.


* :
Improving the Sensitivity of Online Controlled Experiments by Utilizing Pre-Experiment Data
Improving the Sensitivity of Online Controlled Experiments: Case Studies at Netflix
How Booking.com increases the power of online experiments with CUPED

All Articles