Weave testing - 100 times faster than AB test

A / B testing is one of the main tools of product management, so far they have not come up with a more reliable and cheap way to reliably evaluate the impact of one specific change on the business metrics of the product, isolating it from all other factors.

In this article I want to talk about an alternative method of testing changes in a product: interlacing testing, in English literature - interleaving testing. To reveal its advantages and disadvantages, we will privately compare it with the traditional A / B test, but not because it is some kind of new, more advanced method, which is faster and more accurate, and should replace the A / B tests. This is an additional tool for a product manager with a different field of application and answering a different question. Comparison simply makes it easy to show what the differences and strengths of the tests are with.

Summary:

  • Why weaving is faster than A / B test
  • When can weave the weave test
  • What is the difference between A / B test and weave results
  • How to combine the strengths of weaving and A / B test

Why weave testing is much faster than A / B testing


In numerous attempts to convey to colleagues and other product managers the basic idea of ​​weaving, I came to the conclusion that the following example illustrates it best. Take a moment to enter the context, I promise that by the end of the example you will agree that this is very clear.

Suppose we need to determine what soda needs to be offered in our bar in order to sell as many drinks as possible: Coca-Cola or Pepsi. If we approach this solution from the point of view of A / B testing, then we must open two absolutely identical bars in one of which there will be only stake, in the other only Pepsi, and direct visitors to one of these bars randomly.

image

Then we compare the visitors of which of the bars ordered the most drinks available there, and conclude which drinks provide the most revenue.

I think you can already see what the problem is: so many visitors to that bar where they don’t have their favorite drink will still order what they give, because they still want to drink. And only very few will be so principled in their preferences that they will not drink at all or drink much less. Not very important visitors reduce the sensitivity of our test to preferences for drinks, because they actually will not give us any signal with their behavior.

How to solve the same problem by weaving? If we have the physical ability to offer users both of the compared options at the same time and see which one they prefer, we will be able to quickly identify their true preferences.

image

If we apply interweaving to our metaphor with a bar, then we will put two taps at the counter and just see which of the drinks the visitors order the most. I think you intuitively feel that this test will give us a significant result much faster, because each order will be a “vote” in favor of one or another option, while in the A / B test only the difference in the number of orders is a signal.

In an article on Netflix Tech Blogprovides evidence that weaving 100 times faster than A / B determines user preferences. Unfortunately, I cannot publish my personal experience with weaving, but in my case this assessment was confirmed, weaving with almost any reasonable traffic will produce a significant result in less than 24 hours. However, doing the test duration for less than a day still does not work, because it is necessary to ensure representativeness of the sample (morning, afternoon and evening visitors can have different patterns of behavior, let us ignore the weekly cycles).

When can weave weave


Initially, weaving was invented for ranking testing: if you have a certain set of objects (products in an online store, or links to pages on the Internet for a search engine) and you need to sort them so that the ones that most closely match the user's request are on top.

If you have two ranking algorithms and you want to compare them, then you can not show the user either ranking A or ranking B, but show him a page that will look like:

A1 B1 A2 B2 A3 B3 ... and so on, where A2 - this is the second line issued by ranking algorithm A, and B3 is the third line in ranking B.

imageIllustration of the weave from an article on Netflix Tech Blog

Subtleties of implementation
:
  • , ,
  • : , ?
  • ,
  • , , A/B


We direct all available traffic to this intertwined ranking and consider the results generated by which of the two algorithms received more clicks or allowed us to get more targeted actions lower in the conversion funnel.

In fact, there are a lot of elements that are actually the result of ranking in the products, I will give examples:

  • The list of products or the catalog of sections on the main page of the site
  • List of products within a section or in response to a search query
  • List of articles on the news resource
  • "Similar ads"
  • “They also buy with this product.”
  • Articles in the Help section
  • Any listing of elements: friends in the social. networks, posts in the tape, music on the page, movies in the cinema
  • etc

And all these elements can and should be tested using weave. Interweaving allows you to test not one alternative algorithm for selecting recommendations per week, but to test seven hypotheses per week.

What is the difference between A / B test and weave test


When we conduct an A / B test, we can measure the impact of a change in user experience on any metric we are interested in, which we consider in the context of a single user. From conversion to sales to the number of support calls.

The interlacing test allows us to compare only those events that can be directly associated with a click on one of the interwoven options. But this comparison does not allow us to answer the question “what will happen if we replace A with B in our product”, because we don’t know what will happen if the user sees only the ranking of B. We measured on a combination that is not an independent version of the user experience.

Therefore, weaving is recommended to be used as a preliminary stage for selecting the most promising of many hypotheses, for which it already makes sense to conduct a longer A / B test to check how this change affects the target metric.

Very often it may turn out that the improvement of the algorithm did not affect the business metric, but at least you are sure that the user experience has become better, and now you know which block is most likely pointless to optimize in attempts to improve the metric that is targeted for you.

Strengths and weaknesses of weaving


Let's summarize the pros and cons of the weave test.

Minuses


  • , : - - A . , , , .
  • , , , A/B .
  • , , , , .


  • - ( , , . ).
  • ( Netflix 100 , ).
  • . , , , , «» , .


  1. Netflix, , 100 A/B
  2. A more scientific article describing the stat. methods for interpreting test results by weaving (Chapelle, O., Joachims, T., Radlinski, F., and Yue, Y. 2012. Large-scale validation and analysis of interleaved search evaluation. ACM Trans. Inf. Syst. 30, 1, Article 6 (February 2012)

All Articles