👨‍👨‍👦 🌑 🥩 Great A / B Test Guide 👙 👩🏽‍🚒 👨🏻‍🍳

There is a ton of information on the Internet about A / B testing, but many still do it incorrectly. It is really easy to make a mistake, therefore such studies require serious preliminary preparation. This article discusses the main aspects of A / B testing that must be considered for effective web page analysis.

What is A / B testing?

A / B testing (split testing) divides traffic in a ratio of 50/50 between different versions of the page. In essence, this method is a new name for an old technique known as a “controlled experiment.”

To test the effectiveness of new drugs, experts conduct split tests. In fact, most research experiments can be called A / B testing. They include the hypothesis, the main object of study, its variation and the result, presented in the form of statistical data.

That's all. An example is simple A / B testing, in which 50/50 traffic is divided between the main page and its variation:

In the case of conversion optimization, the main difference is the variability of Internet traffic. External variables are easier to control in the lab. On the Internet, you can reduce their impact, but creating a fully controlled test is much more difficult.
In addition, testing new drugs requires a certain degree of accuracy. The lives of people are at stake. From a technical point of view, this means that testing can last longer, since researchers should do everything possible to avoid the first kind of error (false positive).

However, A / B testing of web resources is carried out to achieve business goals. It is necessary for risk and reward analysis, exploration and development, science and business. Therefore, the results are considered from a different point of view, and decisions are made differently from those of researchers in laboratories.

Of course, you can create more than two page variations. A study with several elements is called A / B / n testing. If there is enough traffic, you can test as many options as you like. Here is an example of A / B / C / D testing with the traffic allocated for each variation:

A / B / n testing is great for implementing multiple variations in order to test one hypothesis. However, it will require more traffic because it will have to be divided into more pages.

Despite its popularity, A / B testing is just one type of online research. You can also conduct multivariate tests or use the multi-armed bandit method.

A / B testing, multivariate tests and the multi-armed bandit method: what's the difference?
A / B / n testing is a controlled experiment that compares the conversion rates of the original page and its one or more variations.

Multivariate tests are conducted on several versions of the page in order to determine which attributes are of the greatest importance. As with A / B / n testing, the original is compared with variations. However, each design uses different design elements. For instance:

Each element has a specific use case and affects the performance of the page. You can get the most out of the site in the following ways:

Perform A / B testing to determine the best page layout options.
Conduct multivariate tests to improve layouts and ensure that all page elements interact well with each other.

You will need to attract a huge number of users to the tested page before even considering the possibility of multivariate testing. However, there is enough traffic, both types of research should be used to optimize the site.
Most agencies prefer A / B testing, as their clients usually test significant changes (potentially affecting the page more). In addition, they are easier to carry out.

The multi-armed bandit method is A / B / n — tests that are updated in real time based on the effectiveness of each variation.

In fact, the multi-armed bandit algorithm begins with sending traffic to two (or more) pages: the original and its version (s). Then it is updated depending on which of the variations is the most effective. In the end, the algorithm determines the best option possible:

One of the advantages of the multi-armed bandit method is that it mitigates the conversion losses that you experience when testing the potentially worst-case scenario. This chart from Google explains everything well:

Both the multi-armed bandit method and the A / B / n tests have strengths. The first is ideal for:

Headings and short-term campaigns;
Automatic scaling;
Targeting
Simultaneous optimization and attribution.

No matter what type of testing you use, it’s important to try to increase your chances of success. In other words, the more often you test, the faster the conversion will increase.

How to improve the results of A / B testing

Do not pay attention to articles like "99 things that can be tested using A / B testing." This is a waste of time and traffic. Only the process itself will help you increase revenue.

About 74% of optimizers with a structured approach to conversion also report improved sales. The rest get there, which web analyst Craig Sullivan calls the “trough of disappointment.” (Unless their results are spoiled by false positives, which we will discuss later.)

For maximum effectiveness, the testing structure should look like this:

Study;
Prioritization;
Experimentation;
Analysis, training, repetition.

Study

To optimize your site, you need to understand what and why your users are doing.
However, before thinking about testing, strengthen your strategy of attracting users and build on it. So you need to:

Define the goals of your business.
Define the goals of your website.
Identify your key performance indicators;
Define your target metrics.

Once you understand what you want to achieve, you can start collecting the necessary data. For this, we recommend using the ResearchXL Framework.
Here is a short list of processes used by CXL:

Heuristic analysis;
Technical analysis;
Data analysis of web analytics systems;
Mouse movement analysis;
Quality polls;
User testing.

Heuristic analysis is one of the best A / B testing practices. Even with many years of experience, it’s hard to understand which elements of the page increase its effectiveness. However, areas of opportunity can be identified. UX Specialist Craig Sullivan believes:

“In my experience, these patterns simplify the work, but are not commonplace truths. They direct and inform me, but give no guarantees. ”

Do not rely on patterns. It is also useful to have a framework. When conducting a heuristic analysis, it is worth evaluating each page according to the following criteria:

Relevance;
Clarity;
Value;
Friction;
Abstraction.

Technical analysis is often overlooked. However, errors (if any) kill the conversion. It may seem to you that your site is working fine in terms of user experience and functionality. But does it work equally well with every browser and device? Probably not.

Technical analysis is very effective and not very labor intensive. Therefore, you should:

Conduct cross-browser and cross-platform testing.
Analyze the speed of the site.

Next comes the analysis of data from web analytics systems. First of all, make sure everything works. You will be surprised at the number of web analytics system settings that are set incorrectly.

Mouse motion analysis includes heatmaps, scrolling maps, shape analytics, and user session repetitions. Don't get carried away with the colorful visualization of click cards. Make sure the analysis helps you get the information you need to achieve your goals.
Qualitative research allows you to understand the causes of problems. Many people think that it is simpler than quantitative. In fact, qualitative research must be as accurate as to provide equally useful information.

To do this, it is necessary to carry out:

Surveys on the site;
Customer surveys;
Interviews with clients and focus groups.

Finally, user testing can be used. The idea is simple: watch how real people use your website and interact with it while commenting on their actions. Pay attention to what they are talking about and what they are experiencing.

After a thorough conversion study, you will have a lot of data. The next step is to prioritize testing.

How to prioritize hypotheses in A / B testing

There are many frameworks for prioritizing your A / B tests. Moreover, you can do this based on your own methods. Craig Sullivan prioritizes as follows:

Upon completion of all six stages described above, you will find problems - both serious and minor. Distribute each find into one of five categories:

Testing: Everything that needs to be tested will be sent to this category.
Tools. This category includes fixing, adding or improving the processing of tags / events in analytics.
Hypothesis: This category defines pages, widgets or processes that do not work very well and require error handling.
Just do it. Use this category for tasks that just need to be done.
Study: If a task falls into this category, you will have to dig a little deeper to solve it.

Rate each problem from 1 to 5 stars (1 = minor, 5 = critical). When evaluating, the following two criteria are most important:

Ease of implementation (time / complexity / risk). Sometimes the data tells you to create a function that takes months to develop. Do not start work with her.
Opportunity. Evaluate questions subjectively depending on how big a lift or change they can cause.

Create a spreadsheet with all your data. You will get a split testing scheme with the priorities set.

We have created our own prioritization model to make the whole process as objective as possible. It implies the obligatory entry of data into the table. The model is called PXL and looks like this:

Download a copy of this spreadsheet template here. Just click File> Create Copy to get everything you need.

Instead of predicting the effectiveness of a change, the framework asks you a series of questions about it:

Is there a significant change? A major update will notice more people. Therefore, the change will have a greater impact on the page.
Is it possible to notice a change in 5 seconds? Show the group of people the page, and then its variation (s). Will they notice the differences in 5 seconds? If not, the change is unlikely to have a major impact.
Does the change add or remove anything? Major changes, such as reducing distractions or adding key information, usually greatly affect the page.
Does the test work on pages with high traffic? Improving a page with a lot of traffic gives a big return.

Many potential test variables require data to prioritize your hypotheses. Weekly discussions that ask the following four questions will help you prioritize testing based on data rather than opinions:

Will the problem detected during user testing be resolved?
Are the problems discovered through quality feedback (polls, polls, interviews) being addressed?
Is the hypothesis supported by mouse tracking, heatmaps, or eye tracking?
Are the issues discovered through digital analytics resolved?

PXL Assessment

We use a binary scale: you must choose one rating from two. Thus, for most variables (unless otherwise indicated) you choose either 0 or 1.
However, we also want to sort the variables by importance. To do this, we specifically describe which elements of the page are changing.

Customizability

We created this model, believing that you can and should set up variables depending on the goals of your business.

For example, if you are working with a branding team or user experience and the hypotheses should be consistent with the brand’s recommendations, add them as a variable.
You may be working in a startup whose sales engine is powered by SEO. Perhaps your financing depends on the flow of customers. Add a category like “SEO doesn't interfere” to change some headings or texts.

All organizations work differently. Setting up the template will help to take into account all the nuances and create the optimal program for optimizing the site.

Whatever framework you use, make it clear to every member of the team, as well as to the company's shareholders.

How long do A / B tests take?

First rule: do not stop the test just because it becomes statistically significant. This is probably the most common mistake made by novice optimizers.

If you stop testing too soon, you will find that most of the changes do not lead to an increase in income (which is the main goal).
Pay attention to these statistics obtained after 1000 A / A tests (it was carried out for two identical pages):

771 experiments out of 1000 reached a significance of 90%.
531 experiments out of 1000 reached a significance of 95%.

Prematurely stopping tests increases the risk of false positives.
Determine the sample size and conduct testing for several weeks at least two work cycles in a row.

How to determine the sample size? There are many great tools. Here's how you can calculate the sample size using the Evan Miller tool:

In this example, we have indicated that the conversion rate is 3% and we want to increase this rate by at least 10%. This tool states that 51,486 people must visit each variation before we can look at the levels of statistical significance.

In addition to the significance level, there is statistical strength. Statistical power tries to avoid type II errors (false negatives). In other words, it increases the likelihood that you will find the most effective page element.

Remember that 80% of power is the standard for A / B testing tools. To achieve this level, you will need either a large sample size, or a grandiose effect, or a longer test.

There are no magic numbers

Many articles list magic numbers (such as “100 conversions” or “1000 visitors”) as the best time to stop testing. However, mathematics has nothing to do with magic. In fact, everything is more complicated than simplified heuristics like these numbers. Here's what Andrew Anderson of Malwarebytes says:

“Your goal is not a certain number of conversions. You should strive to collect enough data to test a hypothesis based on representative samples and representative behavior.

One hundred conversions are possible only in the rarest cases and with an incredibly high difference in behavior, but only if other requirements are met - such as time behavior, consistency and normal distribution. At the same time, the risk of an error of the first kind remains very high. ”

So you need a representative sample. How to get it? Conduct testing during two economic cycles, which will help reduce the influence of external factors such as:

\ Day of the week. Daily traffic can vary greatly depending on the day of the week.
\ Sources of traffic. Unless it is necessary to personalize the experience for a particular source.
\ Schedule sending newsletters and blog posts.
\ Returning visitors. People can visit your site, think about a purchase, and then come back 10 days later to make it.
External events. For example, mid-month payroll may affect your purchase.

Be careful with small samples. There are many case studies on the Internet filled with mathematical errors.

As soon as you set everything up, do not look (and do not let the boss peek) at the test results until it is finished. Otherwise, you can draw premature conclusions by "detecting a trend."

Regression to mean

You will often notice that the results vary greatly in the first few days of the test. Subsequently, they will converge to the average value, since the test continues for several weeks. Here is an example of e-commerce site statistics:

The first couple of days: blue (option number 3) wins by a margin. The variation brings $ 16 per visitor against the $ 12.50 that the original page brings. Many (by mistake) would end testing at this point.
After 7 days: the blue version of the page still wins, and the relative difference is quite large.
After 14 days: The orange version (No. 4) comes out on top!
After 21 days: The orange version still wins!
End of testing: there are no differences between the options.

If you completed the test before the fourth week, you would have made the wrong conclusion.

There is a similar problem: the effect of novelty. The novelty of your changes (for example, the big blue button) draws more attention to the page option. Over time, this effect disappears, as the change will gradually cease to be relevant.

Can I run multiple A / B tests at the same time?

You want to speed up your testing program and run more tests. However, is it possible to run more than one A / B test at a time? Will it increase your growth potential or distort the data?

Some experts argue that conducting multiple tests at once is wrong. Some say that everything is in order. In most cases, you will not have problems when conducting several simultaneous tests.

If you are not testing really important things (for example, something that affects your business model and the future of the company), then the benefits of the testing volume will probably outweigh the flaws in your data and random false positives.
If there is a high risk of interaction between multiple tests, reduce the number of simultaneous tests and / or let the tests run longer to improve accuracy.

How to set up A / B tests

After compiling a list of test ideas with prioritized priorities, it is necessary to formulate a hypothesis and conduct an experiment. By hypothesis you determine for what reason the problem arises. In addition, a good hypothesis:

Verifiable. It is measurable, so it can be checked.
Solves the conversion problem. Split testing solves conversion problems.
Provides market insight. With a clearly articulated hypothesis, the results of your split testing will always provide you with valuable customer information.

Craig Sullivan offers the following algorithm to simplify the hypothesis process:

Since we received (data / feedback),
We expect that (change) will cause (effect).
We will measure it using (data metric).

There is an advanced version of this algorithm:

Since we received (qualitative and quantitative data),
We expect that (change) for (population) will cause (effect [s]).
We expect to see (change in data metrics) for the period (X business cycles).

Technical issues

The most entertaining part of testing has come: you can finally choose a tool for it.

Many start this issue first, but this is far from the most important thing. Strategy and statistics are much more important.

However, there are several features of the tools that you should be aware of. They fall into two main categories: server-side or client-side tools.

Server tools display server level code. They send a randomized version of the page to the viewer without changes to the visitor’s browser. Client-side tools send the same page, but JavaScript in the client browser controls the appearance of the original page and its variant.

Client-side testing tools include Optimizely, VWO, and Adobe Target. Conductrics allows you to use both methods, and SiteSpect uses proxies.
What does all this mean to you? If you want to save time, your team is small or you do not have resources for development, client-side tools will help you get started faster. Server-side tools require development resources — however, they are generally more reliable.

Although the test setup is slightly different depending on which tool you use, often the whole process is very simple and anyone can handle it - just follow the instructions.

In addition, you need to set goals. Your testing tool will track when each page option turns visitors into customers.

When setting up A / B tests, the following skills come in handy: HTML, CSS, and JavaScript / JQuery, as well as the ability to create texts and design new page variations. Some tools allow you to use a visual editor, but it limits your flexibility and control.

How to analyze the results of A / B tests?

So, you finally did the research, set up the test correctly and conducted it. Now let's move on to the analysis. It's not that simple - just looking at the graph from your testing tool is not enough.

One thing you should always do: analyze your test results in Google Analytics. So you not only expand your analysis capabilities, but also become more confident in your data and decision making.

Your test tool may not write data correctly. Unless you have another source of information, you can never be sure whether to trust it. Create multiple data sources.

What happens if there is no difference between the variations? Take your time. First, recognize two things:

Your hypothesis might be true, but the implementation turned out to be wrong.
Suppose your qualitative research indicates a security problem. How many times can you improve your perception of security? Unlimited quantity.
Use iterative testing if you want to test something, and compare several iterations.
Even in the absence of a tangible difference in general, the variation may exceed the original page in some respects.

If you notice an increase in efficiency among regular and mobile visitors, but not for new visitors and desktop users, these segments can cancel each other out, giving the impression that “there is no difference”. Analyze your test across key segments to explore this opportunity.

Data segmentation for A / B tests

Segmentation is the key to capitalizing on A / B testing results. Despite the fact that B can lose A in the overall results, the variation can defeat the original page in certain segments (organic traffic, Facebook clicks, mobile traffic, etc.).

There are a huge number of segments that you can analyze, including the following:

Type of browser;
Type of source;
Mobile or desktop computer or device;
Registered and logged out visitors;
PPC / SEM campaigns
Geographic regions (city, state / province, country);
New and regular visitors;
New and repeat customers;
Advanced users against casual visitors;
Men versus women
Age range;
New and already presented leads;
Types of plans or levels of loyalty program;
Current, potential and former subscribers;
Roles (if, for example, your site offers the roles of a buyer and a seller).

As a last resort (provided that you have an adequate sample size), pay attention to these factors:

The popularity of the desktop and mobile versions;
New customers versus returnees;
Lost traffic.

Make sure you have a sufficient sample size in the segment. Calculate it in advance, and be careful if this segment has less than 250-350 conversions per variation.
If your actions have shown good results for a particular segment, you can move on to an individual approach to these users.

How to archive performed A / B tests

A / B testing is primarily necessary to collect information. Statistically correct tests carried out according to the instructions will help to achieve the main goals of growth and optimization.

Smart companies archive test results and constantly improve testing approaches. A structured approach to optimization gives greater growth and is less often limited by local constraints.

The hardest part is this: there is no single best way to structure knowledge management. Some companies use sophisticated built-in tools; some use third-party tools; and some come with Excel and Trello.
Here are three tools specifically designed to optimize your conversion:

Iridion;
Effective Experiments;
Growth Hackers' Projects.

Statistics obtained through A / B tests

Knowledge of statistics is useful in analyzing the results of an A / B test. We examined some of them in the section above, but that is not all.

There are three concepts that you should know before learning the details of statistics obtained through A / B tests:

Mean. We do not measure all conversion rates, but only the sample. The average is only a representative of the whole.
Dispersion. A measure of the scatter of the values of a random variable relative to its mathematical expectation. It affects the test results and how we use them.
Selection. We cannot measure the true conversion rate, so a representative sample is chosen.

What is a P-value?

Many people use the term “statistical significance” incorrectly. By itself, it is not a signal to stop testing. So what is it and why is it so important?
To begin, let's look at the P-values, which also few people understand. Even scientists themselves sometimes get confused in them!

P-value is a value characterizing the probability of error when the null hypothesis is rejected (errors of the first kind). It does not prove that probability B is higher than A. This is a common misconception.

To summarize, we can say that statistical significance (or a statistically significant result) is achieved when the P-value is less than the level of statistical significance (which is usually set to 0.05).

Unilateral and bilateral A / B tests

A one-way test allows you to detect a change in one direction, while a two-way test allows you to detect a change in two directions (both positive and negative).

Don’t worry if your testing software supports only one of the types of A / B tests. If necessary, a one-way test is easily converted to a two-way test and vice versa (however, this must be done before the test). The only difference is the threshold significance level.

If your software uses a one-way test, simply divide the used P-value into two. To ensure that your two-way test is reliable at least 95%, set the confidence level at 97.5%. If you want to achieve a reliability of 99%, then you need to select a value of 99.5%.

The conversion rate is not just X%. It is indicated approximately in this form: X% (± Y). The second number in this formula is the confidence interval, and it is extremely important for understanding the results of the split test.

Confidence intervals are used in A / B testing to minimize the risk of sampling errors. In this sense, we manage the risk associated with the introduction of a new version of the page.

Therefore, if your tool shows something like: “We are 95% sure that the conversion rate is X% ± Y%,” then you need to consider ± Y% as the margin of error.

The reliability of the results largely depends on the magnitude of the error. If the two conversion ranges overlap, you need to continue testing to get a result that looks more like the truth.

Threats to external validity

Split tests are complicated by the fact that the data is not static.

A time series can be called stationary only if its statistical properties (average value, variance, autocorrelation, etc.) are constant in time. For many reasons, the website data is not stationary. Therefore, we cannot make the same assumptions as for stationary data. Here are a few factors that can cause data to change:

Season;
Day of the week;
Holidays;
Positive or negative references in the press;
Other marketing campaigns;
PPC / SEM;
SEO
Word of mouth.

These are just some of the factors to consider when analyzing the results of A / B tests.

Bayesian and frequency statistics

Many popular tools allow you to use both Bayesian and frequency approaches to A / B testing. What is the difference?

In simple words, a probability is assigned to a hypothesis in Bayesian statistics, and in frequency statistics it is checked without assigning probability.

Each approach has its advantages. However, if you are just starting to comprehend the basics of A / B testing, then you need to be the last to worry about choosing a methodology.

Conclusion

A / B testing is an invaluable storehouse of information for everyone who makes decisions in an online environment. With a little knowledge and a lot of effort, you can reduce the many risks that most novice optimizers face.

By delving into the topic, you can get ahead of 90% of people involved in web analytics. Experience and constant practice will allow you to perfectly master this research method. So start testing!

Great A / B Test Guide