Improve your business with intelligent learning: set up impactful AB tests (part 1 — frequentist tests)

9 min readFeb 23, 2024

Hello, I’m a statistician, I’ve been working on data subjects for over 10 years as a data analyst, data scientist, lead data scientist and now head of data and I love sharing my experience about my job which is also my passion.

From what I’ve seen, A/B tests are well known to the general public, but I’ve rarely seen them implemented correctly. Here are few principles to help you make better decisions, while taking into account the risks and costs of implementing this type of test. I implemented that kind of AB testing myself with very interesting results !

In this first part, I detail how to carry out a frequentist test according to the rules of the art in 5 simple steps :

Choose a good metric
Set the test parameters
Set the size of the two groups
Run your test
Take decision and drive your business thanks to data

In subsequent articles, I will detail strategies for better estimating risks and costs, and limiting the opportunity costs (regret) inherent in any A/B test.

Use case: a website selling sunglasses

Let’s start with a concret use case: you’re the proud founder of an online sunglasses business. What’s your brand? Glasses for you: fashionable glasses for everyone.

Illustration by author using “Party glasses” by Muhammad_Usman (Flaticon.com)

To keep your promise, you need to convince visitors to your site that the glasses you’re selling is on trend. This means highlighting the right colors and models.

Use case 1 : which color to display

In the first case, we want to know which color should be highlighted on the site for a given model.

Illustration by author using “Sun Glasses” by Freepik and “Party glasses” by Muhammad_Usman (Flaticon.com)

Use case 2 : which glasses to promote

For the second use case, we need to decide which eyewear model we want to promote: wayfarer or aviator.

Illustration by author using “Glasses”, “Sun Glasses” and “Heart Glasses” by Freepik and “Party glasses” by Muhammad_Usman (Flaticon.com)

In what follows, we’ll look at two-by-two tests (1 color vs. 1 other, 1 model vs. 1 other), but it’s possible to generalize with minor adaptations.

Systematizing and accelerating testing: a vector for growth

Before jumping in, let’s take a look at the benefits of testing, to clarify what we need to focus on next.

Inform your decisions

The tests must enable us to make informed decisions based on reliable rationales. This makes it possible to :

stop what’s not working
reinforce convictions about what works
facilitate communication on decisions that need to mobilize a large number of people in the company (stakeholders, marketing teams, creative teams, etc.).

Speed up change

Beyond better decisions, establishing a culture of testing will promote a more daring culture thanks to better risk management. This will help minimize errors, strengthen innovation by creating a knowledge base on which to capitalize and ultimately promote innovation.

1-Choose a good metric

Measuring: What? How?

Clarify the final objective. For a test to be relevant, it must answer a precise question. It is therefore important to define the final objective in order to design the test properly, and to ask yourself the following two questions:

what action / operation are we planning to achieve?
what is the value to be derived from this action / operation?

In the case of Glasses for you, here are the actions and the value:

use case 1: we want to modify the content of the application, hoping to encourage conversion
use case 2: we want to highlight a product through promotions, hoping to increase sales (we will se later that this is not sufficient.

These two questions will be crucial in determining costs, estimating risk and establishing the best decisions.

An AB test must determine the profitability of an operation, not its impact on sales.

Determine the cost of your opération / action

The cost of the operation. The main reason for determining the action to be taken (a marketing operation, for example) is to establish its cost, and to set a minimum level of profit at which the action can be considered profitable. When Glasses for you runs a 5% promotion on one of its models, this represents a loss of earnings for the company on each sale. The test must therefore determine whether the increase in sales volume over-compensates for the cost of the promotion.

On the other hand, testing is costly. It requires time, resources and expertise, and if there’s a group for which the results are less good, there’s an opportunity cost (a regret we’ll discuss in the section on one-armed bandits).

The single metric used for testing must reflect the profitability of the operation. The metric closest to the final objective should be used (yes, because the final objective is often not immediately measurable), and costs should be taken into account. For example, a Glasses for you promotion might use a metric such as net income, whereas a sales shadow might not take sufficient account of costs. This metric must be unique in order to have an unambiguous decision.

Make sure you have the data to perform the test

Before designing a test, it is also useful to determine the availability of data. For example, for the first Glasses for you use case, there’s no point in making a hyper-detailed plan if you don’t have the means to associate a purchase with a version of the website (“red glasses” version or “blue glasses” version).

2-Set the test parameters

The principle of the test

Random selection. The basis of the test is a random sample of individuals to constitute group A and group B. I insist, the sample must be randomized, because without it we can’t make any calculation. It’s paradoxical, but the fact of having a stochastic elements makes it possible to estimate value distributions thanks to the central limit theorem.

Comparing the two groups. When performing the test, we compare the results of the two groups. There are two opposing hypotheses that we will validate or invalidate.

The H0 hypothesis is the case where there is no difference between the groups.
The H1 alternative hypothesis is the case where there is a difference. For example, Group A (which has received a promotion) has a higher net income per individual (r) than Group B.

To obtain the test result, we set a threshold. If the difference between the groups is greater than this threshold, we conclude that the difference is significant (H1), otherwise we conclude that there is no difference (H0). In other words, if my income for group A is higher than the threshold

But as we have a particular sample, it’s possible that the result obtained in the test does not reflect the result we’ll get in the long term. This is because we may have come across very particular samples.

The larger the group size, the more accurate the results

So now you have to choose the minimum detectable effect you’re looking for and the risks you’re prepared to accept, and adjust your test size accordingly.

Which minimum detectable effect is desirable for your use case

For frequentist tests in particular, you’ll need to have an idea of the minimum effect desired for your action or operation. After all, we’re not going to set up a giant test to see whether the promotional system could bring in $1. Not only does this euro not compensate for the hidden costs we couldn’t take into account, it also slows down the implementation of other, more profitable operations.

Choosing the level of risks

The first type of risk. The first risk is that of rejecting the null hypothesis when in fact there is no fundamental difference between the two groups (there is just a difference between the imperfect samples I’ve drawn). The risk value corresponds to the blue air in the diagram below. These are all the possible cases in which the difference between the two groups is greater than our threshold, even though we are under the null hypothesis (the probability of the difference is centered in 0).

Second-species risk. This is the opposite: we’re in the case where there’s a fundamental difference between the two groups, but the difference between our two samples is small (no luck, we’ve drawn special samples). The risk value corresponds to the orange area in the diagram below (to the left of the threshold).

By adjusting the size of our groups, we can refine the precision of our test and reduce risks.

Illustration by author

3-Set the size of the two groups

Fortunately, statistical theory gives us an equation for determining group size. The demonstration is beyond the scope of this article, but I’ve checked it out — it works!

To determine the Z values associated with risk, we use the Gaussian distribution. This is a demonstrable consequence of the central limit theorem.
The variance of the metric value within each group can be estimated using historical values, for example.

Here is an implementation with the library scipy.

It is possible to reduce the size of one of the two groups, but in this case you’ll have to over-compensate the other for the same level of performance. You just have to set the ratio r between groups.

3-Run your test

Now that everything is set you have to get your data and run the test. We will therefore use the difference between the groups for the chosen metric, then normalize this result. We show that under H0 this value must follow a Gaussian distribution. We therefore consider that if our result exceeds the 95th percentile value of such a distribution, we can reject the null hypothesis. We will then conclude that there is indeed an effect.

I won’t go into the details, but following the test you can go a little further in estimating the effective risk:

either reject H0 and use the p-value to estimate your first-species risk
or accept H0 and recalculate your second-species risk to refine it.

Concretely, this test can be implemented in python as follows :

4 - Take decision and drive your business thanks to your data

Now that you’ve got all the facts, you can make a decision about whether or not to promote “top-gun” eyewear.

But if you want to make projections for your business, I strongly advise you to use confidence intervals rather than a single value. For example, you may see a 10% growth in net revenues thanks to your test, whereas your test tells you that the real growth is very likely to be between 3% and 12%, which is very different.

Conclusion

Now you’ve designed a test corresponding to your needs, while controlling the associated risks. You’ll be able to make a decision with a clear rationale. I haven’t done all the demonstrations, but I give you more information.

In the next article, we’ll look at how to refine risk calculation and reduce testing time with Bayesian testing. Then we’ll write about multi-armed bandits and reinforcement learning.

Take Away

References

Tests paramétriques de comparaison de 2 moyennes, José LABARERE, Université Joseph Fourier de Grenoble
https://towardsdatascience.com/the-math-behind-a-b-testing-with-example-code-part-1-of-2-7be752e1d06f