Making Sense of A/B Testing: Understand Better with Hard Questions

Uncover the counterintuitive aspects of A/B testing through challenging questions, improve your understanding, and steer clear of mistakes

Photo by ALAN DE LA CRUZ on Unsplash

This article highlights common statistical errors in the context of experiments. It’s set up as five questions with answers that many find counterintuitive. It’s tailored for those who are already familiar with A/B tests but are aiming to expand their understanding. This can help you prevent common errors in your daily work or ace a job interview.

Question 1: You’ve conducted an A/B test (α = 0.05, β = 0.2), which yields a statistically significant result. In this scenario, what is the likelihood that it’s a true positive?

Imagine if you were to measure only working hypotheses. Then, 100% of successful A/B tests would be true positives. When none of your hypotheses work, 100% of successful A/B tests would be false positives.

These two extremes are meant to demonstrate that it’s impossible to answer this question without an extra step — an assumption about the distribution of hypotheses.

Let’s try one more time and assume that 10% of the hypotheses we test are effective. Then, observing a statistically significant result from an A/B test implies there’s a 64% (by Bayes’ theorem, (1–0.2)*0.1 / ((1–0.2)*0.1 + 0.05*(1–0.1))) chance that it’s a true positive.

Image by author

Question 2: Suppose the null hypothesis is true. Under this circumstance, would a higher or lower p-value be more likely?

Many think it’s the former. This seems intuitive: when there’s no effect, the result is more likely to be far from statistical significance, hence a higher p-value.

However, the answer is neither. When the null hypothesis is true, p-values are distributed uniformly.

The confusion arises because people often visualise these concepts in terms of z-scores, or sample means, or differences in sample means. All of these are normally distributed. It might be hard to grasp, then, the uniformity of p-values.

Let’s illustrate this with a simulation. Assume that both the treatment and control groups are drawn from the same normal distribution (μ = 0, σ = 1), meaning the null hypothesis is true. We’ll then compare their means, calculate p-values, and repeat this process multiple times. For simplicity, let’s only look at cases where the mean of the treatment group was larger. And then, let’s look at cases with p-values from 0.9 to 0.8 and from 0.2 to 0.1.

When we map these p-value intervals onto the distribution we simulated, the picture becomes clearer. Although the peak of the distribution near zero is higher, the interval’s width here is narrower. Conversely, as we move away from zero, the peak shrinks but the width of the intervals increases. This is because p-values are computed in such a way that intervals of equal length encompass the same area under the curve.

Image by author

Question 3: Due to some technical or business constraints, you’ve run an A/B test with a smaller-than-usual sample size. The result is barely significant. However, the effect size is large, larger than what you typically see in similar A/B tests. Should the larger effect size bolster your confidence in the result?

Not really. In order for an effect to be classified as significant, it must be either plus or minus 2 standard errors away from zero (when α = 0.05). As the sample size shrinks, standard errors generally rise. This implies that statistically significant effects observed in smaller samples tend to be larger.

The simulation below demonstrates that: these are absolute effect sizes of significant A/B tests when both groups (N=1000) are sampled from the same normal distribution (μ = 0, σ = 1).

Image by author

Question 4: Let’s build on the understanding gained from the previous question. Is it possible to detect a true effect that is smaller than 2 standard errors?

Yes, although semantics here is muddy. The true effect size could be significantly smaller than 2 standard errors. Even then, you would anticipate a certain fraction of A/B tests to exhibit statistical significance.

However, under these conditions your detected effect size is always exaggerated. Imagine that the true effect is 0.4, but you’ve detected an effect of 0.5 with a p-value of 0.05. Would you consider this a true positive? What if the true effect size is only 0.1, yet you again detect an effect of 0.5? Is it still a true positive if the true effect is a mere 0.01?

Let’s visualise this scenario. Control groups (N=100) are sampled from a normal distribution (μ = 0, σ = 2), while treatment groups (N=100) are sampled from the same distribution but with μ varying from 0.1 to 1. Regardless of the true effect size, a successful A/B test generates an estimated effect size of at least 0.5. When the true effects are smaller than this, the resulting estimate is clearly inflated.

Image by author

This is why some statisticians avoid dividing outcomes into binary categories like ‘true positives’ or ‘false positives’. Instead, they treat them in a more continuous manner [1].

Question 5: You’ve conducted an A/B test that produces a significant result, with a p-value of 0.04. However, your boss remains unconvinced and asks for a second test. This subsequent test doesn’t yield a significant result, presenting a p-value of 0.25. Does this mean that the original effect wasn’t real, and the initial result was a false positive?

There’s always a risk in interpreting p-values as a binary, lexicographic decision rule. Let’s remind ourselves what a p-value actually is. It’s a measure of surprise. And it’s random and it’s continuous. And it’s only one piece of evidence.

Imagine the first experiment (p=0.04) was run on 1.000 users. The second one (p=0.25) — on 10.000 users. Apart from the noticeable differences in quality, the second A/B test, as we discussed in Questions 3 and 4, probably had a much smaller estimated effect size that might not be practically significant anymore.

Let’s reverse this scenario: the first one (p=0.04) was run on 10.000, and the second one (p=0.25) — on 1.000 users. Here we are much more confident that the effect ‘exists’.

Now, imagine both A/B tests were identical. In this situation, you’ve observed two fairly similar, somewhat surprising results, neither are too consistent with the null hypothesis. The fact that they fall on opposite sides of .05 is not terribly important. What’s important is that observing two small p-values consecutively when the null is true is unlikely.

One question we might consider is whether this difference is statistically significant itself. Categorising p-values in a binary way skews our intuition, making us believe there’s a vast, even ontological, difference between p-values on different sides of the cutoff. However, the p-value is a fairly continuous function, and it might be possible that two A/B tests, despite different p-values, present very similar evidence against the null [2].

Another way to look at this is to combine the evidence. Assuming the null hypothesis is true for both tests, the combined p-value stands at 0.05, according to Fisher’s method. There are other methods to combine p-values, but the general logic remains the same: a sharp null isn’t a realistic hypothesis in most settings. Therefore, enough ‘surprising’ outcomes, even if none of them are statistically significant individually, might be sufficient to reject the null.

Fusing two p-values by using Fisher’s method. Image by Chen-Pan Liao, from Wikipedia

Conclusion

The Null Hypothesis Testing Framework, which we commonly use to analyze A/B tests, isn’t especially intuitive. Without regular mental practice, we often revert to an ‘intuitive’ understanding, which can be misleading. We may also develop routines to ease this cognitive burden. Unfortunately, these routines often become somewhat ritualistic, with the adherence to formal procedures overshadowing the actual objective of inference.

References

McShane, B. B., Gal, D., Gelman, A., Robert, C., & Tackett, J. L. (2019). Abandon statistical significance. The American Statistician, 73(sup1), 235–245.Gelman, A., & Stern, H. (2006). The difference between “significant” and “not significant” is not itself statistically significant. The American Statistician, 60(4), 328–331.

Making Sense of A/B Testing: Understand Better with Hard Questions was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

Logo

Oh hi there 👋
It’s nice to meet you.

Sign up to receive awesome content in your inbox, every month.

We don’t spam!

Leave a Comment

Scroll to Top