Introduction
- For simplicity, in this section, lets assume we are only comparing control with a single treatment/variant. If we have more variants, the underlying tests will change.
- We will look at three different tests based on the type of question we would like to answer from a corresponding experiment.
- These tests are:
- There are always alternative tests that can be used depending on the assumptions.
- For instance, instead of the Poisson means test above, one could use the Mann–Whitney U test non-parametric test.
Sample Size Considerations
- Recall the terminology
- Minimum detectable effect: the effect of using the new treatment/variation over a baseline/control.
- Sample size per variation: the number of users that need to be exposed to the treatment and control to be able to conclusively reject or not reject the null hypothesis.
- There are a few key choices that determine the test specs. These are:
- Statistical power of the test (1 -$\beta$): which is the probability with which the the test will be able to detect the minimum detectable effect if it exists.
- Significance level ($\alpha$): which is the probability with which the test will make a mistake and detect the minimum detectable effect when it does not exist. For instance, if both the control and the variation/treatment are the same, then there is an $\alpha$ percent chance of detecting that the variation is better than the control!
- Typically, in a test specifying an acceptable level of power, the minimum detectable effect size and significance determine the number of samples needed.
- For instance, if we want to detect what we believe is a small improvement over control, we may have to test longer (i.e., test on more users) to be able to reject the null hypothesis.
- Demo: Evan’s A/B tools
- Demo: Optimizely
- Caveat: We are only discussing fixed horizon tests and the sample size requirements thereof. Optimizely’s calculator uses a more sophisticated test suite that allows for early stopping. So the estimated sample sizes may differ.
Two-sample t Test
- The two-sample t test allows one to answer the following question:
Is there a difference in the average value across two groups?
- We can have different null hypotheses:
- The two means are equal
- The mean of the first group is higher than the second
- The mean of the second group is higher than the first
- In this, a t-statistic based on the differences between the means of the two groups is computed.
- Example application:
- The sessions lengths of users on an app (e.g., Netflix)
- Demo
Chi-squared Test
- This test allows for answering the following question:
Is there a difference in the rates of success across two groups?
- The null hypothesis here is that the rates of success are equal.
- The chi-squared statistic is computed based on the different observed success rates of the two groups.
- Example application:
- Successful conversion events (e.g., for everyone who downloaded an app or registered for free trials).
- Demo
Poisson Means Test
- This test lets us answer the following:
Is the the rate of arrival of events (of a certain type) different across two time periods or across two groups?
- The null hypothesis here is that the two rates are equal.
- Example application:
- Counts of successful content interactions (e.g., news article clicks on a search page or a marketing page)
- Demo
How to Remember?
- It may seem that there are a lot of tests, each with its own assumptions about the data. That is true. But a way to approach the testing complexity is to know a few popular tests.
- Another way is to think of tests from a linear model perspective. This is better explained at https://lindeloev.github.io/tests-as-linear/. The gist is that, many hypothesis tests are essentially tests for the coefficients of a corresponding linear model. With this view point, it is also easy to understand the assumptions being made.
Gotchas with Testing
- There are many ways to make wrong inferences with A/B testing.
- The simplest is the misunderstanding of what p-value means. P value is assigning a probability to the observed event under the distribution governed by the null hypothesis.
- Most classical tests (at least the ones seen in introductory stats textbooks) are for a fixed horizon setting where the sample size is pre-determined. There is a big difference in assuming this as running the experiment long enough to see a statistically significant result. This latter even will happen at a much higher frequency.
- Stopping only when you have a significant outcome inflates the rate of false positives (by a huge margin). This error in conducting a clean experiment is called peeking.
- There are adaptive testing techniques that allow one to not commit the a specific sample size in advance. These are called sequential experimental design and Bayesian experimental designs.