Introduction

  • For simplicity, in this section, lets assume we are only comparing control with a single treatment/variant. If we have more variants, the underlying tests will change.
  • We will look at three different tests based on the type of question we would like to answer from a corresponding experiment.
  • These tests are:
  • There are always alternative tests that can be used depending on the assumptions.

Sample Size Considerations

  • Recall the terminology
    • Minimum detectable effect: the effect of using the new treatment/variation over a baseline/control.
    • Sample size per variation: the number of users that need to be exposed to the treatment and control to be able to conclusively reject or not reject the null hypothesis.
  • There are a few key choices that determine the test specs. These are:
    • Statistical power of the test (1 -$\beta$): which is the probability with which the the test will be able to detect the minimum detectable effect if it exists.
    • Significance level ($\alpha$): which is the probability with which the test will make a mistake and detect the minimum detectable effect when it does not exist. For instance, if both the control and the variation/treatment are the same, then there is an $\alpha$ percent chance of detecting that the variation is better than the control!
  • Typically, in a test specifying an acceptable level of power, the minimum detectable effect size and significance determine the number of samples needed.
  • For instance, if we want to detect what we believe is a small improvement over control, we may have to test longer (i.e., test on more users) to be able to reject the null hypothesis.
  • Demo: Evan’s A/B tools
  • Demo: Optimizely
    • Caveat: We are only discussing fixed horizon tests and the sample size requirements thereof. Optimizely’s calculator uses a more sophisticated test suite that allows for early stopping. So the estimated sample sizes may differ.

Two-sample t Test

  • The two-sample t test allows one to answer the following question:

Is there a difference in the average value across two groups?

  • We can have different null hypotheses:
    • The two means are equal
    • The mean of the first group is higher than the second
    • The mean of the second group is higher than the first
  • In this, a t-statistic based on the differences between the means of the two groups is computed.
  • Example application:
    • The sessions lengths of users on an app (e.g., Netflix)
  • Demo

Chi-squared Test

  • This test allows for answering the following question:

Is there a difference in the rates of success across two groups?

  • The null hypothesis here is that the rates of success are equal.
  • The chi-squared statistic is computed based on the different observed success rates of the two groups.
  • Example application:
    • Successful conversion events (e.g., for everyone who downloaded an app or registered for free trials).
  • Demo

Poisson Means Test

  • This test lets us answer the following:

Is the the rate of arrival of events (of a certain type) different across two time periods or across two groups?

  • The null hypothesis here is that the two rates are equal.
  • Example application:
    • Counts of successful content interactions (e.g., news article clicks on a search page or a marketing page)
  • Demo

How to Remember?

  • It may seem that there are a lot of tests, each with its own assumptions about the data. That is true. But a way to approach the testing complexity is to know a few popular tests.
  • Another way is to think of tests from a linear model perspective. This is better explained at https://lindeloev.github.io/tests-as-linear/. The gist is that, many hypothesis tests are essentially tests for the coefficients of a corresponding linear model. With this view point, it is also easy to understand the assumptions being made.

Gotchas with Testing

  • There are many ways to make wrong inferences with A/B testing.
  • The simplest is the misunderstanding of what p-value means. P value is assigning a probability to the observed event under the distribution governed by the null hypothesis.
  • Most classical tests (at least the ones seen in introductory stats textbooks) are for a fixed horizon setting where the sample size is pre-determined. There is a big difference in assuming this as running the experiment long enough to see a statistically significant result. This latter even will happen at a much higher frequency.
    • Stopping only when you have a significant outcome inflates the rate of false positives (by a huge margin). This error in conducting a clean experiment is called peeking.
  • There are adaptive testing techniques that allow one to not commit the a specific sample size in advance. These are called sequential experimental design and Bayesian experimental designs.