Statistical Tests :: MLOps: Operationalizing Machine Learning

Statistical Tests

For simplicity, in this section, lets assume we are only comparing control with a single treatment/variant. If we have more variants, the underlying tests will change.
We will look at three different tests based on the type of question we would like to answer from a corresponding experiment.
These tests are:
There are always alternative tests that can be used depending on the assumptions.
- For instance, instead of the Poisson means test above, one could use the Mann–Whitney U test non-parametric test.

Recall the terminology
- Minimum detectable effect: the effect of using the new treatment/variation over a baseline/control.
- Sample size per variation: the number of users that need to be exposed to the treatment and control to be able to conclusively reject or not reject the null hypothesis.
There are a few key choices that determine the test specs. These are:
- Statistical power of the test (1 -$\beta$): which is the probability with which the the test will be able to detect the minimum detectable effect if it exists.
- Significance level ($\alpha$): which is the probability with which the test will make a mistake and detect the minimum detectable effect when it does not exist. For instance, if both the control and the variation/treatment are the same, then there is an $\alpha$ percent chance of detecting that the variation is better than the control!
Typically, in a test specifying an acceptable level of power, the minimum detectable effect size and significance determine the number of samples needed.
For instance, if we want to detect what we believe is a small improvement over control, we may have to test longer (i.e., test on more users) to be able to reject the null hypothesis.
Demo: Evan’s A/B tools
Demo: Optimizely
- Caveat: We are only discussing fixed horizon tests and the sample size requirements thereof. Optimizely’s calculator uses a more sophisticated test suite that allows for early stopping. So the estimated sample sizes may differ.

Is there a difference in the average value across two groups?

We can have different null hypotheses:
- The two means are equal
- The mean of the first group is higher than the second
- The mean of the second group is higher than the first
In this, a t-statistic based on the differences between the means of the two groups is computed.
Example application:
- The sessions lengths of users on an app (e.g., Netflix)
Demo

Is there a difference in the rates of success across two groups?

The null hypothesis here is that the rates of success are equal.
The chi-squared statistic is computed based on the different observed success rates of the two groups.
Example application:
- Successful conversion events (e.g., for everyone who downloaded an app or registered for free trials).
Demo

Is the the rate of arrival of events (of a certain type) different across two time periods or across two groups?

The null hypothesis here is that the two rates are equal.
Example application:
- Counts of successful content interactions (e.g., news article clicks on a search page or a marketing page)
Demo

It may seem that there are a lot of tests, each with its own assumptions about the data. That is true. But a way to approach the testing complexity is to know a few popular tests.
Another way is to think of tests from a linear model perspective. This is better explained at https://lindeloev.github.io/tests-as-linear/. The gist is that, many hypothesis tests are essentially tests for the coefficients of a corresponding linear model. With this view point, it is also easy to understand the assumptions being made.
- A port of the above mapping between tests and linear modeling (originally in R) to python can be viewed at https://eigenfoo.xyz/tests-as-linear/.

There are many ways to make wrong inferences with A/B testing.
The simplest is the misunderstanding of what p-value means. P value is assigning a probability to the observed event under the distribution governed by the null hypothesis.
Most classical tests (at least the ones seen in introductory stats textbooks) are for a fixed horizon setting where the sample size is pre-determined. There is a big difference in assuming this as running the experiment long enough to see a statistically significant result. This latter even will happen at a much higher frequency.
- Stopping only when you have a significant outcome inflates the rate of false positives (by a huge margin). This error in conducting a clean experiment is called peeking.
There are adaptive testing techniques that allow one to not commit the a specific sample size in advance. These are called sequential experimental design and Bayesian experimental designs.