## Goal

Compare the two versions (A&B) of a variable to find out which performs better in a controlled environment.

## Some Definitions

sample: number of customers that participated in the test

## Steps

1. Setup Objective: The key Metric to measure your testing target (User satisfaction / Conversion Rate … etc)

2. Make Hypothesis

• Null hypothesis $H_{0}$
Null hypothesis states that sample observations result purely from chance. Ex: ”there is no difference in the customer satisfaction processed by A and B”
• Alternative Hypothesis $H_{a}$
The testing target. Ex: “Customer satisfaction processed by A is higher than processed by B”.
3. Create Control Group and Test Group: Randomly select from population, divide into 2 groups:

• Control Group: Use old setting, A
• Test (Variant) Group: Use new setting, B
4. Conduct test and Collect Data. There could be two bias from this stage:

• Sampling Bias: Random sampling is important in hypothesis testing because it eliminates sampling bias, and it’s important to eliminate bias because you want the results of your A/B test to be representative of the entire population rather than the sample itself
• Under Coverage Bias: we would encounter this bias when sample too few observations
5. Calculate Statistical Significance
If we observe a difference like “the average of A’s satisfaction score is 3.9 and B’s is 4.2”, Don’t toast too early 😂  We should always run a two-sample T-Test to get statistical proof of our experiment.

1. Significance level ($\alpha$, setup before the test): The significance level is the probability of rejecting the null hypothesis when it is true. Generally, we use the significance value of 0.05.
2. P-Value ($p$, calculated): It is the probability that the difference between the two values is just because of random chance. P-value is evidence against the null hypothesis. The smaller the p-value stronger the chances to reject the $H_{0}$. For the significance level of 0.05, if the p-value is lesser than it hence we can reject the null hypothesis (Reject hypothesis that B has no difference between A)
3. Confidence interval: The confidence interval is an observed range in which a given percentage of test outcomes fall. We manually select our desired confidence level at the beginning of our test. Generally, we take a 95% confidence interval ($p$=0.05)
4. One-tailed or two-tailed hypothesis?
two-tailed, because we want to test whether two distribution (A’s / B’s) is different
6. Read → Explain → Action

## Common Mistakes

• Works best when testing Incremental Changes (UX changes, new features). Doesn’t work well when testing major changes, new branding, or completely new UX.
• Invalid hypothesis: The whole experiment depends on one thing i.e the hypothesis. What should be changed? Why should it be changed, what the expected outcome is, and so on? If you start with the wrong hypothesis, the probability of the test succeeding, decreases
• Testing too Many Elements Together: Industry experts caution against running too many tests at the same time. Testing too many elements together makes it difficult to pinpoint which element influenced the success or failure. Thus, prioritization of tests is indispensable for successful A/B testing
• Ignoring Statistical Significance: It doesn’t matter what you feel about the test. Irrespective of everything, whether the test succeeds or fails, allow it to run through its entire course so that it reaches its statistical significance
• Not considering the external factor: Tests should be run in comparable periods to produce meaningful results. For example, it is unfair to compare website traffic on the days when it gets the highest traffic to the days when it witnesses the lowest traffic because of external factors such as sale or holidays

## Reference

1. A/B Testing for Machine Learning
2. A/B Testing for Data Science using Python – A Must-Read Guide for Data Scientists
3. T-Test Calculator for 2 Independent Means