Switching from Frequentist approach to Bayesian for A/B testing analysis

When analyzing A/B testing results, Bayesian approaches may supersede the traditional Frequentist approach in many different ways. Also, the Bayesian approach can be very differently from the Frequentist approach from implementation to interpretation.

If you are interested in switching from traditional Frequentist approach to Bayesian approach, but not sure what to expect, the blog will give you an overview about the main differences of the two approaches, from both business perspective and technical perspective.

1. Overview of the theoretical differences

Though this blog does NOT focus on the basics of A/B testing, I’d like to provide a general idea of analysis process with Bayesian approach:

Now let’s highlight some theoretical differences of Bayesian vs. Frequntist approaches.

When to analysis the test data

For Frequentist, you need to complete the experiment first by collecting enough samples before you start analyzing the data, which limits the test to be an “offline experiment”. For Bayesian, the analysis could start during the experiment when collecting data. As a new batch of data comes in, the Bayesian analysis provides updated results until we achieve a conclusion for the hypothesis test. This process makes the “online experiment” possible with Bayesian, which saves time and resources. For more discussion, please see section 3 below.

Data distribution of the KPI in tests:

Understanding of data distributions is needed for both approaches. For Frequentist, different tests are needed based on the distribution of the KPI you want to test [Ref. 1]:

For Bayesian, different conjugate families are needed for KPIs with different distribution. For example, KPI like CTR (Click-through-rate) would use the Beta-Binomial conjugate for update. We would need to set the prior (parameters of a and b) for the Beta distribution, use collect data to update with the Bayes rules, and get the posterior of updated parameters of a and b. Then we sample from the posterior distribution use MCMC and make inference about the test results.

Sample size

Frequentist requires calculation of the sample size prior to the experiment. Also, some test like t test has requirement of assumptions, the number of samples among test groups needs to be balanced to hold the assumption. As a comparison, Bayesian does not require a pre-define sample size, and allows some imbalance of sample size among test groups.

Explanation of test result

For Frequentist, conclusions would be made like “We reject/fail to reject the hypothesis that variant A is better than B”. This conclusion is based on the observation of the historical data collected during the whole test period. For Bayesian, the result can be represented with probability, such as “variant A is better than B with a 98% probability”.

2. Advantages of Bayesian from business perspective

A/B testing is a tool to help make business decisions. Usually, the LOB or the business partners would want to understand and learn from the analysis without knowing too much details of statistics such as p-value or CI. Under the business setting, explaining A/B testing results with Bayesian approach has some natural advantages over Frequentist:

1) Results are easier to understand with the probability. As a probabilistic approach, Bayesian method provides a probability of which variant is better than the others. This probabilistic result quantifies the confidence about your business conclusions, so it is easier to understand for people have little statistics background, compared to p-values used by Frequentist method.

2) Faster iterations are possible with online experiment.Bayesian analysis supports early stop in an online experiment, which enables faster test iterations. Based on the updating frequency you choose, the results get updated after every batch of data comes in, and you don’t need to wait until the probability to achieve 99% to stop the test.

Consider the scenario that after running the test for 3 days and Bayesian gives a probability of 80% that variant A is better than variant B. This probability may continue to increase to 99% when you keep collecting data for 2 more days, but you don’t have to wait another two days if you are already happy with the result that A is better with a 80% probability. Then the test could be stopped here, and you would save time and resources by quickly switch to another test.

3) Bayesian results can be used to make predictions. This is very different from Frequentist method, which gives summary statistics of the samples collected during the experiment period. Since this descriptive statistics is based on historical observations, you could not make any conclusions about the future unseen data. On the contrary, Bayesian approach learns the parameters’ distribution from the data, and gives the posterior predictive distribution for unobserved, future values conditional on the observed data. Therefore, you could make statement like “the CTR for variant A is expected to be 12%”, which is a nice thing to learn from the AB testing in addition to the hypothesis testing.

Of course Bayesian is not a perfect approach. Frequntist has its advantages in estimating the length of experiment for an offline experiment. With the designed sample size, it’s possible to estimate how long it would take based on number of samples you get as the experiment goes. As for Bayesian, it’s difficult to predict how long it would take to make the conclusion that the probability is 99% for variant A to be better. Having more samples coming everyday definitely help to get more confident conclusions (as the probability increases), but we would not know how long this specific experiment would take. However, based on my experiences and practices, Bayesian usually would give a robust result with less samples needed by Frequntist method.

3. What to expect from technical perspective

With all the advantages stated above, I hope you are already interested in applying Bayesian analysis to you A/B testing. So the next question would be: how difficult is it to implement the Bayesian approach? The answer is, it may need a few more considerations:

3.1 Offline experiment or online experiment

First, you need to figure out if you prefer offline experiment or online experiment:

Offline experiment: you start analysis after you finished data collection, so the Bayesian framework only needs to update once with all collected data. Since Bayesian does not have a designed sample size, you would not know if you need more samples after you finished the analysis. Also it’s possible that you spent unnecessary time collecting data while Bayesian could give you a conclusion with less data.
Online experiment: you could start analyzing data as the experiment goes. The Bayesian framework would run and update the results multiple times during the experiment period, but you have the control to pick how often it runs and how big is your batch when you feed the data. Online experiment supports the fast iterations since you could stop the experiment once the stopping rules are met.

I personally would suggest the online experiment to fully use the advantages of Bayesian method - make faster decisions than Frenquentist method.

If you decided to use the online experiment, you need to pick a frequency of how often do you want to run the analysis, since Bayesian takes data in batches to update the results. This choice is flexible, you could choose to get results updated every day, every two days or every 8 hours, etc. This would depends on your business needs - if you want to iterate faster, use a high frequency to update so you could stop early once you achieved the results you want. And it may also depends on your data pipeline, like how often your database gets updated and pulls the new data.

3.2 Granularity level of input data

You may need extra pre-processing for your data because the granularity level is different from Frequentist approach. The input data for Frequentist will be aggregated to user/ID level, so it analyze all the data collected during the test period for each user/ID. However, the granularity level of Bayesian method in online experiment would depend on how often you perform the Bayesian updates. After you choose the frequency of update, you need to aggregate the data accordingly as a pre-processing step. For example, you are testing CTR(click through rate) and the updating frequency is 24 hours, you will need to calculate the number of total “seen” events and number of total “click” events for everyday and have a input dataframe looks like below:

3.3 Multiple comparison

When you have multiple variants to be tested at the same time, you may have the multiple comparison problem. A solution is to use hierarchical Bayesian methods. The reason is similar to use Bonferroni adjustment for Frequentist approach.

3.4 Choice of prior

If you have a decent sample size like a few hundreds or more, I would say non-informative prior would be good enough, since the prior would not affect your results much.

If your data size is small, unless you are very confident to use some customized prior learned from historical data, the non-informative prior is still a good choice to avoid bias brought in with improper prior.

So when to use your own prior? If you have one of the following situations: 1) You already did similar tests before and have a lot of historical data, 2) you have an expert’s opinion which the collected data could barely represent, 3) You have limited time to do the test and you have some expert knowledge and confidences of what the outcome would be.

4. Summary

As a summary, there are some considerations and challenges to switch to Bayesian from Frequentist. But I would still think it worth the changes if you are looking for more interpretable results to Business partners, or want to test fast with more iterations when time is limited. I hope this blog help you understand what you would expect, and what are the main differences of analyzing AB testing results using Frequentist method versus Bayesian method. Thank you for reading!

💖Love the story? Please feel free to subscribe to the mailing list for DS and ML fans, and become a Medium memberfor more DS blogs!🤩

References

1) Wiki https://en.wikipedia.org/wiki/A/B_testings