Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing

Published on Mar 26, 2020
Ron Kohavi1
Estimated H-index: 1
Ron Kohavi48
Estimated H-index: 48
+ 0 AuthorsYa Xu8
Estimated H-index: 8
  • References (189)
  • Citations (3)
📖 Papers frequently viewed together
18 Citations
25 Citations
78% of Scinapse members use related papers. After signing in, all features are FREE.
53 CitationsSource
Aug 4, 2019 in KDD (Knowledge Discovery and Data Mining)
#1Aleksander Fabijan (Microsoft)H-Index: 6
#2Jayant Gupchup (Microsoft)H-Index: 7
Last. Pavel DmitrievH-Index: 8
view all 7 authors...
Accurately learning what delivers value to customers is difficult. Online Controlled Experiments (OCEs), aka A/B tests, are becoming a standard operating procedure in software companies to address this challenge as they can detect small causal changes in user behavior due to product modifications (e.g. new features). However, like any data analysis method, OCEs are sensitive to trustworthiness and data quality issues which, if go unaddressed or unnoticed, may result in making wrong decisions. On...
3 CitationsSource
#1Michelle N. Meyer (GHS: Geisinger Health System)H-Index: 12
#2Patrick R. Heck (GHS: Geisinger Health System)H-Index: 5
Last. Christopher F. Chabris (GHS: Geisinger Health System)H-Index: 41
view all 7 authors...
Randomized experiments have enormous potential to improve human welfare in many domains, including healthcare, education, finance, and public policy. However, such “A/B tests” are often criticized on ethical grounds even as similar, untested interventions are implemented without objection. We find robust evidence across 16 studies of 5,873 participants from three diverse populations spanning nine domains—from healthcare to autonomous vehicle design to poverty reduction—that people frequently rat...
5 CitationsSource
May 1, 2019 in ICSE (International Conference on Software Engineering)
#1Tong Xia (Microsoft)H-Index: 1
#2Sumit Bhardwaj (Microsoft)H-Index: 1
Last. Aleksander Fabijan (Microsoft)H-Index: 6
view all 4 authors...
Software companies are increasingly adopting novel approaches to ensure their products perform correctly, succeed in improving user experience and for increasing revenue. Two approaches that have significantly impacted product development are controlled experiments - concurrent experiments with different variations of the same product, and phased rollouts - deployments to smaller audiences (rings) before deploying broadly. Although powerful in isolation, product teams experience most benefits wh...
1 CitationsSource
May 13, 2019 in WWW (The Web Conference)
#1Dominic Coey (Facebook)H-Index: 5
#2Tom Cunningham (Facebook)H-Index: 3
We present a method for implementing shrinkage of treatment effect estimators, and hence improving their precision, via experiment splitting. Experiment splitting reduces shrinkage to a standard prediction problem. The method makes minimal distributional assumptions, and allows for the degree of shrinkage in one metric to depend on other metrics. Using a dataset of 226 Facebook News Feed A/B tests, we show that a lasso estimator based on repeated experiment splitting has a 44% lower mean squared...
2 CitationsSource
#1Somit Gupta (Microsoft)H-Index: 4
#2Ron Kohavi (Microsoft)H-Index: 48
Last. Igor Yashkov (Yandex)H-Index: 2
view all 34 authors...
Online controlled experiments (OCEs), also known as A/B tests, have become ubiquitous in evaluating the impact of changes made to software products and services. While the concept of online controlled experiments is simple, there are many practical challenges in running OCEs at scale. To understand the top practical challenges in running OCEs at scale and encourage further academic and industrial exploration, representatives with experience in large-scale experimentation from thirteen different ...
2 CitationsSource
#1Brett R. Gordon (NU: Northwestern University)H-Index: 11
#2Florian Zettelmeyer (NU: Northwestern University)H-Index: 21
Last. Dan chapsky (Facebook)H-Index: 2
view all 4 authors...
Observational methods often fail to accurately recover the treatment effects generated from randomized advertising experiments on Facebook.
13 CitationsSource
#1Min LiuH-Index: 1
#2Xiaohui SunH-Index: 1
Last. Ya XuH-Index: 8
view all 4 authors...
Online experimentation (or A/B testing) has been widely adopted in industry as the gold standard for measuring product impacts. Despite the wide adoption, few literatures discuss A/B testing with quantile metrics. Quantile metrics, such as 90th percentile page load time, are crucial to A/B testing as many key performance metrics including site speed and service latency are defined as quantiles. However, with LinkedIn's data size, quantile metric A/B testing is extremely challenging because there...
1 Citations
A network effect is said to take place when a new feature not only impacts the people who receive it, but also other users of the platform, like their connections or the people who follow them. This very common phenomenon violates the fundamental assumption underpinning nearly all enterprise experimentation systems, the stable unit treatment value assumption (SUTVA). When this assumption is broken, a typical experimentation platform, which relies on Bernoulli randomization for assignment and two...
1 Citations
Randomized experiments, or A/B tests are used to estimate the causal impact of a feature on the behavior of users by creating two parallel universes in which members are simultaneously assigned to treatment and control. However, in social network settings, members interact, such that the impact of a feature is not always contained within the treatment group. Researchers have developed a number of experimental designs to estimate network effects in social settings. Alternatively, naturally occurr...
Cited By3
#1Ron KohaviH-Index: 1
#2Diane TangH-Index: 1
Last. Ya XuH-Index: 8
view all 3 authors...
#1Ron KohaviH-Index: 1
#2Diane TangH-Index: 1
Last. John P. A. IoannidisH-Index: 151
view all 5 authors...
BACKGROUND: Many technology companies, including Airbnb, Amazon,, eBay, Facebook, Google, LinkedIn, Lyft, Microsoft, Netflix, Twitter, Uber, and Yahoo!/Oath, run online randomized controlled experiments at scale, namely hundreds of concurrent controlled experiments on millions of users each, commonly referred to as A/B tests. Originally derived from the same statistical roots, randomized controlled trials (RCTs) in medicine are now criticized for being expensive and difficult, while ...
1 CitationsSource
#1Ron Kohavi (Microsoft)H-Index: 1
#1Ron Kohavi (Microsoft)
For the digital parts of businesses in Society 5.0, such as web site and mobile applications, manual testing is impractical and slow. Instead, implementation of ideas can now be evaluated with scientific rigor using online controlled experiments (A/B tests), which provide trustworthy reliable assessments of the impact of the implementations to key metrics of interest. This chapter shows how online controlled experiments can be run at large scale.