Sitemap

Return-Aware Experimentation

5 min readJul 30, 2025

By Simon Ejdemyr and Winston Chou

Press enter or click to view image in full size

Edit: We are proud to announce that our paper on Evaluating Decision Rules Across Many Weak Experiments has received the Best Paper Award (Applied Data Science) at KDD 2025!

At Netflix, we have thought extensively about how to better design and decide A/B tests at scale. We are excited to share two new papers on this topic:

  1. The first, which we presented at the ACM Conference on Economics and Computation (EC ‘25) at Stanford University in July, asks how experimentation programs should be designed to increase long-run returns to business metrics.
  2. The second, which we are presenting at next week’s ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD ‘25) in Toronto, asks how organizations should choose “decision rules” that determine which treatment arm (if any) to launch from an experiment. We provide a data-driven methodology to evaluating and choosing from candidate decision rules.

These papers contribute to an emerging paradigm we call return-aware experimentation. As the name suggests, this paradigm looks at experiments as tools, less for testing scientific theories, and more for helping organizations make good decisions that lift business metrics over time.

Focusing on the returns to experimentation introduces new considerations for experimental design and analysis. For example, while scientists are rightly driven to prevent false discoveries from entering the scientific record, businesses may be more tolerant of false discoveries as long as more true discoveries are unearthed. This isn’t to say that businesses shouldn’t care about learning or validating hypotheses. However, optimizing the returns to experimentation also requires thinking carefully about how to allocate resources across experiments; how to choose a winning treatment to launch from each experiment; whether to run fewer, larger experiments or more, smaller experiments; and the cumulative impact of experimentation programs (sets of related experiments) on business KPIs.

This perspective builds on a growing body of research — conducted both at Netflix and by the broader experimentation community — as well as our day-to-day experience supporting experimentation at scale. Here, the challenge isn’t just to detect treatment effects, but to make thousands of noisy, resource-constrained experiments add up to meaningful business value. Our papers explore what happens when experimentation is reframed as a decision and allocation problem, and decision-makers optimize accordingly.

Optimizing Returns from Experimentation Programs

By Timothy Sudijono, Simon Ejdemyr, Apoorva Lal, and Martin Tingley

As an organization runs an increasing number of concurrent experiments — often in the thousands — it faces at least two new demands. First, experiments begin to compete for resources. For example, even though technology firms may have hundreds of millions of users to allocate to experiments, that user base is a finite resource, especially given the large number of ideas to test and the low signal-to-noise ratio of most product changes. Second, not every experimenter is an experimentation expert. As testing becomes more decentralized, the system itself must absorb more of the complexity.

These demands make it essential to allocate constrained resources deliberately, not just at the level of individual tests, but at the level of experimentation programs. In other words, we need policies that scale: structures that guide what gets tested and how. Experimentation platforms then encode these policies in return-aware design templates that even non-specialists can execute with confidence.

Our paper builds on prior work, particularly A/B Testing with Fat Tails, to offer both theoretical and practical advances in this direction. On the theoretical side, we:

  • Develop a model of experimentation as a resource allocation problem;
  • Derive a dynamic programming solution that identifies the best policy for maximizing returns;
  • Extend the model to portfolio-level optimization, where multiple experimentation programs compete for shared resources.

Our framework has numerous practical implications:

  • It clarifies the tradeoff between testing more ideas with lower precision vs. testing fewer ideas with greater precision, informing design choices like duration and sample size.
  • It shows that, when implementation costs are negligible, firms should run many more low-powered tests and relax p-value thresholds. (As implementation costs increase, the framework still clarifies optimal design parameters.)
  • It provides a principled way to prioritize among competing experiments when capacity is limited.

Ultimately, this work is about helping experimentation programs make the most of their limited resources. At Netflix, we’re optimistic that these advances will help us build a better product for our current and future members.

Evaluating Decision Rules Across Many Weak Experiments

By Winston Chou, Colin Gray, Nathan Kallus, Aurélien Bibaut, and Simon Ejdemyr

Like the first paper, our second paper asks how organizations should decide experiments in order to increase returns. The distinct contribution of this paper is to show how organizations can answer this question empirically by evaluating candidate decision rules across a set of past experiments.

Consider a naive approach:

  1. Gather a set of past experiments and a set of candidate decision rules.
  2. Apply each rule to each experiment, yielding a set of hypothetical “winners” per rule.
  3. Estimate the cumulative returns to each rule by adding up the treatment effects of each winner.

For example, a candidate decision rule might be to select the arm with the largest positive and statistically significant effect on some goal metric (choosing the control arm if no treatment arm has a statistically significant effect). Applying this rule to a set of past experiments yields a set of winners, and adding up the effects of these arms on the goal metric yields a data-driven estimate of the cumulative returns of the rule. Intuitively, this method asks, “What would my returns have been if I had decided all experiments using this rule?”

Our paper points out a flaw in this approach: it uses the same data to choose the winning arm and to measure returns. This can lead to the dreaded “winner’s curse”: winning arms will often have high realized returns by chance, meaning that adding up these returns overstates the true treatment effect. The resulting bias can be severe when each treatment has a small effect, as is common in digital experimentation.

We propose a simple antidote to the winner’s curse: split the data used to select the winner from the data used to evaluate its returns. For example, rather than selecting the winner using 100% of the units in an experiment, one can use 90% of the units to select the winner and evaluate the returns to that winner on the remaining 10% of units, repeating this process ten times so that all units are used for evaluation. This method incurs a small amount of bias, since the decision rule is only applied to a subset of the data, but this bias is easily mitigated in practice.

In the paper, we also describe the real-world application of our methodology to select a new proxy metric and decision rule, which are used to set OKRs and to decide all new personalization A/B tests at Netflix. Our case study shows how return-aware ideas are directly influencing real-world practice in experimentation at scale.

Conclusion

We are excited to contribute these papers to the literature on return-aware experimentation, and even more excited for these ideas to influence experimentation at Netflix and beyond. If you are similarly energized by pushing the frontiers of experimentation and causal inference, consider attending our panel on Return-Aware Experimentation at NABE TEC2025 in Seattle later this year and applying to our open roles!

--

--

Netflix Technology Blog
Netflix Technology Blog

Written by Netflix Technology Blog

Learn more about how Netflix designs, builds, and operates our systems and engineering organizations

Responses (3)