Improving Pull Request Confidence for the Netflix TV App


  • 50 different engineers that contribute to the codebase regularly
  • 250 changes on average that get merged each month to the codebase; this year our busiest month had 367 merged changes (see chart below for a month-by-month breakdown over the first half of this year)
  • The developer iterates on their change on a development branch
  • Once the change is ready for peer review, they create a pull request
  • This kicks off the CI process and test results start coming in as they finish running
  • When the entire test suite has finished running, the developer is responsible for analyzing the results and determining whether the change is safe to merge or not
  • If the developer judges the change to be safe, they merge the pull request to the main branch

Providing Test Confidence Data

How do we do this?

How did it work?

  • If a test fails in CI, even after the appropriate number of retries, we look up the results for different runs of that test on the main branch over the last 3 hours.
  • Based on how often the test failed in the main branch, we assign a “score”. If the score reaches a certain (tunable) threshold, we indicate that the failure is “unlikely” due to the developer’s change.
  • For example, a functional test that is failing consistently on the main branch would have a score closer to 100. In that case, the test would be safely above the set threshold and it would be shown to the user that this particular failure is unlikely due to their change.

Improving the Initial Version

  1. First, there might be no test data available for the main branch in the 3 hour time window. As mentioned earlier, the tests that were run via the Jenkins job were given lower priority on the devices. So, all tests may not have run every 2 hours on the main branch.
  2. The other and more important issue was that even if fresh test data were available within the last 2–3 hours, the data were still often too stale to compute a reasonable confidence score with. Major causes of test flakiness include dependencies going down or devices behaving unexpectedly, and that can happen at any point in time. Therefore, test results from a few hours ago only yield limited value.
  3. Finally, developers wanted to move away from the idea of a confidence “score” and instead preferred being shown some sort of “report”. We realized that providing a score, and an indication of “likely” vs. “unlikely” based on that score, was hiding some of the information that we had available from our users.

How did we address these gaps?

Scenario 1: The test attempt on the destination branch failed with the same error as the pull request.
Scenario 2: The test attempt on the destination branch failed, but with a different error than the pull request.
Scenario 3: The test attempt on the destination branch passed, while the test failed on the pull request.

Benefits of the Improved Version

What’s Next?

  • We can match the number of test runs on the main branch with the number of test runs on the pull request. Currently, we generally end up having 3 test runs for the pull request, while only having a single attempt on the main branch. We can increase the number of attempts on the main branch to ensure the error output is consistent across multiple retries. Obviously, there is a trade-off here as this will require additional device resources.
  • We can use additional historical test run data along with test run data from concurrent pull requests, to provide users even more information to ascertain whether a test failure is due to their change or not.
  • We can leverage the vast amount of new confidence data that we are gathering as a result of this project to readily identify particularly unstable tests and closely analyze the root cause of their instability. This will allow us to tackle the problem of test flakiness head-on and address the larger issue of test stability directly.

Learn more about how Netflix designs, builds, and operates our systems and engineering organizations

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

How I Built a Custom Framework and App with Rack: Part 1

A bigger move.

Build a Website for free using GitHub Pages

An Introduction to the basic principles of Functional Programming

HTTP header you probably don’t know — Clear-Site-Data

Is QA Going to Disappear? Deep Dives by Bunnyshell

The Headless Hero: Why Decoupled Drupal Is Gaining Ground

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Netflix Technology Blog

Netflix Technology Blog

Learn more about how Netflix designs, builds, and operates our systems and engineering organizations

More from Medium

Fixing Performance Regressions Before they Happen

Notification System Architecture

Efficient Resiliency

Best of 2021 in Tech [Talks]