Improving Pull Request Confidence for the Netflix TV App


The Netflix TV app is used across millions of smart TVs, streaming media players, gaming consoles, and set-top boxes worldwide. As the team that focuses on developer productivity for the org, our role is to enable the engineers that develop, innovate on, and test this app to be more productive.

  • 50 different engineers that contribute to the codebase regularly
  • 250 changes on average that get merged each month to the codebase; this year our busiest month had 367 merged changes (see chart below for a month-by-month breakdown over the first half of this year)
  • The developer iterates on their change on a development branch
  • Once the change is ready for peer review, they create a pull request
  • This kicks off the CI process and test results start coming in as they finish running
  • When the entire test suite has finished running, the developer is responsible for analyzing the results and determining whether the change is safe to merge or not
  • If the developer judges the change to be safe, they merge the pull request to the main branch

Providing Test Confidence Data

Even though we are acknowledging that it may not be feasible to provide 100% confidence that a test failure is not due to a developer’s change, we can still provide an indication towards that: any signal that can guide the developer towards making an informed decision is remarkably valuable. Therefore, our goal is to indicate whether a failing functional test is “likely” or “unlikely” due to the developer’s commit and to provide this information to the developer as part of the CI process.

How do we do this?

Primarily, we need some way of determining how an existing test performs with and without the developer’s code change. Out of these two, we already know how a test performs with the code change. We just ran it as part of CI! Now, how do we know how a test performs without the code change? Well, we need to look at how the same test runs on the destination branch.

How did it work?

Roughly, this is how this version of providing test confidence data to developers worked:

  • If a test fails in CI, even after the appropriate number of retries, we look up the results for different runs of that test on the main branch over the last 3 hours.
  • Based on how often the test failed in the main branch, we assign a “score”. If the score reaches a certain (tunable) threshold, we indicate that the failure is “unlikely” due to the developer’s change.
  • For example, a functional test that is failing consistently on the main branch would have a score closer to 100. In that case, the test would be safely above the set threshold and it would be shown to the user that this particular failure is unlikely due to their change.

Improving the Initial Version

Staying true to the Netflix culture and values, we received and gathered feedback to see how the tool was working in practice.

  1. First, there might be no test data available for the main branch in the 3 hour time window. As mentioned earlier, the tests that were run via the Jenkins job were given lower priority on the devices. So, all tests may not have run every 2 hours on the main branch.
  2. The other and more important issue was that even if fresh test data were available within the last 2–3 hours, the data were still often too stale to compute a reasonable confidence score with. Major causes of test flakiness include dependencies going down or devices behaving unexpectedly, and that can happen at any point in time. Therefore, test results from a few hours ago only yield limited value.
  3. Finally, developers wanted to move away from the idea of a confidence “score” and instead preferred being shown some sort of “report”. We realized that providing a score, and an indication of “likely” vs. “unlikely” based on that score, was hiding some of the information that we had available from our users.

How did we address these gaps?

As it was noted earlier, for failing tests, we make at least 4 attempts to see if the test failure is consistent. Thus, instead of using all these attempts for running the test against the pull request, why not use one of the retry attempts to run the test against the destination branch? In other words, if a test fails 3 times, why not use the 4th attempt to run directly against the destination branch and store the results? This would provide directly applicable data that can be used to assess confidence to a much higher degree of accuracy than was done in the initial version.

Scenario 1: The test attempt on the destination branch failed with the same error as the pull request.
Scenario 2: The test attempt on the destination branch failed, but with a different error than the pull request.
Scenario 3: The test attempt on the destination branch passed, while the test failed on the pull request.

Benefits of the Improved Version

With the release of the new version, we saw marked improvement in the availability of confidence data. Let’s look at a quick example that illustrates how the updated version provides value where the previous version did not.

What’s Next?

Here are a couple of ways we are thinking about improving confidence in test results for the future:

  • We can match the number of test runs on the main branch with the number of test runs on the pull request. Currently, we generally end up having 3 test runs for the pull request, while only having a single attempt on the main branch. We can increase the number of attempts on the main branch to ensure the error output is consistent across multiple retries. Obviously, there is a trade-off here as this will require additional device resources.
  • We can use additional historical test run data along with test run data from concurrent pull requests, to provide users even more information to ascertain whether a test failure is due to their change or not.
  • We can leverage the vast amount of new confidence data that we are gathering as a result of this project to readily identify particularly unstable tests and closely analyze the root cause of their instability. This will allow us to tackle the problem of test flakiness head-on and address the larger issue of test stability directly.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Netflix Technology Blog

Netflix Technology Blog

Learn more about how Netflix designs, builds, and operates our systems and engineering organizations