Scaling Image Validation Across Multiple Platforms


Our Team — the Test, Tools, and Reliability (TTR) team — is responsible for validating the quality of the Netflix SDK which we make available to our partners (Samsung, LG, Sony, Vizio, Roku, PlayStation, Xbox, etc…). For context, the core make up of our SDK includes the scripting engine, the layout system, the graphics backend, the text shaping/rendering engine, the network stack, the animation stack, the effects framework, etc… and thus, it is instrumental for our various partner devices to produce a consistent experience in order to provide our customers with the best Netflix experience they’ve come to love.

  • Describe the solution we implemented
  • Share our best practices established during and after our implementation

Automated Image Comparison

The most simpleton approach to achieving image comparison would be to leverage an image comparison tool, such as ImageMagick, to perform an exact pixel-to-pixel match.

Case 1: Corrupted Glyph Cache

Case 2: Rendering Dust

Case 3: Rendering Deltas Across Different Platforms

Different platforms use different rendering APIs (example: Direct3D and OpenGL), and thus can present rounding differences that impact anti-aliasing, scaling, blending, or effects.

Defining Image Comparison Thresholds

Our solution was to investigate a few image comparison metrics and hone in on the ideal metric and threshold for us to scale our automated image comparison tests. While experimenting, we focused on two image comparison metrics: Absolute Error (AE) and Root Mean Square Error (RMSE).

Absolute Error (AE)

Absolute Error (AE), gives you the total number of different pixels. With AE, you could also allow a fuzz value, which is the distance within a colorspace. This means that colors within this fuzz value would still be deemed acceptable and could alleviate some of the minor image deltas.

Root Mean Square Error (RMSE)

This now brings us to the next metric, Root Mean Square Error (or RMSE). The mathematical formula is:

Threshold Selection

RMSE was the clear winner as it allowed us to programmatically gauge the difference between a subtle but important difference, and a subtle but less important difference. The next step was to decide what RMSE threshold value was going to work best, knowing full well that large values could potentially mask real issues. With that in mind, we opted to keep our default threshold as conservatively low as possible. We default to RMSE with threshold 0.1% when comparing against other device platforms, and fallback to an exact pixel-to-pixel match (AE with threshold 0) when comparing against the same base platform used to capture the “golden” reference images.

Best Practices Guide

Reduce Variability

We first review the test to see if there are improvements that can be made to facilitate image reuse without compromising the original test intent.

Keep assets simple

Another best practice is to keep your assets simple. If your test is validating that we can properly stretch, crop, or tile an image, rather than using an image of the world that is susceptible to anti-aliased edges, instead use a synthetic image with simple geometric shapes.

Prefer High Contrast Elements

For RMSE, we try to use high contrast elements to avoid diluting errors. For example, if we failed to render a line of text, it’d stand out more prominently as black text on a white background (14.96% RMSE) than black text on a green background (5.08% RMSE).

Consider Image Size

Since RMSE normalizes the error across the entire image, the dimensions matter and should be taken into account when selecting a threshold value. For example, if a 50x50px square rendered as green instead of white, it would be weighted more heavily on a 100x100px image (35.36% RMSE) than a larger 200x200px image (17.68% RMSE).

Create Platform Goldens as Needed

In some cases, even after applying the above tips, your scenario may still exceed the default threshold slightly. The test author still has the discretion to override the threshold at a test or test step level.


All of these best practices helped us to identify an ideal RMSE value to apply in order to scale our image validation tests across our platforms without compromising quality or increasing maintenance overhead. In our case, we chose a 0.1% RMSE threshold due to the majority of our images being 720p. If we were rendering at an even higher resolution or using lower contrast elements, we’d need to lower that RMSE threshold even further to uphold the same level of quality confidence.

Learn more about how Netflix designs, builds, and operates our systems and engineering organizations

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store