Modernizing the Netflix TV UI Deployment Process
by Ashley Yung
At Netflix, we are always looking for ways to improve our member experience. We discovered that while we were developing the TV UI at a great velocity, the real bottleneck was in the canary deployment process of the UI. Let’s take a look at our existing ESN-based canary approach.
Our existing ESN-based-canary deployment process prevented us from increasing the frequency of delivering joy to our members.
First of all, what is an ESN?
An ESN, or Netflix Electronic Serial Number, is a globally-unique identifier for each device. Even if you have two of the same TVs at home: same brand, model and year, and they both have the Netflix TV app installed, they do not have the same ESN.
How do ESN-based canaries work?
Our hash algorithm indiscriminately hashes a given ESN to one of the 100 device buckets. If your hashed device ID falls between the buckets 0–24, your device would always receive the canary build with the ESN-based canary approach. Similarly, buckets 25–49 receive the “baseline” build, which is used for comparison against the canary build. The rest of the buckets continue to receive the baseline, or main production build.
Problems with ESN-based canaries
Since each Netflix device has a unique ESN, if a customer turns on their TV when our canary deployment is in progress, they might get served the canary build on one device and the baseline or regular build on another — even if they have the exact same TVs at home and on the same Netflix account! This causes a discrepancy in our users’ Netflix TV experience across different devices.
The discrepancy above only happens when the startup of the app coincides with our 2 hour canary deployment window, twice a week. A much bigger issue, as illustrated in the above diagram, is that since the same TV device (users don’t buy new TVs often!) with the same ESN, and given the same hash algorithm, always gets assigned to the same device bucket. Thus, the same set of devices always gets selected to receive the canary build.
Those users are always the most susceptible to potential issues that we only identify at canary time, which translates into a poor Netflix experience to our members.
Another implication of having the same devices get the canary build each time is that we are guaranteed a skewed representation of devices between the control and the canary. The skewness in our device sampling persisted across canary allocations and impacted the outcomes of the statistical analysis; we were unable to catch device-specific issues with a ESN-based canary approach, which significantly lowered our confidence in the ACA reports that we had set up for our canaries. In fact, the push captains (the operators of the deployment) had ended up resorting to manual monitoring and eyeballing the graphs instead.
Other than the ESN-based approach to canaries, there were other dissatisfactions we had with our deployment process in general. Our deployment workflow consisted of a complex Spinnaker pipeline that relied on dozens of Jenkins jobs with configurations defined outside of source control — which also restricted us from testing these configurations outside of production. A graphical representation of the workflow looks something like this:
The end-to-end deployment workflow interacted with many different services through Jenkins. Our Spinnaker pipeline would start a Jenkins job, which then administered a command to a downstream application that we wanted to interact with (e.g. Jira), and then returned the result of the command to Spinnaker. There was a lot of friction in passing the results from the upstream to the downstream services, and vice versa. This was made apparent in the case of intermittent unavailability in the downstream dependencies, where the redirections made retries difficult.
The complexity of the workflow also meant that the learning curve to the process was steep, especially for developers outside the tooling team. We had to rely on two push captains who knew the ins and outs of the pipeline to perform the deployments. It was a joke that both of them had to coordinate their vacation plans :)
We wanted something more resilient, something defined in code, and something that any engineer can pick up easily. So we decided to re-engineer the delivery process from the ground up.
We had several goals in mind when we started looking at an overhaul of the process. The must-haves:
- Reduce friction between Jenkins, Spinnaker, Jira, and our backend metadata service
- Full automation and config-as-code
- A modern canary approach, e.g. A/B-test-based canary. (For further reading, What is an A/B test? and Safe Delivery of Client applications at Netflix)
Murphy (named after the robocop) was the framework that helped bridge our needs.
What is Murphy?
How do we leverage Murphy?
Each project defines a config that lists which plugins are available to its namespace (also sometimes referred to as a config group). At Murphy server runtime, an action server will be created from this config to handle action requests for its namespace. As mentioned in the previous section, each Murphy plugin represents the automation of a unit of task. Each unit of task is represented as an action, which is simply a Murphy client command. Actions run inside isolated Titus containers, which get submitted to the TVUI Action Server. Our deployment pipeline leverages Spinnaker to chain these actions together, which can be configured to automatically perform retries on Titus jobs to minimize any potential infrastructure impact.
Each plugin takes in a request object, and returns a response object. Using the BootstrappingHandler as an example:
Here’s a brief description of some of the main plugins that we built or leveraged for our migration:
Bootstrapping: Fetches all the necessary build metadata for the deployment from our backend build metadata service and outputs a JSON-formatted file which is used as the input for all subsequent stages.
Create AB Test: A common Murphy plugin shared across all client teams. Creates an AB test on ABlaze and returns a JSON-formatted file with the created AB test ID and other metadata. Creates or updates the deployment-related fast properties.
Start AB Test: A common Murphy plugin shared across all client teams. Starts allocating users to the AB test.
Run ACA: A common Murphy plugin shared across all client teams. Kicks off the Automatic Canary Analysis process and generates ACA reports for the push captain to review.
Regional Rollout: Rollout the canary build in an AWS region.
Full Rollout: Rollout the canary build in all AWS regions.
Cleanup Regional Rollout: Roll up all the regionally-scoped Fast Properties into one globally-scoped Fast Property.
Abort Canary: Abort the canary deployment. Clean up any AB test and Fast Properties that is set up as part of the deployment process.
Deployment Workflow Improvements
Our deployment slack channel has become the single source of truth for tracing our deployment process since the adoption of Murphy. Slack notifications are posted by our custom slack bot (you guessed it, it’s called MurphyBot). MurphyBot posts a slack message to our deployment slack channel when the canary deployment begins; the message also contains the link to the Spinnaker deployment pipeline, as well as the link to rollback to the previous build. Throughout the deployment process, it keeps updating the same slack thread with links to the ACA reports and deployment status.
What about A/B-test-based canaries?
A/B-test-based canaries have unlocked our ability to perform an “apples-to-apples” comparison of the baseline and canary builds. Users allocated to test cell 1 receive the baseline build, while users allocated to test cell 2 receive the canary build. Leveraging the power of our ABlaze platform, we are now confident that the population of cell 1 and cell 2 are close to identical in terms of their device representations across cells.
A/B-test-based canaries have been working really well for us. Since the adoption of A/B-test-based canaries in Feb 2021, the improved ACA has already saved us from rolling out a couple of problematic builds to our members which most likely would’ve slipped through the cracks if we had still relied on manually reviewing those ACA reports. Below are a couple of examples (Note: All metric values on the Y-axis in the screenshots below were removed intentionally. The blue line represents the baseline build served in cell 1, while the red line presents the canary build served in cell 2):
Second, in July 2021, we also discovered a 10% background app memory increase in the ACA and was able to make the right decision to halt the rollout of the problematic build.
The new, simplified workflow has made the deployment process much more resilient to failures. We now also have a reliable way of making the “go or no go” decision with our new A/B-test-based canaries. Together, the re-engineered canary deployment process has greatly boosted confidence in our production rollouts.
In engineering, we always strive to make a good process even better. In the near future, we plan to explore the idea of device cohorts in our ACA reports. However, there will inevitably be new devices that Netflix wants to support and older devices that have such a low volume of traffic that become hard to monitor statistically. We believe that grouping and monitoring devices with similar configurations and operating systems is going to provide better statistical power than monitoring individual devices alone (there are only so many of those “signature” devices that we can keep track of!). An example device cohort, grouped by operating system would be “Android TV devices”, and another example would be “low memory devices” where we would be monitoring devices with memory constraints.