Are all Netflix Application Crashes User Impacting?
Akshay Garg, Rishika Idnani, Michael James, Ashwin Kumar Valliammal
Introduction
Consider a scenario where you are actively watching your favorite content on the Netflix application on your TV. Suddenly, in the middle of this content streaming, you see a black screen, and the control jumps back from content playback to the TV’s home screen. What just happened? This unexpected transition is the user-facing experience associated with a Netflix client application crash. An application crash is an unexpected termination of the application that can happen due to different reasons such as an application bug, the TV running out of system memory, or a TV bug while handling the application’s requested functionality. An application crash that occurs during its active usage by the user is classified as a high-user-impacting crash.
But are there any low-user-impacting application crashes? Well, sort of. Consider a situation where a user wants to switch away from the Netflix application and they press the home button on the TV remote. In this situation, under normal circumstances, the application moves to a background state that is not visible to a user. However, if the Netflix application were to crash during or after this transition to the TV home screen UI, the impact of this crash on the user experience would not be as significant as the one that occurs in the middle of active Netflix content streaming by the user. In the context of this article, let’s call such an application crash scenario a low-user-impacting crash.
This article’s key purpose is to capture the nuances associated with accurately classifying an application’s crash severity, which is vital for determining the priority of addressing these problems.
Netflix Application States
For us to understand the Netflix application crash severity identification rules, the first piece of context we need is the classification of Netflix application states on a streaming device like a TV, a Set-Top Box, or a Streaming Stick. At any point in time, the Netflix client application running on such devices can be in one of these states:
- In Foreground, and Actively Used: The user is actively streaming Netflix content or browsing Netflix UI.
- In Foreground, Actively Used but Out of Focus: The user activates the voice assistant on the device, shifting the focus to the voice assistant UI, which overlaps with the Netflix UI. The Netflix application is no longer in focus, but it is still visible.
- In Transition to Foreground: The user has pressed the Netflix button on the TV remote or clicked on the Netflix icon, and the Netflix application is launching.
- In Background, or Not Running: The user is using another application on the streaming device.
- In Transition to Background: The user has taken action to exit the Netflix application, such as pressing the Home button or the entry button of another application.
While it is easy to classify application crashes occurring in the first three states: 1, 2, and 3 as high-user-impacting and the fourth state (4) as low-user-impacting, the last State 5 presents a relatively more complex situation. The next section delves into the foundational concept of Netflix Application Resources. Understanding this is crucial for exploring the severity classification of crashes that occur during State 5.
Netflix Application Resources
Similar to any other streaming application, Netflix requires a set of basic functionalities (also known as ‘Platform Resources’) from a streaming device. These resources can be broadly classified into the following logical categories:
- Playback: Audio Video decoders, Content Storage Memory, Playback Controls, etc
- Graphics: Graphical element rendering and associated memory
- Visibility: Visibility of the Netflix application on the screen
- Focus: Is the Netflix application in focus, i.e., do key presses control the Netflix app?
- Digital Rights Management (DRM): Content decryption functionality
- Text To Speech: Text-to-speech audio for improved accessibility
- UI Audio: Audio for button clicks
- CPU: Application Code Processing cycles
- Network: Network access to download Netflix Content, UI and Other Assets
In the foreground state, the Netflix application “acquires” all of these resources, except for “Text To Speech” which is acquired only when this feature is enabled on the streaming device. However, in the background state, it only needs “CPU” and “Network” for Netflix UI updates and keep-alive messaging.
Additionally, during different state transitions, the Netflix application may “release” or acquire various platform resources. For example:
- When a user switches the TTS (text-to-speech) setting in the device UI, the device requests the Netflix application to acquire the “Text To Speech” resource and enable the TTS functionality.
- When the user activates the device’s voice assistant on top of the Netflix application, the device requests that the Netflix application release the “Focus” resource, i.e., move away from consuming all key input events.
- When a user action starts the transition of the Netflix application to the background, the device requests that the Netflix application release all resources except “CPU” and “Network,” i.e., drop all content and graphics memory, free up DRM resources, release graphical context, release AV decoders, and more. This transition is what we referred to as State 5 in the previous section.
To accurately identify whether Netflix application crashes impact users, we must differentiate transition examples 1 and 2 from the transition in example 3. In examples 1 and 2, users expect a stable Netflix application, so crashes in these scenarios would significantly affect the user experience. Conversely, in example 3, where the application is transitioning to the background, a crash would not significantly impact the user.
In order to realize this distinction, we can classify the list of platform resources presented above as either “Background Transition Triggering Resources” or “Non Background Transition Triggering Resources”:
- Background Transition Triggering Resources: The release of any of these resources only happens when the application is transitioning to background mode. These are: Playback, Graphics, Visibility, DRM, CPU, Network.
- Non-Background Transition Triggering Resources: The release of any of these resources can also happen when the application is not transitioning to the background. These are Focus, UI-Audio, and Text-to-Speech.
Crash Categorization Rules
Now that we have a basic understanding of the Netflix Application States and how the Streaming Device Platform Resources map to them, we can list the base set of rules for classifying a Netflix application crash as user-impacting or non-user-impacting:
- Rule # 1 — High-User-Impacting Crash: If a Netflix application crash occurs when all of the Background Transition Triggering Resources are in the “acquired” state by the Netflix application and the device platform has not requested the “release” of any of these resources, then such a crash is classified as high-user-impacting.
- Rule # 2 — Low-User-Impacting Crash: If the device platform issues a request to “release” any of the Background Transition Triggering Resources to the Netflix application, then irrespective of whether that request is completed or not by the Netflix application, a crash that occurs after this initial device platform generated request would be classified as a low-user-impacting crash. In other words, if the Netflix application has started its transition from foreground to background, then any crashes after the start of this transition would be considered low-user-impacting crashes.
Please note that it is still vital to address the low-user-impacting crashes, as they affect the application’s next launch performance. However, their severity is not at the same level as the crashes that occur during the application’s active usage.
There is one caveat to the above two rules. These rules assume that the underlying device platform behaves correctly and does not take away any of these resources from the Netflix application until the Netflix application has signaled the “release” completion of the platform-requested resource.
However, this does not always happen in practice. For example, when the user presses the Home button on a TV remote, in the interest of expediting this user request, the device platform might show the Home screen first, even before requesting the “release” of any background-triggering resources. If a Netflix application crash were to happen before these background-triggering resource release requests come, we might still end up incorrectly classifying this crash as high-user-impacting.
To accommodate for this situation, after a crash, we query the device platform to determine whether, according to the platform logic, the last Netflix application was user-facing or not. Since the device platform knows that the user wanted to move away from the Netflix application, it will provide the crash type as a background crash, indicating that it was not user-impacting.
Now one might ask, why not just let the device platform categorize and indicate if the crash is high-user-impacting or low-user-impacting? That is a reasonable question. But in reality, not all device platforms are at the same level of sophistication to provide this consistent categorization. Only some device platforms have the capability to manage sophisticated application states to provide this level of nuanced information. Hence, the Netflix internal rule set for crash categorization is necessary for a “best effort” generic approach that is applicable across our diverse device base.
Recent Crash Categorization Improvements at Netflix
Even though the above sections describe our current state of crash categorization logic, we did not follow this rule set until very recently. Prior to this shift, we used to categorize any application crash that occurred after the release of the “visibility” resource as low-user-impacting. But that creates false positives of marking low-user impacting crashes as high-user-impacting since the application background state transition starts with the device platform request for the first Background Transition Triggering Resource which might or might not be “visibility”.
To implement this transition, we added two new marker logs within our client application: the “Background Transition Start” and the “Background Transition End.” Our client application sends the first log event upon receiving the device platform request for the release of any background-triggering resource. The second log event is sent when all the requested background-triggering resources have been released.
Given our ability to correctly sequence the above two logs along with crash event logs from the Netflix client application within our backend data analysis framework, we are now able to correctly reclassify a significant number of “false positive high-user-impacting crashes” as low-user-impacting. Below is a sample sequence with such classification:
The following sections focus on the data analysis aspects and the final outcomes from this workstream.
Data Analysis
From a data perspective, we are now computing an additional intermediate flag which is then stored in our crash data tables. This flag is crucial for the final rule set, which determines whether a crash is high-user-impacting or not. This intermediate flag is called Crash After Background Transition Start (CABTS).
Crash after Background Transition Start (CABTS)
The CABTS flag is a boolean indicator that specifies whether an application crash occurred after the application had begun its background transition. In a crash session, if the crash happened after the start of the most recent background transition, then this flag will be set to true; otherwise, it will be false. A true value for this flag suggests that the crash is not significantly impacting the user experience since the background transition action would have masked the crash.
While this concept seems straightforward, it becomes more complex when a resume event occurs within the crash session.
The above scenario involves a crash session in which several events, such as background transition start, resume, and crash, occur in a specific sequence.
In the first situation, even though the crash happens after the latest background transition starts, the presence of a resume event before the crash means that the CABTS flag is set to false. This is because the resume event would bring the app to the foreground, making the crash visible to the user.
In the second situation, since there is no resume event between the last background transition start and the crash, the CABTS flag is set to true. This indicates that the crash occurred while the app was transitioning to the background, making it not high-user-impacting.
In order to normalize these diverse sequences and create an actionable data set, we removed all events that occurred before the latest resume event in a crash session. The events prior to the last resume event are irrelevant to the determination of the CABTS flag and do not contribute to the decision-making process. This creates simple situations as shown below:
Computing the Final High-User-Impacting Crash flag
Given below are the final set of rules in their execution order that we use to identify if an app crash was High-User-Impacting or not:
- If the device platform indicates that the last crash was triggered by a user action to move out of the application or due to a system update, then High-User-Impacting Crash = False.
- If the device platform does not provide crash severity AND all the Background Triggering Resources are in the Acquired State, i.e., the application is in the foreground, and a crash happens during that state, then High-User-Impacting Crash = True.
- If all Background Triggering Resources are in Acquired State, i.e., the application is in the foreground, and a crash happens during that state, then High-User-Impacting Crash = True.
- If Crash After Background Transition Start, i.e. CABTS = true, then High-User-Impacting Crash = False.
In-Field Data Improvements Observed
Now, let’s move on to the exciting part. Until now, we’ve discussed framework changes to enhance the accuracy of detecting high-user-impacting crashes, but what do the in-field results show?
As illustrated in the graph below, after implementing the updated crash detection logic, the Netflix application’s crash rate for high-user-impacting crashes (as detected by us) has decreased approximately by 25% on key device platforms with large in-field populations. This improvement effectively filtered out false positive high-user-impacting crashes that occurred during the application’s transition from foreground to background.
Additionally, we are observing other benefits following this rollout, including more accurate in-field AB test results, reduced monitoring alerts, and better correlation with Customer Service data.
Conclusion
In this article, we have outlined an approach to enhance the detection accuracy of high-user-impacting crashes in the Netflix application. While our focus has been on app crashes, this classification method is versatile and can be applied to any user-facing issue monitored and reported by our client application. For instance, it can be used to precisely analyze the user impact of playback errors, playback quality degradation, application navigation issues, and more. Although this internal analysis has not directly improved the user experience, it provides our development teams with accurate insights that will be used to measure the real-world impact of future features and adjustments, ultimately helping us to create a positive user experience.