Everyone hates waiting, right? This can be especially maddening when using an application on a slow network. Not all of the 100M+ Netflix users all over the world live in areas with access to reliable, high-speed internet. The Android application team here at Netflix has been working on several projects to improve performance for users who access our service on networks of varying quality. One such project is backing our in-memory Falcor model with a persistent, disk-based cache. This significantly improves the experience for users on high-latency or low-bandwidth networks since much of the data is already on the device and immediately available — no waiting for data from the server.

This diagram illustrates the Falcor system with the new persistent cache.

Cache Behavior

How should a Falcor cache work? A web browser cache provides good examples to emulate, with the ability to specify an expiration for each entry (which Falcor supports via the $expires property) and an LRU policy so that data which hasn’t been viewed in a long time are removed to make room for new data. However, some data shouldn’t participate in LRU since it is central to referencing other data. In our case, this is the grid of references to movies and shows that makes up the home and genres view.

Rather than come up with a complex cache replacement policy that supports prioritization, we instead created two caches — a smaller one that contains the lists and relies solely on expiration for removing entries and a larger LRU-based cache to contain all the leaf nodes related to a video.

For the LRU algorithm, we use a modified version of Jake Wharton’s DiskLruCache rather than implement it from scratch.

What DB should we use?

Our DB needs are fairly simple: the table is limited to 100,000 rows by the LRU and the data is not relational, so we don’t need joins. DB read performance, however, is a key consideration. We chose Realm DB because it is in the front of the pack for read performance, plus we already use it for our downloads feature so there’s no cost of additional code to import. A quick analysis verified that it meets the query performance requirements.

The key for each row in the table is the Falcor path to an individual leaf node. This solution works well, since Falcor already expands a resource request into the individual paths for each leaf node (e.g., [“videos”, 123, [“summary”, “detail”]] becomes [“videos”, 123, ”summary”] and [“videos”, 123, ”detail”]).

The other columns in the table are:

  • payload — The JSON string received for the given path. By storing the payload in JSON format, we don’t have to worry about creating columns for every possible property within each node of the various Falcor paths.
  • expiry and sentinel — These are meta data from the Falcor response that are common to all responses and needed for queries (such as finding expired nodes).

Our RealmModel, therefore, looks something like this:

Performance

One general issue that kept surfacing as the team bounced ideas back and forth was performance. What sort of tricks might we need to implement to ensure that data storage and retrieval is acceptably fast? We receive the payload from the server as JSON strings, which is not overly efficient to decode in Java, so storing JSON may not be sufficient. One suggestion was using a format that is more efficiently deserialized, such as FlatBuffers. In general this is interesting since it has much better performance in Java than JSON (and just about every DB supports some form of byte array).

However, it turned out that JSON parsing was not a significant performance bottleneck. While not incorporated as a requirement for this feature, we’re still keeping this idea on the back burner since FlatBuffer has better decode performance characteristics (on both cpu and memory) than JSON.

Volatile Data

The final issue was dealing with leaf nodes that contain volatile data. We define non-volatile data as data that never changes. Examples are: the title of the movie or show, its length in minutes, etc. Volatile data, on the other hand, can change for a variety of reasons. Examples of volatile data are: a user’s thumb ratings, or the bookmark indicating the position in the video at which to continue playback. We can easily handle the case when the change is due to the user interacting with our Android application; we know what changed and can update the cache. The problem is when data changes for other reasons, for example, when a new season of a show is released.

The volatile data can be grouped into three types:

  1. Data that has a known time at which it will change, such as whether a video is within its license window
  2. Changes caused by the user while on another device (e.g., TV or web)
  3. Data that may change at an unknown time. The percentage match rating is an example of this; it may change over time based on other shows that a user watches and rates.

The first type is easiest to handle. These nodes simply include an expiration that removes the node from the cache at the appropriate time, requiring a network request to retrieve an updated node.

For the second type — changes caused by the user — we leverage push notifications to keep the data updated. Since the user caused the change, we can reasonably assume the user is interested in seeing the change reflected in the application.

The last type is like Schrödinger’s cat. The server is the box, since it is the source of truth. We won’t know the true value until we open the box again (that is, fetch the latest data from the server). This is also the fallback for cases when a push notification is not received.

The solution for this last type is to provisionally display the cached leaf node and check the time when the node was received from the server. If it has aged enough, fire off a request to update it, otherwise assume it is still correct.

Epilogue

Adding a persistence layer to the falcor cache adds some complexity (as would any feature), but — just like a browser cache — improves the user experience.

As with virtually every feature we implement at Netflix, this persistent cache is being A/B tested to ensure it actually benefits users. The early test results are very promising, especially for users on slow networks.

The other exciting thing about this feature is it opens up many future tests, allowing us to push data into cache, such as the videos that are highest ranked for each user.

Author: Ed Ballot

Learn more about how Netflix designs, builds, and operates our systems and engineering organizations