Netflix’s Media Landscape Evolution: From 3–2–1 to Cloud Storage Optimization

Netflix Technology Blog
11 min readApr 2, 2024

--

by Esha Palta Vinay Kawade Ankur Khetrapal Meenakshi Jindal Peijie Hu Dongdong Wu Avinash Dathathri

Introduction

Netflix produces many titles every year. As part of each title production, Netflix consistently acquires various content assets, including image sequences, videos, and text, from various productions. These assets are subsequently securely stored in the Amazon S3 Storage Service. A significant portion of this data is accessed temporarily and exclusively utilized during production. Most of these assets are intended to reside in the active or ‘hot’ storage tier only until the respective title is launched. This blog will explore how harnessing user access patterns helped us optimize storage efficiency and cost-effectiveness smartly. Within this exploration, we delve into a cost analysis of lifecycle policies, explicitly examining the cost-effectiveness of various archival and purge strategies tailored to different AWS storage layers.

Overview of Netflix’s Content Production and Storage Practices

In the dynamic landscape of content production, traditional studios have long adhered to the tried-and-true 3/2/1 rule — a strategy involving the maintenance of three copies of original camera footage and audio stored on at least two different types of media, with one backup kept offsite.

Let’s zoom into active departments in a production lifecycle to understand the media storage and backup scale. LTO tape backups are a constant presence during the production and post-production lifecycle.

Fig 1: Production workflow

Once media is extracted from set cameras and sound recorders, it’s transformed into disk files and manipulated using various department-specific tools — including Editorial, Sound and Music, Visual Effects (VFX), and Picture Finishing. The media undergoes comprehensive backup routines at every stage and phase of this process. This method, based on physical backups and archives, is designed to mitigate potential risks of accidental deletion, vendor-specific errors, and natural disasters. There’s a clear understanding that data loss could lead to considerable costs during both the planning and shooting phases, demonstrating the crucial role of these backup processes.

The advent of modern cloud storage systems marks a paradigm shift in storage practices. Platforms like AWS S3 offer heightened resilience, boasting 11 nines on durability, resiliency, and availability. This evolution has empowered media companies like Netflix to redefine their storage methodologies, leveraging tools such as data archival and purge policies. Media captured from cameras and sound systems on set is uploaded directly into Netflix’s cloud storage from a proximate data center facility, negating the need for backups. After this upload, various stages like Dailies, Editorial, Visual Effects, and Picture finishing can be downloaded, and modified, and the final versions of data can be transmitted into the cloud storage. This structure facilitates tracking, access, control, and scalability aligned with content production.

Moreover, integrating this with data lifecycle policies can result in storage cost savings and diminish energy demands for data administration, thereby reducing the carbon footprint. Implementing these practices can help an enterprise mitigate its energy consumption and, consequently, its environmental impact while enhancing operational efficiency and cost-effectiveness.

Organizations incorporating these strategies contribute to sustainability and increase their agility and readiness to manage data growth more proficiently.

Leveraging User Access Patterns

At the core of Netflix’s media content orchestration lies the Centralized Asset Management Platform (AMP), a robust system dedicated to persisting and discovering all media assets crafted throughout the production and post-production phases. This centralized hub helps streamline the management and accessibility of our extensive media content library. Also, It empowers us to extract valuable insights and generate detailed reports on the usage and access patterns of our media assets. This bird’s-eye view enhances our understanding of how our content is utilized, laying the groundwork for informed decision-making.

To validate our hypothesis, we examined the rate of asset access by both users and applications at different intervals post-launch date. The findings were illuminating — revealing a significant drop in asset usage after the associated title was launched. Insights from Access Patterns:

Fig 2: Asset access pattern before and after title launch

This trend was found to be fairly consistent across multiple asset types. Given the infrequent post-launch access to assets, a compelling opportunity arises to optimize storage costs. For instance, assets could be archived six months after their initial launch. This policy is designed to evolve intelligently and become more accurate as we accumulate additional data over time.

Given that assets are accessed very infrequently after launch, there is a significant opportunity to move archive-ready assets to AWS Glacier storage, which offers 3–5 hour retrieval times for 60% lower monthly storage costs.

Opportunity Size

Our hypothesis is that there’s huge potential for cost savings if we archive assets to cheaper storage. Our finance and data partners helped us develop different models and estimates based on our current data footprint, growth rate, and future business use cases.

Fig 3: Media Storage cost analysis

Status Quo: if we stay the course, the estimated cost burden will increase dramatically.

Short-term: archive underutilized assets to lower-cost S3 Glacier Flexible Retrieval storage will yield more than 50% savings.

Long-term: more savings can be achieved if we apply more granular lifecycle policies on more types of assets across different stages of asset and title lifecycle, including production, post-production, and post-launch phases.

Data Lifecycle Strategy

How to decide the storage tier for the data? As part of the first phase of data lifecycle management we chose a simple strategy to rely on usage patterns. This approach based on usage offers an uncomplicated strategy yet yields a significant value. This strategy involves the migration of assets to lower-cost archive storage after a specific period following the show’s launch on the service. Simultaneously, temporary data is purged post-launch date, optimizing storage resources and minimizing costs.

Evaluating Storage Lifecycle Management Options

In our content-centric domain, Amazon S3 is the bedrock for hosting our extensive media content. Yet, we’ve gone beyond the basics, constructing a robust and highly scalable storage infrastructure layer atop S3. This intricate framework is designed to ensure reliable, secure, and efficient storage, organization, tracking, and control access to our vast media files, meeting the demands of our globally distributed studio.

The assets archiving process involves strategically moving data — not regularly accessed but necessary for future reference — from high-energy, exhaustive primary storage environments to more energy-efficient, low-cost archives. These archival policies trim energy consumption and contribute to the prolonged lifespan of high-performance storage devices. Conversely, purge (aka. deletion) policies focus on systematically deleting outdated or redundant data, optimizing storage space requirements, and consequently lowering the energy consumption of data centers.

While looking for archival solutions, we evaluated AWS S3’s Intelligent Tiering. While S3 Intelligent Tiering is an excellent off-the-shelf solution, it still lacks the custom fine-tuning we wanted on our data. From the access patterns stats, we derived a much richer and more detailed dataset on how various players in the post-production phase access our data. We could do much better with this knowledge and aggressively archive petabytes of data into cheaper storage classes rather than wait for the default 30 days of monitoring that S3 Intelligent Tiering does for moving objects into more affordable storage classes. We could also move data quicker into the deep archival tier and save more. At the same time, this knowledge has helped us purge data more aggressively and save even more.

On top of S3, our Storage Infrastructure Layer can move data between different storage classes based on two inputs,

  1. Policies dictated by end-user client applications.
  2. Policies we have derived from our Access Pattern Statistics.

Storage Lifecycle Service Architecture

Let’s take a quick look at the architecture of storage lifecycle management at Netflix. In the context of Netflix’s content creation, an asset represents a collection of media files produced during production. At a high level Storage lifecycle architecture consists of the following components.

Fig 4: Storage Lifecycle Manager Architecture

  • Asset Management Platform stores and manages the media assets metadata for production and post-production phases at Netflix. Files per asset range anywhere (not limited) from 1 <= Asset <= 1Million.
  • Policy Manager manages the lifecycle policies of storage objects, including assets and temporary media files. This service supports various policy-based automated and ad-hoc operations to manage file storage efficiently.
  • Content Drive provides a centralized, secure, highly scalable solution to effectively track, store, organize, manage, and control access and transfers for these vast assets using conventional File system interfaces. It is the ultimate source of truth for the state of media files. Every other media asset varies from temporary to high priority and has importance over time or in different versions. Content Drive enriches our understanding through data, offering insights into asset access patterns, production-specific file counts, file sizes, and user interactions. This wealth of information enables the definition and attachment of lifecycle policies to production assets, optimizing storage footprint and cost.
  • The Storage Lifecycle Manager handles workflows and orchestrates any storage lifecycle tasks. The primary workflows include:
  • Archival of files managed by Content Drive to any AWS archive storage tier.
  • Restoration of any archived file, where the file will be available in the S3 standard tier for a specified time period before being re-archived.
  • Purge of any file managed by the Content Drive. This is a hard delete.
  • S3 Object Manager is an abstraction layer built on top of S3 to optimize media workflows.

In this post, we will cover a few components at a high level and leave the detailed architecture and flows in the next blog series.

Design Principles

  1. Scalability
  • Designed to handle billions of files, making all storage objects eligible for archival or purge.
  • Can manage thundering herd requests addressing scenarios such as mass archival after a show concludes.

2. Resiliency

  • Ensures data integrity of storage metadata and employs anti-entropy mechanisms.
  • At least once guarantees for activity reporting, bolstering reliability.

3. Durability

  • Guarantees on Archive/Restore/Purge tasks, including retry mechanisms in case of failures downstream.
  • Evolves to handle Purge requirements of tens-of-millions of files per day.

4. Security

  • Authorizes only designated applications to trigger lifecycle tasks, maintaining a secure environment.

At a high level, the asset lifecycle workflow looks like the following:

  • Cloud applications upload assets to storage backends.
  • Product managers and application owners use Policy Manager APIs to define data lifecycle policies applicable for asset types (for example, purge temporary assets 180 days post title launch date or archive final assets 30 days post-production, etc).
  • Policy manager provides the capability to create/manage policies for assets. The policy engine evaluates policy definitions and decides which assets are eligible for a lifecycle operation. Eventually, it queues tasks for policy execution workers. Policy execution workers schedule (create) tasks such as archive/restore/purge for the Content Drive.
  • On policy execution, the Content Drive schedules the execution of lifecycle operations with the Storage Lifecycle Manager.
  • All lifecycle tasks are persisted in the Storage Lifecycle Manager.
  • For each instance in the storage lifecycle, the manager evaluates the eligible tasks for execution and sends execute lifecycle events to the Content Drive. Content Drive updates its metadata state to handle conflicted operation requests on media files, collects the list of eligible objects per task, and sends them to the Storage lifecycle manager.
  • To achieve the required throughput, the async operations are uniformly distributed across all nodes in the cluster, with only one node “working” on a particular task. By sharing the workload, we achieve parallelism, and this solution also avoids database transaction conflicts. Kafka is used to send async operations to the node that “owns” the task. Kafka’s leader election is used to establish ownership evenly across the cluster. Kafka provides at least one messaging and durability guarantee.
  • Storage lifecycle manager uses S3 Object Manager to move/delete objects from one storage tier to another concurrently and waits for async completion/failed events.
  • Storage Lifecycle Manager monitors a task completion and generates a task completion event. Failed events are retried several times before reflecting the task status as failed.
  • S3 Object Manager handles task completion events and translates them to metadata state updates for media files/folders, followed by a lifecycle change completion event for media assets. Studio applications subscribe to completion events generated by the Storage manager and send business updates to end users.

Storage Archival Stats

We have started the automation effort for storage lifecycle management with Policy manager in Production. Here are a few initial examples of the how storage lifecycle dashboard looks like:

Fig 5: No. of Files archived per day in millions

Fig 6: Files Archived in a backfill job

Netflix’s Storage Lifecycle Evolution

With existing usage patterns and content storage and preservation complexities, Netflix aims to pave the way for a more efficient, cost-effective, and sustainable approach to managing its extensive library of media assets. Shortly, we plan to enhance the policy manager to automate data lifecycle policies for all media storage assets and temporary files. This strategic move aims to enhance operational efficiency and aligns with our commitment to staying at the forefront of technological advancements in content management. Long term we want to continue automation and scaling of the data lifecycle management solution while enhancing the policy manager to not only handle conflicts and priorities, but also include data lifecycle management of hybrid storage at Netflix.

Conclusion

Systematic data lifecycle management has become imperative in order to provide highly efficient, cost effective and scalable storage solution. This should be factored into the design or architecture of any new workflow intended for cloud or hybrid storage.

At Netflix, a typical live-action production can derive anywhere from 20k to 80k assets in the post-production phase alone, which amounts to hundreds of terabytes of data. By archiving them into cheaper storage tiers based on asset lifecycle policy, we can achieve roughly 70% cost savings.

Given the success in the initial phase, we are enhancing the capabilities of our systems, including both the storage layer and policy management layer, to expand asset lifecycle policies on two fronts:

  • Extend coverage to more asset types generated from different phases of a title’s lifecycle.
  • Define the complete lifecycle and purge assets that may or may not have been archived.
  • Extend Policy Manager to help data tiering and lifecycle management of Hybrid Storage environments such as on-premise, NFS, FSX, EBS and so on.

Acknowledgments

Special thanks to our stunning colleagues Vinod Viswanathan Sera Leggett Obi-Ike Nwoke Yolanda Cheung Olof Johansson Shailesh Birari Patrick Prothro Gregory Almond John Zinni Chantel Yang Vikram Singh Emily Shaw Abi Kandasamy Zile Liao Jessica Gutierrez Shunfei Chen

Terminology

LTO: Stands for Linear Tape-Open, a technology primarily used for backup, archiving, and transferring data.

VFX: An abbreviation for Visual Effects, which involves modifying videos or images for live-action media.

OCF: An abbreviation for Original Camera Footage, the raw and unedited content initially captured by a film camera.

--

--

Netflix Technology Blog

Learn more about how Netflix designs, builds, and operates our systems and engineering organizations