Navigating the Netflix Data Deluge: The Imperative of Effective Data Management
by Vinay Kawade Obi-Ike Nwoke Vlad Sydorenko Priyesh Narayanan Shannon Heh Shunfei Chen
Introduction
In today’s digital age, data is being generated at an unprecedented rate. Take Netflix, for instance. Netflix Studios all over the world create 100s of PBs of assets every year. The content ranges from text and image sequences to large IMF¹ files for source encoding, and occasionally, generated proxies and mezzanines. This immense flow of data from studios highlights the critical need for effective data management strategies to harness actionable insights. It’s noteworthy that a significant portion of all the content ingested remains unused.
In this article, we, the Media Infrastructure Platform team, outline the development of a Garbage Collector, our solution for effectively managing production data.
The Magnitude of Data Generation
Every week, hundreds of Netflix Studios worldwide are generating data, contributing to about 2 Petabytes of data created per week. This is a complex mix of data coming from a variety of sources, including text, images, image sequences, large sources like IMFs¹, and generated proxies and mezzanines. The sheer volume is staggering, making it increasingly difficult to sift through, organize, and analyze this information effectively. This also significantly raises the storage cost; we have historically seen a 50% increase in our storage bills year over year. At the same time, internal research shows that at least 40% of data never gets used.
The Need for Effective Data Management
As data continues to proliferate, particularly the data ingested by the studio which typically follows a lifecycle of upload -> read -> reupload -> no-reads, the task of managing it effectively takes on increasing importance. Here at Netflix, we the Media Infrastructure & Storage Platform team, engineered a solution for data lifecycle management using a scalable and asynchronous Garbage Collector (GC) that monitors and cleans up file objects based on user actions or pre-configured lifecycle policies. The GC is a component of our team’s Baggins service, which is an internal abstraction layer over S3 that caters to media-specific use cases.
Architecture
At a higher level, we have designed a Data Lifecycle Manager that passively monitors and sifts what data can be safely deleted or moved to colder storage. We take a “mark and sweep” kind of approach for the Garbage Collection.
To start with, we keep all our bytes in AWS’s Simple Storage Service (S3). However, for each file in S3, we maintain some metadata for each file in Baggins. This metadata resides in the Cassandra database and includes various fields like Media Hash List checksums SHA-1, MD5, and XXHash on full files, and encryption keys for client-side encrypted objects. We have hundreds of internal Netflix applications that interact with these files in S3, often generating many proxies, derivatives, clips, and so on. These applications include workflows like Promo Media Generation, Marketing Communications, Content Intelligence, Asset Management Platform, etc. These applications further delete unnecessary files as and when needed, or they can pre-configure a TTL on objects to purge them at different intervals since the object creation automatically. This can be set to 7 days, 15 days, 30 days, 60 days or 180 days. Our delete API marks the object as soft deleted in our database and ceases any access to these objects in S3. This soft deletion keeps our delete API at ultra-low latency of under 15 milliseconds for the 99th percentile.
This brings us to the hard deletion problem, which means at some point not very far in the future, we have to make sure all the traces associated with these deleted files are removed from our system. This includes cleaning up bytes from S3, deleting any indexing done on it in Elastic Search, and finally deleting all metadata from Cassandra DB. To start with we had the following requirements,
- Keep the delete API (soft-deletion) at low latency.
- Passive clean up round the clock.
- Throttle cleanup if it starts affecting the online database.
- Easily scalable to absorb cleanup spikes.
- Convenient visualization of cleanup on a daily basis.
- Build a generic framework to run other data optimization tasks like archival, etc
Under the hood
Following is a high-level architecture of the system,
From the client’s perspective, they simply call the delete API which takes ~15 milliseconds of latency. However, a lot of things happen under the hood after that to properly clean up the object and all its metadata from all places. While this diagram might initially appear a bit tricky, tracing the arrows simplifies understanding and brings clarity to the overall picture. Note the above cycle runs every day, in a passive and automated fashion, and cleans up all data that become eligible for deletion on that day and any backlog accumulated so far.
- Import object keys: Every S3 object has a matching record in our Cassandra database. The entries in our Cassandra metadata tables serve as our primary reference point. We mark the object keys for soft deletion by adding a delete marker to the corresponding row. The same database also contains a table for bucket-level configuration that holds information such as bucket-level pre-configured TTL. However, Cassandra does not allow filtering rows that have been TTL’d out or marked as soft-deleted without providing the partition key. Also, the metadata is spread across multiple tables and requires normalization. This is where Casspactor and Apache Iceberg come into the picture.
- Casspactor: This is an internal Netflix tool we use to export our Cassandra tables to Iceberg. The Casspactor works off of backup copies of Cassandra and does not cause any latencies on the online database. Iceberg is an open-source data warehouse solution created at Netflix, giving us SQL-like access to any structured/ unstructured data. Rather than query Cassandra for TTL’d/ soft-deleted rows, we can now query against the copied rows in Iceberg.
- We use Netflix’s Workflow Orchestration framework, Maestro, to set up daily workflows that call into Iceberg to filter for the keys of interest that become eligible for cleanup on that day. This flow takes into account bucket-level TTL configuration as well as object-level delete markers.
- The Maestro workflows collate the results into another temporary Iceberg table that serves as a source for subsequent tasks.
- Once we have all the rows of interest that need to be deleted on that day, we export all of them to a Kafka queue using our homegrown solution for streaming data movement and processing — Data Mesh. We created a Data Mesh pipeline consisting of an Iceberg connector and Kafka sink.
- We run a couple of Garbage Collector (GC) Workers listening off of this Kafka topic, and perform deletion on each row. These are horizontally scalable and autoscaled based on any spikes or backlog in the Kafka pipeline.
- The GC workers take care of the complete cleanup of an object. It first deletes all bytes in S3. This can be a single file or multiple files if the object exceeds the 5 Terabyte maximum file size limit imposed by AWS S3.
- Next, we clean up all references to this soft deleted file from our local Elastic Search.
- Finally, the workers complete the loop by removing this entry from our online Cassandra database. Note, that the autoscaling of the GC worker also has input from the Cassandra database on its current load and latencies. All these parameters feed into controlled throttling on how fast we go on data deletion.
- The last step powers our dashboards to give us a daily breakdown of how many files we are deleting and what size. For this, we use Apache Superset. This equips us with substantial data that can be utilized for cross-referencing our figures with AWS invoices and predicting future expenses.
Storage Statistics
Following is a snapshot of how our dashboard looks like,
Essential Insights
- Prioritize establishing data cleanup strategies from the onset. The process of data cleanup should be an integral part of the initial design, rather than a subsequent consideration. Over time, data cleanup can escalate into an overwhelming task if not properly managed
- Consistently monitor and record expenses, as there are no cost-free aspects. Each byte of data comes with a financial implication. Therefore, a comprehensive plan for all stored data is imperative. This could involve deleting the data after a specified duration, transferring it to more cost-effective storage tiers when not in use, or at the very least, maintaining a justification for its indefinite retention for future reference and decision-making.
- The design should be versatile, that can be used in a wide array of tasks and environments, from managing active data to archiving inactive data, and everything in between.
Conclusion
As we continue to navigate this era of exponential data growth, the need for effective data management cannot be overstated. It is not merely about handling the quantity of data but also about understanding its quality and relevance. Organizations that invest in comprehensive data management strategies will be the ones to lead in this new data-driven landscape, leveraging their most valuable asset.
Terminology
IMF (Interoperable Master Format): This is a standardized format used for the digital transfer and storage of audio and video master files. For more information, visit this overview
MHL (Media Hash List): This is a standard utilized for preserving checksums on media files during transit and while stored at rest. For more information, visit https://mediahashlist.org
PB (Petabytes): A unit of digital information storage, equivalent to one thousand Terabytes or one million Gigabytes.
TTL (Time-to-Live): This term refers to the duration that a piece of data is stored before it is discarded.
Acknowledgments
Special thanks to our colleagues Emily Shaw, Ankur Khetrapal, Esha Palta, Victor Yelevich, Meenakshi Jindal, Peijie Hu, Dongdong Wu, Chantel Yang, Gregory Almond, Abi Kandasamy, and other stunning colleagues at the Maestro team, Casspactor team, and Iceberg team for their contributions.