Data Gateway — A Platform for Growing and Protecting the Data Tier

13 min readApr 23, 2024

Shahar Zimmerman, Vidhya Arvind, Joey Lynch, Vinay Chella

The Netflix Online Datastore team has built a platform we call the Data Gateway to enable our datastore engineers to deliver powerful data abstractions which protect Netflix application developers from complex distributed databases and incompatible API changes. In this opening post, we cover the platform as the first part of a series which shows how we use this platform to raise the level of abstraction that application developers use every day to create, access, and maintain their online data.

Motivation

At Netflix, we have adopted and contributed to a large number of open-source (OSS) technologies and databases in the data tier, including Apache Cassandra, EVCache (memcached), OpenSearch, and more. Historically, the Online Data Platform operated these datastores and provided their OSS API via client libraries. For example, we operated Apache Cassandra clusters and provided developers client libraries based on Thrift or Cassandra Query Language (CQL) protocols. This strategy gave Data Platform leverage because OSS meant fewer engineers could operate more types of databases for a larger number of users. However, while this enabled rapid expansion, coupling applications to a multitude of APIs Netflix did not control had significant maintenance costs in the long-run.

Most databases have a large API surface area and few guardrails, which result in a number of usage anti-patterns that require advanced knowledge to avoid and may take years to be revealed. For example, developers must avoid writing too much data to one row or field, and the limits change for every datastore. As Netflix’s engineering organization grew and the number of use cases proliferated, engineers experienced more burden from mitigating database misuse and re-designing applications. This also increased the risk of product outages as most critical applications rely on database services and database migrations are intrinsically hazardous.

Furthermore, certain use cases require architectures that combine different databases to achieve a desired API with scalability, availability, and consistent performance. Over time, we saw Netflix developers implementing the same patterns over and over — such as adding caches to key-value lookups — in other words, “re-inventing the wheel”.

Finally, we had to integrate Netflix’s standard service-discovery, remote-procedure-call resilience techniques, authentication, and authorization systems with every OSS database in order for Netflix applications to use them. Integrating every individual database and per-database protocol with these systems is challenging because each implementation was different and had to be maintained by different experts (e.g. Memcached experts, Cassandra experts, etc).

Introducing Data Gateway

Netflix’s Data Gateway is a platform built to solve these issues, making it easy for Netflix to build and manage stable online data access layers (DAL). It simplifies and secures data access by providing tailored APIs using standard IPC protocols like gRPC and HTTP, abstracting the complexities of the backing distributed databases and protecting against their usage anti-patterns, while simultaneously enhancing security, reliability, and scalability.

Component Overview

A Data Gateway sits between applications and databases, enabling Netflix to provide user-friendly, secure, and reliable data persistence services at scale. The platform aims to be:

User-friendly: Host data access layers to provide familiar gRPC or HTTP API’s tailored to common usage patterns at Netflix, for example Key-Value or Time-Series.

Secure: Delegate mTLS, connection management, authentication, and authorization to a performant service mesh as a common solution.

Reliable: Reduce API surface area of OSS datastores to only their safe-at-scale subsets, preventing anti-patterns and providing an indirection layer for resilience techniques including circuit breaking, back-pressure, and load shedding.

As you can see, a Data Gateway data plane instance is composed of:

EC2 Instance: standard cloud compute Linux Virtual Machine tuned for high performance and low-latency by the Netflix performance team.
Data Gateway Agent: sidecar process that orchestrates purpose-built container images and manages service registration when healthy (i.e. Discovery).
Container Runtime: A standard OCI container runtime which runs, monitors, restarts, and connects proxy and DAL containers.
Envoy Proxy: An industry standard service mesh container serves as a reverse proxy.
Data Abstraction Layer (DAL): Application code deployed as containers host purpose-built HTTP or gRPC data access services such as Key-Value.
Declarative Configuration: concise declarative configuration providing the goal cluster and data plane instance state.

Application clients connect to these gateways via standard Netflix Discovery services or AWS load balancers (e.g. ALB/NLB). Envoy terminates TLS, authorizes every connection, and then forwards requests to the appropriate DAL containers which communicate with database specific protocols to complete each query.

Configuration and Declarative Delivery

Declarative configuration drives deployment on-instance via the Data Gateway Agent, and also across the entire fleet. We separate declarative configuration into two categories: runtime and deployment.

Runtime Configuration

Configuration of a single instance goal state is referred to as “runtime” configuration. This configuration includes the desired composition of data abstraction containers, their environment, and networking with the proxy to form a data plane instance. Below is a sample runtime configuration:

# Configure the proxy to accept protocols
proxy_config:
  public_listeners:
    secure_grpc: {mode: grpc, tls_creds: metatron, authz: gandalf, path: 8980}

# Configure the DAL containers, implementing protocols
container_dals:
  cql:
    container_listeners: {secure_grpc: 8980}
    image: "dgw-kv"
  thrift:
    container_listeners: {secure_grpc: 8980}
    image: "dgw-kv"
    env:
      STORAGE_ENGINE: "thrift"

# Configure advanced wiring of protocols
wiring:
  thrift: {mode: shadow, target: cql}

This specifies two Key-Value DAL containers named cql and thrift to be created from the deployment-specific version of the dgw-kv image, and a proxy to listen on host port 8980 externally for mutual TLS (mTLS via metatron) connections. This protocol is named secure_grpc and these connections should authenticate with mutual TLS, authorize using Netflix’s Gandalf authorization system, and forward each request to the DAL processes that are listening inside the container for secure_grpc on port 8980. Finally, the wiring section specifies we want calls to thrift to shadow to the cql container. This is visualized in the below diagram:

Deploy Configuration (Desires)

While runtime configuration is scoped to a single instance, we also have to configure the desired deployment of those instances. Deploy desires declaratively describe the properties of the deployment of Data Gateways. Below is a sample deploy configuration:

deploy_desires:
  # What are the access pattern and capacity
  capacity:
    model_name: org.netflix.key-value
    query_pattern:
      access_pattern: latency
      estimated_read_per_second:  {low: 2000, mid: 20000, high: 200000}
      estimated_write_per_second: {low: 2000, mid: 20000, high: 200000}
    data_shape:
      estimated_state_size_gib:   {low:   20, mid: 200,   high: 2000}
      reserved_instance_app_mem_gib: 20
  # How critical is this deployment to Netflix
  service_tier: 0
  # What version set of software should be deployed
  version_set:
      artifacts:
        dals/dgw-kv:  {kind: branch, value: main}
        # Runtime configuration is a container as well!
        configs/main: {kind: branch, sha: ${DGW_CONFIG_VERSION}}
  # Where should we deploy to, including multiple clusters
  locations:
    - account: prod
      regions: [us-east-2, us-east-1, eu-west-1, us-west-2]
    - account: prod
      regions: [us-east-1]
      stack: leader
  # Who owns (is responsible for) this deployment
  owners:
    - {type: google-group, value: our-cool-team@netflix.com}
    - {type: pager, value: our-cool-pagerduty-service}
  # Who consumes (uses) this deployment, and what role?
  consumers:
    - {type: account-app, value: prod-api, group: read-write}
    - {type: account-app, value: studio_prod-ui, group: read-only}

This configuration specifies desires at a high-level: capacity required and workload context, service criticality, software composition including versions of images and runtime configuration, where to deploy including regions and accounts, and access control. Service tier is a concise piece of context, provided as a numeric value between 0 and 3+ which indicates criticality and influences fleet management, capacity planning, and alerting.

We use deploy desires to provision hardware and software for each shard, for example by using the desired capacity in high-level terms of RPS and data size as input to our automated capacity planner, which compiles that desire down to a price-optimal choice of EC2 instances along with desired ASG scaling policies. We also use deploy desires to inform continuous deployment of the fleet while enabling safer phased rollouts (i.e. less critical tiers deployed first), artifact pinning, and other key features.

We call a group of clusters a “shard” because they provide fault-isolation boundaries for stateful services. Sharded deployments, or single-tenant architecture, are preferred at Netflix for online data services because they minimize the blast radius of misbehaving applications and protect the broader Netflix product from noisy neighbors. In 2024, the Data Gateway platform declaratively manages a fleet of thousands of shards for dozens of different data abstractions.

Data Gateway Agent orchestrates purpose-built components

At the core of each data gateway is an agent we drop on Netflix EC2 VMs which bootstraps from a concise configuration, manages the requisite containers, and wires up proxies to ultimately expose compositions of data abstractions to users.

If you are familiar with docker-compose, the data gateway is philosophically similar except with the integration of a best-in-class mesh proxy and a continuously running agent that constantly drives [pdf] the instance towards the goal configuration and state. We integrate multiple components to provide the gateway:

Reliable System Components: EC2 VMs, containerd, data gateway agent, efficiently compressed software images
Inter-Process Communication: Pluggable registration into service registries, mTLS, authentication, authorization, connection management, and both external and internal instance networking.
Monitoring: Full system health checks, automatically remediate dead or failing containers
Configuration and Software: Version sets of software and configuration, with environment-based configuration

You might ask, “why not Kubernetes”? It is true that Kubernetes pods plus istio is a more general compute platform, but it is also a complex solution to solve our relatively simple problem. At Netflix, the Compute Platform team has well-paved single-tenant EC2 instance deployments, and performance isolation and tooling are excellent in this modality. If we diverged from this paved path, our team would be responsible both for operating Kubernetes and Istio. We had no desire to adopt and maintain such a complex multi-tenant scheduler and container solution to solve our relatively simple problem of composing components co-located onto a host.

Put simply, Kubernetes doesn’t solve many of our actual problems such as allowing us to start and stop containers independently of the pod, is more complex, and brings in a number of dependencies into our infrastructure that we weren’t comfortable with in the backbone data tier. The Data Gateway platform is designed to have exactly three external dependencies: a Linux VM (EC2), a robust scheduler (ASG), and a blob storage system (S3). This reduction in surface area was extremely attractive for a backbone infrastructure component that will deploy foundational data access layers for all of Netflix — all maintained by a small team.

Case Study: Key-Value Service

At Netflix, we deploy our Key-Value Service (KV) as a DAL on the Data Gateway platform. Key-Value is built on Data Gateway as a HashMap[String, SortedMap[Bytes, Bytes]] map-of-maps data model and query API with per namespace consistency and durability controls, abstracting away datastore intricacies. Key-Value is used by hundreds of teams at Netflix to provide online data persistence for mission-critical, full-active, global applications.

The Key-Value DAL runs a Java Spring Boot application, exposing gRPC and HTTP interfaces for Key-Value API’s. This application composes various storage engines and implements features on top such as hedging, look-aside caching, transparent large-data chunking, adaptive pagination, circuit breaking via resource limiters, and more.

The Key-Value DAL image is built using JIB. Netflix’s standard application framework is Spring Boot but the Data Gateway platform is compatible with any OCI-compliant image regardless of application programming language or guest OS. DAL images are securely uploaded to an S3 artifact store during CI (continuous integration) and are checksummed to detect supply-chain tampering.

Key-Value implements environment-specific configuration using Runtime Configuration. For example:

proxy_config:
  public_listeners:
    secure_grpc: {authz: gandalf, mode: grpc, path: "8980", tls_creds: metatron}
    secure_http: {authz: gandalf, mode: http, path: "8443", tls_creds: metatron}

container_dals:
  kv:
    # Pluggable startup commands
    container_cmd: /apps/dgw-kv/start.sh
    container_listeners: {http: "8080", secure_grpc: "8980", secure_http: "8443"}
    # Configure heap and other properties
    env:
      MEMORY: 8000m
      spring.app.property: property_value
    # Define "healthy" for startup checks
    healthcheck:
      test:
        - CMD-SHELL
        - /usr/bin/curl -f -s --connect-timeout 0.500 --max-time 2 http://envoy:8080/admin/health
    image: "dgw-kv"

# Configure Netflix discovery targets
registrations:
  - address: shard.dgwkvgrpc,shard.dgwkv
    mode: nflx-discovery

The agent runs a container named kv configured by the container_dals.kv object, including the image name, environment variables, container health-check command, and container ports to expose.

The agent will configure an Envoy host listener per entry in public_listeners, binding on all addresses (0.0.0.0 ipv4 or :: ipv6). These listeners are forwarded to the container listener by name, for example secure_grpc specifies routing from host port ::8980 to DAL container port 8980. The agent ensures there are no host port conflicts. The agent finally determines which Data Gateway shard to register in service discovery using the registrations configuration.

Runtime Configuration at the container level is agnostic to application details such as ports listened on or health-check endpoint. It’s compatible with the multitude of Data Gateway applications and enables cheaper rollouts by decoupling rapidly changing configuration from the application codebase.

Case Study: Secure RDS

Secure RDS implements a simple pass-through architecture using the Data Gateway platform to secure L4 connections to PostgreSQL and MySQL. This architecture terminates mTLS connections at the Envoy process, and then proxies the underlying L4 stream to a backend AWS RDS cluster. This secures access from the client via Netflix’s standard mTLS authentication and authorization systems, and relies on underlying AWS server TLS on the backend.

The client installs a forward proxy sidecar process which discovers the Data Gateway and listens on the client’s host port localhost:5432 (PostgreSQL). When the client connects to the forward proxy using a standard RDBMs client such as JDBC, the forward proxy initiates an mTLS connection using the client application’s metatron TLS certificate to Data Gateway port 5432. On the Data Gateway server, the connection is authorized against the client’s identity. If allowed, the client app connects via L4 mTLS tunnel from their outgoing proxy, through the Data Gateway which strips mTLS, and through to RDS which terminates the connection with standard server-side TLS.

This architecture allows us to seamlessly secure any database protocol with Netflix’s authentication and authorization paved path, and we’ve used it for AWS RDS, Open Search, CockroachDB, Neptune. In addition, we plan to use this technique for securing other off-the-shelf databases without patching those databases. It also makes username/password authentication redundant as long as database clusters are single-tenant because authentication is handled by Netflix’s Metatron mTLS, and authorization by Netflix’s Gandalf system. We can also onboard existing username/password-authenticated databases into this platform by securely encrypting credentials via Netflix’s secrets system, using the shard’s data access control policy.

Secure RDS runtime configuration specifies no container DALs, instead it configures just the reverse proxy to route to the RDS instance under network_dals.rds.listeners.secure_postgres and a network DAL target:

proxy_config:                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   
  public_listeners:                                                                                                                                                                                                                                                                    
    secure_postgres: {mode: tcp, path: "5432", tls_creds: metatron, authz: gandalf} 

# RDS Gateways run no DAL containers
container_dals: {}        
                                                                                                                                                                                                                                                             
network_dals:                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      
  rds:                                                                                                                                                                                                                                                                                 
    listeners:                                                                                                                                                                                                                                                                         
      secure_postgres: postgresql://rds-db.ih34rtn3tflix.us-east-1.rds.amazonaws.com:5432                                                                                                                                                                                                                                                                                                                                                                                                                                       
    mode: logical_dns

Case Study: Seamless Data Migrations

Engineers need to migrate data between datastores for various reasons which we have presented previously. Modern datastores are designed for particular usage patterns and data models, so when the usage patterns change so does the datastore technology. Database migrations are often essential due to factors such as security vulnerabilities, deprecated APIs, outdated software, or the need to enhance performance/functionality. Whether it’s shifting data to mitigate noisy neighbor issues or provide new features, these migrations play a crucial role in maintaining system integrity and efficiency when in-place upgrades are not possible. In order to make these processes seamless for developers, the Data Gateway platform provides a traffic shadowing layer to both replicate data and performance test query load between different storage engines. With Data Gateway we can manage the entire migration lifecycle:

Data Gateway supports traffic shadowing by deploying two DAL containers in a data plane instance, one being the “primary” connecting to the existing datastore and another “secondary” connecting to a new datastore. We configure the reverse proxy to route real traffic to the primary and “shadow” (in other words, duplicate) traffic to the secondary. After backfilling the data from primary to secondary, we then promote the secondary to receive primary traffic thus completing the data migration. Below is a sample runtime configuration with thrift as the primary DAL and cql as the secondary DAL:

proxy_config:
  public_listeners:
    secure_grpc: { mode: grpc, path: 8980 }

container_dals:
  cql:
    container_listeners:
      secure_grpc: 8980
  thrift:
    container_listeners:
      secure_grpc: 8980

wiring:
  thrift: { mode: shadow, target: cql }

We have used this platform-provided data migration capability to migrate hundreds of deprecated Apache Cassandra 2 databases to the new major version 3. This update could not be safely performed in-place as Cassandra 3 had backward-incompatible changes to the Thrift storage engine, and so we had to migrate hundreds of applications and Cassandra clusters. We accomplished this migration by first migrating applications from direct Cassandra access to Key-Value Service on Data Gateway proxying to their existing thrift data, then migrating user data from Cassandra 2 to Cassandra 3 via shadow traffic and backfill.

Centralizing data migrations using the same infrastructure components was a significant leverage point as it allowed us, as experts, to automate the process, saving thousands of engineering hours and mitigating the risk of data corruption.

Conclusions and Future Work

The Data Gateway is a testament to Netflix’s commitment to technological innovation and operational excellence in our online data tier. It not only addresses immediate operational challenges but also paves the way for future advancements in operational datastores to meet Netflix’s growing business demands from our ever-growing SVOD business and new lines of business such as Ads, Gaming, and Live streaming.

In follow-up posts, we plan to share more details of how we use this platform to rapidly develop, deploy, and maintain high-level data abstractions for our developers such as:

Unified authentication and authorization in front of arbitrary L4/L7 databases
gRPC Key-Value Service that abstracts evolving Key-Value storage engines (Cassandra, EVCache, other custom stores built at Netflix) from our developers across thousands of diverse use cases.
gRPC Time-Series Service that composes multiple storage engines to implement high-scale ingestion, retention policies, and search and retrieval.
gRPC Entity Service that provides a flexible CRUD+QE (Query and Eventing) interface blending CockroachDB, Key-Value, Kafka, and Elasticsearch.

Acknowledgments

Special thanks to our stunning colleagues George Campbell, Chandrasekhar Thumuluru, Raj Ummadisetty, Mengqing Wang, Susheel Aroskar, Rajiv Shringi, Joe Lee