Netflix
Walk through a complete Netflix video streaming design: from encoding one upload into 1,200 adaptive variants to serving 100M concurrent viewers through ISP-embedded CDN appliances with sub-2-second start times.
What is Netflix?
Netflix is a video streaming platform with 220 million subscribers across 190 countries. The interesting engineering challenge is not storing petabytes of video; S3 handles that. It is encoding each uploaded title into 1,200 adaptive format variants, distributing those variants to ISP-embedded appliances so viewers never touch the origin, and selecting the right variant every 4 seconds to keep a 4G phone and a 4K TV both buffer-free on the same film.
This is a rich interview question because it spans four genuinely hard problems: a parallel video encoding pipeline, a proprietary CDN tier called Open Connect, an adaptive bitrate protocol running on the client, and a recommendation system that personalizes a catalog of 15,000 titles for 220 million distinct users.
Functional Requirements
Core Requirements
- Users can browse and search the video catalog.
- Users can stream a video with adaptive quality that does not buffer.
- Users receive personalized content recommendations.
- Users can resume watching from where they left off on any device.
Below the Line (out of scope)
- Content upload and studio ingestion
- Offline download for mobile
- Multi-profile management and parental controls
- Live broadcasting and event streaming
The hardest part in scope: Serving 100 million concurrent viewers without buffering. Every viewer fetches a video segment every 4 seconds. The origin (S3) cannot absorb that read volume directly. The solution is layered: ISP-embedded appliances serve the vast majority of traffic within each city, a tiered CDN catches the rest, and the origin only handles cache misses. Getting that hierarchy right is the core of this design.
Content upload is below the line because it is a write-once, offline pipeline that does not interact with the playback path. To add it, I would build a studio-facing ingest API that accepts high-resolution source files, deposits them in S3, and emits an EncodeVideoEvent to Kafka. The encoding pipeline covered in the deep dives would pick it up from there.
Offline downloads are below the line because they require DRM key management (Netflix uses Widevine, PlayReady, and FairPlay per platform). The download itself is a CDN prefetch operation, but the license server integration adds substantial complexity that changes none of the streaming design. To add it: integrate a Download License Service that issues time-bounded offline playback tokens per device and content item. The CDN segment prefetch path already exists; the missing piece is a client-side encrypted storage mechanism and the License Service issuing short-lived download keys scoped to that device.
Multi-profile management is below the line because it is a user account feature orthogonal to the streaming path. Each profile is a child entity on the User record with its own recommendation scores, watch history, and content restrictions. To add it: extend the User entity with a Profile child and scope all WatchSession, recommendation cache keys, and entitlement checks to profile_id instead of user_id. No streaming path components change.
Live broadcasting is below the line because it uses a fundamentally different delivery protocol (RTMP or SRT ingest, LL-HLS or WebRTC egress) with sub-second latency constraints that override the VOD caching model entirely. The system here is VOD-optimized. To add it: build a separate live ingest pipeline (RTMP or SRT ingest, segmented into LL-HLS chunks at under 2 seconds) and route through a dedicated low-latency CDN path. Live segments expire immediately and must not be cached aggressively; the standard OCA pre-population model does not apply.
Non-Functional Requirements
Core Requirements
- Availability: 99.99% uptime. Netflix is entertainment, and outages are front-page news. Availability over consistency: a stale recommendation is acceptable; a playback failure is not.
- Latency: Video starts within 2 seconds of clicking play (time to first byte for the first segment). Catalog browse and search under 200ms p99.
- Scale: 220M subscribers, roughly 100M DAU. Netflix sustains roughly 15% of global downstream internet bandwidth at peak hours. Peak concurrent streams are roughly 30 to 50M at any given moment.
- Buffering rate: Under 0.1% of playback time. Viewers tolerate minor quality drops; they do not tolerate buffering spinners.
- Storage: Each title generates roughly 1TB of encoded variants (multiple resolutions, codecs, audio tracks). At 15,000 titles, total catalog storage is roughly 15 petabytes.
Below the Line
- Sub-50ms start time for high-speed fixed broadband connections
- Per-language subtitle and audio track management at encoding time
- Device-specific DRM enforcement and key rotation
Sub-50ms start time is below the line because achieving it requires speculative segment prefetch before the viewer clicks play, which demands an ML pipeline predicting next-play candidates from browse behavior and pre-warming OCA edges proactively. To add it: instrument browse dwell-time events, train a lightweight next-play predictor, and issue CDN prefetch directives for the predicted first 3 segments.
Per-language subtitle and audio track encoding is below the line because it runs on a parallel track to video encoding and adds no complexity to the delivery path described here. To add it: route each language audio and subtitle file through the same Kafka job queue with a dedicated worker pool; the Job Coordinator assembles all tracks into the final HLS manifest on completion.
Device-specific DRM enforcement is below the line because it requires integrating three separate license servers (Widevine for Android and Chrome, PlayReady for Windows, FairPlay for Apple). To add it: route the manifest endpoint through a License Service that issues short-lived playback tokens per session and device type. The manifest endpoint already returns a drm_license_url in the API design; the License Service backs that URL.
Read/write ratio: This is among the most read-skewed systems in existence. Each new title write generates 1,200+ encoding jobs that produce the variants. After encoding, reads are effectively permanent: every segment request from every viewer is a read, and content rarely changes after publication. The read-to-write ratio for serving is thousands-to-one. This justifies aggressive caching at every tier of the delivery chain.
I target 2-second start time as the key latency number. The first video segment is typically 2 to 4 seconds of content. The manifest file tells the client player where to find each segment. Fetching the manifest plus the first segment from an ISP-embedded appliance takes under 500ms on most fixed broadband connections, leaving margin for player initialization and DRM license fetch.
Core Entities
- Video: A piece of content with metadata (title, genre, cast, description, rating) and a reference to its set of encoded variants in S3.
- EncodingVariant: A specific rendition of a video at a given resolution, bitrate, and codec (H.264 1080p at 8 Mbps, H.265 4K at 16 Mbps, etc.). Each variant maps to a directory of segment files and a manifest in S3.
- User: Account with a subscription tier (Standard, 4K), regional content entitlements, and a reference to their recommendation scores.
- WatchSession: A viewer's current playback state: last position in milliseconds, device ID, and selected quality profile.
Schema design, partition keys, and indexes are deferred to the deep dives. The four entities above are sufficient to drive the API design and High-Level Design.
API Design
Netflix uses REST for catalog and session operations and a manifest-driven protocol (HLS/DASH) for video delivery. The manifest is the critical contract between the server and the client player.
Browse the catalog:
GET /catalog?page_token={cursor}&limit=50
Response: { items: [...], next_cursor }
Search titles:
GET /search?q={query}&page_token={cursor}&limit=20
Response: { results: [...], next_cursor }
Get video manifest URL:
GET /videos/{video_id}/manifest
Response: { manifest_url, drm_license_url }
Update watch position:
PUT /watch-sessions/{session_id}/position
Body: { position_ms, device_id }
Response: 204 No Content
Fetch recommendations:
GET /recommendations?limit=20
Response: { videos: [...] }
Fetch resume position:
GET /videos/{video_id}/resume-position
Response: { position_ms, device_id, updated_at }
The manifest endpoint returns a URL pointing directly to the manifest file on the CDN, not the manifest content itself. The client player fetches the manifest from the CDN edge, which is geographically close and already has the file cached. Returning the manifest inline through the API server would route large payloads through application servers that have no reason to see them.
Cursor-based pagination on
/catalogand/searchis required rather than offset-based. With 15,000 titles and frequent metadata updates, offset-based pages drift as titles are added mid-browse. Cursors are stable across inserts.
High-Level Design
1. Users can browse and search the video catalog
The catalog read path: the client fetches paginated metadata from the Catalog Service, which reads from a PostgreSQL read replica with a Redis cache in front for hot content.
The catalog is a read-dominated, low-write workload. A film's metadata rarely changes after publication. Cache hit rates are high because a small set of popular titles drives the majority of browse impressions.
I'd draw this box first in any Netflix interview because it is the simplest path in the system and establishes the gateway-service-cache-DB pattern that every subsequent requirement builds on.
Components:
- Client: Web or mobile app sending paginated GET requests to the API Gateway.
- API Gateway: Terminates TLS, validates the session token, and routes requests to downstream services.
- Catalog Service: Stateless service that reads video metadata and assembles the paginated response.
- Metadata Cache (Redis): LRU cache keyed by
video_id. TTL of 1 hour. Cache hit rate exceeds 95% for catalog browse because users repeatedly see the same popular titles. - Metadata DB (PostgreSQL): Source of truth for all video metadata. A read replica handles all catalog read traffic. The primary handles metadata writes only.
Request walkthrough:
- Client sends
GET /catalog?page_token={cursor}&limit=50. - API Gateway validates the Bearer token and routes to the Catalog Service.
- Catalog Service checks Redis for cached metadata entries. Returns cached items directly for hits.
- On cache miss, Catalog Service queries the PostgreSQL read replica and populates Redis before returning.
- Catalog Service returns the paginated list with a
next_cursor.
This diagram covers catalog reads only. The streaming path in the next requirement uses a separate CDN delivery flow that bypasses the Catalog Service entirely.
2. Users can stream a video with adaptive quality
The streaming path: the client fetches a manifest URL from the Manifest Service, then fetches video segments directly from the CDN edge closest to it. The origin (S3) only serves requests that miss every CDN tier.
Netflix does not stream video through application servers. The API server's only role is to hand the client a manifest URL and a DRM license URL. All bytes of video travel through the CDN. I'd state this separation explicitly in the interview because it is the single most important insight: the API layer and the video delivery layer are completely decoupled.
Components:
- Manifest Service: Returns the CDN URL of the manifest file for a given
video_id. Validates subscription tier to determine which quality variants are accessible. - CDN (Open Connect Appliances): ISP-embedded file servers that serve the vast majority of traffic from inside the viewer's ISP network. Fallback CDN (CloudFront) handles remaining traffic.
- S3 (Origin): Source of truth for all encoded video segments and manifest files. The CDN pulls from S3 on cache miss. Viewers never access S3 directly.
Request walkthrough:
- Client sends
GET /videos/{video_id}/manifest. - Manifest Service validates the session and subscription tier. Returns
{ manifest_url, drm_license_url }. - Client player fetches the manifest from the CDN edge. The manifest lists all available variant playlists at multiple bitrates.
- Client ABR algorithm selects the starting variant (default: medium quality for fast start).
- Client fetches 4-second video segments from the CDN. Every 4 seconds the ABR algorithm re-evaluates throughput and switches variants if needed.
- On segment cache miss: the CDN edge pulls the segment from S3 and caches it for subsequent viewers on the same node.
After the first viewer in a city watches a new episode, every subsequent viewer on that CDN node is served from local cache. This is why appliances embedded inside ISP data centers can serve the majority of Netflix traffic without touching the origin.
3. Users receive personalized recommendations
The recommendation path: a batch pipeline pre-computes ranked video IDs offline and stores them in Redis. The Recommendation Service is a cache read, not a live model inference.
Recommendations are not computed at request time. Running neural network inference on every browse request at 100M DAU is not viable at acceptable latency. The system pre-computes scores in batch and caches the results. This is an intentional eventual consistency tradeoff: recommendations are slightly stale, but they are always fast.
I'd call out the stale-but-fast tradeoff immediately when drawing this box. Interviewers love to push on "what if the user just finished a show and recommendations haven't updated?" The answer is: a 1-hour stale recommendation is invisible to the user, but a 500ms recommendation fetch is a visible page load delay.
Components:
- Kafka (Event Stream): Every user action (play, pause, complete, search click) is published as a stream event. This is the training data pipeline, not part of the request path.
- Spark Pipeline: Runs periodically (nightly for model training, hourly for embedding refresh). Reads event data and writes ranked video IDs per user to Redis.
- Recommendation Store (Redis): Keyed
recs:{user_id}. Pre-computed list of ranked video IDs with a 1-hour TTL. - Recommendation Service: Reads pre-computed IDs from Redis, fetches metadata from the Catalog Service (which hits the Redis metadata cache), and returns the ranked list.
Request walkthrough:
- Client sends
GET /recommendations. - Recommendation Service reads
recs:{user_id}from Redis. - Recommendation Service fetches metadata for those video IDs from the Catalog Service.
- Returns ranked video list with metadata.
- In the background, user events (this impression, this play start) are published to Kafka for the next batch cycle.
The Spark pipeline and Kafka are entirely off the critical path for the browse request. A failed batch job means recommendations are stale, not unavailable. Recommendation freshness is eventually consistent by design.
4. Users can resume watching from where they left off
The resume path: the client sends a position heartbeat every 30 seconds during playback. On app open or play click, the client fetches the last position from the Watch Session Service.
Watch position must survive device restarts, cross-device switches, and 30-day gaps between sessions. It is not a high-throughput write, but it must be durable and fast to read.
Components:
- Watch Session Service: Accepts position heartbeats. Writes to Redis synchronously (for fast reads) and to Cassandra asynchronously via Kafka (for durability and cross-device sync).
- Redis (Watch Position): Keyed
pos:{user_id}:{video_id}. Latest position in milliseconds. Sub-millisecond reads on every play click. - Kafka: Buffers position writes to protect Cassandra from peak write load.
- Cassandra (Watch History): Durable append log of all
{user_id, video_id, position_ms, device_id, timestamp}rows partitioned byuser_id. Source of truth for cross-device sync.
Request walkthrough (position update):
- Player sends
PUT /watch-sessions/{session_id}/positionevery 30 seconds. - Watch Session Service writes
pos:{user_id}:{video_id}to Redis synchronously (acknowledges to client). - Watch Session Service publishes the update to Kafka.
- Watch Writer consumer batches Kafka events and writes rows to Cassandra.
Request walkthrough (resume fetch):
- User clicks play on a title they started previously.
- Client sends
GET /videos/{video_id}/resume-position. - Watch Session Service reads
pos:{user_id}:{video_id}from Redis. Returns position in milliseconds. - Client player seeks to that position and begins fetching segments from the CDN.
I use Redis as the synchronous write target rather than Cassandra because the player polls for resume position on every play click. Redis reads are sub-millisecond. Cassandra reads for a single row take 5 to 20ms. At 100M DAU with 30-second heartbeats, the write rate into Redis peaks at roughly 3.3M writes per second; this is distributed across a Redis cluster and well within its capacity. Kafka buffers the Cassandra writes to protect the database from that peak rate.
Potential Deep Dives
1. How do we encode one uploaded film into 1,200 adaptive variants?
Netflix receives master source files from studios: 4K raw video, lossless audio, multiple language tracks, subtitle files. Each title must become hundreds of variants covering every resolution (240p to 4K), codec (H.264, H.265, VP9, AV1), audio track, and subtitle language. A 2-hour film in all variants takes roughly 1TB of storage and can take 72 hours to encode on a single machine.
I'd anchor this deep dive around one number: 1,200 variants per title. That number makes single-machine encoding obviously impossible and sets up the parallelization discussion.
2. How do we serve 100 million concurrent viewers without buffering?
100 million concurrent viewers each fetching a 4-second segment every 4 seconds generates roughly 25 million segment requests per second. No origin server fleet survives that read rate directly. The solution is CDN caching, but the choice of CDN architecture is the real engineering decision.
I'd open this deep dive with the 25M req/s number because it immediately disqualifies the first two options before you even describe them. The interviewer sees the reasoning, not just the answer.
3. How does the client pick the right quality tier without buffering?
The client cannot predict future bandwidth. A viewer drives through a tunnel, switches from Wi-Fi to 4G, or shares a connection with three other devices simultaneously. The player needs to switch quality variants smoothly without stopping playback and without oscillating quality every few seconds.
4. How do we build a recommendation system for 220 million users?
The recommendation system must produce a ranked list of 20 to 50 video IDs per user, personalized to their taste, and serve it in under 200ms. The pre-computation approach in the High-Level Design is correct. The question is what model drives it. The model must be accurate, trainable at scale, and deployable without real-time inference in the critical path.
I'd frame this deep dive as "the latency budget is 200ms, so whatever model we pick, inference cannot happen at request time." That constraint eliminates most approaches before the discussion starts.
Final Architecture
Open Connect Appliances are the single biggest architectural differentiator in this system. Everything else (ABR selection, chunked parallel encoding, two-tower recommendations, watch position write-through to Redis) is in service of making that CDN edge hit rate as high as possible and keeping every viewer away from the origin. The separation between the catalog and streaming paths is the second key insight: metadata reads and video delivery are independently scalable, sized for completely different traffic profiles.
Interview Cheat Sheet
- Separate the catalog read path (low QPS, cacheable metadata) from the video streaming path (100M+ concurrent segment fetches, CDN-only) early in the discussion.
- Netflix does not stream video through application servers. The API server returns a manifest URL. The client fetches all video bytes directly from the CDN from that point forward.
- Open Connect Appliances embedded inside ISPs serve roughly 95% of Netflix traffic. No transit peering, under 10ms latency to viewer, zero egress cost to Netflix.
- New release titles are pre-populated on OCAs nightly before viewers arrive. CDN cache misses during release surges are a configuration problem, not a runtime problem.
- One title produces 1,200+ encoded variants. Scene-based parallel encoding (1,200 variants x 20 scene chunks = 24,000 parallel tasks) cuts wall clock time from 72 hours to under 30 minutes.
- ABR selects the highest VMAF-scored variant within 80% of measured throughput. VMAF is perceptual quality, not raw bitrate. A simpler title at lower bitrate can score higher than a complex title at higher bitrate.
- Recommendations are pre-computed in batch (hourly embedding refresh, nightly model training). At request time, the Recommendation Service is a Redis read plus a Faiss ANN index search, not a model inference.
- Two-tower neural networks solve cold start for new titles: the item tower embeds genre and cast metadata before any viewing history exists for that title.
- Watch position writes to Redis synchronously (sub-ms reads for instant resume across devices) and to Cassandra asynchronously via Kafka (durable history).
- The read-to-write ratio for video serving is thousands-to-one. Every tier applies aggressive caching: metadata cache, recommendation cache, watch position cache, CDN segment cache.
- State the 2-second start time target explicitly. It constrains CDN placement (under 500ms to first segment), ABR startup policy (start at medium quality, not maximum), and manifest fetch latency.
- For the encoding deep dive: emphasize two levels of parallelism (across variants and across scene chunks) and per-title VMAF-optimized bitrate ladders as the mechanism for quality consistency without storage bloat.
- Netflix's tiered OCA architecture (ISP-level nodes pulling from regional nodes pulling from S3) trades deployment complexity for the ability to serve from inside every major ISP network globally.