Design YouTube
Walk through a complete YouTube design, from a bare upload service to a globally distributed video platform handling 500 hours of uploads per minute and 1B hours of daily playback.
What is YouTube?
YouTube is a platform where users upload, store, and stream video globally. The engineering challenge is not storage; it is converting 500 hours of raw video per minute into multiple adaptive bitrate formats and delivering each stream at the right resolution to viewers in 200+ countries within minutes of upload. It tests CDN architecture, async pipeline design, and distributed storage in a single question, which is why it is one of the highest-signal problems in the system design interview circuit. I treat this question as a pipeline design problem first and a storage problem second: getting the transcoding architecture right unlocks every other decision.
Functional Requirements
Core Requirements
- Users can upload a video file.
- After upload, the video becomes available to watch (transcoded into multiple resolutions).
- Users can stream a video at a resolution appropriate for their device and connection.
- Users can search for videos by title and description.
Below the Line (out of scope)
- Comments, likes, and subscriptions
- Recommendations and personalized home feed
- Live streaming
- Monetization and ads
The hardest part in scope: Video transcoding. A raw uploaded file must be converted into 6+ resolution variants (360p, 480p, 720p, 1080p, 4K, HDR) before the upload is considered complete. At 500 uploads per minute, the transcoding pipeline is the highest-throughput compute subsystem in the architecture.
Comments, likes, and subscriptions are below the line because they do not affect the upload or streaming paths. To add them, I would store a video_comments table keyed by (video_id, comment_id) and a video_reactions table keyed by (video_id, user_id). Like counts would be cached in Redis and reconciled to a database asynchronously.
Recommendations are below the line because they form a completely separate offline ML pipeline. To add them, I would emit watch events to a Kafka topic and train a collaborative filtering model offline, serving recommendations via a low-latency feature store.
Live streaming is below the line because it replaces the upload-then-transcode model with a real-time ingest and segment delivery model (HLS or DASH live). The architecture diverges significantly from the stored video path.
Monetization is below the line because ad serving is a separate system with its own auction, targeting, and reporting infrastructure that does not touch the core upload or playback path.
Non-Functional Requirements
Core Requirements
- Availability: 99.99% uptime for video playback. Availability over consistency: a viewer watching a video should never see a playback error due to backend failures.
- Latency: Video playback begins within 2 seconds of pressing play. Upload acknowledgment completes within 500ms (the actual processing continues asynchronously).
- Throughput: 500 hours of video uploaded per minute. 1B hours of video watched daily (roughly 41.7M concurrent streams at any moment, calculated as 1B hours × 3,600 s/hr ÷ 86,400 s/day).
- Durability: Uploaded video must never be lost. Stored across at least 3 geographic regions.
- Search latency: Search results return within 500ms p99.
Below the Line
- Sub-100ms time-to-first-byte via edge PoPs in every major city
- Real-time view count consistency
Read/write ratio: Video streaming traffic dwarfs upload traffic by a factor of roughly 1,400:1. For every 500 hours of video uploaded per minute, roughly 694,000 hours of video are consumed per minute (1B hours per day ÷ 1,440 minutes). This asymmetry shapes every decision in this article: the entire write path (upload, transcode, storage) can be slow and asynchronous because the read path (streaming) must be fast and globally distributed.
The 2-second playback start target rules out serving video files directly from a central origin server. Network round-trip time alone from Asia to a US datacenter is 150-200ms, and streaming a 1080p file at 8 Mbps from a single origin saturates bandwidth quickly. CDN edge delivery is mandatory, not optional.
I'd call out the 2-second playback target early in the interview, because it immediately rules out a single-origin setup and locks in CDN as a non-negotiable component before the design starts.
Core Entities
- Video: The uploaded content. Carries a
video_id,uploader_id,title,description,status(processing, ready, failed), andcreated_at. The status field tracks where the video is in the transcoding pipeline. - VideoVariant: A single transcoded output for a specific resolution and codec. Links back to
video_idand stores the CDN URL for the variant file. A single video produces 6-8 variants. - User: An account. Carries a
user_id, display name, and channel metadata. - SearchIndex (derived): An inverted index over video titles and descriptions. Not a stored table; populated asynchronously from Video records and served by a dedicated search service.
The full schema, indexes, and partition keys are deferred to the data model deep dive. The four entities above are sufficient to drive the API design and High-Level Design.
I treat SearchIndex as a derived entity rather than a first-class stored table; if the interviewer has not asked about search, I can introduce it only when functional requirement 4 comes up.
API Design
Upload a video:
POST /videos/upload
Body: multipart or a pre-signed S3 URL response
Response: { video_id, upload_url }
Get video metadata and playback manifest:
GET /videos/{video_id}
Response: { video_id, title, description, status, manifest_url }
Stream a video (adaptive bitrate manifest):
GET /videos/{video_id}/manifest.m3u8
Response: HLS manifest listing all resolution variants
Search for videos:
GET /search?q={query}&cursor?
Response: { videos: [...], next_cursor }
Pre-signed upload URL: Rather than accepting binary file data through the API server, the Upload Service generates a pre-signed S3 URL and returns it to the client. The client uploads directly to S3 bypassing the application tier entirely. This keeps large binary payloads off the API servers, removes an entire network hop, and lets S3 handle multipart resumable uploads natively. The API server only deals with metadata.
HLS vs raw file URL: The
manifest_urlpoints to an HLS (.m3u8) or DASH manifest, not a direct video file URL. The manifest lists all available resolution variants and segment URLs. The video player selects segments adaptively based on available bandwidth. This is how YouTube, Netflix, and every major streaming platform delivers video today.
Cursor-based pagination applies to search results. Offset pagination breaks when new videos are indexed between pages. A cursor encoding the last-seen video_id ensures stable pagination.
My recommendation for the upload flow is to return a video_id immediately with status=processing and have the client poll for status=ready. Blocking the upload API response on transcoding completion would mean the client waits 5-15 minutes for a 201.
High-Level Design
1. Users can upload a video file
The write path: client requests an upload URL, uploads directly to object storage, the server records the video metadata and begins transcoding.
Components:
- Client: Web or mobile app sending the initial upload request.
- Upload Service: Validates the request, generates a pre-signed upload URL, and creates a
Videorecord withstatus = processing. - Object Storage (S3): Stores the raw uploaded file. Durable, replicated, designed for large binary objects.
- Video DB: Stores video metadata and tracks processing status.
Request walkthrough:
- Client sends
POST /videos/uploadwith the video title and optional description. - Upload Service creates a Video record in the Video DB with
status = processing. - Upload Service generates a pre-signed S3 URL (valid for 1 hour) and returns it with the
video_id. - Client uploads the raw video file directly to S3 using the pre-signed URL.
- S3 triggers a storage event when the upload completes.
flowchart LR
C(["👤 Client\nWeb / mobile app"])
US["⚙️ Upload Service\nCreate Video record · status=processing\nGenerate pre-signed S3 URL"]
S3[("🗄️ S3 Raw Storage\nRaw uploaded video\nDurable · replicated · 3+ regions")]
VDB[("🗄️ Video DB\nvideo_id · title · uploader_id\nstatus=processing")]
C -->|"POST /videos/upload · title"| US
US -->|"INSERT video row · status=processing"| VDB
C -->|"PUT raw video (direct upload)"| S3
The client uploads the raw file directly to S3, bypassing the Upload Service entirely. The API tier only handles metadata. Transcoding is deferred to the next requirement.
2. After upload, the video becomes available to watch
Transcoding pipeline: when the raw upload lands in S3, an async worker picks it up, converts it into multiple resolution variants, stores them back in S3, and marks the video ready.
Components:
- Transcoding Queue (SQS/Kafka): S3 triggers an event on upload completion. The transcoding queue holds pending jobs, decoupling upload from processing.
- Transcoding Workers: Stateless, horizontally scalable workers. Each worker pulls a job, invokes the transcoding binary (FFmpeg) for each resolution, uploads the output variants to S3, and updates the Video DB.
- Video DB (updated): The
statusfield transitions fromprocessingtoreadywhen all variants are complete. - CDN: The variant files in S3 are served via CDN edge nodes. The CDN URL for each variant is written to the
VideoVarianttable.
Request walkthrough (transcoding path):
- S3 publishes an
UploadCompleteevent to the transcoding queue when the raw file lands. - A Transcoding Worker picks up the job.
- Worker downloads the raw file from S3 and invokes FFmpeg to produce variants: 360p, 480p, 720p, 1080p (and 4K if source quality permits).
- Worker uploads each variant file back to S3 under a path like
videos/{video_id}/720p.mp4. - Worker writes a
VideoVariantrow for each output (resolution, CDN URL, file size). - Worker updates the Video record:
status = ready.
flowchart LR
S3R[("🗄️ S3 Raw Storage\nRaw uploaded video\nTriggers UploadComplete event")]
TQ["📨 Transcoding Queue\nUploadComplete jobs\nDecouples upload from processing\nAt-least-once delivery"]
TW["⚙️ Transcoding Workers\nFFmpeg: 360p · 480p · 720p · 1080p\nHorizontally scalable · stateless"]
S3V[("🗄️ S3 Variant Storage\nvideos/{id}/720p.mp4 etc.\nServed via CDN")]
VDB[("🗄️ Video DB\nstatus: processing → ready\nVideoVariant rows per resolution")]
S3R -->|"UploadComplete event"| TQ
TQ -->|"Dequeue job"| TW
TW -->|"Upload variant files"| S3V
TW -->|"INSERT VideoVariant rows · UPDATE status=ready"| VDB
Workers are stateless. Scaling transcoding capacity is a matter of adding more worker instances. Each handles one video at a time; parallelism comes from running many workers, not from making individual workers faster.
FFmpeg transcodes one resolution per invocation. A worker processing a 1080p source into 6 output resolutions runs 6 FFmpeg processes sequentially or spawns 6 parallel sub-processes. The per-variant files are small enough (a 10-minute 720p segment is roughly 150MB) that S3 upload adds only a few extra seconds per variant.
3. Users can stream a video at the right resolution
The read path: client requests the video manifest, the player fetches segments from the CDN. The goal is to start playback within 2 seconds regardless of viewer location.
Components:
- Video Service: Serves GET /videos/ and /manifest. Reads from the Video DB and assembles the HLS manifest dynamically, or serves a pre-generated one from a CDN-backed cache.
- CDN (Content Delivery Network): Stores variant files and manifests at global edge nodes. The first request for a segment warms the edge cache; subsequent requests never touch the origin.
- Video DB (unchanged): Provides variant URLs for manifest construction.
Request walkthrough:
- Client sends
GET /videos/{video_id}and receives metadata including themanifest_url. - Client (video player) fetches the HLS manifest from the CDN edge node nearest to the viewer.
- Player reads the manifest, selects an initial resolution variant based on estimated bandwidth.
- Player fetches 2-second video segments from the CDN edge node sequentially.
- As bandwidth fluctuates, the player switches resolution variants up or down between segments.
flowchart LR
C(["👤 Client\nWeb / mobile app\nvideo player"])
VS["⚙️ Video Service\nServe metadata · assemble manifest\nRead from Video DB"]
CDN["🌐 CDN Edge Nodes\nManifest + segment cache\nPops in 200+ cities · < 20ms RTT\nCache-hit ratio > 95% for popular videos"]
S3V[("🗄️ S3 Variant Storage\nOrigin for CDN\nFetched on cache miss only")]
VDB[("🗄️ Video DB\nvideo_id · variant URLs · status=ready")]
C -->|"GET /videos/{video_id}"| VS
VS -->|"SELECT variants WHERE video_id=?"| VDB
C -->|"GET manifest.m3u8"| CDN
CDN -.->|"Cache miss: fetch from origin"| S3V
C -->|"GET segment chunks"| CDN
For a popular video, 99%+ of segment requests are served from CDN edge cache. The S3 origin only receives the first request for each segment from each PoP. This is the architecture that makes 41.7M concurrent streams viable without a datacenter sized to handle them all directly.
I always draw the CDN as the primary read path for video before showing any app-tier component; for streaming, the edge network does the real work and the origin is just a backing store.
A newly published video has no CDN cache warmth. The first few hundred viewers all miss the CDN edge and hit S3 origin. For large content launches, proactively push the variant files to CDN PoPs (CDN prefetch / cache warming) before making the video publicly available.
4. Users can search for videos by title and description
Search is a separate read path. It needs full-text matching over millions of video records, which no relational database handles well at scale.
Components:
- Search Service: Accepts
GET /search?q=...and queries the search index. - Elasticsearch Cluster: Inverted index over
titleanddescriptionfields. Supports full-text with ranking (BM25), fuzzy matching, and autocomplete. - Indexing Worker: Asynchronously consumes
VideoReadyevents from Kafka (published when transcoding completes) and indexes new video metadata into Elasticsearch. This decouples the search index update from the video write path.
Request walkthrough:
- When transcoding completes, the Transcoding Worker publishes a
VideoReadyevent to Kafka (carriesvideo_id,title,description). - Indexing Worker consumes the event and calls
elasticsearch.index(video_id, title, description). - Client sends
GET /search?q=system+design+interview. - Search Service queries Elasticsearch for matching documents, returns video metadata sorted by relevance score.
- Client renders the result list with pagination cursor.
flowchart LR
TW["⚙️ Transcoding Worker\nPublishes VideoReady event\nafter status=ready"]
MQ["📨 Kafka\nVideoReady events\nAt-least-once · durable"]
IW["⚙️ Indexing Worker\nConsumes VideoReady\nIndexes title + description"]
ES["🔍 Elasticsearch Cluster\nInverted index on title · description\nFull-text · BM25 ranking · fuzzy match"]
C(["👤 Client\nWeb / mobile app"])
SS["⚙️ Search Service\nQuery Elasticsearch\nPaginate with cursor"]
TW -->|"Publish VideoReady"| MQ
MQ -->|"Consume event"| IW
IW -->|"Index video metadata"| ES
C -->|"GET /search?q=..."| SS
SS -->|"Full-text query"| ES
Search is decoupled from the upload and streaming paths. An Elasticsearch cluster outage does not affect playback or upload. New videos appear in search results within seconds of the VideoReady event being processed, not when the upload completes.
Potential Deep Dives
I always open the transcoding deep dive by anchoring on chunk-level parallelism; it is the decision that changes the latency curve from linear (proportional to video length) to constant (one chunk duration), and it immediately differentiates a strong answer from a generic one.
1. How do we scale the transcoding pipeline?
Three constraints drive this problem:
- 500 hours of video uploaded per minute, each needing 6+ resolution variants.
- Large raw files (a 1-hour 4K recording is 50-100 GB) must be downloaded, processed, and re-uploaded without shared state between workers.
- Transcoding is CPU-intensive and slow: a 10-minute 1080p file takes 2-5 minutes of FFmpeg time on a 4-core machine.
2. How do we store and deliver video at global scale?
Context: 41.7M concurrent streams at any moment, viewer population distributed across 200+ countries. The origin storage is centralized in a few US/EU regions. Every stream request hitting the origin directly would require over 330 Tbps of egress bandwidth from origin datacenters, with 150-300ms round-trip times for Asia-Pacific viewers.
3. How do we store video metadata and support fast lookups?
Context: The Video DB needs to support three distinct access patterns: (1) lookup by video_id when serving watch pages, (2) lookup by uploader_id for channel pages listing all videos in reverse-chronological order, and (3) aggregated queries like view counts. At 500M registered users and billions of videos, these patterns have very different scaling requirements.
4. How do we handle the search index at scale?
Context: Users expect search to return results in under 500ms with relevant ranking. The video catalog grows by 500 hours per minute. Titles and descriptions need full-text matching with fuzzy tolerance, not exact prefix matching.
Final Architecture
flowchart LR
subgraph Clients["👤 Clients"]
C(["👤 User\nWeb / Mobile app"])
end
subgraph Gateway["🔀 Gateway Layer"]
AG["🔀 API Gateway\nAuth · rate limiting · routing"]
end
subgraph AppTier["⚙️ App Services"]
US["⚙️ Upload Service\nPre-signed S3 URL · INSERT video row"]
VS["⚙️ Video Service\nWatch page · channel page · INCR view count"]
SS["⚙️ Search Service\nBM25 query via Elasticsearch"]
end
subgraph AsyncTier["📨 Async Pipeline"]
MQ["📨 Kafka\nUploadComplete · VideoDeleted"]
PIPELINE["⚙️ Async Workers\nSplit · FFmpeg transcode · assemble manifest\n30s chunks · N×R parallel jobs · index to ES"]
end
subgraph CacheTier["⚡ Cache + CDN"]
CDN["🌐 CDN Edge Nodes\nHLS segments + manifests\n200+ PoPs · < 20ms · > 95% cache hit"]
RC["⚡ Redis\nView counts: view_count:{id} · INCR < 1ms\nFlushes to Cassandra every 30s"]
end
subgraph StorageTier["🗄️ Storage Tier"]
S3[("🗄️ S3 Storage\nRaw uploads · HLS segment files\nSource of truth · CDN origin")]
CASS[("🗄️ Cassandra\nvideos_by_id · videos_by_uploader\n3-replica · partition reads")]
ES["🔍 Elasticsearch\nInverted index · BM25 · fuzzy\nIsolated from metadata DB"]
end
C -->|"POST /videos/upload"| AG
C -->|"GET /videos/{id}"| AG
C -->|"GET /search?q=..."| AG
AG -->|"Write"| US
AG -->|"Read metadata"| VS
AG -->|"Search"| SS
US -->|"INSERT video row"| CASS
C -->|"PUT raw video file"| S3
S3 -->|"UploadComplete event"| MQ
MQ -->|"Consume UploadComplete"| PIPELINE
PIPELINE -->|"UPDATE status=ready"| CASS
PIPELINE -->|"Upsert video metadata"| ES
VS -->|"SELECT videos_by_id"| CASS
VS -->|"INCR view_count:{id}"| RC
RC -.->|"Flush delta every 30s"| CASS
SS -->|"BM25 multi_match query"| ES
C -->|"Stream HLS manifest + segments"| CDN
CDN -.->|"Cache miss"| S3
The upload and streaming paths are fully decoupled. Upload lands in S3 and triggers an async transcoding pipeline that scales horizontally by adding Transcoding Workers. Streaming serves from CDN edge nodes that are warmed by the first viewer and paid for by subsequent cache hits, keeping S3 and origin bandwidth proportional to distinct segment requests, not total playback hours.
Interview Cheat Sheet
- State the asymmetry first: 500 hours of video are uploaded every minute, but 1B hours are watched every day, a read/write ratio of roughly 1,400:1 that determines every architectural tradeoff.
- Return
video_idandstatus=processingimmediately on upload, then have the client poll forstatus=readywhile transcoding runs asynchronously in the background. - Pre-signed S3 URLs let the client upload video bytes directly to S3, bypassing the API tier entirely so large binary payloads never touch application servers.
- Chunk-based transcoding collapses a full video's processing time to one chunk's processing time: split into 30-second segments, transcode each (chunk, resolution) pair in parallel, and a 10-minute video at 4 resolutions finishes in ~30 seconds of wall-clock time.
- CDN delivery is not optional at this scale: 41.7M concurrent 1080p streams at 8 Mbps totals over 330 Tbps of egress, which only a globally distributed CDN with 200+ PoPs can absorb.
- Proactively warm CDN edges for high-demand videos at publish time using uploader subscriber count and category signals to predict traffic, pushing segments to all PoPs before the first viewer arrives.
- Store video metadata in two Cassandra tables (
videos_by_idfor watch pages,videos_by_uploaderfor channel pages) to get single-partition reads for both access patterns at any scale without scatter-gather. - View counts live in Redis (atomic
INCR, sub-millisecond), not the metadata table; a background flusher batches deltas to Cassandra every 30 seconds, giving live freshness without write amplification on the storage tier. - Elasticsearch handles search with BM25 ranking, fuzzy tolerance, and a
completionsuggester for autocomplete, indexed asynchronously from a Kafka consumer so the search cluster is never on the upload critical path. - Publishing a
VideoReadyevent to Kafka on transcoding completion fans out to the Indexing Worker (upserts Elasticsearch), the CDN warming service, and downstream notification delivery, all decoupled from the write path. - Every step in the pipeline is idempotent: S3 over-writes, Cassandra
IF NOT EXISTSinserts, RedisSETNXchunk-completion signals, and Kafka at-least-once consumers all tolerate replayed events safely. - Handle deletion by publishing a
VideoDeletedevent that asynchronously removes segments from S3, purges CDN caches, removes both Cassandra rows, and callses.delete(video_id), with only the status update happening synchronously. - Meeting the 2-second playback start target requires CDN proximity: serve segments from a PoP in the viewer's city, not from a central US origin.