YouTube
Walk through a complete YouTube design, from a bare upload service to a globally distributed video platform handling 500 hours of uploads per minute and 1B hours of daily playback.
What is YouTube?
YouTube is a platform where users upload, store, and stream video globally. The engineering challenge is not storage; it is converting 500 hours of raw video per minute into multiple adaptive bitrate formats and delivering each stream at the right resolution to viewers in 200+ countries within minutes of upload. It tests CDN architecture, async pipeline design, and distributed storage in a single question, which is why it is one of the highest-signal problems in the system design interview circuit. I treat this question as a pipeline design problem first and a storage problem second: getting the transcoding architecture right unlocks every other decision.
Functional Requirements
Core Requirements
- Users can upload a video file.
- After upload, the video becomes available to watch (transcoded into multiple resolutions).
- Users can stream a video at a resolution appropriate for their device and connection.
- Users can search for videos by title and description.
Below the Line (out of scope)
- Comments, likes, and subscriptions
- Recommendations and personalized home feed
- Live streaming
- Monetization and ads
The hardest part in scope: Video transcoding. A raw uploaded file must be converted into 6+ resolution variants (360p, 480p, 720p, 1080p, 4K, HDR) before the upload is considered complete. At 500 uploads per minute, the transcoding pipeline is the highest-throughput compute subsystem in the architecture.
Comments, likes, and subscriptions are below the line because they do not affect the upload or streaming paths. To add them, I would store a video_comments table keyed by (video_id, comment_id) and a video_reactions table keyed by (video_id, user_id). Like counts would be cached in Redis and reconciled to a database asynchronously.
Recommendations are below the line because they form a completely separate offline ML pipeline. To add them, I would emit watch events to a Kafka topic and train a collaborative filtering model offline, serving recommendations via a low-latency feature store.
Live streaming is below the line because it replaces the upload-then-transcode model with a real-time ingest and segment delivery model (HLS or DASH live). The architecture diverges significantly from the stored video path.
Monetization is below the line because ad serving is a separate system with its own auction, targeting, and reporting infrastructure that does not touch the core upload or playback path.
Non-Functional Requirements
Core Requirements
- Availability: 99.99% uptime for video playback. Availability over consistency: a viewer watching a video should never see a playback error due to backend failures.
- Latency: Video playback begins within 2 seconds of pressing play. Upload acknowledgment completes within 500ms (the actual processing continues asynchronously).
- Throughput: 500 hours of video uploaded per minute. 1B hours of video watched daily (roughly 41.7M concurrent streams at any moment, calculated as 1B hours Γ 3,600 s/hr Γ· 86,400 s/day).
- Durability: Uploaded video must never be lost. Stored across at least 3 geographic regions.
- Search latency: Search results return within 500ms p99.
Below the Line
- Sub-100ms time-to-first-byte via edge PoPs in every major city
- Real-time view count consistency
Read/write ratio: Video streaming traffic dwarfs upload traffic by a factor of roughly 1,400:1. For every 500 hours of video uploaded per minute, roughly 694,000 hours of video are consumed per minute (1B hours per day Γ· 1,440 minutes). This asymmetry shapes every decision in this article: the entire write path (upload, transcode, storage) can be slow and asynchronous because the read path (streaming) must be fast and globally distributed.
The 2-second playback start target rules out serving video files directly from a central origin server. Network round-trip time alone from Asia to a US datacenter is 150-200ms, and streaming a 1080p file at 8 Mbps from a single origin saturates bandwidth quickly. CDN edge delivery is mandatory, not optional.
I'd call out the 2-second playback target early in the interview, because it immediately rules out a single-origin setup and locks in CDN as a non-negotiable component before the design starts.
Core Entities
- Video: The uploaded content. Carries a
video_id,uploader_id,title,description,status(processing, ready, failed), andcreated_at. The status field tracks where the video is in the transcoding pipeline. - VideoVariant: A single transcoded output for a specific resolution and codec. Links back to
video_idand stores the CDN URL for the variant file. A single video produces 6-8 variants. - User: An account. Carries a
user_id, display name, and channel metadata. - SearchIndex (derived): An inverted index over video titles and descriptions. Not a stored table; populated asynchronously from Video records and served by a dedicated search service.
The full schema, indexes, and partition keys are deferred to the data model deep dive. The four entities above are sufficient to drive the API design and High-Level Design.
I treat SearchIndex as a derived entity rather than a first-class stored table; if the interviewer has not asked about search, I can introduce it only when functional requirement 4 comes up.
API Design
Upload a video:
POST /videos/upload
Body: multipart or a pre-signed S3 URL response
Response: { video_id, upload_url }
Get video metadata and playback manifest:
GET /videos/{video_id}
Response: { video_id, title, description, status, manifest_url }
Stream a video (adaptive bitrate manifest):
GET /videos/{video_id}/manifest.m3u8
Response: HLS manifest listing all resolution variants
Search for videos:
GET /search?q={query}&cursor?
Response: { videos: [...], next_cursor }
Pre-signed upload URL: Rather than accepting binary file data through the API server, the Upload Service generates a pre-signed S3 URL and returns it to the client. The client uploads directly to S3 bypassing the application tier entirely. This keeps large binary payloads off the API servers, removes an entire network hop, and lets S3 handle multipart resumable uploads natively. The API server only deals with metadata.
HLS vs raw file URL: The
manifest_urlpoints to an HLS (.m3u8) or DASH manifest, not a direct video file URL. The manifest lists all available resolution variants and segment URLs. The video player selects segments adaptively based on available bandwidth. This is how YouTube, Netflix, and every major streaming platform delivers video today.
Cursor-based pagination applies to search results. Offset pagination breaks when new videos are indexed between pages. A cursor encoding the last-seen video_id ensures stable pagination.
My recommendation for the upload flow is to return a video_id immediately with status=processing and have the client poll for status=ready. Blocking the upload API response on transcoding completion would mean the client waits 5-15 minutes for a 201.
High-Level Design
1. Users can upload a video file
The write path: client requests an upload URL, uploads directly to object storage, the server records the video metadata and begins transcoding.
Components:
- Client: Web or mobile app sending the initial upload request.
- Upload Service: Validates the request, generates a pre-signed upload URL, and creates a
Videorecord withstatus = processing. - Object Storage (S3): Stores the raw uploaded file. Durable, replicated, designed for large binary objects.
- Video DB: Stores video metadata and tracks processing status.
Request walkthrough:
- Client sends
POST /videos/uploadwith the video title and optional description. - Upload Service creates a Video record in the Video DB with
status = processing. - Upload Service generates a pre-signed S3 URL (valid for 1 hour) and returns it with the
video_id. - Client uploads the raw video file directly to S3 using the pre-signed URL.
- S3 triggers a storage event when the upload completes.
The client uploads the raw file directly to S3, bypassing the Upload Service entirely. The API tier only handles metadata. Transcoding is deferred to the next requirement.
2. After upload, the video becomes available to watch
Transcoding pipeline: when the raw upload lands in S3, an async worker picks it up, converts it into multiple resolution variants, stores them back in S3, and marks the video ready.
Components:
- Transcoding Queue (SQS/Kafka): S3 triggers an event on upload completion. The transcoding queue holds pending jobs, decoupling upload from processing.
- Transcoding Workers: Stateless, horizontally scalable workers. Each worker pulls a job, invokes the transcoding binary (FFmpeg) for each resolution, uploads the output variants to S3, and updates the Video DB.
- Video DB (updated): The
statusfield transitions fromprocessingtoreadywhen all variants are complete. - CDN: The variant files in S3 are served via CDN edge nodes. The CDN URL for each variant is written to the
VideoVarianttable.
Request walkthrough (transcoding path):
- S3 publishes an
UploadCompleteevent to the transcoding queue when the raw file lands. - A Transcoding Worker picks up the job.
- Worker downloads the raw file from S3 and invokes FFmpeg to produce variants: 360p, 480p, 720p, 1080p (and 4K if source quality permits).
- Worker uploads each variant file back to S3 under a path like
videos/{video_id}/720p.mp4. - Worker writes a
VideoVariantrow for each output (resolution, CDN URL, file size). - Worker updates the Video record:
status = ready.
Workers are stateless. Scaling transcoding capacity is a matter of adding more worker instances. Each handles one video at a time; parallelism comes from running many workers, not from making individual workers faster.
FFmpeg transcodes one resolution per invocation. A worker processing a 1080p source into 6 output resolutions runs 6 FFmpeg processes sequentially or spawns 6 parallel sub-processes. The per-variant files are small enough (a 10-minute 720p segment is roughly 150MB) that S3 upload adds only a few extra seconds per variant.
3. Users can stream a video at the right resolution
The read path: client requests the video manifest, the player fetches segments from the CDN. The goal is to start playback within 2 seconds regardless of viewer location.
Components:
- Video Service: Serves
GET /videos/{video_id}and/manifest. Reads from the Video DB and assembles the HLS manifest dynamically, or serves a pre-generated one from a CDN-backed cache. - CDN (Content Delivery Network): Stores variant files and manifests at global edge nodes. The first request for a segment warms the edge cache; subsequent requests never touch the origin.
- Video DB (unchanged): Provides variant URLs for manifest construction.
Request walkthrough:
Continue Reading with Premium
Unlock this article and every other in-depth system design guide on the platform with NotesFromSDE Premium.