YouTube

What is YouTube?

YouTube is a platform where users upload, store, and stream video globally. The engineering challenge is not storage; it is converting 500 hours of raw video per minute into multiple adaptive bitrate formats and delivering each stream at the right resolution to viewers in 200+ countries within minutes of upload. It tests CDN architecture, async pipeline design, and distributed storage in a single question, which is why it is one of the highest-signal problems in the system design interview circuit. I treat this question as a pipeline design problem first and a storage problem second: getting the transcoding architecture right unlocks every other decision.

Functional Requirements

Core Requirements

Users can upload a video file.
After upload, the video becomes available to watch (transcoded into multiple resolutions).
Users can stream a video at a resolution appropriate for their device and connection.
Users can search for videos by title and description.

Below the Line (out of scope)

Comments, likes, and subscriptions
Recommendations and personalized home feed
Live streaming
Monetization and ads

The hardest part in scope: Video transcoding. A raw uploaded file must be converted into 6+ resolution variants (360p, 480p, 720p, 1080p, 4K, HDR) before the upload is considered complete. At 500 uploads per minute, the transcoding pipeline is the highest-throughput compute subsystem in the architecture.

Comments, likes, and subscriptions are below the line because they do not affect the upload or streaming paths. To add them, I would store a video_comments table keyed by (video_id, comment_id) and a video_reactions table keyed by (video_id, user_id). Like counts would be cached in Redis and reconciled to a database asynchronously.

Recommendations are below the line because they form a completely separate offline ML pipeline. To add them, I would emit watch events to a Kafka topic and train a collaborative filtering model offline, serving recommendations via a low-latency feature store.

Live streaming is below the line because it replaces the upload-then-transcode model with a real-time ingest and segment delivery model (HLS or DASH live). The architecture diverges significantly from the stored video path.

Monetization is below the line because ad serving is a separate system with its own auction, targeting, and reporting infrastructure that does not touch the core upload or playback path.

Non-Functional Requirements

Core Requirements

Availability: 99.99% uptime for video playback. Availability over consistency: a viewer watching a video should never see a playback error due to backend failures.
Latency: Video playback begins within 2 seconds of pressing play. Upload acknowledgment completes within 500ms (the actual processing continues asynchronously).
Throughput: 500 hours of video uploaded per minute. 1B hours of video watched daily (roughly 41.7M concurrent streams at any moment, calculated as 1B hours × 3,600 s/hr ÷ 86,400 s/day).
Durability: Uploaded video must never be lost. Stored across at least 3 geographic regions.
Search latency: Search results return within 500ms p99.

Below the Line

Sub-100ms time-to-first-byte via edge PoPs in every major city
Real-time view count consistency

Read/write ratio: Video streaming traffic dwarfs upload traffic by a factor of roughly 1,400:1. For every 500 hours of video uploaded per minute, roughly 694,000 hours of video are consumed per minute (1B hours per day ÷ 1,440 minutes). This asymmetry shapes every decision in this article: the entire write path (upload, transcode, storage) can be slow and asynchronous because the read path (streaming) must be fast and globally distributed.

The 2-second playback start target rules out serving video files directly from a central origin server. Network round-trip time alone from Asia to a US datacenter is 150-200ms, and streaming a 1080p file at 8 Mbps from a single origin saturates bandwidth quickly. CDN edge delivery is mandatory, not optional.

I'd call out the 2-second playback target early in the interview, because it immediately rules out a single-origin setup and locks in CDN as a non-negotiable component before the design starts.

Core Entities

Video: The uploaded content. Carries a video_id, uploader_id, title, description, status (processing, ready, failed), and created_at. The status field tracks where the video is in the transcoding pipeline.
VideoVariant: A single transcoded output for a specific resolution and codec. Links back to video_id and stores the CDN URL for the variant file. A single video produces 6-8 variants.
User: An account. Carries a user_id, display name, and channel metadata.
SearchIndex (derived): An inverted index over video titles and descriptions. Not a stored table; populated asynchronously from Video records and served by a dedicated search service.

The full schema, indexes, and partition keys are deferred to the data model deep dive. The four entities above are sufficient to drive the API design and High-Level Design.

I treat SearchIndex as a derived entity rather than a first-class stored table; if the interviewer has not asked about search, I can introduce it only when functional requirement 4 comes up.

API Design

Upload a video:

POST /videos/upload
Body: multipart or a pre-signed S3 URL response
Response: { video_id, upload_url }

Get video metadata and playback manifest:

GET /videos/{video_id}
Response: { video_id, title, description, status, manifest_url }

Stream a video (adaptive bitrate manifest):

GET /videos/{video_id}/manifest.m3u8
Response: HLS manifest listing all resolution variants

Search for videos:

GET /search?q={query}&cursor?
Response: { videos: [...], next_cursor }

Pre-signed upload URL: Rather than accepting binary file data through the API server, the Upload Service generates a pre-signed S3 URL and returns it to the client. The client uploads directly to S3 bypassing the application tier entirely. This keeps large binary payloads off the API servers, removes an entire network hop, and lets S3 handle multipart resumable uploads natively. The API server only deals with metadata.

HLS vs raw file URL: The manifest_url points to an HLS (.m3u8) or DASH manifest, not a direct video file URL. The manifest lists all available resolution variants and segment URLs. The video player selects segments adaptively based on available bandwidth. This is how YouTube, Netflix, and every major streaming platform delivers video today.

Cursor-based pagination applies to search results. Offset pagination breaks when new videos are indexed between pages. A cursor encoding the last-seen video_id ensures stable pagination.

My recommendation for the upload flow is to return a video_id immediately with status=processing and have the client poll for status=ready. Blocking the upload API response on transcoding completion would mean the client waits 5-15 minutes for a 201.

High-Level Design

1. Users can upload a video file

The write path: client requests an upload URL, uploads directly to object storage, the server records the video metadata and begins transcoding.

Components:

Client: Web or mobile app sending the initial upload request.
Upload Service: Validates the request, generates a pre-signed upload URL, and creates a Video record with status = processing.
Object Storage (S3): Stores the raw uploaded file. Durable, replicated, designed for large binary objects.
Video DB: Stores video metadata and tracks processing status.

Request walkthrough:

Client sends POST /videos/upload with the video title and optional description.
Upload Service creates a Video record in the Video DB with status = processing.
Upload Service generates a pre-signed S3 URL (valid for 1 hour) and returns it with the video_id.
Client uploads the raw video file directly to S3 using the pre-signed URL.
S3 triggers a storage event when the upload completes.

The client uploads the raw file directly to S3, bypassing the Upload Service entirely. The API tier only handles metadata. Transcoding is deferred to the next requirement.

2. After upload, the video becomes available to watch

Transcoding pipeline: when the raw upload lands in S3, an async worker picks it up, converts it into multiple resolution variants, stores them back in S3, and marks the video ready.

Components:

Transcoding Queue (SQS/Kafka): S3 triggers an event on upload completion. The transcoding queue holds pending jobs, decoupling upload from processing.
Transcoding Workers: Stateless, horizontally scalable workers. Each worker pulls a job, invokes the transcoding binary (FFmpeg) for each resolution, uploads the output variants to S3, and updates the Video DB.
Video DB (updated): The status field transitions from processing to ready when all variants are complete.
CDN: The variant files in S3 are served via CDN edge nodes. The CDN URL for each variant is written to the VideoVariant table.

Request walkthrough (transcoding path):

S3 publishes an UploadComplete event to the transcoding queue when the raw file lands.
A Transcoding Worker picks up the job.
Worker downloads the raw file from S3 and invokes FFmpeg to produce variants: 360p, 480p, 720p, 1080p (and 4K if source quality permits).
Worker uploads each variant file back to S3 under a path like videos/{video_id}/720p.mp4.
Worker writes a VideoVariant row for each output (resolution, CDN URL, file size).
Worker updates the Video record: status = ready.

Workers are stateless. Scaling transcoding capacity is a matter of adding more worker instances. Each handles one video at a time; parallelism comes from running many workers, not from making individual workers faster.

FFmpeg transcodes one resolution per invocation. A worker processing a 1080p source into 6 output resolutions runs 6 FFmpeg processes sequentially or spawns 6 parallel sub-processes. The per-variant files are small enough (a 10-minute 720p segment is roughly 150MB) that S3 upload adds only a few extra seconds per variant.

3. Users can stream a video at the right resolution

The read path: client requests the video manifest, the player fetches segments from the CDN. The goal is to start playback within 2 seconds regardless of viewer location.

Components:

Video Service: Serves GET /videos/{video_id} and /manifest. Reads from the Video DB and assembles the HLS manifest dynamically, or serves a pre-generated one from a CDN-backed cache.
CDN (Content Delivery Network): Stores variant files and manifests at global edge nodes. The first request for a segment warms the edge cache; subsequent requests never touch the origin.
Video DB (unchanged): Provides variant URLs for manifest construction.

Request walkthrough:

What is YouTube?

Functional Requirements

Core Requirements

Users can upload a video file.
After upload, the video becomes available to watch (transcoded into multiple resolutions).
Users can stream a video at a resolution appropriate for their device and connection.
Users can search for videos by title and description.

Below the Line (out of scope)

Comments, likes, and subscriptions
Recommendations and personalized home feed
Live streaming
Monetization and ads

The hardest part in scope: Video transcoding. A raw uploaded file must be converted into 6+ resolution variants (360p, 480p, 720p, 1080p, 4K, HDR) before the upload is considered complete. At 500 uploads per minute, the transcoding pipeline is the highest-throughput compute subsystem in the architecture.

Monetization is below the line because ad serving is a separate system with its own auction, targeting, and reporting infrastructure that does not touch the core upload or playback path.

Non-Functional Requirements

Core Requirements

Availability: 99.99% uptime for video playback. Availability over consistency: a viewer watching a video should never see a playback error due to backend failures.
Latency: Video playback begins within 2 seconds of pressing play. Upload acknowledgment completes within 500ms (the actual processing continues asynchronously).
Throughput: 500 hours of video uploaded per minute. 1B hours of video watched daily (roughly 41.7M concurrent streams at any moment, calculated as 1B hours × 3,600 s/hr ÷ 86,400 s/day).
Durability: Uploaded video must never be lost. Stored across at least 3 geographic regions.
Search latency: Search results return within 500ms p99.

Below the Line

Sub-100ms time-to-first-byte via edge PoPs in every major city
Real-time view count consistency

Read/write ratio: Video streaming traffic dwarfs upload traffic by a factor of roughly 1,400:1. For every 500 hours of video uploaded per minute, roughly 694,000 hours of video are consumed per minute (1B hours per day ÷ 1,440 minutes). This asymmetry shapes every decision in this article: the entire write path (upload, transcode, storage) can be slow and asynchronous because the read path (streaming) must be fast and globally distributed.

I'd call out the 2-second playback target early in the interview, because it immediately rules out a single-origin setup and locks in CDN as a non-negotiable component before the design starts.

Core Entities

Video: The uploaded content. Carries a video_id, uploader_id, title, description, status (processing, ready, failed), and created_at. The status field tracks where the video is in the transcoding pipeline.
VideoVariant: A single transcoded output for a specific resolution and codec. Links back to video_id and stores the CDN URL for the variant file. A single video produces 6-8 variants.
User: An account. Carries a user_id, display name, and channel metadata.
SearchIndex (derived): An inverted index over video titles and descriptions. Not a stored table; populated asynchronously from Video records and served by a dedicated search service.

The full schema, indexes, and partition keys are deferred to the data model deep dive. The four entities above are sufficient to drive the API design and High-Level Design.

I treat SearchIndex as a derived entity rather than a first-class stored table; if the interviewer has not asked about search, I can introduce it only when functional requirement 4 comes up.

API Design

Upload a video:

POST /videos/upload
Body: multipart or a pre-signed S3 URL response
Response: { video_id, upload_url }

Get video metadata and playback manifest:

GET /videos/{video_id}
Response: { video_id, title, description, status, manifest_url }

Stream a video (adaptive bitrate manifest):

GET /videos/{video_id}/manifest.m3u8
Response: HLS manifest listing all resolution variants

Search for videos:

GET /search?q={query}&cursor?
Response: { videos: [...], next_cursor }

Pre-signed upload URL: Rather than accepting binary file data through the API server, the Upload Service generates a pre-signed S3 URL and returns it to the client. The client uploads directly to S3 bypassing the application tier entirely. This keeps large binary payloads off the API servers, removes an entire network hop, and lets S3 handle multipart resumable uploads natively. The API server only deals with metadata.

HLS vs raw file URL: The manifest_url points to an HLS (.m3u8) or DASH manifest, not a direct video file URL. The manifest lists all available resolution variants and segment URLs. The video player selects segments adaptively based on available bandwidth. This is how YouTube, Netflix, and every major streaming platform delivers video today.

Cursor-based pagination applies to search results. Offset pagination breaks when new videos are indexed between pages. A cursor encoding the last-seen video_id ensures stable pagination.

High-Level Design

1. Users can upload a video file

The write path: client requests an upload URL, uploads directly to object storage, the server records the video metadata and begins transcoding.

Components:

Client: Web or mobile app sending the initial upload request.
Upload Service: Validates the request, generates a pre-signed upload URL, and creates a Video record with status = processing.
Object Storage (S3): Stores the raw uploaded file. Durable, replicated, designed for large binary objects.
Video DB: Stores video metadata and tracks processing status.

Request walkthrough:

Client sends POST /videos/upload with the video title and optional description.
Upload Service creates a Video record in the Video DB with status = processing.
Upload Service generates a pre-signed S3 URL (valid for 1 hour) and returns it with the video_id.
Client uploads the raw video file directly to S3 using the pre-signed URL.
S3 triggers a storage event when the upload completes.

The client uploads the raw file directly to S3, bypassing the Upload Service entirely. The API tier only handles metadata. Transcoding is deferred to the next requirement.

2. After upload, the video becomes available to watch

Transcoding pipeline: when the raw upload lands in S3, an async worker picks it up, converts it into multiple resolution variants, stores them back in S3, and marks the video ready.

Components:

Transcoding Queue (SQS/Kafka): S3 triggers an event on upload completion. The transcoding queue holds pending jobs, decoupling upload from processing.
Transcoding Workers: Stateless, horizontally scalable workers. Each worker pulls a job, invokes the transcoding binary (FFmpeg) for each resolution, uploads the output variants to S3, and updates the Video DB.
Video DB (updated): The status field transitions from processing to ready when all variants are complete.
CDN: The variant files in S3 are served via CDN edge nodes. The CDN URL for each variant is written to the VideoVariant table.

Request walkthrough (transcoding path):

S3 publishes an UploadComplete event to the transcoding queue when the raw file lands.
A Transcoding Worker picks up the job.
Worker downloads the raw file from S3 and invokes FFmpeg to produce variants: 360p, 480p, 720p, 1080p (and 4K if source quality permits).
Worker uploads each variant file back to S3 under a path like videos/{video_id}/720p.mp4.
Worker writes a VideoVariant row for each output (resolution, CDN URL, file size).
Worker updates the Video record: status = ready.

3. Users can stream a video at the right resolution

The read path: client requests the video manifest, the player fetches segments from the CDN. The goal is to start playback within 2 seconds regardless of viewer location.

Components:

Video Service: Serves GET /videos/{video_id} and /manifest. Reads from the Video DB and assembles the HLS manifest dynamically, or serves a pre-generated one from a CDN-backed cache.
CDN (Content Delivery Network): Stores variant files and manifests at global edge nodes. The first request for a segment warms the edge cache; subsequent requests never touch the origin.
Video DB (unchanged): Provides variant URLs for manifest construction.

Request walkthrough:

YouTube

What is YouTube?

Functional Requirements

Core Requirements

Below the Line (out of scope)

Non-Functional Requirements

Core Requirements

Below the Line

Core Entities

API Design

High-Level Design

1. Users can upload a video file

2. After upload, the video becomes available to watch

3. Users can stream a video at the right resolution

Continue Reading with Premium

Comments

YouTube

What is YouTube?

Functional Requirements

Core Requirements

Below the Line (out of scope)

Non-Functional Requirements

Core Requirements

Below the Line

Core Entities

API Design

High-Level Design

1. Users can upload a video file

2. After upload, the video becomes available to watch

3. Users can stream a video at the right resolution

Continue Reading with Premium

Comments