How Instagram Reels recommendation works
Understand the system design behind short-video recommendation feeds like Instagram Reels and TikTok For You Page: the candidate retrieval, ranking pipeline, real-time signals, and feedback loops that decide which video plays next.
The Problem Statement
Interviewer: "You are scrolling through Instagram Reels. You just finished watching a funny cat video. How does Instagram decide what to show you next? Walk me through the full recommendation pipeline, from billions of videos in the system to the one video that plays next on your screen."
This question tests three things: whether you understand multi-stage recommendation pipelines (candidate retrieval then ranking), whether you can reason about machine learning systems at scale without hand-waving, and whether you understand the feedback loops and biases that emerge in recommendation systems.
I find this question separates candidates who understand systems from those who only understand algorithms. The ML model is maybe 20% of the answer. The other 80% is the infrastructure: how you retrieve candidates in under 50ms, how you incorporate real-time signals, how you prevent the system from collapsing into a popularity feedback loop, and how you handle cold start for new creators.
Clarifying the Scenario
You: "Great question. Before I dive in, let me make sure I scope this correctly."
You: "When you say 'decide what to show next,' are we talking about the full recommendation pipeline from candidate generation through ranking, or just the ranking model itself?"
Interviewer: "The full pipeline. I want to understand how you go from billions of videos to the one that plays."
You: "Got it. Should I focus on the Reels tab specifically (pure algorithmic feed, no following-based curation), or the main Instagram feed which mixes Reels with photos and stories?"
Interviewer: "The Reels tab. Pure algorithmic recommendation."
You: "One more: should I discuss the client-side prefetching and video delivery, or keep the focus on the recommendation logic?"
Interviewer: "Focus on the recommendation logic. Mention prefetching briefly if it is relevant."
You: "Perfect. I will structure my answer in three parts: how we retrieve a few hundred candidates from billions of videos, how the ranking model scores and orders those candidates, and how feedback loops and diversity constraints prevent the system from getting stuck."
My Approach
I break this into five parts:
- The two-stage pipeline: Why you cannot score all videos with a single model, and how candidate retrieval narrows the field before ranking.
- Candidate retrieval: The four main sources (embedding similarity, social graph, trending, re-engagement) and how Approximate Nearest Neighbor search makes this fast.
- The ranking model: Multi-task learning that predicts completion rate, likes, shares, and comments simultaneously, then combines them into a single score.
- Real-time signals: How the system incorporates what you just watched 10 seconds ago into the next recommendation.
- Feedback loops and diversity: The filter bubble problem, popularity bias, exploration vs exploitation, and content safety filtering.
The core insight is that recommendation is not one model. It is a pipeline of progressively more expensive computations, each narrowing the field further. The cheapest filter runs over the entire corpus; the most expensive runs over a few hundred candidates.
The Architecture
Here is the full system. The key architectural pattern is the funnel: billions of videos are progressively filtered to a handful of results using increasingly expensive methods.
Walk through the flow:
-
The user opens the Reels tab. The client sends a request with the user ID, device context (screen size, network quality), and a list of recently seen video IDs.
-
Candidate retrieval runs in parallel across three sources. Embedding ANN finds ~200 videos similar to the user's interest vector. Social graph pulls ~50 videos from followed creators and friends' recent likes. Trending surfaces ~50 videos with high engagement velocity in the user's region.
-
The candidate merger deduplicates and caps at ~500 candidates. This is the input to the expensive ranking stage.
-
The ranking model scores each candidate using features from the online feature store: user engagement history, video metadata, creator reputation, and real-time signals (how many views this video got in the last hour). The model predicts multiple engagement probabilities simultaneously and combines them into a single score.
-
Post-ranking filters enforce diversity (no genre repetition), content safety (policy violations, age-gating), and deduplication against already-seen videos.
-
The final 20-30 videos are sent to the client. The player starts downloading the first video immediately and prefetches the next 2-3 in the background.
-
As the user watches, skips, likes, or shares, those events stream back through Kafka to update the online feature store, informing the next recommendation request.
The total latency budget is around 200ms. Retrieval gets ~50ms, ranking gets ~100ms, and post-ranking filters get ~30ms. The remaining ~20ms is network overhead. Exceeding this budget means the video player stalls between videos, which directly hurts user retention.
Candidate Retrieval at Scale
The fundamental challenge of retrieval is narrowing 10M+ eligible videos to ~500 candidates in under 50ms. You cannot run the full ranking model over 10M videos per request. Instead, you use cheap approximate methods that sacrifice some precision for massive speed.
The embedding-based retrieval deserves a closer look. Each user is represented as a dense vector (typically 256 dimensions) that encodes their interest profile. Each video also has an embedding of the same dimensionality. The user embedding is computed from their watch history, likes, and explicit preferences. The video embedding is computed from the video's visual content (frames analyzed by a CNN), audio content, caption text, creator features, and engagement patterns.
At query time, the system finds the K nearest video embeddings to the user's current embedding using Approximate Nearest Neighbor (ANN) search. The index structure (typically HNSW, Hierarchical Navigable Small World) supports sub-10ms queries over tens of millions of vectors by trading exact accuracy for speed. It finds approximately 95% of the true nearest neighbors, which is more than good enough because the ranking model will re-score everything anyway.
Continue Reading with Premium
Unlock this article and every other in-depth system design guide on the platform with NotesFromSDE Premium.