System design for Reddit: pre-computed hot feeds, Redis-buffered vote counting at 100M DAU scale, and N+1-free comment trees via recursive CTE.
What is Reddit?
Reddit is a community-driven platform where users submit posts (links or text) to topic communities called subreddits, and members vote to surface the best content. The engineering challenge is not storing posts. It is serving a dynamically ranked feed to millions of simultaneous users, counting votes under extreme concurrency without write contention, and rendering deeply nested comment trees without hammering the database on every page load.
This is a strong interview question because it tests three independent hard problems: feed pre-computation and ranking, high-concurrency write buffering, and tree data structures at scale. Each problem has a wrong-but-plausible solution that falls apart under load.
I love this question in interviews because candidates who jump straight to "use Redis for everything" miss the actual hard parts. The interesting engineering is in the boundaries: how the Vote Flush Worker bridges Redis and PostgreSQL, how the recursive CTE avoids N+1 queries, and what breaks first when a post goes viral.
Functional Requirements
Core Requirements
- Users can submit posts (link or text) to a subreddit.
- Users can upvote or downvote posts and comments.
- Users can view a subreddit feed sorted by hot, new, or top.
- Users can post comments on a post and reply to other comments.
Below the Line
- User authentication and account management.
- Push and email notifications.
- Full-text search across posts and comments.
- Video hosting and media transcoding.
- User profile pages and karma tracking.
The hardest part in scope: Serving a dynamically ranked "hot" feed to 100M daily users across 100K subreddits without recomputing scores on every request. This is where the system gets genuinely interesting, and it is the problem most candidates skip over by jumping straight to "just use Redis."
Notifications are below the line because they require a separate event bus, push delivery infrastructure (APNs, FCM), and a preference store that sits orthogonal to the read and write path we are designing. To add them, we would publish a Kafka event on each post or vote and have a notification service consume and fan out based on subscription rules.
Full-text search is below the line because it requires a separate search index (Elasticsearch), a background ingestion pipeline, and query infrastructure that operates beside, not inside, the core path.
Video hosting is below the line because it demands a transcoding pipeline, adaptive bitrate streaming, and CDN delivery contracts that constitute a separate product surface.
Non-Functional Requirements
Core Requirements
- Availability: 99.99% uptime. Availability over consistency for feed and vote count reads.
- Feed latency: Subreddit feed and front page load under 200ms p99.
- Vote latency: Vote write acknowledged under 50ms p99. Vote count reflects updates within 5 seconds.
- Scale: 100M DAU, 500M total posts, 5B total comments. Roughly 5 new posts and 60 votes per second at steady state.
- Comment latency: Comment tree for a post with 1,000 comments loads under 300ms p99.
Feed latency under 200ms rules out sorting millions of posts per request at the database layer. A full ORDER BY hot_score across 500M rows would take multiple seconds, not milliseconds. Pre-computation is non-negotiable.
I would state this constraint in the first two minutes of the interview. The moment you say "pre-computation is non-negotiable," the interviewer knows you understand why a Redis sorted set exists in the design rather than a fancy SQL query.
Vote counts are eventually consistent within 5 seconds. The vote count a user sees may lag by a handful of votes during burst traffic. This constraint is what unlocks write buffering and avoids row-level lock contention on hot posts.
Below the Line
- Sub-second real-time vote count updates pushed to the browser.
- Globally distributed multi-region writes.
Sub-second push updates require WebSocket infrastructure (or SSE), a pub/sub broadcast layer, and per-client connection state. All of this sits outside the read and write path here. The 5-second eventual consistency NFR is sufficient and keeps the design tractable.
Multi-region writes require a distributed transaction protocol or CRDT design for vote counting and feed updates. Both substantially complicate the Vote Flush Worker mechanism. Treat multi-region as a follow-on after the single-region design is solid.
Read/write ratio: For every post submitted, expect roughly 2,000 feed impressions and 5,000 individual post views. Votes come in at roughly 10x the post creation rate. The overall read-to-write ratio is around 200:1. I would call this out in the first five minutes of the interview, because every caching decision in the design traces back to this ratio.
Core Entities
- Post: A submission to a subreddit with a title, content (link URL or text body), vote score, comment count, and a creation timestamp.
- Comment: A reply to a post or another comment. Stores a
parent_comment_idreference for the tree structure. - Vote: A single upvote or downvote cast by one user on one post or comment. Enforces one vote per (user, target) pair.
- Subreddit: A named community with a subscriber count and display metadata.
- User: An account with a username and a karma score derived from post and comment votes received.
Schema details and index choices come up in the deep dives. The above is the minimal inventory needed to reason about the API and the data flow.
API Design
FR 1 - Submit a post:
# FR 1: Submit a new post to a subreddit
POST /r/{subreddit}/posts
Body: { title, content_type: "link" | "text", content }
Response: { post_id, permalink }
POST creates a new resource with a server-assigned ID. Return the full permalink (for example, /r/programming/comments/abc123/title-slug) rather than just the post_id so the client does not need to reconstruct it.
FR 2 - Vote on a post or comment:
# FR 2: Cast or change a vote
POST /posts/{post_id}/vote
Body: { direction: "up" | "down" | "none" }
Response: { new_score }
POST /comments/{comment_id}/vote
Body: { direction: "up" | "down" | "none" }
Response: { new_score }
direction: "none" removes an existing vote without replacing it. The endpoint is idempotent: calling it twice with the same direction produces the same result. Returning new_score lets the client update the displayed count immediately, without a separate GET.
FR 3 - View a subreddit feed:
# FR 3: Paginate the ranked feed for a subreddit
GET /r/{subreddit}/posts?sort=hot|new|top&after={cursor}
Response: { posts: [...], next_cursor: "..." }
Cursor-based over offset-based pagination: a cursor encodes the last post's score and ID so the server resumes exactly. Offset pagination on a dynamically re-ranked feed causes skipped or duplicated posts as scores shift between page loads.
FR 4 - Post and view comments:
# FR 4a: Post a comment
POST /posts/{post_id}/comments
Body: { body, parent_comment_id? }
Response: { comment_id, created_at }
# FR 4b: Fetch the comment tree
GET /posts/{post_id}/comments?sort=top|new
Response: { comments: [...] }
parent_comment_id is null for top-level comments and the parent's ID for replies. The response returns a flat list ordered by tree position. The client reconstructs the nested display from parent_comment_id fields.
High-Level Design
1. Users can submit a post
The write path: validate the submission, store the post row, and add the post to the subreddit's feed index.
For now I will treat post ID generation as a black box. The counter-based approach from the Pastebin design applies here: an atomic Redis counter encodes to a short base62 string, guaranteeing uniqueness without retry logic.
Components:
- Client: Web browser or mobile app sending HTTP requests.
- App Server: Validates content size, checks the subreddit exists, generates a post_id, inserts the post row, and seeds the feed index.
- PostgreSQL: Stores posts with title, content, author_id, subreddit_id, score (starting at 0), and created_at.
Request walkthrough:
- Client sends
POST /r/{subreddit}/postswith the title and content. - App Server validates content size and confirms the subreddit exists.
- App Server generates a post_id and inserts the post row into PostgreSQL.
- App Server returns the permalink to the client.
This diagram is intentionally minimal. Feed ranking, caching, and vote counting all come in the next requirements.
2. Users can view the subreddit feed
The feed read path carries the majority of all traffic. Sorting millions of posts on the database per request is not viable.
The naive approach: every GET /r/{subreddit}/posts?sort=hot runs ORDER BY hot_score DESC against the PostgreSQL posts table. At a few hundred posts per subreddit this is fast. At 100K subreddits each holding millions of posts, a full-table sort on every request exhausts CPU and I/O long before you reach 100M DAU.
The fix is to pre-compute a per-subreddit feed and store it as a Redis sorted set keyed feed:{subreddit}:{sort}. The score is the hot rank. Feed reads become O(log N + K) range queries against the sorted set instead of a full sort on disk.
I find it helpful to draw this transition on a whiteboard explicitly: cross out the naive ORDER BY arrow and draw it pointing to Redis instead. Interviewers want to see the reasoning, not just the answer.
The hot score formula is worth understanding: hot_score = sign(score) * log10(max(|score|, 1)) + created_at_epoch / 45000. Every order of magnitude increase in votes adds one rank point. Every 45,000 seconds (roughly 12.5 hours) of age adds one rank point, so a new post with a handful of votes competes fairly with old content that has many.
Components added:
- Redis Feed Cache: A sorted set per subreddit per sort type, keyed
feed:{subreddit}:hot. Score is the computed hot_score. Range queries return ranked post IDs in under 1ms. - App Server (updated): On post creation, computes the initial hot score and does a
ZADDto the subreddit's sorted set with the new post_id.
Request walkthrough (feed read):
- Client sends
GET /r/programming/posts?sort=hot&after={cursor}. - App Server queries
ZRANGEBYSCORE feed:programming:hotwith the cursor score as the upper bound. - Redis returns the top post IDs in ranked order.
- App Server fetches the full post objects from PostgreSQL by ID.
- App Server returns the paginated post list to the client.
The Redis sorted set absorbs all feed ranking work. PostgreSQL is only queried for the post objects after ranking order is already established by Redis.
3. Users can vote on posts and comments
Votes are the highest-frequency write. Sixty votes per second at steady state is trivial until a single viral post attracts 1,000 votes per second.
The naive approach: a direct UPDATE posts SET score = score + 1 WHERE post_id = ?. At 60 writes/second this works fine. When a post goes viral, thousands of concurrent callers attempt to UPDATE the same row, and PostgreSQL row-level locking serializes those writes into a queue. Latency climbs from milliseconds to seconds.
The fix is write buffering. Each vote increments a Redis counter (INCRBY post_vote_delta:{post_id}) atomically instead of touching PostgreSQL directly. A Vote Flush Worker reads buffered deltas every 2 seconds and issues a single batched UPDATE per post to PostgreSQL. It also recomputes the hot score and updates the feed sorted set. Vote counts stay eventually consistent within 5 seconds, satisfying the NFR.
This is the architectural hinge of the entire design. I would spend extra time on the whiteboard drawing the Vote Flush Worker loop, because every follow-up question ("what if Redis crashes?", "what about vote deduplication?") connects back to this mechanism.
Components added:
- Redis Vote Buffer: Integer counters keyed
post_vote_delta:{post_id}. Each vote is an atomic INCRBY. No row-level lock, no contention. - Vote Flush Worker: Polls every 2 seconds, reads each delta with GETDEL (atomic read-and-clear), issues batched UPDATEs to PostgreSQL, and recomputes hot scores with ZADD.
Request walkthrough (vote):
- Client sends
POST /posts/{post_id}/votewith{ direction: "up" }. - App Server checks and records the user's vote in a
user_votestable to enforce the one-vote-per-user rule. - App Server does
INCRBY post_vote_delta:{post_id} 1in Redis. - App Server returns the approximate new score to the client immediately.
- Vote Flush Worker reads the delta every 2 seconds and applies a batched UPDATE to PostgreSQL, then does a ZADD to refresh the feed rank.
The Vote Flush Worker decouples the vote write rate from the database update rate. Ten thousand concurrent votes on one post become a single batched UPDATE every 2 seconds.
4. Users can post and view comments
Comments form a tree. Fetching the tree naively generates one database query per comment.
The naive approach: recursively fetch children of each comment node in the application layer. For a post with 1,000 comments, this produces up to 1,000 separate SELECT queries (the N+1 problem). Each round-trip adds latency; across a 50ms database connection, 1,000 round-trips costs 50 seconds.
The fix is a single PostgreSQL recursive CTE that fetches the entire comment tree in one round-trip. For popular posts, cache the result in Redis for 30 seconds to absorb repeated reads without paying the CTE cost every time.
I have seen candidates overcomplicate this with materialized paths or nested sets. The recursive CTE is the right starting point because it works with a plain adjacency list, which is the simplest schema. Only reach for materialized paths if the CTE becomes a bottleneck in deep dives.
Components added:
- Redis Comment Cache: Per-post comment tree cached as JSON, keyed
comments:{post_id}. TTL is 30 seconds for popular posts and 5 minutes for quiet ones. Invalidated on new comment insert. - PostgreSQL comments table: Stores the adjacency list:
comment_id, post_id, parent_comment_id, author_id, body, score, created_at. Indexed on(post_id, parent_comment_id).
Request walkthrough (fetch comments):
- Client sends
GET /posts/{post_id}/comments?sort=top. - App Server checks Redis for
comments:{post_id}. - Cache hit: return the cached tree immediately.
- Cache miss: App Server runs the recursive CTE against PostgreSQL with a depth cap of 8.
- App Server stores the result in Redis with a TTL and returns the tree to the client.
The recursive CTE collapses the N+1 problem into one database round-trip. Redis caching absorbs repeat loads on viral posts within the TTL window.
Potential Deep Dives
Reddit is a medium-difficulty question, so we cover three deep dives: feed pre-computation and the hot score, vote write buffering and deduplication, and comment tree serving at scale.
1. How do we rank and serve the hot feed at scale?
The feed must serve ranked results under 200ms for 100M DAU across 100K subreddits. The constraints are: the hot score must decay over time (new posts should compete with old ones), the ranking must update within seconds of a vote, and reads must scale to 11,500 requests per second without saturating the database.
2. How do we count votes accurately under high concurrency?
At 60 votes/second average and over 1,000 votes/second on a viral post, the vote mechanism must handle concurrent writes without data loss, enforce one vote per user reliably, and avoid creating hot-row contention on popular posts.
3. How do we serve nested comment trees efficiently?
A popular Reddit post can have thousands of comments arranged in a multi-level tree. The fetch mechanism must return the full tree with correct sort ordering in one request without generating hundreds of queries per page load.
Final Architecture
The architecture separates three distinct data flows: vote writes buffer through Redis to eliminate row contention on PostgreSQL; feed reads draw from Redis sorted sets kept current by the Vote Flush Worker running every 2 seconds; and comment trees are fetched via a single recursive CTE, cached per post, and invalidated on new comment insertion. The read replica absorbs all cache-miss database reads, keeping the primary reserved exclusively for writes.
Interview Cheat Sheet
- Lock down 4 core features and explicitly name what is out of scope (notifications, search, video). Interviewers reward visible scoping over a complete feature list.
- State the 200:1 read/write ratio in the first five minutes. Every caching decision in the design traces back to this number.
- The hot score formula is
log10(max(|score|, 1)) + epoch_seconds / 45000. Each 10x vote increase adds one rank point. Each 12.5 hours of age adds one rank point naturally. New posts with a few votes compete fairly with older posts that have many. - Store the hot feed in a Redis sorted set per subreddit. ZRANGEBYSCORE is O(log N + K), not O(N log N). At 11,500 feed requests/second, this is the difference between a 200ms target and a 2-second timeout.
- For the personalized front page, issue parallel ZRANGEBYSCORE calls across all subscribed subreddits, pull 20 posts from each, and merge-sort in memory. With 30 subscriptions you merge 600 posts in microseconds.
- Buffer votes in Redis with INCRBY per post_id. A viral post receiving 1,000 votes/second generates one Redis operation per vote and one batched PostgreSQL UPDATE every 2 seconds, not 1,000 competing row locks.
- Enforce one-vote-per-user with an
ON CONFLICT DO UPDATEUPSERT in the user_votes table. This eliminates the race condition between SELECT and INSERT that a naive check-then-insert approach allows. - Fetch comment trees with a single recursive CTE (PostgreSQL
WITH RECURSIVE). Cap depth at 8 to bound traversal cost. Cache the result per post in Redis with a short TTL. - Invalidate the comment cache on new comment insert using DEL. Do not invalidate on votes; use the TTL to absorb vote-driven sort changes.
- The Vote Flush Worker (GETDEL + batched UPDATE + ZADD every 2 seconds) is the architectural hinge. Be prepared to describe its failure modes: delta loss on Redis crash, and catching up after a worker restart using the user_votes table as the source of truth.
- Add a daily cleanup job to prune posts older than 7 days from each subreddit's Redis sorted set via ZREMRANGEBYSCORE. Without it, memory grows unbounded as subreddits accumulate posts.
- Shard the posts and comments tables by post_id (hash-based) once you exceed roughly 100M rows per table. Shard user_votes by user_id to keep the UPSERT operation local.
Key questions to address in this interview:
- How do you serve a ranked hot feed for subreddits with millions of posts in under 200ms? (Redis sorted sets, hot score formula, Vote Flush Worker keeping scores current)
- How do you handle 1,000 concurrent votes per second on a single viral post without row-lock contention? (Redis write buffer, Vote Flush Worker, batched UPDATE)
- How do you enforce one vote per user under concurrent access without a race condition? (atomic UPSERT with ON CONFLICT DO UPDATE)
- How do you serve deeply nested comment trees without N+1 queries? (recursive CTE with depth cap, Redis comment cache with TTL and write-time invalidation)
- What breaks first if a post goes viral unexpectedly? (Feed sorted set updates lag if the Vote Flush Worker falls behind; comment cache miss rate spikes on rapid invalidation during a comment burst)