Likes Counter
Design a scalable likes counter for celebrity posts receiving millions of writes per second: handling hot-key write amplification, aggregating counts with acceptable staleness, and preventing double-likes.
What is a likes counter for high-profile posts?
A likes counter tracks how many users have liked a post and enforces that no single user can like the same post twice. The deceptively simple task of incrementing a number becomes a serious distributed systems problem the moment a celebrity posts and 100,000 users like it within 60 seconds.
I like this interview question because it forces you past the "just use Redis" reflex. A single Redis key saturates at exactly the wrong moment. This problem tests hot-key mitigation, write aggregation, probabilistic duplicate rejection, and write-behind persistence in one focused question.
Functional Requirements
Core Requirements
- Users can like or unlike a post.
- Any user viewing a post can see the current like count.
- A user can only like a post once (no double-likes).
Below the Line (out of scope)
- Like notifications delivered to the post author
- Displaying who liked a post (partial like list for display)
- Reactions beyond a binary like (emoji reactions)
- Real-time push of count updates to all active viewers
The hardest part in scope: Writing 100,000 likes per second to the same logical counter for a celebrity post without saturating a single database row or cache key. The uniqueness constraint adds a second layer of difficulty: we must prevent double-likes at a fraction of the per-like write cost.
Like notifications are below the line because they introduce a fan-out problem that is orthogonal to the counter design. To support them, I would publish a like_event to Kafka and have a notification service consume it asynchronously, separate from the write path.
Displaying who liked a post is below the line because it requires paginated queries across potentially millions of records. To add it, I would store (post_id, user_id, liked_at) in a dedicated table with an index on (post_id, liked_at DESC) for cursor-based pagination.
Real-time push of count updates is below the line because it adds WebSocket fan-out complexity without changing the counter storage design. To add it, I would use Server-Sent Events, publishing count deltas from the read service to connected clients on a configurable threshold (for example, every 1,000 new likes).
Non-Functional Requirements
Core Requirements
- Write throughput: Celebrity posts receive up to 100,000 like writes per second for the first 2-5 minutes after posting. Average posts receive far less (roughly 1 like per second at peak).
- Read throughput: With 500M DAU and an average of 20 post views per day, the system processes approximately 115,000 count reads per second.
- Write latency: A like write acknowledges in under 200ms p99.
- Read latency: A like count read returns in under 50ms p99.
- Availability: 99.99% uptime. Availability over consistency: a stale count that is 1-5 seconds behind is fully acceptable.
- Uniqueness: Each user can like a post at most once. Duplicate writes must be rejected.
Below the Line
- Sub-5ms read latency via CDN edge caching (requires additional infrastructure)
- Global multi-region write consistency (out of scope for this design)
Read/write ratio: During the peak of a celebrity post, the ratio briefly inverts to roughly 1:1 (equal writes and reads). For average posts and steady-state traffic, reads outpace writes by roughly 10:1. Both extremes need addressing: the peak write case determines the counter design, and the read-heavy steady state determines the caching strategy.
The 100,000 writes-per-second target rules out any design that routes all writes to a single storage node. A typical Redis node handles 100,000-200,000 ops/second total.
With no write spreading, a single hot key saturates the CPU of one Redis primary almost immediately. The 1-5 second stale-count tolerance is the most consequential NFR: it unlocks batching, in-memory aggregation, and async database flush, all of which are required to absorb the write burst.
Core Entities
- Like: The canonical record that a specific user liked a specific post. Contains
post_id,user_id, andliked_at. Enforces uniqueness at the database layer via aUNIQUE(post_id, user_id)constraint. - Post: The content item being liked. Treated as an external entity. The counter design does not own the post record.
- LikeCount: The aggregated count for a post. In the naive design this is a derived value from
COUNT(*). In the evolved design it is a materialized value held in Redis shards and periodically flushed to the database.
I'd keep the entity list this short in an interview. The full schema, indexes, and flush semantics are deferred to the deep dives. These three entities are sufficient to drive the API and high-level design.
API Design
Two functional requirements drive the API shape.
FR 1 and FR 3 - Like or unlike a post:
POST /posts/{post_id}/likes
Authorization: Bearer <token>
Response: 200 OK | 409 Conflict (already liked)
DELETE /posts/{post_id}/likes
Authorization: Bearer <token>
Response: 200 OK | 404 Not Found (not liked)
POST creates a like record. DELETE removes it. Both derive user_id from the auth token rather than the request body, preventing any user from spoofing another user's like action.
The 409 on POST and 404 on DELETE are the signals the client uses to update button state without a separate round trip.
FR 2 - Read the current like count:
GET /posts/{post_id}/likes/count
Response: { "count": 14823901, "is_approximate": true }
The is_approximate field is the honest signal to clients that the count may be 1-5 seconds stale. Surfacing this explicitly prevents product teams from building features that depend on exact real-time counts, which this system deliberately does not guarantee at peak load.
High-Level Design
1. Naive approach: direct database writes
The simplest design routes every like directly to a relational database. The database enforces uniqueness via a UNIQUE(post_id, user_id) constraint and computes the count via COUNT(*) on each read.
Components:
- Client: Issues
POST /posts/{id}/likesto like a post andGET /posts/{id}/likes/countto read the count. - API Service: Validates the auth token, extracts the user ID, and issues SQL writes and reads.
- Database: Stores the canonical like records.
SELECT COUNT(*)returns the current count.
Request walkthrough (write):
- Client sends
POST /posts/42/likeswith an auth token. - API service validates the token and extracts
user_id. - Service issues
INSERT INTO likes (post_id, user_id, liked_at) VALUES (42, 891, now()). - The
UNIQUE(post_id, user_id)constraint rejects duplicates with a constraint violation, which the service maps to a 409. - Client receives
200 OKor409 Conflict.
This is the complete naive system. It is correct for low-traffic posts. I'd sketch this on the whiteboard first because it shows the interviewer you can identify the simplest correct solution before optimizing. It breaks under celebrity traffic.
The problem at celebrity scale
This design breaks immediately when a celebrity posts. At 100,000 writes per second, every write serializes against the same rows in the same database table. PostgreSQL accumulates contention on the (post_id, user_id) index partition.
SELECT COUNT(*) becomes an expensive index scan across potentially millions of rows. A single database node saturates well below 100,000 writes/second.
The core failure is a hot-key problem: all 100,000 writes per second are directed at the same logical key (post_id = <celebrity_post>). Any single-node storage system, whether a database row or a single Redis key, will saturate under this load pattern.
A single Redis key (INCR post:42:count) does not solve the hot-key problem. Redis is single-threaded per key. At 100,000 INCR ops/second, one Redis key consumes the full CPU of one node. The saturation point is just higher than Postgres before it happens, not eliminated.
2. Evolved approach: sharded Redis counters with async flush
The fix is to spread writes across K independent Redis keys (shards) for the same post. A write picks a random shard and INCRs it. A read fetches all K shards with a single MGET and sums them.
This trades slightly more expensive reads (K values fetched instead of one) for dramatically more scalable writes (100,000 writes/second distributed across K shards).
Components:
- Write Service: Receives
POST /likes, checks for duplicates using a bloom filter, picks a random shardrand(0, K-1), and INCRspost:{id}:shard_{N}. - Read Service: Receives
GET /count, fetches all K shards in a single Redis MGET, returns the sum. - Redis Cluster: Holds K shards per post. Each shard is an independent key distributed across Redis nodes by keyslot. Writes never contend.
- Kafka: Captures every like event as a durable log. The DB consumer reads from Kafka and flushes records to the database.
- DB Consumer: Consumes
like_eventsfrom Kafka and batches writes to the database every few seconds. - Database: Stores canonical
(post_id, user_id)like records for durability and deduplication audit.
Request walkthrough (write):
- Client sends
POST /posts/42/likes. - Write Service checks the bloom filter for a fast duplicate check (see Deep Dive 2).
- Write Service picks
shard = rand(0, K-1)and issuesINCR post:42:shard_{N}to Redis. - Write Service publishes
{ post_id: 42, user_id: 891 }to Kafkalike_events. - Client receives
200 OKin under 200ms. - DB Consumer (async) batches events from Kafka and flushes to the database.
This design handles 100,000 writes/second via sharding and returns counts in under 50ms via Redis MGET, which is sub-millisecond at this scale. I often see candidates jump straight to this design without showing the naive version first. Resist that urge. The evolution from naive to sharded is the story the interviewer wants to hear.
K (the shard count) is a tuning knob. At K=10, each shard receives 10,000 writes/second for a celebrity post. That is well within a comfortable operating range for one Redis key. On read, MGET 10 keys is a single round trip completing in under 1ms. Start with K=10 and increase only if monitoring shows shard saturation.
Potential Deep Dives
1. How do we solve the hot-key problem for celebrity posts?
The NFR is explicit: 100,000 like writes per second to a single post. Any design that routes all writes to one storage key will fail. The question is how to spread them.
2. How do we prevent a user from liking the same post twice?
The UNIQUE(post_id, user_id) constraint in the database is the correct safety net. The challenge is that we cannot afford a database write on every like attempt at 100,000 writes/second. We need a fast rejection path that handles the vast majority of duplicate attempts cheaply, before they touch the database.
3. How do we durably persist like counts without the database becoming a bottleneck?
Redis stores the live counter and handles all writes at peak throughput. But Redis is not durable by default. We need a strategy to flush aggregated counts to a durable store without reintroducing the DB bottleneck we just eliminated.
Final Architecture
The critical architectural insight is the complete decoupling of three independent throughput ceilings.
Write Service throughput is bounded by Redis shard capacity (millions of ops/second across K nodes). Read Service throughput is bounded by Redis read capacity and CDN hit rate (effectively unlimited at a 1-5 second TTL). DB Consumer throughput is bounded only by Kafka's ability to buffer events, which means the database absorbs write spikes smoothly regardless of the user-facing write rate.
Interview Cheat Sheet
- The hot-key problem appears when millions of writes target the same Redis key, saturating that key's CPU on one node. Redis is single-threaded per key, so 100,000 INCR ops/second on one key is near its saturation ceiling.
- Sharded counters fix the hot-key: write to
post:{id}:shard_{rand(0,K)}, read by MGET all K shards and sum. K=10 divides peak write load by 10 with no read latency penalty. - Stale counts of 1-5 seconds are acceptable for this use case and should be stated out loud in the interview. This tolerance unlocks every other design choice: batched writes, in-memory aggregation, async DB flush.
- A single Redis INCR key does not solve the hot-key problem. The ceiling is just higher than Postgres before saturation, not eliminated.
- Probabilistic bloom filter for duplicate rejection: Redis Bloom
BF.EXISTScompletes in under 1ms and handles 99%+ of duplicate checks without touching the database. - Bloom filters produce no false negatives: if
BF.EXISTSreturns 0, the user has genuinely not liked this post. False positives (filter says "probably seen" when the user has not) are rare (~1%) and require a fallback DB confirmation. Use the Redis Bloom module (BF.RESERVE) for memory-efficient probabilistic filtering. - Write-behind via Kafka: accept the like into Redis and Kafka immediately, flush to DB asynchronously in batches. This decouples user-facing write latency from DB insertion throughput entirely.
- Kafka exactly-once semantics: commit Kafka offsets only after the DB transaction succeeds. On consumer restart, events replay from the last committed offset.
ON CONFLICT DO NOTHINGmakes each INSERT idempotent. - The DB
UNIQUE(post_id, user_id)constraint is the final safety net. It is never the primary dedup mechanism in the hot write path. - Like count reads go through CDN edge cache with a 1-5 second TTL. At 115,000 count reads per second globally, CDN cache absorbs the vast majority of requests before reaching the Read Service.
- On Redis recovery after a crash, re-seed shard counters from the DB total for recently active posts. Kafka retains events for 7 days, so the consumer can replay any gap in the DB record without data loss.
- Choose K=10 shards as the default. Increase K if monitoring shows any shard exceeding 50,000 INCR ops/second. Redis MGET of 10 keys remains sub-millisecond at any realistic traffic level.