Kafka Clone
Design the internals of a durable, high-throughput message streaming platform: from a single-broker write path to a multi-partition, multi-datacenter system capable of Facebook-scale event ingestion.
What is a distributed message streaming system?
Kafka is a distributed commit log: producers append messages to named topics, consumers read them at their own pace using persistent offsets. The key engineering challenge is not the pub/sub concept itself but achieving millions of sequential disk writes per second on commodity hardware while tracking offset state for thousands of independent consumer groups. This question tests the intersection of I/O mechanics, distributed consensus, and horizontal partition scaling, which is why it appears in staff-level interviews at companies running high-ingestion data platforms. I'd lead with the sequential I/O insight in any interview because every other design decision in Kafka falls out of that one constraint.
Functional Requirements
Core Requirements
- Producers publish messages to named topics.
- Consumers subscribe and read messages from a topic with persistent offsets.
- Messages are durable and survive broker restarts.
- Support at-least-once delivery.
Scope Exclusions
- Stream processing (Flink, Spark Streaming) beyond basic consumer logic.
- Schema registry or message transformation.
Non-Functional Requirements
Core Requirements
- Write throughput: The system handles at least 10 million messages per second across all topics.
- Latency: End-to-end producer-to-consumer latency is under 10ms p99 at normal load.
- Durability: Zero data loss on a single broker restart; messages survive with replication factor of 2 or more.
- Availability: 99.99% uptime. The cluster continues serving reads and writes during a single broker failure.
- Scale: Facebook-level ingestion, roughly 1 trillion messages per day distributed across thousands of topics.
Below the Line
- Exactly-once delivery semantics (builds on the at-least-once foundation we will design)
- Cross-datacenter replication (MirrorMaker / active-active topology)
- Built-in schema evolution (Confluent Schema Registry)
- Tiered storage to object stores (Kafka 3.x feature)
The hardest engineering problem in scope: Sustaining 10 million writes per second with sub-10ms latency and full durability is the fundamental tension in this design. Random writes to disk top out at roughly 100 MB/s; sequential appends saturate at 600+ MB/s. The entire write path is engineered around eliminating random I/O, and every other design decision in this article is downstream of that one insight.
Exactly-once semantics is below the line because it requires the idempotent producer protocol and transactional APIs, which form a separate abstraction layer on top of at-least-once. To add it, I would enable enable.idempotence=true on the producer (assigns a producer ID and sequence number to each message batch) and wrap the produce and offset commit in a Kafka transaction.
Cross-datacenter replication is below the line because it requires understanding active-passive versus active-active topologies and lag monitoring. To add it, I would deploy MirrorMaker 2 as a replication pipeline between clusters and use a consistent hashing scheme to route producers to their nearest datacenter.
Core Entities
- Topic: A named, logical stream of records. Topics are append-only and hold records indefinitely subject to configurable retention.
- Partition: An ordered, immutable append-only log. The unit of parallelism and distribution across brokers.
- Segment: A physical file on disk within a partition. Kafka writes to the active segment and rolls to a new one at a configurable size or time boundary.
- Record (Message): The fundamental unit: a key, a value, a timestamp, and an offset. The key drives partition assignment; the offset is the record's position within its partition.
- ConsumerGroup: A named set of consumers sharing a single logical read position per topic. Each partition is assigned to exactly one consumer within the group at any time.
- Offset: The monotonically increasing position of a record within a partition. Consumers commit their current offset to checkpoint read progress and resume after restarts.
Schema and serialization format are deferred to the deep dives. These six entities are sufficient to drive the API design and High-Level Design.
API Design
Kafka exposes an SDK-style client API rather than REST. I'll show the naive single-producer, single-consumer shape first, then the partitioned and keyed evolution that handles production workloads.
Produce messages (naive shape, no partitioning):
// Simple produce: one broker, one partition, fire and forget
producer.produce(topic="events", value="payload")
This is the starting point. No partition key, no acknowledgment mode, no batching. It works for development but has no durability or throughput guarantees.
Produce messages (evolved shape, keyed with acks):
// Keyed produce: consistent partition assignment by key hash
producer.produce(
topic="events",
key="user_id:12345", // murmurhash(key) % num_partitions = target partition
value="payload",
acks="all" // wait for all ISR replicas to acknowledge before returning
)
The key determines which partition receives the message. All messages with the same key land on the same partition, preserving ordering per key. acks=all waits for the leader and all in-sync replicas to acknowledge before returning to the caller.
Subscribe and poll (consumer side):
// Register consumer in a named group; broker assigns partitions
consumer.subscribe(topics=["events"], group_id="analytics-workers")
// Poll loop: fetch records from last committed offset
while running:
records = consumer.poll(timeout_ms=500)
for record in records:
process(record)
consumer.commit(offsets=current_offsets) // checkpoint after successful processing
subscribe() registers the consumer in a consumer group. The broker group coordinator assigns partitions to each consumer in the group. commit() advances the checkpoint so a restart resumes from the correct position rather than replaying from the beginning.
Admin API (topic management):
// Partition count controls max consumer parallelism; set it right at creation
admin.create_topic(
name="events",
num_partitions=64,
replication_factor=3
)
Partition count is set at topic creation and directly controls maximum parallel consumer throughput. Use replication_factor=3 for all production topics; this tolerates one broker failure without data loss.
High-Level Design
1. Producers publish messages to named topics
The write path at its simplest: a producer connects to a broker, the broker appends the message to a partition file on disk, and returns an acknowledgment.
Components:
- Producer Client: The application sending records. Handles serialization, batching, and retry.
- Broker: The single server that owns the partition. Appends each incoming record to the active segment file on disk.
- Partition Log: An append-only file. Records are never modified; they are only appended and eventually deleted by the retention policy.
Request walkthrough:
- Producer calls
produce(topic="events", value="payload"). - The broker receives the produce request and deserializes the record.
- The broker appends the serialized record to the active segment file for partition 0.
- The broker assigns the next integer offset and writes it to the segment index.
- The broker acknowledges success back to the producer.
This is the write path only: one producer, one broker, one partition file. There is no fault tolerance and no parallelism here; both are added in later requirements and deep dives. I start every Kafka walkthrough with this single-partition sketch because it forces the interviewer to see the simplest possible system before any complexity is introduced.
2. Consumers subscribe and read with persistent offsets
Consumers do not move messages off the broker. They track their own position (offset) in each partition and fetch records at their own pace, which means the same data can be read by multiple independent consumer groups without any coordination between them.
Components:
- Consumer Client: Calls
poll()in a loop. Fetches records from a specific partition starting at the last committed offset. - Offset Store: The broker records the last committed offset per consumer group and partition. On restart, the consumer resumes from this stored checkpoint.
- Group Coordinator: A designated broker partition that tracks which consumers are alive and which partitions each consumer owns.
Request walkthrough:
- Consumer calls
subscribe(topics=["events"], group_id="my-group"). - The group coordinator assigns partition 0 to this consumer.
- Consumer calls
poll(). The broker returns records starting at the consumer's last committed offset. - Consumer processes records, then calls
commit(offsets)to checkpoint progress. - On restart, the consumer fetches its last committed offset from the broker and resumes from that position.
Adding offset tracking separates Kafka from a simple file tail. Multiple consumer groups can read the same topic completely independently, consumers can replay from any offset, and restarts are lossless. I'd call out the offset model explicitly in an interview because it is the single feature that makes Kafka fundamentally different from RabbitMQ.
3. Messages survive broker restarts (durability)
A file on a single broker is not durable. To survive restarts, records must be replicated: the leader broker writes each record, and follower brokers replicate it before the producer receives an acknowledgment.
Components:
- Leader Partition: The designated broker that handles all produce requests for a partition.
- Follower Replicas: Brokers that maintain a copy of the leader's partition log by continuously fetching new records.
- In-Sync Replica Set (ISR): The set of replicas that are fully caught up with the leader (within
replica.lag.time.max.ms). Theacks=allsetting means the leader only acks once every ISR member has persisted the record. - ZooKeeper / KRaft Controller: Manages cluster metadata and runs leader election when a broker goes down.
Request walkthrough:
- Producer sends a record with
acks=all. - Leader broker appends the record to its own log.
- Follower brokers fetch the new record and append it to their local copies.
- Once all ISR members confirm they have persisted the record, the leader sends the ACK to the producer.
- If the leader crashes, the KRaft controller elects a new leader from the ISR. Every ISR member already has the record, so no data is lost.
The ISR protocol makes acks=all durable without sacrificing throughput. If a follower falls behind, it is removed from the ISR, and the leader stops waiting for it, maintaining write availability while the slow follower catches up in the background. I have seen production clusters where a single slow disk on one broker cascaded into full-cluster stalls because the team had not set replica.lag.time.max.ms low enough to eject the lagging replica from the ISR quickly.
4. At-least-once delivery guarantee
At-least-once means every message is delivered, but may be delivered more than once when network failures cause retries. The producer must retry on timeout, and the consumer must process records idempotently or deduplicate by application key.
Components:
- Producer Retry Logic: On timeout or network error, the producer retries the produce request with the same record. If the original ACK was lost in transit, the broker receives and appends a duplicate.
- Idempotent Consumer: Processes each record and commits the offset only after successful processing. A mid-batch crash causes the consumer to re-read and re-process those records on restart.
Request walkthrough:
- Producer sends a record with
acks=allandretries=Integer.MAX_VALUE. - The broker appends and replicates the record successfully, but the ACK is lost on the network.
- Producer times out and retries the same record.
- The broker appends the duplicate with a new offset. Two copies now exist in the log.
- Consumer reads both copies and must deduplicate using an application-level key (such as
event_id).
This covers all four functional requirements. The deep dives below address the non-functional requirements: write throughput at 10M messages/second, horizontal scalability to Facebook-level volumes, fault tolerance under broker failures, and offset state management at scale. I always pause here in an interview and ask the interviewer which deep dive they want to explore first, rather than charging ahead. It signals awareness that time is finite.
Potential Deep Dives
1. How does Kafka achieve high write throughput with sequential disk I/O?
The target is 10 million writes per second with sub-10ms end-to-end latency. Random-access storage on commodity hardware tops out at roughly 5,000 to 10,000 random writes per second; sequential appends on the same hardware saturate at 600+ MB/s. I always explain this gap to interviewers early because it is the root cause behind every architectural decision in Kafka's write path.
2. How do partitions enable horizontal scalability and parallel consumption?
A single partition is a sequential log with one writer and one active consumer slot per consumer group. To handle Facebook-level throughput, we need many partitions distributed across many brokers with consumers reading them in parallel.
3. How does Kafka achieve fault tolerance and high availability?
Fault tolerance means surviving broker failures without data loss or extended downtime. The mechanism is distributed leader-follower replication with a controlled failover protocol. The subtlety is that "no data loss" requires careful definition: the ISR contract ensures the new leader always has every acknowledged message.
4. How does Kafka maintain consumer offset state at scale?
Every consumer group tracks one committed offset per partition. At Facebook scale with thousands of consumer groups each subscribed to hundreds of topics with 64 partitions each, offset storage becomes a high-frequency coordination write problem.
5. Kafka vs. traditional message queues: when to use each?
Traditional queuing systems (RabbitMQ, Amazon SQS) and Kafka solve different problems. The choice depends on whether you need replay, fan-out to independent consumers, or high write throughput versus routing flexibility and simple queue semantics.
Final Architecture
The single most important architectural insight in Kafka: every component is designed around the OS page cache. Writes land in memory first (no per-record fsync), ISR replicas fetch from the same page cache on the leader, and consumers read via zero-copy sendfile() without touching user space. ISR replication on top of this delivers durability without sacrificing the throughput that memory-speed I/O provides.
Interview Cheat Sheet
- Kafka vs. RabbitMQ: Use Kafka when consumers need to replay messages, multiple independent services read the same event stream, or you need millions of writes per second with durable retention. Use RabbitMQ when you need complex per-message routing, direct queue semantics where messages are removed after consumption, or simpler operational overhead at lower scale.
- The write throughput secret: Sequential appends to the OS page cache, not direct disk writes. The OS flushes to disk asynchronously; ISR replication provides durability, not fsync. Sequential I/O hits 600+ MB/s on commodity hardware; random I/O tops out at around 100 MB/s.
- Zero-copy is the consumer throughput lever:
sendfile()transfers bytes from the broker's OS page cache to the consumer socket entirely in the kernel, bypassing user-space memory. This roughly doubles consumer throughput on a saturated NIC and is why Kafka consumers can match producer write rates. - Partition count equals max consumer parallelism: Each partition can be consumed by exactly one consumer per group at a time. A topic with 64 partitions supports at most 64 parallel consumers in a single group. Start with 3x your expected peak consumer count, since increasing partition count later breaks per-key ordering.
- ISR guarantees durability without sacrificing write availability:
acks=allplusmin.insync.replicas=2means at least two brokers must persist every acknowledged message. If one ISR member falls behind, the leader removes it from the ISR and continues writing with the remaining members. - Offset as a cursor, not a lock: Consumer groups track their read position as a committed offset stored in
__consumer_offsets, not as a server-side cursor or lock. Multiple consumer groups read the same topic completely independently with no coordination. This is the core capability that separates Kafka from traditional message queues. __consumer_offsetsis itself a Kafka topic: Kafka bootstraps offset state onto its own replication and compaction infrastructure. The topic has 50 partitions by default, uses log compaction (latest offset per key only), and scales with the broker cluster. No external coordination service is needed.- ISR election eliminates data loss: Leader election in Kafka always picks from the ISR, the set of replicas provably caught up with the leader. Since every ISR member has every acknowledged message, any ISR member can become the new leader without losing a single acknowledged write.
- KRaft drops ZooKeeper and cuts failover to under 1 second: KRaft (default in Kafka 3.3+) replaces ZooKeeper with an internal Raft consensus group of controller brokers. This removes the external dependency and cuts leader election time from roughly 30 seconds to under 1 second.
- Rebalancing is the hidden source of consumer latency spikes: Every consumer join or leave triggers a rebalance that pauses all consumption in the group. Use
CooperativeStickyAssignorto minimize the pause by only reassigning partitions that actually need to move, not all partitions in the group. - At-least-once is the practical default; idempotency is the fix: Enable
retries=Integer.MAX_VALUEandenable.idempotence=trueon the producer. The idempotent producer assigns a sequence number to each batch; the broker deduplicates retried batches at the broker level. Downstream consumers still need application-level deduplication by event ID for exactly-once application semantics. - Facebook-scale sizing: At 1 trillion messages per day, a single Kafka cluster needs thousands of partitions. Each broker is sized for NIC bandwidth first (10-25 Gbps per node) because the OS page cache absorbs most reads and the bottleneck shifts from disk to network at scale.