Kafka Clone
Design the internals of a durable, high-throughput message streaming platform: from a single-broker write path to a multi-partition, multi-datacenter system capable of Facebook-scale event ingestion.
What is a distributed message streaming system?
Kafka is a distributed commit log: producers append messages to named topics, consumers read them at their own pace using persistent offsets. The key engineering challenge is not the pub/sub concept itself but achieving millions of sequential disk writes per second on commodity hardware while tracking offset state for thousands of independent consumer groups. This question tests the intersection of I/O mechanics, distributed consensus, and horizontal partition scaling, which is why it appears in staff-level interviews at companies running high-ingestion data platforms. I'd lead with the sequential I/O insight in any interview because every other design decision in Kafka falls out of that one constraint.
Functional Requirements
Core Requirements
- Producers publish messages to named topics.
- Consumers subscribe and read messages from a topic with persistent offsets.
- Messages are durable and survive broker restarts.
- Support at-least-once delivery.
Scope Exclusions
- Stream processing (Flink, Spark Streaming) beyond basic consumer logic.
- Schema registry or message transformation.
Non-Functional Requirements
Core Requirements
- Write throughput: The system handles at least 10 million messages per second across all topics.
- Latency: End-to-end producer-to-consumer latency is under 10ms p99 at normal load.
- Durability: Zero data loss on a single broker restart; messages survive with replication factor of 2 or more.
- Availability: 99.99% uptime. The cluster continues serving reads and writes during a single broker failure.
- Scale: Facebook-level ingestion, roughly 1 trillion messages per day distributed across thousands of topics.
Below the Line
- Exactly-once delivery semantics (builds on the at-least-once foundation we will design)
- Cross-datacenter replication (MirrorMaker / active-active topology)
- Built-in schema evolution (Confluent Schema Registry)
- Tiered storage to object stores (Kafka 3.x feature)
The hardest engineering problem in scope: Sustaining 10 million writes per second with sub-10ms latency and full durability is the fundamental tension in this design. Random writes to disk top out at roughly 100 MB/s; sequential appends saturate at 600+ MB/s. The entire write path is engineered around eliminating random I/O, and every other design decision in this article is downstream of that one insight.
Exactly-once semantics is below the line because it requires the idempotent producer protocol and transactional APIs, which form a separate abstraction layer on top of at-least-once. To add it, I would enable enable.idempotence=true on the producer (assigns a producer ID and sequence number to each message batch) and wrap the produce and offset commit in a Kafka transaction.
Cross-datacenter replication is below the line because it requires understanding active-passive versus active-active topologies and lag monitoring. To add it, I would deploy MirrorMaker 2 as a replication pipeline between clusters and use a consistent hashing scheme to route producers to their nearest datacenter.
Core Entities
- Topic: A named, logical stream of records. Topics are append-only and hold records indefinitely subject to configurable retention.
- Partition: An ordered, immutable append-only log. The unit of parallelism and distribution across brokers.
- Segment: A physical file on disk within a partition. Kafka writes to the active segment and rolls to a new one at a configurable size or time boundary.
- Record (Message): The fundamental unit: a key, a value, a timestamp, and an offset. The key drives partition assignment; the offset is the record's position within its partition.
- ConsumerGroup: A named set of consumers sharing a single logical read position per topic. Each partition is assigned to exactly one consumer within the group at any time.
- Offset: The monotonically increasing position of a record within a partition. Consumers commit their current offset to checkpoint read progress and resume after restarts.
Schema and serialization format are deferred to the deep dives. These six entities are sufficient to drive the API design and High-Level Design.
API Design
Kafka exposes an SDK-style client API rather than REST. I'll show the naive single-producer, single-consumer shape first, then the partitioned and keyed evolution that handles production workloads.
Produce messages (naive shape, no partitioning):
// Simple produce: one broker, one partition, fire and forget
producer.produce(topic="events", value="payload")
This is the starting point. No partition key, no acknowledgment mode, no batching. It works for development but has no durability or throughput guarantees.
Produce messages (evolved shape, keyed with acks):
// Keyed produce: consistent partition assignment by key hash
producer.produce(
topic="events",
key="user_id:12345", // murmurhash(key) % num_partitions = target partition
value="payload",
acks="all" // wait for all ISR replicas to acknowledge before returning
)
The key determines which partition receives the message. All messages with the same key land on the same partition, preserving ordering per key. acks=all waits for the leader and all in-sync replicas to acknowledge before returning to the caller.
Subscribe and poll (consumer side):
// Register consumer in a named group; broker assigns partitions
consumer.subscribe(topics=["events"], group_id="analytics-workers")
// Poll loop: fetch records from last committed offset
while running:
records = consumer.poll(timeout_ms=500)
for record in records:
process(record)
consumer.commit(offsets=current_offsets) // checkpoint after successful processing
subscribe() registers the consumer in a consumer group. The broker group coordinator assigns partitions to each consumer in the group. commit() advances the checkpoint so a restart resumes from the correct position rather than replaying from the beginning.
Admin API (topic management):
// Partition count controls max consumer parallelism; set it right at creation
admin.create_topic(
name="events",
num_partitions=64,
replication_factor=3
)
Partition count is set at topic creation and directly controls maximum parallel consumer throughput. Use replication_factor=3 for all production topics; this tolerates one broker failure without data loss.
High-Level Design
1. Producers publish messages to named topics
The write path at its simplest: a producer connects to a broker, the broker appends the message to a partition file on disk, and returns an acknowledgment.
Components:
- Producer Client: The application sending records. Handles serialization, batching, and retry.
- Broker: The single server that owns the partition. Appends each incoming record to the active segment file on disk.
- Partition Log: An append-only file. Records are never modified; they are only appended and eventually deleted by the retention policy.
Request walkthrough:
- Producer calls
produce(topic="events", value="payload"). - The broker receives the produce request and deserializes the record.
- The broker appends the serialized record to the active segment file for partition 0.
- The broker assigns the next integer offset and writes it to the segment index.
- The broker acknowledges success back to the producer.
This is the write path only: one producer, one broker, one partition file. There is no fault tolerance and no parallelism here; both are added in later requirements and deep dives. I start every Kafka walkthrough with this single-partition sketch because it forces the interviewer to see the simplest possible system before any complexity is introduced.
2. Consumers subscribe and read with persistent offsets
Consumers do not move messages off the broker. They track their own position (offset) in each partition and fetch records at their own pace, which means the same data can be read by multiple independent consumer groups without any coordination between them.
Components:
- Consumer Client: Calls
poll()in a loop. Fetches records from a specific partition starting at the last committed offset. - Offset Store: The broker records the last committed offset per consumer group and partition. On restart, the consumer resumes from this stored checkpoint.
- Group Coordinator: A designated broker partition that tracks which consumers are alive and which partitions each consumer owns.
Request walkthrough:
- Consumer calls
subscribe(topics=["events"], group_id="my-group"). - The group coordinator assigns partition 0 to this consumer.
- Consumer calls
poll(). The broker returns records starting at the consumer's last committed offset. - Consumer processes records, then calls
commit(offsets)to checkpoint progress. - On restart, the consumer fetches its last committed offset from the broker and resumes from that position.
Adding offset tracking separates Kafka from a simple file tail. Multiple consumer groups can read the same topic completely independently, consumers can replay from any offset, and restarts are lossless. I'd call out the offset model explicitly in an interview because it is the single feature that makes Kafka fundamentally different from RabbitMQ.
3. Messages survive broker restarts (durability)
A file on a single broker is not durable. To survive restarts, records must be replicated: the leader broker writes each record, and follower brokers replicate it before the producer receives an acknowledgment.
Components:
- Leader Partition: The designated broker that handles all produce requests for a partition.
- Follower Replicas: Brokers that maintain a copy of the leader's partition log by continuously fetching new records.
- In-Sync Replica Set (ISR): The set of replicas that are fully caught up with the leader (within
replica.lag.time.max.ms). Theacks=allsetting means the leader only acks once every ISR member has persisted the record. - ZooKeeper / KRaft Controller: Manages cluster metadata and runs leader election when a broker goes down.
Request walkthrough:
- Producer sends a record with
acks=all. - Leader broker appends the record to its own log.
- Follower brokers fetch the new record and append it to their local copies.
- Once all ISR members confirm they have persisted the record, the leader sends the ACK to the producer.
- If the leader crashes, the KRaft controller elects a new leader from the ISR. Every ISR member already has the record, so no data is lost.
Continue Reading with Premium
Unlock this article and every other in-depth system design guide on the platform with NotesFromSDE Premium.