Kafka internals: log segments, ISR, and consumer groups
How Kafka stores data in log segments, how the ISR (in-sync replicas) set ensures durability, how consumer group rebalancing works under the hood, and how compacted topics enable stateful consumers.
The problem
Your team ships a microservices architecture. Service A produces order events. Services B, C, and D all need those events. You use a traditional message queue with competing consumers, so each message is delivered to exactly one consumer. But B, C, and D all need every event. You create three queues and fan out the message to all three.
Now multiply that by 40 services. The fan-out configuration becomes a maintenance disaster. Worse, Service D falls behind for two hours during a deploy, and by the time it recovers, the queue has already purged those messages. D has lost data permanently because the queue deletes messages after delivery.
Traditional message queues treat consumption as destructive: once a consumer reads a message, it is gone (or at least invisible). They conflate "delivery" with "deletion." If any consumer needs to re-read historical data, replay events, or process at a different pace, the model breaks. This is the problem Kafka's log-based architecture solves.
What it is
Apache Kafka is a distributed commit log: an append-only, ordered sequence of records that persists data to disk and allows multiple consumers to read independently at their own pace without destroying the data.
Think of it as a shared notebook in a library. Anyone can write new entries at the end, but nobody tears out pages. Every reader has their own bookmark. Some readers are on page 50, others on page 5,000. They all read the same notebook without interfering with each other.
A topic is logical. Each topic has N partitions. Each partition is a separate log, stored on a separate disk segment on the broker that owns it.
How it works
The core write and read paths in Kafka flow through the partition leader, segment files, and consumer offsets.
The write path works as follows:
- The producer serializes the record and hashes the key (or round-robins if no key) to select a partition.
- The producer sends the record to the partition leader broker.
- The leader appends the record to the active log segment file on disk.
- Follower replicas pull the record from the leader and append it to their own segment files.
- Once all in-sync replicas (ISR) acknowledge, the leader considers the record committed.
- The leader responds to the producer with the assigned offset.
The read path is pull-based. Consumers issue fetch requests to the leader, specifying the offset to read from. The leader reads from the segment file (often served from the OS page cache) and returns a batch of records. The consumer advances its offset.
Kafka's read path exploits zero-copy I/O (sendfile() on Linux). Instead of copying data from the page cache into the Kafka JVM heap and then into the network socket buffer, sendfile() transfers data directly from the OS page cache to the network interface. This eliminates two memory copies per fetch and is a major reason Kafka sustains gigabytes/second of consumer throughput.
Interview tip: Kafka is pull, not push
When asked "how do consumers get data from Kafka," state that consumers pull data by polling the broker. This is a deliberate design choice: pull-based consumption lets each consumer control its own read rate and avoids overwhelming slow consumers. Traditional push-based queues force the consumer to keep up with the broker's pace.
Log segments and indexes
Each partition is stored as a series of log segment files on disk. A segment is a pair of files:
.logfile: the actual record data (sequential binary, append-only).indexfile: a sparse index mapping offset to byte position in the .log file
When a consumer seeks to offset 1500, Kafka:
- Finds the segment whose base offset is at most 1500 (segment starting at offset 1000).
- Searches the sparse
.indexfile for the largest entry no greater than offset 1500, getting a byte position. - Scans forward from that byte position in the
.logfile until it reaches offset 1500.
This lookup is sub-millisecond for well-sized segments. The sparse index keeps only every Nth entry (configurable via index.interval.bytes, default 4 KB), so index files stay small.
Segment lifecycle:
- New records always append to the active segment.
- When the active segment hits the size limit (
log.segment.bytes, default 1 GB) or the age limit (log.roll.ms), Kafka seals it and creates a new active segment. - Sealed segments are eligible for deletion based on
retention.ms(default 7 days) orretention.bytes. - Kafka deletes entire segments at once, never individual records. A segment is only deleted when ALL records in it are past the retention threshold.
| Parameter | Default | Effect |
|---|---|---|
log.segment.bytes | 1 GB | Segment file size before rolling |
log.roll.ms | 7 days | Max age of active segment before rolling |
retention.ms | 7 days | How long to keep sealed segments |
retention.bytes | -1 (unlimited) | Max total size per partition |
index.interval.bytes | 4096 | Bytes between sparse index entries |
Small segments โ too many file handles
Setting log.segment.bytes too low (e.g., 10 MB) creates thousands of tiny segment files per partition. Each open segment consumes a file descriptor. A broker with 1,000 partitions and small segments can exhaust the OS file descriptor limit, causing broker crashes. Keep segment size at 256 MB or above in production.
ISR replication protocol
Partitions are replicated across brokers for fault tolerance. Each partition has one leader and N-1 followers. Producers write only to the leader. Followers continuously pull (fetch) from the leader and append to their own logs.
In-Sync Replicas (ISR): The set of replicas that are "caught up" with the leader, within replica.lag.time.max.ms (default 30 seconds) of the leader's latest offset.
A record is considered committed when all replicas in the current ISR have acknowledged it. The high watermark (HW) is the offset of the last committed record. Consumers can only read up to the high watermark. This prevents consumers from reading data that might be lost if the leader fails before replication completes.
ISR dynamics:
- If a follower falls behind by more than
replica.lag.time.max.ms, the leader removes it from the ISR. The ISR shrinks, and fewer replicas are needed for commitment. - When the lagging follower catches back up, the leader adds it back to the ISR.
min.insync.replicas(default 1) sets the minimum ISR size for accepting writes withacks=all. If the ISR falls below this threshold, the broker rejects writes with aNotEnoughReplicasExceptionrather than risking data loss.
I think of min.insync.replicas as the "how much data loss can I tolerate" knob. Set it to 1 and you get availability but risk data loss. Set it to the replication factor and you get maximum durability but reduced availability.
| Configuration | Behavior | Tradeoff |
|---|---|---|
acks=0 | Producer does not wait for any ack | Maximum throughput, no durability guarantee |
acks=1 | Producer waits for leader ack only | Leader crash before replication = data loss |
acks=all + min.insync.replicas=2 | Waits for all ISR (min 2) | Strong durability, slightly higher latency |
acks=all + min.insync.replicas=RF | Full replication factor required | Maximum durability, lowest availability |
Leader election: When the current leader fails, the controller (or KRaft quorum in newer Kafka) elects a new leader exclusively from the ISR. This guarantees the new leader has all committed records. The election is fast (typically under a second) because the ISR is already tracked in metadata.
If all ISR members are down simultaneously, Kafka has two options controlled by unclean.leader.election.enable:
false(recommended): The partition stays offline until an ISR member recovers. No data loss, but unavailable.true: A lagging out-of-sync replica can become leader. The partition becomes available immediately, but committed records that only existed on the old ISR members are permanently lost.
Consumer group rebalancing
Kafka distributes partitions among consumers within a consumer group. If the group has 3 consumers and a topic has 6 partitions, each consumer gets 2 partitions. When a consumer joins, leaves, or crashes, partitions must be redistributed. This is a rebalance.
The rebalance problem: During a rebalance, all consumers in the group stop processing. For a consumer group with 50 consumers and 200 partitions, a rebalance can take 30-60 seconds. During this window, no messages are processed. In a high-throughput system, this causes a backlog that takes minutes to drain.
Eager vs. cooperative rebalancing:
- Eager (default before Kafka 2.4): All consumers revoke all partitions, then reassign. Full stop-the-world.
- Cooperative (incremental, Kafka 2.4+): Only the partitions that need to move are revoked. Consumers that keep their partitions continue processing uninterrupted. Two-phase: first revoke migrating partitions, then assign them to new owners.
I always recommend cooperative rebalancing in production. The eager protocol is a relic that causes unnecessary downtime during routine scaling events.
Offset management: Each consumer tracks its position in each partition as a committed offset, stored in Kafka's internal __consumer_offsets topic (itself a compacted log).
| Parameter | Default | Effect |
|---|---|---|
auto.offset.reset | latest | Where to start when no committed offset exists (earliest or latest) |
enable.auto.commit | true | Auto-commit offsets periodically |
auto.commit.interval.ms | 5000 | Interval between auto-commits |
session.timeout.ms | 45000 | Time before broker considers consumer dead |
max.poll.interval.ms | 300000 | Max time between poll() calls before rebalance |
Auto-commit can cause data loss or duplicates
With auto-commit enabled, offsets are committed on a timer regardless of whether processing succeeded. If the consumer crashes after auto-commit but before processing, you lose messages. If it crashes after processing but before auto-commit, you get duplicate processing on restart. For exactly-once semantics, disable auto-commit and commit manually after confirming your downstream write succeeded.
Log compaction
Standard retention deletes old segments by time or size. Log compaction is a different retention mode: it keeps only the latest value for each key, discarding older values.
The log cleaner thread runs in the background. It reads the "dirty" (uncompacted) portion of the log, builds an in-memory map of key to latest offset, and rewrites the segment files keeping only the latest entry per key.
Key behaviors:
- A tombstone (record with null value) marks a key for deletion. The tombstone is retained for
delete.retention.ms(default 24 hours) to allow downstream consumers to see the delete, then removed. - Compaction never reorders records. Offsets are preserved, but gaps appear where old records were removed.
- The active (newest) segment is never compacted. Only sealed segments are candidates.
Use case: Compacted topics are ideal for changelog streams. A Debezium CDC connector captures database changes as a compacted topic. Kafka Streams state stores use compacted topics as their backing store. Any new consumer can rebuild full current state by reading the compacted topic from offset 0 without replaying years of history.
function compactSegment(segment, latestOffsets):
writer = new SegmentWriter(segment.baseOffset)
for record in segment.records():
if latestOffsets[record.key] == record.offset:
writer.append(record) // keep: this is the latest for this key
else:
skip // discard: a newer record exists for this key
writer.close()
replaceSegmentOnDisk(segment, writer.output)
Production usage
Kafka is one of the most widely deployed distributed systems in the industry. Nearly every large-scale tech company uses it as the backbone for event streaming, CDC pipelines, and inter-service communication.
| System | Usage | Notable behavior |
|---|---|---|
| Event sourcing backbone for 7+ trillion messages/day | Kafka was created at LinkedIn. Uses custom extensions for cross-datacenter replication (MirrorMaker). | |
| Uber | Real-time trip event streaming and dispatch | Runs 1000+ Kafka brokers. Built uReplicator for multi-region active-active replication. |
| Confluent Cloud | Managed Kafka as a service | Uses tiered storage to offload old segments to object storage (S3), cutting broker disk costs by 50%+. |
| Netflix | Event pipeline for 700+ billion events/day | Uses Kafka with custom consumer libraries that handle offset management and dead letter queues. |
| Datadog | Metrics ingestion pipeline | Processes 40+ trillion data points/day through Kafka. Key design: partition by metric name for locality in downstream aggregation. |
Limitations and when NOT to use it
Kafka is powerful but not universally appropriate. I have seen teams adopt Kafka when a simple HTTP call or a PostgreSQL LISTEN/NOTIFY would have been sufficient.
- Not a database. Kafka lacks indexing, secondary lookups, and ad-hoc queries. If you need to query by anything other than partition + offset, Kafka is the wrong tool. Use it as a transport layer, not a storage layer.
- Partition count is hard to change. Adding partitions redistributes keys, breaking ordering guarantees for existing keys. You cannot reduce partition count without recreating the topic. Plan partition counts carefully upfront.
- Consumer rebalancing causes processing pauses. Even with cooperative rebalancing, adding or removing consumers triggers a brief processing gap. For latency-critical consumers, this is disruptive.
- No per-message routing or filtering at the broker. Unlike RabbitMQ, Kafka has no topic exchanges or header-based routing. Consumers receive all records in their assigned partitions and must filter client-side.
- Operational complexity is high. Running a Kafka cluster requires tuning JVM heap, page cache sizing, disk throughput, replication configs, and ZooKeeper (or KRaft). Managed services (Confluent Cloud, Amazon MSK) reduce but do not eliminate this burden.
- Head-of-line blocking within a partition. If one record in a partition is slow to process, it blocks all subsequent records in that partition for that consumer. There is no per-message parallelism within a single partition.
Interview cheat sheet
- When asked about Kafka's storage model, state that Kafka stores data as an append-only commit log on disk. Records are not deleted after consumption. Multiple consumer groups read independently from the same log at different offsets.
- When asked "how does Kafka achieve high throughput," say sequential disk I/O (append-only writes), OS page cache for reads, zero-copy transfer via
sendfile(), and batching at both the producer and consumer level. - When asked about durability guarantees, explain the ISR mechanism:
acks=allwithmin.insync.replicas=2ensures a record is written to at least 2 brokers before the producer gets an ack. If the ISR shrinks belowmin.insync.replicas, the broker rejects writes. - When asked about ordering, state that Kafka guarantees order only within a partition, not across partitions. To maintain order for a given entity, use that entity's ID as the partition key.
- When asked about exactly-once semantics, mention Kafka's idempotent producer (sequence numbers per partition) combined with transactional writes. The consumer side requires "read-process-write" with transactional offset commits.
- When asked about consumer group rebalancing, explain eager vs. cooperative rebalancing. Mention that eager causes full stop-the-world and cooperative (incremental) minimizes disruption by only migrating affected partitions.
- When asked about retention, distinguish time/size-based deletion (whole segments deleted) from log compaction (latest value per key kept). Compacted topics are perfect for CDC and state store changelogs.
- When asked to compare Kafka vs. traditional queues, emphasize the log model: non-destructive reads, consumer-controlled offset, replay capability, and partition-based parallelism. Traditional queues delete on dequeue and push messages to consumers.
Quick recap
- Kafka stores partitions as append-only log segments on disk: sequential
.logfiles with sparse.indexfiles for offset-to-byte-position lookups. - The ISR (in-sync replicas) set tracks which replicas are caught up with the leader. Records are committed only when all ISR members acknowledge.
min.insync.replicascontrols the minimum ISR size for accepting writes. - Consumer groups distribute partitions among consumers. Rebalancing (triggered by consumer joins, leaves, or crashes) can pause processing. Cooperative rebalancing minimizes disruption by migrating only affected partitions.
- Log compaction retains only the latest value per key, enabling stateful consumers and CDC workflows to rebuild state without replaying full history.
- Kafka achieves high throughput through sequential disk I/O, OS page cache reads, zero-copy transfer, and batching. It is a log, not a queue: consumption is non-destructive and offset-controlled.
- Use
acks=allwithmin.insync.replicas=2for durable topics. Disableunclean.leader.electionfor any topic where data loss is unacceptable. Disable auto-commit for exactly-once processing.
Related concepts
- Message queues - Kafka is often compared to traditional message queues, but its log-based model differs fundamentally in how it handles consumption and retention.
- Event-driven architecture - Kafka is the most common transport layer for event-driven systems, providing the durable event log that decouples producers from consumers.
- Write-ahead log - Kafka's partition is essentially a WAL. The same append-only, sequential write pattern that makes WALs fast makes Kafka fast.
- Zero-copy I/O - Kafka uses
sendfile()to transfer data from disk to network socket without copying through userspace, a key reason for its high throughput.