Kafka vs. RabbitMQ
When to choose Kafka vs. RabbitMQ: log-based vs. traditional broker, message replay, ordering guarantees, fan-out patterns, retention, and which architecture fits each messaging pattern.
TL;DR
| Dimension | Choose Kafka | Choose RabbitMQ |
|---|---|---|
| Throughput | Need 100K+ msgs/sec per broker with batched writes | 20K-50K msgs/sec is sufficient, individual message latency matters more |
| Message replay | Must replay historical events (bug fixes, new consumers, auditing) | Messages are fire-and-forget once processed |
| Ordering | Need strict per-entity ordering (events for user X always in order) | Ordering is not critical, or single-consumer-per-queue is acceptable |
| Routing | Topic-based fan-out with independent consumer group offsets | Complex per-message routing rules (headers, patterns, priority, DLX) |
| Consumer model | Pull-based, consumers control pace, backpressure is natural | Push-based, broker manages delivery, prefetch tuning |
| Retention | Need days/weeks/indefinite retention for event sourcing or audit | Messages are transient, delete on ACK |
Default answer: Use Kafka for event streaming and high-throughput log pipelines. Use RabbitMQ for task queues, request-reply patterns, and complex routing. They solve different problems.
The Framing
Your order service publishes an "order created" event. Three downstream services need it: inventory, billing, and analytics. On Monday, the analytics team deploys a bug that silently drops half the events. They discover it Wednesday.
With RabbitMQ, those events are gone. The moment each message was acknowledged, the broker deleted it. The analytics team has no way to reprocess Tuesday's orders without re-publishing from the source.
With Kafka, the analytics team resets their consumer group offset to Monday at midnight and replays every event. The fix is deployed, reprocessing completes in an hour, and no other consumer is affected. This is the moment most teams realize the two tools solve fundamentally different problems.
RabbitMQ is a message broker. It routes individual messages from producers to consumers, and its job is done when the consumer acknowledges. Kafka is a distributed commit log. It persists an ordered, immutable stream of records, and any number of consumers can read from any position independently.
This distinction cascades into everything: ordering guarantees, throughput characteristics, consumer models, operational patterns, and failure recovery. I've seen teams waste months trying to make RabbitMQ behave like Kafka (or vice versa) because they picked the tool before understanding which problem they had.
How Each Works
Kafka: Distributed Commit Log
Kafka organizes data into topics, and each topic is split into partitions. Each partition is an append-only, ordered log of records stored on disk. Producers write to the end of a partition, and consumers read from any position by tracking an offset (a sequential number).
Consumer groups are the parallelism unit. Within a group, each partition is assigned to exactly one consumer. If you have 12 partitions and 4 consumers in a group, each consumer reads 3 partitions. Adding a fifth consumer rebalances the assignment. A separate consumer group reads the same data independently at its own pace.
# Producer: append to a topic with a partition key
producer.send(
topic="orders",
key="user_456", # Same key = same partition = ordered
value=serialize(order_event),
headers={"event_type": "order.created"}
)
# Consumer: poll for records, commit offset after processing
while True:
records = consumer.poll(timeout_ms=100)
for record in records:
process(record)
consumer.commit() # Mark offset as processed
Retention is time-based or size-based (default 7 days). Records stay on disk whether or not any consumer has read them. Log compaction keeps only the latest value per key, useful for CDC and materialized views.
Kafka uses in-sync replicas (ISR) for durability. Each partition has a leader and N-1 followers. Writes go to the leader, followers replicate asynchronously, and a write is considered committed when all ISR members acknowledge it. If a follower falls behind, it gets removed from the ISR until it catches up.
KRaft (Kafka Raft) replaces ZooKeeper for metadata management in Kafka 3.3+. The controller quorum handles broker registration, partition assignment, and leader election using Raft consensus. My recommendation for new clusters: always use KRaft. ZooKeeper is on its way out.
Exactly-once semantics require three components working together. Idempotent producers (enable.idempotence=true) guarantee that retried sends produce exactly one record per message (the broker deduplicates using producer ID and sequence number). Transactional writes wrap read-process-write operations in an atomic transaction: consumer reads from input topic, processor transforms, producer writes to output topic. If any step fails, the entire transaction aborts. The isolation.level=read_committed setting on downstream consumers ensures they only see records from committed transactions.
Consumer group rebalancing redistributes partitions when consumers join or leave. The cooperative sticky assignor (default in newer clients) minimizes partition movement during rebalancing. Processing pauses briefly on affected partitions during a rebalance. At scale, this pause causes consumer lag spikes. Setting session.timeout.ms=45000 and heartbeat.interval.ms=15000 balances between fast failure detection and avoiding unnecessary rebalances from brief network hiccups.
Tiered storage (KIP-405, available in Kafka 3.6+ and Confluent Platform) offloads old log segments to object storage (S3) while keeping recent segments on local disk. This dramatically reduces broker disk costs for topics with long retention (30+ days). Brokers store the "hot" data locally for low-latency reads, and transparently fetch "cold" data from S3 when consumers read historical offsets. For teams that want infinite retention without proportionally scaling disk, tiered storage is a game-changer.
Kafka's zero-copy optimization (sendfile system call) transfers data directly from the page cache to the network socket without copying through user space. This is why Kafka can sustain multi-GB/s read throughput per broker with minimal CPU usage. Consumers reading recent data (still in page cache) get near-memory-speed performance. Consumers reading data older than the page cache pay disk I/O cost.
# Key Kafka producer configuration
enable.idempotence: true # Exactly-once per partition
acks: all # Wait for all ISR replicas
retries: 2147483647 # Infinite retries (idempotent)
max.in.flight.requests.per.connection: 5 # Safe with idempotence
compression.type: lz4 # Batch compression
linger.ms: 5 # Batch for 5ms before sending
batch.size: 16384 # 16 KB batches
RabbitMQ: AMQP Broker with Exchange Routing
RabbitMQ implements AMQP 0-9-1. Producers publish messages to exchanges, not queues directly. Exchanges route messages to queues based on bindings and routing keys. The exchange type determines the routing algorithm.
# Producer: publish to an exchange with a routing key
channel.basic_publish(
exchange="order_events",
routing_key="order.created",
body=serialize(order_event),
properties=pika.BasicProperties(
delivery_mode=2, # Persistent to disk
content_type="application/json"
)
)
# Consumer: subscribe to a queue, ACK after processing
def callback(ch, method, properties, body):
process(body)
ch.basic_ack(delivery_tag=method.delivery_tag)
channel.basic_consume(
queue="billing_orders",
on_message_callback=callback
)
Four exchange types handle different routing patterns:
- Direct: exact match on routing key (e.g.,
order.createdmatches onlyorder.created) - Topic: wildcard pattern matching (
order.*matchesorder.created,#.createdmatches anything ending in.created) - Fanout: broadcast to all bound queues regardless of routing key
- Headers: match on message header attributes instead of routing key
RabbitMQ pushes messages to consumers (the broker initiates delivery). Prefetch count controls how many unacknowledged messages a consumer can hold, providing built-in backpressure. Once a consumer ACKs a message, it is deleted from the queue.
Dead letter exchanges (DLX) capture messages that are rejected, expired, or exceed queue length. Priority queues reorder messages by priority level (1-255). Quorum queues (introduced in 3.8) provide Raft-based replication for high availability, replacing the older mirrored queue approach.
Message flow in RabbitMQ works like this: the producer sends a message via an AMQP channel to the exchange. The exchange evaluates bindings and routes the message to zero or more queues. Each queue stores the message in memory or on disk depending on persistence settings. The broker then pushes the message to a subscribed consumer based on the prefetch count.
RabbitMQ's connection model uses multiplexed channels over a single TCP connection. One connection can have hundreds of channels, each handling independent message streams. This is efficient for applications that publish and consume from many queues: you open one TCP connection and multiplex the traffic over lightweight channels. Kafka uses one TCP connection per broker, with the client library managing which partitions map to which broker connections internally.
Prefetch tuning is critical for throughput. basic.qos(prefetch_count=1) means the broker sends one message at a time, waiting for ACK before sending the next. This guarantees fair distribution but limits throughput. basic.qos(prefetch_count=100) sends up to 100 unacknowledged messages, keeping the consumer busy but risking message concentration if processing is slow. My default: start at 20-50 for background jobs, 1-5 for long-running tasks, and tune based on consumer processing time.
Quorum queues use Raft consensus to replicate messages across a configurable number of nodes (default 3). Writes require majority acknowledgment (2 of 3 nodes). This provides better data safety than mirrored queues (which could lose messages during network partitions) at the cost of slightly higher write latency. For new deployments, always use quorum queues over classic mirrored queues.
RabbitMQ management plugin provides an HTTP API and web UI for monitoring queue depths, consumer counts, message rates, and connection states. Key operational metrics to monitor: queue depth (messages ready for delivery), unacked count (messages delivered but not yet acknowledged), publish rate vs consume rate (if publish > consume, the queue grows), and memory/disk alarms (RabbitMQ blocks publishers when memory exceeds the watermark threshold, default 40% of system RAM).
RabbitMQ Streams (introduced in 3.9) add Kafka-like append-only log semantics to RabbitMQ. Streams support offset-based consumption, time-based offset seeking, and message replay. This narrows the gap between the two tools, but streams lack Kafka's partition model, consumer groups, and ecosystem maturity. I view RabbitMQ Streams as useful for "I mostly need RabbitMQ but sometimes need replay for one queue" rather than a replacement for Kafka's streaming architecture.
RabbitMQ's plugin ecosystem extends its capabilities. The Shovel plugin copies messages between brokers (useful for cross-datacenter replication). The Federation plugin loosely connects brokers across geographic regions with eventual consistency. The consistent hash exchange distributes messages across queues using consistent hashing, providing load-balanced consumption without application-side routing logic.
Publisher confirms are RabbitMQ's equivalent of Kafka's acks. With confirms enabled, the broker sends an acknowledgment to the producer after the message is written to disk (and replicated, for quorum queues). Without confirms, a crash between the network send and disk write loses the message. Always enable publisher confirms for persistent messages in production. The latency cost is 1-5ms per confirmed message.
RabbitMQ's lazy queues store messages to disk immediately instead of keeping them in memory first. For queues that build up large backlogs (consumers are slow or offline), lazy queues prevent memory exhaustion. The trade-off is higher per-message latency (disk write on every enqueue). Default (non-lazy) queues are faster for high-throughput, low-backlog scenarios where messages are consumed almost immediately.
Key Configuration Differences
Continue Reading with Premium
Unlock this article and every other in-depth system design guide on the platform with NotesFromSDE Premium.