Kafka vs. RabbitMQ
When to choose Kafka vs. RabbitMQ: log-based vs. traditional broker, message replay, ordering guarantees, fan-out patterns, retention, and which architecture fits each messaging pattern.
TL;DR
| Dimension | Choose Kafka | Choose RabbitMQ |
|---|---|---|
| Throughput | Need 100K+ msgs/sec per broker with batched writes | 20K-50K msgs/sec is sufficient, individual message latency matters more |
| Message replay | Must replay historical events (bug fixes, new consumers, auditing) | Messages are fire-and-forget once processed |
| Ordering | Need strict per-entity ordering (events for user X always in order) | Ordering is not critical, or single-consumer-per-queue is acceptable |
| Routing | Topic-based fan-out with independent consumer group offsets | Complex per-message routing rules (headers, patterns, priority, DLX) |
| Consumer model | Pull-based, consumers control pace, backpressure is natural | Push-based, broker manages delivery, prefetch tuning |
| Retention | Need days/weeks/indefinite retention for event sourcing or audit | Messages are transient, delete on ACK |
Default answer: Use Kafka for event streaming and high-throughput log pipelines. Use RabbitMQ for task queues, request-reply patterns, and complex routing. They solve different problems.
The Framing
Your order service publishes an "order created" event. Three downstream services need it: inventory, billing, and analytics. On Monday, the analytics team deploys a bug that silently drops half the events. They discover it Wednesday.
With RabbitMQ, those events are gone. The moment each message was acknowledged, the broker deleted it. The analytics team has no way to reprocess Tuesday's orders without re-publishing from the source.
With Kafka, the analytics team resets their consumer group offset to Monday at midnight and replays every event. The fix is deployed, reprocessing completes in an hour, and no other consumer is affected. This is the moment most teams realize the two tools solve fundamentally different problems.
RabbitMQ is a message broker. It routes individual messages from producers to consumers, and its job is done when the consumer acknowledges. Kafka is a distributed commit log. It persists an ordered, immutable stream of records, and any number of consumers can read from any position independently.
This distinction cascades into everything: ordering guarantees, throughput characteristics, consumer models, operational patterns, and failure recovery. I've seen teams waste months trying to make RabbitMQ behave like Kafka (or vice versa) because they picked the tool before understanding which problem they had.
How Each Works
Kafka: Distributed Commit Log
Kafka organizes data into topics, and each topic is split into partitions. Each partition is an append-only, ordered log of records stored on disk. Producers write to the end of a partition, and consumers read from any position by tracking an offset (a sequential number).
Consumer groups are the parallelism unit. Within a group, each partition is assigned to exactly one consumer. If you have 12 partitions and 4 consumers in a group, each consumer reads 3 partitions. Adding a fifth consumer rebalances the assignment. A separate consumer group reads the same data independently at its own pace.
# Producer: append to a topic with a partition key
producer.send(
topic="orders",
key="user_456", # Same key = same partition = ordered
value=serialize(order_event),
headers={"event_type": "order.created"}
)
# Consumer: poll for records, commit offset after processing
while True:
records = consumer.poll(timeout_ms=100)
for record in records:
process(record)
consumer.commit() # Mark offset as processed
Retention is time-based or size-based (default 7 days). Records stay on disk whether or not any consumer has read them. Log compaction keeps only the latest value per key, useful for CDC and materialized views.
Kafka uses in-sync replicas (ISR) for durability. Each partition has a leader and N-1 followers. Writes go to the leader, followers replicate asynchronously, and a write is considered committed when all ISR members acknowledge it. If a follower falls behind, it gets removed from the ISR until it catches up.
KRaft (Kafka Raft) replaces ZooKeeper for metadata management in Kafka 3.3+. The controller quorum handles broker registration, partition assignment, and leader election using Raft consensus. My recommendation for new clusters: always use KRaft. ZooKeeper is on its way out.
Exactly-once semantics require three components working together. Idempotent producers (enable.idempotence=true) guarantee that retried sends produce exactly one record per message (the broker deduplicates using producer ID and sequence number). Transactional writes wrap read-process-write operations in an atomic transaction: consumer reads from input topic, processor transforms, producer writes to output topic. If any step fails, the entire transaction aborts. The isolation.level=read_committed setting on downstream consumers ensures they only see records from committed transactions.
Consumer group rebalancing redistributes partitions when consumers join or leave. The cooperative sticky assignor (default in newer clients) minimizes partition movement during rebalancing. Processing pauses briefly on affected partitions during a rebalance. At scale, this pause causes consumer lag spikes. Setting session.timeout.ms=45000 and heartbeat.interval.ms=15000 balances between fast failure detection and avoiding unnecessary rebalances from brief network hiccups.
Tiered storage (KIP-405, available in Kafka 3.6+ and Confluent Platform) offloads old log segments to object storage (S3) while keeping recent segments on local disk. This dramatically reduces broker disk costs for topics with long retention (30+ days). Brokers store the "hot" data locally for low-latency reads, and transparently fetch "cold" data from S3 when consumers read historical offsets. For teams that want infinite retention without proportionally scaling disk, tiered storage is a game-changer.
Kafka's zero-copy optimization (sendfile system call) transfers data directly from the page cache to the network socket without copying through user space. This is why Kafka can sustain multi-GB/s read throughput per broker with minimal CPU usage. Consumers reading recent data (still in page cache) get near-memory-speed performance. Consumers reading data older than the page cache pay disk I/O cost.
# Key Kafka producer configuration
enable.idempotence: true # Exactly-once per partition
acks: all # Wait for all ISR replicas
retries: 2147483647 # Infinite retries (idempotent)
max.in.flight.requests.per.connection: 5 # Safe with idempotence
compression.type: lz4 # Batch compression
linger.ms: 5 # Batch for 5ms before sending
batch.size: 16384 # 16 KB batches
RabbitMQ: AMQP Broker with Exchange Routing
RabbitMQ implements AMQP 0-9-1. Producers publish messages to exchanges, not queues directly. Exchanges route messages to queues based on bindings and routing keys. The exchange type determines the routing algorithm.
# Producer: publish to an exchange with a routing key
channel.basic_publish(
exchange="order_events",
routing_key="order.created",
body=serialize(order_event),
properties=pika.BasicProperties(
delivery_mode=2, # Persistent to disk
content_type="application/json"
)
)
# Consumer: subscribe to a queue, ACK after processing
def callback(ch, method, properties, body):
process(body)
ch.basic_ack(delivery_tag=method.delivery_tag)
channel.basic_consume(
queue="billing_orders",
on_message_callback=callback
)
Four exchange types handle different routing patterns:
- Direct: exact match on routing key (e.g.,
order.createdmatches onlyorder.created) - Topic: wildcard pattern matching (
order.*matchesorder.created,#.createdmatches anything ending in.created) - Fanout: broadcast to all bound queues regardless of routing key
- Headers: match on message header attributes instead of routing key
RabbitMQ pushes messages to consumers (the broker initiates delivery). Prefetch count controls how many unacknowledged messages a consumer can hold, providing built-in backpressure. Once a consumer ACKs a message, it is deleted from the queue.
Dead letter exchanges (DLX) capture messages that are rejected, expired, or exceed queue length. Priority queues reorder messages by priority level (1-255). Quorum queues (introduced in 3.8) provide Raft-based replication for high availability, replacing the older mirrored queue approach.
Message flow in RabbitMQ works like this: the producer sends a message via an AMQP channel to the exchange. The exchange evaluates bindings and routes the message to zero or more queues. Each queue stores the message in memory or on disk depending on persistence settings. The broker then pushes the message to a subscribed consumer based on the prefetch count.
RabbitMQ's connection model uses multiplexed channels over a single TCP connection. One connection can have hundreds of channels, each handling independent message streams. This is efficient for applications that publish and consume from many queues: you open one TCP connection and multiplex the traffic over lightweight channels. Kafka uses one TCP connection per broker, with the client library managing which partitions map to which broker connections internally.
Prefetch tuning is critical for throughput. basic.qos(prefetch_count=1) means the broker sends one message at a time, waiting for ACK before sending the next. This guarantees fair distribution but limits throughput. basic.qos(prefetch_count=100) sends up to 100 unacknowledged messages, keeping the consumer busy but risking message concentration if processing is slow. My default: start at 20-50 for background jobs, 1-5 for long-running tasks, and tune based on consumer processing time.
Quorum queues use Raft consensus to replicate messages across a configurable number of nodes (default 3). Writes require majority acknowledgment (2 of 3 nodes). This provides better data safety than mirrored queues (which could lose messages during network partitions) at the cost of slightly higher write latency. For new deployments, always use quorum queues over classic mirrored queues.
RabbitMQ management plugin provides an HTTP API and web UI for monitoring queue depths, consumer counts, message rates, and connection states. Key operational metrics to monitor: queue depth (messages ready for delivery), unacked count (messages delivered but not yet acknowledged), publish rate vs consume rate (if publish > consume, the queue grows), and memory/disk alarms (RabbitMQ blocks publishers when memory exceeds the watermark threshold, default 40% of system RAM).
RabbitMQ Streams (introduced in 3.9) add Kafka-like append-only log semantics to RabbitMQ. Streams support offset-based consumption, time-based offset seeking, and message replay. This narrows the gap between the two tools, but streams lack Kafka's partition model, consumer groups, and ecosystem maturity. I view RabbitMQ Streams as useful for "I mostly need RabbitMQ but sometimes need replay for one queue" rather than a replacement for Kafka's streaming architecture.
RabbitMQ's plugin ecosystem extends its capabilities. The Shovel plugin copies messages between brokers (useful for cross-datacenter replication). The Federation plugin loosely connects brokers across geographic regions with eventual consistency. The consistent hash exchange distributes messages across queues using consistent hashing, providing load-balanced consumption without application-side routing logic.
Publisher confirms are RabbitMQ's equivalent of Kafka's acks. With confirms enabled, the broker sends an acknowledgment to the producer after the message is written to disk (and replicated, for quorum queues). Without confirms, a crash between the network send and disk write loses the message. Always enable publisher confirms for persistent messages in production. The latency cost is 1-5ms per confirmed message.
RabbitMQ's lazy queues store messages to disk immediately instead of keeping them in memory first. For queues that build up large backlogs (consumers are slow or offline), lazy queues prevent memory exhaustion. The trade-off is higher per-message latency (disk write on every enqueue). Default (non-lazy) queues are faster for high-throughput, low-backlog scenarios where messages are consumed almost immediately.
Key Configuration Differences
# Kafka: broker-level + topic-level settings
num.partitions: 12 # Default partitions per topic
default.replication.factor: 3 # Copies across brokers
log.retention.hours: 168 # 7 days default
log.segment.bytes: 1073741824 # 1 GB segment files
auto.create.topics.enable: false # Explicit topic creation
min.insync.replicas: 2 # Quorum for acks=all
# RabbitMQ: queue-level settings
x-queue-type: quorum # Raft-replicated
x-max-length: 1000000 # Max messages in queue
x-message-ttl: 86400000 # 24h message expiry
x-dead-letter-exchange: dlx # Failed message routing
x-delivery-limit: 5 # Redelivery attempts
The configuration philosophy differs fundamentally. Kafka configurations are cluster-wide and topic-wide (you tune brokers and topics, not individual messages). RabbitMQ configurations are queue-level and message-level (each queue and each message can have different TTL, priority, persistence settings). This reflects the architectural difference: Kafka treats the stream as a unit, RabbitMQ treats each message as a unit.
Delivery Guarantees Comparison
| Guarantee | Kafka | RabbitMQ |
|---|---|---|
| At-most-once | acks=0 (fire and forget) | No confirms, auto-ACK |
| At-least-once | acks=all + consumer manual commit | Publisher confirms + manual ACK |
| Exactly-once | Idempotent producer + transactions + read_committed consumers | Not natively supported (requires application-level deduplication) |
Kafka's exactly-once semantics (EOS) is the strongest delivery guarantee available in mainstream messaging systems. It requires the full chain: idempotent producer (dedup at broker), transactional producer (atomic writes across partitions), and isolation.level=read_committed on consumers (only see committed transaction records). The performance cost of EOS is ~5-10% throughput reduction compared to at-least-once.
RabbitMQ achieves effective exactly-once at the application level by combining publisher confirms, consumer ACK, and idempotent message processing (using a deduplication table or idempotency key). This is more work but gives you full control over the deduplication logic.
For most applications, at-least-once delivery with idempotent consumers is the pragmatic choice for both tools. Design consumers to safely process the same message twice (using database upserts, idempotency keys, or conditional writes), and the delivery guarantee question becomes less critical than it appears in theory.
Operational Complexity Comparison
| Dimension | Kafka | RabbitMQ |
|---|---|---|
| Minimum production cluster | 3 brokers + KRaft (or ZooKeeper) | 3 nodes (quorum queues) |
| Disk requirements | High (retention stores all data) | Low (messages deleted on ACK) |
| Memory requirements | Page cache dependent (more RAM = faster reads) | Queue depth dependent (large backlogs need more RAM) |
| Key metrics to monitor | Consumer lag, under-replicated partitions, ISR shrink rate | Queue depth, unacked count, memory alarms |
| Scaling | Add brokers + rebalance partitions | Add nodes + configure quorum membership |
| Managed options | Confluent Cloud, Amazon MSK, Azure Event Hubs | CloudAMQP, Amazon MQ, Azure Service Bus |
The fan-out model highlights the architectural difference. In Kafka, adding a new consumer is purely consumer-side: create a new consumer group that reads from the existing topic. No producer changes, no broker configuration. In RabbitMQ, adding a new consumer requires creating a queue and binding it to the exchange.
Head-to-Head Comparison
| Dimension | Kafka | RabbitMQ | Verdict |
|---|---|---|---|
| Throughput | 100K-2M msgs/sec per broker (batched, sequential disk I/O) | 20K-50K msgs/sec per node (per-message routing overhead) | Kafka, 5-20x higher |
| Latency | 5-50ms typical (batching adds latency for throughput) | Sub-ms to 5ms (immediate push, no batching) | RabbitMQ for individual messages |
| Message replay | Any consumer can re-read from any offset at any time | Once ACK'd, message is deleted forever | Kafka, decisively |
| Ordering | Strict per-partition (same key = same partition = ordered) | Per-queue with single consumer only; no ordering with competing consumers | Kafka, more granular control |
| Routing | Topic-based only; consumers read entire topics | Exchange-based: direct, topic, fanout, headers with flexible binding rules | RabbitMQ, far more expressive |
| Consumer model | Pull-based: consumers poll at their own pace | Push-based: broker delivers with prefetch-based flow control | RabbitMQ simpler for task queues |
| Backpressure | Natural: slow consumer falls behind but never overloads | Prefetch count + credit-based flow control | Both good, different mechanisms |
| Persistence | Always persisted to disk, configurable retention (hours to forever) | Optional per-message persistence, deleted on ACK | Kafka, built-in retention |
| Priority | No native priority; separate topics for priority lanes | Priority queues (1-255 levels) | RabbitMQ |
| Operational complexity | Higher: partitions, ISR, replication factor, KRaft/ZooKeeper | Lower: simpler topology, but quorum queues add complexity | RabbitMQ simpler to start |
The fundamental tension: Kafka optimizes for throughput, ordering, and replay at the cost of routing flexibility and operational simplicity. RabbitMQ optimizes for message-level routing, priority, and ease of setup at the cost of throughput ceiling and replay capability.
When Kafka Wins
Kafka is the right choice when your primary concerns are throughput, ordering, and the ability to replay events.
Event streaming and event sourcing. If downstream services need to rebuild state from a stream of events, Kafka's retention and replay make this possible. RabbitMQ deletes messages on ACK, making replay impossible without re-publishing from the source. For systems where "what happened" matters as much as "what is the current state," Kafka is the only option.
High-throughput ingestion. Kafka's sequential disk I/O and batched writes handle 100K-2M messages per second per broker. A three-broker Kafka cluster handles what would require 10-20 RabbitMQ nodes. For log aggregation, clickstream, IoT telemetry, or metrics pipelines, the throughput gap is decisive.
Exactly-once semantics. Kafka's idempotent producers (enable.idempotence=true) combined with transactional writes (read-process-write atomically) provide exactly-once processing guarantees. RabbitMQ provides at-most-once or at-least-once, not exactly-once.
Multiple independent consumers. In Kafka, adding a new consumer group to read from an existing topic requires zero changes to producers or brokers. Each group tracks its own offset independently. In RabbitMQ, adding a new service requires creating a queue and binding it to the exchange.
Per-entity ordering at scale. Kafka's partition key guarantees that all events for user_456 go to the same partition and are processed in order. You can have 100 partitions (100-way parallelism) while maintaining per-user ordering. RabbitMQ only guarantees ordering within a single queue consumed by a single consumer, which limits parallelism.
Audit trails and compliance. Financial systems, healthcare, and regulated industries need immutable event logs. Kafka's retention (set to infinite with compaction) serves as a durable audit trail without additional infrastructure.
Change Data Capture (CDC) backbone. Debezium reads database WAL changes and publishes them to Kafka topics. Downstream consumers (search indexes, cache invalidation, analytics) react to data changes without polling the source database. The CDC-to-Kafka pipeline is the standard pattern for keeping derived data stores in sync.
Stream processing with Kafka Streams or Flink. Kafka is both the transport layer and the storage layer for stream processing. Kafka Streams (a Java library, not a separate cluster) reads from input topics, processes records, and writes to output topics. Apache Flink reads from Kafka for more complex stateful processing (windowed aggregations, event-time processing, exactly-once across multiple sinks). The key advantage of Kafka Streams: it's a library, not a cluster. You deploy it as part of your application with no additional infrastructure.
Kafka Connect for no-code integrations. Kafka Connect provides source connectors (ingest data from databases, files, APIs into Kafka) and sink connectors (write data from Kafka to databases, S3, Elasticsearch, data warehouses). For common integrations (PostgreSQL CDC, S3 archival, Elasticsearch indexing), Kafka Connect eliminates custom consumer code entirely. A JSON configuration file defines the connector, and Kafka Connect runs it as a distributed, fault-tolerant task.
Monitoring and operational maturity. Kafka exposes detailed JMX metrics: consumer lag per partition, broker throughput, ISR shrink/expand rates, under-replicated partitions. Tools like Burrow, LinkedIn's Kafka monitor, track consumer group health. At scale, consumer lag is the single most important metric: it tells you how far behind each consumer group is from the latest data.
Log compaction for materialized views. Kafka's log compaction retains only the latest value per key, indefinitely. This turns a Kafka topic into a slowly-updating key-value store. CDC topics with compaction enabled provide a "current state" view: consumers reading from the beginning get the latest version of every row without processing the full history. This is the foundation of Kafka's use as a state store, not just a message transport.
# Critical Kafka monitoring metrics
- kafka.server:BrokerTopicMetrics:MessagesInPerSec # Ingest rate
- kafka.server:ReplicaManager:UnderReplicatedPartitions # Data safety
- kafka.consumer:FetchManager:records-lag-max # Consumer health
- kafka.server:BrokerTopicMetrics:BytesOutPerSec # Read throughput
- kafka.controller:ControllerStats:UncleanLeaderElectionsPerSec # Split brain risk
When RabbitMQ Wins
RabbitMQ excels at traditional message brokering, task distribution, and scenarios where per-message routing flexibility matters more than throughput or replay.
Task queues with competing consumers. Job processing (email sending, image resizing, PDF generation) where you need to distribute work across a pool of workers. RabbitMQ's push model with prefetch count is purpose-built for this. Workers receive tasks automatically, and if a worker crashes, the message is redelivered to another worker.
Complex routing rules. A payment event needs to go to the fraud service (always), the rewards service (only for amounts over $100), and the audit service (only for international transactions). RabbitMQ's topic exchange with binding patterns like payment.international.# handles this at the broker level. In Kafka, you'd need topic-per-route or consumer-side filtering.
Request-reply patterns. RabbitMQ supports reply-to queues and correlation IDs natively. The RPC-over-message-queue pattern works naturally. Kafka can do request-reply, but it's awkward and requires response topics with correlation matching.
Priority scheduling. If high-priority messages need to jump the queue (payment retries before marketing emails), RabbitMQ priority queues handle this natively. Kafka has no priority concept. You'd create separate priority topics and consume from the high-priority topic first, but that's application-level logic you're building yourself.
Low-latency individual messages. If you need sub-millisecond broker-to-consumer delivery for individual messages (not batch throughput), RabbitMQ's push model delivers faster than Kafka's poll-based consumption. Kafka's batching optimizes for throughput at the cost of per-message latency.
Simpler operations for small teams. A single RabbitMQ node handles most moderate workloads. No ZooKeeper/KRaft, no partition management, no ISR monitoring. For a team of 3-5 engineers running 10K messages/sec, RabbitMQ's operational simplicity is a real advantage. I've seen small teams adopt Kafka and spend more time maintaining the cluster than building features.
Retry and delay queues. RabbitMQ's dead letter exchange (DLX) with TTL-based retry queues creates sophisticated retry patterns declaratively. A failed message goes to a retry queue with a 30-second TTL, then back to the main queue for reprocessing. After 5 failures, it goes to a dead letter queue for human review. Building this in Kafka requires custom retry topic chains and manual offset management.
# RabbitMQ retry pattern: DLX with TTL
channel.queue_declare(
queue='main_queue',
arguments={
'x-dead-letter-exchange': 'retry_exchange',
'x-dead-letter-routing-key': 'retry'
}
)
channel.queue_declare(
queue='retry_queue',
arguments={
'x-message-ttl': 30000, # 30 seconds
'x-dead-letter-exchange': 'main_exchange',
'x-dead-letter-routing-key': 'main'
}
)
Message acknowledgment flexibility. RabbitMQ supports individual ACK, bulk ACK (acknowledge all messages up to a delivery tag), and NACK with requeue. This fine-grained control over message lifecycle is useful when processing requires validation steps before committing.
The Nuance
Here's the honest answer: Kafka and RabbitMQ are not competitors. They solve different problems, and the "which one should I use" framing is usually the wrong question.
Kafka is an event streaming platform. It answers: "What happened, in what order, and let me replay it." RabbitMQ is a message broker. It answers: "Take this message and deliver it to the right consumer." Many production systems use both.
The common hybrid pattern: Kafka handles event streaming (order events, click events, state changes) while RabbitMQ handles command distribution (send this email, resize this image, process this payment). Events flow through Kafka because multiple services need to independently consume them and replay is valuable. Commands flow through RabbitMQ because they need routing, priority, and reliable task distribution.
Schema evolution is critical for long-lived Kafka topics. Without schema management, producers can change message structure and break consumers. Confluent Schema Registry (or AWS Glue Schema Registry) enforces Avro, Protobuf, or JSON Schema compatibility rules. Backward compatibility means new consumers can read old messages. Forward compatibility means old consumers can read new messages. Full compatibility ensures both directions. For production Kafka deployments, Schema Registry is not optional.
I've seen teams try to replace RabbitMQ entirely with Kafka for task queues, and the result is always worse. Kafka's pull model, lack of per-message routing, and lack of priority make it a poor task queue. Similarly, teams that use RabbitMQ for event streaming inevitably hit the replay wall and wish they'd used Kafka.
The red flag in interviews: if a candidate says "I'd use Kafka" or "I'd use RabbitMQ" without asking what the messaging pattern is, they're pattern-matching on tool names rather than understanding use cases.
Another common mistake: running Kafka for a workload that processes 500 messages per minute. At that volume, Kafka's operational overhead (broker management, partition tuning, consumer group coordination) far exceeds its benefits. RabbitMQ or even a simple SQS queue handles low-volume messaging with dramatically less operational cost. Kafka's advantages emerge at 10K+ messages per second, not 10 messages per second.
The managed service question matters too. Confluent Cloud, Amazon MSK, and Azure Event Hubs provide managed Kafka without the operational burden of running brokers. CloudAMQP provides managed RabbitMQ. For teams without dedicated infrastructure engineers, managed services change the "operational complexity" dimension of this trade-off significantly. The engineering cost shifts from cluster management to monthly bills.
The performance numbers tell the story. A single Kafka broker on a c5.4xlarge instance (16 vCPUs, 32 GB RAM) with 3x gp3 EBS volumes achieves 200K-400K messages/sec sustained write throughput with replication factor 3 and acks=all. A single RabbitMQ node on the same hardware handles 20K-30K persistent message publishes per second with quorum queues. Both benchmarks use 1 KB message payloads.
The latency profile is the inverse. RabbitMQ's publish-to-consume latency (with publisher confirms and consumer ACK) is typically 0.5-2ms for persistent messages. Kafka's produce-to-consume latency is 5-50ms because of batching (linger.ms), network round trips for acks=all, and consumer poll intervals. If you need single-digit millisecond message delivery and can tolerate lower throughput, RabbitMQ wins. If you need 100K+ msgs/sec and can tolerate 20ms latency, Kafka wins.
# Throughput comparison (1 KB messages, 3-node cluster)
Kafka:
Producer (acks=all, replication=3): 200K-400K msgs/sec
Consumer (single group, 12 partitions): 500K+ msgs/sec
End-to-end latency (p99): 10-50ms
RabbitMQ:
Producer (persistent, quorum queue): 20K-30K msgs/sec
Consumer (prefetch=50, single queue): 25K-40K msgs/sec
End-to-end latency (p99): 1-5ms
Interview tip: name the pattern before the tool
When messaging comes up in a system design interview, say: "This is an event streaming pattern, so Kafka" or "This is a task distribution pattern, so RabbitMQ." That one sentence demonstrates more understanding than a five-minute feature comparison. If both patterns exist, propose using both tools and explain which traffic goes where.
Real-World Examples
LinkedIn: Kafka was built at LinkedIn to handle activity stream data and operational metrics. Their deployment processes over 7 trillion messages per day across hundreds of Kafka clusters with over 100K partitions. Every user action (profile view, connection request, content interaction) is published as an event. Multiple consumer groups independently consume these events for the news feed, ad targeting, search indexing, and analytics. Replay is critical: when the ML team updates a recommendation model, they replay weeks of engagement events to retrain. At this scale, Kafka's sequential disk I/O and partition-level parallelism are the only architecture that works economically.
Shopify: Uses RabbitMQ for background job processing across their e-commerce platform. When a merchant updates a product, RabbitMQ distributes tasks to workers: update the storefront cache, regenerate SEO metadata, notify subscribed customers, and sync with third-party marketplaces. The routing key pattern (product.updated, product.created, product.deleted) determines which workers receive each message. At peak (Black Friday), their RabbitMQ cluster processes 100K+ jobs per minute using competing consumers with prefetch tuning.
Uber: Runs both Kafka and custom tooling (originally Cherami, later migrated to Kafka for most use cases). Kafka handles trip events, pricing updates, and driver location streams at millions of events per second across 3,000+ microservices. Their initial use of RabbitMQ-style task queues for ride matching was replaced with Kafka partitioned by geographic region, maintaining per-region ordering. The key lesson from Uber's migration: they kept seeing the replay problem (new services needed historical events) and eventually moved nearly everything to Kafka, using Cadence/Temporal for workflow orchestration instead of RabbitMQ-style task queues.
Confluent (Kafka Cloud): Reports that their managed Kafka clusters across thousands of customers handle over 10 petabytes of data per day. The largest single cluster handles 30 GB/sec sustained write throughput. These numbers contextualize why Kafka's architecture (append-only log, sequential I/O, zero-copy transfer) exists: it's designed for throughput that would overwhelm any per-message routing broker.
Netflix: Uses Kafka as the backbone of their data pipeline. Every user interaction (play, pause, seek, rate) is published to Kafka topics. Data engineering teams consume these events for A/B test analysis, content recommendation training, and operational dashboards. Their Kafka clusters handle over 1 trillion messages per day. They built Zuul (API gateway) with Kafka-backed request logging, enabling replay-based debugging: when a user reports an issue, engineers can replay the exact sequence of API calls from Kafka.
GitHub: Uses RabbitMQ for webhook delivery. When a push event occurs, GitHub publishes to RabbitMQ, which routes the webhook payload to delivery workers. Workers make HTTP POST requests to the configured webhook URLs with retry logic (exponential backoff, DLQ for persistently failing endpoints). The task queue pattern fits perfectly: each webhook delivery is an independent job that needs reliable delivery and retry mechanics. At GitHub's scale (billions of webhook deliveries per month), RabbitMQ's per-message routing and dead-letter handling are essential.
How This Shows Up in Interviews
This trade-off appears every time a system design involves asynchronous communication between services. The interviewer wants to see whether you understand messaging patterns, not just tool names.
What they're testing: Can you articulate when log-based streaming (Kafka) is appropriate versus when a traditional broker (RabbitMQ) is appropriate? Do you understand replay, ordering, and routing trade-offs?
Depth expected at senior level:
- Know Kafka's partition model: how partition keys determine ordering, how consumer groups parallelize
- Understand ISR and replication factor for durability guarantees
- Know RabbitMQ's exchange types and when each is useful
- Explain why Kafka is not a good task queue and RabbitMQ is not a good event log
- Discuss exactly-once semantics (Kafka's idempotent producers + transactions)
- Mention the hybrid pattern (Kafka for events, RabbitMQ for commands) as a mature answer
| Interviewer asks | Strong answer |
|---|---|
| "Why Kafka here and not RabbitMQ?" | "We need event replay for bug recovery and new consumer onboarding, per-entity ordering via partition keys, and throughput above 100K events/sec. RabbitMQ deletes on ACK, so replay is impossible." |
| "This looks like a task queue. Why not Kafka?" | "Task queues need priority, per-message routing, and push-based delivery. Kafka's pull model and lack of priority make it a poor fit. RabbitMQ with competing consumers and prefetch tuning is purpose-built for this." |
| "How do you handle ordering in Kafka?" | "Messages with the same partition key go to the same partition, which guarantees ordering for that key. We'd partition by entity ID, so all events for order_789 are strictly ordered. Cross-partition ordering requires a single partition, which limits throughput." |
| "What happens if a Kafka consumer is slow?" | "It falls behind. Consumer lag (the gap between the latest offset and the consumer's committed offset) grows. We monitor lag and scale out consumers within the group. The key advantage: the slow consumer doesn't affect other consumer groups or the producer." |
| "Can you replay messages in RabbitMQ?" | "Not natively. Once ACK'd, messages are deleted. You'd need to re-publish from the source system. Some teams use the shovel plugin to copy messages to a separate queue before consumption, but that's a workaround, not a core capability." |
| "How do you choose the number of partitions?" | "Start with the number of consumers you expect in the largest consumer group. If you need 12-way parallelism, create 12 partitions. You can always add partitions later, but you can never reduce them without recreating the topic. Over-partitioning wastes broker resources; under-partitioning limits parallelism." |
Gotcha: don't call Kafka a message queue
Kafka is a distributed commit log, not a message queue. Calling it a queue suggests messages are deleted after consumption, which is the opposite of how Kafka works. In an interview, this distinction signals whether you actually understand the tool or are just name-dropping.
Quick Recap
- Kafka is a distributed commit log that retains ordered records on disk. Consumers read at their own pace from any offset, enabling replay, event sourcing, and independent consumer groups.
- RabbitMQ is an AMQP message broker that routes individual messages via exchanges (direct, topic, fanout, headers) and deletes them after consumer acknowledgment. It excels at task distribution, complex routing, and priority scheduling.
- Kafka handles 100K-2M msgs/sec per broker through batched sequential I/O. RabbitMQ handles 20K-50K msgs/sec per node with lower per-message latency and push-based delivery.
- Ordering in Kafka is per-partition (use partition keys for per-entity ordering). Ordering in RabbitMQ is per-queue with a single consumer only.
- The strong default: use Kafka for event streaming (notifications, CDC, metrics, audit logs) and RabbitMQ for task queues (jobs, commands, request-reply). Many mature systems run both.
- In interviews, name the messaging pattern before the tool. "This is event streaming, so Kafka" or "This is task distribution, so RabbitMQ" demonstrates deeper understanding than a feature comparison.
What About Amazon SQS and SNS?
SQS (Simple Queue Service) and SNS (Simple Notification Service) are AWS-managed alternatives worth mentioning. SQS provides a managed message queue with no infrastructure to operate: you create a queue, send messages, and poll for them. SQS FIFO queues provide ordering and exactly-once delivery within a message group. SNS provides pub/sub fan-out (one message to many subscribers).
SQS replaces RabbitMQ for simple task queues when you don't need exchange-based routing, priority queues, or DLX retry patterns. The operational simplicity is unbeatable: zero servers, zero configuration, pay per message ($0.40 per million requests). For AWS-native architectures processing under 50K messages/second, SQS is often the right choice over both Kafka and RabbitMQ.
SQS does NOT replace Kafka. SQS has no replay (messages are deleted after processing), no ordering across message groups, and no consumer group model. For event streaming, CDC, and multi-consumer independence, Kafka remains necessary even in AWS-native architectures. Amazon MSK (Managed Streaming for Kafka) provides Kafka as a managed service.
Related Trade-offs
- Message queues for queue fundamentals, delivery guarantees, and dead-letter patterns
- RPC vs. messaging for when synchronous calls beat async messaging
- Batch vs. stream processing for the data processing trade-off that often pairs with Kafka
- Event-driven architecture for choreography, orchestration, and event sourcing patterns
- Sync vs. async communication for the foundational decision that drives messaging tool selection