Message Queue
Design the internals of a durable message queue like RabbitMQ or Amazon SQS: from a single-broker FIFO queue to a horizontally partitioned, replicated system with at-least-once delivery guarantees.
What is a distributed message queue?
A distributed message queue decouples producers from consumers: a producer writes a message to a named queue, and one or more consumers read it independently and asynchronously. The engineering challenge isn't the queue itself; it's the combination of visibility timeout mechanics (what happens when a consumer crashes mid-processing) and durable storage that survives broker failures without sacrificing throughput.
This question tests WAL-based persistence, partition design for horizontal scale, leader election under failure, and at-least-once delivery semantics.
Functional Requirements
Core Requirements
- Producers publish messages to named queues.
- Consumers receive and acknowledge messages.
- Unacknowledged messages are redelivered after a visibility timeout.
- Messages are durable: they survive broker restarts.
Scope Exclusions
- Fan-out pub-sub semantics (closer to Kafka/SNS).
- Message transformation or schema validation.
Non-Functional Requirements
Core Requirements
- Throughput: 1M messages/second aggregate across all queues.
- Latency: Message visible to consumer within 100ms of publish.
- Durability: No message loss on a single broker failure.
- Availability: 99.99% uptime (under 52 minutes of downtime per year).
- At-least-once delivery: Every published message is consumed at least once; zero deliveries never occurs.
Below the Line
- Exactly-once delivery
- Strict global message ordering across partitions
- Cross-region active-active replication
The hardest engineering problem in scope: Implementing atomic visibility-timeout tracking while sustaining 1M messages/second is the central challenge. The broker must efficiently identify expired in-flight leases across thousands of concurrent consumers without scanning every in-flight message on every tick.
I'd front-load this constraint in any interview because it kills the naive approach immediately. Most candidates design a clean visibility tracker, then realize at scale it scans 30M in-flight messages per tick.
Exactly-once delivery is deferred because it requires distributed transactions or consumer-side idempotency, which belongs in the application layer rather than the queue. To add it: assign each message a deduplication ID, maintain a TTLed seen-IDs set in Redis, and reject reprocessed IDs on consumer acknowledgement.
Strict global ordering across partitions is deferred because it forces a single partition, collapsing horizontal throughput. To add it: use FIFO queues with message group IDs (covered in Deep Dive 3), which provide per-key ordering while allowing cross-key parallelism.
Cross-region active-active replication is deferred because it adds cross-datacenter round-trips to the synchronous write path. To add it: use async replication to a secondary region, accepting a recovery point objective of a few seconds in exchange for local write performance.
Core Entities
- Queue: Named channel with configuration including visibility timeout, retention period, max message size, and an optional dead-letter queue reference.
- Message: Payload unit with a unique ID, body (opaque bytes), user-defined attributes, delivery count, and a visibility deadline timestamp.
- Consumer: Client that polls a queue and holds an opaque
receipt_handletoken while processing; identified byconsumer_group_id. - DeadLetterQueue (DLQ): A standard queue that receives any message exceeding its source queue's
max_delivery_count.
Full schema is deferred to the deep dives. The critical relationship: a Message belongs to exactly one Queue, and a DLQ is simply another Queue with a source queue pointer.
API Design
This system exposes an SDK-style API. The calling application links against a client library that manages connection pooling, retries, and serialization.
FR 1 - Producer publishes a message:
send_message(
queue_url: string,
body: bytes,
delay_seconds?: int = 0,
message_group_id?: string
) -> { message_id: string }
delay_seconds defers visibility to consumers by the specified number of seconds. Use it for debounce patterns or scheduled jobs without a separate scheduler service.
FR 2 - Consumer receives messages:
receive_messages(
queue_url: string,
max_messages: int, // 1-10
visibility_timeout: int // seconds, up to 43200
) -> [{ message_id, body, attributes, receipt_handle, delivery_count }]
Batch up to 10 messages per call to amortize network round-trips. The receipt_handle is an opaque token the consumer presents to delete or extend the message's visibility window. Return delivery_count so consumers can detect redeliveries and apply idempotency logic. I always expose delivery_count in the response because it's the one field that lets consumers distinguish a first attempt from a retry without maintaining external state.
FR 3 - Consumer acknowledges a message:
delete_message(
queue_url: string,
receipt_handle: string
) -> void
Deletion is the acknowledgement. A message is only removed from the queue when the consumer explicitly calls delete_message with a valid receipt handle.
FR 4 - Consumer extends visibility for long-running jobs:
change_visibility(
queue_url: string,
receipt_handle: string,
new_timeout: int // seconds
) -> void
Use change_visibility as a heartbeat: call it periodically when processing takes longer than the initial visibility timeout to prevent redelivery. This is the correct pattern for long-running jobs rather than setting an artificially high initial timeout.
Traditional queue vs log-based stream: Use a queue (SQS, RabbitMQ) when you need automatic message deletion after acknowledgement, per-message visibility control, and a dead-letter queue for poison messages. Use a log-based stream (Kafka) when you need long-term message retention, consumer-controlled offset replay, or fan-out to many independent consumer groups reading the same stream from the beginning. The delete-on-ack model we are designing here means consumers cannot replay past messages; if replay is a requirement, the answer is Kafka, not this system.
High-Level Design
1. Producer publishes to a named queue
Solving: The basic write path. A producer sends a message and the broker stores it for future retrieval.
Components:
- Producer Client: Application code calling
send_message. Connects to the broker over TCP. - Broker: Single process. Accepts messages, assigns a unique ID, and appends to a named in-memory queue.
- In-Memory Queue: A FIFO data structure per named queue. Fast but not yet durable.
Request walkthrough:
- Producer calls
send_message("orders", body). - Broker receives the message and assigns a UUID.
- Broker appends the message to the target queue's in-memory FIFO structure.
- Broker returns
{ message_id }to the producer.
This diagram shows the write path only. Consumer delivery and visibility timeout mechanics come next. Notice we start deliberately simple: one broker, in-memory only, no durability. This is the baseline that everything else evolves from.
2. Consumer receives and acknowledges messages with visibility timeout
Solving: The consumer pull path and the core visibility timeout mechanic. When a consumer receives a message, that message must become invisible to other consumers until either acknowledged or the timeout expires.
Components:
- Consumer Client: Application code calling
receive_messages. Long-polls the broker. - Visibility Tracker: An in-memory map of
receipt_handle -> (message_id, expiry_timestamp). - Receipt Handle: An opaque token encoding the consumer's lease on the message.
Request walkthrough:
- Consumer calls
receive_messages("orders", max_messages=5, visibility_timeout=30). - Broker locks the next 5 available (non-invisible) messages.
- Broker generates a unique
receipt_handleper message. - Broker stores
receipt_handle -> (message_id, now + 30s)in the visibility tracker. - Broker returns the messages and their receipt handles to the consumer.
- Consumer processes the message and calls
delete_message(receipt_handle). - Broker removes the message from the queue and the visibility tracker.
The receipt_handle is not just the message_id. It encodes the consumer's lease on the specific delivery attempt. If a consumer receives the same message twice after redelivery, the old receipt handle is invalid. This prevents a slow consumer from accidentally deleting a message that a different consumer has already claimed and acknowledged.
3. Unacknowledged messages are redelivered
Solving: The failure path. If a consumer crashes or stalls, the visibility timeout expires and the broker must redeliver the message.
Components:
- Redelivery Scanner: A background process that scans the visibility tracker for expired leases.
- DLQ Router: Checks if a message has exceeded
max_delivery_countbefore requeue. If so, routes to the dead-letter queue.
Request walkthrough:
- Consumer A receives a message with a 30-second visibility timeout.
- Consumer A crashes before calling
delete_message. - At
now >= expiry_timestamp, the Redelivery Scanner detects the expired lease. - Scanner checks: is
delivery_count >= max_delivery_count? If yes, routes to the DLQ. If no, incrementsdelivery_countand restores the message to visible. - Consumer B picks up the redelivered message on the next poll.
The Redelivery Scanner is a polling loop. Run it against a sorted data structure ordered by expiry timestamp rather than a flat iteration over all in-flight messages. A linear scan of all in-flight messages is O(n) per tick and collapses at high throughput. This is the motivation for the Redis sorted set optimization in Deep Dive 2.
4. Messages survive broker restarts
Solving: Durability. An in-memory queue loses all data on process restart. Durable storage requires persisting messages before acknowledging the producer.
Components:
- Write-Ahead Log (WAL): An append-only file on the broker's local disk. Every message is written to the WAL before being ACKed to the producer.
- Sync policy:
fsyncper message (safest, slowest),fsyncper batch (balanced), or OS-buffered writes (fastest, least durable). Start with fsync per batch.
Request walkthrough:
- Producer calls
send_message("orders", body). - Broker appends the message to the WAL and calls
fsync. - Broker appends the message to the in-memory queue.
- Broker returns
{ message_id }to the producer. - On crash and restart, the broker replays the WAL to rebuild in-memory state.
The WAL covers single-broker durability. It does not cover broker hardware failure or network partition. That requires replication, which is addressed in Deep Dive 4. In my experience, interviewers are satisfied when you acknowledge the gap here and explicitly defer replication to the deep dive rather than trying to solve everything in the initial design.
Potential Deep Dives
1. How do we achieve 1M messages/second throughput?
A single broker with synchronous fsync per message tops out around 10,000 to 20,000 writes per second on commodity hardware. The NFR demands 1M per second. That 50-100x gap is the kind of number I'd write on the whiteboard immediately because it makes the case for batching and partitioning before you even draw a diagram.
2. How do we implement visibility timeout and redelivery efficiently?
I always probe this in interviews because the naive implementation looks correct but collapses at scale. With 1M messages per second and a 30-second visibility timeout, the system can have up to 30M in-flight messages at any moment.
3. How do we partition the queue without breaking ordering guarantees?
I always push on the ordering requirement in interviews: most teams say they need strict FIFO but actually need FIFO-per-key, which opens up horizontal partitioning. The distinction changes the architecture completely.
4. How do we detect and recover from broker failures?
Final Architecture
The single most important insight in this architecture: durability and throughput are not opposites. Group-commit WAL batching decouples the per-message write cost from the fsync cost, letting a small broker cluster sustain 1M messages per second with full crash recovery and at-least-once delivery guarantees. If you take one thing from this article to an interview, make it this: batching turns an I/O-bound system into a CPU-bound one, and CPU scales horizontally.
Interview Cheat Sheet
- A distributed message queue decouples producers and consumers by storing messages durably until a consumer explicitly acknowledges them, providing at-least-once delivery semantics.
- The visibility timeout is the core mechanic: when a consumer receives a message, it becomes invisible to other consumers for a configurable period. If the consumer does not call
delete_messagebefore timeout expiry, the broker redelivers the message. - Use a Redis sorted set (ZRANGEBYSCORE) to track visibility timeouts. The naive linear scan collapses at 30M in-flight messages; ZRANGEBYSCORE is O(log n + k) where k is the number of expired leases, not the total in-flight count.
- A single broker with per-message fsync tops out at 10,000 to 20,000 writes per second. Group-commit batching (flush every 10ms or 1,000 messages) reduces fsync calls by 100x and scales a single broker to 100K messages per second.
- Reach 1M messages per second by partitioning queues across 10+ brokers, each running group-commit WAL writes. The partition router hashes the message's partition key modulo the number of partitions.
- FIFO queues with message group IDs give per-key ordering without sacrificing horizontal throughput. All messages with the same group ID are delivered in send order to one consumer at a time; messages across different group IDs run in parallel.
- Use async replication (ISR) over synchronous replication. Synchronous replication blocks every producer on follower network latency, capping throughput at roughly 1,000 writes per second per partition (1ms round-trip). Async replication ACKs the producer immediately and replicates in the background.
- When a broker fails, the Raft controller elects a new leader from the ISR set. Consumers rebalance their partition assignments to the new leader. Rebalance takes 1 to 5 seconds with no messages dispatched to the affected group during that window.
- The
ackssetting trades durability for latency.acks=leaderaccepts a small data loss window on leader failure.acks=allwaits for all ISR members to confirm, eliminating data loss at the cost of write latency equal to the slowest ISR follower. - Choose a traditional queue (SQS, RabbitMQ) over a log-based stream (Kafka) when you need automatic message deletion after acknowledgement, per-message visibility control, and simple dead-letter handling. Choose Kafka when you need long-term message retention, consumer-controlled offset replay, or fan-out to many independent consumer groups reading the same stream from the beginning.
- The
receipt_handleis an opaque lease token, not the rawmessage_id. An expired handle is invalid, preventing a slow consumer from deleting a message that another consumer has already claimed. - A dead-letter queue receives messages that exceed
max_delivery_count. Monitor DLQ depth as an operational metric: a rising DLQ depth signals a consumer bug or a serialization change that broke message deserialization.