Message Queue

What is a distributed message queue?

A distributed message queue decouples producers from consumers: a producer writes a message to a named queue, and one or more consumers read it independently and asynchronously. The engineering challenge isn't the queue itself; it's the combination of visibility timeout mechanics (what happens when a consumer crashes mid-processing) and durable storage that survives broker failures without sacrificing throughput.

This question tests WAL-based persistence, partition design for horizontal scale, leader election under failure, and at-least-once delivery semantics.

Functional Requirements

Core Requirements

Producers publish messages to named queues.
Consumers receive and acknowledge messages.
Unacknowledged messages are redelivered after a visibility timeout.
Messages are durable: they survive broker restarts.

Scope Exclusions

Fan-out pub-sub semantics (closer to Kafka/SNS).
Message transformation or schema validation.

Non-Functional Requirements

Core Requirements

Throughput: 1M messages/second aggregate across all queues.
Latency: Message visible to consumer within 100ms of publish.
Durability: No message loss on a single broker failure.
Availability: 99.99% uptime (under 52 minutes of downtime per year).
At-least-once delivery: Every published message is consumed at least once; zero deliveries never occurs.

Below the Line

Exactly-once delivery
Strict global message ordering across partitions
Cross-region active-active replication

The hardest engineering problem in scope: Implementing atomic visibility-timeout tracking while sustaining 1M messages/second is the central challenge. The broker must efficiently identify expired in-flight leases across thousands of concurrent consumers without scanning every in-flight message on every tick.

I'd front-load this constraint in any interview because it kills the naive approach immediately. Most candidates design a clean visibility tracker, then realize at scale it scans 30M in-flight messages per tick.

Exactly-once delivery is deferred because it requires distributed transactions or consumer-side idempotency, which belongs in the application layer rather than the queue. To add it: assign each message a deduplication ID, maintain a TTLed seen-IDs set in Redis, and reject reprocessed IDs on consumer acknowledgement.

Strict global ordering across partitions is deferred because it forces a single partition, collapsing horizontal throughput. To add it: use FIFO queues with message group IDs (covered in Deep Dive 3), which provide per-key ordering while allowing cross-key parallelism.

Cross-region active-active replication is deferred because it adds cross-datacenter round-trips to the synchronous write path. To add it: use async replication to a secondary region, accepting a recovery point objective of a few seconds in exchange for local write performance.

Core Entities

Queue: Named channel with configuration including visibility timeout, retention period, max message size, and an optional dead-letter queue reference.
Message: Payload unit with a unique ID, body (opaque bytes), user-defined attributes, delivery count, and a visibility deadline timestamp.
Consumer: Client that polls a queue and holds an opaque receipt_handle token while processing; identified by consumer_group_id.
DeadLetterQueue (DLQ): A standard queue that receives any message exceeding its source queue's max_delivery_count.

Full schema is deferred to the deep dives. The critical relationship: a Message belongs to exactly one Queue, and a DLQ is simply another Queue with a source queue pointer.

API Design

This system exposes an SDK-style API. The calling application links against a client library that manages connection pooling, retries, and serialization.

FR 1 - Producer publishes a message:

send_message(
  queue_url: string,
  body: bytes,
  delay_seconds?: int = 0,
  message_group_id?: string
) -> { message_id: string }

delay_seconds defers visibility to consumers by the specified number of seconds. Use it for debounce patterns or scheduled jobs without a separate scheduler service.

FR 2 - Consumer receives messages:

receive_messages(
  queue_url: string,
  max_messages: int,         // 1-10
  visibility_timeout: int    // seconds, up to 43200
) -> [{ message_id, body, attributes, receipt_handle, delivery_count }]

Batch up to 10 messages per call to amortize network round-trips. The receipt_handle is an opaque token the consumer presents to delete or extend the message's visibility window. Return delivery_count so consumers can detect redeliveries and apply idempotency logic. I always expose delivery_count in the response because it's the one field that lets consumers distinguish a first attempt from a retry without maintaining external state.

FR 3 - Consumer acknowledges a message:

delete_message(
  queue_url: string,
  receipt_handle: string
) -> void

Deletion is the acknowledgement. A message is only removed from the queue when the consumer explicitly calls delete_message with a valid receipt handle.

FR 4 - Consumer extends visibility for long-running jobs:

change_visibility(
  queue_url: string,
  receipt_handle: string,
  new_timeout: int           // seconds
) -> void

Use change_visibility as a heartbeat: call it periodically when processing takes longer than the initial visibility timeout to prevent redelivery. This is the correct pattern for long-running jobs rather than setting an artificially high initial timeout.

Traditional queue vs log-based stream: Use a queue (SQS, RabbitMQ) when you need automatic message deletion after acknowledgement, per-message visibility control, and a dead-letter queue for poison messages. Use a log-based stream (Kafka) when you need long-term message retention, consumer-controlled offset replay, or fan-out to many independent consumer groups reading the same stream from the beginning. The delete-on-ack model we are designing here means consumers cannot replay past messages; if replay is a requirement, the answer is Kafka, not this system.

High-Level Design

1. Producer publishes to a named queue

Solving: The basic write path. A producer sends a message and the broker stores it for future retrieval.

Components:

Producer Client: Application code calling send_message. Connects to the broker over TCP.
Broker: Single process. Accepts messages, assigns a unique ID, and appends to a named in-memory queue.
In-Memory Queue: A FIFO data structure per named queue. Fast but not yet durable.

Request walkthrough:

Producer calls send_message("orders", body).
Broker receives the message and assigns a UUID.
Broker appends the message to the target queue's in-memory FIFO structure.
Broker returns { message_id } to the producer.

This diagram shows the write path only. Consumer delivery and visibility timeout mechanics come next. Notice we start deliberately simple: one broker, in-memory only, no durability. This is the baseline that everything else evolves from.

2. Consumer receives and acknowledges messages with visibility timeout

Solving: The consumer pull path and the core visibility timeout mechanic. When a consumer receives a message, that message must become invisible to other consumers until either acknowledged or the timeout expires.

Components:

Consumer Client: Application code calling receive_messages. Long-polls the broker.
Visibility Tracker: An in-memory map of receipt_handle -> (message_id, expiry_timestamp).
Receipt Handle: An opaque token encoding the consumer's lease on the message.

Request walkthrough:

Consumer calls receive_messages("orders", max_messages=5, visibility_timeout=30).
Broker locks the next 5 available (non-invisible) messages.
Broker generates a unique receipt_handle per message.
Broker stores receipt_handle -> (message_id, now + 30s) in the visibility tracker.
Broker returns the messages and their receipt handles to the consumer.
Consumer processes the message and calls delete_message(receipt_handle).
Broker removes the message from the queue and the visibility tracker.

The receipt_handle is not just the message_id. It encodes the consumer's lease on the specific delivery attempt. If a consumer receives the same message twice after redelivery, the old receipt handle is invalid. This prevents a slow consumer from accidentally deleting a message that a different consumer has already claimed and acknowledged.

3. Unacknowledged messages are redelivered

Solving: The failure path. If a consumer crashes or stalls, the visibility timeout expires and the broker must redeliver the message.

Components:

Redelivery Scanner: A background process that scans the visibility tracker for expired leases.
DLQ Router: Checks if a message has exceeded max_delivery_count before requeue. If so, routes to the dead-letter queue.

Request walkthrough:

Consumer A receives a message with a 30-second visibility timeout.
Consumer A crashes before calling delete_message.
At now >= expiry_timestamp, the Redelivery Scanner detects the expired lease.
Scanner checks: is delivery_count >= max_delivery_count? If yes, routes to the DLQ. If no, increments delivery_count and restores the message to visible.
Consumer B picks up the redelivered message on the next poll.

What is a distributed message queue?

This question tests WAL-based persistence, partition design for horizontal scale, leader election under failure, and at-least-once delivery semantics.

Functional Requirements

Core Requirements

Producers publish messages to named queues.
Consumers receive and acknowledge messages.
Unacknowledged messages are redelivered after a visibility timeout.
Messages are durable: they survive broker restarts.

Scope Exclusions

Fan-out pub-sub semantics (closer to Kafka/SNS).
Message transformation or schema validation.

Non-Functional Requirements

Core Requirements

Throughput: 1M messages/second aggregate across all queues.
Latency: Message visible to consumer within 100ms of publish.
Durability: No message loss on a single broker failure.
Availability: 99.99% uptime (under 52 minutes of downtime per year).
At-least-once delivery: Every published message is consumed at least once; zero deliveries never occurs.

Below the Line

Exactly-once delivery
Strict global message ordering across partitions
Cross-region active-active replication

The hardest engineering problem in scope: Implementing atomic visibility-timeout tracking while sustaining 1M messages/second is the central challenge. The broker must efficiently identify expired in-flight leases across thousands of concurrent consumers without scanning every in-flight message on every tick.

Core Entities

Queue: Named channel with configuration including visibility timeout, retention period, max message size, and an optional dead-letter queue reference.
Message: Payload unit with a unique ID, body (opaque bytes), user-defined attributes, delivery count, and a visibility deadline timestamp.
Consumer: Client that polls a queue and holds an opaque receipt_handle token while processing; identified by consumer_group_id.
DeadLetterQueue (DLQ): A standard queue that receives any message exceeding its source queue's max_delivery_count.

Full schema is deferred to the deep dives. The critical relationship: a Message belongs to exactly one Queue, and a DLQ is simply another Queue with a source queue pointer.

API Design

This system exposes an SDK-style API. The calling application links against a client library that manages connection pooling, retries, and serialization.

FR 1 - Producer publishes a message:

send_message(
  queue_url: string,
  body: bytes,
  delay_seconds?: int = 0,
  message_group_id?: string
) -> { message_id: string }

delay_seconds defers visibility to consumers by the specified number of seconds. Use it for debounce patterns or scheduled jobs without a separate scheduler service.

FR 2 - Consumer receives messages:

receive_messages(
  queue_url: string,
  max_messages: int,         // 1-10
  visibility_timeout: int    // seconds, up to 43200
) -> [{ message_id, body, attributes, receipt_handle, delivery_count }]

FR 3 - Consumer acknowledges a message:

delete_message(
  queue_url: string,
  receipt_handle: string
) -> void

Deletion is the acknowledgement. A message is only removed from the queue when the consumer explicitly calls delete_message with a valid receipt handle.

FR 4 - Consumer extends visibility for long-running jobs:

change_visibility(
  queue_url: string,
  receipt_handle: string,
  new_timeout: int           // seconds
) -> void

High-Level Design

1. Producer publishes to a named queue

Solving: The basic write path. A producer sends a message and the broker stores it for future retrieval.

Components:

Producer Client: Application code calling send_message. Connects to the broker over TCP.
Broker: Single process. Accepts messages, assigns a unique ID, and appends to a named in-memory queue.
In-Memory Queue: A FIFO data structure per named queue. Fast but not yet durable.

Request walkthrough:

Producer calls send_message("orders", body).
Broker receives the message and assigns a UUID.
Broker appends the message to the target queue's in-memory FIFO structure.
Broker returns { message_id } to the producer.

2. Consumer receives and acknowledges messages with visibility timeout

Components:

Consumer Client: Application code calling receive_messages. Long-polls the broker.
Visibility Tracker: An in-memory map of receipt_handle -> (message_id, expiry_timestamp).
Receipt Handle: An opaque token encoding the consumer's lease on the message.

Request walkthrough:

Consumer calls receive_messages("orders", max_messages=5, visibility_timeout=30).
Broker locks the next 5 available (non-invisible) messages.
Broker generates a unique receipt_handle per message.
Broker stores receipt_handle -> (message_id, now + 30s) in the visibility tracker.
Broker returns the messages and their receipt handles to the consumer.
Consumer processes the message and calls delete_message(receipt_handle).
Broker removes the message from the queue and the visibility tracker.

3. Unacknowledged messages are redelivered

Solving: The failure path. If a consumer crashes or stalls, the visibility timeout expires and the broker must redeliver the message.

Components:

Redelivery Scanner: A background process that scans the visibility tracker for expired leases.
DLQ Router: Checks if a message has exceeded max_delivery_count before requeue. If so, routes to the dead-letter queue.

Request walkthrough:

Consumer A receives a message with a 30-second visibility timeout.
Consumer A crashes before calling delete_message.
At now >= expiry_timestamp, the Redelivery Scanner detects the expired lease.
Scanner checks: is delivery_count >= max_delivery_count? If yes, routes to the DLQ. If no, increments delivery_count and restores the message to visible.
Consumer B picks up the redelivered message on the next poll.

Message Queue

What is a distributed message queue?

Functional Requirements

Core Requirements

Scope Exclusions

Non-Functional Requirements

Core Requirements

Below the Line

Core Entities

API Design

High-Level Design

1. Producer publishes to a named queue

2. Consumer receives and acknowledges messages with visibility timeout

3. Unacknowledged messages are redelivered

Continue Reading with Premium

Comments

Message Queue

What is a distributed message queue?

Functional Requirements

Core Requirements

Scope Exclusions

Non-Functional Requirements

Core Requirements

Below the Line

Core Entities

API Design

High-Level Design

1. Producer publishes to a named queue

2. Consumer receives and acknowledges messages with visibility timeout

3. Unacknowledged messages are redelivered

Continue Reading with Premium

Comments