Dead-letter queue
How dead-letter queues isolate poison-pill messages, prevent consumer stalls, enable inspection and replay, and how to set up DLQ depth alerting before backlog silently grows.
TL;DR
- A dead-letter queue (DLQ) is a secondary queue where messages land after exhausting all retry attempts, isolating failures so the main queue keeps flowing.
- Without a DLQ, a single malformed "poison pill" message blocks your consumer forever, stalling every message behind it.
- DLQ depth is a critical operational alert: any message in the DLQ means something failed that needs human attention.
- After fixing the root cause, you replay DLQ messages back to the main queue, but replay order may differ from original order.
- Every major broker supports DLQs natively (SQS redrive policy, RabbitMQ dead-letter exchange) or by convention (Kafka dead-letter topics).
The Problem
It's Monday morning. Your order processing pipeline has been running fine for months. Then a partner API changes their response format without warning. One order event now contains a nested field your consumer doesn't expect. The consumer throws a deserialization exception, the message goes back to the queue, the consumer picks it up again, same exception, back to the queue.
Meanwhile, 3,000 perfectly valid orders are stacking up behind this one broken message. Your queue depth alarm fires. Customers are emailing support asking why their orders aren't confirmed. The consumer is alive, healthy, burning CPU, and accomplishing nothing. Your monitoring shows the consumer is running and the queue is draining (messages are being dequeued and re-enqueued), so everything looks "normal" until someone checks the actual processing success rate.
This is the poison pill problem. One message that will never succeed blocks everything behind it. The consumer's retry logic, designed to handle transient failures, becomes an infinite loop for permanent failures. The longer it runs, the deeper the backlog grows.
The forces in tension: you need reliable message processing (every message must be handled), but you also need forward progress (the queue can't stall). Without a mechanism to separate "messages that will succeed eventually" from "messages that will never succeed," you're forced to choose between dropping messages and blocking the queue. Neither is acceptable for business-critical pipelines.
The fix isn't smarter retry logic. The fix is a parking lot for messages that have proven they can't be processed right now.
One-Line Definition
A dead-letter queue isolates unprocessable messages from the main queue after N failed attempts, so one poison pill cannot stall the entire pipeline.
Analogy
Think about a post office sorting facility. Letters flow through automated sorting machines at high speed. Occasionally a letter has an unreadable address, or it's too thick for the machine, or the zip code doesn't match any known route. The machine can't process it, but it also can't just stop the entire belt.
So there's a bin at the end of the line. Letters that fail sorting three times get routed into that bin. A human picks them up later, inspects them, fixes what they can, and re-feeds them into the machine. The rest of the mail keeps moving the entire time.
That bin is your DLQ. The human inspector is your ops team or your automated replay tool. The key insight: the bin doesn't fix the problem. It isolates the problem so the rest of the system isn't held hostage by it.
Solution Walkthrough
The DLQ sits beside the main queue. The broker (or the consumer) tracks how many times each message has been attempted. When attempts exceed the configured maximum, the message is moved to the DLQ instead of being redelivered to the consumer. The consumer never blocks, the main queue keeps draining, and the failed message is safely parked for later investigation.
This is a simple but powerful architectural pattern. You don't need custom infrastructure. Every major message broker supports this flow natively or with a thin consumer-side wrapper.
Here's the step-by-step flow for a single message:
- Producer publishes a message to the main queue.
- Consumer picks it up, attempts processing.
- Processing fails (exception, timeout, dependency down).
- Broker redelivers the message. The attempt counter increments.
- Steps 2-4 repeat up to
maxRetries(typically 3-5). - On the final failure, the broker moves the message to the configured DLQ.
- The main queue advances. Next message is delivered to the consumer.
- Alerting fires on DLQ depth > 0. An engineer investigates.
- After fixing the root cause, the engineer replays DLQ messages back to the main queue.
The critical insight: the consumer never blocks on the poison pill. After maxRetries attempts, the message is out of the way. The 3,000 valid messages behind it flow through normally.
Implementation Sketch
Here's a consumer-side DLQ implementation for a system where the broker doesn't handle DLQ routing natively (like Kafka):
// Consumer-side DLQ routing (Kafka pattern)
async function processMessage(msg: Message): Promise<void> {
const maxRetries = 3;
const attempts = getRetryCount(msg.headers) + 1;
try {
await handleOrder(msg.payload);
await consumer.commit(msg.offset);
} catch (error) {
if (attempts >= maxRetries) {
// Move to dead-letter topic with failure metadata
await producer.send({
topic: "orders.DLT",
messages: [{
key: msg.key,
value: msg.value,
headers: {
...msg.headers,
"x-original-topic": "orders",
"x-failure-reason": error.message,
"x-failed-at": new Date().toISOString(),
"x-attempt-count": String(attempts),
},
}],
});
await consumer.commit(msg.offset); // commit to unblock main topic
} else {
// Retry: publish back with incremented count
await producer.send({
topic: "orders",
messages: [{
key: msg.key,
value: msg.value,
headers: { ...msg.headers, "x-retry-count": String(attempts) },
}],
});
await consumer.commit(msg.offset);
}
}
}
Key details: the consumer commits the offset on the main topic regardless of whether the message succeeded or went to the DLT. This is what unblocks the main queue. The failure metadata (x-failure-reason, x-failed-at, x-attempt-count) makes DLQ messages inspectable without digging through logs.
Message metadata preservation
Every DLQ message should carry enough context to answer three questions without opening a log aggregator: (1) What failed? (2) Why did it fail? (3) Where did it come from?
Essential headers for DLQ messages:
| Header | Purpose | Example |
|---|---|---|
x-original-topic | Source queue/topic for replay routing | orders |
x-failure-reason | Exception message from last attempt | NullPointerException at line 42 |
x-attempt-count | Total delivery attempts | 3 |
x-first-failure-at | When the first attempt failed | 2026-04-04T09:15:00Z |
x-dlq-entry-at | When the message entered the DLQ | 2026-04-04T09:15:08Z |
x-consumer-version | Consumer build/version that failed | order-svc-v2.3.1 |
x-correlation-id | Trace ID for distributed tracing | abc-123-def-456 |
The correlation ID is especially valuable. When you find a DLQ message, you can trace the entire request chain through your observability stack (Jaeger, Datadog, etc.) to see exactly what happened upstream.
DLQ Configuration by Broker
Most message brokers support DLQ natively. The configuration differs, but the concept is identical.
AWS SQS (Redrive Policy)
{
"RedrivePolicy": {
"deadLetterTargetArn": "arn:aws:sqs:us-east-1:123456789:orders-dlq",
"maxReceiveCount": 3
}
}
SQS tracks receive count automatically. After 3 failed receive-and-delete cycles (the consumer received the message but never deleted it within the visibility timeout), SQS moves the message to orders-dlq. No consumer-side logic needed. This is the simplest DLQ setup across all major brokers.
SQS also provides a native "Start DLQ redrive" API that moves messages from the DLQ back to the source queue. You can filter which messages to replay. The redrive respects the original message attributes and headers, making it a clean round-trip.
One detail that catches people: the DLQ in SQS is just a regular SQS queue with a different name. It has its own retention policy, its own visibility timeout, and its own consumers. If you read a message from the DLQ and don't delete it, the DLQ's own maxReceiveCount applies (if configured with its own redrive policy). Set the DLQ's retention to at least 14 days.
Kafka (Dead Letter Topic)
Kafka has no native DLQ. You implement it in the consumer (as shown in the implementation sketch above). The convention is naming the DLT as {original-topic}.DLT. Spring Kafka and Confluent's consumer frameworks have built-in DLT support that handles retry counting and DLT routing automatically:
// Spring Kafka: automatic DLT routing
@Bean
public DefaultErrorHandler errorHandler(KafkaTemplate<String, String> template) {
DeadLetterPublishingRecoverer recoverer =
new DeadLetterPublishingRecoverer(template);
return new DefaultErrorHandler(recoverer, new FixedBackOff(1000L, 2));
// Retries twice with 1s backoff, then routes to {topic}.DLT
}
Because Kafka DLTs are regular topics, they inherit Kafka's retention and compaction policies. Set the DLT retention to at least 14 days (longer than the main topic, which might compact or expire sooner). Also, DLT topics should have fewer partitions than the main topic since throughput is much lower.
RabbitMQ (Dead Letter Exchange)
Set x-dead-letter-exchange on the original queue. Failed or expired messages are routed to the configured exchange with all original headers preserved, plus x-death headers containing failure metadata.
{
"x-dead-letter-exchange": "orders.dlx",
"x-dead-letter-routing-key": "orders.failed",
"x-message-ttl": 30000
}
RabbitMQ also supports dead-lettering on message TTL expiry and queue length overflow, not just consumer rejection. This means you can use DLQs for overflow protection: if the main queue exceeds a maximum length, excess messages route to the DLQ rather than being dropped. RabbitMQ preserves the x-death header array, which contains a history of every dead-lettering event the message has been through, including the reason, queue name, and timestamp of each DLQ routing.
What Lands in the DLQ
Messages fail for different reasons. Each class of failure needs a different response. The mistake I see most often is treating all DLQ messages the same and replaying them blindly.
Continue Reading with Premium
Unlock this article and every other in-depth system design guide on the platform with NotesFromSDE Premium.