Competing consumers
How to scale message processing by running multiple consumers against the same queue. Covers partition assignment, consumer group rebalancing, at-least-once delivery, and idempotency requirements.
TL;DR
- Competing consumers scales message processing by running multiple workers against the same queue, where each message is processed by exactly one worker.
- In SQS, scaling is trivial: add more workers and the queue distributes messages automatically via visibility timeouts.
- In Kafka, maximum parallelism equals the partition count. You can't have more active consumers than partitions within a consumer group.
- All competing consumer systems deliver at-least-once, so workers must be idempotent: processing the same message twice produces the same result as processing it once.
- Consumer group rebalancing (Kafka) and visibility timeout tuning (SQS) are the two operational knobs that determine reliability and throughput.
The Problem
Your notification service sends emails when orders ship. During normal hours, one worker process handles the load fine. Then Black Friday hits. Order volume jumps 10x. Your single worker's queue depth climbs from 0 to 50,000 messages in minutes. Email delivery latency goes from seconds to hours.
You can't make the worker faster (it's limited by the SMTP server's throughput per connection). You need more workers. But if two workers pull the same message, the customer gets two shipping emails. If a worker crashes mid-processing, the message must not be lost.
The queue acts as a load balancer for messages. Each message goes to exactly one worker. Workers don't know about each other. A crashed worker's unacknowledged messages are re-delivered to healthy workers.
This is the competing consumers pattern: multiple workers "compete" for messages from a shared queue, and the queue guarantees each message is processed by exactly one consumer at a time.
The pattern solves three problems simultaneously:
- Throughput: N workers process N times more messages than a single worker.
- Resilience: a crashed worker's messages are re-delivered to surviving workers automatically.
- Elasticity: you can add or remove workers based on load without reconfiguring the queue.
But it introduces two new challenges that don't exist with a single consumer: duplicate delivery (a message may be processed more than once) and ordering gaps (messages may arrive out of order across workers). Everything in this article is about solving those two challenges at scale.
One-Line Definition
Competing consumers scales message processing horizontally by running multiple workers against a shared queue, where the queue coordinates message assignment so each message is processed by exactly one worker at a time.
Analogy
Think of a bank with multiple teller windows. Customers enter a single line (the queue). When a teller finishes serving one customer, the next customer in line walks up. No customer gets served by two tellers simultaneously. If a teller goes on break, customers keep flowing to the remaining windows. If the line gets long, the bank opens more windows.
The queue is the single line. The tellers are the workers. Opening more windows is adding more consumer instances. The bank doesn't need to redesign the line to add windows; scaling is just a capacity decision.
One important detail in this analogy: the bank doesn't let two tellers serve the same customer. The queue enforces exclusive delivery. And if a teller collapses mid-transaction, the customer returns to the front of the line and the next available teller takes over. That's the retry mechanism.
The analogy breaks at ordering: bank customers don't care if customer #5 finishes before customer #3. But in message processing, ordering sometimes matters. That's where partitions and message keys come in.
Solution Walkthrough
SQS Model: Simple Competing Consumers
With SQS, competing consumers are trivially simple. Multiple processes call ReceiveMessage. SQS handles distribution automatically.
When a worker receives a message, SQS hides it from other workers for the visibility timeout (default 30 seconds). The worker processes the message and calls DeleteMessage. If the worker crashes before deleting, the visibility timeout expires and another worker picks up the message.
Scaling: just add more workers. SQS handles load distribution automatically. No partition assignment concept, no rebalancing, no maximum parallelism limit beyond SQS's own throughput ceiling (nearly unlimited for standard queues).
Ordering: not guaranteed in standard queues. Messages can arrive out of order. FIFO queues preserve order within a MessageGroupId but are limited to 300 messages per second per group (3,000 with batching).
Visibility timeout tuning is the most important SQS configuration:
- Too short: slow processing causes the timeout to expire, and another worker picks up the message before the first one finishes. Result: duplicate processing.
- Too long: a crashed worker's messages stay hidden for the full timeout before being retried. Result: delayed processing on failure.
The rule of thumb: set visibility timeout to 2-3x your expected processing time. If processing takes 10 seconds, set the timeout to 30 seconds.
The SQS message lifecycle looks like this:
A message bouncing between Available and InFlight is either a poison pill or a sign that your visibility timeout is too short. Monitor the ApproximateNumberOfMessagesNotVisible metric: if it stays high while ApproximateNumberOfMessagesDeleted stays low, messages are being received but not successfully processed.
Kafka Model: Consumer Groups and Partitions
Kafka fundamentally changes the competing consumers model. Instead of a shared queue, messages live in ordered partitions within a topic.
The key constraint: each partition is assigned to exactly one consumer within a group at any time.
- 6 partitions, 3 consumers: each consumer handles 2 partitions
- 6 partitions, 6 consumers: each consumer handles 1 partition (maximum parallelism)
- 6 partitions, 8 consumers: 6 consumers are active, 2 sit idle wasting resources
Maximum parallelism equals the partition count. If you want 20 parallel workers, you need at least 20 partitions. You cannot add a 21st consumer and have it do useful work. Choose your partition count carefully at topic creation time; increasing it later changes message routing and can break ordering guarantees.
Kafka preserves order within a partition. If ordering matters for a key (all events for user X processed in order), route events for the same key to the same partition using the message key. All messages with key user-123 go to the same partition, which is assigned to exactly one consumer.
Multiple Consumer Groups
One of Kafka's most powerful features is that multiple consumer groups can independently consume the same topic. Each group maintains its own offset position and receives every message.
The email-sender group has 2 consumers processing notifications. The analytics group has 1 consumer building dashboards. The fraud-check group has 2 consumers running fraud models. All three groups read every order event independently, at their own pace. If the analytics consumer falls behind, it doesn't affect the email or fraud workers.
This is impossible with SQS's standard queue model. Once a consumer deletes a message from SQS, it's gone. To achieve the same multi-consumer pattern with SQS, you'd need SNS fan-out to multiple queues, one per consumer type. That works, but it's more operational overhead than Kafka's native multi-group model.
Message Ordering Considerations
The ordering question comes up in every interview discussion about competing consumers:
- SQS Standard Queue: no ordering guarantee at all. Messages can arrive in any order. If you send messages A, B, C, a consumer might receive B, A, C.
- SQS FIFO Queue: preserves order within a
MessageGroupId. All messages with the same group ID are processed one at a time, in order. Different group IDs can be processed in parallel. Throughput cap: 300 messages/second per group (3,000 with high throughput mode). - Kafka: preserves order within a partition. Route messages with the same key to the same partition. Within that partition, messages are consumed in the exact order they were produced.
If you need global ordering (all messages across all partitions/groups in order), you're limited to a single consumer. This defeats the purpose of competing consumers. The good news: global ordering is almost never actually required. Per-entity ordering (all events for user X in order) is sufficient for the vast majority of use cases.
Rebalancing
When consumers join or leave a Kafka consumer group, the group coordinator triggers a rebalance: partition assignments are redistributed among the remaining consumers.
Eager rebalancing (the default before Kafka 2.4) is a "stop the world" event: all consumers stop processing, all partitions are revoked, then reassigned. This pause can cause significant consumer lag.
Cooperative incremental rebalancing (Kafka 2.4+) only revokes the partitions that are actually moving. Consumers continue processing their unchanged partitions. This dramatically reduces rebalance impact in production.
Mitigations for rebalancing overhead:
- Static membership (
group.instance.id): Kafka recognizes a rejoining consumer as the same instance and skips rebalance during rolling deploys. - Session timeout tuning: shorter timeout detects dead consumers faster (less lag on failure) but causes more rebalances on transient network blips.
- Heartbeat interval: set to 1/3 of session timeout. Default session timeout is 45 seconds, heartbeat interval is 3 seconds.
Implementation Sketch
A TypeScript consumer with idempotent processing and dead letter queue routing:
async function processMessage(msg: QueueMessage) {
// Idempotency check: skip if already processed
const alreadyProcessed = await redis.get(`processed:${msg.id}`);
if (alreadyProcessed) {
await queue.ack(msg);
return;
}
try {
await sendShippingEmail(msg.body.orderId, msg.body.email);
// Mark as processed with 7-day TTL
await redis.set(`processed:${msg.id}`, "1", "EX", 604800);
await queue.ack(msg);
} catch (error) {
if (msg.receiveCount >= 3) {
// Poison pill: route to dead letter queue
await deadLetterQueue.send(msg);
await queue.ack(msg);
} else {
// Let visibility timeout expire for retry
await queue.nack(msg);
}
}
}
This sketch handles the three core concerns: idempotency (Redis dedup key), retry (nack for re-delivery), and poison pill routing (DLQ after 3 attempts).
A Kafka consumer with manual offset commit:
async function consumePartition(consumer: KafkaConsumer) {
for await (const batch of consumer.eachBatch()) {
for (const message of batch.messages) {
const key = message.key.toString();
const dedup = await redis.get(`processed:${key}:${message.offset}`);
if (dedup) continue;
await processOrder(JSON.parse(message.value));
await redis.set(
`processed:${key}:${message.offset}`, "1", "EX", 86400
);
}
// Commit after entire batch succeeds
await consumer.commitOffsets(batch.lastOffset);
}
}
Notice the commit happens after the entire batch. If the consumer crashes mid-batch, it will reprocess from the last committed offset. The idempotency check prevents duplicate side effects.
Idempotency Strategies
Since competing consumers deliver at-least-once, every handler must tolerate duplicate delivery. Here are the three most common idempotency approaches:
1. Deduplication by message ID. Before processing, check if the message ID has already been handled (using Redis, a database table, or an in-memory set with TTL). This works for any message type but requires storage for processed IDs.
2. Conditional database writes. Use version columns or conditional updates: UPDATE orders SET status = 'shipped' WHERE id = ? AND status = 'processing'. If the update affects 0 rows, the message was already processed. No separate dedup storage needed, but only works for database-backed operations.
3. Naturally idempotent operations. Some operations are inherently idempotent: setting a value (not incrementing), upserting a record, or writing a file with a deterministic name. Design your operations to be idempotent by nature when possible.
The dedup key's TTL matters. If you set a 1-hour TTL and Kafka reprocesses a message from 2 hours ago (after a consumer group reset), the dedup key has expired and you'll process it again. Set the TTL based on your worst-case replay window, not average processing time.
SQS vs Kafka: When to Use Which
| Dimension | SQS | Kafka |
|---|---|---|
| Scaling model | Add workers freely | Max parallelism = partition count |
| Ordering | FIFO queues (limited throughput) | Per-partition ordering |
| Message retention | 14 days max, deleted on ack | Configurable, retained after consumption |
| Replay | Not possible after deletion | Replay by resetting consumer offset |
| Operational overhead | Fully managed, zero config | Cluster management (or use managed Kafka) |
| Throughput ceiling | ~3,000/sec per FIFO group | Millions/sec with enough partitions |
| Multi-consumer | One consumer per message | Multiple consumer groups read same topic |
| Best for | Task queues, job processing | Event streaming, event sourcing, log aggregation |
When It Shines
Competing consumers are the right choice when:
- Throughput exceeds single-worker capacity. One worker can't keep up with the message rate during peak traffic. More workers = proportionally more throughput (up to the parallelism ceiling).
- Processing is independent per message. Each message can be handled without coordinating with other messages. No cross-message transactions.
- Workload is spiky. Black Friday, flash sales, batch imports. Auto-scaling consumer count based on queue depth lets you handle bursts without over-provisioning.
- You need resilience to worker failure. A crashed worker's messages are automatically re-delivered to healthy workers. No manual intervention needed.
- Processing time varies per message. Some messages take 100ms, others take 10 seconds. Competing consumers naturally load-balance: fast workers pull more messages, slow workers pull fewer.
- You want decoupled, independently deployable workers. Each consumer is a stateless process. You can deploy, scale, or restart workers without affecting the queue or other consumers.
When NOT to Use
- Strict global ordering is required. If every message must be processed in exactly the order it was produced across all keys, competing consumers break that guarantee. Use a single consumer instead.
- Processing requires cross-message state. If handling message N depends on the result of message N-1, competing consumers add complexity. Consider the saga pattern with an orchestrator instead.
- Message volume is consistently low. If you process 10 messages per minute, running 5 competing consumers wastes resources. One consumer suffices.
Auto-Scaling Consumer Fleets
The most powerful benefit of competing consumers is dynamic scaling. With SQS, you can auto-scale directly off queue depth metrics:
SQS scaling signal: ApproximateNumberOfMessages / number of consumers = messages per worker. When this ratio exceeds your processing capacity, scale out.
Kafka scaling signal: consumer lag (offset difference between latest produced and last committed). When lag grows faster than consumers drain it, you need more consumers (up to the partition count). Beyond that, you need more partitions, which requires topic reconfiguration.
I once had a team that auto-scaled Kafka consumers to 30 instances but only had 12 partitions. 18 consumers sat idle while we burned compute. The monitoring dashboard showed everything "green" because CPU was low. It was low because 60% of the fleet was doing nothing.
Practical auto-scaling rules:
- SQS: scale on
ApproximateNumberOfMessages / DesiredConsumerCount. Target: each consumer should have no more than a few hundred messages queued. AWS provides a custom metric recipe for this. - Kafka: scale on
records-lag-max(max lag across all partitions for a consumer group). But cap the max consumer count at the partition count. Auto-scaling beyond the partition count is pure waste. - Scale-down cooldown: set a cooldown period (5-10 minutes) before scaling down. Queue depth can be spiky; you don't want to remove consumers only to add them back 2 minutes later.
- Minimum fleet size: always keep at least 2 consumers running for redundancy. If one dies, the other continues processing while autoscaling brings up a replacement.
Failure Modes and Pitfalls
1. Poison Pill Messages
A message that repeatedly fails processing (malformed payload, missing reference data, bug in handler logic) will be retried indefinitely unless you set a retry limit. It blocks its partition (in Kafka) or is re-delivered endlessly (in SQS).
Fix: configure a dead letter queue (DLQ). After N failed attempts, move the message to the DLQ for manual investigation. SQS has native DLQ support via maxReceiveCount. For Kafka, implement DLQ routing in your consumer code: catch exceptions, check the retry count from a custom header, and produce to a <topic>-dlq topic after N failures.
A single poison pill in Kafka blocks its entire partition. All subsequent messages in that partition wait behind the failing message. In SQS, the impact is smaller since the message is retried independently. Monitor DLQ depth and alert when it exceeds zero.
2. Consumer Lag Avalanche
During a rebalance or outage, messages accumulate in the queue. When consumers come back, they process a backlog. If processing triggers downstream calls (database writes, API calls), the burst can overwhelm downstream services.
Fix: implement backpressure. Limit the batch size per consumer. Use rate limiting on downstream calls. Monitor consumer lag as a leading indicator before it becomes critical.
I've seen this pattern repeatedly: a 30-minute Kafka outage causes 500,000 messages to queue up. When consumers reconnect, they blast through the backlog at full speed, hammering the database with 10x normal write volume. The database's connection pool saturates, queries start timing out, and the consumer starts failing too. Now you have a cascading failure. Rate-limit the consumer's processing speed, especially after an outage recovery.
3. Visibility Timeout Misconfiguration (SQS)
Visibility timeout too short causes duplicate delivery. Too long causes delayed retry on failure. I once spent two days debugging "duplicate emails" before realizing our visibility timeout was 30 seconds but the email API call took 45 seconds under load.
Fix: set visibility timeout to 2-3x p99 processing time. Use ChangeMessageVisibility to extend the timeout for long-running messages.
Here's the pattern for dynamic visibility extension. Your worker sets a heartbeat timer that extends the visibility timeout every N seconds while processing continues:
async function processWithHeartbeat(msg: SQSMessage) {
const heartbeat = setInterval(async () => {
await sqs.changeMessageVisibility({
QueueUrl: QUEUE_URL,
ReceiptHandle: msg.ReceiptHandle,
VisibilityTimeout: 30, // extend by 30s each beat
});
}, 15_000); // heartbeat every 15 seconds
try {
await processExpensiveTask(msg);
await sqs.deleteMessage({ QueueUrl: QUEUE_URL, ReceiptHandle: msg.ReceiptHandle });
} finally {
clearInterval(heartbeat);
}
}
This eliminates the need to guess a "safe" visibility timeout. The timeout renews as long as the worker is alive and processing.
4. Rebalance Storm (Kafka)
Frequent rebalances (consumers flapping, session timeout too short, long GC pauses) cause a cascading failure: rebalance starts, processing pauses, consumer lag grows, consumers fall behind, session timeouts fire, more rebalances trigger.
Fix: use cooperative incremental rebalancing, set session.timeout.ms to 45+ seconds, use static group membership for rolling deploys, and monitor rebalance frequency as a key health metric.
A teammate once set session.timeout.ms to 10 seconds to "detect failures faster." During a network hiccup that lasted 12 seconds, every consumer in the group was declared dead. The coordinator triggered a full rebalance of 200+ partitions. Processing stopped for 3 minutes while the group stabilized. The "faster failure detection" created a much worse failure.
5. Ordering Violations Under Rebalance
During a Kafka rebalance, a partition moves from Consumer A to Consumer B. If Consumer A hadn't committed its latest offset, Consumer B reprocesses those messages. If the processing order matters, the gap between "A processed but didn't commit" and "B reprocesses" can cause ordering violations.
Fix: commit offsets synchronously before releasing partitions during cooperative rebalance (configure ConsumerRebalanceListener.onPartitionsRevoked). Accept that at-least-once delivery means some messages will be processed twice.
6. Head-of-Line Blocking in Kafka
When processing messages synchronously within a partition, a single slow message blocks all subsequent messages in that partition. If one out of every 1,000 messages takes 30 seconds (instead of the normal 50ms), that partition's throughput drops dramatically.
Fix: process messages asynchronously within the consumer. Use an internal thread pool to parallelize processing across messages from the same partition. You lose strict ordering within the partition, but gain resilience against slow outliers. If ordering matters, only parallelize across different keys.
The "at-least-once" guarantee means exactly-once processing is your application's responsibility. Every competing consumer system, whether SQS, Kafka, RabbitMQ, or anything else, can deliver a message more than once. Build idempotency into your message handlers from day one, not as an afterthought.
Trade-offs
| Advantage | Disadvantage |
|---|---|
| Linear horizontal scaling | Max parallelism bounded by partition count (Kafka) |
| Automatic failover on worker death | At-least-once delivery requires idempotent workers |
| Simple to implement | Ordering guarantees limited to per-partition/per-group |
| Handles bursty workloads naturally | Rebalancing causes processing pauses (Kafka) |
| Workers are stateless and independently deployable | Poison pills can block partition processing |
| No coordination protocol between workers | Visibility timeout tuning is error-prone (SQS) |
| Multiple consumer groups can read same topic (Kafka) | Adding partitions later changes key routing |
The fundamental tension: competing consumers trade ordering and exactly-once guarantees for horizontal scalability. You get throughput, but you must design for duplicate delivery and per-partition (not global) ordering.
If your interview answer is "use competing consumers to scale," follow immediately with: "and workers must be idempotent because at-least-once delivery means duplicates are possible." This shows you understand the trade-off, not just the pattern.
Real-World Usage
Uber processes billions of Kafka messages daily across ride matching, pricing, and ETA calculation. Their consumer groups scale to hundreds of instances per topic, with partition counts in the thousands for high-throughput topics.
Shopify uses competing consumers for webhook delivery. When a merchant's store generates events (order created, product updated), competing consumer workers deliver webhooks to the merchant's configured endpoints. During flash sales, consumer count auto-scales based on queue depth.
DoorDash uses Kafka consumer groups for order assignment. When a new order comes in, competing consumers process assignment logic (finding nearby Dashers, calculating ETAs). They run 100+ consumers per topic during dinner rush, scaling down to 10-20 during off-peak hours.
Stripe processes webhook delivery using competing consumers. Each webhook event is a message; consumer workers deliver it to the merchant's endpoint with retries and exponential backoff. At Stripe's scale, this means millions of webhook deliveries per hour, with consumer fleets that independently scale per event type.
LinkedIn uses Kafka consumer groups extensively across their infrastructure. Their messaging platform processes billions of events per day through competing consumer groups, with separate groups for feed updates, notification delivery, analytics, and search indexing, all reading from the same topics independently.
When choosing partition counts for Kafka, plan for your peak load, not average load. A topic with 12 partitions can never exceed 12 parallel consumers, even during Black Friday. Over-partitioning has low cost; under-partitioning requires topic recreation.
Monitoring Essentials
The key metrics to track for a competing consumer system:
| Metric | SQS | Kafka | Alert When |
|---|---|---|---|
| Queue depth / consumer lag | ApproximateNumberOfMessages | records-lag-max | Growing faster than drain rate |
| Processing rate | Messages deleted per second | Records consumed per second | Drops below expected baseline |
| Error rate | DLQ message count | Consumer error logs | Exceeds 0.1% of total |
| Processing latency | Time between send and delete | Commit offset latency | p99 exceeds SLA |
| Rebalance frequency | N/A | rebalance-total metric | More than 1 per hour outside deploys |
| Idle consumers | N/A | Consumers with 0 assigned partitions | Any idle consumer exists |
Set up a dashboard with these six metrics before deploying competing consumers to production. Consumer lag is the single most important metric: if lag is growing, either add consumers (up to partition count) or investigate why processing is slow. The second most critical alert is DLQ depth: if poison pills are accumulating, you have a data quality or code bug that needs immediate attention.
How This Shows Up in Interviews
Junior: "How would you scale processing for this queue?" Show that you know to add more workers and that the queue handles exclusive delivery.
Mid-level: "What happens if a worker crashes mid-processing?" Discuss visibility timeouts (SQS), offset commits (Kafka), at-least-once delivery, and idempotency.
Senior: "How do you handle ordering guarantees while maintaining parallel processing?" Explain partition-based ordering, the parallelism ceiling, rebalancing strategies, and the trade-off between throughput and ordering.
In system design interviews, competing consumers often come up indirectly. When an interviewer asks "how would you handle millions of notification emails?" or "how do you process uploaded images at scale?", the answer involves competing consumers even if you don't name the pattern explicitly. Frame it as: "multiple workers pulling from a shared queue, where the queue handles exclusive delivery and retry on failure."
| Question | What They Want to Hear |
|---|---|
| "How do you scale queue consumers?" | Add workers; SQS auto-distributes, Kafka limited by partition count |
| "What's the max parallelism in Kafka?" | Equal to the number of partitions in the topic |
| "How do you prevent duplicate processing?" | Idempotency keys, deduplication at the application layer |
| "What happens during a Kafka rebalance?" | Partitions redistributed, processing pauses (eager) or partially pauses (cooperative) |
| "How do you handle poison pill messages?" | Dead letter queue after N retries, alert on DLQ depth |
Common Interview Mistakes
- Saying "just add more consumers" without discussing the partition ceiling in Kafka.
- Assuming messages are delivered exactly once. They are delivered at-least-once; your application must handle duplicates.
- Forgetting about ordering. Mentioning competing consumers without discussing what happens to message ordering shows incomplete understanding.
- Not mentioning the rebalance impact. In Kafka, adding or removing consumers triggers a rebalance that temporarily pauses processing.
Test Your Understanding
Quick Recap
- Competing consumers scale message processing by running multiple workers against the same queue. Each message is processed by exactly one worker; the broker handles exclusive delivery.
- With SQS, scaling is trivial: add workers and SQS distributes load automatically. With Kafka, maximum parallelism equals the number of partitions.
- Kafka consumer groups rebalance when consumers join or leave, causing a processing pause. Use static membership and cooperative incremental rebalancing to minimize disruption during deploys.
- Competing consumers provide at-least-once delivery. A message may be redelivered after a worker crash or rebalance. Workers must be idempotent.
- To preserve ordering within a logical unit (user, order), partition by the key in Kafka or use
MessageGroupIdin SQS FIFO. Global ordering requires a single consumer and eliminates the scaling benefit. - Poison pill messages can block a partition. Route failed messages to a dead letter queue after N retries.
- Visibility timeout (SQS) and session timeout (Kafka) are the critical tuning parameters. Set visibility timeout to 2-3x p99 processing time. Set session timeout to tolerate transient network blips.
- Plan Kafka partition counts for peak load at topic creation. Adding partitions later is possible but changes key routing and can break ordering guarantees.
- Use multiple Kafka consumer groups to independently consume the same topic for different purposes (notifications, analytics, fraud detection) without interference.
Related Patterns
- Message Queues for the underlying infrastructure that competing consumers run on.
- Dead Letter Queue for handling poison pill messages that fail repeatedly.
- Saga Pattern for orchestrating multi-step workflows where competing consumers handle individual steps.
- Leader Election for scenarios where only one consumer should process a partition (Kafka's internal coordination mechanism).
- Circuit Breaker for protecting downstream services when consumers generate burst traffic after recovering from lag.
- Change Data Capture for generating the event streams that competing consumers often process downstream.