📐HowToHLD
Vote for New Content
Vote for New Content
Home/High Level Design/Patterns

Outbox pattern

Learn how the Outbox pattern eliminates the dual-write problem in distributed systems, guaranteeing every database write produces its corresponding event even when brokers and services crash mid-flight.

47 min read2026-03-26hardoutbox-patterndistributed-systemsmessagingcdcmicroservices

TL;DR

  • The Outbox pattern solves the dual-write problem: you cannot atomically write to a database and publish to a message broker simultaneously. One can succeed while the other fails, leaving your system in an inconsistent state.
  • The fix is to write the event into an outbox table inside the same database transaction as your business data. A separate relay process then reads the outbox and publishes to the broker. Atomicity comes from the database; delivery is handled by the relay.
  • This gives you at-least-once delivery, not exactly-once. Consumers must implement idempotency to handle duplicate events.
  • Two relay strategies exist: polling (simple, works up to ~5K events/second) and CDC via Debezium (sub-second latency, higher operational burden, no theoretical throughput ceiling).
  • This pattern is not optional complexity. It is the structural difference between an event-driven architecture that works in production and one that silently loses events under load.

The Problem

Your order service needs to do two things when a customer places an order: write the order to PostgreSQL, and publish an order.created event to Kafka so the inventory service can reserve stock.

Two lines of code: INSERT, then publish. The problem is those two operations are not atomic. PostgreSQL knows nothing about Kafka. Kafka knows nothing about PostgreSQL.

If your application crashes between the INSERT and the publish, the order exists in the database but the inventory service never sees the event. Stock is never reserved. The customer gets an order confirmation, but nothing ships. Your on-call engineer finds out at 3 a.m. when the customer calls.

sequenceDiagram
    participant C as 👤 Customer
    participant S as ⚙️ Order Service
    participant DB as 🗄️ PostgreSQL
    participant K as 📨 Kafka

    Note over S,K: The Dual-Write Problem: two operations, zero atomicity

    C->>S: POST /orders
    S->>DB: INSERT INTO orders (COMMIT)
    Note over DB: ✅ Order persisted
    Note over S: 💥 Process crash / network timeout / OOM kill
    S--xK: publish order.created
    Note over K: ❌ Event never arrives

    Note over DB,K: Inconsistent state: order in DB, no event in Kafka
    Note over DB,K: Inventory never reserved. Order ships nothing.

This is not just a crash scenario. The same failure appears when Kafka is temporarily unavailable, when a network partition separates your app from the broker, or when the publish call throws an exception after the DB commit has already completed.

The instinctive fix is to add retry logic around the publish. Retry logic does not help if the application has already crashed before the retry runs. It also risks double-publishing if the original publish actually succeeded but the acknowledgment was lost.

The fundamental issue: you cannot make two writes to two unrelated systems atomic without a distributed transaction, and distributed transactions are something you do not want in a high-throughput service.


One-Line Definition

The Outbox pattern guarantees event delivery by writing events into a dedicated table in the same database transaction as your business data, then using a relay process to publish those events to the message broker asynchronously.


Analogy

Think of a restaurant kitchen where the chef (your service) must both prepare a dish (database write) and call out to a waiter (message broker) to deliver it. The chef can complete the dish and then faint before calling anyone. Dish plated, waiter never notified. The customer gets no food even though the kitchen has a completed meal.

The outbox is the kitchen ticket printer. Every time the chef finishes a dish, they print a ticket and clip it to the pass. A dedicated ticket runner (the relay) comes by every 30 seconds, picks up all undelivered tickets, and routes them to the right waiter. Even if the ticket runner is temporarily absent, the tickets stay clipped. The next time the runner comes, they deliver everything. The chef never worries about delivery.


Solution Walkthrough

The Outbox pattern has three moving parts: the transactional write, the relay, and the consumer. Here is how they work together.

The Outbox Table pattern showing the order service writing to both orders and outbox tables in one atomic transaction, with a relay polling the outbox and publishing to Kafka, and the consumer processing with idempotency.
The outbox table lives in the same database as your business data. Writing to it is part of the same ACID transaction, so the event exists if and only if the business write succeeded.

Step 1: Atomic transactional write

Your service writes business data and the corresponding outbox row in a single database transaction. Both succeed together, or both fail together. If the transaction commits, you are guaranteed the outbox row exists. If the transaction rolls back, no outbox row exists. The two states are always consistent.

Step 2: Relay reads and publishes

A background process polls the outbox table for PENDING rows (and for PROCESSING rows older than a timeout, which indicates a crashed relay) and publishes them to Kafka. After receiving a Kafka acknowledgment, it marks the row as SENT. This relay runs independently of the service that wrote the data. If the relay crashes mid-publish, the row stays in PROCESSING and is recovered automatically when the next poll cycle finds it past the stale timeout.

Step 3: Consumer processes with idempotency

The consumer receives the event. Before executing business logic, it attempts to insert the event ID into a processed_events table. If the insert succeeds (first time), process normally. If it fails on a unique constraint (already processed), skip silently. Because the relay delivers at-least-once, duplicates are expected. Idempotency is how you make them harmless.

sequenceDiagram
    participant S as ⚙️ Order Service
    participant DB as 🗄️ PostgreSQL
    participant R as ⚙️ Message Relay
    participant K as 📨 Kafka
    participant I as 📦 Inventory Service

    Note over S,DB: Step 1: One atomic transaction

    S->>DB: BEGIN TRANSACTION
    S->>DB: INSERT INTO orders VALUES (...)
    S->>DB: INSERT INTO outbox (event_type, payload, status='PENDING')
    S->>DB: COMMIT
    Note over DB: Both rows committed. Event cannot be lost.

    Note over R,DB: Step 2: Relay polls for PENDING rows

    R->>DB: SELECT * FROM outbox WHERE status='PENDING'<br/>FOR UPDATE SKIP LOCKED LIMIT 100
    DB-->>R: [{ id, event_type, payload, aggregate_id }]
    R->>DB: UPDATE outbox SET status='PROCESSING' WHERE id IN (...)
    R->>K: Produce event to topic 'orders' (key=aggregate_id)
    K-->>R: ack (offset confirmed)
    R->>DB: UPDATE outbox SET status='SENT' WHERE id IN (...)

    Note over K,I: Step 3: Consumer with idempotency guard

    K->>I: Deliver event (at-least-once)
    I->>DB: INSERT INTO processed_events (event_id) ON CONFLICT DO NOTHING
    I->>DB: UPDATE inventory SET reserved=reserved+qty WHERE product_id=...
    I->>DB: COMMIT

The FOR UPDATE SKIP LOCKED in Step 2 is not a cosmetic detail. Without it, running two relay instances means both claim the same batch of rows, and every event gets published twice. SKIP LOCKED makes each relay instance claim a disjoint subset of rows automatically.

For your interview: describe all three steps, name FOR UPDATE SKIP LOCKED as the concurrency lock, and state that consumers get at-least-once delivery and must be idempotent. That combination signals you've actually run this, not just read about it.


Implementation Sketch

Here is a production-grade relay covering batching, the three-state lifecycle, and dead-letter handling:

// outbox-relay.ts — background service, one per service instance or as a cron job
class OutboxRelay {
  private readonly BATCH_SIZE = 100;
  private readonly MAX_ATTEMPTS = 5;
  private readonly POLL_INTERVAL_MS = 2000;

  async start(): Promise<void> {
    while (true) {
      await this.processBatch();
      await sleep(this.POLL_INTERVAL_MS);
    }
  }

  private async processBatch(): Promise<void> {
    // The SELECT and the UPDATE to PROCESSING MUST be in the same transaction.
    // If they are separate db.query calls, each auto-commits on their own connection,
    // releasing the FOR UPDATE lock between the two calls. A concurrent relay instance
    // can then claim the same rows in that gap — exactly the race we are trying to prevent.
    const client = await pool.connect();
    let rows: OutboxRow[] = [];
    try {
      await client.query('BEGIN');

      const result = await client.query<OutboxRow>(
        `SELECT id, aggregate_id, event_type, payload, attempts
         FROM outbox
         WHERE (
           (status = 'PENDING' AND attempts < $1)
           OR
           -- Recover PROCESSING rows left by a crashed relay instance
           (status = 'PROCESSING' AND updated_at < NOW() - INTERVAL '10 minutes')
         )
         ORDER BY created_at
         LIMIT $2
         FOR UPDATE SKIP LOCKED`,
        [this.MAX_ATTEMPTS, this.BATCH_SIZE]
      );
      rows = result.rows;

      if (rows.length === 0) {
        await client.query('COMMIT');
        return;
      }

      // Mark PROCESSING while still holding the FOR UPDATE row locks.
      // updated_at = NOW() is critical: the stale-recovery condition uses updated_at
      // to detect crashed relays. Without this, a row that waited >10 min in PENDING
      // would have an old updated_at and be immediately vulnerable to concurrent re-claiming.
      await client.query(
        `UPDATE outbox SET status = 'PROCESSING', updated_at = NOW() WHERE id = ANY($1)`,
        [rows.map(r => r.id)]
      );

      await client.query('COMMIT');
      // Locks released. Rows are now PROCESSING — concurrent relays skip them.
    } catch (err) {
      await client.query('ROLLBACK').catch(() => {});
      throw err;
    } finally {
      client.release();
    }

    const sentIds: string[] = [];
    const failedIds: string[] = [];

    for (const row of rows) {
      try {
        await kafka.produce({
          topic: topicForEventType(row.event_type),
          key:   row.aggregate_id, // ensures ordering: same aggregate → same partition
          value: row.payload,
          headers: { 'x-outbox-event-id': row.id }, // consumer uses for idempotency key
        });
        sentIds.push(row.id);
      } catch {
        failedIds.push(row.id);
      }
    }

    // Batch updates — never do per-row DB round-trips in the relay hot path.
    // These run OUTSIDE the above transaction so PROCESSING already prevents re-claiming.
    if (sentIds.length > 0) {
      await pool.query(
        `UPDATE outbox SET status = 'SENT', sent_at = NOW(), updated_at = NOW() WHERE id = ANY($1)`,
        [sentIds]
      );
    }

    if (failedIds.length > 0) {
      await pool.query(
        `UPDATE outbox
         SET status     = CASE WHEN attempts + 1 >= $2 THEN 'DEAD_LETTER' ELSE 'PENDING' END,
             attempts   = attempts + 1,
             updated_at = NOW()
         WHERE id = ANY($1)`,
        [failedIds, this.MAX_ATTEMPTS]
      );
    }
  }
}

The three-status progression (PENDING → PROCESSING → SENT) is the detail most tutorial implementations omit. The intermediate PROCESSING state prevents a second relay instance from claiming a row while the first is mid-publish. Without it, a slow Kafka produce call gives the second relay a window to claim the same row and publish the event twice.

My rule: implement PROCESSING status from day one, even if you only ever run one relay instance. You will run two for HA eventually.

How the Relay Works: Polling vs. CDC

Choosing how to read the outbox table is one of the first architectural decisions teams get wrong. Not because the wrong choice breaks things immediately, but because migrating from polling to CDC under production load is painful.

Polling relay loop showing the relay querying the outbox table with FOR UPDATE SKIP LOCKED, publishing to Kafka in batches, then marking rows SENT in a single batch UPDATE.
Polling relay: explicit, debuggable with plain SQL, and the correct default for teams below 5K events/second.

Polling relay is what the implementation sketch above implements. A dedicated process runs on a tight loop, querying the outbox table for new rows every N seconds. Simple to implement, simple to debug (check the outbox table with SQL), and suitable for the vast majority of production workloads.

The tradeoff: polling adds read load to your primary database. At very high event volumes (50K+ events/second), you either poll so frequently you create meaningful DB contention, or you poll infrequently and add latency that makes events feel stale.

CDC (Change Data Capture) is the high-throughput alternative. Instead of polling the outbox table, a CDC daemon tails PostgreSQL's Write-Ahead Log (WAL). Every committed INSERT into the outbox table appears in the WAL as a change record, which Debezium captures and routes to Kafka directly. No SELECT query is issued against the outbox table.

CDC architecture showing an application writing to PostgreSQL, Debezium reading the WAL via a logical replication slot, and change events flowing into Kafka topics.
Debezium reads the WAL stream rather than querying the table. This eliminates polling overhead and delivers events within 300ms of commit at any volume.

The tradeoff: CDC requires operating Debezium (a distributed Kafka Connect cluster), managing PostgreSQL logical replication slots, and handling schema migrations carefully. There is real operational complexity here that teams consistently underestimate.

My recommendation: start with polling relay. Migrate to CDC when your event volume or latency requirements make polling untenable, and when you have a team member who has operated Debezium before.

DimensionPolling RelayCDC (Debezium)
SimplicityHigh — pure SQL, no extra infraLow — Debezium cluster, replication slot setup
Event latency1–10 seconds100–500ms
DB read loadModerate (periodic SELECT + UPDATE)Near zero (reads WAL, not the table)
Scale ceiling~5K events/second100K+ events/second
Ops burdenVery lowHigh (schema migrations, slot monitoring)
Use whenDefault for all new servicesVolume exceeds 5K/s or latency SLA is under 1s

Message Lifecycle

Every outbox row has a well-defined status that tells you exactly where it is in the delivery pipeline.

flowchart TD
    PENDING["⏳ PENDING\nRow inserted by service\nAwaiting relay pickup"]
    PROCESSING["🔄 PROCESSING\nRelay claimed this batch\nPublishing in progress"]
    SENT["✅ SENT\nKafka ack received\nEvent delivered"]
    STALE["⏱️ STALE PROCESSING\nRelay crashed mid-publish\nupdated_at > 10 min ago"]
    DEAD_LETTER["☠️ DEAD_LETTER\nMax attempts exhausted\nRequires manual review"]

    PENDING  -->|"relay picks up (SKIP LOCKED)"| PROCESSING
    PROCESSING -->|"Kafka ack received"| SENT
    PROCESSING -->|"Kafka error, attempts+1 < MAX"| PENDING
    PROCESSING -->|"Kafka error, attempts+1 >= MAX"| DEAD_LETTER
    PROCESSING -->|"relay crash: sits > 10 min"| STALE
    STALE     -->|"next poll timeout recovery"| PROCESSING
    DEAD_LETTER -->|"manual fix + reset to PENDING"| PENDING

Dead-letter rows need human attention: inspect the payload, fix the underlying issue (topic ACL, schema mismatch, consumer bug), reset status to PENDING, and re-run. Add a Slack alert when the DEAD_LETTER count rises above zero in production. A dead-letter row is not a background queue item; it is a delivered notification that a downstream service did not receive.

Without MAX_ATTEMPTS and DEAD_LETTER, one malformed event cycles the relay forever. Every engineer I know who has skipped this in a PoC has regretted it within the first week of production traffic.

At-Least-Once Delivery and Idempotency

Here is the detail most engineers miss when they first encounter this pattern: the outbox gives you at-least-once delivery, not exactly-once.

At-least-once delivery scenario: relay crashes after Kafka publish but before marking the row SENT, causing it to re-publish on restart. The consumer receives the event twice. The idempotency guard prevents double-processing.
The relay-crash-after-publish scenario is not an edge case. It is the expected failure mode. Design consumers for it from day one.

The scenario above happens whenever any reliability boundary is crossed: relay restarts, Kafka producer timeout, network hiccup after ack. Design for it from day one.

Here is the idempotency guard that prevents double-processing:

// Consumer-side idempotency guard — prevents duplicate Kafka events from causing
// duplicate business logic execution
async function processOrderCreated(event: KafkaEvent): Promise<void> {
  const eventId = event.headers['x-outbox-event-id'] as string;

  await db.transaction(async (trx) => {
    // Attempt to record this event ID atomically with the business logic
    const inserted = await trx.query(
      `INSERT INTO processed_events (event_id, processed_at)
       VALUES ($1, NOW())
       ON CONFLICT (event_id) DO NOTHING
       RETURNING event_id`,
      [eventId]
    );

    // rowCount = 0 means the event_id was already in the table — skip
    if (inserted.rowCount === 0) {
      return; // duplicate delivery: silent skip
    }

    // First time we've seen this event — safe to apply business logic
    const payload = JSON.parse(event.value) as OrderCreatedPayload;
    await trx.query(
      `UPDATE inventory SET reserved = reserved + $1 WHERE product_id = $2`,
      [payload.quantity, payload.productId]
    );
    // Both writes commit together. If consumer crashes before COMMIT,
    // both roll back. On retry, the idempotency insert succeeds and processing continues.
  });
}

The INSERT into processed_events and the business logic UPDATE happen in the same transaction. If the consumer crashes between receiving the event and committing, both roll back. On re-delivery, the idempotency check correctly detects the first successful processing attempt.


When It Shines

Decision flowchart for when to use the Outbox pattern, showing branches based on event loss tolerance, database ownership, and event volume leading to: fire-and-forget, Saga pattern, Outbox with CDC, or Outbox with polling relay.
Start at the top: if missing an event causes a business failure and you own the database, you need the Outbox. CDC vs polling is a scale decision made later.

Ok, so here is the honest answer on when you actually need this pattern.

Use it when:

  • Your service writes to a database and must emit events to downstream services with zero tolerance for event loss.
  • You are building microservices that communicate through events and need loose coupling with guaranteed delivery.
  • You are implementing the Saga pattern, where each step emits an event to trigger the next. Without a reliable outbox, failed event deliveries leave sagas permanently stalled.
  • You need a queryable audit trail of all events produced (the outbox table is your event log by default).
  • Your team has already experienced data inconsistencies from direct dual-writes and needs a structural fix, not a code review policy.

Avoid it when:

  • Event loss is genuinely tolerable. For analytics tracking, non-critical notifications, or best-effort logging, the extra table and relay complexity are not justified.
  • You control all consumers and a synchronous API call is viable. Event-driven architecture adds latency and eventual consistency; do not introduce them without need.
  • Your database is not transactional. Without ACID transactions, the outbox's atomicity guarantee does not exist. Some NoSQL stores need the inbox pattern on the consumer side instead.
  • You are in early-stage prototyping. Measure your actual write patterns first, then add the outbox to proven bottlenecks rather than speculative ones.

If you are building a system where a missing event causes a business failure (wrong inventory, unbilled payment, unsent notification), the outbox is not optional.


Failure Modes and Pitfalls

The failure modes below are not hypothetical. They are the production incidents I've seen teams cause by skipping the details.

1. The Phantom Sequence Problem

This is subtle, consistently omitted from outbox tutorials, and will hit you the moment you have concurrent high-throughput writes.

In PostgreSQL, rows become visible to SELECT queries at commit time, not at INSERT time. Consider two concurrent transactions:

  • Transaction A inserts outbox row with auto-increment id = 1001, then does a slow computation before committing.
  • Transaction B inserts outbox row with auto-increment id = 1002, commits immediately.

Your relay polls WHERE status = 'PENDING' ORDER BY id ASC. It sees row 1002 (committed) but not row 1001 (still open). It publishes 1002 and records last_processed_id = 1002.

On the next poll cycle, row 1001 finally commits. But if your relay uses a high-water mark approach (WHERE id > $last_processed_id), row 1001 is permanently skipped. The event is never published.

The fix: never use an ID-based high-water mark. Use WHERE status = 'PENDING' with FOR UPDATE SKIP LOCKED and a fixed batch size. Uncommitted rows are invisible due to PostgreSQL's MVCC — that visibility rule is separate from SKIP LOCKED. SKIP LOCKED solves a different problem: preventing two relay instances from claiming the same already-committed PENDING rows simultaneously. Once Transaction A commits, its row becomes visible as PENDING and gets picked up on the very next poll. No rows are skipped.

-- WRONG: ID-based high-water mark permanently skips out-of-order commits
SELECT * FROM outbox WHERE id > $last_seen_id ORDER BY id LIMIT 100;

-- CORRECT: SKIP LOCKED handles visibility naturally
SELECT * FROM outbox
WHERE status = 'PENDING'
ORDER BY created_at
LIMIT 100
FOR UPDATE SKIP LOCKED;

2. Missing Aggregate Versioning

Your outbox emits three events for the same order: order.created, order.updated, order.shipped. Due to relay retry and Kafka partition assignment, the consumer might receive them out of order.

Without a aggregate_version field, the consumer cannot detect this. It processes them in arrival order, producing incorrect state.

-- Add this to your outbox table
ALTER TABLE outbox ADD COLUMN aggregate_version INT NOT NULL DEFAULT 0;

-- Increment per write within the same transaction
-- aggregate_id='order-123', versions: 1, 2, 3 across the three events

The consumer validates: "Is this event's version = my current expected version + 1?" If not, it rejects and relies on the broker to retry in order. This is optional for simple workloads but required for event-driven state machines where event order has business meaning.

3. The Poison Pill

A single malformed outbox row that always fails to process permanently blocks relay forward progress if you do not have a max-attempts limit.

Without MAX_ATTEMPTS: the row cycles between PENDING and PROCESSING endlessly (each relay attempt claims it, fails, resets it to PENDING), burning relay cycles, filling logs, and blocking other rows (depending on your relay's ordering logic).

With MAX_ATTEMPTS and DEAD_LETTER: the row is quarantined after N failures. The relay skips it. All other rows process normally. A human investigates asynchronously.

This is why MAX_ATTEMPTS is not a nice-to-have. It is what keeps one bad event from poisoning your entire event delivery pipeline.

4. The N+1 Relay Anti-Pattern

Early implementations often process outbox rows one at a time: SELECT one row, publish, UPDATE status, SELECT next. At 100 events/second that means 300 DB round-trips per second from your relay alone.

Always batch. The implementation sketch above uses three queries per poll cycle regardless of batch size: one SELECT to claim, one UPDATE to mark PROCESSING, and one UPDATE to mark SENT. Batch efficiency is the difference between a relay that costs negligible DB load and one that saturates your connection pool.


Trade-offs

ProsCons
Guaranteed at-least-once event delivery, no silent event lossExtra outbox table that grows indefinitely without TTL cleanup
Uses existing RDBMS infrastructure, no new systems for basic pollingRelay process is another component to deploy, monitor, and scale
Atomicity inherited from your existing DB transaction (no new coordination logic)At-least-once delivery means every consumer must implement idempotency
Outbox rows are queryable: full debug, replay, and audit capability via SQLPolling relay adds sustained read load to your primary database
Works with any message broker (Kafka, RabbitMQ, SQS, NATS)CDC adds Debezium operational burden and replication slot management risk
Enables event-driven architectures without distributed transactions or 2PCLatency from write to consumer delivery is 1-10 seconds (polling) or 100-500ms (CDC)

The fundamental tension here is reliability vs. operational complexity. Dropping the outbox simplifies the code but introduces silent event loss. The outbox eliminates silent loss at the cost of an extra table, extra process, and consumer idempotency. For any system where a missing event is a business failure, that is the right trade.


Real-World Usage

Shopify: outbox at peak BFCM scale

Shopify processes over 10,000 orders per minute at BFCM peaks across tens of thousands of merchants. They use per-merchant outbox tables rather than a shared table, preventing any one high-volume merchant from creating write contention for others. The lesson: the outbox table is a write-hot table that must be partitioned by your natural multi-tenancy boundary, not shared globally.

Klarna: CDC for payment event streaming

Klarna adopted Debezium-based CDC for their payment outbox because polling every 100ms competed with write latency on their payments primary. The result was sub-300ms payment event delivery with zero polling overhead on the DB. The hard operational lesson: they lost 24 hours of WAL history during a schema migration after leaving a replication slot unmonitored; replication slot lag is now a primary SLO.

Uber (Go Dispatch service): outbox combined with Saga

Uber's dispatch matching uses Outbox alongside the Saga pattern to coordinate driver-trip assignment, guaranteeing no state transition event is lost during individual service failures. Their addition to the standard implementation: a correlation_id field on every outbox row, enabling distributed tracing across multi-hop event chains. Without that field, debugging a failure five hops downstream from the original event is nearly impossible in practice.

Each of these examples demonstrates the same underlying truth: the value of the Outbox pattern is not in the happy path (any direct publish works for that). It is in the failure and recovery path, which matters most at exactly the moment your system is under the most stress.


How This Shows Up in Interviews

Here is the honest picture of what separates candidates on this topic. Most engineers at the mid-level know what an outbox is. What separates senior and staff candidates is the depth on delivery semantics, relay implementation, and the non-obvious failure modes.

When to bring it up proactively

In any microservices design where a service writes to a database and emits an event, the moment you draw that service, say: "I'll use the Outbox pattern here to avoid the dual-write problem: I write the event into the outbox table inside the same DB transaction as my business data, and a relay publishes it asynchronously to Kafka. This gives at-least-once delivery, so my consumer will implement idempotency." That single paragraph earns senior points before the first follow-up question.

Do not propose 2PC when asked about atomicity

A common trap: the interviewer asks "how do you atomically write to the DB and Kafka?" and the candidate answers "two-phase commit." This is technically correct but is architecturally the wrong answer for microservices. 2PC requires a transaction coordinator, holds locks across network calls, and fails catastrophically if the coordinator crashes mid-transaction. The Outbox pattern achieves the same correctness guarantee without any distributed locks. Always choose the Outbox over 2PC in a microservices design interview.

Depth expected at senior/staff level:

  • Explain the dual-write problem as a fundamental ACID boundary violation, not a retry problem.
  • Name both relay strategies (polling and CDC), make a recommendation based on event volume, and explain the Debezium replication slot hazard.
  • State that at-least-once delivery is the guarantee and describe the consumer-side idempotency mechanism concretely.
  • Explain FOR UPDATE SKIP LOCKED and why it is required for concurrent relay instances.
  • Describe the DEAD_LETTER state and MAX_ATTEMPTS — why they are required, not optional.
  • Distinguish Kafka EOS from the dual-write problem: EOS covers the Kafka-internal producer layer; idempotency covers the application-to-broker and broker-to-consumer boundaries.

Common follow-up questions and strong answers:

Interviewer asksStrong answer
"How do you guarantee exactly-once delivery?""Not at the infrastructure level. The Outbox gives at-least-once. The consumer implements idempotency: insert the event ID into a processed_events table in the same transaction as the business write. Duplicate delivery hits a unique constraint and is skipped. From the business logic perspective, the combined system behaves as exactly-once."
"What happens if the relay crashes after publishing but before marking SENT?""The row stays in PROCESSING. It is not re-selected by a poll that only looks at PENDING rows. The relay must also recover stale PROCESSING rows: rows where updated_at is older than the relay's crash timeout (e.g. 10 minutes). On recovery, the relay marks the row PROCESSING again and re-attempts the publish. If the publish succeeds it becomes SENT; if it fails it returns to PENDING (or DEAD_LETTER if attempts are exhausted). The consumer receives the event twice and deduplicates via the event ID."
"How do you handle a malformed event that always fails processing?""MAX_ATTEMPTS and DEAD_LETTER. After N retries, the row is quarantined as DEAD_LETTER. The relay skips it and continues processing other rows. A human investigates offline, fixes the root cause, and resets to PENDING. Without MAX_ATTEMPTS, one bad event blocks the relay indefinitely."
"How does this compare to Kafka's exactly-once semantics?""Kafka EOS prevents duplicate messages within the Kafka broker from producer retries within the Kafka producer-broker boundary. It does not solve the gap between a PostgreSQL commit and the Kafka produce call. The relay can still crash between those two steps. EOS and consumer idempotency solve complementary problems; neither replaces the other."
"How do you ensure event ordering?""Use aggregate_id as the Kafka partition key. All events for order-123 land in the same partition and are delivered in write order. Across different aggregates, Kafka offers no cross-partition ordering. If a consumer needs strict global ordering, it needs event sourcing, not just an outbox."

Know these cold. The delivery semantics question, the crash-recovery scenario, and the Kafka EOS framing come up in 80% of staff-level interviews that touch this pattern.


Test Your Understanding


Quick Recap

  1. The Outbox pattern solves the dual-write problem by co-locating event writes with business data writes in a single database transaction, making event existence atomic with data existence.
  2. A relay process reads PENDING rows from the outbox table and publishes them to the message broker; the relay runs independently of the service that wrote the data.
  3. The pattern guarantees at-least-once delivery: every consumer must implement idempotency using a processed_events deduplication guard to prevent duplicate business logic execution.
  4. Use FOR UPDATE SKIP LOCKED when running multiple relay instances to prevent duplicate publishing; the PROCESSING status catches the gap between claiming a row and successfully publishing it.
  5. DEAD_LETTER status and MAX_ATTEMPTS are required, not optional; without them, one malformed event blocks the relay indefinitely.
  6. For CDC via Debezium, monitor WAL replication slot lag as a primary SLO; an unattended stuck replication slot will fill your PostgreSQL disk to capacity.
  7. In an interview, state the dual-write problem explicitly, name at-least-once delivery as the guarantee, describe both relay strategies, mention idempotency at the consumer, and draw a firm line between the Outbox pattern and Kafka EOS.

Variants

Inbox pattern is the consumer-side complement. The consumer writes received events into an inbox table before processing. Both the inbox INSERT and the business logic UPDATE happen in one transaction. This gives the consumer effectively-exactly-once processing semantics regardless of broker delivery guarantees. It doubles the table infrastructure but is the correct choice for payment processors and financial transaction consumers where idempotency bugs are catastrophic.

Event Sourcing takes the outbox concept to its logical conclusion: instead of the event table being a side buffer, the events themselves become the source of truth. Business state is rebuild from events on demand. The Outbox pattern is a stepping stone toward event sourcing, but event sourcing requires a fundamentally different application architecture and is a much larger design decision.

Transactional Outbox frameworks (Spring's ChainedKafkaTransactionManager, Axon Framework, Debezium's Outbox EventRouter) provide the outbox mechanics as library abstractions. Worth knowing for the interview, but understanding the underlying mechanisms is what actually earns points.


Related Patterns

  • Saga Pattern — the Outbox pattern is the delivery mechanism that makes Saga steps reliable. Each saga step emits the next event through its outbox. Without the outbox, saga event loss causes permanently stalled transactions with no compensation path.
  • Event Sourcing — the architectural evolution of the outbox concept. Event sourcing treats the event log as the primary store. Understanding event sourcing shows where the outbox pattern's design decisions lead when taken to their logical conclusion.
  • CQRS — command/query separation often pairs with an event-driven write path. The outbox is the reliable bridge between the write model and the event stream that updates read models.
  • Message Queues — foundational mechanics of Kafka, RabbitMQ, and SQS that the relay publishes to. Understand partition semantics and at-least-once delivery before designing around the relay.
  • Circuit Breaker — add a circuit breaker on the relay's broker publish path. When Kafka is down, the circuit opens and the relay stops retrying immediately, preventing the outbox table from accumulating thousands of PROCESSING rows held open by failed in-flight attempts.

Previous

Sidecar pattern

Comments

On This Page

TL;DRThe ProblemOne-Line DefinitionAnalogySolution WalkthroughImplementation SketchHow the Relay Works: Polling vs. CDCMessage LifecycleAt-Least-Once Delivery and IdempotencyWhen It ShinesFailure Modes and Pitfalls1. The Phantom Sequence Problem2. Missing Aggregate Versioning3. The Poison Pill4. The N+1 Relay Anti-PatternTrade-offsReal-World UsageHow This Shows Up in InterviewsTest Your UnderstandingQuick RecapVariantsRelated Patterns