Saga pattern

TL;DR

A saga is a sequence of local database transactions across multiple services. Each step publishes an event (or sends a command) that triggers the next. If any step fails, compensating transactions undo the already-committed steps in reverse order.
The core trade-off is eventual consistency vs atomicity: a saga does not give you ACID across service boundaries. It gives you a best-effort consistency guarantee through compensation. You see intermediate states.
Two implementation styles: choreography (services react to events on a shared bus, no coordinator) and orchestration (a central saga coordinator sends commands and tracks state). Orchestration is almost always the right choice once your saga has more than three steps.
The hardest part is not the happy path. It is making compensating transactions idempotent and reliable, because your network will drop the compensation message and the saga will retry.
Pair sagas with the Outbox Pattern to guarantee event delivery. Without it, your saga loses events silently when a service crashes between writing to its DB and publishing to the broker.

It is Friday evening. Your e-commerce platform is processing 50,000 orders per hour. Each order flows through four services: Order Service (creates the record), Inventory Service (reserves stock), Payment Service (charges the card), and Notification Service (emails the receipt).

These four services each have their own database. ACID transactions do not cross service boundaries. The databases do not share a transaction log.

When your Payment Service gets a 429 from Stripe at step three, the order already exists and the inventory is already reserved. You now have a ghost order and phantom reserved stock, with no automatic rollback in sight.

The naive fix is Two-Phase Commit (2PC). A transaction coordinator asks all participants to prepare (phase 1), then issues a global commit or abort (phase 2). It gives you something close to atomicity, but the coordinator is a single point of failure.

If it crashes between phase 1 and phase 2, every participant holds its row locks indefinitely. In practice, this causes full system freezes.

Transaction coordinator box on the left, crashed with a red indicator. Three service boxes (Inventory, Payment, Shipping) in the middle, each labeled BLOCKED and waiting for COMMIT. Three database boxes on the right, each showing a locked row with a 60-second timeout. — 2PC coordinator failure leaves every participant holding row locks indefinitely. One coordinator crash freezes the entire system until locks time out. This is why 2PC is avoided in microservice architectures.

The fundamental problem is that distributed systems need a consistency model that tolerates partial failure without requiring a global lock. The saga pattern is the answer. Everything else in this article explains exactly how.

Every system I have seen skip this design conversation eventually retrofitted saga-like compensation logic after a production incident revealed the gap.

One-Line Definition

A saga sequences local transactions across services, publishing an event or message after each step to trigger the next, and running compensating transactions in reverse order when any step fails.

Analogy

Think about booking a holiday through a travel agent. The agent books your flight, reserves a hotel, and arranges a rental car (three separate transactions with three separate companies). If the car rental falls through, the agent does not magically undo the first two.

The agent calls the hotel to release the reservation, then calls the airline to cancel the flight. Each cancellation is an explicit compensating action.

The agent does not hold all three companies frozen while deciding. They act, observe outcomes, and compensate when something goes wrong.

That is exactly what a saga does. The agent is the orchestrator, each booking company is a service, and each cancellation call is a compensating transaction.

Solution Walkthrough

A saga breaks a multi-step business operation into individual local transactions, each scoped to a single service and its database. Each transaction either succeeds and publishes a success event, or fails and the saga triggers compensations.

Four green boxes showing forward steps: T1 Create Order, T2 Reserve Inventory, T3 Charge Payment, T4 Send Notification (marked FAILED in red). Below them, three yellow compensation boxes: C3 Refund Payment, C2 Release Inventory, C1 Cancel Order, with arrows running right-to-left. — When T4 fails, the saga runs compensations in reverse order: C3 refunds the charge, C2 releases the reserved stock, C1 cancels the order. Each compensation is a business-level undo, not a database rollback.

The key insight: compensating transactions are not database rollbacks. They are new, explicit business operations. CANCEL_ORDER sets the order status to CANCELLED, records who cancelled it and when, and is a new write, not a SQL undo.

For your interview: the moment you introduce multiple services with separate databases, assume you need a saga for any workflow that spans more than one of them.

Two Implementation Styles

There is no single correct implementation of a saga. The saga pattern is a logical idea, and you implement it one of two ways.

Choreography

In a choreography-based saga, there is no central coordinator. Each service listens to the message broker for the events it cares about, processes them, and publishes the next event. The workflow emerges from the connected chain of reactions.

Kafka event bus in the center. Order Service on the top-left publishes order.created. Inventory Service on the top-right listens to order.created and publishes inventory.reserved. Payment Service on the bottom-right listens to inventory.reserved. Notification Service on the bottom-left listens to payment.charged. — No central brain. Each service reacts to the event it cares about and publishes the next. The workflow is implicit in the event chain, which makes it powerful and surprisingly difficult to debug when something goes wrong.

sequenceDiagram
    participant C as 👤 Client
    participant OS as ⚙️ Order Service
    participant EB as 📨 Event Bus (Kafka)
    participant IS as ⚙️ Inventory Service
    participant PS as ⚙️ Payment Service
    participant NS as ⚙️ Notification Service

    C->>OS: POST /orders
    OS->>OS: Create Order (PENDING)
    OS->>EB: publish order.created

    Note over EB,IS: IS subscribed to order.created
    EB-->>IS: deliver order.created
    IS->>IS: Reserve 2 units
    IS->>EB: publish inventory.reserved

    Note over EB,PS: PS subscribed to inventory.reserved
    EB-->>PS: deliver inventory.reserved
    PS->>PS: Charge $149.99
    PS->>EB: publish payment.charged

    Note over EB,NS: NS subscribed to payment.charged
    EB-->>NS: deliver payment.charged
    NS->>NS: Send confirmation email
    NS->>EB: publish notification.sent

    EB-->>OS: deliver notification.sent
    OS->>OS: Update Order to CONFIRMED
    OS-->>C: HTTP 202 Accepted

Choreography is attractive because there is no central service to maintain, and teams can add new participants by subscribing to the right event without touching other services. But the workflow is invisible. If you want to know what step a saga is on, you have to reconstruct it from the event log.

Orchestration

In an orchestration-based saga, a central saga orchestrator owns the state machine. It sends commands to each service, waits for their replies, and decides what to do next. The workflow is explicit and visible.

The orchestrator holds the saga state machine. Every step is a command it issues; every reply advances the state. Restart the orchestrator and it resumes exactly where it left off.

sequenceDiagram
    participant C as 👤 Client
    participant O as 🧠 Saga Orchestrator
    participant OS as ⚙️ Order Service
    participant IS as ⚙️ Inventory Service
    participant PS as ⚙️ Payment Service
    participant NS as ⚙️ Notification Service

    C->>O: POST /sagas/order (orderId=9871)
    activate O
    Note over O: State: STARTED

    O->>OS: cmd: createOrder(orderId=9871)
    OS-->>O: reply: COMPLETED
    Note over O: State: ORDER_CREATED

    O->>IS: cmd: reserveInventory(orderId, qty=2)
    IS-->>O: reply: COMPLETED
    Note over O: State: INVENTORY_RESERVED

    O->>PS: cmd: chargePayment(orderId, $149.99)
    PS-->>O: reply: COMPLETED
    Note over O: State: PAYMENT_CHARGED

    O->>NS: cmd: sendNotification(orderId)
    NS-->>O: reply: FAILED (smtp timeout)
    Note over O: State: COMPENSATING

    O->>PS: cmd: refundPayment(orderId)
    PS-->>O: reply: COMPENSATED
    O->>IS: cmd: releaseInventory(orderId)
    IS-->>O: reply: COMPENSATED
    O->>OS: cmd: cancelOrder(orderId)
    OS-->>O: reply: COMPENSATED
    deactivate O
    Note over O: State: COMPENSATED
    O-->>C: HTTP 200 (saga failed, order cancelled)

I always recommend orchestration by default unless you are working with a very small team, a very short saga (two or three steps), and the services are owned by the same team. The moment you have cross-team ownership or five or more steps, orchestration pays for itself on the first debugging session. Choreography looks elegant in architecture diagrams but it is a debugging nightmare at 3 a.m.

Implementation Sketch

Here is a typed sketch of an orchestration-based saga in TypeScript. This is deliberately simplified to show the state machine mechanics.

// Orchestration-based saga: state machine skeleton
class OrderSagaOrchestrator {
  async execute(ctx: SagaContext): Promise<void> {
    try {
      await this.step("createOrder", ctx,
        () => orderService.create(ctx.orderId));
      ctx.state = "ORDER_CREATED";

      await this.step("reserveInventory", ctx,
        () => inventoryService.reserve(ctx.orderId, ctx.qty));
      ctx.state = "INVENTORY_RESERVED";

      await this.step("chargePayment", ctx,
        () => paymentService.charge(ctx.orderId, ctx.amount));
      ctx.state = "PAYMENT_CHARGED";

      await this.step("sendNotification", ctx,
        () => notificationService.send(ctx.orderId));
      ctx.state = "COMPLETED";
    } catch {
      const lastCommittedState = ctx.state; // e.g. "INVENTORY_RESERVED"
      ctx.state = "COMPENSATING";
      await this.compensate(ctx, lastCommittedState);
    }
  }

  private async compensate(ctx: SagaContext, failedAtState: string): Promise<void> {
    const rank: Record<string, number> = {
      ORDER_CREATED: 1, INVENTORY_RESERVED: 2, PAYMENT_CHARGED: 3,
    };
    const at = rank[failedAtState] ?? 0;
    if (at >= 3) await paymentService.refund(ctx.orderId);    // C3
    if (at >= 2) await inventoryService.release(ctx.orderId); // C2
    if (at >= 1) await orderService.cancel(ctx.orderId);      // C1
    ctx.state = "COMPENSATED";
  }

  private async step(name: string, ctx: SagaContext, fn: () => Promise<void>): Promise<void> {
    await sagaRepository.recordStep(ctx.orderId, name, "STARTED");
    await fn();
    await sagaRepository.recordStep(ctx.orderId, name, "COMPLETED");
  }
}

Notice the sagaRepository.recordStep call wrapping every step. This is not optional. If the orchestrator crashes mid-saga and restarts, it reads the step log and resumes from the last COMPLETED step. Without this, every restart re-executes steps from the beginning, causing duplicate charges, double-reservations, and a very bad day.

When It Shines

Ok, but here is the thing most people miss in interviews: the saga pattern is not a general-purpose transaction mechanism. It is specifically designed for one scenario. Use it when:

You have two or more microservices, each with their own database, that must participate in the same business operation
The workflow can tolerate intermediate visible states (e.g., "order pending" before inventory is confirmed)
Steps are sequential with clear dependencies (each step depends on the previous one succeeding)
Each step has a well-defined compensating transaction that is reliable and idempotent
Your team can accept eventual consistency as the end state

Do not use it:

When all your data lives in a single database (use regular ACID transactions)
When you need strict atomicity with no visible intermediate state (consider rethinking the design; this is rarely a hard requirement)
When compensating transactions cannot be reasoned about because the downstream effects are irreversible (example: you cannot un-send 10 million push notifications)
For read-heavy workflows (sagas are a write coordination pattern)

The rule of thumb I apply: if you have microservices with separate databases and a multi-step business flow, you almost certainly need a saga. The mistake I see most often is reaching for a saga inside a monolith as future-proofing; do not.

Failure Modes and Pitfalls

The happy path is not where sagas are hard. These are the places where real systems break. I have debugged each of these in production; the semantic lock problem is the one most engineers do not see coming.

Top row shows four service state boxes: Orders DB (COMMITTED), Inventory DB (COMMITTED), Payments DB (COMMITTED), Notification Svc (FAILED). An orange warning box explains the inconsistent state. Below, three compensation boxes with arrows running right-to-left. A purple note at the bottom explains the idempotency requirement. — After three successful steps and one failure, the system is in a partially committed state. All three compensations must run to restore consistency, and each one must be idempotent in case the network drops mid-compensation and the saga retries.

1. Non-idempotent compensating transactions

Your saga runs refundPayment. The network drops before the ACK returns. The saga retries and runs refundPayment again. Without idempotency, the customer gets refunded twice.

Every compensating transaction must check whether it already ran before executing. Use an idempotency_key or store the compensation result against the saga_id and step_name. Stripe's API does this natively. Your internal services need to do the same.

2. The outbox gap

Your service writes to its database and then publishes an event to Kafka. Between the write and the publish, the process crashes. The database has the record. The event was never sent. The saga stalls silently.

The fix is the Outbox Pattern: write both the business data and the outgoing event to the same local database transaction. A background process reads the outbox and delivers to the broker. This makes event delivery reliable by reducing it to a single-database ACID write.

The silent saga stall is the hardest bug to find

A saga that stalls mid-flight looks exactly like a slow saga from the outside. No error. No exception. Just a saga that never reaches COMPLETED. Always build a timeout monitor that alerts when any saga has been in a non-terminal state for more than N minutes. Without this, you will not find stuck sagas until a customer calls.

3. Compensations that fail

Your saga runs releaseInventory as a compensation. The Inventory Service is down. Now what?

You need a compensation retry loop with exponential backoff. Compensations are not optional. They must eventually complete. Some teams implement a dead-letter compensation queue where permanently failed compensations are routed for manual review. This is not a failure mode you can ignore without accumulating ghost records.

4. Pivot transactions and the point of no return

Not every step is reversible. If your notification service sends an email midway through the saga, and then the payment fails, you cannot unsend the email. This step is called a pivot transaction, the point of no return.

Good saga design places irreversible steps last (or avoids them until the saga is effectively committed). Notification sends, external API calls, and webhook deliveries are all pivot candidates.

5. Semantic locks and dirty reads

Between T3 (payment charged) and C3 (payment refunded), the customer's bank statement shows a charge. That is a visible intermediate state. During this window, a concurrent saga for the same customer may read that charge and make incorrect decisions. This is a semantic lock problem; no database lock protects you here.

The mitigation is to design your data model to include saga status in every affected record. An inventory record reserved by an in-flight saga should be flagged as PENDING_SAGA. Reads should treat PENDING_SAGA records as provisionally allocated, not finalized.

Trade-offs

Aspect	Saga	2PC	Local Transaction
Consistency model	Eventual	Near-atomic	ACID
Coordinator SPOF	No	Yes (coordinator)	No
Intermediate visible state	Yes	No	No
Performance	High (async)	Low (blocking)	High
Implementation complexity	High	Medium	Low
Compensations required	Yes	No (abort handles it)	No
Scales to microservices	Yes	Poor (coordinator bottleneck)	N/A (single DB)
Failure isolation	Strong	Weak (cascade from coordinator)	N/A

The fundamental tension here is consistency vs availability: 2PC gives you stronger consistency but kills availability when the coordinator is slow or down. Sagas keep every service available through failures but require you to design the inconsistency window explicitly.

Real-World Usage

Uber processes roughly 25 million trips per day globally. Matching a rider with a driver, authorizing the payment hold, activating GPS tracking, and notifying the driver are separate service operations across separate databases. Early versions lost trip state mid-saga during service restarts; their current implementation uses orchestration with a saga log persisted in Cassandra for durable recovery across regions.

Stripe processes over 500 million API calls per day, making idempotent retry behavior a first-class API primitive. Their idempotency key system exists precisely because payment sagas require compensations to be safe to retry. When you call POST /charges with the same idempotency key twice, Stripe returns the cached result instead of charging twice.

Amazon processes millions of orders per day, each flowing through at least six internal services (selection availability, pricing, stock reservation, payment authorization, fulfillment, notifications). At Amazon's scale, the time between order creation and full commitment spans seconds to minutes. Customers see the intermediate states ("Payment pending", "Preparing shipment") in their order console; these are deliberate saga state exposures, not bugs.

How This Shows Up in Interviews

So when does this come up in a design interview? Every time you sketch a microservices architecture and someone says "but what happens if the payment fails after you reserved inventory?"

That question is the saga trigger. The moment you are designing a workflow that spans two or more services with separate databases, you need to address distributed consistency, and "we'll use a distributed transaction" is the wrong answer at a staff-level interview.

Depth expected at senior and staff level:

Distinguish choreography vs orchestration and state when/why you would choose each
Know that compensating transactions must be idempotent, and explain what that means concretely
Name the Outbox Pattern as the mechanism for guaranteed event delivery
Identify the pivot transaction problem and how saga design avoids it
Understand that sagas give eventual consistency, not atomicity, and be able to explain what the intermediate states look like
Know the failure modes: stalled sagas, failed compensations, semantic locks

Interview Q&A:

Interviewer asks	Strong answer
"How do you handle failures in your distributed order flow?"	"I would use a saga. Each service runs a local transaction and publishes an event. On failure, we trigger compensating transactions in reverse. I would use orchestration to keep the state machine visible and debuggable."
"What is the difference between a saga and a 2PC?"	"2PC requires a coordinator that holds all participants in a locked state until commit or abort. The coordinator is a SPOF and causes blocking failures. Sagas use compensations instead of locks, keeping each service available and independent."
"What happens if a compensation fails?"	"The compensation needs a retry loop with exponential backoff. Permanently failed compensations go to a dead-letter queue for manual review. You cannot ignore a failed compensation; you will accumulate inconsistent state."
"How do you prevent double-charging if the payment service is called twice?"	"Every step and compensation must be idempotent. Store a (saga_id, step_name) key and check it before executing. If the key already exists, return the cached result. Stripe does this with idempotency keys at the API level."
"Can you walk me through the saga state machine?"	Enumerate the states: STARTED, ORDER_CREATED, INVENTORY_RESERVED, PAYMENT_CHARGED, COMPLETED on the happy path; COMPENSATING, COMPENSATED on failure. Draw the transitions. Mention that state is persisted to enable restart recovery.

The strongest move on any saga question is to name the failure modes before the interviewer asks. Surface idempotent compensations, stalled saga detection, and the outbox gap as known challenges, then explain concisely how you would address each.

Interview tip: name the failure mode before they ask

The strongest interview move is to proactively say: "The hardest part of saga design is not the happy path. It is idempotent compensations, stalled saga detection, and the outbox gap. Here is how I would address each." This signals staff-level thinking; you have operated this, not just read about it.

Test Your Understanding

When to Use Which Style

Decision flowchart with three diamond nodes: 5 or more steps or cross-team ownership, need explicit rollback coordination, need centralized observability. YES answers route to Orchestration boxes. NO path through all three routes to Choreography. — Default to Orchestration. Choreography only makes sense for short, team-local sagas where you explicitly want the decoupling. The debugging cost of choreography at scale almost always exceeds the deployment simplicity benefit.

Decision flowchart. If no cross-service span, use local transactions. If XA support everywhere, consider 2PC but beware SPOF. If eventual consistency is acceptable, use Saga. If strong consistency required across microservices, rethink the design. — The 90% answer: if you have microservices with separate databases, you need the Saga pattern. 2PC is a legacy pattern, and local transactions are always best when the data fits in one database.

Quick Recap

A saga is a sequence of local transactions across multiple services, where each step publishes an event that triggers the next, and compensating transactions undo committed steps when a failure occurs.
Compensating transactions are explicit business operations (not database rollbacks) and must be idempotent because the network will drop ACKs and cause retries.
Orchestration (central coordinator with explicit state machine) is almost always preferable to choreography for sagas with more than three steps or cross-team service ownership.
The Outbox Pattern is a required companion to the saga pattern: write the event to a local DB table in the same transaction as the business data, then deliver via a background relay to guarantee no event is silently lost.
The hardest failure modes are stalled sagas (detect with timeout monitors), non-idempotent compensations (fix with idempotency keys), and pivot transactions (place irreversible steps last).
Sagas give eventual consistency, not atomicity. Customers see intermediate states. Design your UI and business rules around this reality rather than pretending it does not exist.
If your saga has conditional branching, parallel steps, or long-running timeouts, you have outgrown the basic saga pattern. Consider a workflow engine like Temporal or AWS Step Functions.

Outbox Pattern: The Outbox Pattern is not optional with sagas at production scale. It closes the gap between writing to your database and publishing an event, ensuring events are never silently lost when a service restarts mid-saga.
Event Sourcing: Event sourcing stores every state change as an immutable event, which gives you a natural audit log of every saga step. At high saga volumes, event sourcing can replace the saga step log entirely.
Circuit Breaker: Wrap the external service calls inside each saga step in a circuit breaker. Without it, a single failing downstream service will cause every in-flight saga to hang indefinitely waiting for a timeout.
Message Queues: Sagas are only as reliable as their message delivery. Understanding Kafka's durability guarantees, consumer group semantics, and dead-letter queue behavior is prerequisite knowledge for operating sagas in production.
Microservices: The saga pattern exists because of microservices. If you have not internalized why microservices have separate databases, re-read the microservices article first. The saga pattern will not make sense without that foundation.