Saga pattern
Learn how the saga pattern maintains data consistency across microservices without distributed locks, and why compensating transactions are the key to surviving partial failure.
TL;DR
- A saga is a sequence of local database transactions across multiple services. Each step publishes an event (or sends a command) that triggers the next. If any step fails, compensating transactions undo the already-committed steps in reverse order.
- The core trade-off is eventual consistency vs atomicity: a saga does not give you ACID across service boundaries. It gives you a best-effort consistency guarantee through compensation. You see intermediate states.
- Two implementation styles: choreography (services react to events on a shared bus, no coordinator) and orchestration (a central saga coordinator sends commands and tracks state). Orchestration is almost always the right choice once your saga has more than three steps.
- The hardest part is not the happy path. It is making compensating transactions idempotent and reliable, because your network will drop the compensation message and the saga will retry.
- Pair sagas with the Outbox Pattern to guarantee event delivery. Without it, your saga loses events silently when a service crashes between writing to its DB and publishing to the broker.
The Problem
It is Friday evening. Your e-commerce platform is processing 50,000 orders per hour. Each order flows through four services: Order Service (creates the record), Inventory Service (reserves stock), Payment Service (charges the card), and Notification Service (emails the receipt).
These four services each have their own database. ACID transactions do not cross service boundaries. The databases do not share a transaction log.
When your Payment Service gets a 429 from Stripe at step three, the order already exists and the inventory is already reserved. You now have a ghost order and phantom reserved stock, with no automatic rollback in sight.
The naive fix is Two-Phase Commit (2PC). A transaction coordinator asks all participants to prepare (phase 1), then issues a global commit or abort (phase 2). It gives you something close to atomicity, but the coordinator is a single point of failure.
If it crashes between phase 1 and phase 2, every participant holds its row locks indefinitely. In practice, this causes full system freezes.
The fundamental problem is that distributed systems need a consistency model that tolerates partial failure without requiring a global lock. The saga pattern is the answer. Everything else in this article explains exactly how.
Every system I have seen skip this design conversation eventually retrofitted saga-like compensation logic after a production incident revealed the gap.
One-Line Definition
A saga sequences local transactions across services, publishing an event or message after each step to trigger the next, and running compensating transactions in reverse order when any step fails.
Analogy
Think about booking a holiday through a travel agent. The agent books your flight, reserves a hotel, and arranges a rental car (three separate transactions with three separate companies). If the car rental falls through, the agent does not magically undo the first two.
The agent calls the hotel to release the reservation, then calls the airline to cancel the flight. Each cancellation is an explicit compensating action.
The agent does not hold all three companies frozen while deciding. They act, observe outcomes, and compensate when something goes wrong.
That is exactly what a saga does. The agent is the orchestrator, each booking company is a service, and each cancellation call is a compensating transaction.
Solution Walkthrough
A saga breaks a multi-step business operation into individual local transactions, each scoped to a single service and its database. Each transaction either succeeds and publishes a success event, or fails and the saga triggers compensations.
The key insight: compensating transactions are not database rollbacks. They are new, explicit business operations. CANCEL_ORDER sets the order status to CANCELLED, records who cancelled it and when, and is a new write, not a SQL undo.
For your interview: the moment you introduce multiple services with separate databases, assume you need a saga for any workflow that spans more than one of them.
Two Implementation Styles
There is no single correct implementation of a saga. The saga pattern is a logical idea, and you implement it one of two ways.
Choreography
In a choreography-based saga, there is no central coordinator. Each service listens to the message broker for the events it cares about, processes them, and publishes the next event. The workflow emerges from the connected chain of reactions.
sequenceDiagram
participant C as 👤 Client
participant OS as ⚙️ Order Service
participant EB as 📨 Event Bus (Kafka)
participant IS as ⚙️ Inventory Service
participant PS as ⚙️ Payment Service
participant NS as ⚙️ Notification Service
C->>OS: POST /orders
OS->>OS: Create Order (PENDING)
OS->>EB: publish order.created
Note over EB,IS: IS subscribed to order.created
EB-->>IS: deliver order.created
IS->>IS: Reserve 2 units
IS->>EB: publish inventory.reserved
Note over EB,PS: PS subscribed to inventory.reserved
EB-->>PS: deliver inventory.reserved
PS->>PS: Charge $149.99
PS->>EB: publish payment.charged
Note over EB,NS: NS subscribed to payment.charged
EB-->>NS: deliver payment.charged
NS->>NS: Send confirmation email
NS->>EB: publish notification.sent
EB-->>OS: deliver notification.sent
OS->>OS: Update Order to CONFIRMED
OS-->>C: HTTP 202 Accepted
Choreography is attractive because there is no central service to maintain, and teams can add new participants by subscribing to the right event without touching other services. But the workflow is invisible. If you want to know what step a saga is on, you have to reconstruct it from the event log.
Orchestration
In an orchestration-based saga, a central saga orchestrator owns the state machine. It sends commands to each service, waits for their replies, and decides what to do next. The workflow is explicit and visible.
sequenceDiagram
participant C as 👤 Client
participant O as 🧠 Saga Orchestrator
participant OS as ⚙️ Order Service
participant IS as ⚙️ Inventory Service
participant PS as ⚙️ Payment Service
participant NS as ⚙️ Notification Service
C->>O: POST /sagas/order (orderId=9871)
activate O
Note over O: State: STARTED
O->>OS: cmd: createOrder(orderId=9871)
OS-->>O: reply: COMPLETED
Note over O: State: ORDER_CREATED
O->>IS: cmd: reserveInventory(orderId, qty=2)
IS-->>O: reply: COMPLETED
Note over O: State: INVENTORY_RESERVED
O->>PS: cmd: chargePayment(orderId, $149.99)
PS-->>O: reply: COMPLETED
Note over O: State: PAYMENT_CHARGED
O->>NS: cmd: sendNotification(orderId)
NS-->>O: reply: FAILED (smtp timeout)
Note over O: State: COMPENSATING
O->>PS: cmd: refundPayment(orderId)
PS-->>O: reply: COMPENSATED
O->>IS: cmd: releaseInventory(orderId)
IS-->>O: reply: COMPENSATED
O->>OS: cmd: cancelOrder(orderId)
OS-->>O: reply: COMPENSATED
deactivate O
Note over O: State: COMPENSATED
O-->>C: HTTP 200 (saga failed, order cancelled)
I always recommend orchestration by default unless you are working with a very small team, a very short saga (two or three steps), and the services are owned by the same team. The moment you have cross-team ownership or five or more steps, orchestration pays for itself on the first debugging session. Choreography looks elegant in architecture diagrams but it is a debugging nightmare at 3 a.m.
Implementation Sketch
Here is a typed sketch of an orchestration-based saga in TypeScript. This is deliberately simplified to show the state machine mechanics.
// Orchestration-based saga: state machine skeleton
class OrderSagaOrchestrator {
async execute(ctx: SagaContext): Promise<void> {
try {
await this.step("createOrder", ctx,
() => orderService.create(ctx.orderId));
ctx.state = "ORDER_CREATED";
await this.step("reserveInventory", ctx,
() => inventoryService.reserve(ctx.orderId, ctx.qty));
ctx.state = "INVENTORY_RESERVED";
await this.step("chargePayment", ctx,
() => paymentService.charge(ctx.orderId, ctx.amount));
ctx.state = "PAYMENT_CHARGED";
await this.step("sendNotification", ctx,
() => notificationService.send(ctx.orderId));
ctx.state = "COMPLETED";
} catch {
const lastCommittedState = ctx.state; // e.g. "INVENTORY_RESERVED"
ctx.state = "COMPENSATING";
await this.compensate(ctx, lastCommittedState);
}
}
private async compensate(ctx: SagaContext, failedAtState: string): Promise<void> {
const rank: Record<string, number> = {
ORDER_CREATED: 1, INVENTORY_RESERVED: 2, PAYMENT_CHARGED: 3,
};
const at = rank[failedAtState] ?? 0;
if (at >= 3) await paymentService.refund(ctx.orderId); // C3
if (at >= 2) await inventoryService.release(ctx.orderId); // C2
if (at >= 1) await orderService.cancel(ctx.orderId); // C1
ctx.state = "COMPENSATED";
}
private async step(name: string, ctx: SagaContext, fn: () => Promise<void>): Promise<void> {
await sagaRepository.recordStep(ctx.orderId, name, "STARTED");
await fn();
await sagaRepository.recordStep(ctx.orderId, name, "COMPLETED");
}
}
Notice the sagaRepository.recordStep call wrapping every step. This is not optional. If the orchestrator crashes mid-saga and restarts, it reads the step log and resumes from the last COMPLETED step. Without this, every restart re-executes steps from the beginning, causing duplicate charges, double-reservations, and a very bad day.
When It Shines
Ok, but here is the thing most people miss in interviews: the saga pattern is not a general-purpose transaction mechanism. It is specifically designed for one scenario. Use it when:
- You have two or more microservices, each with their own database, that must participate in the same business operation
- The workflow can tolerate intermediate visible states (e.g., "order pending" before inventory is confirmed)
- Steps are sequential with clear dependencies (each step depends on the previous one succeeding)
- Each step has a well-defined compensating transaction that is reliable and idempotent
- Your team can accept eventual consistency as the end state
Do not use it:
- When all your data lives in a single database (use regular ACID transactions)
- When you need strict atomicity with no visible intermediate state (consider rethinking the design; this is rarely a hard requirement)
- When compensating transactions cannot be reasoned about because the downstream effects are irreversible (example: you cannot un-send 10 million push notifications)
- For read-heavy workflows (sagas are a write coordination pattern)
The rule of thumb I apply: if you have microservices with separate databases and a multi-step business flow, you almost certainly need a saga. The mistake I see most often is reaching for a saga inside a monolith as future-proofing; do not.
Failure Modes and Pitfalls
The happy path is not where sagas are hard. These are the places where real systems break. I have debugged each of these in production; the semantic lock problem is the one most engineers do not see coming.
1. Non-idempotent compensating transactions
Your saga runs refundPayment. The network drops before the ACK returns. The saga retries and runs refundPayment again. Without idempotency, the customer gets refunded twice.
Every compensating transaction must check whether it already ran before executing. Use an idempotency_key or store the compensation result against the saga_id and step_name. Stripe's API does this natively. Your internal services need to do the same.
2. The outbox gap
Your service writes to its database and then publishes an event to Kafka. Between the write and the publish, the process crashes. The database has the record. The event was never sent. The saga stalls silently.
The fix is the Outbox Pattern: write both the business data and the outgoing event to the same local database transaction. A background process reads the outbox and delivers to the broker. This makes event delivery reliable by reducing it to a single-database ACID write.
The silent saga stall is the hardest bug to find
A saga that stalls mid-flight looks exactly like a slow saga from the outside. No error. No exception. Just a saga that never reaches COMPLETED. Always build a timeout monitor that alerts when any saga has been in a non-terminal state for more than N minutes. Without this, you will not find stuck sagas until a customer calls.
3. Compensations that fail
Your saga runs releaseInventory as a compensation. The Inventory Service is down. Now what?
You need a compensation retry loop with exponential backoff. Compensations are not optional. They must eventually complete. Some teams implement a dead-letter compensation queue where permanently failed compensations are routed for manual review. This is not a failure mode you can ignore without accumulating ghost records.
4. Pivot transactions and the point of no return
Not every step is reversible. If your notification service sends an email midway through the saga, and then the payment fails, you cannot unsend the email. This step is called a pivot transaction, the point of no return.
Good saga design places irreversible steps last (or avoids them until the saga is effectively committed). Notification sends, external API calls, and webhook deliveries are all pivot candidates.
5. Semantic locks and dirty reads
Between T3 (payment charged) and C3 (payment refunded), the customer's bank statement shows a charge. That is a visible intermediate state. During this window, a concurrent saga for the same customer may read that charge and make incorrect decisions. This is a semantic lock problem; no database lock protects you here.
The mitigation is to design your data model to include saga status in every affected record. An inventory record reserved by an in-flight saga should be flagged as PENDING_SAGA. Reads should treat PENDING_SAGA records as provisionally allocated, not finalized.
Trade-offs
| Aspect | Saga | 2PC | Local Transaction |
|---|---|---|---|
| Consistency model | Eventual | Near-atomic | ACID |
| Coordinator SPOF | No | Yes (coordinator) | No |
| Intermediate visible state | Yes | No | No |
| Performance | High (async) | Low (blocking) | High |
| Implementation complexity | High | Medium | Low |
| Compensations required | Yes | No (abort handles it) | No |
| Scales to microservices | Yes | Poor (coordinator bottleneck) | N/A (single DB) |
| Failure isolation | Strong | Weak (cascade from coordinator) | N/A |
The fundamental tension here is consistency vs availability: 2PC gives you stronger consistency but kills availability when the coordinator is slow or down. Sagas keep every service available through failures but require you to design the inconsistency window explicitly.
Real-World Usage
Uber processes roughly 25 million trips per day globally. Matching a rider with a driver, authorizing the payment hold, activating GPS tracking, and notifying the driver are separate service operations across separate databases. Early versions lost trip state mid-saga during service restarts; their current implementation uses orchestration with a saga log persisted in Cassandra for durable recovery across regions.
Stripe processes over 500 million API calls per day, making idempotent retry behavior a first-class API primitive. Their idempotency key system exists precisely because payment sagas require compensations to be safe to retry. When you call POST /charges with the same idempotency key twice, Stripe returns the cached result instead of charging twice.
Amazon processes millions of orders per day, each flowing through at least six internal services (selection availability, pricing, stock reservation, payment authorization, fulfillment, notifications). At Amazon's scale, the time between order creation and full commitment spans seconds to minutes. Customers see the intermediate states ("Payment pending", "Preparing shipment") in their order console; these are deliberate saga state exposures, not bugs.
How This Shows Up in Interviews
So when does this come up in a design interview? Every time you sketch a microservices architecture and someone says "but what happens if the payment fails after you reserved inventory?"
That question is the saga trigger. The moment you are designing a workflow that spans two or more services with separate databases, you need to address distributed consistency, and "we'll use a distributed transaction" is the wrong answer at a staff-level interview.
Depth expected at senior and staff level:
- Distinguish choreography vs orchestration and state when/why you would choose each
- Know that compensating transactions must be idempotent, and explain what that means concretely
- Name the Outbox Pattern as the mechanism for guaranteed event delivery
- Identify the pivot transaction problem and how saga design avoids it
- Understand that sagas give eventual consistency, not atomicity, and be able to explain what the intermediate states look like
- Know the failure modes: stalled sagas, failed compensations, semantic locks
Interview Q&A:
| Interviewer asks | Strong answer |
|---|---|
| "How do you handle failures in your distributed order flow?" | "I would use a saga. Each service runs a local transaction and publishes an event. On failure, we trigger compensating transactions in reverse. I would use orchestration to keep the state machine visible and debuggable." |
| "What is the difference between a saga and a 2PC?" | "2PC requires a coordinator that holds all participants in a locked state until commit or abort. The coordinator is a SPOF and causes blocking failures. Sagas use compensations instead of locks, keeping each service available and independent." |
| "What happens if a compensation fails?" | "The compensation needs a retry loop with exponential backoff. Permanently failed compensations go to a dead-letter queue for manual review. You cannot ignore a failed compensation; you will accumulate inconsistent state." |
| "How do you prevent double-charging if the payment service is called twice?" | "Every step and compensation must be idempotent. Store a (saga_id, step_name) key and check it before executing. If the key already exists, return the cached result. Stripe does this with idempotency keys at the API level." |
| "Can you walk me through the saga state machine?" | Enumerate the states: STARTED, ORDER_CREATED, INVENTORY_RESERVED, PAYMENT_CHARGED, COMPLETED on the happy path; COMPENSATING, COMPENSATED on failure. Draw the transitions. Mention that state is persisted to enable restart recovery. |
The strongest move on any saga question is to name the failure modes before the interviewer asks. Surface idempotent compensations, stalled saga detection, and the outbox gap as known challenges, then explain concisely how you would address each.
Interview tip: name the failure mode before they ask
The strongest interview move is to proactively say: "The hardest part of saga design is not the happy path. It is idempotent compensations, stalled saga detection, and the outbox gap. Here is how I would address each." This signals staff-level thinking; you have operated this, not just read about it.
Test Your Understanding
When to Use Which Style
Quick Recap
- A saga is a sequence of local transactions across multiple services, where each step publishes an event that triggers the next, and compensating transactions undo committed steps when a failure occurs.
- Compensating transactions are explicit business operations (not database rollbacks) and must be idempotent because the network will drop ACKs and cause retries.
- Orchestration (central coordinator with explicit state machine) is almost always preferable to choreography for sagas with more than three steps or cross-team service ownership.
- The Outbox Pattern is a required companion to the saga pattern: write the event to a local DB table in the same transaction as the business data, then deliver via a background relay to guarantee no event is silently lost.
- The hardest failure modes are stalled sagas (detect with timeout monitors), non-idempotent compensations (fix with idempotency keys), and pivot transactions (place irreversible steps last).
- Sagas give eventual consistency, not atomicity. Customers see intermediate states. Design your UI and business rules around this reality rather than pretending it does not exist.
- If your saga has conditional branching, parallel steps, or long-running timeouts, you have outgrown the basic saga pattern. Consider a workflow engine like Temporal or AWS Step Functions.
Related Patterns
- Outbox Pattern: The Outbox Pattern is not optional with sagas at production scale. It closes the gap between writing to your database and publishing an event, ensuring events are never silently lost when a service restarts mid-saga.
- Event Sourcing: Event sourcing stores every state change as an immutable event, which gives you a natural audit log of every saga step. At high saga volumes, event sourcing can replace the saga step log entirely.
- Circuit Breaker: Wrap the external service calls inside each saga step in a circuit breaker. Without it, a single failing downstream service will cause every in-flight saga to hang indefinitely waiting for a timeout.
- Message Queues: Sagas are only as reliable as their message delivery. Understanding Kafka's durability guarantees, consumer group semantics, and dead-letter queue behavior is prerequisite knowledge for operating sagas in production.
- Microservices: The saga pattern exists because of microservices. If you have not internalized why microservices have separate databases, re-read the microservices article first. The saga pattern will not make sense without that foundation.