Choreography vs orchestration
When to use event-driven choreography versus a central orchestrator for multi-service workflows, covering tradeoffs in observability, coupling, error handling, and operational complexity.
TL;DR
- Choreography: each service reacts to events from other services. No central coordinator. The workflow emerges from individual service reactions.
- Orchestration: a central service explicitly calls each step, tracks state, and drives compensation on failure.
- Choreography offers lower coupling and no single point of failure, but makes debugging distributed workflows significantly harder.
- Orchestration centralizes workflow logic (easy to reason about, debug, and audit), but creates a single point of failure and tighter service coupling.
- Most production systems use a hybrid: choreography between team boundaries, orchestration within a single team's bounded context.
The Problem
You're building an order fulfillment system. When a customer places an order, five things need to happen in sequence: validate the order, reserve inventory, charge payment, create a shipment, and send a confirmation email. Five different microservices own these five steps. Each team deploys on its own schedule.
The question isn't whether to coordinate. You must coordinate. The question is where the coordination logic lives and which team owns it.
If you put it in the Order Service, that service becomes a god object that knows about every other service, calls them in order, and handles every failure combination. Change anything downstream and the Order Service needs an update. If you put it everywhere (each service triggers the next), no single service knows the full workflow, and debugging a failure at step 4 means tracing events across four services' logs at 3 a.m.
Both approaches feel wrong because both have real costs. That tension is the core of this pattern.
Neither approach is universally better. The right choice depends on how many teams own the services, how complex the failure handling is, and whether you need to audit "what happened to order X" from a single place.
The mistake I see most often in interviews: candidates pick one approach and defend it absolutely. The strong answer acknowledges that most systems use both, and explains the criteria for choosing between them.
One-Line Definition
Choreography distributes coordination across services via event reactions, while orchestration centralizes coordination in a dedicated workflow service that explicitly drives each step.
In simpler terms: choreography is "react to what happened" and orchestration is "tell each service what to do next."
Analogy
Think of two ways to coordinate a dinner party.
Choreography is like a potluck. You tell each friend "bring something that pairs well with what others bring." Alice hears Bob is bringing steak, so she brings red wine. Carol hears about the wine and brings cheese. Everyone reacts to what others are doing. The dinner works if everyone pays attention, but nobody knows the full menu until the food arrives.
Orchestration is like hiring a caterer. The caterer plans the menu, assigns each dish, coordinates timing, and handles substitutions when an ingredient is unavailable. You can see the full plan in one document. If the caterer gets sick, the dinner falls apart.
Both produce dinner. The potluck scales to large groups without a bottleneck (no caterer needed). The caterer produces a more predictable result, handles substitutions gracefully, and can give you a definitive answer to "what's for dessert?" The potluck can only answer that question after all the food arrives.
This maps directly to systems. In an interview, use this analogy to anchor the discussion before drawing the technical diagrams. It immediately communicates the core tradeoff: distributed autonomy vs centralized coordination.
Solution Walkthrough
Let's walk through the same order fulfillment workflow implemented with each approach. Seeing the same scenario twice makes the tradeoffs concrete and easy to compare.
Choreography: Order Fulfillment via Events
Each service subscribes to events it cares about, does its work, and emits an event for the next step. No service knows about the others. They only know the event schema they subscribe to and the event schema they emit.
The workflow is implicit. It exists only as the sum of all subscriptions across all services. You can't look at any single service and see the full order flow. To understand the workflow, you need to trace event types across every service's subscription code.
Compensation in choreography works through reverse events:
Each service is responsible for listening to its own compensation event and undoing its action. If Payment fails, Inventory must release the reservation. The Order Service must mark the order as cancelled. Nobody orchestrates this; each service independently reacts.
The challenge: what if the Inventory Service misses the PaymentFailed event? The reservation stays locked forever. You need dead letter queues, retry policies, and monitoring on every compensation path. And if you have 5 steps with 5 possible failure points, you need 10+ compensation handlers, each in a different service, each independently implemented and tested.
Choreography compensation gets exponentially harder
With N steps, you potentially need O(N) compensation handlers across N different services. Each handler must be independently tested, monitored, and debugged. At 3 steps, this is manageable. At 8+ steps, it becomes a maintenance nightmare.
Orchestration: Same Flow with a Central Coordinator
The orchestrator holds the state machine above. It knows every step, every transition, and every compensation path. A single database query shows you the current state of any order: "Order 42 is in ChargePayment state, waiting for payment confirmation since 14:32 UTC."
The orchestrator calls each service directly and handles all compensation logic in one place:
class OrderOrchestrator:
def process_order(self, order_id):
# Step 1: Reserve inventory
reservation = inventory_service.reserve(order_id)
if not reservation.ok:
return fail("inventory unavailable")
# Step 2: Charge payment
payment = payment_service.charge(order_id)
if not payment.ok:
inventory_service.release(reservation.id)
return fail("payment failed")
# Step 3: Create shipment
shipment = shipping_service.create(order_id)
if not shipment.ok:
payment_service.refund(payment.id)
inventory_service.release(reservation.id)
return fail("shipping failed")
return success(shipment)
Compensation is explicit and centralized. When step 3 fails, the orchestrator calls refund() then release() in order. No events, no subscriptions, no wondering if some service missed a compensation event.
The key advantage you should highlight in interviews: the orchestrator is a single source of truth for workflow state. You can query it to answer "what's the status of order 42?" without correlating logs from five services.
For your interview: the orchestrator state machine diagram above is extremely powerful. Draw it on the whiteboard and the interviewer immediately sees that you understand compensation, partial failures, and workflow state tracking.
Implementation Sketch
Choreography: Event Handler Pattern
Each service follows the same pattern: subscribe to an event, do work, emit the next event. The code is simple per service, but the overall workflow is split across many files in different repositories.
// Inventory Service: choreography participant
eventBus.subscribe("OrderPlaced", async (event) => {
const reserved = await reserveStock(event.orderId, event.items);
if (reserved) {
await eventBus.emit("InventoryReserved", {
orderId: event.orderId,
reservationId: reserved.id,
});
} else {
await eventBus.emit("InventoryUnavailable", {
orderId: event.orderId,
reason: "insufficient_stock",
});
}
});
// Compensation handler: separate subscriber
eventBus.subscribe("PaymentFailed", async (event) => {
await releaseReservation(event.orderId);
await eventBus.emit("InventoryReleased", { orderId: event.orderId });
});
Notice that the Inventory Service has no idea about payment or shipping. It only knows two events: OrderPlaced (trigger) and PaymentFailed (compensate). This is loose coupling in action.
Orchestration: Workflow Engine Pattern
The orchestrator uses a state machine with explicit steps, compensation, and state persistence:
// Order Orchestrator: centralized workflow
class OrderWorkflow {
private state: WorkflowState = "PENDING";
async execute(orderId: string) {
try {
this.state = "RESERVING_INVENTORY";
const reservation = await inventoryService.reserve(orderId);
this.state = "CHARGING_PAYMENT";
const payment = await paymentService.charge(orderId);
this.state = "CREATING_SHIPMENT";
const shipment = await shippingService.create(orderId);
this.state = "COMPLETE";
return { success: true, shipment };
} catch (error) {
await this.compensate(orderId, error);
this.state = "FAILED";
return { success: false, error };
}
}
private async compensate(orderId: string, error: Error) {
// Reverse completed steps in reverse order
if (this.state === "CREATING_SHIPMENT") {
await paymentService.refund(orderId);
await inventoryService.release(orderId);
} else if (this.state === "CHARGING_PAYMENT") {
await inventoryService.release(orderId);
}
}
}
Continue Reading with Premium
Unlock this article and every other in-depth system design guide on the platform with NotesFromSDE Premium.