Choreography vs orchestration
When to use event-driven choreography versus a central orchestrator for multi-service workflows, covering tradeoffs in observability, coupling, error handling, and operational complexity.
TL;DR
- Choreography: each service reacts to events from other services. No central coordinator. The workflow emerges from individual service reactions.
- Orchestration: a central service explicitly calls each step, tracks state, and drives compensation on failure.
- Choreography offers lower coupling and no single point of failure, but makes debugging distributed workflows significantly harder.
- Orchestration centralizes workflow logic (easy to reason about, debug, and audit), but creates a single point of failure and tighter service coupling.
- Most production systems use a hybrid: choreography between team boundaries, orchestration within a single team's bounded context.
The Problem
You're building an order fulfillment system. When a customer places an order, five things need to happen in sequence: validate the order, reserve inventory, charge payment, create a shipment, and send a confirmation email. Five different microservices own these five steps. Each team deploys on its own schedule.
The question isn't whether to coordinate. You must coordinate. The question is where the coordination logic lives and which team owns it.
If you put it in the Order Service, that service becomes a god object that knows about every other service, calls them in order, and handles every failure combination. Change anything downstream and the Order Service needs an update. If you put it everywhere (each service triggers the next), no single service knows the full workflow, and debugging a failure at step 4 means tracing events across four services' logs at 3 a.m.
Both approaches feel wrong because both have real costs. That tension is the core of this pattern.
Neither approach is universally better. The right choice depends on how many teams own the services, how complex the failure handling is, and whether you need to audit "what happened to order X" from a single place.
The mistake I see most often in interviews: candidates pick one approach and defend it absolutely. The strong answer acknowledges that most systems use both, and explains the criteria for choosing between them.
One-Line Definition
Choreography distributes coordination across services via event reactions, while orchestration centralizes coordination in a dedicated workflow service that explicitly drives each step.
In simpler terms: choreography is "react to what happened" and orchestration is "tell each service what to do next."
Analogy
Think of two ways to coordinate a dinner party.
Choreography is like a potluck. You tell each friend "bring something that pairs well with what others bring." Alice hears Bob is bringing steak, so she brings red wine. Carol hears about the wine and brings cheese. Everyone reacts to what others are doing. The dinner works if everyone pays attention, but nobody knows the full menu until the food arrives.
Orchestration is like hiring a caterer. The caterer plans the menu, assigns each dish, coordinates timing, and handles substitutions when an ingredient is unavailable. You can see the full plan in one document. If the caterer gets sick, the dinner falls apart.
Both produce dinner. The potluck scales to large groups without a bottleneck (no caterer needed). The caterer produces a more predictable result, handles substitutions gracefully, and can give you a definitive answer to "what's for dessert?" The potluck can only answer that question after all the food arrives.
This maps directly to systems. In an interview, use this analogy to anchor the discussion before drawing the technical diagrams. It immediately communicates the core tradeoff: distributed autonomy vs centralized coordination.
Solution Walkthrough
Let's walk through the same order fulfillment workflow implemented with each approach. Seeing the same scenario twice makes the tradeoffs concrete and easy to compare.
Choreography: Order Fulfillment via Events
Each service subscribes to events it cares about, does its work, and emits an event for the next step. No service knows about the others. They only know the event schema they subscribe to and the event schema they emit.
The workflow is implicit. It exists only as the sum of all subscriptions across all services. You can't look at any single service and see the full order flow. To understand the workflow, you need to trace event types across every service's subscription code.
Compensation in choreography works through reverse events:
Each service is responsible for listening to its own compensation event and undoing its action. If Payment fails, Inventory must release the reservation. The Order Service must mark the order as cancelled. Nobody orchestrates this; each service independently reacts.
The challenge: what if the Inventory Service misses the PaymentFailed event? The reservation stays locked forever. You need dead letter queues, retry policies, and monitoring on every compensation path. And if you have 5 steps with 5 possible failure points, you need 10+ compensation handlers, each in a different service, each independently implemented and tested.
Choreography compensation gets exponentially harder
With N steps, you potentially need O(N) compensation handlers across N different services. Each handler must be independently tested, monitored, and debugged. At 3 steps, this is manageable. At 8+ steps, it becomes a maintenance nightmare.
Orchestration: Same Flow with a Central Coordinator
The orchestrator holds the state machine above. It knows every step, every transition, and every compensation path. A single database query shows you the current state of any order: "Order 42 is in ChargePayment state, waiting for payment confirmation since 14:32 UTC."
The orchestrator calls each service directly and handles all compensation logic in one place:
class OrderOrchestrator:
def process_order(self, order_id):
# Step 1: Reserve inventory
reservation = inventory_service.reserve(order_id)
if not reservation.ok:
return fail("inventory unavailable")
# Step 2: Charge payment
payment = payment_service.charge(order_id)
if not payment.ok:
inventory_service.release(reservation.id)
return fail("payment failed")
# Step 3: Create shipment
shipment = shipping_service.create(order_id)
if not shipment.ok:
payment_service.refund(payment.id)
inventory_service.release(reservation.id)
return fail("shipping failed")
return success(shipment)
Compensation is explicit and centralized. When step 3 fails, the orchestrator calls refund() then release() in order. No events, no subscriptions, no wondering if some service missed a compensation event.
The key advantage you should highlight in interviews: the orchestrator is a single source of truth for workflow state. You can query it to answer "what's the status of order 42?" without correlating logs from five services.
For your interview: the orchestrator state machine diagram above is extremely powerful. Draw it on the whiteboard and the interviewer immediately sees that you understand compensation, partial failures, and workflow state tracking.
Implementation Sketch
Choreography: Event Handler Pattern
Each service follows the same pattern: subscribe to an event, do work, emit the next event. The code is simple per service, but the overall workflow is split across many files in different repositories.
// Inventory Service: choreography participant
eventBus.subscribe("OrderPlaced", async (event) => {
const reserved = await reserveStock(event.orderId, event.items);
if (reserved) {
await eventBus.emit("InventoryReserved", {
orderId: event.orderId,
reservationId: reserved.id,
});
} else {
await eventBus.emit("InventoryUnavailable", {
orderId: event.orderId,
reason: "insufficient_stock",
});
}
});
// Compensation handler: separate subscriber
eventBus.subscribe("PaymentFailed", async (event) => {
await releaseReservation(event.orderId);
await eventBus.emit("InventoryReleased", { orderId: event.orderId });
});
Notice that the Inventory Service has no idea about payment or shipping. It only knows two events: OrderPlaced (trigger) and PaymentFailed (compensate). This is loose coupling in action.
Orchestration: Workflow Engine Pattern
The orchestrator uses a state machine with explicit steps, compensation, and state persistence:
// Order Orchestrator: centralized workflow
class OrderWorkflow {
private state: WorkflowState = "PENDING";
async execute(orderId: string) {
try {
this.state = "RESERVING_INVENTORY";
const reservation = await inventoryService.reserve(orderId);
this.state = "CHARGING_PAYMENT";
const payment = await paymentService.charge(orderId);
this.state = "CREATING_SHIPMENT";
const shipment = await shippingService.create(orderId);
this.state = "COMPLETE";
return { success: true, shipment };
} catch (error) {
await this.compensate(orderId, error);
this.state = "FAILED";
return { success: false, error };
}
}
private async compensate(orderId: string, error: Error) {
// Reverse completed steps in reverse order
if (this.state === "CREATING_SHIPMENT") {
await paymentService.refund(orderId);
await inventoryService.release(orderId);
} else if (this.state === "CHARGING_PAYMENT") {
await inventoryService.release(orderId);
}
}
}
The orchestrator persists its state to a database after each step. If it crashes, it resumes from the last persisted state on restart. This is exactly how workflow engines like Temporal and Netflix Conductor work internally. The durable state is what makes orchestration reliable at scale.
The Hybrid Approach
Most mature systems use both patterns. Choreography between bounded contexts (across team boundaries) and orchestration within a bounded context (one team's services). This isn't a compromise; it's the principled approach based on Conway's law: coordinate tightly within teams, communicate loosely between teams.
The order team orchestrates internally because they need tight control over the fulfillment steps and compensation logic. Once the order is complete (or cancelled), they emit a single event. Other teams subscribe to that event without any coordination with the order team. The analytics team doesn't need to know about inventory reservations; they just need "order completed."
This hybrid gives you explicitness where it matters (critical business logic) and loose coupling where it doesn't (cross-team notifications and analytics).
Testing Differences
Testing strategy differs dramatically between the two approaches, and this is something interviewers often probe on.
Testing choreography requires end-to-end integration tests that exercise the full event chain. You spin up all services, trigger the initiating event, and verify that every downstream service reacted correctly. These tests are slow, flaky (any service can cause failure), and hard to maintain. Contract testing helps: each service publishes its event schema, and consumers test against those schemas independently.
Testing orchestration is more straightforward. You unit-test the orchestrator with mocked service clients. The orchestrator's logic (step ordering, compensation, timeouts) is tested in isolation. Individual services are tested independently against their API contracts. The integration surface is smaller: just the orchestrator-to-service interactions.
// Orchestrator unit test example
test("compensates on payment failure", async () => {
inventoryService.reserve.mockResolvedValue({ ok: true, id: "res-1" });
paymentService.charge.mockResolvedValue({ ok: false });
const result = await workflow.execute("order-42");
expect(result.success).toBe(false);
expect(inventoryService.release).toHaveBeenCalledWith("res-1");
expect(paymentService.refund).not.toHaveBeenCalled(); // never charged
});
This test verifies compensation logic without spinning up any real services.
When It Shines
Here's a decision framework for choosing the coordination style:
Choose choreography when:
- Services are owned by different teams who deploy independently
- Workflows are additive (new steps just subscribe to existing events)
- Eventual consistency is acceptable for the business domain
- You have strong distributed tracing infrastructure (Jaeger, Tempo, Datadog APM)
- The workflow is mostly linear with minimal branching or conditional logic
- The number of steps is small (3-5 services)
Choose orchestration when:
- The workflow has complex conditional logic, branching, or loops
- You need tight SLA guarantees and fast failure detection
- Business auditors need to query "what happened with order X?"
- Compensation logic is complex (multiple steps to undo in specific order)
- One team owns all the services in the workflow
- You need workflow versioning (running old and new versions simultaneously)
Choose hybrid when:
- You have multiple teams with distinct bounded contexts (most real systems at scale)
- Some workflows are critical (orchestrate those) while others are best-effort (choreograph those)
- You want team autonomy at the inter-domain boundaries but strict control within each domain
Failure Modes and Pitfalls
1. The Ghost Workflow (Choreography)
A service emits an event but the downstream consumer crashes before processing it. The event sits in the dead letter queue (if you have one) or gets lost entirely (if you don't). Meanwhile, the workflow appears to have stopped. Nobody knows where it stalled because no single service tracks the full flow. You discover it when a customer complains about a stuck order three days later.
This is choreography's most insidious failure mode: silent partial completion. The order was charged but never shipped. The inventory was reserved but never released. No service raises an alarm because each service only sees its own slice.
Fix: implement workflow timeout monitors that detect stalled sagas by checking for orders that haven't progressed past a state within an expected timeframe. Also implement dead letter queue monitors with alerts for any event that couldn't be processed.
2. The God Orchestrator (Orchestration)
The orchestrator grows to include every business rule, validation, and edge case. It becomes a 5,000-line monolith that every team needs to modify for every feature. Deploys become risky because a bug in the notification step can break the payment step. This is the microservice version of the monolith problem, just centralized in one service instead of one codebase.
Fix: keep the orchestrator thin. It should only manage workflow sequencing and compensation. Business logic belongs in the individual services. The orchestrator calls paymentService.charge(), not paymentService.validateCard() followed by paymentService.authorizeAmount() followed by paymentService.capturePayment(). Those details are the payment service's internal concern.
3. Event Storms (Choreography)
A bug in one service causes it to emit events in a tight loop. Every downstream service reacts, emitting more events, creating a runaway cascade. The event bus fills up. All services start lagging. Consumer lag grows from seconds to hours. I've seen this take down an entire platform in minutes because nobody had rate limiters on event emission.
Fix: rate limiters on event emission, circuit breakers on event consumption, and dead letter queues with alerts for event processing failures. Also consider event TTLs: events older than X minutes are automatically routed to a dead letter queue rather than being processed stale.
4. Orchestrator SPOF (Orchestration)
The orchestrator crashes. Every in-flight workflow stops. No new workflows can start. If the orchestrator is stateless (no persistent state), all in-flight workflows are lost and must be manually investigated and restarted.
The blast radius is proportional to how many workflows the orchestrator manages. If it manages all workflows for the entire company, an outage is catastrophic.
Fix: persist workflow state to a database after each step. On restart, resume from the last checkpointed state. This is exactly what Temporal and Conductor provide out of the box. For HA, run multiple orchestrator instances behind a leader election mechanism. Only the leader processes workflows; followers take over instantly if the leader dies.
Event Schema Governance (Choreography-specific)
As choreography systems grow, event schemas become the primary coupling mechanism between teams. Without governance, teams independently evolve their event schemas and break consumers they don't even know about. The irony is significant: you adopted choreography to reduce coupling, but now event schemas are a tighter coupling surface than direct API calls.
The production answer: a schema registry (Confluent Schema Registry for Avro/Protobuf, or a Git-based contract repository). All event schemas are registered with backward-compatibility rules enforced at publish time. Breaking changes are rejected automatically. Teams use consumer-driven contract testing to verify their schemas against consumer expectations before deploying.
My advice: if you're going to do choreography at scale, invest in schema governance before you hit 10 event types. After 10 types with 20+ consumers, retroactively adding governance is painful.
5. Distributed Tracing Gaps (Choreography)
Each service logs its own actions, but correlating them into a single workflow trace requires propagating a correlation ID through every event. If even one service fails to propagate it, the trace breaks and you can't follow the full workflow. In choreographed systems with 10+ services, these gaps appear surprisingly often and make incident response painfully slow.
Fix: enforce correlation ID propagation as a mandatory event field. Reject events without it at the event bus level. Use OpenTelemetry auto-instrumentation to reduce the chance of human error in propagation.
Trade-offs
| Dimension | Choreography | Orchestration |
|---|---|---|
| Coupling | Services coupled to event schema only | Services coupled to orchestrator interface |
| Observability | Hard: trace assembled from logs across services | Easy: orchestrator holds full state |
| Adding a new step | Subscribe a new service to the right event | Modify the orchestrator code |
| Compensation | Distributed: each service handles its own | Centralized: orchestrator drives all |
| Single point of failure | No central coordinator | Orchestrator is SPOF (mitigate with HA) |
| Testing | Integration tests across services (hard) | Test orchestrator in isolation (easier) |
| Team autonomy | High: teams deploy independently | Low: orchestrator changes require coordination |
| Workflow complexity ceiling | Struggles with conditionals and branching | Handles arbitrarily complex state machines |
| Debugging | "Which service failed?" requires distributed tracing | Inspect orchestrator state directly |
| Scaling bottleneck | None (fully distributed) | Orchestrator throughput limits workflow rate |
The fundamental tension is autonomy vs visibility. Choreography maximizes team independence at the cost of workflow observability. Orchestration maximizes operational control at the cost of centralized coordination. Most mature organizations find a balance point rather than going all-in on either extreme.
Real-World Usage
Netflix Conductor is Netflix's open-source orchestration engine. They use it to coordinate content encoding workflows (a new title upload triggers 200+ encoding tasks across different services). Conductor runs millions of workflows per day with full state tracking, retry policies, and sub-workflow composition. They chose orchestration because content encoding has complex branching (different codecs, resolutions, DRM packaging) that would be impossible to manage through choreography. Conductor stores workflow state in Cassandra and uses Redis for task queuing, achieving sub-second task dispatch latency even at peak load.
Uber Cadence (now evolved into Temporal) was built for ride-matching and payment workflows. A single ride involves driver matching, fare calculation, payment authorization, payment capture, and tip processing. Uber chose orchestration because payment workflows need strict ordering, compensation guarantees, and audit trails. Cadence handles millions of concurrent workflows with durable execution (the workflow survives process restarts). The key innovation: workflows are written as normal code with function calls, not as state machines or YAML configurations.
Shopify uses choreography for their event-driven commerce platform. When a merchant updates a product, the event propagates to inventory, search, storefront rendering, analytics, and tax calculation services. Each team owns their subscriber independently. Shopify's event bus handles billions of events per day across 30+ consuming teams. They chose choreography because team autonomy is their top priority: 2,000+ engineers need to deploy independently without coordinating changes through a central orchestrator.
Interview shortcut: name the tool
Saying "I'd use Temporal for the orchestrated saga" is much stronger than "I'd build a custom orchestrator." Temporal and Conductor are battle-tested; building your own workflow engine is a multi-quarter project. Name the tool, explain why, and move on.
How This Shows Up in Interviews
This topic comes up in every multi-service system design. Whenever you have 3+ services that need to coordinate, the interviewer expects you to address the coordination strategy explicitly.
When to bring it up: "For this multi-step workflow, I'd use an orchestrator pattern, specifically a saga orchestrator, because we need compensation guarantees and auditability. Between the order domain and the notification domain, I'd use choreography since those teams deploy independently."
The strongest candidates proactively draw the orchestrator state machine on the whiteboard. It immediately shows the interviewer that you understand failure modes, compensation paths, and workflow state.
Depth expected:
- At senior level: know the difference, draw both patterns, explain when to use each, mention compensation
- At staff level: discuss the hybrid approach, explain compensation in detail, name production tools (Temporal, Conductor), discuss testing strategies and schema evolution
- At principal level: discuss organizational implications (Conway's law), workflow versioning, and cost/benefit of building vs buying orchestration infrastructure
| Interviewer asks | Strong answer |
|---|---|
| "How do you coordinate these five services?" | "Saga orchestrator within the order domain. It tracks state, handles compensation, and is auditable. Other domains subscribe to the completion event via choreography." |
| "What if the orchestrator goes down?" | "Workflow state is persisted to the database after each step. On restart, it resumes from the last checkpoint. For HA, run multiple orchestrator replicas with leader election." |
| "Why not just choreography everywhere?" | "Choreography struggles with complex compensation. If step 4 of 5 fails, you need to undo steps 1-3 in reverse order. With choreography, each service independently decides how to compensate. With orchestration, the compensations are explicit and ordered." |
| "How do you test a choreography-based workflow?" | "Contract tests: each service's event schema is version-controlled. Integration tests: spin up all services with an in-memory event bus and verify the full flow end-to-end. Chaos tests: kill a service mid-workflow and verify compensation fires." |
| "What tooling would you use?" | "Temporal for orchestration. It gives you durable execution, retry policies, and workflow versioning out of the box. For choreography, Kafka with consumer groups and a schema registry." |
Common mistake in interviews: candidates draw only the happy path. The interviewer wants to see what happens when step 3 fails. Draw the compensation arrows. Show the state machine transitions for failure cases. This is what separates senior from mid-level answers.
Another common mistake: choosing choreography or orchestration based on a vague preference rather than concrete criteria. Always tie your choice to specific requirements: "We need audit trails, so orchestration" or "Four teams deploy independently, so choreography at the boundaries."
Quick Recap
- Choreography distributes coordination across services via events; orchestration centralizes it in a dedicated workflow service.
- Choreography excels at team autonomy and extensibility but struggles with complex compensation, observability, and time-bound SLAs.
- Orchestration provides explicit workflow state tracking and centralized compensation but introduces a single point of failure and deployment coupling.
- The orchestrator state machine is a powerful interview artifact: it shows every step, every failure transition, and every compensation path in one diagram.
- Most production systems use a hybrid: orchestration within a bounded context, choreography between contexts. This follows Conway's law.
- Choreography's "loose coupling" is partially illusory: coupling moves from code to event schema contracts. Invest in schema governance early.
- When time-bound SLAs or complex multi-step compensation are required, orchestration wins decisively.
- Production tools: Temporal and Netflix Conductor for orchestration; Kafka with schema registries for choreography.
- In interviews, always show the compensation path (not just the happy path) and tie your choice to specific requirements.
Related Patterns
- Saga pattern: Sagas are the implementation mechanism for both choreography and orchestration in distributed transactions. This article covers when to use each coordination style.
- Event-driven architecture: Choreography is one pattern within the broader event-driven architecture. Understand event types, schemas, and delivery guarantees.
- Message queues: Both patterns rely on messaging infrastructure. Choreography uses pub/sub topics; orchestration may use request-reply over queues.
- Dead letter queue: Essential for choreography: when a consumer fails to process an event, the DLQ captures it for manual investigation and replay.
- Circuit breaker: Both orchestration (breaker on service calls) and choreography (breaker on event processing) benefit from circuit breakers to prevent cascade failures.
The Observability Problem With Choreography
The most underestimated cost of choreography is debugging:
Support ticket: "Order 12345 is stuck — inventory was reserved 3 hours ago but payment never happened"
Engineer: open 4 different service log dashboards, correlate on order_id, find that PaymentService
crashed after receiving InventoryReserved but before emitting PaymentCollected.
Event is still in Kafka, will be retried at 2am.
With an orchestrator, this is one query: SELECT * FROM workflow_runs WHERE order_id = 12345. With choreography, you need distributed tracing or a dedicated event store to replay what happened.
Choreography without distributed tracing is an operational liability at scale.
Quick Recap
- Choreography routes work through events — each service reacts to events from others, no central coordinator. Orchestration uses a central service that explicitly calls each step and manages failures.
- Choreography minimizes service coupling (only to event schemas) but makes workflows hard to observe and debug. Orchestration centralizes workflow logic and state but creates a single point of failure.
- Compensation in choreography is distributed — each service listens for its own rollback event. In orchestration, the coordinator explicitly calls each compensation step, which is easier to reason about.
- Use choreography across team/domain boundaries where services are independently owned. Use orchestration within a bounded context where one team owns all services.
- Do not adopt choreography without distributed tracing infrastructure. Debugging a stuck workflow across 5 services without trace correlation is a significant operational burden.