Synchronous event chain anti-pattern
Learn how event-triggered synchronous calls stack latency and amplify failures across a service chain, and how async event handling breaks the coupling.
TL;DR
- A synchronous event chain occurs when an event triggers a service call, which triggers another event, which triggers another call, all synchronously in the same request thread.
- Each hop in the chain adds its latency to the total response time. Each service in the chain is a potential failure point that propagates back to the original caller.
- This pattern often hides in event frameworks: "process order event โ call inventory service โ call email service โ call analytics service." What looks like an event pipeline is a synchronous call chain in disguise.
- Break the chain with async: publish events to a queue, let consumers handle them independently without blocking the original caller.
- Latency in a synchronous chain is additive (sum of all hops). Latency in an async system is determined by the critical path alone (the slowest operation the user must wait for).
The Problem
It's 11:02 a.m. on launch day. Your checkout team deployed a "simple" order event handler last week. When a user places an order, the Order service creates the record, then fires an order.created event. Sounds async, right? Except every listener processes synchronously in the same request thread.
Here's the actual call chain:
The user waited 8.1 seconds for an order confirmation. Their request was blocked while Warehouse called Metrics, which called Analytics. Analytics had a slow database query that day. The user's "Place Order" button spun, they clicked again, and now you have a duplicate order.
I've seen this exact failure in two different companies. The worst part: the Order service team didn't even know their latency depended on the Analytics database. The chain was invisible until someone looked at the distributed trace.
The latency budget was eaten by non-critical downstream calls (analytics, metrics) that had no business being in the synchronous critical path of order creation. Remove Analytics from the chain and the response drops to 190ms.
Here's the breakdown of where the time went:
| Service | Latency | Critical for user? | Should be sync? |
|---|---|---|---|
| Order Service | 50ms | Yes (creates order) | Yes |
| Inventory Service | 80ms | Yes (confirms stock) | Yes |
| Warehouse Service | 30ms | No (internal logistics) | No |
| Metrics Service | 20ms | No (observability) | No |
| Analytics Service | 3,200ms | No (dashboard) | No |
| Total (sync chain) | 8,100ms | ||
| Total (critical only) | 130ms |
The difference between 8.1 seconds and 130ms is the cost of the synchronous event chain.
Why It Happens
Teams build synchronous event chains because each decision makes sense in isolation.
"Events are async by nature." Not necessarily. Many event frameworks (Spring ApplicationEvent, Node.js EventEmitter, .NET MediatR) dispatch events synchronously by default. The word "event" tricks you into thinking it's non-blocking, but the handler runs in the calling thread, blocking the original request until it returns.
"We just need one more listener." The chain grows incrementally. First it's Order + Inventory (reasonable, 2 services). Then someone adds email notification. Then analytics. Then fraud scoring. No single addition looks dangerous, but by month six, your checkout latency depends on five downstream services.
"All our services are fast." They are, until one isn't. Latency in a synchronous chain is additive: if five services each take 50ms on a good day, that's 250ms. When one service hits a slow query or a connection pool timeout, the whole chain stalls. I've seen a 3-second Analytics hiccup turn into an 8-second checkout timeout.
"We need the result before we can respond." Sometimes true (inventory check), but teams apply this reasoning to every operation. Ask: "Does the user need this result to see their confirmation page?" If the answer is no, it doesn't belong in the synchronous path.
Real-world examples
This anti-pattern appears wherever "event-driven" meets "in-process dispatch":
- Spring
@EventListenerwithout@Async. The default is synchronous. Your event handler runs in the HTTP request thread. Adding@TransactionalEventListenerdoesn't make it async either; it just defers to transaction commit but still blocks the response. - Node.js EventEmitter with await.
await emitter.emit('order.created')blocks if any listener returns a Promise. The event looks async (it's an "event"), but the caller is blocked. - gRPC unary calls chained together. Service A calls Service B, which calls Service C, which calls Service D. Each call is a unary RPC. The latency is the sum of all four calls plus network hops.
- Saga orchestrators that wait for all steps. An orchestrator that sends commands and
awaits every response before sending the next command is a synchronous chain using saga terminology. True sagas emit events and handle responses asynchronously.
The growth pattern
Synchronous chains rarely start long. They grow incrementally:
Month 1: Order โ Inventory (2 services, 130ms)
Month 3: Order โ Inventory โ Email (3 services, 180ms)
Month 5: Order โ Inventory โ Email โ Analytics (4 services, 230ms)
Month 8: Order โ Inventory โ Email โ Analytics โ Warehouse โ Metrics (6 services, 400ms+)
Each addition passes code review because the team only sees "+1 service, +50ms." Nobody tracks the cumulative chain depth. By month 8, a degraded service turns 400ms into 8 seconds. Set a lint rule or architecture review gate: any synchronous chain deeper than 3 hops requires explicit approval.
How to Detect It
| Symptom | What It Means | How to Check |
|---|---|---|
| Linear span waterfall in traces | Each service waits for the next one to finish before returning | Open Jaeger/Zipkin, look for staircase-shaped traces |
| Adding a notification feature increases checkout latency | New listener is synchronous in the critical path | Measure p99 before/after deploying the new listener |
| Non-critical service outage degrades critical user action | Sync chain couples critical and non-critical paths | Kill the analytics service in staging, watch checkout latency |
| Removing a downstream service makes the system faster | That service was doing unnecessary work in the request thread | Profile with and without the service call |
| Latency increases linearly with chain depth | Each hop adds its processing time | Plot latency vs. number of downstream calls per request |
Code smells
Look for these patterns in your codebase:
// SMELL: synchronous event dispatch in request handler
app.post('/orders', async (req, res) => {
const order = await createOrder(req.body);
await eventBus.emit('order.created', order); // blocks until ALL listeners finish
res.json(order); // user waits for everything above
});
// SMELL: event listener that makes a synchronous HTTP call
eventBus.on('order.created', async (order) => {
await fetch('http://analytics-service/events', { // sync call inside "event" handler
method: 'POST',
body: JSON.stringify(order)
});
});
The await eventBus.emit(...) is the red flag. If your event bus blocks the caller until all listeners return, your "events" are synchronous calls in disguise.
Latency math tells the story
A quick formula to diagnose this: total request latency should roughly equal the latency of the slowest critical-path operation, not the sum of all operations. If your checkout takes 2 seconds but your slowest critical operation (payment) takes 400ms, the remaining 1.6 seconds is being spent on synchronous non-critical calls.
Expected latency โ max(critical operations) + overhead
Actual latency โ sum(all operations in chain)
If actual >> expected, you have a synchronous chain.
Check your APM tool (Datadog, New Relic, Jaeger). Filter for traces where the span count is > 4 and the parent span duration is > 1 second. Sort by span count descending. The traces with the most spans and highest parent duration are your synchronous chains.
The Fix
Fix 1: Separate the critical path from the non-critical path
Identify which operations the user must wait for and push everything else to async event consumers.
// BAD: synchronous event chain in the request handler
app.post('/orders', async (req, res) => {
const order = await orderService.create(req.body);
await inventoryService.reserve(order.items); // critical, keep sync
await emailService.sendConfirmation(order); // non-critical, but blocks response
await analyticsService.trackOrder(order); // non-critical, blocks response
await warehouseService.schedulePickup(order); // non-critical, blocks response
res.json({ orderId: order.id, status: 'CONFIRMED' });
});
// GOOD: publish event, let consumers handle non-critical work
app.post('/orders', async (req, res) => {
const order = await orderService.create(req.body);
await inventoryService.reserve(order.items); // critical, keep sync
await messageQueue.publish('order.created', order); // non-blocking
res.json({ orderId: order.id, status: 'CONFIRMED' });
});
The user gets a response in under 200ms. Email, warehouse reservation, and metrics all happen separately after the order is committed. Failures in those services don't affect the user's experience.
Fix 2: Classify operations as sync or async
Not everything can be async. Some operations must be synchronous:
| Operation | Sync or Async? | Reason |
|---|---|---|
| Inventory reservation | Sync | User needs confirm/reject before payment |
| Payment charge | Sync | User must know if payment succeeded |
| Order ID creation | Sync | User needs a reference number |
| Welcome email | Async | User can receive it seconds later |
| Analytics event recording | Async | Non-critical, can be delayed |
| Warehouse reservation | Async | Can use saga for compensation if it fails |
| Fraud score calculation | Depends | If you want to block orders: sync. If just monitoring: async |
Fix 3: Two-phase response for long-running operations
For unavoidably long-running operations, return a pending status immediately and deliver the final result asynchronously:
// Phase 1: Accept immediately, return tracking ID
app.post('/orders', async (req, res) => {
const order = await orderService.create(req.body);
await inventoryService.reserve(order.items); // critical, keep sync
await messageQueue.publish('order.process', order); // async processing
res.status(202).json({ orderId: order.id, status: 'PENDING' });
});
// Phase 2: Notify when processing completes
async function processOrder(order: Order) {
await paymentService.charge(order);
await warehouseService.schedule(order);
await webhookService.deliver(order.callbackUrl, {
orderId: order.id,
status: 'CONFIRMED',
});
}
The user sees "order received" in 200ms. Payment, warehouse scheduling, and confirmation happen in the background. If the user needs real-time updates, combine this with SSE or polling on the order status.
This pattern is especially useful for operations that depend on third-party APIs (payment gateways, shipping providers) where latency is unpredictable.
Which fix to use
Severity and Blast Radius
Synchronous event chains are high severity in user-facing flows. The blast radius grows with chain depth: a 2-service chain has 2 failure points, but a 5-service chain has 5 failure points with compounding latency.
| Chain depth | p99 latency (good day) | p99 latency (one service degrades) | Failure probability |
|---|---|---|---|
| 2 services | 100ms | 350ms | 2 ร single-service failure rate |
| 3 services | 150ms | 3.2s | 3 ร single-service failure rate |
| 5 services | 250ms | 8s+ | 5 ร single-service failure rate |
Worst case: a non-critical service (analytics, logging) goes down and takes your checkout flow with it. Recovery requires identifying which service is slow, then either removing it from the chain or converting it to async. If the coupling is deep, this can take hours to days.
The insidious part is that the system works perfectly in testing. The chain only fails under production load when one downstream service degrades. I've seen teams pass load testing because all services were healthy during the test window, then fail on launch day when real traffic created contention.
When It's Actually OK
- Two-service chains where both are critical. Order โ Inventory is a legitimate synchronous call. The user needs to know if stock is available before paying.
- Internal microservice calls with strict SLAs. If both services are owned by the same team, co-located, and have p99 < 10ms, the sync overhead is negligible.
- Request validation chains. Auth โ Rate Limit โ Request Handler is synchronous by nature. Each step must pass before the next one runs.
- Reads with no side effects. A sync chain that only reads data (no writes, no external calls) is lower risk because retries are safe.
The pattern becomes dangerous when non-critical operations sneak into the synchronous path, or when the chain grows beyond 2-3 services.
A good rule of thumb: if your synchronous chain has more than 3 services, audit it. For each service after the third, ask "what happens if I move this to async?" If the user experience doesn't change, move it.
How This Shows Up in Interviews
Interviewers test this when you design any event-driven system (order processing, notification pipelines, payment flows). The test is whether you distinguish between operations that must be synchronous and operations that can be async.
A strong answer includes:
- Explicit classification of which operations are in the critical path
- A message queue or event bus for non-critical operations
- Acknowledgment that "event-driven" doesn't automatically mean "async"
When designing an event-driven system, describe which operations you'd keep in the synchronous critical path and which you'd push to async queues. "The user waits only for inventory reservation and payment authorisation. Everything after that (email, analytics, warehouse fulfillment) consumes an order.created event asynchronously. Failures in those consumers don't affect the user or require a retry that the user sees."
Test Your Understanding
Quick Recap
- Synchronous event chains look like event-driven systems but are actually synchronous call chains where each event immediately blocks on a downstream call.
- Latency stacks linearly and failures propagate backward. A slow analytics service can delay a user's checkout response.
- Separate the critical path (what the user waits for) from the non-critical path (what happens after confirmation).
- Non-critical operations should publish to a message queue and be consumed independently.
- For long-running critical operations, use the two-phase response pattern: return PENDING immediately, deliver final status via webhook or polling.
- The word "event" doesn't make something async. If the caller awaits the listener, it's a synchronous call in disguise.
- Test by killing non-critical downstream services in staging. If your critical user flow breaks, you have a synchronous chain.