Event-driven architecture
How event-driven systems decouple producers from consumers using events as the primary communication mechanism, covering event types, broker topology, ordering guarantees, and the tradeoffs vs. synchronous calls.
TL;DR
- Event-driven architecture (EDA) uses events — immutable records of things that happened — as the primary communication mechanism between services.
- Producers emit events without knowing which consumers exist. Consumers subscribe to event streams and react independently.
- Key properties: decoupling (producers don't know consumers), temporal independence (consumers can be offline), fan-out without coordination (one event reaches N consumers).
- Tradeoffs: harder to trace end-to-end flows, eventual consistency by default, debugging requires distributed tracing or event log inspection.
- Apache Kafka dominates production event streaming; SNS/SQS is common on AWS for simpler use cases.
The Problem It Solves
Your checkout service processes an order. It needs to notify inventory, send a confirmation email, update analytics, and trigger fulfillment. In a synchronous architecture, the checkout service calls each downstream service directly: inventory.reserve(), email.send(), analytics.track(), fulfillment.create(). If the email service is slow (3 seconds), the entire checkout hangs for 3 seconds. If fulfillment is down, the checkout fails entirely, even though the payment already went through.
Worse, every new team that needs order data requires a code change in the checkout service. Product wants a recommendations feed? Add a call to recommendations.update(). Marketing wants abandon-cart tracking? Another call. The checkout service becomes a God service that knows about every downstream consumer, and a failure in any of them can cascade back and break the purchase flow.
This is the problem that event-driven architecture solves: tight coupling between the producer of information and every consumer that needs it.
What Is It?
Event-driven architecture (EDA) is a design approach where services communicate by producing and consuming events, which are immutable records of things that happened. The producer publishes an event ("order was placed") without knowing or caring who consumes it. Consumers subscribe to event streams and react independently.
Think of it like a newspaper versus phone calls. In a synchronous world, the checkout service has to call each interested party individually (like making 5 phone calls, waiting for each one to answer). In an event-driven world, it publishes a single "Order Placed" event (like publishing a headline), and whoever is subscribed reads it independently. Adding a new subscriber doesn't require the publisher to change anything.
The checkout service's job is done as soon as the event is published. If the email service is slow, it processes the event at its own pace. If fulfillment is down, the event waits in the broker until fulfillment recovers. Adding a new consumer (recommendations) requires zero changes to the checkout service.
For your interview: say "I'd use event-driven architecture here to decouple the producer from consumers, so a failure in one downstream doesn't cascade back to the caller" and you've already shown good instincts.
How It Works
Let's trace a single event from production to consumption through Kafka, which is the most common event broker in production systems.
Steps in detail:
- Produce: The checkout service publishes an
OrderPlacedevent to theorder-eventsKafka topic. The event includes the order ID as the partition key. - Partition: Kafka hashes the order ID and appends the event to the corresponding partition. All events for the same order land on the same partition, preserving order per entity.
- Consume: Three consumer groups (inventory, notifications, analytics) each get a copy of the event. Within each group, one consumer instance processes the event and commits its offset.
- Retry on failure: If the email consumer crashes mid-processing, Kafka redelivers the event to another instance in the same group. The event is not lost.
Here's what the producer code looks like:
# Producer: publish event after checkout completes
event = {
"eventId": "evt_01J8XK...",
"eventType": "order.placed",
"eventVersion": "2.0",
"timestamp": "2024-01-15T10:30:42.123Z",
"source": "checkout-service",
"correlationId": "req_abc123",
"data": {
"orderId": "ord_xyz789",
"userId": "usr_123",
"totalCents": 4995,
"currency": "USD"
}
}
producer.send(
topic="order-events",
key=event["data"]["orderId"], # partition key
value=json.dumps(event)
)
Critical fields: eventId (for deduplication), eventType (for routing), correlationId (for distributed tracing), eventVersion (for schema evolution).
Ordering guarantees
Kafka guarantees order within a partition, not across partitions. If your order-events topic has 12 partitions, events for order 1001 all land on the same partition (because the partition key is orderId), so they're consumed in order. Events for different orders may be processed out of order, but that's fine because they're independent.
Topic: order-events (12 partitions)
Partition 3: [order:1001-placed, order:1001-paid, order:1001-shipped] ← in order
Partition 7: [order:1002-placed, order:1002-paid] ← in order
No guarantee between partition 3 and partition 7 (but no need for one)
The rule of thumb: use the entity ID as the partition key, and you get per-entity ordering for free.
Idempotent consumers
Most brokers offer at-least-once delivery. If a consumer crashes after processing an event but before committing its offset, Kafka redelivers the event. Consumers must handle duplicates:
def handle_order_placed(event):
# Idempotent check: already processed?
if db.exists("processed_events", event["eventId"]):
return # Skip duplicate
with db.transaction():
create_fulfillment_record(event["data"]["orderId"])
db.insert("processed_events", event["eventId"])
# Offset commit happens after transaction
I'll often see teams skip idempotency checks and then discover duplicate fulfillment records or double-charged customers during load tests. At-least-once is the default, and your consumers must handle it.
Key Components
| Component | Role |
|---|---|
| Producer | Service that publishes events. Knows the event schema and the target topic, but has no knowledge of consumers. |
| Event Broker | Infrastructure that receives, stores, and delivers events. Examples: Kafka, RabbitMQ, AWS SNS/SQS, Pulsar. |
| Topic / Queue | Named channel for events. Topics support fan-out (multiple consumers). Queues deliver to one consumer per message. |
| Consumer Group | A set of consumer instances that share the work of consuming a topic. Each partition is assigned to one consumer in the group. |
| Partition | A unit of parallelism within a topic. Events with the same key go to the same partition. More partitions = more throughput. |
| Dead Letter Queue (DLQ) | Holds events that failed processing after N retries. Prevents poison messages from blocking the entire consumer. |
| Schema Registry | Enforces event schema compatibility (Avro, Protobuf). Prevents producers from publishing breaking changes. |
| Offset / Cursor | Tracks a consumer's position in the event log. Enables replay from any point and exactly-once processing guarantees. |
Types / Variations
Events vs. messages vs. commands
Three related but distinct concepts that are frequently confused:
| Concept | Definition | Direction | Example |
|---|---|---|---|
| Command | Request for an action | Producer → specific consumer | ProcessPayment{orderId: 123} |
| Event | Record of something that happened | Broadcast to anyone interested | OrderPlaced{orderId: 123} |
| Message | Generic envelope for either | Varies | Depends on context |
Commands are targeted and imply an expectation of handling ("do this"). Events are facts about the past ("this happened"). The distinction matters because events are naturally broadcastable while commands are inherently point-to-point.
Broker topologies
Point-to-point (queue): One producer, one consumer per message. Each message is processed exactly once by one consumer instance. Use case: work queues where each task should be processed once (payment processing, email delivery).
Pub/Sub (topic): One producer, many consumers. Each consumer group gets a copy of every event. Use case: integration fan-out where one business event triggers independent reactions in multiple services.
Streaming (log): Kafka-style: events are appended to an immutable, ordered log. Consumers can replay from any offset, not just the latest. This gives you pub/sub semantics plus the ability to reprocess historical events (e.g., rebuild a search index from scratch).
Event patterns
Event notification: A thin event that says "something happened" with minimal data. Consumers query the source for full details.
OrderPlaced { orderId: "123" }
# Consumer calls: GET /orders/123 to get full order
Simple, but creates runtime coupling: the consumer needs the producer to be available at query time.
Event-carried state transfer: A fat event that includes all the data consumers need. No callbacks required.
OrderPlaced { orderId: "123", userId: "456", items: [...], total: 49.95 }
# Consumer has everything. No callback to order service.
More data on the wire, but fully decoupled. My recommendation: default to event-carried state transfer for most use cases. The bandwidth cost is negligible compared to the operational simplicity.
Event sourcing: Store the full sequence of events as the source of truth, not just the current state. The current state is derived by replaying events. Covered in depth in the Event Sourcing article.
Choreography vs. orchestration
Two approaches to coordinating multi-step business processes across services:
Choreography: No central coordinator. Each service listens to events and reacts by doing its work and publishing the next event. Simple for 2-3 steps but becomes hard to follow with 5+ services (the "event spaghetti" problem). Debugging a failed flow means tracing events across multiple services.
Orchestration: A central saga orchestrator tells each service what to do and tracks the overall progress. Easier to understand and debug complex flows, but the orchestrator is a single point of logic that becomes complex. Compensation logic (rollbacks) is explicit.
The rule of thumb: use choreography for simple, independent reactions (notifications, analytics). Use orchestration for complex, multi-step business processes that need clear error handling and compensation (order fulfillment, payment workflows).
Schema evolution
Events are immutable once published. Consumers may run old code against new events, so schema changes must be backward-compatible:
Backward compatible (safe): add optional fields
v1: { orderId, userId }
v2: { orderId, userId, couponCode? } ← v1 consumers ignore couponCode
Breaking (dangerous): rename/remove required fields, change types
Requires: dual-publish during migration, coordinated consumer updates
Schema registries (Confluent Schema Registry, AWS Glue) enforce compatibility rules at the producer boundary, rejecting breaking changes before they reach consumers.
Events are not API calls
The biggest misconception in event-driven design is treating events like asynchronous API calls. Events describe what happened ("OrderPlaced"), not what should happen ("ProcessPayment"). If your event names are verbs/commands, you're building a distributed RPC system with extra steps, not an event-driven architecture. The litmus test: can you add a new consumer without changing the producer? If not, you're doing commands disguised as events.
Trade-offs
| Advantage | Disadvantage |
|---|---|
| Loose coupling (producers don't know consumers) | Eventual consistency by default (not immediate) |
| Temporal independence (consumers can be offline, events wait) | Harder to trace end-to-end flows across services |
| Natural fan-out to N consumers without producer changes | Debugging requires event log inspection and distributed tracing |
| Absorbs traffic spikes (broker buffers bursts) | Ordering requires careful partition key design |
| Enables independent deployment and scaling | At-least-once delivery means every consumer must handle duplicates |
| Immutable event log enables replay and rebuilding derived state | Schema evolution is a discipline (breaking changes are very costly) |
The fundamental tension is decoupling vs. observability. The more you decouple services, the harder it becomes to understand what's happening across the system. A synchronous call chain is easy to trace (one request, one call stack). An event flowing through 5 independent consumers requires distributed tracing, correlation IDs, and event log inspection. Every event-driven system needs an investment in observability proportional to its decoupling.
When to Use It / When to Avoid It
Use event-driven architecture when:
- Multiple services need to react to the same business event (fan-out)
- The producer doesn't need an immediate response from consumers
- You need temporal decoupling (consumers can process asynchronously)
- You need to absorb traffic spikes without overloading downstream services
- You want to add new consumers without modifying existing producers
- You're building data pipelines, analytics, or notification systems
Avoid event-driven architecture when:
- The caller needs a synchronous response (checkout confirmation, auth token)
- You have 2 services with a simple request/response pattern (just use HTTP)
- Your team doesn't have the operational maturity for distributed tracing and event debugging
- Eventual consistency is genuinely unacceptable for the use case (real-time balance checks)
- The added complexity isn't justified by the decoupling benefit
Here's the honest answer: if you have one producer and one consumer with a simple request/response pattern, a synchronous HTTP call is simpler and you should use it. Event-driven architecture pays off when you have fan-out (multiple consumers), temporal decoupling needs, or high-volume data flows. Don't introduce Kafka for two services that just need an HTTP call.
Real-World Examples
LinkedIn runs one of the world's largest event-driven platforms. Kafka was originally built at LinkedIn and now processes over 7 trillion events per day. Every user action (profile view, connection request, message sent) becomes an event that feeds dozens of downstream systems: news feed ranking, notification delivery, analytics, ad targeting, and abuse detection. Kafka's append-only log design enables LinkedIn to replay events when rebuilding search indexes or training new ML models.
Uber built its microservices architecture on event-driven communication. Ride events (requested, matched, started, completed) flow through Kafka to independent services handling pricing, ETA estimation, driver assignment, payment, and receipts. At peak, Uber processes millions of events per second across 4,000+ microservices. The decoupling allows teams to deploy independently, and the event log enables reconstruction of any ride's full history.
Netflix uses event-driven architecture for its content delivery pipeline. When a new title is ingested, events trigger transcoding (120+ video profiles), quality analysis, metadata tagging, and CDN placement. Each step is an independent consumer that can be scaled, retried, and deployed separately. Netflix processes hundreds of billions of events daily through its internal event bus.
How This Shows Up in Interviews
When to bring it up: Mention event-driven architecture whenever a design has fan-out requirements (one event triggers multiple independent reactions), needs temporal decoupling, or involves high-volume data flows. "I'd use an event broker here so the checkout service doesn't need to know about every downstream consumer" is a strong opening.
Depth expected at senior/staff level:
- Know the difference between events, commands, and messages, and when each is appropriate
- Explain choreography vs. orchestration and their failure modes
- Understand partition-level ordering and why total ordering doesn't scale
- Discuss idempotent consumer patterns (deduplication table, natural idempotency)
- Be ready to talk about dead letter queues and how you handle poison messages
- Know when synchronous calls are still the right choice (not everything should be async)
Interview power move: name the event pattern
When you say "I'd emit an OrderPlaced event here," follow it with the specific pattern: "using event-carried state transfer so the downstream services don't need to call back." This shows you know there are different event patterns, not just "throw it on a queue." Interviewers remember candidates who name things precisely.
| Interviewer asks | Strong answer |
|---|---|
| "How do services communicate?" | "For fan-out scenarios, I'd use event-driven communication via Kafka. The producer publishes an event (e.g., OrderPlaced), and consumer groups process it independently. For request/response paths, I'd keep synchronous HTTP/gRPC." |
| "What about ordering?" | "Kafka guarantees order within a partition. I'd use the entity ID (e.g., orderId) as the partition key, so all events for the same entity are processed in order. Cross-entity ordering isn't needed because they're independent." |
| "What if a consumer fails?" | "At-least-once delivery with idempotent consumers. If a consumer crashes before committing its offset, Kafka redelivers. The consumer checks a deduplication table (keyed on eventId) before processing. Failed events go to a dead letter queue after N retries." |
| "Choreography or orchestration?" | "Choreography for simple fan-out (notifications, analytics). Orchestration (saga pattern) for multi-step business flows that need clear compensation logic, like order fulfillment where you might need to rollback inventory if payment fails." |
| "How do you handle schema changes?" | "Backward-compatible changes only: add optional fields, never remove or rename required ones. A schema registry enforces compatibility at the producer boundary, so breaking changes are rejected before they reach any consumer." |
Test Your Understanding
Quick Recap
- Event-driven architecture decouples producers from consumers: the producer publishes an event and moves on, consumers react independently.
- Events are immutable records of things that happened, distinct from commands (requests for action) and messages (generic envelopes).
- Kafka guarantees ordering within a partition. Use the entity ID as the partition key to get per-entity ordering automatically.
- Most brokers deliver at-least-once, so every consumer must be idempotent: processing the same event twice produces the same result as processing it once.
- Use choreography for simple fan-out (notifications, analytics) and orchestration for multi-step business processes that need compensation logic.
- Schema evolution must be backward-compatible: add optional fields, never remove or rename required ones. A schema registry enforces this at the producer boundary.
- Event-driven is not a replacement for synchronous calls. Login, checkout confirmation, and other request/response flows should stay synchronous. Use events for fan-out, decoupling, and async processing.
Related Concepts
- Message Queues: Message queues are the underlying infrastructure that event-driven architecture builds on. EDA is the architecture pattern; queues and brokers are the plumbing.
- CQRS: Command Query Responsibility Segregation pairs naturally with EDA. Events update the write model, and separate read models are built from the event stream.
- Event Sourcing: Event sourcing takes EDA further by making the event log the source of truth. Current state is derived by replaying events, not by querying a mutable database.
- Sync vs. Async: The broader trade-off discussion between synchronous and asynchronous communication patterns, of which EDA is the most structured async approach.
- Saga Pattern: Sagas coordinate multi-step business processes in event-driven systems, handling compensation when a step fails.