Event-driven architecture

TL;DR

Event-driven architecture (EDA) uses events — immutable records of things that happened — as the primary communication mechanism between services.
Producers emit events without knowing which consumers exist. Consumers subscribe to event streams and react independently.
Key properties: decoupling (producers don't know consumers), temporal independence (consumers can be offline), fan-out without coordination (one event reaches N consumers).
Tradeoffs: harder to trace end-to-end flows, eventual consistency by default, debugging requires distributed tracing or event log inspection.
Apache Kafka dominates production event streaming; SNS/SQS is common on AWS for simpler use cases.

Your checkout service processes an order. It needs to notify inventory, send a confirmation email, update analytics, and trigger fulfillment. In a synchronous architecture, the checkout service calls each downstream service directly: inventory.reserve(), email.send(), analytics.track(), fulfillment.create(). If the email service is slow (3 seconds), the entire checkout hangs for 3 seconds. If fulfillment is down, the checkout fails entirely, even though the payment already went through.

Worse, every new team that needs order data requires a code change in the checkout service. Product wants a recommendations feed? Add a call to recommendations.update(). Marketing wants abandon-cart tracking? Another call. The checkout service becomes a God service that knows about every downstream consumer, and a failure in any of them can cascade back and break the purchase flow.

This is the problem that event-driven architecture solves: tight coupling between the producer of information and every consumer that needs it.

What Is It?

Event-driven architecture (EDA) is a design approach where services communicate by producing and consuming events, which are immutable records of things that happened. The producer publishes an event ("order was placed") without knowing or caring who consumes it. Consumers subscribe to event streams and react independently.

Think of it like a newspaper versus phone calls. In a synchronous world, the checkout service has to call each interested party individually (like making 5 phone calls, waiting for each one to answer). In an event-driven world, it publishes a single "Order Placed" event (like publishing a headline), and whoever is subscribed reads it independently. Adding a new subscriber doesn't require the publisher to change anything.

The checkout service's job is done as soon as the event is published. If the email service is slow, it processes the event at its own pace. If fulfillment is down, the event waits in the broker until fulfillment recovers. Adding a new consumer (recommendations) requires zero changes to the checkout service.

For your interview: say "I'd use event-driven architecture here to decouple the producer from consumers, so a failure in one downstream doesn't cascade back to the caller" and you've already shown good instincts.

How It Works

Let's trace a single event from production to consumption through Kafka, which is the most common event broker in production systems.

Steps in detail:

Produce: The checkout service publishes an OrderPlaced event to the order-events Kafka topic. The event includes the order ID as the partition key.
Partition: Kafka hashes the order ID and appends the event to the corresponding partition. All events for the same order land on the same partition, preserving order per entity.
Consume: Three consumer groups (inventory, notifications, analytics) each get a copy of the event. Within each group, one consumer instance processes the event and commits its offset.
Retry on failure: If the email consumer crashes mid-processing, Kafka redelivers the event to another instance in the same group. The event is not lost.

Here's what the producer code looks like:

# Producer: publish event after checkout completes
event = {
    "eventId": "evt_01J8XK...",
    "eventType": "order.placed",
    "eventVersion": "2.0",
    "timestamp": "2024-01-15T10:30:42.123Z",
    "source": "checkout-service",
    "correlationId": "req_abc123",
    "data": {
        "orderId": "ord_xyz789",
        "userId": "usr_123",
        "totalCents": 4995,
        "currency": "USD"
    }
}

producer.send(
    topic="order-events",
    key=event["data"]["orderId"],  # partition key
    value=json.dumps(event)
)

Critical fields: eventId (for deduplication), eventType (for routing), correlationId (for distributed tracing), eventVersion (for schema evolution).

Ordering guarantees

Kafka guarantees order within a partition, not across partitions. If your order-events topic has 12 partitions, events for order 1001 all land on the same partition (because the partition key is orderId), so they're consumed in order. Events for different orders may be processed out of order, but that's fine because they're independent.

Topic: order-events (12 partitions)
  Partition 3: [order:1001-placed, order:1001-paid, order:1001-shipped] ← in order
  Partition 7: [order:1002-placed, order:1002-paid]                    ← in order
  No guarantee between partition 3 and partition 7 (but no need for one)

The rule of thumb: use the entity ID as the partition key, and you get per-entity ordering for free.

Idempotent consumers

Most brokers offer at-least-once delivery. If a consumer crashes after processing an event but before committing its offset, Kafka redelivers the event. Consumers must handle duplicates:

def handle_order_placed(event):
    # Idempotent check: already processed?
    if db.exists("processed_events", event["eventId"]):
        return  # Skip duplicate

    with db.transaction():
        create_fulfillment_record(event["data"]["orderId"])
        db.insert("processed_events", event["eventId"])
        # Offset commit happens after transaction

I'll often see teams skip idempotency checks and then discover duplicate fulfillment records or double-charged customers during load tests. At-least-once is the default, and your consumers must handle it.

Key Components

Component	Role
Producer	Service that publishes events. Knows the event schema and the target topic, but has no knowledge of consumers.
Event Broker	Infrastructure that receives, stores, and delivers events. Examples: Kafka, RabbitMQ, AWS SNS/SQS, Pulsar.
Topic / Queue	Named channel for events. Topics support fan-out (multiple consumers). Queues deliver to one consumer per message.
Consumer Group	A set of consumer instances that share the work of consuming a topic. Each partition is assigned to one consumer in the group.
Partition	A unit of parallelism within a topic. Events with the same key go to the same partition. More partitions = more throughput.
Dead Letter Queue (DLQ)	Holds events that failed processing after N retries. Prevents poison messages from blocking the entire consumer.
Schema Registry	Enforces event schema compatibility (Avro, Protobuf). Prevents producers from publishing breaking changes.
Offset / Cursor	Tracks a consumer's position in the event log. Enables replay from any point and exactly-once processing guarantees.

Types / Variations

Events vs. messages vs. commands

Three related but distinct concepts that are frequently confused:

Concept	Definition	Direction	Example
Command	Request for an action	Producer → specific consumer	`ProcessPayment{orderId: 123}`
Event	Record of something that happened	Broadcast to anyone interested	`OrderPlaced{orderId: 123}`
Message	Generic envelope for either	Varies	Depends on context

Commands are targeted and imply an expectation of handling ("do this"). Events are facts about the past ("this happened"). The distinction matters because events are naturally broadcastable while commands are inherently point-to-point.

Broker topologies

Point-to-point (queue): One producer, one consumer per message. Each message is processed exactly once by one consumer instance. Use case: work queues where each task should be processed once (payment processing, email delivery).

Pub/Sub (topic): One producer, many consumers. Each consumer group gets a copy of every event. Use case: integration fan-out where one business event triggers independent reactions in multiple services.

Streaming (log): Kafka-style: events are appended to an immutable, ordered log. Consumers can replay from any offset, not just the latest. This gives you pub/sub semantics plus the ability to reprocess historical events (e.g., rebuild a search index from scratch).

Event patterns

Event notification: A thin event that says "something happened" with minimal data. Consumers query the source for full details.

OrderPlaced { orderId: "123" }
# Consumer calls: GET /orders/123 to get full order

Simple, but creates runtime coupling: the consumer needs the producer to be available at query time.

Event-carried state transfer: A fat event that includes all the data consumers need. No callbacks required.

OrderPlaced { orderId: "123", userId: "456", items: [...], total: 49.95 }
# Consumer has everything. No callback to order service.

More data on the wire, but fully decoupled. My recommendation: default to event-carried state transfer for most use cases. The bandwidth cost is negligible compared to the operational simplicity.

Event sourcing: Store the full sequence of events as the source of truth, not just the current state. The current state is derived by replaying events. Covered in depth in the Event Sourcing article.

Choreography vs. orchestration

Two approaches to coordinating multi-step business processes across services:

Choreography: No central coordinator. Each service listens to events and reacts by doing its work and publishing the next event. Simple for 2-3 steps but becomes hard to follow with 5+ services (the "event spaghetti" problem). Debugging a failed flow means tracing events across multiple services.

Orchestration: A central saga orchestrator tells each service what to do and tracks the overall progress. Easier to understand and debug complex flows, but the orchestrator is a single point of logic that becomes complex. Compensation logic (rollbacks) is explicit.

The rule of thumb: use choreography for simple, independent reactions (notifications, analytics). Use orchestration for complex, multi-step business processes that need clear error handling and compensation (order fulfillment, payment workflows).

Schema evolution

Events are immutable once published. Consumers may run old code against new events, so schema changes must be backward-compatible:

Backward compatible (safe): add optional fields
  v1: { orderId, userId }
  v2: { orderId, userId, couponCode? }  ← v1 consumers ignore couponCode

Breaking (dangerous): rename/remove required fields, change types
  Requires: dual-publish during migration, coordinated consumer updates

Schema registries (Confluent Schema Registry, AWS Glue) enforce compatibility rules at the producer boundary, rejecting breaking changes before they reach consumers.

Events are not API calls

The biggest misconception in event-driven design is treating events like asynchronous API calls. Events describe what happened ("OrderPlaced"), not what should happen ("ProcessPayment"). If your event names are verbs/commands, you're building a distributed RPC system with extra steps, not an event-driven architecture. The litmus test: can you add a new consumer without changing the producer? If not, you're doing commands disguised as events.

Trade-offs

Advantage	Disadvantage
Loose coupling (producers don't know consumers)	Eventual consistency by default (not immediate)
Temporal independence (consumers can be offline, events wait)	Harder to trace end-to-end flows across services
Natural fan-out to N consumers without producer changes	Debugging requires event log inspection and distributed tracing
Absorbs traffic spikes (broker buffers bursts)	Ordering requires careful partition key design
Enables independent deployment and scaling	At-least-once delivery means every consumer must handle duplicates
Immutable event log enables replay and rebuilding derived state	Schema evolution is a discipline (breaking changes are very costly)

The fundamental tension is decoupling vs. observability. The more you decouple services, the harder it becomes to understand what's happening across the system. A synchronous call chain is easy to trace (one request, one call stack). An event flowing through 5 independent consumers requires distributed tracing, correlation IDs, and event log inspection. Every event-driven system needs an investment in observability proportional to its decoupling.

When to Use It / When to Avoid It

Use event-driven architecture when:

Multiple services need to react to the same business event (fan-out)
The producer doesn't need an immediate response from consumers
You need temporal decoupling (consumers can process asynchronously)
You need to absorb traffic spikes without overloading downstream services
You want to add new consumers without modifying existing producers
You're building data pipelines, analytics, or notification systems

Avoid event-driven architecture when:

The caller needs a synchronous response (checkout confirmation, auth token)
You have 2 services with a simple request/response pattern (just use HTTP)
Your team doesn't have the operational maturity for distributed tracing and event debugging
Eventual consistency is genuinely unacceptable for the use case (real-time balance checks)
The added complexity isn't justified by the decoupling benefit

Here's the honest answer: if you have one producer and one consumer with a simple request/response pattern, a synchronous HTTP call is simpler and you should use it. Event-driven architecture pays off when you have fan-out (multiple consumers), temporal decoupling needs, or high-volume data flows. Don't introduce Kafka for two services that just need an HTTP call.

Real-World Examples

LinkedIn runs one of the world's largest event-driven platforms. Kafka was originally built at LinkedIn and now processes over 7 trillion events per day. Every user action (profile view, connection request, message sent) becomes an event that feeds dozens of downstream systems: news feed ranking, notification delivery, analytics, ad targeting, and abuse detection. Kafka's append-only log design enables LinkedIn to replay events when rebuilding search indexes or training new ML models.

Uber built its microservices architecture on event-driven communication. Ride events (requested, matched, started, completed) flow through Kafka to independent services handling pricing, ETA estimation, driver assignment, payment, and receipts. At peak, Uber processes millions of events per second across 4,000+ microservices. The decoupling allows teams to deploy independently, and the event log enables reconstruction of any ride's full history.

Netflix uses event-driven architecture for its content delivery pipeline. When a new title is ingested, events trigger transcoding (120+ video profiles), quality analysis, metadata tagging, and CDN placement. Each step is an independent consumer that can be scaled, retried, and deployed separately. Netflix processes hundreds of billions of events daily through its internal event bus.

How This Shows Up in Interviews

When to bring it up: Mention event-driven architecture whenever a design has fan-out requirements (one event triggers multiple independent reactions), needs temporal decoupling, or involves high-volume data flows. "I'd use an event broker here so the checkout service doesn't need to know about every downstream consumer" is a strong opening.

Depth expected at senior/staff level:

Know the difference between events, commands, and messages, and when each is appropriate
Explain choreography vs. orchestration and their failure modes
Understand partition-level ordering and why total ordering doesn't scale
Discuss idempotent consumer patterns (deduplication table, natural idempotency)
Be ready to talk about dead letter queues and how you handle poison messages
Know when synchronous calls are still the right choice (not everything should be async)

Interview power move: name the event pattern

When you say "I'd emit an OrderPlaced event here," follow it with the specific pattern: "using event-carried state transfer so the downstream services don't need to call back." This shows you know there are different event patterns, not just "throw it on a queue." Interviewers remember candidates who name things precisely.

Interviewer asks	Strong answer
"How do services communicate?"	"For fan-out scenarios, I'd use event-driven communication via Kafka. The producer publishes an event (e.g., OrderPlaced), and consumer groups process it independently. For request/response paths, I'd keep synchronous HTTP/gRPC."
"What about ordering?"	"Kafka guarantees order within a partition. I'd use the entity ID (e.g., orderId) as the partition key, so all events for the same entity are processed in order. Cross-entity ordering isn't needed because they're independent."
"What if a consumer fails?"	"At-least-once delivery with idempotent consumers. If a consumer crashes before committing its offset, Kafka redelivers. The consumer checks a deduplication table (keyed on eventId) before processing. Failed events go to a dead letter queue after N retries."
"Choreography or orchestration?"	"Choreography for simple fan-out (notifications, analytics). Orchestration (saga pattern) for multi-step business flows that need clear compensation logic, like order fulfillment where you might need to rollback inventory if payment fails."
"How do you handle schema changes?"	"Backward-compatible changes only: add optional fields, never remove or rename required ones. A schema registry enforces compatibility at the producer boundary, so breaking changes are rejected before they reach any consumer."

Test Your Understanding

Quick Recap

Event-driven architecture decouples producers from consumers: the producer publishes an event and moves on, consumers react independently.
Events are immutable records of things that happened, distinct from commands (requests for action) and messages (generic envelopes).
Kafka guarantees ordering within a partition. Use the entity ID as the partition key to get per-entity ordering automatically.
Most brokers deliver at-least-once, so every consumer must be idempotent: processing the same event twice produces the same result as processing it once.
Use choreography for simple fan-out (notifications, analytics) and orchestration for multi-step business processes that need compensation logic.
Schema evolution must be backward-compatible: add optional fields, never remove or rename required ones. A schema registry enforces this at the producer boundary.
Event-driven is not a replacement for synchronous calls. Login, checkout confirmation, and other request/response flows should stay synchronous. Use events for fan-out, decoupling, and async processing.

Message Queues: Message queues are the underlying infrastructure that event-driven architecture builds on. EDA is the architecture pattern; queues and brokers are the plumbing.
CQRS: Command Query Responsibility Segregation pairs naturally with EDA. Events update the write model, and separate read models are built from the event stream.
Event Sourcing: Event sourcing takes EDA further by making the event log the source of truth. Current state is derived by replaying events, not by querying a mutable database.
Sync vs. Async: The broader trade-off discussion between synchronous and asynchronous communication patterns, of which EDA is the most structured async approach.
Saga Pattern: Sagas coordinate multi-step business processes in event-driven systems, handling compensation when a step fails.