๐Ÿ“HowToHLD
Vote for New Content
Vote for New Content
Home/High Level Design/Patterns

Event Sourcing

Learn how event sourcing stores state as an immutable event log, enabling audit trails, time travel queries, and replayable projections at any scale.

49 min read2026-03-26hardevent-sourcingcqrsdistributed-systemspatternshld

TL;DR

  • Event sourcing stores the full history of state changes as an immutable, append-only log of domain events. Current state is derived by replaying that log, not by reading a single mutable row.
  • The core trade-off is query complexity vs. auditability: every read of current state requires replaying events (or querying a cached projection), but any past state is reconstructible, any projection is rebuildable, and bugs in read models are fixable without data migration.
  • Snapshot strategy, aggregate boundaries, and event schema versioning are the three design decisions that determine whether event sourcing is maintainable or nightmarish in production.
  • Event sourcing alone does not give you scalable reads. Pair it with CQRS: write-side aggregates append to an event store; read-side projectors consume events and build optimized query models.
  • Only reach for event sourcing when audit trails, temporal queries, or multi-consumer event fan-out are actual requirements, not future hedges. The complexity is real.

The Problem

A bank processes 10,000 account transactions per day. At month-end, a compliance audit asks: "Show me every state change to Account #78901 in chronological order, who triggered each change, and what the system's understanding of the risk profile was at the moment of each change."

Your database has one row for Account #78901. It reads: balance: $4,230, status: ACTIVE, last_modified: 2026-03-26. Every UPDATE that ran against that row overwrote what came before. The history is gone. The compliance report is impossible without shipping audit-logging infrastructure you should have built from the start.

The same problem surfaces in less obvious places. Customer support asks: "A user says their order was cancelled but we still charged them โ€” can you reconstruct the timeline?" Every UPDATE orders SET status = 'CANCELLED' destroyed the prior state. You are left reading application logs and hoping a developer left traces.

Traditional CRUD is optimized for current state. It answers "what is true now?" at low cost, by design. Every time you UPDATE or DELETE, you deliberately discard history.

At low scale, that is fine. At high scale, with compliance requirements, complex domain logic, or distributed consumers, that discarded history becomes a debt you pay in production incidents.

The silent audit gap most teams discover too late

Teams add explicit audit logging after the first compliance request they cannot fulfill. The problem: audit logging bolted on after the fact covers only the fields someone thought to instrument, and cannot retroactively reconstruct state from before it was added. Event sourcing makes the audit trail the primary data structure, not a secondary concern.

Side-by-side comparison of traditional CRUD (three successive database row states showing values overwritten and history lost) versus event sourcing (three immutable append-only events showing full history preserved, with a replay step computing current balance).
CRUD answers 'what is true now' by design. Every UPDATE destroys what came before. Event sourcing preserves every state transition, making any past state reconstructible on demand.

One-Line Definition

Event sourcing eliminates the audit gap and enables time-travel queries by persisting the state of a domain entity as an immutable, ordered sequence of domain events, where current state is derived by replaying that sequence in order (or from a snapshot checkpoint).


Analogy

Think of a bank's general ledger versus a simple savings account passbook.

A passbook shows one balance: $4,230. Simple to read, instant to query. But if you find an error, you cannot trace it back without an external record.

A general ledger records every single transaction: +$1,000 on March 1, -$200 on March 5, +$3,430 on March 20. The current balance is always the sum of all entries. The ledger is slightly more verbose to query, but the balance you compute is mathematically auditable and verifiable at any point in time.

Event sourcing is the general ledger for your software system. Your database stops being the passbook (current state only) and becomes the ledger (the full, immutable record of how you got here). My recommendation: use this analogy in your first 30 seconds when explaining event sourcing to an interviewer. It grounds the pattern in a business domain everyone understands immediately.


Solution Walkthrough

Here is what happens when a user places an order in an event-sourced system:

  1. Command arrives: PlaceOrder(orderId: "o-789", customerId: "c-123", items: [...])
  2. Command Handler loads the aggregate: fetches all events for orderId: o-789 from the Event Store and replays them to reconstruct the current Order state.
  3. Aggregate validates the command: checks business invariants: is this customer active? are the items available? does this order already exist?
  4. Aggregate emits events: if validation passes, the aggregate produces OrderPlaced(orderId: "o-789", totalAmount: 59.99, placedAt: "2026-03-26T09:00:00Z"). It does not write to a mutable database table.
  5. Events are persisted to the Event Store: appended to the stream for orderId: o-789. This write is atomic. The event is now the source of truth.
  6. Event bus notifies consumers: the Event Store (or an outbox relay) publishes the event to subscribers. The order projector updates the read model. The inventory service reserves stock. The email service queues a confirmation.
  7. Read models are eventually consistent: projectors consume events asynchronously and update query-optimized views. A GET /orders/o-789 query hits the read model, not the event stream.
flowchart TD
  subgraph WriteModels["โœ๏ธ Write Side"]
    Client(["๐Ÿ‘ค Client\nSend Command"])
    CH["โš™๏ธ Command Handler\nLoad ยท validate ยท apply"]
    Agg["๐Ÿ“ฆ Order Aggregate\nBusiness invariants\nOptimistic lock version"]
    ES[("๐Ÿ“œ Event Store\nAppend-only\nStream per aggregate")]
  end

  subgraph Fanout["๐Ÿ“จ Event Fan-out"]
    EB["๐Ÿ“จ Event Bus\nKafka / EventStoreDB\nSubscription groups"]
  end

  subgraph ReadModels["๐Ÿ“– Read Side"]
    P1["โš™๏ธ Order Projector\nBuilds OrderView table"]
    P2["โš™๏ธ Analytics Projector\nBuilds RevenueStats"]
    P3["โš™๏ธ Search Projector\nBuilds Elasticsearch index"]
    RM1[("๐Ÿ”ต Orders Read DB\nPostgres JOIN-friendly")]
    RM2[("๐Ÿ”ต Analytics DB\nTimescale / BigQuery")]
    RM3[("๐Ÿ”ต Search Index\nElasticsearch")]
  end

  Client -->|"PlaceOrder command"| CH
  CH -->|"load stream"| ES
  ES -->|"event history"| Agg
  Agg -->|"OrderPlaced event + version check"| ES
  ES -->|"publish new event"| EB
  EB -->|"event stream"| P1 & P2 & P3
  P1 -->|"upsert view"| RM1
  P2 -->|"increment stats"| RM2
  P3 -->|"index document"| RM3

The right-hand side (projectors and read models) is what CQRS adds. An event-sourced system without CQRS forces every query to replay the event stream, which scales poorly. In practice, 95% of event sourcing implementations pair with CQRS for the read side.

Horizontal event timeline with six labeled events (AccountOpened, Deposited, LimitChanged, Withdrawn, Deposited, Withdrawn) at different timestamps. Three vertical arrows point down from different positions on the timeline to state boxes showing the reconstructed balance at each point in time.
Event replay gives you time travel for free: any historical state is computable by replaying events up to that timestamp. This is structurally impossible with CRUD without purpose-built audit logging.

Key Components

ComponentRole
CommandAn instruction to do something. Named in present imperative tense. May be rejected if invariants are violated.
Domain EventAn immutable fact that something happened. Named in past tense. Always persisted. Never rejected after acceptance.
AggregateThe consistency boundary. Loads its event stream, validates commands against current state, and emits new events. The smart part of the write side.
Event StoreThe append-only database of events organized into streams (one per aggregate instance). Core operations: AppendToStream(streamId, events, expectedVersion) and ReadStream(streamId).
Projection / Read ModelA derived view built by consuming events. Rebuilt by replaying the event store. Can be any shape: SQL table, Redis hash, Elasticsearch document.
SnapshotA serialized checkpoint of an aggregate's state at a given event sequence number. Eliminates replaying from event #1 on every aggregate load.
Event BusThe pub/sub mechanism that distributes new events to subscribed projectors and other services. Can be in-process or distributed.
UpcasterA transformation function that converts old event versions to the current version on read. Enables schema evolution without migrating stored event data.

Implementation Sketch

// --- sketch ---
type OrderPlaced = { type: 'OrderPlaced'; orderId: string; customerId: string; items: OrderLine[]; totalAmount: number; placedAt: string; version: 1 };
type OrderCancelled = { type: 'OrderCancelled'; orderId: string; reason: string; cancelledAt: string; version: 1 };
type OrderEvent = OrderPlaced | OrderCancelled;

type OrderState = { orderId: string | null; status: 'PENDING' | 'PLACED' | 'CANCELLED'; totalAmount: number };
type PlaceOrderCommand = { orderId: string; customerId: string; items: OrderLine[]; totalAmount: number };

const initialState: OrderState = { orderId: null, status: 'PENDING', totalAmount: 0 };

// Apply a single event to state: pure function, zero side effects
function applyEvent(state: OrderState, event: OrderEvent): OrderState {
  switch (event.type) {
    case 'OrderPlaced':
      return { orderId: event.orderId, status: 'PLACED', totalAmount: event.totalAmount };
    case 'OrderCancelled':
      return { ...state, status: 'CANCELLED' };
    default:
      // Forward-compatible: ignore unknown future event types
      return state as OrderState;
  }
}

// Reconstruct current state by replaying all events
function rehydrate(events: OrderEvent[], from: OrderState = initialState): OrderState {
  return events.reduce(applyEvent, from);
}
// Command handler: load aggregate, validate, append event
async function placeOrder(command: PlaceOrderCommand): Promise<void> {
  const streamId = `order-${command.orderId}`;

  const events = await eventStore.readStream(streamId);
  const currentState = rehydrate(events);

  if (currentState.status !== 'PENDING') {
    throw new DomainError(`Order ${command.orderId} already exists: ${currentState.status}`);
  }

  const newEvent: OrderPlaced = {
    type: 'OrderPlaced',
    orderId: command.orderId,
    customerId: command.customerId,
    items: command.items,
    totalAmount: command.totalAmount,
    placedAt: new Date().toISOString(),
    version: 1,
  };

  // Optimistic lock: fails if another writer appended since we loaded
  await eventStore.appendToStream(streamId, [newEvent], {
    expectedVersion: events.length,
  });
}

The expectedVersion check on appendToStream is the concurrency control mechanism. If two command handlers load the same aggregate simultaneously and both try to append at version 5, the second write fails because the stream is already at version 6. This is optimistic locking, equivalent to a SQL row version column, applied at the event stream level.


CQRS and Event Sourcing Together

These two patterns are independent but almost always paired. Treating them as the same thing is the most common interview mistake.

CQRS without event sourcing: Commands update a normal Postgres table. A domain event is published via the Outbox pattern. Projectors consume it. Simpler to operate. The projector does not have access to the full historical context; only the current row state at the time of the update is available.

Event sourcing without CQRS: Every read loads and replays the aggregate's event stream. No separate read model. Works for small aggregates and low read frequency. Falls apart at scale. Replaying 1,000 events per query at 500 req/s is not a production architecture.

Event sourcing with CQRS: The standard combination. The write side is an event store; the read side is projectors building optimized views for each query shape.

Each projection can be SQL, Redis, or Elasticsearch, shaped independently for its query pattern. When a projector has a bug, fix the code and replay the event stream to rebuild a clean view. No data migration needed.

flowchart LR
  subgraph WriteSide["โœ๏ธ Write Side (event-sourced)"]
    WClient(["๐Ÿ‘ค Write Client"])
    WCH["โš™๏ธ Command Handler"]
    WES[("๐Ÿ“œ EventStore\nOne stream per aggregate\nOptimistic concurrency")]
  end

  subgraph Bridge["๐ŸŒ‰ Event Bridge"]
    Bus["๐Ÿ“จ Kafka / Subscription\nAt-least-once delivery\nIdempotent consumers required"]
  end

  subgraph ReadSide["๐Ÿ“– Read Side (CQRS projections)"]
    RP1["โš™๏ธ Order Projector"]
    RP2["โš™๏ธ Analytics Projector"]
    RDB1[("๐Ÿ”ต OrdersView DB\nSQL / full JOIN support")]
    RDB2[("๐Ÿ”ต Analytics Store\nClickHouse / Timescale")]
    RClient(["๐Ÿ‘ค Read Client\nGET /orders"])
  end

  WClient -->|"command"| WCH
  WCH -->|"append event"| WES
  WES -->|"new event tail"| Bus
  Bus -->|"fanout"| RP1 & RP2
  RP1 -->|"upsert"| RDB1
  RP2 -->|"batch insert"| RDB2
  RDB1 -->|"query result"| RClient

The operational superpower here is projection replayability. If your Order Projector computed totals incorrectly for three months, fix the bug, spin up a shadow projector replaying from event one, validate it looks correct, and swap the table pointer. No data migration. Bugs in read models become deployment problems, not data corruption problems.

For your interview: say "I am adding a CQRS read side with separate projectors for each query shape, and if a projector has a bug, I replay the event stream to rebuild it cleanly." One sentence, then move on.


Event Versioning and Schema Evolution

Event versioning is what separates teams that ship event sourcing successfully from those who regret it. Events are persisted forever. Your event schema will change. This is not a risk to manage; it is a certainty to design for.

Three strategies:

Strategy 1: Upcasting (recommended for most changes)

On read, an upcaster transforms old event versions to the current version before passing them to applyEvent. Stored data is never touched.

type OrderPlacedV1 = { type: 'OrderPlaced'; version: 1; orderId: string; totalAmount: number; };
type OrderPlacedV2 = { type: 'OrderPlaced'; version: 2; orderId: string; totalAmount: number; currency: string; };

function upcastOrderPlaced(raw: OrderPlacedV1 | OrderPlacedV2): OrderPlacedV2 {
  if (!('currency' in raw) || raw.version < 2) {
    // Historical events default to USD โ€” safe backward compat
    return { ...(raw as OrderPlacedV1), currency: 'USD', version: 2 };
  }
  return raw;
}

// Applied in the event store reader pipeline โ€” never modifies stored data
const events = (await eventStore.readStream(streamId))
  .map(e => e.type === 'OrderPlaced' ? upcastOrderPlaced(e as OrderPlacedV1 | OrderPlacedV2) : e);

Strategy 2: Weak schema (additive changes only)

Add new fields as optional. New consumers read the field if present; old consumers ignore it. Works for purely additive changes with no upcaster code.

Strategy 3: Event migration (high risk, rarely needed)

A one-time script replays the event store and writes a new version of each event. Use only for breaking changes like renaming event types after exhausting upcasting. The migration creates a window where two versions exist, and one mis-handled edge case corrupts data you cannot recover.

My rule: never run event migration unless you have exhausted upcasting first.


Snapshot Strategy

Two-column comparison: without snapshots, all 500 events are replayed on every aggregate load. With snapshots every 100 events, only the events after the latest snapshot are replayed, drastically reducing load time at high event counts.
Snapshots cap the replay cost at a fixed window. Without them, aggregate load time grows linearly with event count. A high-volume aggregate accumulates millions of events over years.

Snapshots are serialized checkpoints of an aggregate's state at a given sequence number. The loading algorithm becomes: fetch the latest snapshot, then replay only events after that checkpoint.

async function loadAggregate(streamId: string): Promise<OrderState> {
  const snapshot = await snapshotStore.getLatest(streamId);
  const fromVersion = snapshot ? snapshot.version + 1 : 0;
  const fromState   = snapshot ? snapshot.state    : initialState;

  // Only replay events after the snapshot checkpoint
  const recentEvents = await eventStore.readStreamFrom(streamId, fromVersion);
  return rehydrate(recentEvents, fromState);
}

// Snapshotting policy: take a snapshot every 50 events
async function appendWithSnapshot(
  streamId: string,
  event: OrderEvent,
  currentState: OrderState,
  newStreamLength: number
): Promise<void> {
  await eventStore.appendToStream(streamId, [event]);
  if (newStreamLength % 50 === 0) {
    await snapshotStore.save({ streamId, version: newStreamLength, state: currentState });
  }
}

The decision for when to snapshot is not purely about event count. It is about replay cost: event_count ร— avg_apply_time_ms. The formula: snapshot_interval = target_load_ms / avg_apply_time_ms. For a 25ms target at 0.17ms per event: 25 / 0.17 = ~150 events. Round to 150.

A 3-by-3 decision matrix. X-axis: event stream depth (1-50, 51-200, 200+ events). Y-axis: aggregate reads per second (less than 1, 1-10, more than 10). Cells are color-coded from green (no snapshots needed) to red (snapshots required plus caching).
Snapshot necessity depends on both event depth and read frequency. A deep stream loaded rarely can wait. A deep stream loaded 10+ times per second needs snapshots and an aggregate cache layer.

Aggregate Boundary Design

Three-column comparison: 'Too Large' aggregate (single OrderAggregate containing order lines, shipping, payment, fraud score, and returns), 'Right-Sized' (separate Order, Shipment, Payment, and FraudReview aggregates), and 'The Rule' column explaining one business invariant per aggregate.
Aggregate size is the most consequential design decision in event sourcing. Too large creates stream contention under load. Too small breaks transaction boundaries. The boundary should follow business consistency requirements, not data ownership.

The aggregate boundary determines the transaction scope, the concurrency unit, and the stream size. It is the decision that is hardest to change after launch.

The rule: if two pieces of state do not need to be updated in the same transaction to preserve a business invariant, they belong in separate aggregates. A status change and a totalAmount calculation must be consistent together in an OrderAggregate. A shipping tracking update and a payment capture are independent. They belong in separate aggregates.

I will often see teams hit contention and slowness in an event-sourced system, and the root cause is an aggregate that swallowed too many concepts. The tell: commands on entirely different aspects of an entity all lock the same stream and serialize against each other.


When It Shines

When does event sourcing actually pay for itself? The honest answer: when the requirements explicitly include one or more of these characteristics. Not when you are hedging against future requirements.

Use event sourcing when:

  • Compliance or regulatory audit trails are a hard requirement (finance, healthcare, legal).
  • You need temporal queries: "What was the state of X at time T?" with guaranteed correctness, not best-effort logging.
  • Your domain model has complex state transitions where replaying history catches edge cases that simple update logic misses.
  • You have three or more downstream consumers of state changes needing the full event context, not just "something changed."
  • Your team needs projection replayability: the ability to retroactively fix read model bugs without data migration.
  • You are building event-driven microservices where downstream services need to bootstrap from the full event history.

Skip event sourcing when:

  • Your domain is simple CRUD: user profiles, content management, basic inventory. Event sourcing adds complexity with no benefit when audit trails are not required.
  • Your team is new to eventual consistency. The gap between event write time and projection catch-up causes bugs until the team has internalized the model.
  • Write latency is your primary concern. Loading an event stream adds read overhead to every command operation.
  • Your schema is simple and stable. Upcasting strategies are only necessary if schemas evolve.

The rule: if you cannot name a specific query or audit requirement that event sourcing uniquely enables, use plain CRUD.

Decision flowchart asking: Do you need a full audit trail or event replay? Do you need multiple independent read models? Is your team comfortable with eventual consistency? Each yes/no path leads to one of four outcomes: Event Sourcing plus CQRS, CQRS without ES, CQRS plus Outbox, or Standard CRUD.
Use this decision tree before committing to event sourcing. The complexity is real. Each no answer is a signal that a simpler pattern exists for your requirements.

Failure Modes and Pitfalls

1. Aggregate boundaries are too large

The most common early mistake. Teams design an OrderAggregate that contains order lines, shipping details, payment info, fraud score, and return history. Every command on any part of the order locks the entire stream.

At 500 commands/sec on the same aggregate, constant retry contention emerges. The fix: split into Order (placement, cancellation), Shipment (tracking, delivery), and Payment (capture, refund), each with its own stream.

I will often see teams blame slow event store performance when the real culprit is an oversized aggregate causing a retry storm at the optimistic lock boundary.

2. Projectors without idempotency

An event bus delivers at-least-once. Your projector will receive the same event twice, guaranteed, eventually. A non-idempotent projector double-counts, duplicates records, or corrupts state silently.

Every projector must record the last processed event sequence number and skip events already applied. The simplest pattern: an ON CONFLICT DO UPDATE ... WHERE last_event_seq < EXCLUDED.last_event_seq guard on every upsert.

async function applyOrderPlaced(event: OrderPlaced): Promise<void> {
  await db.query(
    `INSERT INTO orders_view (order_id, status, total, last_event_seq)
     VALUES ($1, 'PLACED', $2, $3)
     ON CONFLICT (order_id) DO UPDATE
       SET status = EXCLUDED.status,
           total  = EXCLUDED.total,
           last_event_seq = EXCLUDED.last_event_seq
     WHERE orders_view.last_event_seq < EXCLUDED.last_event_seq`,
    [event.orderId, event.totalAmount, event.sequenceNumber]
  );
}

Projectors without idempotency are the single most common bug I see in event-sourced systems. Fix it before you ship, not after the first double-charge incident.

3. Querying the event store for reads

An event store is optimized for sequential stream appends and full-stream reads. Running SELECT * FROM events WHERE customer_id = $1 is an O(N) full scan. At 100M events, that scan takes minutes.

Build projections for every query shape you need. The event store answers "give me all events for aggregate X" and nothing else. If you are running ad-hoc queries against the event store, you have missed the point of the read-side architecture.

4. Unbounded event schema changes without upcasters

If your code throws an exception when it encounters an unknown event field from a newer version, your system breaks the moment you deploy an update and old consumers are still running. Never remove or rename fields in an event schema without deploying an upcaster simultaneously.

Safe pattern: add new fields as optional, deploy consumers that handle both versions, then optionally migrate. Never run migration before consumer compatibility is deployed.

5. Missing retention policy on the event store

Events accumulate forever by default. A financial system running five years with 10M accounts at 5 events per day generates 91 billion events. At 200 bytes per event, that is 18TB in Postgres.

Without archival to cold storage, query planning degrades as the events table statistics become unreliable at that scale.

Use stream archival: events older than your retention window move to cold storage (S3, Glacier). Keep recent events in your primary store. Snapshot all aggregates before archiving their early events. Verify you can still rehydrate aggregates from archives before you need to in production.


Trade-offs

BenefitCost
Complete audit trail with zero additional code (structurally impossible to lose history)Every read of current state requires event replay or a projection; immediate consistency is never free
Retroactive projection replayability (bugs in read models are deployment problems, not data migrations)Operational complexity: event store, projectors, snapshot management, and schema versioning all need active ownership
Time travel queries: reconstruct state at any past moment with no extra infrastructureEventual consistency between write side and read side is permanent, not a temporary phase
Decoupled consumers: new downstream services bootstrap from full event historyAggregate boundary design is a one-way door. Splitting aggregates later requires migrating event streams
Easier testing: given a sequence of events, test the aggregate in memory with zero infrastructureTeams unfamiliar with DDD-style aggregate design write anemic aggregates, losing most of the benefit
Natural fit for distributed microservices where multiple teams consume the same domain eventsDebugging requires reasoning about three different clocks: event time, projection time, and query time

The fundamental tension here is historical completeness vs. query simplicity. CRUD wins on "how do I get current state fast?" Event sourcing wins on "how do I understand everything that happened and recover from mistakes?" Pick based on whether your domain's value lives in the present state or in the history of how you got there.


Real-World Examples

Regulatory systems at major banks

Every large bank and insurance company that has moved to microservices in the past decade has implemented event sourcing for core domain entities: account transactions, policy changes, claim events. The driver is always regulatory: GDPR (right to erasure via crypto-shredding), PCI-DSS (transaction audit trail immutability), and SOX (financial record integrity).

JP Morgan's internal ledger platform replays hundreds of billions of events for compliance reconciliation. The non-obvious insight: at that scale, they use changelog compaction (equivalent to Kafka log compaction) to reduce the active event log to only deltas since the last regulatory checkpoint, not full replay from day one.

GitHub: the pull request timeline

GitHub hosts over 420 million repositories and processes tens of millions of PR events per day. GitHub's pull request timeline behaves as though it were event-sourced: every comment, review, CI status, label change, and approval is preserved as an immutable record, and the timeline is a direct projection of that sequence.

The non-obvious lesson: teams that store only current state and try to reverse-engineer a timeline from application logs spend weeks building what this architecture delivers structurally. The PR timeline is not a separate feature built on top of storage. It falls out of the storage model for free.

LinkedIn: 1 trillion events per day

LinkedIn processes over one trillion events per day through their internal EventBus system. Their user activity feed (connections, likes, posts, profile views) is a set of projections built from an event stream. The key insight from their published engineering posts: by separating event production (write side) from feed rendering (read side), they iterated on feed-ranking algorithms (which are just different projectors) without touching core activity recording logic. New ranking model means a new projector. Old projector runs in parallel. A/B test results. Swap when confident. Zero risk of corrupting the canonical event log. This is projection replayability at one-trillion-events-per-day scale.


How This Shows Up in Interviews

When to bring it up proactively: Mention event sourcing as soon as any of these signals appear: the interviewer specifies a compliance audit trail, asks about "what did the state look like at time T," describes three or more independent downstream consumers that each need the full change history, or mentions that projection bugs have previously required painful data migrations. If none of those apply, do not introduce it: plain CRUD with an Outbox pattern is almost always the right default, and proposing event sourcing for a simple CRUD domain signals poor judgment.

Here is what separates the majority of candidates from the top 5% on event sourcing questions: most people can name the pattern and say "immutable events." Very few can explain aggregate boundaries, concurrency control, and projection replayability in the same breath.

My recommendation: as soon as you draw an event store, immediately call out the write-side/read-side split (CQRS), name optimistic locking by version, and describe what happens when a projector has a bug. That three-part answer shows you have thought about the operational reality, not just the happy path.

The phrase that signals production experience

Say: "With event sourcing, the event store holds the source of truth, and projections are disposable derived views. A projection bug means redeploying the projector and replaying the stream: it is a deployment problem, not a data corruption problem." Follow that with: "I would size snapshot intervals so aggregate load stays under 25ms. At 0.17ms per event, that is a snapshot every 150 events."

Depth expected at senior/staff level:

  • Explain the command/event boundary: commands can be rejected; domain events are immutable facts stored once a command is accepted.
  • Name optimistic locking via expectedVersion on appendToStream and describe the retry behavior when concurrent writers conflict.
  • Cover event schema evolution: upcasting on read vs. event migration. Know when upcasting fails (type rename, removed required field) and what to do.
  • Explain the snapshot strategy and the formula for snapshot interval sizing.
  • Describe idempotent projectors and why at-least-once delivery makes them mandatory.
  • Distinguish CQRS from event sourcing: you can have CQRS without event sourcing (update DB plus outbox plus projections) and event sourcing without CQRS (small-scale, low-read systems).
  • For multi-aggregate coordination: why you need a saga or process manager, because aggregates cannot share a transaction.

The misconception that costs the most interview points

Many candidates describe event sourcing as "just like Kafka" or "event streaming." They are different things. Kafka is a durable distributed message bus optimized for throughput. An event store is an append-only database with stream-per-aggregate semantics and optimistic concurrency control. You can use Kafka as an event store with caveats, but the two solve different problems. Conflating them signals surface-level knowledge.

Common follow-up questions and strong answers:

Interviewer asksStrong answer
"How do you handle concurrent writes to the same aggregate?""Optimistic concurrency: appendToStream takes an expectedVersion. If two command handlers both loaded version 10 and try to append at 10, the second write fails with a concurrency exception. The caller retries: reloads the stream, revalidates the command against the new state, and appends again. Identical semantics to a SQL optimistic lock, applied at the event stream level."
"How do you handle GDPR right to erasure?""You do not delete events; they are immutable. Standard approach: crypto-shredding. Encrypt personal-data fields with a per-user key. Delete the key and events containing that data become permanently illegible while the event structure remains intact. This preserves the event log for aggregate replays while making personal data irrecoverable."
"What is the performance impact of replaying events on every read?""This is exactly why you pair event sourcing with CQRS. The write side loads aggregates by replaying their stream, bounded by snapshot frequency. Reads never touch the event store; they query projections in Postgres, Redis, or Elasticsearch. A properly designed system has 95%+ of reads hitting optimized read models with zero event replay at query time."
"How do you debug an incorrect projection?""Replay the event stream in a background process, compute the projection state at each event boundary, and find where the result diverges. Fix the projector code, drop the projection table, replay from event one into a shadow table, validate, and swap. This is operationally much safer than debugging a CRUD system where the state was already overwritten."
"Why would you NOT use event sourcing?""When the domain has no meaningful history. A user preferences record (set once, occasionally updated, nobody cares what it was six months ago) is a pure CRUD concern. Event sourcing adds an event store, projectors, snapshot management, and schema versioning to a problem that a Postgres row solves in five minutes. Match the tool to the requirement."

Test Your Understanding


Quick Recap

  1. Event sourcing persists state as an immutable, append-only sequence of domain events. Current state is derived by replaying that sequence, not by reading a mutable row.
  2. Commands are instructions that can be rejected; domain events are immutable facts stored once a command is accepted. Mixing these semantics is the most common architectural error in new implementations.
  3. The aggregate boundary is the most consequential design decision: it defines the transaction scope, the contention unit, and the replay performance ceiling. Design boundaries around business invariants, not around data ownership.
  4. Snapshot every N events to keep aggregate load time bounded. The formula: snapshot_interval = target_load_ms / avg_apply_time_ms. Revisit snapshot intervals when aggregate throughput changes.
  5. Projections are disposable derived views, not permanent data. Bugs in projectors are fixed by redeploying and replaying the event stream. No data migration.
  6. Idempotent projectors with explicit checkpoints are mandatory, not optional. At-least-once event delivery means every projector will eventually receive the same event twice.
  7. Event sourcing earns its complexity when audit trails, temporal queries, or multi-consumer fan-out are real requirements. Use plain CRUD when they are not.

Variants

Event Sourcing without CQRS: Queries load and replay the aggregate's event stream directly. Works for small aggregates under 50 events and infrequent reads. Becomes unworkable as aggregates grow past a few hundred events at any meaningful read frequency.

Kafka as Event Log: Some teams use Kafka as their event store, using one partition per aggregate type with a separate offset index (Elasticsearch, DynamoDB) for per-aggregate stream reads. Pragmatic for teams already invested in Kafka infrastructure. The trade-off: no native optimistic concurrency enforcement; you must build it separately via a version table in Postgres.

Event Sourcing with the Outbox Pattern: When events need to trigger actions in external systems, the Outbox pattern bridges the gap: the event store write and the outbox entry write happen in the same local transaction, guaranteeing at-least-once delivery to external systems without 2-phase commit.

Bi-temporal Event Sourcing: Beyond storing "when this event was recorded," also store "when this event was valid in the business domain." A retrospective salary adjustment effective next quarter has a different business effective date than its system recording date. Bi-temporal event sourcing captures both dimensions. Used in insurance, banking, and payroll systems where business effective dates regularly diverge from system recording dates.


Related Patterns

  • CQRS โ€” Event sourcing's natural partner. CQRS provides optimized read-side projections that make event sourcing practical at scale. Almost always used together in production systems.
  • Outbox Pattern โ€” The safe bridge between an event store write and publishing to an external message broker. Prevents the dual-write problem when events must reach both the event store and Kafka.
  • Saga Pattern โ€” The coordination mechanism for commands spanning multiple aggregates. Sagas are required any time multi-aggregate consistency is needed, since aggregates cannot share a transaction.
  • Message Queues โ€” The infrastructure that event projectors consume. Understanding durable consumers, consumer groups, and delivery guarantees is prerequisite knowledge for operating event-sourced systems at scale.
  • Databases โ€” Understanding write amplification, index maintenance cost, and MVCC semantics of your event store's underlying database determines your operational ceiling for event throughput.

Previous

CQRS (Command Query Responsibility Segregation)

Next

Saga pattern

Comments

On This Page

TL;DRThe ProblemOne-Line DefinitionAnalogySolution WalkthroughKey ComponentsImplementation SketchCQRS and Event Sourcing TogetherEvent Versioning and Schema EvolutionSnapshot StrategyAggregate Boundary DesignWhen It ShinesFailure Modes and Pitfalls1. Aggregate boundaries are too large2. Projectors without idempotency3. Querying the event store for reads4. Unbounded event schema changes without upcasters5. Missing retention policy on the event storeTrade-offsReal-World ExamplesHow This Shows Up in InterviewsTest Your UnderstandingQuick RecapVariantsRelated Patterns