Change data capture
How CDC extracts database change events using log tailing, triggers, or dual-write. Covers Debezium with PostgreSQL WAL, use cases for CDC in event-driven architectures, and the tradeoffs of each approach.
TL;DR
- Change Data Capture (CDC) reads the database's internal transaction log to emit events for every insert, update, and delete, eliminating the need for application code to publish events separately.
- The preferred approach is log tailing (reading the Write-Ahead Log), which captures all changes with zero application code changes and minimal database overhead.
- CDC solves the dual-write problem: writing to a database and publishing an event are two separate operations that can partially fail, leaving systems permanently out of sync.
- Debezium (the most popular CDC tool) connects to PostgreSQL/MySQL via logical replication and streams row-level changes to Kafka topics in near-real-time.
- Think of CDC as event-driven architecture's zero-code option: your existing database becomes an event source without changing application code.
The Problem
Your e-commerce platform stores orders in PostgreSQL. Three downstream systems need to know when an order changes: the search index (Elasticsearch), the analytics pipeline (data warehouse), and the notification service (sends shipping emails).
The obvious approach: after every database write, publish an event to Kafka from your application code. This works fine on a whiteboard. In production, it breaks in subtle ways.
The database write succeeded but the Kafka publish failed. PostgreSQL says the order shipped, but Elasticsearch, the analytics pipeline, and the notification service have no idea. The customer never gets their shipping email. The dashboard undercounts shipped orders. Support gets tickets about missing status updates.
You could wrap both in a distributed transaction, but 2PC is slow, fragile, and most message brokers don't support XA transactions. You could retry the Kafka publish, but if the app crashes between the DB commit and the retry, the event is permanently lost. You could add a background job that scans for "unsynced" rows, but now you're building a bespoke CDC system with all its edge cases.
This is the dual-write problem: two writes to two different systems can't be made atomic without distributed transactions. One will always fail independently of the other. The failure window might be small, but across millions of requests per day, "small" means "happens every week."
CDC takes a completely different approach. Instead of the application emitting events, a separate process reads the database's own transaction log and publishes events from that. The database write is the only write. The event is derived from it, not duplicated alongside it.
With dual-write, the application makes two independent writes that can partially fail. With CDC, the application makes one write, and the event is derived automatically from the database's own log. There's no second write to fail.
One-Line Definition
Change Data Capture detects and streams row-level database changes by reading the database's internal transaction log, turning your database into a reliable event source without modifying application code.
Analogy
Think of a bank teller and the bank's ledger. Every transaction the teller processes gets recorded in a central ledger. An auditor doesn't ask the teller to report every transaction separately (that's the dual-write approach). Instead, the auditor reads the ledger directly.
The auditor never misses a transaction because the ledger is the source of truth. The teller doesn't need to do any extra work. If the bank hires a new compliance officer, that person also reads the same ledger, no changes needed to how the teller operates.
CDC works the same way. The database's Write-Ahead Log (WAL) is the ledger. Debezium is the auditor. Adding a new downstream consumer is like hiring another compliance officer: just point them at the same log. No one needs to change how they do their job.
The key insight from this analogy: the ledger (WAL) exists regardless of whether anyone reads it. The database writes to the WAL for its own durability purposes. CDC just taps into what's already there.
Solution Walkthrough
CDC has three main implementation approaches. Log tailing is the production standard; triggers and polling are alternatives for constrained environments.
Approach 1: Log Tailing (the standard)
Every major relational database writes changes to a sequential log before applying them to data files. PostgreSQL calls this the Write-Ahead Log (WAL). MySQL calls it the binlog. This log exists for crash recovery, but CDC repurposes it as an event stream.
The flow works like this:
- Application writes to the database normally. No code changes needed.
- PostgreSQL writes the change to the WAL (it does this anyway for durability).
- Debezium connects via a logical replication slot and reads decoded WAL entries.
- Each WAL entry becomes a structured JSON event published to a Kafka topic.
- Downstream consumers read from Kafka at their own pace.
Each event contains the before-image, after-image, operation type (c/u/d), source table, timestamp, and the Log Sequence Number (LSN) for ordering.
{
"op": "u",
"before": { "id": 42, "status": "processing", "updated_at": "..." },
"after": { "id": 42, "status": "shipped", "updated_at": "..." },
"source": {
"table": "orders",
"lsn": 234881024,
"ts_ms": 1712000000000
}
}
Why this works: the WAL is written atomically with the transaction. If the transaction commits, the WAL entry exists. If it rolls back, the WAL entry is discarded. There's no window where the database has a change but the event doesn't exist. The dual-write problem vanishes.
PostgreSQL configuration for logical replication requires two changes:
-- Enable logical replication (requires restart)
ALTER SYSTEM SET wal_level = logical;
-- Create a replication slot for Debezium
SELECT pg_create_logical_replication_slot('debezium_orders', 'pgoutput');
The pgoutput plugin is PostgreSQL's built-in logical decoding output. Debezium connects to this slot, and PostgreSQL streams decoded WAL entries in real-time. The replication slot also acts as a bookmark: PostgreSQL retains WAL segments until Debezium confirms it has read them, so no events are lost even if Debezium restarts.
Approach 2: Database Triggers
For databases that don't support logical replication (or environments where Debezium can't be deployed), triggers capture changes inside the same transaction.
CREATE FUNCTION capture_change() RETURNS trigger AS $$
BEGIN
INSERT INTO change_events (table_name, op, payload, created_at)
VALUES (TG_TABLE_NAME, TG_OP, row_to_json(NEW), now());
RETURN NEW;
END;
$$ LANGUAGE plpgsql;
CREATE TRIGGER cdc_orders
AFTER INSERT OR UPDATE OR DELETE ON orders
FOR EACH ROW EXECUTE FUNCTION capture_change();
A polling process reads change_events and publishes to Kafka. The trigger fires within the same transaction as the business write, so you get atomicity. The trade-off is performance: every write includes an extra INSERT, and polling adds 1-5 seconds of latency.
I've seen this approach work well in environments where the infrastructure team won't approve Debezium but the team still needs change events. It's not as clean as WAL-based CDC, but it's entirely self-contained within the database.
Approach 3: Application-Level Polling
Query the source table for rows modified since the last checkpoint:
SELECT * FROM orders WHERE updated_at > :last_poll_time;
Simplest approach, but it misses DELETEs (the row is gone), can miss rapid updates between poll intervals, and adds load to the source database with repeated full-range scans. You also need a monotonically increasing column (updated_at or a sequence) on every table you want to capture. I'd avoid this for anything production-critical, but it's a reasonable starting point for a proof of concept.
Comparison
| Dimension | Log Tailing (WAL) | Triggers | Polling |
|---|---|---|---|
| Latency | Sub-second | 1-5 seconds | Seconds to minutes |
| DB overhead | Minimal (reads existing log) | Moderate (extra INSERT per write) | High (repeated queries) |
| Captures DELETEs | Yes | Yes | No (row is gone) |
| Schema changes needed | None | Trigger per table | updated_at column required |
| Tooling required | Debezium + Kafka Connect | None (built-in SQL) | None (simple queries) |
| Ordering guarantees | LSN-ordered (strong) | Transaction-ordered (strong) | Timestamp-ordered (weak) |
For most production systems, log tailing with Debezium is the clear winner.
CDC vs the Outbox Pattern
Both CDC and the outbox pattern solve the dual-write problem, but they take different paths. Understanding the distinction matters for interviews.
| Dimension | CDC (Log Tailing) | Outbox Pattern |
|---|---|---|
| Event semantics | Database-level ("row 42 changed") | Application-intent ("OrderShipped") |
| Code changes required | None (reads existing WAL) | Must write to outbox table in every transaction |
| Schema coupling | DB schema = event schema | Event schema is independent of DB schema |
| Captures direct SQL changes | Yes (migrations, scripts, admin queries) | No (only changes through application code) |
| Event filtering | All changes emitted (need consumer-side filtering) | Publish only the events you want |
| Infrastructure required | Debezium + Kafka Connect + Kafka | Polling job or CDC on the outbox table + Kafka |
| Best for | Operational sync (search, cache, analytics) | Inter-service business events |
The rule of thumb: use CDC for "keep this other data store in sync" and the outbox pattern for "tell other services what happened." Many production systems use both.
Debezium Connector Lifecycle
Here's how Debezium manages the full lifecycle of change capture:
On first start, Debezium performs a consistent snapshot of existing data, then switches to streaming mode. It tracks its position using the LSN, so restarts resume exactly where they left off with no data loss.
Scaling CDC: Multi-Source Topologies
In larger architectures, CDC rarely involves just one database. You might have separate databases for orders, users, inventory, and payments. Each gets its own Debezium connector, feeding into its own set of Kafka topics.
Each connector is independent. If the orders connector goes down, user and inventory CDC continue uninterrupted. The search service joins data from multiple topics to build a denormalized search index. The analytics pipeline consumes everything for the data warehouse.
This topology is common at companies with 5+ microservices and multiple databases. The Kafka topics become the shared event backbone that all teams can tap into without coordinating deployments or schema changes between services.
Implementation Sketch
A minimal Debezium + Kafka Connect configuration for PostgreSQL CDC:
{
"name": "orders-cdc",
"config": {
"connector.class": "io.debezium.connector.postgresql.PostgresConnector",
"database.hostname": "pg-primary.internal",
"database.port": "5432",
"database.dbname": "ecommerce",
"database.user": "debezium_user",
"plugin.name": "pgoutput",
"slot.name": "debezium_orders",
"table.include.list": "public.orders,public.order_items",
"topic.prefix": "cdc",
"snapshot.mode": "initial",
"heartbeat.interval.ms": 10000,
"tombstones.on.delete": true
}
}
On the consumer side, a TypeScript service processes change events:
async function handleCDCEvent(event: CDCEvent) {
switch (event.op) {
case "c": // create
case "u": // update
await elasticsearch.index({
index: "orders",
id: event.after.id,
body: event.after,
});
break;
case "d": // delete
await elasticsearch.delete({
index: "orders",
id: event.before.id,
});
break;
}
}
The consumer is idempotent: indexing the same document twice produces the same result. This matters because CDC delivers at-least-once, and replaying from an earlier LSN offset re-delivers events.
The key design principle: treat every CDC consumer as if it will see duplicates. Idempotent writes (PUT semantics, not POST) make this safe.
When It Shines
- Keeping search indexes in sync: Elasticsearch or Typesense indexes that mirror a relational source of truth. CDC guarantees the index sees every change without dual-writing. I've seen teams reduce search staleness from 15 minutes (batch ETL) to under 2 seconds with Debezium.
- Cache invalidation: Redis entries invalidated when underlying data changes. CDC avoids TTL-based guessing and gives you precise, event-driven invalidation.
- Cross-service data replication: Populating a read-optimized store from a normalized source without coupling services. Each team consumes the same Kafka topic independently.
- Analytics pipelines: Streaming changes to a data warehouse in near-real-time rather than nightly batch ETL. Particularly valuable for operational dashboards that need fresh data.
- Audit logs: Capturing every change with before/after images for compliance. The WAL already records this; CDC just makes it accessible externally.
- Event-driven architecture bootstrap: Teams that want event-driven patterns but can't refactor every service to publish events. CDC lets you adopt incrementally, one table at a time.
For your interview: if you're designing a system with a relational source of truth and downstream read stores, CDC should be your default answer for keeping them in sync.
When NOT to Use CDC
- You need application-intent events: If downstream consumers need to know "an order was shipped" (with carrier info, tracking number, estimated delivery), not "the status column on row 42 changed." Use the outbox pattern instead.
- Your database doesn't support logical replication: Some managed database offerings restrict WAL access. Check before committing to CDC.
- You have very few tables and high control: If you own all the code and the service is small, publishing events explicitly from application code may be simpler than running Debezium infrastructure.
- You need exactly-once processing guarantees: CDC provides at-least-once delivery. If you can't make consumers idempotent, CDC will cause issues with duplicate processing.
Failure Modes and Pitfalls
1. Replication Slot Bloat
If Debezium goes down or falls behind, PostgreSQL retains WAL segments that haven't been consumed. The WAL can grow unbounded, filling the disk and crashing the database. Monitor pg_replication_slots for slot lag and set max_slot_wal_keep_size to cap retention.
I've seen this take down a production database at 3 a.m. because nobody was monitoring slot lag. The fix took 5 minutes (drop the slot), but the recovery (re-snapshot of 200GB) took hours.
Replication slot monitoring is non-negotiable
Always set max_slot_wal_keep_size and alert on pg_stat_replication_slots lag. Without these, a single Debezium outage can cascade into a database disk exhaustion incident.
2. Schema Evolution Surprises
When you add a column, rename a field, or change a type, Debezium's event schema changes too. If downstream consumers use rigid schemas (Avro with a schema registry), an incompatible change breaks the entire pipeline.
This is CDC's Achilles heel. Every database migration is now an event contract change. Always coordinate schema changes with the Debezium connector, enforce compatibility rules in the schema registry, and use backward-compatible evolution patterns. Column additions with defaults are safe. Renames and type changes are breaking.
3. Large Transaction Event Storms
A bulk UPDATE touching millions of rows generates millions of CDC events in a burst. This overwhelms both Debezium's throughput and downstream consumers. Break bulk operations into batches of 10K-50K rows with short pauses between them. Alternatively, use Debezium's column.exclude.list to skip non-essential columns from change detection, and configure max.batch.size and poll.interval.ms to throttle event emission.
4. Snapshot-Streaming Overlap
During the initial snapshot, Debezium reads existing data with a SELECT. Changes happening during the snapshot are also captured via the WAL. The merge can produce apparent duplicates. Consumers must handle this idempotently. For large tables (100M+ rows), the initial snapshot can take hours; plan for the overlap window accordingly and monitor consumer lag during this period.
5. Confusing CDC Events with Application Intent
CDC events reflect what the database did, not what the application intended. If your app updates a row 5 times in quick succession, you get 5 events, not 1. Downstream consumers that trigger expensive side effects (sending emails, making API calls) on every event need to debounce or filter. My recommendation: always add an idempotency layer on the consumer side.
Trade-offs
| Advantage | Disadvantage |
|---|---|
| Zero application code changes | Operational complexity (Debezium, Kafka Connect, Kafka) |
| Captures all changes (including direct SQL) | Schema coupling between DB schema and event schema |
| Sub-second latency | Replication slot bloat risk |
| Reliable LSN-based ordering | Not application-intent events (5 UPDATEs = 5 events) |
| Battle-tested tooling (Debezium) | Initial snapshot can be slow for large tables |
| Enables event-driven without refactoring | Requires logical replication support from the DB |
The fundamental tension is operational simplicity vs schema coupling. CDC frees developers from publishing events, but it ties your event schema to your database schema. Every ALTER TABLE is now a potential breaking change for all downstream consumers.
My recommendation for most teams: start with CDC for search and cache sync (where database-level events are perfectly fine), and use the outbox pattern for inter-service business events (where you need stable, domain-driven contracts). This gives you the best of both worlds without overcommitting to either approach.
Real-World Usage
LinkedIn built CDC into their core data infrastructure with Databus, a precursor to modern open-source CDC tools. Databus captured changes from Oracle and MySQL databases, streaming them to search indexes, graph stores, and caching layers. At peak, it handled millions of change events per second across thousands of tables. LinkedIn later contributed to Apache Kafka's development partly because Databus proved the value of log-based change propagation at scale.
Airbnb uses Debezium as part of their SpinalTap CDC platform. They stream changes from MySQL databases into Kafka to power search, analytics, and derived data stores. SpinalTap processes tens of millions of events per day and reduced search index staleness from minutes (batch ETL) to under 10 seconds. Their key insight: CDC let them migrate from a monolithic MySQL database to microservices incrementally, with each new service consuming change events rather than sharing the database directly.
Netflix uses CDC to keep denormalized data stores in sync. When title metadata changes in the primary content database, CDC streams those changes to Elasticsearch (powering search), the recommendation engine, and content delivery metadata stores. This replaced dozens of team-specific polling jobs with a shared event backbone. Netflix processes billions of events per day through their CDC pipelines, making it one of the largest deployments of log-based change capture in production.
Interview shortcut: the CDC elevator pitch
"We use Debezium to tail the PostgreSQL WAL, publishing row-level change events to Kafka. Downstream consumers (search, cache, analytics) read those events at their own pace. No dual-write risk, no application code changes, sub-second latency." That's the full pitch. If the interviewer wants more, discuss replication slot management and schema evolution.
How This Shows Up in Interviews
CDC appears in system design interviews whenever you need to keep multiple data stores in sync. That's your cue: any time the design involves "update the database AND also update the search index / cache / analytics pipeline."
When to bring it up: Say "I'd use CDC here, specifically Debezium tailing the PostgreSQL WAL, to stream changes to Kafka. Downstream consumers update the search index and cache from those events. No dual-write risk."
Depth expected:
- At senior level: know that CDC exists, name Debezium, explain why it beats dual-write
- At staff level: explain WAL mechanics, discuss replication slot risks, compare CDC vs outbox pattern, discuss schema evolution strategies
- At principal level: discuss CDC topology for multi-region setups, WAL volume capacity planning, and how CDC fits into a broader data mesh architecture
Common follow-up questions and how to handle them:
The interviewer will typically probe on failure modes. Be ready to discuss what happens when Debezium goes down (WAL accumulation), when schemas change (Avro compatibility), and when you need ordering guarantees (partition by entity key).
| Interviewer asks | Strong answer |
|---|---|
| "How do you keep the search index in sync?" | "CDC via Debezium. It tails the PostgreSQL WAL and publishes change events to Kafka. An ES consumer reads those events and updates the index. No dual-write risk." |
| "What if Debezium falls behind?" | "The replication slot retains WAL segments. We set max_slot_wal_keep_size to prevent disk exhaustion. If Debezium is down too long, we re-snapshot." |
| "Why not publish events from app code?" | "That's dual-write. If the DB commit succeeds but Kafka publish fails, systems diverge silently. CDC makes the database the single source of truth for events." |
| "What about schema changes?" | "We use Avro with a schema registry and enforce backward-compatible changes. Column additions are fine; renames or type changes require coordination." |
| "CDC vs the outbox pattern?" | "Outbox gives application-intent events (OrderShipped). CDC gives data-level events (row changed). Outbox requires code changes but cleaner semantics. CDC needs no code changes but mirrors the DB schema." |
Quick Recap
- CDC reads the database's internal transaction log to emit change events, eliminating the dual-write problem entirely.
- Log tailing (WAL-based CDC via Debezium) is the preferred approach: sub-second latency, zero code changes, minimal overhead.
- Debezium connects via a logical replication slot, performs an initial snapshot, then streams WAL changes to Kafka.
- The main operational risk is replication slot bloat: if the consumer falls behind, WAL segments accumulate and can exhaust disk space. Always set
max_slot_wal_keep_size. - CDC events mirror the database schema, not application intent. For domain-driven events, consider the outbox pattern instead.
- Downstream consumers must be idempotent because CDC delivers at-least-once and replays may occur after restarts or rebalances.
- The fundamental tension is operational simplicity (no code changes) vs schema coupling (DB schema = event schema).
- In interviews, CDC is the answer for "how do you keep the search index / cache / analytics in sync without dual-writing."
Related Patterns
- Outbox pattern: Gives you application-intent events by writing to an outbox table in the same transaction. Use when you need clean domain event contracts rather than raw database changes.
- Event sourcing: Where CDC captures changes to a mutable database, event sourcing stores events as the primary data model. CDC is a bridge; event sourcing is the destination.
- Message queues: CDC events flow through brokers like Kafka. Understanding consumer groups and offset management is essential for operating CDC pipelines.
- CQRS: CDC naturally enables CQRS by streaming changes from the write model to separate read-optimized stores.
- Competing consumers: CDC topics are consumed using the competing consumers pattern for horizontal scaling of downstream processing.
Search index synchronization: Elasticsearch doesn't support transactions aligned with your primary database. CDC streams row changes to an indexing worker that updates Elasticsearch. No dual-write, no missed updates.
Cache invalidation: When a database row changes, fire a CDC event that deletes or updates the corresponding cache key.
Audit log: Every row change is automatically captured with before/after state, user, and timestamp.
Materialized views in a different system: Build a denormalized read model in Redis or Cassandra from changes in a normalized PostgreSQL schema.
Data replication to analytics: Stream production database changes to a data warehouse (Snowflake, BigQuery) for real-time analytics without polluting the production database with analytical queries.
Quick Recap
- CDC extracts database changes as events without requiring application code to explicitly publish them. This eliminates the dual-write problem where a DB write succeeds but event publishing fails.
- Log tailing reads the database's internal write-ahead log (WAL). Debezium connector for PostgreSQL is the standard implementation — zero application code changes, low DB overhead, captures all changes including direct SQL.
- Trigger-based CDC writes changes to an outbox table in the same transaction, then has a separate process publish from that table. It works with any database that has triggers but adds per-row write overhead.
- Dual-write (write to DB then publish event from application code) is not CDC — it's the anti-pattern CDC solves. Use the outbox pattern instead if you can't use log tailing.
- Key CDC use cases: real-time search index sync, cache invalidation, audit logs, materializing denormalized read models, and streaming changes to analytics warehouses without performance impact on the primary database.