Change data capture

TL;DR

Change Data Capture (CDC) reads the database's internal transaction log to emit events for every insert, update, and delete, eliminating the need for application code to publish events separately.
The preferred approach is log tailing (reading the Write-Ahead Log), which captures all changes with zero application code changes and minimal database overhead.
CDC solves the dual-write problem: writing to a database and publishing an event are two separate operations that can partially fail, leaving systems permanently out of sync.
Debezium (the most popular CDC tool) connects to PostgreSQL/MySQL via logical replication and streams row-level changes to Kafka topics in near-real-time.
Think of CDC as event-driven architecture's zero-code option: your existing database becomes an event source without changing application code.

Your e-commerce platform stores orders in PostgreSQL. Three downstream systems need to know when an order changes: the search index (Elasticsearch), the analytics pipeline (data warehouse), and the notification service (sends shipping emails).

The obvious approach: after every database write, publish an event to Kafka from your application code. This works fine on a whiteboard. In production, it breaks in subtle ways.

The database write succeeded but the Kafka publish failed. PostgreSQL says the order shipped, but Elasticsearch, the analytics pipeline, and the notification service have no idea. The customer never gets their shipping email. The dashboard undercounts shipped orders. Support gets tickets about missing status updates.

You could wrap both in a distributed transaction, but 2PC is slow, fragile, and most message brokers don't support XA transactions. You could retry the Kafka publish, but if the app crashes between the DB commit and the retry, the event is permanently lost. You could add a background job that scans for "unsynced" rows, but now you're building a bespoke CDC system with all its edge cases.

This is the dual-write problem: two writes to two different systems can't be made atomic without distributed transactions. One will always fail independently of the other. The failure window might be small, but across millions of requests per day, "small" means "happens every week."

CDC takes a completely different approach. Instead of the application emitting events, a separate process reads the database's own transaction log and publishes events from that. The database write is the only write. The event is derived from it, not duplicated alongside it.

With dual-write, the application makes two independent writes that can partially fail. With CDC, the application makes one write, and the event is derived automatically from the database's own log. There's no second write to fail.

One-Line Definition

Change Data Capture detects and streams row-level database changes by reading the database's internal transaction log, turning your database into a reliable event source without modifying application code.

Analogy

Think of a bank teller and the bank's ledger. Every transaction the teller processes gets recorded in a central ledger. An auditor doesn't ask the teller to report every transaction separately (that's the dual-write approach). Instead, the auditor reads the ledger directly.

The auditor never misses a transaction because the ledger is the source of truth. The teller doesn't need to do any extra work. If the bank hires a new compliance officer, that person also reads the same ledger, no changes needed to how the teller operates.

CDC works the same way. The database's Write-Ahead Log (WAL) is the ledger. Debezium is the auditor. Adding a new downstream consumer is like hiring another compliance officer: just point them at the same log. No one needs to change how they do their job.

The key insight from this analogy: the ledger (WAL) exists regardless of whether anyone reads it. The database writes to the WAL for its own durability purposes. CDC just taps into what's already there.

Solution Walkthrough

CDC has three main implementation approaches. Log tailing is the production standard; triggers and polling are alternatives for constrained environments.

Approach 1: Log Tailing (the standard)

Every major relational database writes changes to a sequential log before applying them to data files. PostgreSQL calls this the Write-Ahead Log (WAL). MySQL calls it the binlog. This log exists for crash recovery, but CDC repurposes it as an event stream.

The flow works like this:

Application writes to the database normally. No code changes needed.
PostgreSQL writes the change to the WAL (it does this anyway for durability).
Debezium connects via a logical replication slot and reads decoded WAL entries.
Each WAL entry becomes a structured JSON event published to a Kafka topic.
Downstream consumers read from Kafka at their own pace.

Each event contains the before-image, after-image, operation type (c/u/d), source table, timestamp, and the Log Sequence Number (LSN) for ordering.

{
  "op": "u",
  "before": { "id": 42, "status": "processing", "updated_at": "..." },
  "after":  { "id": 42, "status": "shipped",    "updated_at": "..." },
  "source": {
    "table": "orders",
    "lsn": 234881024,
    "ts_ms": 1712000000000
  }
}

Why this works: the WAL is written atomically with the transaction. If the transaction commits, the WAL entry exists. If it rolls back, the WAL entry is discarded. There's no window where the database has a change but the event doesn't exist. The dual-write problem vanishes.

PostgreSQL configuration for logical replication requires two changes:

-- Enable logical replication (requires restart)
ALTER SYSTEM SET wal_level = logical;

-- Create a replication slot for Debezium
SELECT pg_create_logical_replication_slot('debezium_orders', 'pgoutput');

The pgoutput plugin is PostgreSQL's built-in logical decoding output. Debezium connects to this slot, and PostgreSQL streams decoded WAL entries in real-time. The replication slot also acts as a bookmark: PostgreSQL retains WAL segments until Debezium confirms it has read them, so no events are lost even if Debezium restarts.

Approach 2: Database Triggers

For databases that don't support logical replication (or environments where Debezium can't be deployed), triggers capture changes inside the same transaction.

CREATE FUNCTION capture_change() RETURNS trigger AS $$
BEGIN
  INSERT INTO change_events (table_name, op, payload, created_at)
  VALUES (TG_TABLE_NAME, TG_OP, row_to_json(NEW), now());
  RETURN NEW;
END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER cdc_orders
AFTER INSERT OR UPDATE OR DELETE ON orders
FOR EACH ROW EXECUTE FUNCTION capture_change();

A polling process reads change_events and publishes to Kafka. The trigger fires within the same transaction as the business write, so you get atomicity. The trade-off is performance: every write includes an extra INSERT, and polling adds 1-5 seconds of latency.

I've seen this approach work well in environments where the infrastructure team won't approve Debezium but the team still needs change events. It's not as clean as WAL-based CDC, but it's entirely self-contained within the database.

Approach 3: Application-Level Polling

Query the source table for rows modified since the last checkpoint:

SELECT * FROM orders WHERE updated_at > :last_poll_time;

Simplest approach, but it misses DELETEs (the row is gone), can miss rapid updates between poll intervals, and adds load to the source database with repeated full-range scans. You also need a monotonically increasing column (updated_at or a sequence) on every table you want to capture. I'd avoid this for anything production-critical, but it's a reasonable starting point for a proof of concept.

Comparison

Dimension	Log Tailing (WAL)	Triggers	Polling
Latency	Sub-second	1-5 seconds	Seconds to minutes
DB overhead	Minimal (reads existing log)	Moderate (extra INSERT per write)	High (repeated queries)
Captures DELETEs	Yes	Yes	No (row is gone)
Schema changes needed	None	Trigger per table	`updated_at` column required
Tooling required	Debezium + Kafka Connect	None (built-in SQL)	None (simple queries)
Ordering guarantees	LSN-ordered (strong)	Transaction-ordered (strong)	Timestamp-ordered (weak)

For most production systems, log tailing with Debezium is the clear winner.

CDC vs the Outbox Pattern

Both CDC and the outbox pattern solve the dual-write problem, but they take different paths. Understanding the distinction matters for interviews.

TL;DR

Change Data Capture (CDC) reads the database's internal transaction log to emit events for every insert, update, and delete, eliminating the need for application code to publish events separately.
The preferred approach is log tailing (reading the Write-Ahead Log), which captures all changes with zero application code changes and minimal database overhead.
CDC solves the dual-write problem: writing to a database and publishing an event are two separate operations that can partially fail, leaving systems permanently out of sync.
Debezium (the most popular CDC tool) connects to PostgreSQL/MySQL via logical replication and streams row-level changes to Kafka topics in near-real-time.
Think of CDC as event-driven architecture's zero-code option: your existing database becomes an event source without changing application code.

The Problem

The obvious approach: after every database write, publish an event to Kafka from your application code. This works fine on a whiteboard. In production, it breaks in subtle ways.

One-Line Definition

Analogy

Solution Walkthrough

CDC has three main implementation approaches. Log tailing is the production standard; triggers and polling are alternatives for constrained environments.

Approach 1: Log Tailing (the standard)

The flow works like this:

Application writes to the database normally. No code changes needed.
PostgreSQL writes the change to the WAL (it does this anyway for durability).
Debezium connects via a logical replication slot and reads decoded WAL entries.
Each WAL entry becomes a structured JSON event published to a Kafka topic.
Downstream consumers read from Kafka at their own pace.

Each event contains the before-image, after-image, operation type (c/u/d), source table, timestamp, and the Log Sequence Number (LSN) for ordering.

{
  "op": "u",
  "before": { "id": 42, "status": "processing", "updated_at": "..." },
  "after":  { "id": 42, "status": "shipped",    "updated_at": "..." },
  "source": {
    "table": "orders",
    "lsn": 234881024,
    "ts_ms": 1712000000000
  }
}

PostgreSQL configuration for logical replication requires two changes:

-- Enable logical replication (requires restart)
ALTER SYSTEM SET wal_level = logical;

-- Create a replication slot for Debezium
SELECT pg_create_logical_replication_slot('debezium_orders', 'pgoutput');

Approach 2: Database Triggers

For databases that don't support logical replication (or environments where Debezium can't be deployed), triggers capture changes inside the same transaction.

CREATE FUNCTION capture_change() RETURNS trigger AS $$
BEGIN
  INSERT INTO change_events (table_name, op, payload, created_at)
  VALUES (TG_TABLE_NAME, TG_OP, row_to_json(NEW), now());
  RETURN NEW;
END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER cdc_orders
AFTER INSERT OR UPDATE OR DELETE ON orders
FOR EACH ROW EXECUTE FUNCTION capture_change();

Approach 3: Application-Level Polling

Query the source table for rows modified since the last checkpoint:

SELECT * FROM orders WHERE updated_at > :last_poll_time;

Comparison

Dimension	Log Tailing (WAL)	Triggers	Polling
Latency	Sub-second	1-5 seconds	Seconds to minutes
DB overhead	Minimal (reads existing log)	Moderate (extra INSERT per write)	High (repeated queries)
Captures DELETEs	Yes	Yes	No (row is gone)
Schema changes needed	None	Trigger per table	`updated_at` column required
Tooling required	Debezium + Kafka Connect	None (built-in SQL)	None (simple queries)
Ordering guarantees	LSN-ordered (strong)	Transaction-ordered (strong)	Timestamp-ordered (weak)

For most production systems, log tailing with Debezium is the clear winner.

CDC vs the Outbox Pattern

Both CDC and the outbox pattern solve the dual-write problem, but they take different paths. Understanding the distinction matters for interviews.

Change data capture

TL;DR

The Problem

One-Line Definition

Analogy

Solution Walkthrough

Approach 1: Log Tailing (the standard)

Approach 2: Database Triggers

Approach 3: Application-Level Polling

Comparison

CDC vs the Outbox Pattern

Continue Reading with Premium

Comments

Change data capture

TL;DR

The Problem

One-Line Definition

Analogy

Solution Walkthrough

Approach 1: Log Tailing (the standard)

Approach 2: Database Triggers

Approach 3: Application-Level Polling

Comparison

CDC vs the Outbox Pattern

Continue Reading with Premium

Comments