Write-ahead log (WAL)
Understand how the write-ahead log guarantees database durability, what fsync actually costs, and why WAL enables both crash recovery and replication in one mechanism.
The problem
At 2:14 a.m., your database server loses power mid-transaction. The app had just committed a payment. The row was updated in memory. The pages were queued to be written to disk. None of those pages made it.
When the server restarts, what happened to that payment? Without a mechanism for recovery, the answer is: nobody knows. The in-memory state is gone and the on-disk state is a partially applied mess.
This is the durability problem. Every production database solves it with the same fundamental mechanism: the write-ahead log.
What the WAL is
The write-ahead log is an append-only file that records every change before that change is applied to the actual data pages. "Write-ahead" means exactly that: the log entry must be written to disk before the data page is modified.
Analogy: Before moving furniture in a house, you write down in a notebook what you're about to do: "move sofa from living room to bedroom." If you collapse mid-move and someone finds you later, they can read the notebook and either finish the move or put things back. The notebook is the WAL.
The commit acknowledgement goes to the client only after the log record hits durable storage. The actual data page write happens later, asynchronously, during a checkpoint.
What "durability" actually costs: fsync
The WAL only delivers durability if the log record survives a power failure. That requires fsync(), the system call that tells the OS to flush its write cache and confirm that data is on the physical storage medium.
Without fsync, the OS write cache holds data in RAM and flushes at its own pace. A power loss before the flush = data loss, even though the write system call "succeeded."
Without fsync:
write("WAL record") β OS write cache (DRAM) β ... β disk (eventually)
Power loss here β loses data
With fsync:
write("WAL record") β OS write cache β fsync() waits β disk confirmed
Power loss after fsync: data is safe
The cost: fsync() on a typical HDD takes 5-10ms. On an NVMe SSD: 50-200 microseconds. This is why databases with fsync = on have a throughput ceiling measured in thousands of transactions per second (each needing its own fsync), and fsync = off databases can do hundreds of thousands per second but lose all guarantees on crash.
fsync = off is data loss waiting to happen
Some tutorials suggest disabling fsync for performance in development. Never do this in production. The GitLab incident of 2017 involved data loss partly attributed to fsync misconfiguration. The PostgreSQL documentation explicitly warns: "Setting fsync off can result in unrecoverable data corruption in the event of a crash."
Group commit: batching fsyncs
A single fsync per transaction limits throughput. Modern databases use group commit: batch multiple concurrent transactions into one fsync.
// Group commit (simplified)
while true:
collect all transactions waiting for commit (say, 50-200 in a busy system)
write all their log records to the WAL buffer
single fsync() for the entire batch
reply COMMIT to all 50-200 clients simultaneously
If 200 clients are committing concurrently and you can batch them into one fsync, you get 200 commits per fsync() instead of 1. On PostgreSQL, this happens automatically. MySQL InnoDB has a sync_binlog and innodb_flush_log_at_trx_commit setting that controls the tradeoff.
At high concurrency, group commit transforms the fsync bottleneck from "200 separate 200Β΅s operations" into "one 200Β΅s operation serving 200 clients." Throughput scales roughly linearly with concurrent writers up to the point where WAL write bandwidth becomes the limit.
Crash recovery
spawnSync d2 ENOENT
On restart after a crash, the database applies the WAL to bring data pages up to date. The process has three phases:
- Analysis. Read the WAL from the last checkpoint forward to determine which transactions committed and which were in-flight at crash time.
- Redo. Reapply all committed changes in order, even if the data page already has them (idempotent). This covers the gap between the last checkpoint and the crash.
- Undo. Roll back any transactions that had not committed at crash time. The WAL includes enough information to reverse partial changes.
// Crash recovery (ARIES algorithm, simplified)
checkpoint_lsn = read_from_control_file()
transactions = {}
// Phase 1: Analysis
for each record in WAL starting from checkpoint_lsn:
if record.type == BEGIN: transactions[txn_id] = ACTIVE
if record.type == COMMIT: transactions[txn_id] = COMMITTED
if record.type == ROLLBACK: transactions[txn_id] = ABORTED
// Phase 2: Redo (replay all changes, committed or not)
for each record in WAL starting from checkpoint_lsn:
apply_change_to_data_page(record)
// Phase 3: Undo (reverse uncommitted transactions)
for txn_id, state in transactions:
if state == ACTIVE: // was in-flight at crash
reverse_changes(txn_id)
Recovery time is bounded by the distance between the last checkpoint and the crash point: only the WAL records after the last checkpoint need to be replayed. PostgreSQL's checkpoint_timeout (default: 5 minutes) limits how far back recovery must go. A 5-minute timeout means crash recovery takes at most the amount of work done in 5 minutes of WAL, not the entire WAL history.
WAL and replication
PostgreSQL streaming replication works by shipping WAL records from the primary to standbys. The standbys apply the same records in order. Because WAL is the single source of truth for all changes, a standby that has applied all WAL records up to LSN X has exactly the same data state as the primary at LSN X.
Continue Reading with Premium
Unlock this article and every other in-depth system design guide on the platform with NotesFromSDE Premium.