Write-ahead log (WAL)
Understand how the write-ahead log guarantees database durability, what fsync actually costs, and why WAL enables both crash recovery and replication in one mechanism.
The problem
At 2:14 a.m., your database server loses power mid-transaction. The app had just committed a payment. The row was updated in memory. The pages were queued to be written to disk. None of those pages made it.
When the server restarts, what happened to that payment? Without a mechanism for recovery, the answer is: nobody knows. The in-memory state is gone and the on-disk state is a partially applied mess.
This is the durability problem. Every production database solves it with the same fundamental mechanism: the write-ahead log.
What the WAL is
The write-ahead log is an append-only file that records every change before that change is applied to the actual data pages. "Write-ahead" means exactly that: the log entry must be written to disk before the data page is modified.
Analogy: Before moving furniture in a house, you write down in a notebook what you're about to do: "move sofa from living room to bedroom." If you collapse mid-move and someone finds you later, they can read the notebook and either finish the move or put things back. The notebook is the WAL.
sequenceDiagram
participant C as Client
participant DB as Database Process
participant WAL as WAL File (disk)
participant HP as Data Pages (disk)
C->>DB: BEGIN + UPDATE balance=950 WHERE id=1 + COMMIT
DB->>WAL: Write log record: txn=X change=UPDATE accounts id=1 balance=950
WAL-->>DB: fsync confirmed (log on disk)
DB-->>C: COMMIT acknowledged
Note over DB,HP: Pages updated in memory (buffer pool)
DB->>HP: Checkpoint flushes dirty pages to disk (async)
The commit acknowledgement goes to the client only after the log record hits durable storage. The actual data page write happens later, asynchronously, during a checkpoint.
What "durability" actually costs: fsync
The WAL only delivers durability if the log record survives a power failure. That requires fsync(), the system call that tells the OS to flush its write cache and confirm that data is on the physical storage medium.
Without fsync, the OS write cache holds data in RAM and flushes at its own pace. A power loss before the flush = data loss, even though the write system call "succeeded."
Without fsync:
write("WAL record") โ OS write cache (DRAM) โ ... โ disk (eventually)
Power loss here โ loses data
With fsync:
write("WAL record") โ OS write cache โ fsync() waits โ disk confirmed
Power loss after fsync: data is safe
The cost: fsync() on a typical HDD takes 5-10ms. On an NVMe SSD: 50-200 microseconds. This is why databases with fsync = on have a throughput ceiling measured in thousands of transactions per second (each needing its own fsync), and fsync = off databases can do hundreds of thousands per second but lose all guarantees on crash.
fsync = off is data loss waiting to happen
Some tutorials suggest disabling fsync for performance in development. Never do this in production. The GitLab incident of 2017 involved data loss partly attributed to fsync misconfiguration. The PostgreSQL documentation explicitly warns: "Setting fsync off can result in unrecoverable data corruption in the event of a crash."
Group commit: batching fsyncs
A single fsync per transaction limits throughput. Modern databases use group commit: batch multiple concurrent transactions into one fsync.
// Group commit (simplified)
while true:
collect all transactions waiting for commit (say, 50-200 in a busy system)
write all their log records to the WAL buffer
single fsync() for the entire batch
reply COMMIT to all 50-200 clients simultaneously
If 200 clients are committing concurrently and you can batch them into one fsync, you get 200 commits per fsync() instead of 1. On PostgreSQL, this happens automatically. MySQL InnoDB has a sync_binlog and innodb_flush_log_at_trx_commit setting that controls the tradeoff.
At high concurrency, group commit transforms the fsync bottleneck from "200 separate 200ยตs operations" into "one 200ยตs operation serving 200 clients." Throughput scales roughly linearly with concurrent writers up to the point where WAL write bandwidth becomes the limit.
Crash recovery
On restart after a crash, the database applies the WAL to bring data pages up to date. The process has three phases:
- Analysis. Read the WAL from the last checkpoint forward to determine which transactions committed and which were in-flight at crash time.
- Redo. Reapply all committed changes in order, even if the data page already has them (idempotent). This covers the gap between the last checkpoint and the crash.
- Undo. Roll back any transactions that had not committed at crash time. The WAL includes enough information to reverse partial changes.
// Crash recovery (ARIES algorithm, simplified)
checkpoint_lsn = read_from_control_file()
transactions = {}
// Phase 1: Analysis
for each record in WAL starting from checkpoint_lsn:
if record.type == BEGIN: transactions[txn_id] = ACTIVE
if record.type == COMMIT: transactions[txn_id] = COMMITTED
if record.type == ROLLBACK: transactions[txn_id] = ABORTED
// Phase 2: Redo (replay all changes, committed or not)
for each record in WAL starting from checkpoint_lsn:
apply_change_to_data_page(record)
// Phase 3: Undo (reverse uncommitted transactions)
for txn_id, state in transactions:
if state == ACTIVE: // was in-flight at crash
reverse_changes(txn_id)
Recovery time is bounded by the distance between the last checkpoint and the crash point: only the WAL records after the last checkpoint need to be replayed. PostgreSQL's checkpoint_timeout (default: 5 minutes) limits how far back recovery must go. A 5-minute timeout means crash recovery takes at most the amount of work done in 5 minutes of WAL, not the entire WAL history.
WAL and replication
PostgreSQL streaming replication works by shipping WAL records from the primary to standbys. The standbys apply the same records in order. Because WAL is the single source of truth for all changes, a standby that has applied all WAL records up to LSN X has exactly the same data state as the primary at LSN X.
This is not a coincidence. Crash recovery and replication are the same operation: apply WAL records to bring a data store to a consistent state.
flowchart TD
subgraph Primary["๐ข Primary DB"]
W["โ๏ธ Write Transactions\ncreate WAL records"]
WAL_P["๐ WAL Segment File\nLSN 0 โ current"]
Pages_P[("๐๏ธ Data Pages\nupdated at checkpoint")]
end
subgraph Standby1["๐ต Standby 1 (sync)"]
WAL_S1["๐ WAL receiver\napplies records in order"]
Pages_S1[("๐๏ธ Data Pages\nmirror of primary")]
end
subgraph Standby2["๐ต Standby 2 (async)"]
WAL_S2["๐ WAL receiver\nmay lag primary"]
Pages_S2[("๐๏ธ Data Pages\neventually consistent")]
end
W -->|"Log record\nbefore data page change"| WAL_P
WAL_P -->|"Stream WAL records\nsync: waits for ack"| WAL_S1
WAL_P -.->|"Stream WAL records\nasync: fire-and-forget"| WAL_S2
WAL_S1 -->|"Apply"| Pages_S1
WAL_S2 -->|"Apply"| Pages_S2
Synchronous replication: The primary waits for at least one standby to confirm it received the WAL record before acknowledging the commit to the client. Zero data loss on failover but adds network latency to every commit.
Asynchronous replication: The primary commits immediately without waiting for standbys. Lower commit latency but standbys may lag by seconds. If the primary fails before a standby catches up, those seconds of changes are lost.
Checkpoints
WAL records accumulate indefinitely. The mechanism for bounding WAL size and capping recovery time is the checkpoint.
During a checkpoint:
- All dirty data pages in the buffer pool are flushed to disk.
- A checkpoint record is written to the WAL with the current LSN.
After a checkpoint, WAL records before the checkpoint LSN are no longer needed for recovery (the data pages are already up to date). They can be archived or deleted.
WAL timeline:
... [LSN 100] [LSN 101] ... [CHECKPOINT LSN 500] ... [LSN 850] โ crash here
Recovery: only needs to replay LSN 500 โ 850.
Everything before the checkpoint is already in data pages.
Checkpoints are expensive: flushing all dirty pages at once spikes disk I/O. PostgreSQL spreads checkpoint writes over checkpoint_completion_target (default 0.9) of the time between checkpoints to smooth the I/O.
Production usage
| System | WAL mechanism | Notable behavior |
|---|---|---|
| PostgreSQL | pg_wal/ directory; fsync = on; wal_level = replica for replication | checkpoint_completion_target = 0.9 smooths I/O; synchronous_commit = local gives primary durability without blocking on standby latency |
| MySQL InnoDB | ib_logfile0 / ib_logfile1 (redo log); binary log for replication | innodb_flush_log_at_trx_commit = 1 is the durable default; value 2 flushes once per second and can improve write throughput 30-50% at the cost of up to 1 second of data loss on crash |
| SQLite | WAL mode (PRAGMA journal_mode=WAL); separate -wal file | WAL mode allows concurrent readers alongside one writer; the -wal file is checkpointed back to the main database file periodically |
| CockroachDB | Pebble storage engine uses a WAL per LSM level | Raft consensus ensures all replicas agree on the WAL record order before the commit is acknowledged to the client |
When WAL becomes a bottleneck
- Synchronous replication blocks on cross-region latency. With
synchronous_commit = onand a standby in another region (80ms away), every write adds 80ms of round-trip latency. Usesynchronous_commit = localto commit immediately on the primary with async replication to remote standbys. - WAL on network-mounted storage is unsafe.
fsyncon NFS can silently succeed on the client while data is not yet on the server's storage. PostgreSQL documentation explicitly warns against placing WAL on NFS. Keep WAL on local NVMe. - Checkpoint storms compete with user queries. Frequent checkpoints (low
checkpoint_timeout) flush all dirty pages rapidly, spiking disk I/O. On write-heavy workloads this competes with ongoing queries. Usecheckpoint_completion_target = 0.9and monitorpg_stat_bgwriterto verify checkpoint frequency is not excessive. - WAL accumulation during replica lag. If a streaming standby falls far behind, the primary cannot delete old WAL segments because the standby might still need them. WAL fills the disk. Use
wal_keep_sizeand replication slots with caution; monitor replica lag in production.
Choosing WAL vs alternatives
flowchart TD
Start{"Do you need crash\ndurability for writes?"}
Start -->|"No (e.g., caches)"| NoWAL["In-memory store\nno durability needed\nRedis with persistence off"]
Start -->|"Yes"| DataMutation{"What type of\ndata mutation?"}
DataMutation -->|"In-place updates\n(B-tree pages)"| WALBTree["WAL + B-tree\nPostgreSQL, InnoDB\nupdate in place,\nlog before write"]
DataMutation -->|"Append-only writes\n(time-series, logs)"| LSMOption{"High write\nthroughput?"}
LSMOption -->|"Yes"| LSM["LSM tree + WAL\nCassandra, LevelDB, RocksDB\nmemtable + WAL,\nperiodic compaction"]
LSMOption -->|"No"| WALBTree
DataMutation -->|"Embedded / mobile"| WALSQLite["SQLite WAL mode\nConcurrent readers + writer\nno server process needed"]
WALBTree --> Replicated{"Need streaming\nreplication?"}
Replicated -->|"Yes"| Stream["WAL shipping\nPostgreSQL streaming\nor logical replication"]
Replicated -->|"No"| LocalWAL["Local WAL only\nno archive needed"]
Interview cheat sheet
- The WAL is an append-only file. Every change is logged before the data page is touched. "Write-ahead" is the contract.
fsync()is the system call that makes durability real. Without it, a power failure loses data even if the write succeeded.- Group commit batches multiple transactions into a single fsync, which is why high-concurrency workloads saturate commit throughput less than you'd expect.
- Crash recovery replays WAL records from the last checkpoint. Checkpoint frequency controls recovery time.
- Streaming replication ships WAL records from primary to standbys. Synchronous replication waits for standby ack, giving zero data loss at the cost of added latency.
fsync = offin PostgreSQL makes it faster in testing but produces unrecoverable data corruption in crashes. Never use it in production.- Synchronous replication with a remote standby adds one network round-trip to every committed write. When latency matters, use
synchronous_commit = localto persist on the primary immediately and replicate asynchronously. - WAL segments must be retained until all streaming standbys have consumed them. Misconfigured replication slots can silently fill the primary's disk with uncollected WAL.
Quick recap
- The write-ahead log records every change before the data page is touched, giving the database a complete history of all modifications.
fsync()is the only mechanism that makes WAL records survive a power failure. Without it, durability is an illusion.- Group commit batches concurrent transactions into a single fsync, which is why high-concurrency systems can sustain high write throughput despite fsync overhead.
- Crash recovery replays WAL records from the last checkpoint. Checkpoint frequency directly controls maximum recovery time.
- Streaming replication ships WAL records from primary to standbys. Synchronous replication achieves zero data loss but blocks the primary if standbys are unreachable.
- Checkpoints are expensive I/O bursts;
checkpoint_completion_targetspreads them out to smooth disk utilization.
Related concepts
- Replication โ PostgreSQL streaming replication is WAL shipping. Understanding WAL makes the replication lag, RPO, and failover tradeoffs concrete instead of abstract.
- LSM Trees โ LSM trees also use an append-only log (the memtable flush to SSTable). The WAL-then-checkpoint pattern in B-tree databases and the memtable-then-SSTable pattern in LSM engines solve the same durability problem differently.
- Databases โ WAL is the mechanism that makes ACID durability (the "D") real. Understanding it gives you the vocabulary to reason about
crash-safe,point-in-time recovery, andreplication lagin production.