MVCC (Multi-Version Concurrency Control)
Learn how MVCC lets readers and writers proceed without blocking each other by keeping multiple row versions, what vacuum does in PostgreSQL, and why MVCC does not eliminate all lock contention.
The problem with simple locking
The naive approach to concurrency: lock a row for reading, lock it for writing. One transaction at a time.
This prevents consistency problems but creates a new one. In a read-heavy system (most web applications), long-running reads block writes and writes block reads. A reporting query that takes 5 seconds to scan 10 million rows holds read locks for 5 seconds. Any write to those rows must wait. Your order processing system pauses because the analytics dashboard is running.
The goal: reads and writes should not interfere with each other. Readers should not wait for writers. Writers should not wait for readers.
MVCC is how every major relational database solves this.
What MVCC is
Multi-Version Concurrency Control keeps multiple versions of each row simultaneously. When a transaction reads a row, it sees the version that was current at the transaction's start, not whatever the current live state is. Writers create new versions rather than modifying rows in place.
Analogy: Think of a Google Doc with revision history. When you open the doc to read it, you see the version from when you opened it. While you're reading, someone else edits the doc, creating a new version. You still see the same version you started with. Neither of you is blocked. When you're done reading, the editor's new version is visible to anyone who opens the doc fresh.
How MVCC works in PostgreSQL
sequenceDiagram
participant T1 as Transaction T1
participant DB as Postgres DB
participant T2 as Transaction T2
participant T3 as Transaction T3
T1->>DB: BEGIN ยท snapshot at txid=100
T1->>DB: SELECT balance WHERE id=1
DB-->>T1: xmin=50 ยท balance=500
T2->>DB: BEGIN ยท UPDATE balance=600
Note over DB: old row: xmin=50, xmax=200<br/>new row: xmin=200, balance=600
T1->>DB: SELECT balance WHERE id=1 (repeat)
DB-->>T1: still xmin=50 ยท balance=500
Note over T1,DB: T1 snapshot frozen at txid=100<br/>xmin=200 row invisible to T1
T2->>DB: COMMIT
T1->>DB: SELECT balance WHERE id=1 (repeat)
DB-->>T1: still balance=500
T1->>DB: COMMIT
T3->>DB: BEGIN ยท snapshot after T2 commit
T3->>DB: SELECT balance WHERE id=1
DB-->>T3: xmin=200 ยท balance=600
Note over T3,DB: T3 sees T2 version now visible
Every row in PostgreSQL carries two hidden system columns:
xmin: the transaction ID that created this row version.xmax: the transaction ID that deleted or updated this row (0 if the row is still live).
-- You can see these explicitly:
SELECT xmin, xmax, id, name FROM users WHERE id = 1;
-- xmin=1042 xmax=0 id=1 name='Alice'
When a transaction updates a row:
- PostgreSQL does NOT modify the existing row in place.
- It marks the old row's
xmaxwith the current transaction ID (logically "deleted"). - It inserts a new row version with
xminset to the current transaction ID.
-- Transaction txn_id=5000 runs: UPDATE users SET name='Alice B.' WHERE id=1;
-- Before:
-- xmin=1042 xmax=0 id=1 name='Alice' โ original row, still visible
-- After:
-- xmin=1042 xmax=5000 id=1 name='Alice' โ old version, deleted by txn 5000
-- xmin=5000 xmax=0 id=1 name='Alice B.' โ new version, created by txn 5000
Three concurrent transactions, one row, two versions in the heap. T1 and T3 see different states of the same row because their snapshots were taken at different times. Neither had to wait. No locks were taken for the reads.
How reads see consistent data: Every transaction has a "snapshot" taken at transaction start. A row is visible if:
xminis โค the current transaction's snapshot horizon (the row existed when we started)xmaxis either 0 (row is still live) orxmaxis greater than the snapshot horizon (deleted after we started)
This means a concurrent reader always sees the row as it was at the moment their transaction began, regardless of how many updates happen while they are reading.
Isolation levels and what they change
MVCC alone does not define the isolation level. The snapshot rules determine it:
| Isolation level | Snapshot taken at | What it prevents | What it allows |
|---|---|---|---|
| Read Committed | Start of each statement | Dirty reads | Non-repeatable reads, phantom reads |
| Repeatable Read | Start of transaction | Dirty + non-repeatable reads | Phantom reads (mostly) |
| Serializable | Start of transaction + SSI | All anomalies | Nothing (full ACID) |
Read Committed (PostgreSQL default): each statement sees a fresh snapshot. If you run the same query twice in one transaction, you might get different results if another transaction committed between your two queries.
Repeatable Read: the snapshot is taken at the first statement of the transaction. Repeated reads of the same rows return the same data, even if other transactions commit in between.
-- Transaction A: REPEATABLE READ
BEGIN;
SELECT balance FROM accounts WHERE id = 1; -- sees $100
-- Transaction B (concurrent): commits UPDATE accounts SET balance=200 WHERE id = 1;
SELECT balance FROM accounts WHERE id = 1; -- still sees $100 (same snapshot!)
COMMIT;
The VACUUM problem
MVCC creates old row versions every time a row is updated or deleted. These old versions are needed by transactions that still have them in their snapshots. Once no active transaction needs the old version, it becomes "dead" and takes up space without serving any purpose.
Without cleanup, tables grow indefinitely. Index entries for dead rows remain. Every query scans dead rows it must then filter. Eventually, a table with heavy updates has more dead rows than live ones.
VACUUM is PostgreSQL's background process that reclaims dead row space. It:
- Scans the table for rows where
xmaxis set and the deleting transaction is known committed and no active snapshot needs the old version. - Marks those pages as free (available for reuse).
- Updates index entries (VACUUM FULL rewrites the whole table; regular VACUUM in-place).
Table state after 1 million updates without VACUUM:
Live rows: 1,000,000
Dead rows: 1,000,000 โ same number, wasted space
Table bloat: 2x expected size
After VACUUM:
Live rows: 1,000,000
Dead rows: 0
Table size: Expected size (approximately)
Autovacuum must not be disabled
PostgreSQL runs autovacuum automatically. Disabling it (or setting vacuum thresholds too high) causes table bloat, increasing scan times, and eventually causes transaction ID wraparound, a catastrophic condition where PostgreSQL refuses all writes to prevent data corruption. Never disable autovacuum in production.
Transaction ID wraparound
PostgreSQL uses 32-bit transaction IDs (xmin, xmax). With ~4 billion possible values, and systems performing thousands of transactions per second, IDs eventually wrap around.
When a transaction ID wraps, old rows with xmin values in the "future" relative to the wrapped counter would appear as not-yet-created. PostgreSQL prevents this by periodically freezing old rows: replacing their xmin with a special "frozen" value that is always visible to any transaction.
VACUUM handles freezing. If autovacuum falls behind (because it's disabled, under-resourced, or tables are growing faster than it can clean), PostgreSQL will forcibly take the database into single-user mode and run VACUUM FREEZE to prevent the catastrophic wraparound.
In practice: if autovacuum is running correctly, you never see this. It becomes a problem when autovacuum is disabled "for performance" or when write volume exceeds the vacuum worker's capacity to keep up.
Long-running transactions delay VACUUM and cause table bloat
VACUUM cannot reclaim a dead row version if any open transaction has a snapshot older than that version's deletion. A single analytics query running for 8 hours prevents VACUUM from reclaiming any dead row created during those 8 hours. On a system processing 10,000 updates per second, that is 288 million dead rows accumulating on disk. The fix: set idle_in_transaction_session_timeout to a reasonable limit (for example, 30 minutes), monitor pg_stat_activity ORDER BY xact_start, and terminate long-running transactions that are no longer actively executing.
What MVCC does not solve
Write-write conflicts. Two transactions updating the same row must still coordinate. MVCC does not eliminate the need for row-level locks when two writers target the same row. The second writer waits for the first to commit or roll back.
Serializable anomalies. At Repeatable Read isolation, write skew is possible. Two transactions each read state, make independent decisions based on what they see, then write. If both reads produced the same consistent snapshot but both writes together produce an invalid state, you have write skew. Only the Serializable isolation level (SSI in PostgreSQL) prevents this.
-- Write skew example: each doctor checks if enough doctors are on call
-- Doctor A: SELECT count(*) FROM on_call WHERE shift='night'; -- sees 2
-- Doctor B: SELECT count(*) FROM on_call WHERE shift='night'; -- sees 2
-- Doctor A: DELETE FROM on_call WHERE doctor_id=1 AND shift='night'; -- now 1
-- Doctor B: DELETE FROM on_call WHERE doctor_id=2 AND shift='night'; -- now 0
-- Invariant violated: 0 doctors on call, but each transaction saw 2 when it started
MVCC does not prevent write skew at Repeatable Read
Write skew happens when two transactions each see a consistent snapshot, make individually valid decisions, and their combined writes violate a shared invariant. Real examples: overselling the last inventory item, double-booking a time slot, two on-call doctors simultaneously going off shift. Use SELECT ... FOR UPDATE to lock the rows you intend to modify, or use SERIALIZABLE isolation to let PostgreSQL detect and abort the conflicting transaction. Read Committed and Repeatable Read do not prevent write skew.
Production usage
| System | Usage | Notable behavior |
|---|---|---|
| PostgreSQL | Full MVCC for all reads and writes | Dead tuples accumulate on every UPDATE and DELETE. VACUUM reclaims space. Autovacuum runs automatically but must be tuned for high-write tables. |
| MySQL InnoDB | MVCC with undo log for read consistency | Old versions stored in the undo log, not in the main table file. Purge thread cleans up after transactions close. |
| CockroachDB | Distributed MVCC with HLC timestamps | Each key stores multiple timestamped versions. Garbage collection runs at configurable intervals to remove versions older than the GC TTL (default 25 hours). |
| Oracle | Multi-version read consistency via undo segments | "Read consistency" is Oracle's branding for the same mechanism. Queries always see a consistent snapshot as of the query start time. |
Choosing the right isolation level
flowchart TD
Q1{"Do reads and writes need\nto proceed without blocking?"}
Q1 -->|"Yes (most web apps)"| MVCC["MVCC-based isolation\n(PostgreSQL, MySQL InnoDB)"]
Q1 -->|"No, strict serialization OK"| Lock["SERIALIZABLE with shared locks\n(rare, high contention)"]
MVCC --> Q2{"Same-row reads must stay\nconsistent within a transaction?"}
Q2 -->|"No, freshest data\non each statement"| RC["READ COMMITTED\n(PostgreSQL default)\nFast, allows non-repeatable reads"]
Q2 -->|"Yes, snapshot-stable reads"| Q3{"Check-then-act patterns\non shared rows?"}
Q3 -->|"No"| RR["REPEATABLE READ\nSnapshot at txn start\nWrite skew still possible"]
Q3 -->|"Yes, must prevent\nwrite skew"| Q4{"Tolerate transaction\naborts and retries?"}
Q4 -->|"Yes"| Ser["SERIALIZABLE (SSI)\nAll anomalies prevented\nRequires retry on abort"]
Q4 -->|"No"| ForUpdate["REPEATABLE READ\n+ SELECT FOR UPDATE\nManual invariant protection"]
The practical default: READ COMMITTED for most OLTP. Add SELECT FOR UPDATE when you need check-then-act atomicity on specific rows. Use SERIALIZABLE only when the invariant spans multiple rows in a way that explicit locking cannot cleanly address.
Interview cheat sheet
- MVCC stores multiple row versions simultaneously. Readers see the version current at their transaction start. Writers create new versions rather than modifying in place.
- The core benefit: reads never block writes; writes never block reads. This is the correct answer to "how does PostgreSQL serve high concurrency without lock contention?"
- Dead rows accumulate from updates/deletes and must be reclaimed by VACUUM. Table bloat = query slowdown.
- Autovacuum is not optional. Disabling it leads to bloat and eventually transaction ID wraparound, which stops all writes.
- MVCC does not prevent write-write conflicts. Two transactions writing the same row still contend for a row-level lock.
- Write skew is possible at Repeatable Read. Use Serializable isolation to prevent it, or use explicit locking (
SELECT FOR UPDATE).
Quick recap
- MVCC stores multiple row versions per row. Readers see the snapshot from their transaction start, writers create new versions rather than modifying in place.
- The core property: readers never block writers; writers never block readers. This is the mechanism behind PostgreSQL's high read/write concurrency.
- Dead row versions accumulate and must be reclaimed by VACUUM. Without autovacuum, tables bloat and eventually transaction ID wraparound stops all writes.
- Long-running transactions hold old snapshots, preventing VACUUM from reclaiming dead rows created during that transaction's lifetime.
- MVCC does not prevent write-write conflicts on the same row, or write skew at Repeatable Read isolation.
- Serializable isolation (SSI) prevents all anomalies including write skew, at the cost of transaction aborts that require application retry logic.
Related concepts
- Database Locking โ MVCC reduces lock contention but does not eliminate it. Row-level locks, table locks, and advisory locks interact with MVCC in ways that are essential to understand for write-heavy workloads.
- B-tree Indexes โ Dead row versions from MVCC also create dead index entries. VACUUM maintains both the heap (table data) and the associated indexes. Understanding this explains index bloat.
- Databases โ MVCC is one of the fundamental mechanisms that distinguishes production-grade databases from simple key-value stores. It enables the "D" in ACID without sacrificing concurrent read throughput.