๐Ÿ“HowToHLD
Vote for New Content
Vote for New Content
Home/High Level Design/Engineering Internals

MVCC (Multi-Version Concurrency Control)

Learn how MVCC lets readers and writers proceed without blocking each other by keeping multiple row versions, what vacuum does in PostgreSQL, and why MVCC does not eliminate all lock contention.

19 min read2026-03-28hardmvccdatabasespostgresqltransactionsconcurrency

The problem with simple locking

The naive approach to concurrency: lock a row for reading, lock it for writing. One transaction at a time.

This prevents consistency problems but creates a new one. In a read-heavy system (most web applications), long-running reads block writes and writes block reads. A reporting query that takes 5 seconds to scan 10 million rows holds read locks for 5 seconds. Any write to those rows must wait. Your order processing system pauses because the analytics dashboard is running.

The goal: reads and writes should not interfere with each other. Readers should not wait for writers. Writers should not wait for readers.

MVCC is how every major relational database solves this.


What MVCC is

Multi-Version Concurrency Control keeps multiple versions of each row simultaneously. When a transaction reads a row, it sees the version that was current at the transaction's start, not whatever the current live state is. Writers create new versions rather than modifying rows in place.

Analogy: Think of a Google Doc with revision history. When you open the doc to read it, you see the version from when you opened it. While you're reading, someone else edits the doc, creating a new version. You still see the same version you started with. Neither of you is blocked. When you're done reading, the editor's new version is visible to anyone who opens the doc fresh.


How MVCC works in PostgreSQL

sequenceDiagram
    participant T1 as Transaction T1
    participant DB as Postgres DB
    participant T2 as Transaction T2
    participant T3 as Transaction T3

    T1->>DB: BEGIN ยท snapshot at txid=100
    T1->>DB: SELECT balance WHERE id=1
    DB-->>T1: xmin=50 ยท balance=500

    T2->>DB: BEGIN ยท UPDATE balance=600
    Note over DB: old row: xmin=50, xmax=200<br/>new row: xmin=200, balance=600

    T1->>DB: SELECT balance WHERE id=1 (repeat)
    DB-->>T1: still xmin=50 ยท balance=500
    Note over T1,DB: T1 snapshot frozen at txid=100<br/>xmin=200 row invisible to T1

    T2->>DB: COMMIT

    T1->>DB: SELECT balance WHERE id=1 (repeat)
    DB-->>T1: still balance=500
    T1->>DB: COMMIT

    T3->>DB: BEGIN ยท snapshot after T2 commit
    T3->>DB: SELECT balance WHERE id=1
    DB-->>T3: xmin=200 ยท balance=600
    Note over T3,DB: T3 sees T2 version now visible

Every row in PostgreSQL carries two hidden system columns:

  • xmin: the transaction ID that created this row version.
  • xmax: the transaction ID that deleted or updated this row (0 if the row is still live).
-- You can see these explicitly:
SELECT xmin, xmax, id, name FROM users WHERE id = 1;
-- xmin=1042   xmax=0   id=1   name='Alice'

When a transaction updates a row:

  1. PostgreSQL does NOT modify the existing row in place.
  2. It marks the old row's xmax with the current transaction ID (logically "deleted").
  3. It inserts a new row version with xmin set to the current transaction ID.
-- Transaction txn_id=5000 runs: UPDATE users SET name='Alice B.' WHERE id=1;

-- Before:
-- xmin=1042  xmax=0     id=1  name='Alice'     โ† original row, still visible

-- After:
-- xmin=1042  xmax=5000  id=1  name='Alice'     โ† old version, deleted by txn 5000
-- xmin=5000  xmax=0     id=1  name='Alice B.'  โ† new version, created by txn 5000
T1 (txid=100, started before T2)T2 (txid=5000, updater)T3 (txid=6000, started after T2 committed)Row v1: xmin=1042, xmax=5000, name=Alice (dead)Row v2: xmin=5000, xmax=0, name=Alice B. (live) sees this (xmin=1042 is within snapshot)marks xmax=5000 (logically deletes)inserts new versionsees this (xmin=5000, T2 committed before T3)

Three concurrent transactions, one row, two versions in the heap. T1 and T3 see different states of the same row because their snapshots were taken at different times. Neither had to wait. No locks were taken for the reads.

How reads see consistent data: Every transaction has a "snapshot" taken at transaction start. A row is visible if:

  • xmin is โ‰ค the current transaction's snapshot horizon (the row existed when we started)
  • xmax is either 0 (row is still live) or xmax is greater than the snapshot horizon (deleted after we started)

This means a concurrent reader always sees the row as it was at the moment their transaction began, regardless of how many updates happen while they are reading.


Isolation levels and what they change

MVCC alone does not define the isolation level. The snapshot rules determine it:

Isolation levelSnapshot taken atWhat it preventsWhat it allows
Read CommittedStart of each statementDirty readsNon-repeatable reads, phantom reads
Repeatable ReadStart of transactionDirty + non-repeatable readsPhantom reads (mostly)
SerializableStart of transaction + SSIAll anomaliesNothing (full ACID)

Read Committed (PostgreSQL default): each statement sees a fresh snapshot. If you run the same query twice in one transaction, you might get different results if another transaction committed between your two queries.

Repeatable Read: the snapshot is taken at the first statement of the transaction. Repeated reads of the same rows return the same data, even if other transactions commit in between.

-- Transaction A: REPEATABLE READ
BEGIN;
SELECT balance FROM accounts WHERE id = 1;  -- sees $100

-- Transaction B (concurrent): commits UPDATE accounts SET balance=200 WHERE id = 1;

SELECT balance FROM accounts WHERE id = 1;  -- still sees $100 (same snapshot!)
COMMIT;

The VACUUM problem

MVCC creates old row versions every time a row is updated or deleted. These old versions are needed by transactions that still have them in their snapshots. Once no active transaction needs the old version, it becomes "dead" and takes up space without serving any purpose.

Without cleanup, tables grow indefinitely. Index entries for dead rows remain. Every query scans dead rows it must then filter. Eventually, a table with heavy updates has more dead rows than live ones.

VACUUM is PostgreSQL's background process that reclaims dead row space. It:

  1. Scans the table for rows where xmax is set and the deleting transaction is known committed and no active snapshot needs the old version.
  2. Marks those pages as free (available for reuse).
  3. Updates index entries (VACUUM FULL rewrites the whole table; regular VACUUM in-place).
Table state after 1 million updates without VACUUM:
  Live rows:     1,000,000
  Dead rows:     1,000,000  โ† same number, wasted space
  Table bloat:   2x expected size

After VACUUM:
  Live rows:     1,000,000
  Dead rows:     0
  Table size:    Expected size (approximately)

Autovacuum must not be disabled

PostgreSQL runs autovacuum automatically. Disabling it (or setting vacuum thresholds too high) causes table bloat, increasing scan times, and eventually causes transaction ID wraparound, a catastrophic condition where PostgreSQL refuses all writes to prevent data corruption. Never disable autovacuum in production.


Transaction ID wraparound

PostgreSQL uses 32-bit transaction IDs (xmin, xmax). With ~4 billion possible values, and systems performing thousands of transactions per second, IDs eventually wrap around.

When a transaction ID wraps, old rows with xmin values in the "future" relative to the wrapped counter would appear as not-yet-created. PostgreSQL prevents this by periodically freezing old rows: replacing their xmin with a special "frozen" value that is always visible to any transaction.

VACUUM handles freezing. If autovacuum falls behind (because it's disabled, under-resourced, or tables are growing faster than it can clean), PostgreSQL will forcibly take the database into single-user mode and run VACUUM FREEZE to prevent the catastrophic wraparound.

In practice: if autovacuum is running correctly, you never see this. It becomes a problem when autovacuum is disabled "for performance" or when write volume exceeds the vacuum worker's capacity to keep up.

Long-running transactions delay VACUUM and cause table bloat

VACUUM cannot reclaim a dead row version if any open transaction has a snapshot older than that version's deletion. A single analytics query running for 8 hours prevents VACUUM from reclaiming any dead row created during those 8 hours. On a system processing 10,000 updates per second, that is 288 million dead rows accumulating on disk. The fix: set idle_in_transaction_session_timeout to a reasonable limit (for example, 30 minutes), monitor pg_stat_activity ORDER BY xact_start, and terminate long-running transactions that are no longer actively executing.


What MVCC does not solve

Write-write conflicts. Two transactions updating the same row must still coordinate. MVCC does not eliminate the need for row-level locks when two writers target the same row. The second writer waits for the first to commit or roll back.

Serializable anomalies. At Repeatable Read isolation, write skew is possible. Two transactions each read state, make independent decisions based on what they see, then write. If both reads produced the same consistent snapshot but both writes together produce an invalid state, you have write skew. Only the Serializable isolation level (SSI in PostgreSQL) prevents this.

-- Write skew example: each doctor checks if enough doctors are on call
-- Doctor A: SELECT count(*) FROM on_call WHERE shift='night'; -- sees 2
-- Doctor B: SELECT count(*) FROM on_call WHERE shift='night'; -- sees 2
-- Doctor A: DELETE FROM on_call WHERE doctor_id=1 AND shift='night'; -- now 1
-- Doctor B: DELETE FROM on_call WHERE doctor_id=2 AND shift='night'; -- now 0
-- Invariant violated: 0 doctors on call, but each transaction saw 2 when it started

MVCC does not prevent write skew at Repeatable Read

Write skew happens when two transactions each see a consistent snapshot, make individually valid decisions, and their combined writes violate a shared invariant. Real examples: overselling the last inventory item, double-booking a time slot, two on-call doctors simultaneously going off shift. Use SELECT ... FOR UPDATE to lock the rows you intend to modify, or use SERIALIZABLE isolation to let PostgreSQL detect and abort the conflicting transaction. Read Committed and Repeatable Read do not prevent write skew.


Production usage

SystemUsageNotable behavior
PostgreSQLFull MVCC for all reads and writesDead tuples accumulate on every UPDATE and DELETE. VACUUM reclaims space. Autovacuum runs automatically but must be tuned for high-write tables.
MySQL InnoDBMVCC with undo log for read consistencyOld versions stored in the undo log, not in the main table file. Purge thread cleans up after transactions close.
CockroachDBDistributed MVCC with HLC timestampsEach key stores multiple timestamped versions. Garbage collection runs at configurable intervals to remove versions older than the GC TTL (default 25 hours).
OracleMulti-version read consistency via undo segments"Read consistency" is Oracle's branding for the same mechanism. Queries always see a consistent snapshot as of the query start time.

Choosing the right isolation level

flowchart TD
  Q1{"Do reads and writes need\nto proceed without blocking?"}
  Q1 -->|"Yes (most web apps)"| MVCC["MVCC-based isolation\n(PostgreSQL, MySQL InnoDB)"]
  Q1 -->|"No, strict serialization OK"| Lock["SERIALIZABLE with shared locks\n(rare, high contention)"]
  MVCC --> Q2{"Same-row reads must stay\nconsistent within a transaction?"}
  Q2 -->|"No, freshest data\non each statement"| RC["READ COMMITTED\n(PostgreSQL default)\nFast, allows non-repeatable reads"]
  Q2 -->|"Yes, snapshot-stable reads"| Q3{"Check-then-act patterns\non shared rows?"}
  Q3 -->|"No"| RR["REPEATABLE READ\nSnapshot at txn start\nWrite skew still possible"]
  Q3 -->|"Yes, must prevent\nwrite skew"| Q4{"Tolerate transaction\naborts and retries?"}
  Q4 -->|"Yes"| Ser["SERIALIZABLE (SSI)\nAll anomalies prevented\nRequires retry on abort"]
  Q4 -->|"No"| ForUpdate["REPEATABLE READ\n+ SELECT FOR UPDATE\nManual invariant protection"]

The practical default: READ COMMITTED for most OLTP. Add SELECT FOR UPDATE when you need check-then-act atomicity on specific rows. Use SERIALIZABLE only when the invariant spans multiple rows in a way that explicit locking cannot cleanly address.


Interview cheat sheet

  • MVCC stores multiple row versions simultaneously. Readers see the version current at their transaction start. Writers create new versions rather than modifying in place.
  • The core benefit: reads never block writes; writes never block reads. This is the correct answer to "how does PostgreSQL serve high concurrency without lock contention?"
  • Dead rows accumulate from updates/deletes and must be reclaimed by VACUUM. Table bloat = query slowdown.
  • Autovacuum is not optional. Disabling it leads to bloat and eventually transaction ID wraparound, which stops all writes.
  • MVCC does not prevent write-write conflicts. Two transactions writing the same row still contend for a row-level lock.
  • Write skew is possible at Repeatable Read. Use Serializable isolation to prevent it, or use explicit locking (SELECT FOR UPDATE).


Quick recap

  1. MVCC stores multiple row versions per row. Readers see the snapshot from their transaction start, writers create new versions rather than modifying in place.
  2. The core property: readers never block writers; writers never block readers. This is the mechanism behind PostgreSQL's high read/write concurrency.
  3. Dead row versions accumulate and must be reclaimed by VACUUM. Without autovacuum, tables bloat and eventually transaction ID wraparound stops all writes.
  4. Long-running transactions hold old snapshots, preventing VACUUM from reclaiming dead rows created during that transaction's lifetime.
  5. MVCC does not prevent write-write conflicts on the same row, or write skew at Repeatable Read isolation.
  6. Serializable isolation (SSI) prevents all anomalies including write skew, at the cost of transaction aborts that require application retry logic.

Related concepts

  • Database Locking โ€” MVCC reduces lock contention but does not eliminate it. Row-level locks, table locks, and advisory locks interact with MVCC in ways that are essential to understand for write-heavy workloads.
  • B-tree Indexes โ€” Dead row versions from MVCC also create dead index entries. VACUUM maintains both the heap (table data) and the associated indexes. Understanding this explains index bloat.
  • Databases โ€” MVCC is one of the fundamental mechanisms that distinguishes production-grade databases from simple key-value stores. It enables the "D" in ACID without sacrificing concurrent read throughput.

Previous

Two-phase commit (2PC)

Next

Event loop and I/O multiplexing

Comments

On This Page

The problem with simple lockingWhat MVCC isHow MVCC works in PostgreSQLIsolation levels and what they changeThe VACUUM problemTransaction ID wraparoundWhat MVCC does not solveProduction usageChoosing the right isolation levelInterview cheat sheetQuick recapRelated concepts