How PostgreSQL executes a query
How PostgreSQL parses SQL, plans execution with the cost-based optimizer, chooses between sequential and index scans, and manages MVCC for concurrent reads and writes.
The Interview Question
Interviewer: "Your dashboard query was fast in staging with 10K rows but takes 12 seconds in production with 50 million rows. You ran
EXPLAIN ANALYZEand saw a sequential scan instead of an index scan. Walk me through how PostgreSQL decides which execution plan to use, and why it sometimes picks the 'wrong' one."
This question tests whether you understand what happens between typing SQL and getting rows back. The interviewer wants to hear about the query planner, cost estimation, statistics, and MVCC visibility rules. Surface-level answers like "it uses indexes" will not cut it.
What to Clarify Before Answering
You: "Before I dive in, let me scope this properly..."
- "Should I walk through the full pipeline from SQL text to result rows, or focus specifically on the planner?"
- "Are we talking about a single query on a standalone Postgres instance, or should I cover parallel query execution and replication?"
- "Should I include how MVCC affects which rows the query can see, or keep it to the execution path only?"
- "Is the interviewer interested in the write path (WAL, buffer pool) or the read path only?"
Why this matters: PostgreSQL query execution touches parsing, planning, execution, memory management, concurrency control, and durability. A strong candidate scopes the answer before diving in, showing they understand the breadth of the system.
The 30-Second Answer
PostgreSQL processes every query through a five-stage pipeline: parse, analyze, rewrite, plan, execute. The parser converts SQL text into a parse tree. The analyzer resolves table and column names against the catalog. The rewriter applies view definitions and rules. The cost-based optimizer then generates candidate execution plans, estimates their cost using table statistics (row counts, value distributions, correlation), and picks the cheapest one. Finally, the executor walks the plan tree, fetching rows through sequential scans, index scans, hash joins, or merge joins as the plan dictates. Concurrent access is managed through MVCC, where every tuple carries transaction visibility information (xmin/xmax), and each transaction sees a consistent snapshot of the database.
The Architecture Overview
PostgreSQL follows a process-per-connection model. When a client connects, the postmaster (main process) forks a new backend process dedicated to that connection. This is different from MySQL's thread-per-connection model. The process model provides stronger isolation (a crash in one backend does not affect others) at the cost of higher memory per connection (~5-10MB per backend process). Connection poolers like PgBouncer are essential for applications that need thousands of connections.
When a query arrives, it enters a dedicated backend process (one per connection). The parser tokenizes the SQL and builds a parse tree. The analyzer resolves all names against the system catalog, checking that tables and columns exist. The rewriter expands views and applies rewrite rules. The planner is where the real work happens: it generates multiple candidate plans, estimates costs using statistics from pg_statistic, and picks the cheapest one.
The executor then walks the chosen plan tree node by node. Each node (SeqScan, IndexScan, HashJoin, Sort) is a pull-based iterator: the parent calls ExecProcNode() on its child, which returns one tuple at a time. Data pages are fetched through shared_buffers (a shared memory cache), and writes go through the WAL buffer for durability.
I will now walk through each stage in detail.
Parse, Analyze, Rewrite: From SQL Text to Query Tree
The first three stages transform raw SQL text into an internal representation the planner can work with.
Parsing
The parser uses a Bison grammar (over 15,000 lines of grammar rules) to convert SQL text into a parse tree. At this stage, Postgres has no idea whether the tables or columns actually exist. It only checks syntax.
-- Input SQL
SELECT u.name, COUNT(o.id)
FROM users u JOIN orders o ON u.id = o.user_id
WHERE u.created_at > '2024-01-01'
GROUP BY u.name
HAVING COUNT(o.id) > 5;
-- Parser output (simplified parse tree):
-- SelectStmt
-- targetList: [ColumnRef(u.name), FuncCall(COUNT, ColumnRef(o.id))]
-- fromClause: [JoinExpr(users AS u, orders AS o, ON u.id = o.user_id)]
-- whereClause: [OpExpr(>, ColumnRef(u.created_at), Const('2024-01-01'))]
-- groupClause: [ColumnRef(u.name)]
-- havingClause: [OpExpr(>, FuncCall(COUNT, ...), Const(5))]
Analysis
The analyzer resolves every name in the parse tree against the system catalog (pg_class, pg_attribute, pg_type). It verifies that users exists, that users.name is a valid column, that COUNT is a valid aggregate, and that the types are compatible. The output is a query tree with fully resolved OIDs (internal object identifiers).
Rewriting
The rewriter applies transformation rules. The most common use case: if users is actually a view, the rewriter replaces the table reference with the view's underlying query. It also applies row-level security policies and certain system-defined rules.
Why this matters in production
If your query suddenly changes plans after a seemingly unrelated schema change, check whether a view definition changed or a new rule was added. The rewriter transforms the query before the planner ever sees it, so EXPLAIN shows the rewritten version, not your original SQL.
The Cost-Based Optimizer: How Postgres Chooses a Plan
This is the most important part of the entire pipeline. The planner generates candidate execution plans and estimates their cost using a model based on I/O and CPU costs.
How Cost Estimation Works
Every plan node has two cost numbers: startup cost (cost before the first row is returned) and total cost (cost to return all rows). These are measured in abstract cost units that roughly correspond to sequential page reads.
The key tuning knobs:
| Parameter | Default | Meaning |
|---|---|---|
seq_page_cost | 1.0 | Cost of reading one page sequentially |
random_page_cost | 4.0 | Cost of reading one random page |
cpu_tuple_cost | 0.01 | Cost of processing one row |
cpu_index_tuple_cost | 0.005 | Cost of processing one index entry |
cpu_operator_cost | 0.0025 | Cost of evaluating one operator |
effective_cache_size | 4GB | Planner's estimate of total cache available |
The ratio of random_page_cost to seq_page_cost (4:1 by default) is why Postgres often prefers sequential scans on small-to-medium tables. A sequential scan reads pages in order (leveraging OS readahead), while an index scan jumps to random pages.
Sequential Scan vs. Index Scan Decision
The planner uses table statistics to estimate how many rows a WHERE clause will filter. These statistics come from ANALYZE (run automatically by autovacuum or manually):
n_distinct: Number of distinct values in a columnmost_common_vals/most_common_freqs: The top N values and their frequencieshistogram_bounds: Equal-frequency histogram dividing the remaining values
Join Strategies
When a query joins two or more tables, the planner chooses between three join algorithms:
| Join Type | Best When | Complexity | Memory Use |
|---|---|---|---|
| Nested Loop | Inner table is small or has an index | O(N * M) or O(N * log M) with index | Minimal |
| Hash Join | No useful index, both tables fit in work_mem | O(N + M) | O(smaller table) |
| Merge Join | Both inputs are already sorted (index or explicit sort) | O(N log N + M log M) | O(sort buffers) |
The planner also considers join order. For up to ~12 tables, it evaluates all possible orderings using dynamic programming. Beyond that, it switches to the Genetic Query Optimizer (GEQO), which uses a genetic algorithm to search a subset of orderings. This threshold is controlled by geqo_threshold (default 12).
Common trap: work_mem too low
The default work_mem is 4MB. When a hash join's build side exceeds work_mem, Postgres spills to disk in batches, which can be 10-100x slower. If your queries do large joins or sorts, increase work_mem per-session. But be careful: each sort or hash operation gets its own work_mem allocation, so a complex query with 10 hash joins could use 10x work_mem total.
MVCC: How Concurrent Reads and Writes Coexist
PostgreSQL uses Multi-Version Concurrency Control so that readers and writers operate without blocking each other. Every tuple (row version) on disk carries visibility information.
Tuple Header Visibility Fields
Each heap tuple has a header with these critical fields:
| Field | Size | Purpose |
|---|---|---|
xmin | 4 bytes | Transaction ID that inserted this tuple |
xmax | 4 bytes | Transaction ID that deleted/updated this tuple (0 if live) |
t_ctid | 6 bytes | Pointer to the next version of this tuple (for updates) |
t_infomask | 2 bytes | Hint bits for committed/aborted status |
When a transaction runs a SELECT, it does not see the "latest" version of every row. It sees the version that was visible at its snapshot.
Snapshot Isolation
A snapshot records:
- xmin: The lowest still-running transaction ID at snapshot time
- xmax: The next transaction ID that will be assigned
- xip_list: Array of transaction IDs that were in-progress at snapshot time
A tuple is visible to a snapshot if:
xminis committed ANDxmin< snapshot'sxmaxANDxminis not inxip_listxmaxis zero (not deleted), ORxmaxis not yet committed, ORxmaxis inxip_list
This visibility check happens for every tuple during a scan. For a sequential scan of 10 million rows, Postgres evaluates these rules 10 million times. The t_infomask hint bits optimize this by caching the committed/aborted status so Postgres does not need to look up the pg_xact (CLOG) file for every tuple.
Why readers never block writers
Unlike traditional locking systems where a SELECT can block an UPDATE (or vice versa), MVCC allows both to proceed simultaneously. The SELECT reads the old tuple version, and the UPDATE creates a new version. Neither waits for the other. The only scenario where transactions block each other is two concurrent UPDATEs to the same row, where the second UPDATE must wait for the first to commit or abort.
The VACUUM Process
When tuples are updated or deleted, the old versions remain on disk (they might still be visible to other transactions). VACUUM reclaims space from dead tuples.
| VACUUM Type | When It Runs | What It Does | Impact |
|---|---|---|---|
| Regular VACUUM | Autovacuum trigger | Marks dead tuple space as reusable | Does not shrink file on disk |
| VACUUM FULL | Manual only | Rewrites entire table, reclaims disk space | Exclusive lock, very slow |
| Autovacuum | When dead tuples > threshold | Runs regular VACUUM automatically | Minimal impact if tuned |
| Anti-wraparound | When xid age > 200M | Prevents transaction ID wraparound | Can freeze system if neglected |
The transaction ID wraparound problem
PostgreSQL uses 32-bit transaction IDs (roughly 4 billion before wraparound). VACUUM freezes old tuples to reclaim transaction IDs. If autovacuum falls behind, Postgres will refuse new writes at 2 billion transactions remaining and eventually shut down to prevent data corruption. This is the single most critical operational failure mode in PostgreSQL.
Index Internals: How B-Trees Speed Up Lookups
PostgreSQL's default index type is the B-tree (balanced tree). Understanding how it works explains why some queries use indexes and others do not.
B-Tree Structure
A B-tree index is a sorted, balanced tree where:
- Leaf pages contain index entries (key value + pointer to heap tuple)
- Internal pages contain separator keys that guide the search
- Root page is the starting point for every lookup
- All leaf pages are linked in a doubly-linked list for range scans
For a table with 10 million rows and a B-tree on an integer column, the tree is typically 3-4 levels deep. Each lookup requires 3-4 page reads (one per level) plus one heap page read to fetch the actual tuple.
// Simplified B-tree lookup for WHERE id = 42
function btree_search(root, key=42):
page = root // Level 0: root page
while page.is_internal:
// Binary search within page for correct child pointer
child_ptr = binary_search(page.entries, key)
page = read_page(child_ptr) // Level 1, 2: internal pages
// Now at leaf level
entry = binary_search(page.entries, key)
if entry.key == key:
return read_heap_tuple(entry.tid) // Fetch actual row from heap
return NULL
Index Types Beyond B-Tree
| Index Type | Best For | Structure | Example Use Case |
|---|---|---|---|
| B-tree | Equality and range queries | Balanced tree | WHERE id = 42, WHERE date > '2024-01-01' |
| Hash | Equality only | Hash buckets | WHERE email = 'foo@bar.com' (rarely better than B-tree) |
| GIN | Multi-valued columns | Inverted index | Full-text search, JSONB containment, array overlap |
| GiST | Spatial and range types | Generalized search tree | PostGIS geometry, range types, nearest-neighbor |
| BRIN | Large naturally-ordered tables | Block range summaries | Time-series data on append-only tables |
When BRIN beats B-tree
For append-only time-series tables with billions of rows, a BRIN index on the timestamp column can be 1,000x smaller than a B-tree (a few KB versus several GB). BRIN stores the min/max value per block range (128 pages by default). The tradeoff is that BRIN is useless if the data is not physically ordered by the indexed column. I reach for BRIN when the table is append-only and queries always filter by the ordering column.
WAL: How Writes Become Durable
The Write-Ahead Log is PostgreSQL's durability mechanism. The rule is simple: before any data page change is written to disk, the WAL record describing that change must be written and fsynced first.
The Write Path
The sequence for a write operation:
- Backend generates a WAL record describing the change (essentially a diff of the page)
- WAL record is appended to the WAL buffer in shared memory
- The heap page in
shared_buffersis modified (marked dirty) - At
COMMIT, the WAL buffer is fsynced to the WAL segment file on disk - The dirty data page is written to disk later by the background writer or checkpointer
This separation is critical. WAL writes are sequential (fast), while data page writes are random (slow). By writing WAL first, Postgres can acknowledge a commit quickly and batch data page writes.
WAL Record Format
Each WAL record contains:
- Resource manager ID: Which subsystem generated the record (heap, btree, transaction, etc.)
- Transaction ID: Which transaction this change belongs to
- LSN (Log Sequence Number): A monotonically increasing pointer into the WAL stream
- Page diff data: The actual bytes that changed on the data page
The LSN is critical. Every data page header stores the LSN of the most recent WAL record that modified it. During recovery, Postgres compares the page LSN with the WAL record LSN. If the page LSN is already >= the WAL record LSN, that change was already flushed to disk and the record is skipped. This makes recovery idempotent.
Checkpoints and Recovery
A checkpoint flushes all dirty pages from shared_buffers to disk and writes a checkpoint record to WAL. After a crash, recovery starts from the last checkpoint and replays WAL records forward.
| Parameter | Default | Purpose |
|---|---|---|
checkpoint_timeout | 5 min | Maximum time between checkpoints |
max_wal_size | 1 GB | WAL size that triggers a checkpoint |
wal_level | replica | How much detail to write (minimal, replica, logical) |
synchronous_commit | on | Whether to wait for WAL fsync before returning |
Buffer Pool and Shared Memory
PostgreSQL caches data pages in a shared memory region called shared_buffers. Every read and write goes through this cache.
How the Buffer Pool Works
The buffer pool is an array of 8KB buffer slots. Each slot holds one data page (heap page, index page, or other). When a backend needs a page:
- Check the buffer tag hash table for the page
- If found (cache hit), pin the buffer and return it
- If not found (cache miss), find a free or evictable buffer slot
- Read the page from disk into the slot
- Insert into the hash table and return
The buffer tag is a combination of (relation file node, fork number, block number). The hash table uses partition-based locking (128 partitions), so multiple backends can look up different pages concurrently without blocking each other.
Double Buffering: shared_buffers + OS Page Cache
PostgreSQL relies on a two-tier caching strategy. shared_buffers is the first tier, managed by Postgres with pin counts and usage tracking. The OS page cache is the second tier, caching any file data Postgres reads from disk.
When Postgres reads a page not in shared_buffers, the read goes to the OS. If the OS has the page in its page cache, the read completes from RAM (just a memory copy, ~microseconds). If not, it goes to disk (~milliseconds). This means the effective cache size is shared_buffers + OS page cache, which is why effective_cache_size should be set to ~75% of total RAM.
Ring Buffers for Large Scans
When a sequential scan reads a table larger than 1/4 of shared_buffers, Postgres uses a ring buffer (256KB, just 32 pages) instead of the main buffer pool. This prevents a single large table scan from evicting all cached pages that other queries depend on. Without this mechanism, a single SELECT * FROM huge_table would flush the entire buffer pool, causing every other concurrent query to re-read its data from disk.
The same ring buffer strategy applies to bulk writes (COPY, CREATE TABLE AS) and VACUUM operations. This is one of the most elegant design decisions in the buffer manager.
Clock-Sweep Eviction
When the buffer pool is full, Postgres uses a clock-sweep algorithm to find a page to evict. Each buffer has a usage counter (0-5). The clock hand sweeps through buffers:
- If
usage_count > 0, decrement it and move on - If
usage_count == 0, evict this buffer (write to disk first if dirty)
This approximates LRU without the overhead of maintaining a linked list. Frequently accessed pages accumulate higher usage counts and survive longer.
// Simplified clock-sweep eviction
function find_victim_buffer():
while true:
buf = buffers[clock_hand]
clock_hand = (clock_hand + 1) % num_buffers
if buf.usage_count == 0:
if buf.is_dirty:
write_page_to_disk(buf)
return buf
else:
buf.usage_count -= 1
Why shared_buffers should not be too large
There is a common misconception that you should set shared_buffers to 50%+ of RAM. Postgres relies heavily on the OS page cache as a second layer of caching. Setting shared_buffers to 25% of RAM (the typical recommendation) leaves room for the OS cache, which handles sequential scan readahead and double-buffers pages that were evicted from shared_buffers. On a 64GB machine, I typically set shared_buffers to 16GB.
Parallel Query Execution
Since version 9.6, PostgreSQL can parallelize certain operations. The planner creates a Gather or Gather Merge node that spawns background workers. Each worker processes a portion of the data and sends results back to the leader process.
Operations that support parallelism:
- Sequential scans (each worker scans a subset of blocks)
- Index scans and bitmap index scans
- Hash joins and nested loop joins (inner side)
- Aggregates (partial aggregation in workers, final aggregation in leader)
| Parameter | Default | Purpose |
|-----------|---------|---------||
| max_parallel_workers_per_gather | 2 | Max workers per query |
| min_parallel_table_scan_size | 8MB | Minimum table size for parallel scan |
| parallel_tuple_cost | 0.1 | Planner cost of transferring a tuple between processes |
| max_parallel_workers | 8 | Global max parallel workers |
Parallel query helps most on large analytical queries that scan millions of rows. It does not help (and adds overhead) for OLTP point lookups that return a single row. The planner accounts for the overhead of inter-process communication when deciding whether to parallelize.
What Happens When Things Break
| Failure | What Happens | How to Detect | How to Fix |
|---|---|---|---|
| Crash mid-transaction | Uncommitted changes are lost, WAL replays from last checkpoint on restart | Server exits, pg_ctl status shows stopped | Restart Postgres, recovery is automatic |
| Stale statistics | Planner picks wrong plans (seq scan instead of index scan) | EXPLAIN ANALYZE shows estimated vs actual rows differ by 10x+ | Run ANALYZE on affected tables |
| Transaction ID wraparound | Postgres refuses new writes, eventually shuts down | WARNING: database must be vacuumed within X transactions in logs | Emergency: run VACUUM FREEZE on affected tables |
| Bloated tables from long transactions | Sequential scans take 10x longer, disk usage spikes | pg_stat_user_tables.n_dead_tup grows continuously | Kill idle transactions, run VACUUM, consider VACUUM FULL during maintenance window |
| WAL disk full | Postgres halts all writes | PANIC: could not write to file pg_wal/... | Free disk space, archive old WAL segments, increase disk |
| shared_buffers thrashing | High I/O, slow queries across the board | pg_stat_bgwriter.buffers_backend high (backends writing their own pages) | Increase shared_buffers, reduce concurrent connections |
The key insight about crash recovery
Postgres never needs a "repair" or "recovery mode" dialog like some databases. WAL-based recovery is fully automatic. The database reads the last checkpoint location from pg_control, replays all WAL records after that point, and is ready for connections. This typically takes seconds to minutes, even for large databases, because it only replays the WAL generated since the last checkpoint (not the entire database).
Performance Characteristics
| Operation | Typical Latency | Throughput | Key Factor |
|---|---|---|---|
| Point lookup via B-tree index | 0.1-1ms | 50,000-100,000 QPS | Tree depth (3-4 levels) + 1 heap fetch |
| Sequential scan (cold cache) | 100MB/s+ from disk | Limited by disk I/O | OS readahead, page size |
| Sequential scan (warm cache) | 1-5GB/s from shared_buffers | Limited by CPU | Tuple deforming, visibility checks |
| Single-row INSERT | 0.5-2ms (with fsync) | 5,000-20,000 TPS | WAL fsync latency |
| Bulk COPY load | 100K-500K rows/sec | Limited by WAL write throughput | Disable indexes during load |
| Hash join (in-memory) | O(N + M) | Limited by work_mem | Fits in work_mem vs spill to disk |
| VACUUM on 1M dead tuples | 1-10 seconds | ~100K-500K tuples/sec | I/O for reading/writing pages |
| Checkpoint (1GB dirty data) | 5-30 seconds | Spread across checkpoint_timeout | Disk write throughput |
The 5% rule of thumb for index scans
As a rough heuristic, Postgres prefers an index scan when the query returns less than ~5% of the table rows, and a sequential scan when it returns more. This threshold shifts based on random_page_cost, index correlation, and whether the data is cached. On SSDs with low random_page_cost, the crossover point can be as high as 20-30%.
How This Compares to Alternatives
| Feature | PostgreSQL | MySQL (InnoDB) | SQLite | CockroachDB |
|---|---|---|---|---|
| Query planner | Cost-based, very sophisticated | Cost-based, simpler | Simpler cost-based | Distributed cost-based |
| MVCC implementation | Heap tuple versioning (xmin/xmax) | Undo log (rollback segments) | WAL-mode with snapshots | Timestamp-based MVCC |
| Index structure | Heap + separate B-tree | Clustered index (data in PK B-tree) | B-tree | LSM-based (RocksDB) |
| Dead tuple cleanup | VACUUM (explicit process) | Purge thread (automatic) | Not needed (WAL handles it) | Garbage collection |
| Parallel query | Yes (since v9.6) | Limited (since 8.0) | No | Yes (distributed) |
| Extension ecosystem | Rich (PostGIS, pg_vector, timescaledb) | Limited | Minimal | PostgreSQL-compatible |
| Connection model | Process per connection | Thread per connection | In-process | Process per node |
I reach for PostgreSQL when I need a general-purpose relational database with strong consistency, a rich extension ecosystem, and a sophisticated query planner. I switch to MySQL when the team already has deep MySQL expertise and the workload is simple OLTP. I consider CockroachDB when I need horizontal scaling with serializable isolation, though the latency overhead of distributed consensus is real (2-10ms per write vs sub-millisecond in standalone Postgres).
SQLite is the right choice for embedded applications, mobile apps, or single-user tools where the simplicity of an in-process database outweighs the limitations of single-writer concurrency. I have used SQLite as a local cache in front of a remote Postgres database to great effect.
The InnoDB clustered index difference
The biggest architectural difference between PostgreSQL and MySQL/InnoDB is how they store data. PostgreSQL uses a heap (unordered) with separate B-tree indexes. InnoDB uses a clustered index where the primary key B-tree IS the table data. This means InnoDB primary key lookups are faster (no separate heap fetch), but secondary index lookups are slower (they must first look up the primary key, then traverse the clustered index). It also means InnoDB UPDATEs that change the primary key are expensive because they move the entire row in the B-tree.
Interview Cheat Sheet
- When asked about query execution: "Postgres has a five-stage pipeline: parse, analyze, rewrite, plan, execute. The planner is cost-based and uses statistics from pg_statistic to estimate row counts and choose between scan types."
- When asked about index vs. seq scan: "The planner compares the estimated cost of sequential I/O (reading all pages) versus random I/O (index lookups). The crossover is typically around 5-15% of table rows, but it depends on random_page_cost and whether data is cached."
- When asked about MVCC: "Every tuple has xmin and xmax fields. Each transaction gets a snapshot that determines which tuple versions it can see. Readers never block writers and vice versa."
- When asked about VACUUM: "Updates and deletes create dead tuples. VACUUM marks that space as reusable. Long-running transactions prevent vacuum from cleaning up, which causes table bloat."
- When asked about transaction ID wraparound: "Postgres uses 32-bit transaction IDs. VACUUM freezes old tuples to reclaim IDs. If vacuum falls behind, Postgres will refuse writes at 2 billion transactions to prevent corruption."
- When asked about WAL: "Write-Ahead Logging ensures durability. WAL records are written and fsynced before the transaction is acknowledged. Dirty data pages are flushed later by the checkpointer."
- When asked about the buffer pool: "shared_buffers is a page cache using clock-sweep eviction. Set it to ~25% of RAM and let the OS cache handle the rest. Backends that write their own pages (buffers_backend) indicate the pool is too small."
- When asked about join strategies: "Nested loop for small inner tables with indexes, hash join for larger unindexed joins that fit in work_mem, merge join when both sides are already sorted."
- When asked about slow queries: "First: EXPLAIN ANALYZE to see actual vs estimated rows. Second: check if statistics are stale (run ANALYZE). Third: check if work_mem is causing disk spills. Fourth: check for sequential scans on large tables where an index would help."
Test Your Understanding
Quick Recap
- PostgreSQL processes every query through a five-stage pipeline: parse, analyze, rewrite, plan, execute.
- The cost-based optimizer estimates plan costs using table statistics (row counts, histograms, most common values) and cost parameters (sequential vs random I/O costs).
- The planner chooses between sequential scan, index scan, bitmap scan, and index-only scan based on selectivity, correlation, and available indexes.
- MVCC uses tuple-level xmin/xmax fields so that each transaction sees a consistent snapshot without blocking other transactions.
- VACUUM is essential for reclaiming dead tuple space and preventing transaction ID wraparound, which is the most critical operational failure in PostgreSQL.
- The Write-Ahead Log ensures durability by writing WAL records before data pages, enabling automatic crash recovery from the last checkpoint.
- The buffer pool (shared_buffers) uses clock-sweep eviction and should be set to ~25% of RAM, with the OS page cache handling the rest.
- For multi-table joins, the planner uses nested loops, hash joins, or merge joins, and switches to a genetic optimizer for queries joining more than 12 tables.
Related Concepts
- How DynamoDB works: Contrast with a fully managed NoSQL database that takes a completely different approach to query execution (no joins, no planner, partition-key routing only)
- How RocksDB compaction works: The LSM-tree storage engine used by CockroachDB and many NewSQL databases, a fundamentally different storage approach than PostgreSQL's heap + B-tree
- How garbage collection works: Similar to VACUUM in principle (reclaiming unused memory/space), but in a runtime environment rather than a database
- How ClickHouse works: A columnar database optimized for analytical queries, showing the tradeoffs between OLTP (Postgres) and OLAP (ClickHouse) query execution