PostgreSQL query planner: statistics, cost model, and EXPLAIN
How PostgreSQL's query planner chooses execution plans using column statistics and a cost model. What EXPLAIN ANALYZE output means, how to identify bad plans, and how to fix them without rewriting queries.
The problem
Your e-commerce platform runs a query that joins orders, users, and products to generate a daily sales report. In development with 10,000 rows, it takes 12ms. In production with 50 million orders, the same query takes 47 seconds.
You add an index on orders.user_id. The query still takes 47 seconds. You check the execution plan with EXPLAIN ANALYZE and discover the planner is ignoring your index entirely. It chose a sequential scan across the full orders table, then a nested loop join against users, scanning all 2 million user rows for every batch of orders.
The index exists. The query is correct. The problem is not your SQL. The problem is the planner's decision: it estimated the join would return 12 rows (based on stale statistics from last month's data) when the real result set is 380,000 rows. With that wrong estimate, nested loop looked cheaper than hash join. The plan it chose is catastrophically wrong.
This is the problem the query planner solves, and the problem you face when the planner gets it wrong. PostgreSQL's query planner is a cost-based optimizer that evaluates multiple execution strategies and picks the cheapest. When its cost estimates are accurate, it makes excellent choices. When the estimates are wrong (stale statistics, skewed data, correlated columns), the plan can be orders of magnitude slower than optimal.
What it is
The PostgreSQL query planner is a cost-based optimizer that translates a declarative SQL query into a physical execution plan: a tree of operators (scans, joins, sorts, aggregates) that the executor runs to produce results. For each query, the planner generates candidate plans, estimates their cost using stored statistics about the data, and picks the cheapest one.
Analogy: Think of a GPS navigation system. You type in a destination (the SQL query). The GPS considers multiple routes: highway (fast but longer distance), side streets (shorter but more stops), toll road (expensive but avoids traffic). It picks the route with the lowest estimated travel time based on current traffic data. If the traffic data is stale (it thinks the highway is clear when it is actually jammed), it sends you the wrong way. The PostgreSQL planner works the same way: good statistics produce good plans, stale statistics produce catastrophic plans.
Optimizer vs. planner
PostgreSQL documentation uses "planner" and "optimizer" interchangeably. Other databases (MySQL, Oracle, SQL Server) call the same component the "query optimizer." The function is identical: translate SQL into the cheapest physical execution plan.
How it works
The planner transforms a SQL query into an execution plan through four stages: parsing, rewriting, planning, and execution. The planning stage is where all the interesting decisions happen.
The planner's core loop works by dynamic programming. For each table in the query, it generates all possible access paths (sequential scan, each available index scan, bitmap scan). For each pair of tables, it considers all join strategies. It builds the plan bottom-up, caching the cheapest sub-plan at each step.
function plan_query(tables, joins, filters):
// Step 1: Generate scan paths for each table
for each table in tables:
paths[table] = []
paths[table].add(seq_scan(table))
for each index on table:
if index matches any filter on table:
paths[table].add(index_scan(table, index))
if multiple index conditions exist:
paths[table].add(bitmap_scan(table, indexes))
paths[table] = cheapest(paths[table])
// Step 2: Generate join paths for each pair
for join_size = 2 to len(tables):
for each subset S of tables with size join_size:
for each way to split S into (left, right):
for join_method in [nested_loop, hash_join, merge_join]:
cost = estimate_cost(paths[left], paths[right], join_method)
paths[S] = keep_cheapest(paths[S], cost)
// Step 3: Add sort, aggregate, limit on top
plan = paths[all_tables]
if ORDER BY: plan = add_sort(plan)
if GROUP BY: plan = add_aggregate(plan)
return plan
For queries involving more than geqo_threshold tables (default: 12), PostgreSQL switches from exhaustive dynamic programming to a Genetic Query Optimizer (GEQO) that uses randomized search. This is because the number of possible join orderings grows factorially: 12 tables produce 479 million possible orderings.
The cost model: sequential scan vs index scan
Every plan node gets a cost estimate. The cost formula combines I/O cost (pages read from disk) and CPU cost (rows processed). The cost units are arbitrary but consistent, anchored to seq_page_cost = 1.0.
Key cost parameters and their defaults:
| Parameter | Default | What it controls |
|---|---|---|
seq_page_cost | 1.0 | Cost of reading one page sequentially (baseline) |
random_page_cost | 4.0 | Cost of reading one page randomly (index lookups) |
cpu_tuple_cost | 0.01 | CPU cost of processing one row |
cpu_index_tuple_cost | 0.005 | CPU cost of processing one index entry |
cpu_operator_cost | 0.0025 | CPU cost of one operator evaluation (comparison, function) |
effective_cache_size | 4GB | Planner's estimate of available OS + PostgreSQL cache |
A concrete example: table orders has 1 million rows across 50,000 pages. An index on user_id covers 2,700 index pages.
Sequential scan cost (scanning all rows, filtering to user_id = 123):
I/O: 50,000 pages Ć 1.0 (seq_page_cost) = 50,000
CPU: 1,000,000 rows Ć 0.01 (cpu_tuple_cost) = 10,000
Total: 60,000
Index scan cost (looking up user_id = 123, estimated 50 matching rows):
Index I/O: 3 levels Ć 4.0 (random_page_cost) = 12
Heap I/O: 50 rows Ć 4.0 (random_page_cost) = 200
CPU: 50 rows Ć 0.005 = 0.25
Total: 212.25
The index scan is 280x cheaper. The planner picks it. But if random_page_cost were misconfigured to 40.0, the index scan cost jumps to 2,012, and the planner might still pick the sequential scan for smaller result sets.
On SSD storage, set random_page_cost to 1.1 (not the default 4.0). SSDs have nearly equal sequential and random read latency. The default of 4.0 was calibrated for spinning disks and causes the planner to under-use indexes on SSD-backed systems. I have seen this single misconfiguration cause 10x slowdowns on production queries.
Join algorithms
When a query joins two or more tables, the planner picks a join algorithm for each pair. PostgreSQL supports three algorithms, each optimal for different data shapes.
Nested loop join iterates over every row in the outer table and, for each, scans the inner table for matches. With an index on the inner table's join key, each inner lookup is O(log N). Without an index, it is O(M) per outer row, which is catastrophic for large tables. The planner picks nested loop when it estimates the outer result set is small (tens or hundreds of rows).
Hash join builds an in-memory hash table from the smaller relation, then probes it with each row from the larger relation. The cost is O(N + M) after the hash build phase. Hash joins require enough work_mem to hold the hash table. If the inner relation exceeds work_mem, PostgreSQL spills to disk in batches, significantly increasing I/O.
Merge join sorts both inputs on the join key, then walks them in a single pass. If either input is already sorted (from an index scan or a prior sort step), the sort phase is free. Merge join handles range joins better than hash join and is most efficient when both sides are large and pre-sorted.
Interview tip: join selection
When an interviewer asks why a query is slow, check the join algorithm first. If you see a nested loop on a large outer set with no index on the inner, that is almost always the problem. The fix is either adding an index (so nested loop becomes cheap) or ensuring the statistics are accurate (so the planner picks hash join instead).
Statistics and pg_stats
The planner's cost estimates are only as good as its statistics. PostgreSQL collects per-column statistics via the ANALYZE command and stores them in pg_statistic (viewable through pg_stats).
| Statistic | What it stores | How the planner uses it |
|---|---|---|
n_distinct | Number of distinct values. Positive = exact count, negative = fraction of rows (-0.5 means 50% distinct) | Estimates selectivity of equality filters. WHERE status = 'active' returns ~1/n_distinct of rows |
most_common_vals / most_common_freqs | Top N most frequent values and their frequencies | For skewed columns, uses exact frequency instead of assuming uniform distribution |
histogram_bounds | Array of values dividing the non-MCV range into equal-frequency buckets | Estimates range filter selectivity: WHERE price > 100 |
correlation | How closely physical row order matches logical order (-1 to 1) | High correlation means index scans read sequential pages (cheap). Low correlation means random I/O |
null_frac | Fraction of rows that are NULL | Adjusts row estimates for IS NULL / IS NOT NULL filters |
The default statistics target is 100 (100 most common values, 100 histogram buckets). For columns with high cardinality or heavy skew, increase it:
-- Increase statistics granularity for a skewed column
ALTER TABLE orders ALTER COLUMN status SET STATISTICS 500;
-- Refresh statistics
ANALYZE orders;
-- View the statistics
SELECT attname, n_distinct, most_common_vals, most_common_freqs, correlation
FROM pg_stats
WHERE tablename = 'orders' AND attname = 'status';
Statistics staleness is the most common cause of bad plans. After bulk loads, large deletes, or data migrations, statistics may not reflect the current distribution. Autovacuum refreshes statistics automatically, but it triggers based on row-change thresholds. A table with 100 million rows needs 10 million changes (default autovacuum_analyze_scale_factor = 0.1) before auto-analyze runs. In high-write workloads, this lag causes the planner to make decisions based on old data.
Correlated columns break the planner's independence assumption. If city = 'Seattle' and state = 'WA' always appear together, the planner estimates the combined selectivity as P(city) Ć P(state), which is far too low. PostgreSQL 10+ has extended statistics (CREATE STATISTICS) for this, but you must create them explicitly.
Reading EXPLAIN ANALYZE
EXPLAIN ANALYZE runs the query and reports both estimated and actual costs for every plan node. This is your primary tool for diagnosing planner problems.
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT o.*, u.email
FROM orders o
JOIN users u ON o.user_id = u.id
WHERE o.status = 'pending'
AND o.created_at > '2026-01-01';
Hash Join (cost=4521.00..12987.45 rows=3200 width=96)
(actual time=45.2..187.3 rows=128450 loops=1)
Hash Cond: (o.user_id = u.id)
Buffers: shared hit=8921 read=2341
-> Bitmap Heap Scan on orders o (cost=112.45..8234.12 rows=3200 width=48)
(actual time=2.1..132.4 rows=128450 loops=1)
Recheck Cond: (status = 'pending')
Filter: (created_at > '2026-01-01')
Rows Removed by Filter: 4521
Buffers: shared hit=6234 read=2341
-> Bitmap Index Scan on orders_status_idx (cost=0.00..111.65 rows=5400 width=0)
(actual time=1.8..1.8 rows=132971 loops=1)
-> Hash (cost=3012.00..3012.00 rows=200000 width=52)
(actual time=42.8..42.8 rows=200000 loops=1)
Buckets: 262144 Batches: 1 Memory Usage: 15234kB
Buffers: shared hit=2687
-> Seq Scan on users u (cost=0.00..3012.00 rows=200000 width=52)
(actual time=0.01..18.2 rows=200000 loops=1)
Buffers: shared hit=2687
The critical diagnostic signal: estimated rows vs. actual rows. In the bitmap heap scan above, the planner estimated 3,200 rows but got 128,450. That is a 40x underestimate. This means the planner under-allocated resources and may have chosen the wrong join algorithm entirely.
| Field | What it means |
|---|---|
cost=X..Y | Estimated startup cost .. total cost (arbitrary units, anchored to seq_page_cost = 1.0) |
rows=N | Estimated output rows from statistics |
actual time=X..Y | Real wall-clock time in ms (startup .. total) |
actual rows=M | Real output row count |
loops=N | How many times this node executed (inner side of nested loop) |
Buffers: shared hit=X read=Y | Pages served from cache vs. read from disk |
Rows Removed by Filter | Rows that passed the index condition but failed the WHERE |
When you see a large mismatch between rows and actual rows, that is the root cause. Run ANALYZE on the affected tables. If the mismatch persists, increase the statistics target or create extended statistics for correlated columns.
Production usage
| System | Query planner type | Notable behavior |
|---|---|---|
| PostgreSQL | Cost-based optimizer with dynamic programming. GEQO for 12+ table joins | Uses column statistics (MCV, histograms, correlation). Extended statistics for correlated columns since v10 |
| MySQL (InnoDB) | Cost-based optimizer with greedy heuristics | Simpler statistics model. No histogram support before MySQL 8.0. Often requires optimizer hints for complex joins |
| CockroachDB | Distributed cost-based optimizer inspired by PostgreSQL | Extends the model with network transfer cost. Statistics collection is automatic across distributed nodes |
| Oracle | Cost-based optimizer with adaptive plans | Can switch strategy mid-execution if initial estimates are wrong. Most mature commercial optimizer |
| SQL Server | Cost-based optimizer with adaptive query processing | Memory grant feedback adjusts allocations across executions. Interleaved execution re-estimates mid-plan |
| SQLite | Simple cost-based planner | No histograms. Uses sqlite_stat1 for basic row counts. Adequate for single-node, small-to-medium workloads |
Limitations and when NOT to use it
- The planner assumes column values are independent. Correlated filters (
city = 'Seattle' AND state = 'WA') produce wildly wrong selectivity estimates without explicitly created extended statistics. - Statistics are point-in-time snapshots. Between ANALYZE runs, the planner works with stale data. High-write workloads with skewed distributions suffer most from this lag.
- The planner cannot see runtime parameter values in prepared statements. After 5 executions, PostgreSQL may switch to a generic plan that ignores parameter values. Use
plan_cache_mode = force_custom_planif this is a problem. - For queries joining more than 12 tables, the genetic optimizer (GEQO) may not find the optimal join order. Break such queries into CTEs or materialized subqueries.
- The cost model does not account for CPU cache effects, NUMA topology, or concurrent query load. Two queries competing for the buffer pool are planned in isolation.
- User-defined functions are treated as black boxes with a flat cost. A function that makes a network call or runs a sub-query will be wildly undercosted.
Interview cheat sheet
- When asked about slow database queries, start with
EXPLAIN ANALYZE. The gap between estimated rows and actual rows is the single most diagnostic signal. If the estimate is off by 10x+, the plan is almost certainly suboptimal. - When discussing index selection, mention that the planner compares sequential scan cost against index scan cost using
random_page_cost. On SSDs, set it to 1.1 to make index scans competitive. - When a join is slow, identify the algorithm. Nested loop with a large outer set and no inner index is the most common performance killer. Hash join is O(N+M) but needs enough
work_mem. - When asked about PostgreSQL tuning, mention
pg_statsand the statistics target. Default is 100 histogram buckets. Skewed columns need 500+. - When discussing prepared statements, note the generic plan trap. After 5 executions, PostgreSQL may switch to a generic plan that ignores parameter values.
- When asked about multi-table joins, mention GEQO. Beyond 12 tables, the planner uses a genetic algorithm. Breaking into CTEs can force join order.
- When discussing correlated columns, mention extended statistics (
CREATE STATISTICS). Without them, the planner multiplies selectivities independently, underestimating by orders of magnitude. - When asked about query plan caching, note that PostgreSQL re-plans simple queries every execution (no persistent plan cache). Prepared statements cache plans but risk generic plan degradation.
Quick recap
- PostgreSQL's query planner is a cost-based optimizer that generates candidate execution plans and picks the cheapest one based on column statistics (distinct counts, histograms, most common values) stored in pg_statistic.
- The cost model estimates I/O and CPU costs per plan node, anchored to
seq_page_cost = 1.0. Setrandom_page_cost = 1.1on SSDs, as the default 4.0 was calibrated for spinning disks. - Three join algorithms (nested loop, hash join, merge join) are selected based on estimated row counts and available indexes. Nested loop with a large outer set and no inner index is the most common performance problem.
- Statistics staleness is the primary cause of bad plans. When estimated rows in EXPLAIN ANALYZE differ from actual rows by 10x+, run ANALYZE and increase the statistics target for skewed columns.
- For queries joining more than 12 tables, PostgreSQL switches to a genetic optimizer (GEQO) that may miss the optimal join order. Break complex queries into CTEs or materialized subqueries.
- Extended statistics (
CREATE STATISTICS) fix the planner's independence assumption for correlated columns, which otherwise causes order-of-magnitude selectivity underestimates.
Related concepts
- Databases - How PostgreSQL's storage engine and buffer pool interact with the query planner's I/O cost model.
- Indexing strategies - B-tree, hash, GIN, and GiST indexes that the planner evaluates as alternatives to sequential scans.
- B-tree indexes - The internal data structure behind most PostgreSQL indexes, and how the planner estimates B-tree traversal costs.