B-tree indexes
Learn how B-tree indexes store and retrieve rows, why column order in composite indexes matters, and what causes index fragmentation at scale.
The problem
You have a users table with 100 million rows. Someone runs SELECT * FROM users WHERE email = 'alice@example.com'. Without an index, the database reads every single row. On spinning disk, that's minutes. On SSD, still seconds. Your app times out, your users see errors, and you're paged at 3 a.m.
The fix everyone knows is "add an index." The part most engineers skip is understanding what that index actually is, because that knowledge is what lets you design indexes that perform under load instead of ones that silently hurt you.
What a B-tree is
flowchart TD
subgraph Index["B-tree Index (in memory/buffer pool)"]
Root["Root\n[30 | 70]"]
I1["Interior\n[10 | 20]"]
I2["Interior\n[40 | 55]"]
I3["Interior\n[75 | 85]"]
L1["Leaf: 10"]
L2["Leaf: 20"]
L3["Leaf: 30"]
L4["Leaf: 40"]
L5["Leaf: 70"]
L6["Leaf: 85"]
Root -->|"< 30"| I1
Root -->|"30-69"| I2
Root -->|">= 70"| I3
I1 --> L1
I1 --> L2
I2 --> L3
I2 --> L4
I3 --> L5
I3 --> L6
end
subgraph Disk["Disk pages"]
D1[("Page 3")]
D2[("Page 7")]
D3[("Page 12")]
D4[("Page 18")]
D5[("Page 22")]
D6[("Page 31")]
end
L1 -->|"row pointer"| D1
L2 -->|"row pointer"| D2
L3 -->|"row pointer"| D3
L4 -->|"row pointer"| D4
L5 -->|"row pointer"| D5
L6 -->|"row pointer"| D6
A B-tree (Balanced tree) is a self-balancing tree where every leaf node is at the same depth. Most database indexes (PostgreSQL, MySQL InnoDB, SQLite) are actually B+ trees, a variant where all data pointers live in the leaf layer and internal nodes store only keys for navigation.
Analogy: Think of a library catalog organized by first letter, then second letter, drilling down until you hit the actual shelf location. You never need to check every book. You navigate from the front of the catalog to the right section, then the right subsection, then pick up the exact card.
[M]
/ \
[D, H] [R, W]
/ | \ | \
[A-C][E-G][I-L] [N-Q][S-V][X-Z]
| | | | | |
rows rows rows rows rows rows ā leaf pages with row pointers
The depth of a B-tree on a table with 100 million rows is about 4-5 levels. Every lookup is 4-5 page reads, regardless of which row you're looking for.
How a lookup works
SELECT * FROM users WHERE email = 'alice@example.com';
With an index on (email):
- Start at the root. The root node holds a small number of key ranges (typically a few hundred entries per page, since B-tree nodes are sized to match a disk page, usually 8KB or 16KB).
- Navigate down. At each internal node, compare the target email against the stored boundary keys to pick the correct child pointer. Follow the pointer. Repeat until you reach a leaf node.
- Hit the leaf. Leaf nodes store the indexed column value and a pointer to the heap row (the actual row's physical location on disk).
- Fetch the row. Follow the heap pointer to retrieve the full row data.
Total cost: 4-5 index page reads + 1 heap page read. Total: 5-6 random I/O operations instead of 100,000,000.
For a range scan (WHERE email BETWEEN 'a@...' AND 'b@...'), the database walks the leaf level horizontally. Leaf nodes are doubly linked, so range scans are sequential reads after the initial traversal.
Lookup walk: step by step
Depth is logarithmic, not linear
A B-tree on 100 million rows is about 4-5 levels deep. A B-tree on 100 billion rows is about 6-7 levels deep. Depth grows as log(N), so adding 1,000x more rows only adds 2-3 more levels. This is why index lookups have roughly constant cost regardless of table size, and why "add an index" solves most read performance problems permanently rather than temporarily.
Composite indexes and column order
The most important practical rule: column order in a composite index is not arbitrary.
A composite index (last_name, first_name) can satisfy:
WHERE last_name = 'Smith'WHERE last_name = 'Smith' AND first_name = 'Alice'ORDER BY last_name, first_name
It cannot satisfy:
WHERE first_name = 'Alice'(no leading column match)ORDER BY first_name, last_name(wrong order)
The reason: B-trees sort entries lexicographically by column order. The tree is physically sorted by last_name first. Skipping the leading column means the database has no starting point in the tree and must scan the entire leaf level.
-- Index: (last_name, first_name, created_at)
-- This uses the index efficiently:
SELECT * FROM users
WHERE last_name = 'Smith'
AND first_name = 'Alice'
ORDER BY created_at DESC;
-- This does not use the index at all:
SELECT * FROM users
WHERE first_name = 'Alice'; -- missing leading column last_name
The rule of thumb: put the column with the highest selectivity (most distinct values) first, unless the leading column is always present in your queries. Write the queries first, then design the index.
Index-only scans and covering indexes
A standard index lookup does two things: reads the index, then reads the heap row. The second step is a random I/O per row. At high volume, those heap reads dominate query cost.
A covering index includes all columns the query needs so the heap read never happens.
-- Query: fetch user IDs and emails for a given account_id
SELECT user_id, email FROM users WHERE account_id = 42;
-- Covering index: includes all referenced columns
CREATE INDEX idx_users_covering ON users (account_id, user_id, email);
-- PostgreSQL can now answer this query from the index alone.
-- The "Index Only Scan" in EXPLAIN ANALYZE confirms zero heap reads.
PostgreSQL calls this an "Index Only Scan." MySQL InnoDB calls it a "covering index." The effect is the same: the database never touches the actual table data. For read-heavy queries, this is often the single biggest query optimization available.
Page splits and fragmentation
B-trees stay balanced by splitting nodes when they fill up. A leaf node filled to capacity cannot accept another entry without splitting into two half-full nodes.
Unordered inserts cause frequent splits. Random UUIDs as primary keys are the infamous example. Because each new UUID lands at a random position in the leaf layer, almost every insert hits a different page. Once that page is full, it splits. The net effect:
- Leaf pages average ~50% fill factor (each split leaves two half-full pages)
- 2x more pages to read for range scans
- Write amplification: a single
INSERTtriggers additional I/O to write the new page and update the parent node's pointer
Sequential IDs (auto-increment or Snowflake IDs) avoid this entirely. New IDs always sort to the end of the leaf layer. Inserts append to the rightmost page. No splits until that page fills, and then growth expands the tree rightward. Fill factor stays near 100%.
Random UUID inserts:
[page A: 50% full] [page B: 50% full] [page C: 50% full] ...
Sequential ID inserts:
[page A: 100% full] [page B: 100% full] [page C: 70% full] ā current end
This is why distributed-id-generation.mdx recommends Snowflake over UUID v4 for write-heavy tables. The index behavior, not just storage size, drives that recommendation.
Partial indexes
A partial index covers only a subset of rows, defined by a WHERE clause. This makes the index smaller, faster, and cheaper to maintain.
-- Full index: indexes all 100M users
CREATE INDEX idx_users_email ON users (email);
-- Partial index: indexes only unverified users (say, 5M rows)
CREATE INDEX idx_unverified_users ON users (email)
WHERE email_verified = false;
-- The partial index is 20x smaller. This query hits only that index:
SELECT * FROM users WHERE email = ? AND email_verified = false;
Use partial indexes when a query always filters on a column with a very skewed distribution: active vs. inactive records, pending vs. processed jobs, unread vs. read notifications.
Expression indexes
Index on a computed expression, not just a raw column value.
-- Problem: case-insensitive search requires full scan
SELECT * FROM users WHERE LOWER(email) = 'alice@example.com';
-- Solution: index the expression
CREATE INDEX idx_users_email_lower ON users (LOWER(email));
-- Now the query above hits the index.
Expression indexes are the fix for any query that applies a function to a column in the WHERE clause. The database stores the expression's result in the index, not the raw column value.
When indexes hurt
Adding more indexes sounds harmless. It is not.
Every index must be maintained on every INSERT, UPDATE, and DELETE. A table with 10 indexes does 10x more index maintenance work on every write. For write-heavy tables (event logs, audit tables, high-volume order tables), excess indexes can make writes the bottleneck.
The concrete harm:
- Each extra index costs 1-3 additional page writes per
INSERT UPDATEon an indexed column rewrites the index entry (delete old, insert new)- Index maintenance holds locks, which blocks concurrent writers
Run SELECT * FROM pg_stat_user_indexes WHERE idx_scan = 0 in PostgreSQL to find unused indexes. Drop them. An unused index is pure write overhead.
Write-heavy tables need index audits, not more indexes
I've worked with teams whose orders table had 15 indexes added over 3 years ("just in case"). At 50K inserts/second, those 15 indexes consumed more I/O than the actual query workload. Run the unused index query monthly on any table receiving more than 1K writes/second.
Choosing the right index type
flowchart TD
Q1{"Query filter\ncardinality?"}
Q1 -->|"High (email, UUID,\nmost values unique)"| Q2
Q1 -->|"Low (status, boolean,\n2-5 distinct values)"| Partial["Partial index on\nthe minority value\nWHERE status = 'pending'"]
Q2{"Query returns\nonly indexed columns?"}
Q2 -->|"Yes"| Cover["Covering index\nINCLUDE all returned columns\nzero heap reads"]
Q2 -->|"No, fetches full row"| Q3
Q3{"Multi-column\nWHERE clause?"}
Q3 -->|"Yes"| Comp["Composite index\nhighest-selectivity column first\nor leading = always-present column"]
Q3 -->|"Single column"| Standard["Standard B-tree index\n4-5 page reads, then heap fetch"]
Production usage
| System | Usage | Notable behavior |
|---|---|---|
| PostgreSQL | Default index type for all columns unless otherwise specified | Uses B+ trees (all values in leaves, linked leaf list for range scans). VACUUM reclaims dead tuple space in indexes. |
| MySQL InnoDB | Clustered index on primary key, secondary indexes store PK value as pointer | Clustered B+ tree means PK lookups read data directly from leaf. Secondary index lookups require two lookups: secondary leaf to PK, then PK lookup. |
| SQLite | B-tree for table storage and indexes | Tables are themselves B-trees keyed by rowid. Covering indexes eliminate the second lookup step. |
| Oracle/SQL Server | B-tree indexes with index-organized tables | Index-organized tables (IOT) store all row data in the leaf nodes, similar to InnoDB's clustered index. |
Interview cheat sheet
- Index lookups cost 4-5 page reads regardless of table size, because B-tree depth grows logarithmically. A 100M row table and a 100B row table have almost the same lookup depth.
- Column order in a composite index is the single most important design decision. The leading column must match your query's WHERE clause, or the index is useless.
- Sequential primary keys (auto-increment, Snowflake) avoid B-tree fragmentation entirely. Random UUIDs cause ~2x page overhead from splits.
- A covering index eliminates heap reads. For read-heavy SELECT queries, it's often the biggest optimization with no schema change required.
- Partial indexes on skewed distributions (active/inactive, pending/complete) give you a small, fast index for the common case.
- More indexes make reads faster but writes slower. On write-heavy tables, unused indexes are a measurable bottleneck.
Quick recap
- B-tree indexes reduce a full table scan (millions of reads) to a 4-5 level traversal, regardless of table size.
- Leaf nodes store index entries and heap pointers. Range scans walk the doubly-linked leaf layer horizontally after the initial traversal.
- Column order in composite indexes is not arbitrary: the leading column must match your query's WHERE clause or the database cannot navigate the tree.
- Sequential primary keys (auto-increment, Snowflake) prevent fragmentation by appending to the rightmost leaf page. Random UUIDs scatter inserts and leave leaf pages half-full.
- Covering indexes eliminate heap reads entirely. For read-heavy queries, this is often the largest single optimization available.
- Every index adds write overhead. On write-heavy tables, excessive indexes are a measurable bottleneck, not a free optimization.
Related concepts
- Distributed ID Generation ā The primary key type you choose directly determines whether your B-tree indexes fragment or stay dense. Snowflake IDs are the recommendation specifically because of index performance, not just storage size.
- Databases ā B-trees underpin relational database storage engines. Understanding the storage layer helps you reason about query plans and schema design.
- MVCC ā PostgreSQL's multi-version concurrency control creates row versions that must eventually be cleaned up (VACUUM). Understanding B-tree structure explains why dead tuples cause index bloat and why VACUUM is not optional.