B-tree indexes
Learn how B-tree indexes store and retrieve rows, why column order in composite indexes matters, and what causes index fragmentation at scale.
The problem
You have a users table with 100 million rows. Someone runs SELECT * FROM users WHERE email = 'alice@example.com'. Without an index, the database reads every single row. On spinning disk, that's minutes. On SSD, still seconds. Your app times out, your users see errors, and you're paged at 3 a.m.
The fix everyone knows is "add an index." The part most engineers skip is understanding what that index actually is, because that knowledge is what lets you design indexes that perform under load instead of ones that silently hurt you.
What a B-tree is
A B-tree (Balanced tree) is a self-balancing tree where every leaf node is at the same depth. Most database indexes (PostgreSQL, MySQL InnoDB, SQLite) are actually B+ trees, a variant where all data pointers live in the leaf layer and internal nodes store only keys for navigation.
Analogy: Think of a library catalog organized by first letter, then second letter, drilling down until you hit the actual shelf location. You never need to check every book. You navigate from the front of the catalog to the right section, then the right subsection, then pick up the exact card.
[M]
/ \
[D, H] [R, W]
/ | \ | \
[A-C][E-G][I-L] [N-Q][S-V][X-Z]
| | | | | |
rows rows rows rows rows rows β leaf pages with row pointers
The depth of a B-tree on a table with 100 million rows is about 4-5 levels. Every lookup is 4-5 page reads, regardless of which row you're looking for.
How a lookup works
SELECT * FROM users WHERE email = 'alice@example.com';
With an index on (email):
- Start at the root. The root node holds a small number of key ranges (typically a few hundred entries per page, since B-tree nodes are sized to match a disk page, usually 8KB or 16KB).
- Navigate down. At each internal node, compare the target email against the stored boundary keys to pick the correct child pointer. Follow the pointer. Repeat until you reach a leaf node.
- Hit the leaf. Leaf nodes store the indexed column value and a pointer to the heap row (the actual row's physical location on disk).
- Fetch the row. Follow the heap pointer to retrieve the full row data.
Total cost: 4-5 index page reads + 1 heap page read. Total: 5-6 random I/O operations instead of 100,000,000.
For a range scan (WHERE email BETWEEN 'a@...' AND 'b@...'), the database walks the leaf level horizontally. Leaf nodes are doubly linked, so range scans are sequential reads after the initial traversal.
Lookup walk: step by step
spawnSync d2 ENOENT
Depth is logarithmic, not linear
A B-tree on 100 million rows is about 4-5 levels deep. A B-tree on 100 billion rows is about 6-7 levels deep. Depth grows as log(N), so adding 1,000x more rows only adds 2-3 more levels. This is why index lookups have roughly constant cost regardless of table size, and why "add an index" solves most read performance problems permanently rather than temporarily.
Composite indexes and column order
The most important practical rule: column order in a composite index is not arbitrary.
A composite index (last_name, first_name) can satisfy:
WHERE last_name = 'Smith'WHERE last_name = 'Smith' AND first_name = 'Alice'ORDER BY last_name, first_name
It cannot satisfy:
WHERE first_name = 'Alice'(no leading column match)ORDER BY first_name, last_name(wrong order)
The reason: B-trees sort entries lexicographically by column order. The tree is physically sorted by last_name first. Skipping the leading column means the database has no starting point in the tree and must scan the entire leaf level.
-- Index: (last_name, first_name, created_at)
-- This uses the index efficiently:
SELECT * FROM users
WHERE last_name = 'Smith'
AND first_name = 'Alice'
ORDER BY created_at DESC;
-- This does not use the index at all:
SELECT * FROM users
WHERE first_name = 'Alice'; -- missing leading column last_name
The rule of thumb: put the column with the highest selectivity (most distinct values) first, unless the leading column is always present in your queries. Write the queries first, then design the index.
Index-only scans and covering indexes
A standard index lookup does two things: reads the index, then reads the heap row. The second step is a random I/O per row. At high volume, those heap reads dominate query cost.
A covering index includes all columns the query needs so the heap read never happens.
-- Query: fetch user IDs and emails for a given account_id
SELECT user_id, email FROM users WHERE account_id = 42;
-- Covering index: includes all referenced columns
CREATE INDEX idx_users_covering ON users (account_id, user_id, email);
-- PostgreSQL can now answer this query from the index alone.
-- The "Index Only Scan" in EXPLAIN ANALYZE confirms zero heap reads.
PostgreSQL calls this an "Index Only Scan." MySQL InnoDB calls it a "covering index." The effect is the same: the database never touches the actual table data. For read-heavy queries, this is often the single biggest query optimization available.
Page splits and fragmentation
B-trees stay balanced by splitting nodes when they fill up. A leaf node filled to capacity cannot accept another entry without splitting into two half-full nodes.
Continue Reading with Premium
Unlock this article and every other in-depth system design guide on the platform with NotesFromSDE Premium.