Indexing strategies

TL;DR

An index is a separate data structure (usually a B-tree) that lets the database find matching rows in O(log N) instead of scanning every row O(N). Without indexes, every query is a full table scan.
Composite index column order matters enormously: the leftmost prefix rule means an index on (a, b, c) helps queries filtering on a, a+b, or a+b+c, but not b alone or c alone. Put equality columns first, range columns last.
Covering indexes include all columns a query needs, so the engine reads only the index and never touches the table. This is the single highest-impact read optimization for hot query paths.
Partial indexes index only a subset of rows (e.g., WHERE status = 'pending'). They are tiny, targeted, and dramatically underused.
Indexes are not free: every write must update every index on the table. A table with 10 indexes pays 10x the write maintenance cost. Over-indexing kills write throughput.

The Problem It Solves

Your SaaS application has been in production for two years. The events table has grown to 200 million rows. The dashboard page loads a list of recent events for the current user:

SELECT event_id, event_type, created_at
FROM events
WHERE user_id = 'usr_8374'
ORDER BY created_at DESC
LIMIT 50;

This query used to return in 3ms. Last week it started taking 4 seconds. Your monitoring shows the database CPU pegged at 95% during business hours, and this query is the top consumer.

You run EXPLAIN ANALYZE and see the problem:

Seq Scan on events  (cost=0.00..4521890.00 rows=847 width=64)
  Filter: (user_id = 'usr_8374')
  Rows Removed by Filter: 199999153
  Planning Time: 0.1 ms
  Execution Time: 4200 ms

A sequential scan. The database is reading all 200 million rows, checking the user_id predicate on each one, and throwing away 199,999,153 rows to return just 847 matches. Four seconds of CPU and I/O to find 847 rows in 200 million.

I'll often see engineers reach for caching or read replicas at this point, but both miss the real issue. This query is slow because the database is doing 200 million comparisons to find 847 matches. The fix isn't more hardware; it's telling the database where to look.

CREATE INDEX idx_events_user_created ON events(user_id, created_at DESC);

After the index, the same query:

Index Scan using idx_events_user_created on events (cost=0.56..42.15 rows=847 width=64)
  Index Cond: (user_id = 'usr_8374')
  Planning Time: 0.1 ms
  Execution Time: 0.8 ms

From 4,200ms to 0.8ms. A 5,000x improvement from a single CREATE INDEX statement. That is the power of indexing, and understanding which index to create is one of the most practical skills in system design.

What Is It?

A database index is a separate data structure (most commonly a B-tree) that the query engine consults before touching the main table. It maps column values to the physical locations of matching rows, letting the engine jump directly to the data it needs instead of scanning every row.

Analogy: Think of the index at the back of a textbook. Without it, finding every mention of "B-tree" requires reading all 500 pages. With it, you flip to the index, find "B-tree: pages 42, 87, 203" and go directly there. The index is smaller than the book, sorted alphabetically, and maps keywords to page numbers.

A database index works the same way. It is sorted by the indexed column's value and stores pointers to the corresponding table rows. The tradeoff: the index takes up disk space and must be updated on every write, but it turns O(N) scans into O(log N) lookups.

The cost model is straightforward: index reads are O(log N) + K, where K is the number of matching rows. Sequential scans are O(N). For large tables and selective queries (small K relative to N), indexes win by orders of magnitude. For small tables (under 1,000 rows) or queries returning most of the table (more than 20-30% of rows), a sequential scan is actually faster because it avoids the overhead of tree traversal and random I/O.

How It Works

Here's what happens step by step when the query engine uses a B-tree index:

Parse the query. The planner looks at the WHERE clause and checks which indexes exist on the referenced columns.
Walk the B-tree root to leaf. The B-tree is a balanced tree (typically 3-4 levels deep for millions of rows). Starting at the root, the engine follows pointers down the tree, comparing the search key at each level. Each level halves the search space.
Scan the leaf nodes. The leaf level contains the indexed values in sorted order, each with a pointer (ctid in Postgres, row ID in MySQL) to the corresponding table row. For range queries, the engine scans consecutive leaf nodes.
Fetch table rows. For each matching leaf entry, the engine follows the pointer to the table heap to read the full row. This is a random I/O operation (the rows may be scattered on disk).
Return results. If the index covers all required columns (a covering index), step 4 is skipped entirely, making the query even faster.

For a table with 10 million rows, a B-tree is typically 3-4 levels deep. Finding one row requires 3-4 page reads (each ~8 KB). Compare that to a sequential scan: 10 million rows at ~100 bytes each is about 1 GB of data to read from disk.

B-tree indexes (the default in Postgres and MySQL) support these operations efficiently: =, <, >, <=, >=, BETWEEN, IN, LIKE 'prefix%' (prefix matches only), and ORDER BY on the indexed column.

Operations they do not support: LIKE '%suffix' (needs full-text or trigram index), != and NOT IN (anti-patterns for index use), and functions applied to indexed columns. WHERE lower(email) = ? won't use an index on email because the engine can't match the function output to the sorted index values. The fix is a functional index: CREATE INDEX idx_email_lower ON users(lower(email)).

Functions on indexed columns break index usage

If your WHERE clause applies a function to the column (WHERE YEAR(created_at) = 2026), the optimizer cannot use a standard index on that column. Either create a functional index or rewrite the query to use range predicates: WHERE created_at >= '2026-01-01' AND created_at < '2027-01-01'. This is one of the most common performance mistakes I see in production SQL.

Key Components

Component	Role
B-tree index	Default index type. Balanced tree supporting equality, range, and sort operations. 3-4 levels deep for millions of rows.
Hash index	O(1) lookups for equality only. Cannot support range queries or sorting. Used in memory-optimized engines (Redis, some MySQL MEMORY tables).
Covering index	An index that includes all columns needed by a query, enabling index-only scans (no table heap access).
Partial index	An index on a subset of rows matching a WHERE condition. Tiny and targeted.
Composite index	An index on multiple columns. Column order matters (leftmost prefix rule).
GIN index	Generalized Inverted Index. Used for full-text search, JSONB containment, and array operations in Postgres.
GiST index	Generalized Search Tree. Used for geometric data, range types, and nearest-neighbor queries in Postgres.
Bitmap index	Combines multiple indexes by AND/OR-ing bitmaps. Used internally by Postgres for multi-column filter queries.

Types / Variations

Composite Index Column Order

A composite index on (a, b, c) is useful for queries on:

a alone
a and b
a, b, and c

But not directly useful for:

b alone
c alone
b and c together

TL;DR

An index is a separate data structure (usually a B-tree) that lets the database find matching rows in O(log N) instead of scanning every row O(N). Without indexes, every query is a full table scan.
Composite index column order matters enormously: the leftmost prefix rule means an index on (a, b, c) helps queries filtering on a, a+b, or a+b+c, but not b alone or c alone. Put equality columns first, range columns last.
Covering indexes include all columns a query needs, so the engine reads only the index and never touches the table. This is the single highest-impact read optimization for hot query paths.
Partial indexes index only a subset of rows (e.g., WHERE status = 'pending'). They are tiny, targeted, and dramatically underused.
Indexes are not free: every write must update every index on the table. A table with 10 indexes pays 10x the write maintenance cost. Over-indexing kills write throughput.

The Problem It Solves

Your SaaS application has been in production for two years. The events table has grown to 200 million rows. The dashboard page loads a list of recent events for the current user:

SELECT event_id, event_type, created_at
FROM events
WHERE user_id = 'usr_8374'
ORDER BY created_at DESC
LIMIT 50;

This query used to return in 3ms. Last week it started taking 4 seconds. Your monitoring shows the database CPU pegged at 95% during business hours, and this query is the top consumer.

You run EXPLAIN ANALYZE and see the problem:

Seq Scan on events  (cost=0.00..4521890.00 rows=847 width=64)
  Filter: (user_id = 'usr_8374')
  Rows Removed by Filter: 199999153
  Planning Time: 0.1 ms
  Execution Time: 4200 ms

CREATE INDEX idx_events_user_created ON events(user_id, created_at DESC);

After the index, the same query:

Index Scan using idx_events_user_created on events (cost=0.56..42.15 rows=847 width=64)
  Index Cond: (user_id = 'usr_8374')
  Planning Time: 0.1 ms
  Execution Time: 0.8 ms

What Is It?

How It Works

Here's what happens step by step when the query engine uses a B-tree index:

Parse the query. The planner looks at the WHERE clause and checks which indexes exist on the referenced columns.
Walk the B-tree root to leaf. The B-tree is a balanced tree (typically 3-4 levels deep for millions of rows). Starting at the root, the engine follows pointers down the tree, comparing the search key at each level. Each level halves the search space.
Scan the leaf nodes. The leaf level contains the indexed values in sorted order, each with a pointer (ctid in Postgres, row ID in MySQL) to the corresponding table row. For range queries, the engine scans consecutive leaf nodes.
Fetch table rows. For each matching leaf entry, the engine follows the pointer to the table heap to read the full row. This is a random I/O operation (the rows may be scattered on disk).
Return results. If the index covers all required columns (a covering index), step 4 is skipped entirely, making the query even faster.

Functions on indexed columns break index usage

Key Components

Component	Role
B-tree index	Default index type. Balanced tree supporting equality, range, and sort operations. 3-4 levels deep for millions of rows.
Hash index	O(1) lookups for equality only. Cannot support range queries or sorting. Used in memory-optimized engines (Redis, some MySQL MEMORY tables).
Covering index	An index that includes all columns needed by a query, enabling index-only scans (no table heap access).
Partial index	An index on a subset of rows matching a WHERE condition. Tiny and targeted.
Composite index	An index on multiple columns. Column order matters (leftmost prefix rule).
GIN index	Generalized Inverted Index. Used for full-text search, JSONB containment, and array operations in Postgres.
GiST index	Generalized Search Tree. Used for geometric data, range types, and nearest-neighbor queries in Postgres.
Bitmap index	Combines multiple indexes by AND/OR-ing bitmaps. Used internally by Postgres for multi-column filter queries.

Types / Variations

Composite Index Column Order

A composite index on (a, b, c) is useful for queries on:

a alone
a and b
a, b, and c

But not directly useful for:

b alone
c alone
b and c together

Indexing strategies

TL;DR

The Problem It Solves

What Is It?

How It Works

Key Components

Types / Variations

Composite Index Column Order

Continue Reading with Premium

Comments

Indexing strategies

TL;DR

The Problem It Solves

What Is It?

How It Works

Key Components

Types / Variations

Composite Index Column Order

Continue Reading with Premium

Comments