Inverted index
How search engines use inverted indexes to find documents by term in O(1). The data structure behind Elasticsearch, Lucene, and PostgreSQL full-text search, covering posting lists, TF-IDF scoring, and segment merging.
The problem
Your e-commerce platform has 50 million product listings. A user types "wireless noise cancelling headphones" into the search bar. You need to return the most relevant products in under 100ms.
B-tree indexes handle exact matches and ranges well. WHERE price > 100 is a range scan. WHERE brand = 'Sony' is a point lookup. But WHERE description CONTAINS 'wireless noise cancelling headphones' is a completely different problem. The search terms could appear anywhere in the text, in any order, and the user expects ranked results (not just matching documents).
WHERE price > 100 -- B-tree: range scan, fast
WHERE user_id = 42 -- B-tree: point lookup, fast
WHERE title = 'exact string' -- B-tree: exact match, fast
WHERE title LIKE '%search%' -- B-tree: CANNOT use index, full table scan
WHERE body CONTAINS 'distributed systems' -- B-tree: impossible
A naive approach scans every document, checking if the search terms appear in the text. At 50 million documents, this is 50 million string comparisons per query. Even at 1 microsecond per check, that is 50 seconds. Users expect 100ms.
This is the problem the inverted index solves: given a set of search terms, find all matching documents in O(1) per term, regardless of corpus size, then rank them by relevance.
What it is
An inverted index is a data structure that maps each unique term (word) to the list of documents containing that term. It "inverts" the natural relationship: instead of document pointing to its words, each word points back to its documents.
Analogy: Think of the index at the back of a textbook. If you want to find every page that mentions "distributed systems," you do not read the entire book. You look up "distributed systems" in the alphabetical index, which lists page numbers 42, 87, and 234. The inverted index does the same thing for a corpus of millions of documents, each with thousands of words.
The core insight is that building the index is expensive (you must read every document once), but querying it is cheap: a single hash lookup per search term, regardless of whether the corpus has 1,000 or 1 billion documents.
How it works
The inverted index has two phases: indexing (build time) and querying (search time). At build time, every document is tokenized into terms, and each term is mapped to its list of document IDs. At query time, terms are looked up and the document lists are intersected or unioned.
The build process in pseudocode:
function build_inverted_index(documents):
index = HashMap<String, List<Posting>>()
for each document in documents:
terms = tokenize(document.text) // split, lowercase, stem
for each term in terms:
posting = {
doc_id: document.id,
term_freq: count(term, document.text),
positions: find_positions(term, document.text)
}
index[term].append(posting)
// Sort each posting list by doc_id for efficient intersection
for each term in index:
sort(index[term], by: doc_id)
return index
function search(index, query_terms):
results = index[query_terms[0]] // first term's posting list
for each remaining term in query_terms:
results = intersect(results, index[term])
return rank(results) // TF-IDF or BM25 scoring
Each term lookup is O(1) via hash map. Intersection of two sorted posting lists is O(min(a, b)) using a merge-style walk. For rare terms, the posting lists are short, making intersection nearly instant even over billions of documents.
Posting lists and skip pointers
The list of document IDs per term is called a posting list. In a production system, posting lists store far more than just document IDs:
Term: "distributed"
Posting list:
[
{doc_id: 1, term_freq: 2, positions: [0, 7], field: "body"},
{doc_id: 2, term_freq: 1, positions: [0], field: "title"},
{doc_id: 5, term_freq: 3, positions: [3, 14, 22], field: "body"},
...
]
- Term frequency: how many times the term appears in the document (used for TF-IDF scoring)
- Positions: word offsets for phrase queries ("distributed systems" as an exact phrase, not just both words anywhere)
- Field: which document field the term appeared in (title matches are weighted higher than body matches)
Posting lists are stored sorted by document ID. This sorted order enables efficient intersection: to find documents matching both "distributed" AND "databases," walk both sorted lists simultaneously, advancing whichever pointer has the smaller doc ID.
Skip pointers
For long posting lists (millions of entries), even linear intersection is slow. Skip pointers add express lanes through the list. Every K entries, a skip pointer records the doc ID and byte offset K positions ahead. When intersecting, if the current doc IDs are far apart, you can skip forward instead of walking one entry at a time.
The optimal skip interval is approximately sqrt(posting_list_length). For a posting list of 1 million entries, skip every ~1,000 entries. This reduces intersection from O(N) to O(sqrt(N)) in practice.
Interview tip: posting list intersection
When discussing search at scale, mention skip pointers. They explain how Elasticsearch intersects posting lists with hundreds of millions of entries in milliseconds. Without skip pointers, intersecting two posting lists of 100M entries each requires 100M comparisons. With skip pointers at sqrt(N) intervals, you skip past irrelevant ranges and reduce work to ~10K comparisons.
TF-IDF and BM25 scoring
Returning matching documents is not enough. You need to rank them by relevance. TF-IDF (Term Frequency-Inverse Document Frequency) is the classical scoring formula:
TF-IDF(term, document) = TF(term, doc) Ć IDF(term)
TF(term, doc) = frequency of term in document
ā a doc mentioning "database" 10 times is more relevant than one mentioning it once
IDF(term) = log(N / df(term))
where N = total documents, df = documents containing term
ā rare terms ("NoSQL") score higher than common terms ("the")
For a multi-term query, sum the TF-IDF scores across all query terms:
score(doc, query) = Σ TF-IDF(term, doc) for each term in query
TF-IDF has a known weakness: it does not normalize for document length. A 10,000-word article that mentions "database" 10 times should not score higher than a 100-word summary that mentions it 3 times. BM25 (Best Match 25) fixes this with length normalization:
BM25(term, doc) = IDF(term) Ć (TF Ć (k1 + 1)) / (TF + k1 Ć (1 - b + b Ć (doc_len / avg_doc_len)))
Where:
k1 = 1.2 (term frequency saturation parameter)
b = 0.75 (length normalization parameter)
BM25 saturates the effect of term frequency (mentioning a word 100 times is not 10x better than 10 times) and penalizes long documents. Elasticsearch and Lucene switched from TF-IDF to BM25 as the default scoring function in Lucene 6.0 (2016). Modern neural re-rankers can further improve results, but BM25 remains the baseline.
Tokenization and analyzers
The quality of search results depends entirely on how text is broken into terms before indexing. This process, called analysis, has three stages: character filtering, tokenization, and token filtering.
Tokenization splits text into individual tokens. The standard tokenizer splits on whitespace and punctuation. Language-specific tokenizers handle edge cases: CJK languages (Chinese, Japanese, Korean) need character-based or dictionary-based tokenization because words are not separated by spaces.
Stopword removal drops extremely common words ("the," "is," "and") that appear in nearly every document and have near-zero IDF. Removing them shrinks the index and speeds up queries.
Stemming reduces words to their root form: "running," "runs," and "ran" all become "run." The Porter stemmer is the most common for English. Stemming increases recall (more matches) at the cost of precision (sometimes wrong matches, like "universe" and "university" both stemming to "univers").
Lemmatization is a more sophisticated alternative that uses vocabulary and grammar rules: "better" becomes "good," "mice" becomes "mouse." It is more accurate but slower than stemming.
Analyzer mismatches are a common source of "search returns nothing" bugs. If the index analyzer stems words but the query analyzer does not (or vice versa), the terms will not match. Always use the same analyzer chain for index-time and query-time analysis, unless you intentionally want asymmetric behavior (e.g., index with synonyms, query without).
Lucene's segment architecture
Lucene (underlying Elasticsearch, Solr, and many others) builds the inverted index as immutable segments. This design solves the problem of updating a sorted, compressed data structure.
New documents go into an in-memory buffer. When the buffer is full (or on a manual flush), Lucene writes it as a new immutable segment on disk. Each segment is a self-contained inverted index with its own term dictionary, posting lists, and stored fields.
Why immutability? Modifying a sorted, compressed posting list in place is expensive. You would need to shift entries, rewrite compression blocks, and update skip pointers. Immutable segments avoid this entirely: writes go to new segments, and background merging combines small segments into large ones.
Deletes are handled with a separate deletion bitset per segment. When a document is deleted, its bit is set to 1 in the bitset. Searches skip deleted documents. The next merge physically removes them by not including those documents in the merged segment.
Segment merging is the key maintenance operation. Too many small segments slow down queries (each query must search every segment). Lucene uses a tiered merge policy: segments are grouped by size tier, and when enough segments of similar size exist, they are merged into one larger segment. This is the same immutability + compaction pattern as LSM trees, optimized for search workloads.
Production usage
| System | Usage | Notable behavior |
|---|---|---|
| Elasticsearch / OpenSearch | Full-text search, log analytics, observability | Distributed inverted index across shards. BM25 default scoring. Near-real-time search (1-second refresh interval) |
| Apache Lucene | Core library underlying Elasticsearch, Solr, and many others | Segment-based architecture with tiered merge policy. Supports stored fields, doc values, and term vectors |
| PostgreSQL (GIN index) | Full-text search via tsvector / tsquery | GIN (Generalized Inverted Index) for text search, array containment, and JSONB queries. Simpler than Lucene but integrated into the relational engine |
| Apache Solr | Enterprise search, e-commerce product search | Same Lucene core as Elasticsearch. Adds faceted search, spatial search, and rich document parsing |
| Sphinx / Manticore | Real-time full-text search for MySQL/PostgreSQL | Optimized for high-throughput indexing with low latency. Uses a RAM-based segment for real-time inserts |
Limitations and when NOT to use it
- Inverted indexes are optimized for text search on tokenized terms. They are not a replacement for B-tree indexes on structured columns.
WHERE user_id = 42should use a B-tree, not an inverted index. - Index build time scales linearly with corpus size. Re-indexing 1 billion documents after an analyzer change takes hours or days. Plan analyzer changes carefully.
- Posting list intersection degrades for very common terms. A query for "the system" where "the" appears in 99% of documents requires intersecting a massive posting list. Stopword removal mitigates this, but some queries inherently involve common terms.
- Inverted indexes assume a fixed vocabulary at analysis time. New synonym rules, stemmer updates, or analyzer changes require a full re-index to take effect on existing documents.
- Memory usage grows with term cardinality. High-cardinality fields (UUIDs, URLs, log messages with timestamps) produce enormous term dictionaries that consume significant heap memory in Lucene.
- Relevance scoring (TF-IDF, BM25) works well for keyword-style queries but struggles with semantic queries ("how to fix a leaky faucet" vs. "plumbing repair"). Vector search or hybrid approaches are needed for semantic similarity.
Interview cheat sheet
- When asked about full-text search, immediately describe the inverted index: term-to-document mapping, O(1) lookup per term, and posting list intersection for multi-term queries.
- When discussing search relevance, explain BM25 as the default (not TF-IDF). BM25 adds term frequency saturation and document length normalization over raw TF-IDF.
- When asked about search at scale, mention Lucene's segment architecture. Immutable segments enable concurrent reads during writes and make compression efficient. Background merging controls segment proliferation.
- When discussing Elasticsearch internals, note that each shard is a Lucene index. Documents are routed to shards by
_idhash. Search queries fan out to all shards and results are merged. - When asked about phrase queries ("exact phrase match"), explain that posting lists store positions. A phrase match requires that terms appear at consecutive positions in the same document.
- When discussing search performance, mention skip pointers for long posting lists. They reduce intersection complexity from O(N) to approximately O(sqrt(N)).
- When asked about analyzers, distinguish between index-time and query-time analysis. Mismatched analyzers cause silent search failures where matching documents are not returned.
- When discussing the tradeoff between search and storage, note that inverted indexes trade write performance and storage space for read performance. Each indexed field adds to the index size and slows down indexing throughput.
Quick recap
- An inverted index maps each term to the sorted list of documents containing it. A search for N terms becomes N hash lookups and a posting list intersection, O(1) per term regardless of corpus size.
- Posting lists store document IDs, term frequencies, and positions. Positions enable phrase queries. Term frequencies drive relevance scoring via TF-IDF or BM25.
- BM25 (the production standard) improves on TF-IDF by saturating term frequency impact and normalizing for document length, preventing long documents from dominating results.
- Tokenization, stopword removal, and stemming (the "analyzer chain") determine what terms enter the index. Analyzer mismatches between index-time and query-time cause silent search failures.
- Lucene writes new documents to immutable segments and merges them in the background. Too many small segments degrade search latency. Force merging consolidates segments during off-peak hours.
- Skip pointers on long posting lists reduce intersection complexity from O(N) to approximately O(sqrt(N)), enabling millisecond-level search over billions of documents.
Related concepts
- B-tree indexes - The data structure for exact-match and range queries on structured columns, complementary to inverted indexes for full-text search.
- Databases - PostgreSQL's GIN index provides inverted-index-style full-text search within the relational engine.
- Bloom filters - Probabilistic membership testing that Lucene uses to skip segments that definitely do not contain a query term.
- LSM trees - Lucene's segment architecture follows the same immutable-write plus background-compaction pattern as LSM tree storage engines.