Inverted index
How search engines use inverted indexes to find documents by term in O(1). The data structure behind Elasticsearch, Lucene, and PostgreSQL full-text search, covering posting lists, TF-IDF scoring, and segment merging.
The problem
Your e-commerce platform has 50 million product listings. A user types "wireless noise cancelling headphones" into the search bar. You need to return the most relevant products in under 100ms.
B-tree indexes handle exact matches and ranges well. WHERE price > 100 is a range scan. WHERE brand = 'Sony' is a point lookup. But WHERE description CONTAINS 'wireless noise cancelling headphones' is a completely different problem. The search terms could appear anywhere in the text, in any order, and the user expects ranked results (not just matching documents).
WHERE price > 100 -- B-tree: range scan, fast
WHERE user_id = 42 -- B-tree: point lookup, fast
WHERE title = 'exact string' -- B-tree: exact match, fast
WHERE title LIKE '%search%' -- B-tree: CANNOT use index, full table scan
WHERE body CONTAINS 'distributed systems' -- B-tree: impossible
A naive approach scans every document, checking if the search terms appear in the text. At 50 million documents, this is 50 million string comparisons per query. Even at 1 microsecond per check, that is 50 seconds. Users expect 100ms.
This is the problem the inverted index solves: given a set of search terms, find all matching documents in O(1) per term, regardless of corpus size, then rank them by relevance.
What it is
An inverted index is a data structure that maps each unique term (word) to the list of documents containing that term. It "inverts" the natural relationship: instead of document pointing to its words, each word points back to its documents.
Analogy: Think of the index at the back of a textbook. If you want to find every page that mentions "distributed systems," you do not read the entire book. You look up "distributed systems" in the alphabetical index, which lists page numbers 42, 87, and 234. The inverted index does the same thing for a corpus of millions of documents, each with thousands of words.
The core insight is that building the index is expensive (you must read every document once), but querying it is cheap: a single hash lookup per search term, regardless of whether the corpus has 1,000 or 1 billion documents.
How it works
The inverted index has two phases: indexing (build time) and querying (search time). At build time, every document is tokenized into terms, and each term is mapped to its list of document IDs. At query time, terms are looked up and the document lists are intersected or unioned.
The build process in pseudocode:
function build_inverted_index(documents):
index = HashMap<String, List<Posting>>()
for each document in documents:
terms = tokenize(document.text) // split, lowercase, stem
for each term in terms:
posting = {
doc_id: document.id,
term_freq: count(term, document.text),
positions: find_positions(term, document.text)
}
index[term].append(posting)
// Sort each posting list by doc_id for efficient intersection
for each term in index:
sort(index[term], by: doc_id)
return index
function search(index, query_terms):
results = index[query_terms[0]] // first term's posting list
for each remaining term in query_terms:
results = intersect(results, index[term])
return rank(results) // TF-IDF or BM25 scoring
Each term lookup is O(1) via hash map. Intersection of two sorted posting lists is O(min(a, b)) using a merge-style walk. For rare terms, the posting lists are short, making intersection nearly instant even over billions of documents.
Posting lists and skip pointers
The list of document IDs per term is called a posting list. In a production system, posting lists store far more than just document IDs:
Term: "distributed"
Posting list:
[
{doc_id: 1, term_freq: 2, positions: [0, 7], field: "body"},
{doc_id: 2, term_freq: 1, positions: [0], field: "title"},
{doc_id: 5, term_freq: 3, positions: [3, 14, 22], field: "body"},
...
]
- Term frequency: how many times the term appears in the document (used for TF-IDF scoring)
- Positions: word offsets for phrase queries ("distributed systems" as an exact phrase, not just both words anywhere)
- Field: which document field the term appeared in (title matches are weighted higher than body matches)
Posting lists are stored sorted by document ID. This sorted order enables efficient intersection: to find documents matching both "distributed" AND "databases," walk both sorted lists simultaneously, advancing whichever pointer has the smaller doc ID.
Skip pointers
For long posting lists (millions of entries), even linear intersection is slow. Skip pointers add express lanes through the list. Every K entries, a skip pointer records the doc ID and byte offset K positions ahead. When intersecting, if the current doc IDs are far apart, you can skip forward instead of walking one entry at a time.
The optimal skip interval is approximately sqrt(posting_list_length). For a posting list of 1 million entries, skip every ~1,000 entries. This reduces intersection from O(N) to O(sqrt(N)) in practice.
Interview tip: posting list intersection
When discussing search at scale, mention skip pointers. They explain how Elasticsearch intersects posting lists with hundreds of millions of entries in milliseconds. Without skip pointers, intersecting two posting lists of 100M entries each requires 100M comparisons. With skip pointers at sqrt(N) intervals, you skip past irrelevant ranges and reduce work to ~10K comparisons.
TF-IDF and BM25 scoring
Returning matching documents is not enough. You need to rank them by relevance. TF-IDF (Term Frequency-Inverse Document Frequency) is the classical scoring formula:
TF-IDF(term, document) = TF(term, doc) Γ IDF(term)
TF(term, doc) = frequency of term in document
β a doc mentioning "database" 10 times is more relevant than one mentioning it once
IDF(term) = log(N / df(term))
where N = total documents, df = documents containing term
β rare terms ("NoSQL") score higher than common terms ("the")
For a multi-term query, sum the TF-IDF scores across all query terms:
score(doc, query) = Ξ£ TF-IDF(term, doc) for each term in query
Continue Reading with Premium
Unlock this article and every other in-depth system design guide on the platform with NotesFromSDE Premium.