Inverted index

The problem

Your e-commerce platform has 50 million product listings. A user types "wireless noise cancelling headphones" into the search bar. You need to return the most relevant products in under 100ms.

B-tree indexes handle exact matches and ranges well. WHERE price > 100 is a range scan. WHERE brand = 'Sony' is a point lookup. But WHERE description CONTAINS 'wireless noise cancelling headphones' is a completely different problem. The search terms could appear anywhere in the text, in any order, and the user expects ranked results (not just matching documents).

WHERE price > 100            -- B-tree: range scan, fast
WHERE user_id = 42           -- B-tree: point lookup, fast
WHERE title = 'exact string' -- B-tree: exact match, fast
WHERE title LIKE '%search%'  -- B-tree: CANNOT use index, full table scan
WHERE body CONTAINS 'distributed systems' -- B-tree: impossible

A naive approach scans every document, checking if the search terms appear in the text. At 50 million documents, this is 50 million string comparisons per query. Even at 1 microsecond per check, that is 50 seconds. Users expect 100ms.

This is the problem the inverted index solves: given a set of search terms, find all matching documents in O(1) per term, regardless of corpus size, then rank them by relevance.

What it is

An inverted index is a data structure that maps each unique term (word) to the list of documents containing that term. It "inverts" the natural relationship: instead of document pointing to its words, each word points back to its documents.

Analogy: Think of the index at the back of a textbook. If you want to find every page that mentions "distributed systems," you do not read the entire book. You look up "distributed systems" in the alphabetical index, which lists page numbers 42, 87, and 234. The inverted index does the same thing for a corpus of millions of documents, each with thousands of words.

The core insight is that building the index is expensive (you must read every document once), but querying it is cheap: a single hash lookup per search term, regardless of whether the corpus has 1,000 or 1 billion documents.

How it works

The inverted index has two phases: indexing (build time) and querying (search time). At build time, every document is tokenized into terms, and each term is mapped to its list of document IDs. At query time, terms are looked up and the document lists are intersected or unioned.

The build process in pseudocode:

function build_inverted_index(documents):
  index = HashMap<String, List<Posting>>()

  for each document in documents:
    terms = tokenize(document.text)  // split, lowercase, stem
    for each term in terms:
      posting = {
        doc_id: document.id,
        term_freq: count(term, document.text),
        positions: find_positions(term, document.text)
      }
      index[term].append(posting)

  // Sort each posting list by doc_id for efficient intersection
  for each term in index:
    sort(index[term], by: doc_id)

  return index

function search(index, query_terms):
  results = index[query_terms[0]]  // first term's posting list
  for each remaining term in query_terms:
    results = intersect(results, index[term])
  return rank(results)  // TF-IDF or BM25 scoring

Each term lookup is O(1) via hash map. Intersection of two sorted posting lists is O(min(a, b)) using a merge-style walk. For rare terms, the posting lists are short, making intersection nearly instant even over billions of documents.

Posting lists and skip pointers

The list of document IDs per term is called a posting list. In a production system, posting lists store far more than just document IDs:

Term: "distributed"
Posting list:
  [
    {doc_id: 1, term_freq: 2, positions: [0, 7], field: "body"},
    {doc_id: 2, term_freq: 1, positions: [0],    field: "title"},
    {doc_id: 5, term_freq: 3, positions: [3, 14, 22], field: "body"},
    ...
  ]

Term frequency: how many times the term appears in the document (used for TF-IDF scoring)
Positions: word offsets for phrase queries ("distributed systems" as an exact phrase, not just both words anywhere)
Field: which document field the term appeared in (title matches are weighted higher than body matches)

Posting lists are stored sorted by document ID. This sorted order enables efficient intersection: to find documents matching both "distributed" AND "databases," walk both sorted lists simultaneously, advancing whichever pointer has the smaller doc ID.

Skip pointers

For long posting lists (millions of entries), even linear intersection is slow. Skip pointers add express lanes through the list. Every K entries, a skip pointer records the doc ID and byte offset K positions ahead. When intersecting, if the current doc IDs are far apart, you can skip forward instead of walking one entry at a time.

The optimal skip interval is approximately sqrt(posting_list_length). For a posting list of 1 million entries, skip every ~1,000 entries. This reduces intersection from O(N) to O(sqrt(N)) in practice.

Interview tip: posting list intersection

When discussing search at scale, mention skip pointers. They explain how Elasticsearch intersects posting lists with hundreds of millions of entries in milliseconds. Without skip pointers, intersecting two posting lists of 100M entries each requires 100M comparisons. With skip pointers at sqrt(N) intervals, you skip past irrelevant ranges and reduce work to ~10K comparisons.

TF-IDF and BM25 scoring

Returning matching documents is not enough. You need to rank them by relevance. TF-IDF (Term Frequency-Inverse Document Frequency) is the classical scoring formula:

TF-IDF(term, document) = TF(term, doc) × IDF(term)

TF(term, doc) = frequency of term in document
  → a doc mentioning "database" 10 times is more relevant than one mentioning it once

IDF(term) = log(N / df(term))
  where N = total documents, df = documents containing term
  → rare terms ("NoSQL") score higher than common terms ("the")

For a multi-term query, sum the TF-IDF scores across all query terms:

score(doc, query) = Σ TF-IDF(term, doc) for each term in query

The problem

Your e-commerce platform has 50 million product listings. A user types "wireless noise cancelling headphones" into the search bar. You need to return the most relevant products in under 100ms.

WHERE price > 100            -- B-tree: range scan, fast
WHERE user_id = 42           -- B-tree: point lookup, fast
WHERE title = 'exact string' -- B-tree: exact match, fast
WHERE title LIKE '%search%'  -- B-tree: CANNOT use index, full table scan
WHERE body CONTAINS 'distributed systems' -- B-tree: impossible

This is the problem the inverted index solves: given a set of search terms, find all matching documents in O(1) per term, regardless of corpus size, then rank them by relevance.

What it is

How it works

The build process in pseudocode:

function build_inverted_index(documents):
  index = HashMap<String, List<Posting>>()

  for each document in documents:
    terms = tokenize(document.text)  // split, lowercase, stem
    for each term in terms:
      posting = {
        doc_id: document.id,
        term_freq: count(term, document.text),
        positions: find_positions(term, document.text)
      }
      index[term].append(posting)

  // Sort each posting list by doc_id for efficient intersection
  for each term in index:
    sort(index[term], by: doc_id)

  return index

function search(index, query_terms):
  results = index[query_terms[0]]  // first term's posting list
  for each remaining term in query_terms:
    results = intersect(results, index[term])
  return rank(results)  // TF-IDF or BM25 scoring

Posting lists and skip pointers

The list of document IDs per term is called a posting list. In a production system, posting lists store far more than just document IDs:

Term: "distributed"
Posting list:
  [
    {doc_id: 1, term_freq: 2, positions: [0, 7], field: "body"},
    {doc_id: 2, term_freq: 1, positions: [0],    field: "title"},
    {doc_id: 5, term_freq: 3, positions: [3, 14, 22], field: "body"},
    ...
  ]

Term frequency: how many times the term appears in the document (used for TF-IDF scoring)
Positions: word offsets for phrase queries ("distributed systems" as an exact phrase, not just both words anywhere)
Field: which document field the term appeared in (title matches are weighted higher than body matches)

Skip pointers

Interview tip: posting list intersection

TF-IDF and BM25 scoring

Returning matching documents is not enough. You need to rank them by relevance. TF-IDF (Term Frequency-Inverse Document Frequency) is the classical scoring formula:

TF-IDF(term, document) = TF(term, doc) × IDF(term)

TF(term, doc) = frequency of term in document
  → a doc mentioning "database" 10 times is more relevant than one mentioning it once

IDF(term) = log(N / df(term))
  where N = total documents, df = documents containing term
  → rare terms ("NoSQL") score higher than common terms ("the")

For a multi-term query, sum the TF-IDF scores across all query terms:

score(doc, query) = Σ TF-IDF(term, doc) for each term in query

Inverted index

The problem

What it is

How it works

Posting lists and skip pointers

Skip pointers

TF-IDF and BM25 scoring

Continue Reading with Premium

Comments

Inverted index

The problem

What it is

How it works

Posting lists and skip pointers

Skip pointers

TF-IDF and BM25 scoring

Continue Reading with Premium

Comments