How Google Search indexing works

The Problem Statement

Interviewer: "Walk me through how Google Search works end-to-end. A user types 'best pizza near me' and gets results in 200ms. How does Google crawl billions of pages, build an index, rank results, and serve queries at that speed?"

This question tests four things: whether you understand the crawling pipeline (URL frontier, politeness, dedup), the inverted index data structure and how queries execute against it, PageRank and modern learning-to-rank models, and how a scatter-gather architecture serves queries from sharded indexes in under 200ms.

I love this question for senior+ interviews because there is no single "hard part." Every phase, crawling, indexing, ranking, serving, is a distributed systems challenge on its own. The interviewer wants to see whether you can hold the full pipeline in your head and go deep on any phase when asked.

The scale sets this apart from typical system design questions. Google's index covers over 100 billion web pages. The crawl pipeline processes billions of URLs per day. The serving infrastructure handles over 8.5 billion queries per day (roughly 100,000 queries per second). And the user expects results in under 200ms. Every design decision is driven by these numbers.

I have seen candidates approach this question by drawing a single "search engine" box and hand-waving about databases. That immediately signals shallow understanding. The interviewer wants you to decompose the system into its distinct phases and explain the data structures, algorithms, and distributed systems patterns that make each phase possible at web scale.

Clarifying the Scenario

You: "Great question. Before I walk through the full pipeline, let me clarify scope."

You: "Should I cover the full end-to-end flow from crawling to serving, or do you want me to focus on a specific phase like indexing or ranking?"

Interviewer: "Walk through the full pipeline, but I will ask you to go deep on specific parts."

You: "Got it. And should I cover Google-specific infrastructure like Bigtable and MapReduce, or keep it at the conceptual system design level?"

Interviewer: "Conceptual level is fine. I want to understand the architecture, not the implementation details of specific Google technologies."

You: "Perfect. I will structure my answer in four phases: the crawling pipeline that discovers and fetches web pages, the indexing pipeline that builds the inverted index, the ranking system that scores results using PageRank and machine learning, and the serving architecture that executes queries across sharded indexes in under 200ms."

My Approach

I break this into five parts:

The crawling pipeline: How a fleet of crawlers discovers and fetches billions of pages using a priority-based URL frontier, politeness constraints, and content deduplication.
Building the inverted index: How raw HTML is parsed into tokens, and how those tokens are organized into a term-to-document mapping with position lists, stored across thousands of shards.
PageRank and link graph analysis: How the web's hyperlink structure is used to compute page authority scores through iterative graph computation.
Query serving with scatter-gather: How a user query is broadcast to thousands of index shards in parallel, results are merged, ranked, and returned in under 200ms.
Real-time ranking with learning-to-rank: How BM25 text matching, PageRank authority, and neural models like BERT combine to produce the final ranked result list.

The mental model I use is a pipeline with three speeds: the slow path (crawling, days to weeks), the medium path (indexing and ranking precomputation, hours), and the fast path (query serving, milliseconds). Each path has fundamentally different system design constraints.

Here are the numbers that shape every design decision:

Metric	Scale
Indexed web pages	100+ billion
Daily queries	~8.5 billion (~100K QPS)
Crawl rate	Billions of URLs per day
Index size (in memory)	Petabytes across the cluster
Index shards	10,000+ per datacenter
Target query latency	< 200ms end-to-end
New/changed pages per day	Hundreds of millions
Link graph edges	100+ billion
Unique daily queries (never seen before)	~15% of total

Google does not publish its exact architecture, but the core techniques are well-documented in research papers (the original PageRank paper, the Google File System paper, the MapReduce paper, the Bigtable paper, and more recent BERT and learning-to-rank papers). The architecture described here is based on these publications and standard information retrieval principles.

The Architecture

Here is the end-to-end pipeline from crawling to serving. The key insight is that crawling and indexing happen offline (the "write path"), while query serving happens in real-time (the "read path"). These two paths are decoupled, which is what makes sub-200ms serving possible even though the index covers 100 billion pages.

Let me walk through each phase. The crawl pipeline runs continuously, discovering new and updated pages. The indexing pipeline runs in batches (and increasingly in near-real-time for fresh content). The ranking precomputation updates periodically. The serving path runs on every query in real-time.

The pipeline is organized so that the serving path does as little work as possible. All the heavy computation (crawling, parsing, index building, PageRank, feature extraction) happens offline. At query time, the system does four things: tokenize the query, look up posting lists, score candidates, and generate snippets. That is it.

I think of this as the "precompute everything" philosophy. Every signal that can be computed ahead of time (PageRank, spam scores, document quality, freshness timestamps) is materialized in the feature store before the first query arrives. The query-time code path is just joining precomputed signals with the real-time BM25 score and running them through the ranking model. This is why adding a new ranking signal to Google Search is mostly an offline pipeline problem, not a serving latency problem.

The Inverted Index and Query Execution

The inverted index is the single most important data structure in search. It maps every term in the corpus to the list of documents that contain it (called a "posting list"), along with metadata like term frequency and position within the document.

Here is what the inverted index looks like conceptually:

Term	Posting List (doc_id, term_freq, positions)
pizza	(doc_42, 3, [5,18,97]), (doc_108, 1, [22]), (doc_999, 5, [1,3,15,88,102])
best	(doc_42, 1, [4]), (doc_108, 2, [1,21]), (doc_555, 1, [7])
near	(doc_42, 1, [6]), (doc_999, 1, [16])

When a user searches "best pizza near me," the query processor:

Tokenizes the query into terms: ["best", "pizza", "near"] (stop words like "me" may be dropped or handled specially)
Looks up the posting list for each term
Intersects the posting lists to find documents containing all terms
Scores each document using BM25 (a text relevance formula) combined with precomputed signals like PageRank
Returns the top-K results

The intersection step is where the magic happens for speed. Posting lists are sorted by document ID. Intersecting two sorted lists is O(n+m) with a merge-join algorithm. For a 3-term query, you intersect iteratively. Skip pointers (every 128th entry contains a pointer to the next block) allow jumping over large sections of the posting list when the current document ID is far behind.

Here is a concrete example. If the "best" posting list has 100,000 documents and the "pizza" posting list has 500,000 documents, a naive intersection touches every entry in both lists. With skip pointers, the algorithm can jump past 127 entries at a time when the current document ID in one list is far behind the other. On average, this reduces the work by 10-50x for terms with very different frequencies.

The Problem Statement

Interviewer: "Walk me through how Google Search works end-to-end. A user types 'best pizza near me' and gets results in 200ms. How does Google crawl billions of pages, build an index, rank results, and serve queries at that speed?"

Clarifying the Scenario

You: "Great question. Before I walk through the full pipeline, let me clarify scope."

You: "Should I cover the full end-to-end flow from crawling to serving, or do you want me to focus on a specific phase like indexing or ranking?"

Interviewer: "Walk through the full pipeline, but I will ask you to go deep on specific parts."

You: "Got it. And should I cover Google-specific infrastructure like Bigtable and MapReduce, or keep it at the conceptual system design level?"

Interviewer: "Conceptual level is fine. I want to understand the architecture, not the implementation details of specific Google technologies."

My Approach

I break this into five parts:

The crawling pipeline: How a fleet of crawlers discovers and fetches billions of pages using a priority-based URL frontier, politeness constraints, and content deduplication.
Building the inverted index: How raw HTML is parsed into tokens, and how those tokens are organized into a term-to-document mapping with position lists, stored across thousands of shards.
PageRank and link graph analysis: How the web's hyperlink structure is used to compute page authority scores through iterative graph computation.
Query serving with scatter-gather: How a user query is broadcast to thousands of index shards in parallel, results are merged, ranked, and returned in under 200ms.
Real-time ranking with learning-to-rank: How BM25 text matching, PageRank authority, and neural models like BERT combine to produce the final ranked result list.

Here are the numbers that shape every design decision:

Metric	Scale
Indexed web pages	100+ billion
Daily queries	~8.5 billion (~100K QPS)
Crawl rate	Billions of URLs per day
Index size (in memory)	Petabytes across the cluster
Index shards	10,000+ per datacenter
Target query latency	< 200ms end-to-end
New/changed pages per day	Hundreds of millions
Link graph edges	100+ billion
Unique daily queries (never seen before)	~15% of total

The Architecture

The Inverted Index and Query Execution

Here is what the inverted index looks like conceptually:

Term	Posting List (doc_id, term_freq, positions)
pizza	(doc_42, 3, [5,18,97]), (doc_108, 1, [22]), (doc_999, 5, [1,3,15,88,102])
best	(doc_42, 1, [4]), (doc_108, 2, [1,21]), (doc_555, 1, [7])
near	(doc_42, 1, [6]), (doc_999, 1, [16])

When a user searches "best pizza near me," the query processor:

Tokenizes the query into terms: ["best", "pizza", "near"] (stop words like "me" may be dropped or handled specially)
Looks up the posting list for each term
Intersects the posting lists to find documents containing all terms
Scores each document using BM25 (a text relevance formula) combined with precomputed signals like PageRank
Returns the top-K results

How Google Search indexing works

The Problem Statement

Clarifying the Scenario

My Approach

The Architecture

The Inverted Index and Query Execution

Continue Reading with Premium

Comments

How Google Search indexing works

The Problem Statement

Clarifying the Scenario

My Approach

The Architecture

The Inverted Index and Query Execution

Continue Reading with Premium

Comments