How Google Search indexing works
Understand the system design behind web-scale search: how Google crawls billions of pages, builds an inverted index, ranks results with PageRank and learning-to-rank models, and serves queries in under 200ms globally.
The Problem Statement
Interviewer: "Walk me through how Google Search works end-to-end. A user types 'best pizza near me' and gets results in 200ms. How does Google crawl billions of pages, build an index, rank results, and serve queries at that speed?"
This question tests four things: whether you understand the crawling pipeline (URL frontier, politeness, dedup), the inverted index data structure and how queries execute against it, PageRank and modern learning-to-rank models, and how a scatter-gather architecture serves queries from sharded indexes in under 200ms.
I love this question for senior+ interviews because there is no single "hard part." Every phase, crawling, indexing, ranking, serving, is a distributed systems challenge on its own. The interviewer wants to see whether you can hold the full pipeline in your head and go deep on any phase when asked.
The scale sets this apart from typical system design questions. Google's index covers over 100 billion web pages. The crawl pipeline processes billions of URLs per day. The serving infrastructure handles over 8.5 billion queries per day (roughly 100,000 queries per second). And the user expects results in under 200ms. Every design decision is driven by these numbers.
I have seen candidates approach this question by drawing a single "search engine" box and hand-waving about databases. That immediately signals shallow understanding. The interviewer wants you to decompose the system into its distinct phases and explain the data structures, algorithms, and distributed systems patterns that make each phase possible at web scale.
Clarifying the Scenario
You: "Great question. Before I walk through the full pipeline, let me clarify scope."
You: "Should I cover the full end-to-end flow from crawling to serving, or do you want me to focus on a specific phase like indexing or ranking?"
Interviewer: "Walk through the full pipeline, but I will ask you to go deep on specific parts."
You: "Got it. And should I cover Google-specific infrastructure like Bigtable and MapReduce, or keep it at the conceptual system design level?"
Interviewer: "Conceptual level is fine. I want to understand the architecture, not the implementation details of specific Google technologies."
You: "Perfect. I will structure my answer in four phases: the crawling pipeline that discovers and fetches web pages, the indexing pipeline that builds the inverted index, the ranking system that scores results using PageRank and machine learning, and the serving architecture that executes queries across sharded indexes in under 200ms."
My Approach
I break this into five parts:
- The crawling pipeline: How a fleet of crawlers discovers and fetches billions of pages using a priority-based URL frontier, politeness constraints, and content deduplication.
- Building the inverted index: How raw HTML is parsed into tokens, and how those tokens are organized into a term-to-document mapping with position lists, stored across thousands of shards.
- PageRank and link graph analysis: How the web's hyperlink structure is used to compute page authority scores through iterative graph computation.
- Query serving with scatter-gather: How a user query is broadcast to thousands of index shards in parallel, results are merged, ranked, and returned in under 200ms.
- Real-time ranking with learning-to-rank: How BM25 text matching, PageRank authority, and neural models like BERT combine to produce the final ranked result list.
The mental model I use is a pipeline with three speeds: the slow path (crawling, days to weeks), the medium path (indexing and ranking precomputation, hours), and the fast path (query serving, milliseconds). Each path has fundamentally different system design constraints.
Here are the numbers that shape every design decision:
| Metric | Scale |
|---|---|
| Indexed web pages | 100+ billion |
| Daily queries | ~8.5 billion (~100K QPS) |
| Crawl rate | Billions of URLs per day |
| Index size (in memory) | Petabytes across the cluster |
| Index shards | 10,000+ per datacenter |
| Target query latency | < 200ms end-to-end |
| New/changed pages per day | Hundreds of millions |
| Link graph edges | 100+ billion |
| Unique daily queries (never seen before) | ~15% of total |
Google does not publish its exact architecture, but the core techniques are well-documented in research papers (the original PageRank paper, the Google File System paper, the MapReduce paper, the Bigtable paper, and more recent BERT and learning-to-rank papers). The architecture described here is based on these publications and standard information retrieval principles.
The Architecture
Here is the end-to-end pipeline from crawling to serving. The key insight is that crawling and indexing happen offline (the "write path"), while query serving happens in real-time (the "read path"). These two paths are decoupled, which is what makes sub-200ms serving possible even though the index covers 100 billion pages.
Let me walk through each phase. The crawl pipeline runs continuously, discovering new and updated pages. The indexing pipeline runs in batches (and increasingly in near-real-time for fresh content). The ranking precomputation updates periodically. The serving path runs on every query in real-time.
The pipeline is organized so that the serving path does as little work as possible. All the heavy computation (crawling, parsing, index building, PageRank, feature extraction) happens offline. At query time, the system does four things: tokenize the query, look up posting lists, score candidates, and generate snippets. That is it.
I think of this as the "precompute everything" philosophy. Every signal that can be computed ahead of time (PageRank, spam scores, document quality, freshness timestamps) is materialized in the feature store before the first query arrives. The query-time code path is just joining precomputed signals with the real-time BM25 score and running them through the ranking model. This is why adding a new ranking signal to Google Search is mostly an offline pipeline problem, not a serving latency problem.
The Inverted Index and Query Execution
The inverted index is the single most important data structure in search. It maps every term in the corpus to the list of documents that contain it (called a "posting list"), along with metadata like term frequency and position within the document.
Here is what the inverted index looks like conceptually:
| Term | Posting List (doc_id, term_freq, positions) |
|---|---|
| pizza | (doc_42, 3, [5,18,97]), (doc_108, 1, [22]), (doc_999, 5, [1,3,15,88,102]) |
| best | (doc_42, 1, [4]), (doc_108, 2, [1,21]), (doc_555, 1, [7]) |
| near | (doc_42, 1, [6]), (doc_999, 1, [16]) |
When a user searches "best pizza near me," the query processor:
- Tokenizes the query into terms: ["best", "pizza", "near"] (stop words like "me" may be dropped or handled specially)
- Looks up the posting list for each term
- Intersects the posting lists to find documents containing all terms
- Scores each document using BM25 (a text relevance formula) combined with precomputed signals like PageRank
- Returns the top-K results
The intersection step is where the magic happens for speed. Posting lists are sorted by document ID. Intersecting two sorted lists is O(n+m) with a merge-join algorithm. For a 3-term query, you intersect iteratively. Skip pointers (every 128th entry contains a pointer to the next block) allow jumping over large sections of the posting list when the current document ID is far behind.
Here is a concrete example. If the "best" posting list has 100,000 documents and the "pizza" posting list has 500,000 documents, a naive intersection touches every entry in both lists. With skip pointers, the algorithm can jump past 127 entries at a time when the current document ID in one list is far behind the other. On average, this reduces the work by 10-50x for terms with very different frequencies.
Continue Reading with Premium
Unlock this article and every other in-depth system design guide on the platform with NotesFromSDE Premium.