Design a Simple Search Engine
OOP design for an in-memory search engine covering document indexing, inverted index construction, TF-IDF ranking, boolean queries (AND/OR/NOT), and result pagination.
The Problem
Your engineering team maintains a knowledge base of 50,000 internal documents. Engineers search for solutions dozens of times a day, but the current implementation runs a naive String.contains() scan across every document for every query. At 50,000 documents, each search takes 3-5 seconds, and boolean queries like "circuit breaker AND retry NOT timeout" are not supported at all.
An inverted index solves the core performance problem. Instead of scanning every document, you build a map from each term to the list of documents containing that term. A search for "retry" becomes a single map lookup returning document IDs in microseconds. Boolean queries become set operations: AND is intersection, OR is union, NOT is difference.
Design the core classes for an in-memory search engine that supports document indexing, inverted index construction, pluggable ranking strategies, boolean query evaluation, and paginated result retrieval.
Requirements
Clarifying Questions
Before jumping into class design, ask questions to turn the vague prompt into a concrete specification. Cover four areas: core behavior, ranking, query language, and boundaries.
You: "What does a document look like? Does it have a title, body, and metadata, or just raw text?"
Interviewer: "Each document has an ID, a title, and a body. No additional metadata for now."
Simple model. A Document record with three fields keeps things clean.
You: "How should we rank search results? By relevance, recency, or just presence?"
Interviewer: "Relevance by default using TF-IDF. But we want the flexibility to swap in other algorithms like BM25 or simple frequency count."
That signals Strategy pattern for ranking. The default implementation computes TF-IDF scores, but the interface stays open for alternatives.
You: "Do we need to support boolean queries like AND, OR, NOT?"
Interviewer: "Yes. Users should be able to combine terms with AND, OR, and NOT. For example, 'cache AND invalidation NOT stale'."
Boolean operators mean we need a query tree. Composite pattern fits: each query node is either a leaf (single term) or a composite (AND/OR/NOT combining child nodes).
You: "Should tokenization be configurable? Some teams might want stop-word removal or stemming."
Interviewer: "Yes. Default tokenization is lowercase plus whitespace splitting. But make it pluggable so we can add stop-word removal or stemming later."
Another Strategy. Tokenization is a pipeline step that varies independently from ranking.
You: "How should we handle pagination? Do callers specify page size and page number?"
Interviewer: "Yes. Return a page of results with the total count so callers can build pagination controls."
A SearchResult object wrapping a page of ranked results with metadata.
You: "Should the system handle concurrent reads and writes? Can indexing happen while searches are running?"
Interviewer: "Yes. Documents get indexed continuously while searches execute. Thread safety is required."
ReadWriteLock on the index. Reads (searches) can run concurrently, but writes (indexing) acquire an exclusive lock.
You: "Do we need phrase queries like 'circuit breaker' as an exact phrase?"
Interviewer: "Not in the core scope. But design the posting list so positional data is available for future phrase support."
Good boundary. We store term positions in each posting so phrase queries become a natural extension.
You: "Is there a limit on document size or total document count?"
Interviewer: "No hard limit. Assume in-memory is sufficient. Optimize for fast lookup, not storage efficiency."
Final Requirements
Functional Requirements:
- Index documents by tokenizing their title and body into an inverted index.
- Search for documents matching a single term, returning ranked results.
- Support boolean queries (AND, OR, NOT) combining multiple terms.
- Rank results using a pluggable strategy (TF-IDF by default).
- Tokenize text using a pluggable strategy (lowercase + whitespace split by default).
- Return paginated results with total count.
- Store term positions in postings for future phrase query support.
Non-Functional Requirements:
- Thread-safe for concurrent indexing and searching.
- Sub-millisecond single-term lookups via inverted index.
- Extensible for new ranking algorithms and tokenizers without modifying existing code.
Out of Scope:
- Persistence or database integration
- Phrase queries (positional data stored but not queried)
- Fuzzy matching or typo correction
- UI rendering or API endpoints
- Distributed indexing or sharding
Example Inputs and Outputs
Scenario 1: Single-term search with ranking
- Input: Index three documents. Doc 1: "retry mechanisms in distributed systems". Doc 2: "retry and circuit breaker patterns". Doc 3: "caching strategies for performance". Search for "retry".
- Expected:
search("retry")returns Doc 1 and Doc 2, ranked by TF-IDF score. Doc 3 is excluded because it does not contain "retry". - Why: Validates inverted index lookup and ranking.
Scenario 2: Boolean AND query
- Input: Same three documents. Search for "retry AND patterns".
- Expected: Only Doc 2 is returned (contains both "retry" and "patterns"). Doc 1 has "retry" but not "patterns". Doc 3 has neither.
- Why: Validates posting list intersection for AND queries.
Scenario 3: Boolean NOT query
- Input: Same three documents. Search for "retry NOT circuit".
- Expected: Only Doc 1 is returned. Doc 2 has "retry" but also contains "circuit", so NOT excludes it.
- Why: Validates posting list difference for NOT queries.
Try It Yourself
Try it yourself
Before reading the solution, spend 15-20 minutes sketching your own class diagram. Focus on how you would structure the inverted index (what data does each posting store?) and how boolean queries compose into a tree. Think about which parts vary independently and deserve their own strategy interface. Compare your approach with the walkthrough below.
Step 1: Identify Core Entities
Start by asking: what are the main "things" in this problem? Look for nouns in the requirements. You will find documents, an index, tokens, queries, rankings, and results. Each one maps to a class with a single clear responsibility.
A common mistake is lumping the index, tokenizer, and ranker into one giant SearchEngine class. Good design means each class has one job. The engine orchestrates, the index stores postings, the tokenizer splits text, and the ranker scores results.
| Entity | Responsibility | Key attributes |
|---|---|---|
| Document | Immutable data holder for indexed content. | id, title, body |
| Posting | Records where a term appears within a specific document. | docId, frequency, positions |
| InvertedIndex | Maps terms to posting lists. The core data structure. | index (Map of term to List of Posting) |
| Tokenizer | Splits raw text into normalized tokens. Pluggable. | (strategy interface) |
| RankingStrategy | Scores and sorts search results. Pluggable. | (strategy interface) |
| Query | Represents a search query node. Composite tree for boolean ops. | (composite interface) |
| SearchResult | Wraps a page of ranked results with metadata. | results, totalCount, page, pageSize |
| SearchEngine | Orchestrator. Coordinates indexing and searching. | index, tokenizer, ranker |
Notice that Posting is separate from InvertedIndex. A posting captures per-document term metadata (frequency, positions), while the index manages the mapping from terms to posting lists. Merging them would violate SRP because the index handles lookups while postings hold statistics.
Step 2: Define Relationships and Class Design
SearchEngine
The orchestrator. It coordinates indexing and searching but delegates all real work to collaborators.
Deriving state from requirements:
| Requirement | What SearchEngine must track |
|---|---|
| "Index documents into an inverted index" | The inverted index |
| "Pluggable tokenization" | The active tokenizer |
| "Pluggable ranking" | The active ranking strategy |
| "Thread-safe concurrent access" | A ReadWriteLock |
Deriving methods from needs:
| Need from requirements | Method |
|---|---|
| "Index documents" | indexDocument(document) |
| "Search with pagination" | search(query, page, pageSize) |
InvertedIndex
The core data structure. Maps each term to a list of postings, and stores documents by ID for retrieval.
Deriving state from requirements:
| Requirement | What InvertedIndex must track |
|---|---|
| "Map terms to posting lists" | Map from term to List of Posting |
| "Return documents by ID" | Map from docId to Document |
Deriving methods from needs:
Continue Reading with Premium
Unlock this article and every other in-depth system design guide on the platform with NotesFromSDE Premium.