Web Crawler

What is a web crawler?

A web crawler fetches pages from the internet and stores their content for downstream indexing. Downloading HTML is the easy part. The real challenge is doing it at a billion-page scale while staying polite to target servers, deduplicating URLs across petabytes of already-visited content, and keeping pages fresh without re-crawling the entire index on every cycle. This question tests distributed queues, Bloom filters, scheduling algorithms, and rate limiting all at once, which is exactly why interviewers reach for it.

I'd design this as a periodic pull-based system rather than real-time. A real-time crawler would require a push architecture where publishers notify you via WebSub (PubSubHubbub) or sitemap ping APIs, and you process page updates as events arrive. That changes the URL Frontier from a scheduled priority queue into a streaming ingestion pipeline (a Kafka topic with event-time processing). For a general-purpose crawler covering billions of arbitrary pages, the push model is impractical: most pages don't implement WebSub. The periodic model with adaptive re-crawl intervals is the right default, and real-time supplements (sitemap pings, RSS polling) layer on top without replacing the core architecture.

Functional Requirements

Core Requirements

Crawl all pages reachable from a seed list of URLs.
Store the raw HTML and extracted links for downstream indexing.
Discover new URLs continuously and re-crawl stale pages on a schedule.

Below the Line (out of scope)

Full-text search indexing and ranking (downstream system)
JavaScript rendering (Puppeteer/headless browser tier)
Login-gated content crawling
Media file (image/PDF) extraction

Full-text indexing is a downstream concern that consumes the raw HTML this crawler produces. If in scope, I'd add an Index Service that reads from the HTML store, tokenizes content, computes TF-IDF or BM25 scores, and writes to an inverted index store. The crawler and the indexer share exactly one interface: the raw HTML storage layer, so both can evolve independently.

JavaScript rendering requires a headless browser tier (Puppeteer, Playwright) that is 10 to 50 times more expensive per page than a plain HTTP fetch. If in scope, I'd route detected JavaScript-heavy pages to a separate rendering queue with much lower throughput (50 to 100 pages per second instead of 1,000) and pass the rendered HTML back into the main pipeline.

Login-gated content requires per-site credential management, session handling, and CAPTCHA handling. It is sufficiently different from general-purpose crawling that I'd build it as a separate, targeted crawl service rather than generalizing the main crawler.

Media extraction (images, PDFs) is excluded because the storage and processing requirements are fundamentally different: blob storage, OCR pipelines, image recognition. The crawler still encounters these URLs during link extraction, but it discards the binary payload and stores only the URL reference.

The hardest part in scope: URL deduplication at petabyte scale is the single most challenging problem here. At 1 billion pages, a memory-resident hash set requires hundreds of gigabytes of RAM. A database lookup per URL becomes a write-path bottleneck at 1,000 pages per second. Getting deduplication right determines everything else about the crawler's performance.

Non-Functional Requirements

Core Requirements

Scale: 1 billion pages crawled total; 100 million new or updated pages per day.
Throughput: 1,000 pages per second sustained (86.4 million pages per day at full capacity).
Politeness: Maximum 1 request per domain per second; 1 per 10 seconds for sensitive or slow-responding domains.
Deduplication: No URL crawled twice within a single crawl cycle.
Freshness: High-change pages (news sites, live feeds) re-crawled within 24 hours; low-change pages within 30 days.
Storage: Raw HTML stored durably; estimated 1 TB per 1 million pages = 1 PB total at full scale.
Availability: The crawler runs continuously; transient failures must not lose queued URLs.

Below the Line

Sub-millisecond deduplication latency (seconds-scale latency is acceptable for the URL queue)
Exactly-once crawl guarantees (at-least-once with idempotency is sufficient)

Read/write ratio: This system is almost entirely writes. For every URL fetched and stored, there is 1 deduplication check, 1 HTML write to object storage, and several URL queue operations. The only significant read workload is the downstream indexer reading from HTML storage. Design every component for write throughput, not read latency. This is one of the few large-scale systems where you can deprioritize read optimization almost entirely.

Core Entities

CrawlJob: A top-level crawl task with a seed URL list, status, and configuration (depth limit, domain scope, re-crawl enabled flag).
URLFrontier: A prioritized queue entry representing a URL to be crawled, with a scheduled crawl time and a numeric priority score.
CrawledPage: The stored result of one crawl: raw HTML, extracted outbound links, HTTP status code, crawl timestamp, and content hash.
DomainPolicy: The cached per-domain rules: parsed robots.txt directives, crawl delay setting, last-crawl timestamp, and domain authority score.
URLFingerprint: A compact record (canonical URL string and its hash) used for Bloom filter deduplication lookups.

Full schema and indexing strategy are deferred to the deep dives. These five entities are enough to drive the API and High-Level Design.

API Design

A web crawler is primarily internal, but operator APIs are needed for seed submission and observability.

FR 1: Submit seed URLs to start a crawl:

POST /v1/crawl/seeds
Body: {
  urls: ["https://example.com", "https://news.ycombinator.com"],
  config: { max_depth: 5, scope: "domain", recrawl_enabled: true }
}
Response: { job_id: "cj_abc123", queued_count: 2, status: "queued" }

POST because this creates a new CrawlJob. The config block lets callers scope the crawl to a single domain or the full reachable web, and enables or disables freshness-based re-crawling per job. The job_id is returned for status polling; callers do not wait synchronously for the crawl to complete.

FR 2: Check the crawl status of a specific URL:

GET /v1/crawl/status/{encoded_url}
Response: {
  url: "https://example.com",
  status: "crawled",
  last_crawled_at: "2026-03-29T12:00:00Z",
  next_scheduled_at: "2026-03-30T06:00:00Z",
  http_status: 200,
  content_hash: "sha256:abc..."
}

The URL is URL-encoded in the path. next_scheduled_at lets downstream integrations know when refreshed content will be available. content_hash reveals whether a re-crawl produced any actual change, which matters for incremental indexers.

FR 3: List active crawl jobs:

GET /v1/crawl/jobs?status=running&cursor=eyJ0c...&limit=20
Response: {
  jobs: [
    { job_id: "cj_abc", seed_count: 2, pages_crawled: 15420, status: "running", started_at: "..." }
  ],
  next_cursor: "eyJ0c..."
}

Use cursor-based pagination because the job list is a time-ordered stream. Offset pagination skips jobs when new crawls start mid-page. Filter by status (queued, running, completed, failed) to limit result set sizes in production.

High-Level Design

1. Basic crawl loop: frontier, fetcher, and storage

The core pipeline: dequeue a URL, fetch the page, parse links, store HTML, and enqueue discovered URLs.

This satisfies FR 1 and FR 2 end-to-end on a small seed set. It has no politeness enforcement, no deduplication, and no priority control. Establishing correctness first gives us a clear baseline before adding complexity.

Components:

Seed Submitter: Operator API that pushes seed URLs into the URL Frontier after job creation.
URL Frontier Queue: A persistent FIFO queue (Kafka or Redis sorted set) holding URLs pending crawl.
Fetcher Service: Dequeues URLs, sends HTTP GET to target servers, receives raw HTML, and extracts outbound links.
HTML Store: Object storage (S3) for raw HTML, keyed by URL hash for content-addressed lookup.
Crawl DB: PostgreSQL tracking crawl status per URL (queued, crawled, failed, skipped).

Request walkthrough:

Seed Submitter calls POST /v1/crawl/seeds and the API writes seed URLs into the URL Frontier Queue.
Fetcher Service dequeues the next URL and sends HTTP GET to the target server.
Fetcher writes raw HTML to HTML Store, keyed by the SHA-256 hash of the URL.
Fetcher extracts all <a href> links from the HTML and checks Crawl DB: for each link not yet queued, enqueues it into the URL Frontier.
Fetcher updates Crawl DB: mark the URL as crawled, record http_status and crawl_timestamp.

This covers the basic pipeline. I like to start interviews with this exact diagram because it proves you understand the core loop before adding any optimization layers. The fetcher hammers the same domain without throttling, and re-enqueues already-visited URLs, so the frontier grows unboundedly. Both problems are addressed next.

2. Politeness enforcement

A fetcher that fires requests without throttling will hit the same domain dozens of times per second, triggering rate-limit responses or outright IP bans.

Politeness has two components: respecting robots.txt (which URLs the site disallows) and enforcing per-domain crawl delays (how fast we hit a given server). I'd treat both as pre-flight checks before every HTTP request. In my experience, interviewers love probing this area because it separates candidates who have actually built scrapers from those reading a design doc for the first time.

Components added:

Robots.txt Cache: A Redis hash storing parsed robots.txt rules per domain with a 24-hour TTL. Checked before every request; on cache miss, the fetcher issues a real robots.txt fetch and caches the result.
Domain Rate Limiter: A Redis sorted set tracking the last-crawl timestamp per domain. The fetcher checks this before sending a request and re-queues the URL if the domain is on cooldown.

Request walkthrough:

Fetcher dequeues a URL (e.g., https://example.com/page-42).
Fetcher checks Robots.txt Cache for example.com. On cache miss, it fetches and caches https://example.com/robots.txt.
If the URL path is disallowed, Fetcher drops the URL and marks it "skipped" in Crawl DB.
Fetcher queries Domain Rate Limiter for example.com: time since last request. If under the minimum delay (1 second default, 10 seconds for slow domains), it re-queues the URL with a delayed scheduled time.
Fetcher acquires the domain slot, sends HTTP GET, and proceeds.

Politeness violations are now prevented. The remaining gap is that the fetcher still enqueues already-crawled URLs, so the frontier grows without bound.

3. URL deduplication

Without deduplication, the frontier re-enqueues URLs we've already crawled, creating an ever-growing backlog of redundant work.

The goal is to reject a URL before it enters the frontier if it has already been queued or crawled in this cycle. I'll treat the exact mechanism as a black box here and detail the options in the deep dives.

Components added:

Bloom Filter: A probabilistic data structure that answers "have I seen this URL?" with zero false negatives and a small, configurable false-positive rate. At 1 billion URLs with a 0.1% false-positive rate, a Bloom filter requires approximately 1.8 GB of memory.

Request walkthrough:

HTML Parser extracts an outbound link from a crawled page.
Fetcher queries the Bloom Filter: "have we seen this URL?"
If the Bloom Filter returns "probably seen," the URL is discarded.
If it returns "definitely not seen," the URL is added to the Bloom Filter and enqueued into the URL Frontier.
On a false positive (0.1% rate), we skip a URL we haven't actually crawled. This is acceptable for a best-effort general-purpose crawler.

The crawler now avoids redundant work and the frontier stays bounded. For your interview: mention the Bloom filter's 1.8 GB memory footprint versus 100+ GB for a full hash set, and you've already shown you understand the space-accuracy tradeoff.

The remaining gap is that all URLs have equal priority and crawled pages are never revisited.

4. Freshness scheduling and prioritization

A one-time crawl goes stale within hours. Pages change, new content appears, and the index diverges from reality without re-crawl logic.

We need two things: a priority score that puts high-value pages first in the frontier, and a re-crawl scheduler that re-inserts pages based on how frequently their content changes. I'd call out the freshness scheduling explicitly in the interview because it is the part most candidates skip entirely, defaulting to "just re-crawl everything weekly."

Components added:

Priority Scorer: Assigns a numeric priority to each URL based on domain authority, estimated change frequency, and time since last crawl.
Recrawl Scheduler: A background job that queries Crawl DB for pages due for re-crawl (next_crawl_at <= NOW()) and re-inserts them into the URL Frontier with their computed priority score.

Request walkthrough:

After fetching a page, Fetcher computes a SHA-256 content hash and compares it against the stored hash in Crawl DB.
If the content changed, Fetcher increments the change_frequency score for that URL. If unchanged, it decrements the score.
Fetcher sets next_crawl_at using the formula: interval = clamp(BASE_INTERVAL / change_frequency, 24h, 30d).
Recrawl Scheduler runs every few minutes, queries Crawl DB for due pages, and re-inserts them into the frontier with their priority score.
Because the frontier is now a Redis sorted set, URLs are dequeued in priority order rather than FIFO.

The system now handles all three functional requirements: crawling from seeds, storing raw HTML, and re-crawling stale pages on a freshness-based schedule.

Potential Deep Dives

1. URL deduplication at scale

At 1 billion URLs and 1,000 crawls per second, the deduplication layer processes thousands of URL lookups per second against a dataset growing by millions of entries per day. The mechanism you choose determines memory cost, latency, and operational complexity.

2. URL frontier and prioritization

The URL Frontier is the scheduler at the heart of the crawler. At 1 billion queued URLs it must dequeue in priority order, prevent domain starvation, and survive process restarts without losing queued work.

3. Politeness enforcement at scale

With hundreds of fetcher workers running in parallel, each targeting thousands of domains, politeness enforcement must be globally consistent. A per-worker in-memory rate limiter allows different workers to independently violate the same domain's limit.

Final Architecture

The key architectural insight is that domain partitioning across workers eliminates cross-process coordination for rate limiting, while the Bloom filter in shared Redis blocks duplicate enqueues without the memory cost of a full URL hash store. I'd draw this diagram last in an interview and walk through one URL's full lifecycle from seed submission to re-crawl, touching every component once. Every tier scales independently: add more crawlers to increase throughput, shard the Bloom filter for more unique URLs, and add Recrawl Scheduler instances to handle larger re-crawl volumes without touching the hot crawl path.

Interview Cheat Sheet

Real-time vs periodic: Crawlers are periodic with freshness scoring, not real-time. Re-crawl intervals range from 24 hours (high-change pages) to 30 days (static pages) based on measured content change frequency. Real-time supplements (sitemap push notifications, RSS feeds) exist but complement the crawl loop rather than replace it.
URL deduplication: Use a Bloom filter via RedisBloom. At 1 billion URLs with 0.1% FPR, it requires ~1.8 GB of RAM, versus 100+ GB for a Redis key store. Bloom filters have zero false negatives, so "definitely not seen" is always correct.
False positive handling: At 0.1% FPR with 1 billion URLs, roughly 1 million legitimate URLs are skipped per cycle. For critical seed URLs and re-crawls, bypass the Bloom filter and do a direct Crawl DB lookup.
URL frontier: Use a two-tier design: a back-tier Redis sorted set ranking domains by priority score, and a front-tier per-domain Redis list holding actual URLs. The Domain Selector picks the highest-priority domain whose crawl cooldown has elapsed.
Prioritization formula: Priority = 50% domain authority + 30% staleness (time since last crawl) + 20% recent success rate. High-authority pages with stale content are always dequeued first.
Politeness (robots.txt): Cache parsed robots.txt per domain in Redis with a 24-hour TTL. Check before every fetch. Disallowed URLs are dropped and marked "skipped" in Crawl DB, not retried.
Politeness (rate limiting): Partition domains across workers using consistent hashing. Each worker exclusively owns its assigned domains and enforces per-domain minimum delays in local memory. No cross-worker coordination needed; adding workers doesn't worsen politeness.
Re-crawl scheduling: After each crawl, compute next_crawl_at = now() + clamp(BASE_INTERVAL / change_frequency, 24h, 30d). High-change pages (content hash changed) get shorter intervals; low-change pages get longer intervals. The Recrawl Scheduler queries WHERE next_crawl_at <= NOW() and re-inserts due URLs into the frontier, bypassing the Bloom filter.
Throughput math: 1,000 pages per second x 86,400 seconds per day = 86.4 million pages per day. At 100 million new or updated pages per day, you need slightly more than one worker tier at full speed. At 1 billion total pages, a full static cycle takes roughly 10 days, with high-change pages cycling daily.
Storage estimate: Average HTML page is roughly 50 KB. 1 billion pages x 50 KB = 50 TB raw. With gzip compression (~5x ratio), approximately 10 TB for compressed HTML. The 1 PB estimate accounts for multiple crawl cycles and historical snapshots retained for the indexer.
DNS prefetching: DNS resolution adds 20 to 120 ms per new domain. Fetcher workers resolve DNS for the next URL while processing the current one. Cache DNS results per domain for up to 5 minutes to avoid hammering resolvers.
Content deduplication vs URL deduplication: URL deduplication (Bloom filter) prevents re-crawling the same URL. Content deduplication (comparing content hashes) detects mirror sites. Store a content hash per CrawledPage and flag pages whose hash matches an already-indexed page so the downstream indexer doesn't re-index duplicates.
Horizontal scaling: Add Crawler Workers to increase throughput (throughput scales linearly). Shard Crawl DB by URL hash for higher write throughput. Redis Cluster for the Bloom filter and domain queues. Recrawl Scheduler scales by partitioning the next_crawl_at time range across multiple instances.

What is a web crawler?

Functional Requirements

Core Requirements

Crawl all pages reachable from a seed list of URLs.
Store the raw HTML and extracted links for downstream indexing.
Discover new URLs continuously and re-crawl stale pages on a schedule.

Below the Line (out of scope)

Full-text search indexing and ranking (downstream system)
JavaScript rendering (Puppeteer/headless browser tier)
Login-gated content crawling
Media file (image/PDF) extraction

The hardest part in scope: URL deduplication at petabyte scale is the single most challenging problem here. At 1 billion pages, a memory-resident hash set requires hundreds of gigabytes of RAM. A database lookup per URL becomes a write-path bottleneck at 1,000 pages per second. Getting deduplication right determines everything else about the crawler's performance.

Non-Functional Requirements

Core Requirements

Scale: 1 billion pages crawled total; 100 million new or updated pages per day.
Throughput: 1,000 pages per second sustained (86.4 million pages per day at full capacity).
Politeness: Maximum 1 request per domain per second; 1 per 10 seconds for sensitive or slow-responding domains.
Deduplication: No URL crawled twice within a single crawl cycle.
Freshness: High-change pages (news sites, live feeds) re-crawled within 24 hours; low-change pages within 30 days.
Storage: Raw HTML stored durably; estimated 1 TB per 1 million pages = 1 PB total at full scale.
Availability: The crawler runs continuously; transient failures must not lose queued URLs.

Below the Line

Sub-millisecond deduplication latency (seconds-scale latency is acceptable for the URL queue)
Exactly-once crawl guarantees (at-least-once with idempotency is sufficient)

Read/write ratio: This system is almost entirely writes. For every URL fetched and stored, there is 1 deduplication check, 1 HTML write to object storage, and several URL queue operations. The only significant read workload is the downstream indexer reading from HTML storage. Design every component for write throughput, not read latency. This is one of the few large-scale systems where you can deprioritize read optimization almost entirely.

Core Entities

CrawlJob: A top-level crawl task with a seed URL list, status, and configuration (depth limit, domain scope, re-crawl enabled flag).
URLFrontier: A prioritized queue entry representing a URL to be crawled, with a scheduled crawl time and a numeric priority score.
CrawledPage: The stored result of one crawl: raw HTML, extracted outbound links, HTTP status code, crawl timestamp, and content hash.
DomainPolicy: The cached per-domain rules: parsed robots.txt directives, crawl delay setting, last-crawl timestamp, and domain authority score.
URLFingerprint: A compact record (canonical URL string and its hash) used for Bloom filter deduplication lookups.

Full schema and indexing strategy are deferred to the deep dives. These five entities are enough to drive the API and High-Level Design.

API Design

A web crawler is primarily internal, but operator APIs are needed for seed submission and observability.

FR 1: Submit seed URLs to start a crawl:

POST /v1/crawl/seeds
Body: {
  urls: ["https://example.com", "https://news.ycombinator.com"],
  config: { max_depth: 5, scope: "domain", recrawl_enabled: true }
}
Response: { job_id: "cj_abc123", queued_count: 2, status: "queued" }

FR 2: Check the crawl status of a specific URL:

GET /v1/crawl/status/{encoded_url}
Response: {
  url: "https://example.com",
  status: "crawled",
  last_crawled_at: "2026-03-29T12:00:00Z",
  next_scheduled_at: "2026-03-30T06:00:00Z",
  http_status: 200,
  content_hash: "sha256:abc..."
}

FR 3: List active crawl jobs:

GET /v1/crawl/jobs?status=running&cursor=eyJ0c...&limit=20
Response: {
  jobs: [
    { job_id: "cj_abc", seed_count: 2, pages_crawled: 15420, status: "running", started_at: "..." }
  ],
  next_cursor: "eyJ0c..."
}

High-Level Design

1. Basic crawl loop: frontier, fetcher, and storage

The core pipeline: dequeue a URL, fetch the page, parse links, store HTML, and enqueue discovered URLs.

Components:

Seed Submitter: Operator API that pushes seed URLs into the URL Frontier after job creation.
URL Frontier Queue: A persistent FIFO queue (Kafka or Redis sorted set) holding URLs pending crawl.
Fetcher Service: Dequeues URLs, sends HTTP GET to target servers, receives raw HTML, and extracts outbound links.
HTML Store: Object storage (S3) for raw HTML, keyed by URL hash for content-addressed lookup.
Crawl DB: PostgreSQL tracking crawl status per URL (queued, crawled, failed, skipped).

Request walkthrough:

Seed Submitter calls POST /v1/crawl/seeds and the API writes seed URLs into the URL Frontier Queue.
Fetcher Service dequeues the next URL and sends HTTP GET to the target server.
Fetcher writes raw HTML to HTML Store, keyed by the SHA-256 hash of the URL.
Fetcher extracts all <a href> links from the HTML and checks Crawl DB: for each link not yet queued, enqueues it into the URL Frontier.
Fetcher updates Crawl DB: mark the URL as crawled, record http_status and crawl_timestamp.

2. Politeness enforcement

A fetcher that fires requests without throttling will hit the same domain dozens of times per second, triggering rate-limit responses or outright IP bans.

Components added:

Robots.txt Cache: A Redis hash storing parsed robots.txt rules per domain with a 24-hour TTL. Checked before every request; on cache miss, the fetcher issues a real robots.txt fetch and caches the result.
Domain Rate Limiter: A Redis sorted set tracking the last-crawl timestamp per domain. The fetcher checks this before sending a request and re-queues the URL if the domain is on cooldown.

Request walkthrough:

Fetcher dequeues a URL (e.g., https://example.com/page-42).
Fetcher checks Robots.txt Cache for example.com. On cache miss, it fetches and caches https://example.com/robots.txt.
If the URL path is disallowed, Fetcher drops the URL and marks it "skipped" in Crawl DB.
Fetcher queries Domain Rate Limiter for example.com: time since last request. If under the minimum delay (1 second default, 10 seconds for slow domains), it re-queues the URL with a delayed scheduled time.
Fetcher acquires the domain slot, sends HTTP GET, and proceeds.

Politeness violations are now prevented. The remaining gap is that the fetcher still enqueues already-crawled URLs, so the frontier grows without bound.

3. URL deduplication

Without deduplication, the frontier re-enqueues URLs we've already crawled, creating an ever-growing backlog of redundant work.

Components added:

Bloom Filter: A probabilistic data structure that answers "have I seen this URL?" with zero false negatives and a small, configurable false-positive rate. At 1 billion URLs with a 0.1% false-positive rate, a Bloom filter requires approximately 1.8 GB of memory.

Request walkthrough:

HTML Parser extracts an outbound link from a crawled page.
Fetcher queries the Bloom Filter: "have we seen this URL?"
If the Bloom Filter returns "probably seen," the URL is discarded.
If it returns "definitely not seen," the URL is added to the Bloom Filter and enqueued into the URL Frontier.
On a false positive (0.1% rate), we skip a URL we haven't actually crawled. This is acceptable for a best-effort general-purpose crawler.

The remaining gap is that all URLs have equal priority and crawled pages are never revisited.

4. Freshness scheduling and prioritization

A one-time crawl goes stale within hours. Pages change, new content appears, and the index diverges from reality without re-crawl logic.

Components added:

Priority Scorer: Assigns a numeric priority to each URL based on domain authority, estimated change frequency, and time since last crawl.
Recrawl Scheduler: A background job that queries Crawl DB for pages due for re-crawl (next_crawl_at <= NOW()) and re-inserts them into the URL Frontier with their computed priority score.

Request walkthrough:

After fetching a page, Fetcher computes a SHA-256 content hash and compares it against the stored hash in Crawl DB.
If the content changed, Fetcher increments the change_frequency score for that URL. If unchanged, it decrements the score.
Fetcher sets next_crawl_at using the formula: interval = clamp(BASE_INTERVAL / change_frequency, 24h, 30d).
Recrawl Scheduler runs every few minutes, queries Crawl DB for due pages, and re-inserts them into the frontier with their priority score.
Because the frontier is now a Redis sorted set, URLs are dequeued in priority order rather than FIFO.

The system now handles all three functional requirements: crawling from seeds, storing raw HTML, and re-crawling stale pages on a freshness-based schedule.

Real-time vs periodic: Crawlers are periodic with freshness scoring, not real-time. Re-crawl intervals range from 24 hours (high-change pages) to 30 days (static pages) based on measured content change frequency. Real-time supplements (sitemap push notifications, RSS feeds) exist but complement the crawl loop rather than replace it.
URL deduplication: Use a Bloom filter via RedisBloom. At 1 billion URLs with 0.1% FPR, it requires ~1.8 GB of RAM, versus 100+ GB for a Redis key store. Bloom filters have zero false negatives, so "definitely not seen" is always correct.
False positive handling: At 0.1% FPR with 1 billion URLs, roughly 1 million legitimate URLs are skipped per cycle. For critical seed URLs and re-crawls, bypass the Bloom filter and do a direct Crawl DB lookup.
URL frontier: Use a two-tier design: a back-tier Redis sorted set ranking domains by priority score, and a front-tier per-domain Redis list holding actual URLs. The Domain Selector picks the highest-priority domain whose crawl cooldown has elapsed.
Prioritization formula: Priority = 50% domain authority + 30% staleness (time since last crawl) + 20% recent success rate. High-authority pages with stale content are always dequeued first.
Politeness (robots.txt): Cache parsed robots.txt per domain in Redis with a 24-hour TTL. Check before every fetch. Disallowed URLs are dropped and marked "skipped" in Crawl DB, not retried.
Politeness (rate limiting): Partition domains across workers using consistent hashing. Each worker exclusively owns its assigned domains and enforces per-domain minimum delays in local memory. No cross-worker coordination needed; adding workers doesn't worsen politeness.
Re-crawl scheduling: After each crawl, compute next_crawl_at = now() + clamp(BASE_INTERVAL / change_frequency, 24h, 30d). High-change pages (content hash changed) get shorter intervals; low-change pages get longer intervals. The Recrawl Scheduler queries WHERE next_crawl_at <= NOW() and re-inserts due URLs into the frontier, bypassing the Bloom filter.
Throughput math: 1,000 pages per second x 86,400 seconds per day = 86.4 million pages per day. At 100 million new or updated pages per day, you need slightly more than one worker tier at full speed. At 1 billion total pages, a full static cycle takes roughly 10 days, with high-change pages cycling daily.
Storage estimate: Average HTML page is roughly 50 KB. 1 billion pages x 50 KB = 50 TB raw. With gzip compression (~5x ratio), approximately 10 TB for compressed HTML. The 1 PB estimate accounts for multiple crawl cycles and historical snapshots retained for the indexer.
DNS prefetching: DNS resolution adds 20 to 120 ms per new domain. Fetcher workers resolve DNS for the next URL while processing the current one. Cache DNS results per domain for up to 5 minutes to avoid hammering resolvers.
Content deduplication vs URL deduplication: URL deduplication (Bloom filter) prevents re-crawling the same URL. Content deduplication (comparing content hashes) detects mirror sites. Store a content hash per CrawledPage and flag pages whose hash matches an already-indexed page so the downstream indexer doesn't re-index duplicates.
Horizontal scaling: Add Crawler Workers to increase throughput (throughput scales linearly). Shard Crawl DB by URL hash for higher write throughput. Redis Cluster for the Bloom filter and domain queues. Recrawl Scheduler scales by partitioning the next_crawl_at time range across multiple instances.

Comments

Comments