Metrics Collector
Design a pull-based metrics collection pipeline that monitors thousands of servers in real time, aggregates time-series data efficiently, and triggers alerts without losing data during spikes.
What is a server metrics collection system?
A server metrics collection system scrapes CPU, memory, disk, and network statistics from every machine in a fleet, stores those measurements as time-series data, and lets engineers query and alert on them in near real time. The interesting engineering challenge is not the scraping; it is handling the write volume at scale (thousands of servers emitting metrics every 10 seconds means hundreds of thousands of data points per minute), choosing a storage engine that evaluates range queries over months of data in under a second, and deciding whether each server should push metrics out or wait to be pulled.
I'd open with the write volume math in an interview because it changes every downstream choice. This question tests write-heavy system design, time-series database internals, alerting pipeline architecture, and the push vs pull debate that comes up in every observability interview.
Functional Requirements
Core Requirements
- Collect CPU, memory, disk, and network metrics from thousands of servers every 10 to 60 seconds.
- Store time-series data with enough resolution to detect and alert on short-duration spikes.
- Support dashboard queries returning aggregate metrics over arbitrary time ranges (last 5 minutes to last 6 months).
- Trigger alerts when a metric crosses a configurable threshold for a sustained duration.
Below the Line (out of scope)
- Log collection and log-based alerting (covered in Design a Distributed Logging System)
- Distributed tracing and service-level APM
- ML-based anomaly detection on metric streams
- Multi-tenant metric isolation with per-customer encryption
The hardest part in scope: Storing time-series data efficiently. Naive row-per-sample storage in a relational database quickly reaches tens of billions of rows. The correct answer is a purpose-built time-series database (TSDB) that compresses samples using delta-of-delta encoding and gorilla float compression, enabling 10-40x storage reduction while keeping range queries under 100ms.
Log collection is below the line because it requires separate indexing infrastructure (inverted index for full-text search) that is architecturally distinct from the append-only numeric time-series store described here. To add it, build a Log Ingestion Service that writes to Elasticsearch or a ClickHouse full-text index in parallel with the metrics pipeline.
Distributed tracing is below the line because it requires a causal graph store keyed by trace ID, not a time-series store. To add it, run an OpenTelemetry Collector that fans spans to a Jaeger backend (Cassandra or Elasticsearch) alongside the metrics pipeline.
ML-based anomaly detection is below the line because it is a consumer of the metrics pipeline, not a change to the pipeline itself. To add it, subscribe a streaming job (Flink) to the metrics ingestion topic and emit anomaly events to an alert bus when a model scores a metric as anomalous.
Non-Functional Requirements
Core Requirements
- Write throughput: 10,000 servers, each emitting 100 metrics every 10 seconds, produces 100,000 metric samples per second sustained. Peak (all servers flushing simultaneously) can reach 300,000 samples/second.
- Query latency: Dashboard range queries must return in under 1 second at p95 for time ranges up to 7 days. Longer ranges (up to 6 months) must complete in under 5 seconds.
- Retention: 15 days of raw data (10-second resolution). 1 year of downsampled data (1-minute resolution). 3 years of heavily downsampled data (1-hour resolution).
- Alert latency: Threshold alerts must fire within 30 seconds of the metric crossing the threshold.
- Durability: 99.9% of metric samples must be stored. Brief gaps during a collector restart are acceptable; sustained loss is not.
- Availability: 99.9% uptime for the query and alert path. The collection path can have momentary gaps during deployments.
Below the Line
- Sub-100ms alert latency (requires streaming evaluation rather than batch rule evaluation)
- Per-metric access control (who can query which server's metrics)
- Cardinality explosion protection at ingestion time
Read/write ratio: This system is heavily write-skewed. 100,000 samples/second sustained means roughly 8.6 billion samples per day on the write side. Dashboard queries are sporadic bursts from on-call engineers; alert evaluations happen every 10 seconds per rule but touch only aggregated data. The write-to-interactive-read ratio is roughly 1,000:1. Every major design decision in this article is driven by that number: make writes cheap, make reads pre-computed wherever possible.
I treat the 1-second query SLA on 7-day ranges as the forcing constraint for storage format. Any database that requires a full table scan to answer that query is architecturally wrong for this system.
Core Entities
- MetricSample: A single measurement: metric name, host ID, tags (region, service, environment), timestamp, and float64 value. The atomic unit of storage.
- MetricSeries: A unique combination of metric name + tag set. All samples for one series share the same labels. The unit of indexing in a TSDB.
- AlertRule: A threshold condition on a metric query: expression, evaluation interval, duration (how long the condition must hold before firing), and notification channel.
- AlertEvent: A fired or resolved alert instance tied to a rule, with the triggering value, timestamp, and affected host or service.
- Dashboard: A saved collection of metric queries rendered as time-series charts or gauges. Backed by the query service.
Schema details and TSDB internals are deferred to the deep dives. These five entities drive the API and High-Level Design.
API Design
Start with one endpoint per functional requirement, then evolve where the naive shape breaks.
FR 1 - Ingest metric samples:
The naive push shape:
POST /metrics
Body: { metric_name, host_id, tags, timestamp, value }
Response: HTTP 204
This breaks at 100,000 samples/second: one HTTP request per sample is absurd overhead. The evolved shape pushes batches from each server's local agent:
POST /metrics/batch
Body: {
host_id: "server-1234",
timestamp: 1743417600,
samples: [
{ name: "cpu.usage", tags: { core: "0" }, value: 72.4 },
{ name: "mem.used_bytes", tags: {}, value: 12884901888 }
]
}
Response: HTTP 204
Each batch covers one scrape interval (10 to 60 seconds of samples from one host). The agent keeps unsent batches in local memory for up to 5 minutes to handle transient collector downtime. I've seen production outages where a 2-minute collector restart caused permanent metric gaps because agents had no local buffer. Five minutes of local retention is the minimum I'd accept.
FR 2 - Query metrics:
GET /query_range
Query params:
expr=cpu.usage{host="server-1234",env="prod"}
start=2026-03-29T00:00:00Z
end=2026-03-29T06:00:00Z
step=60s
Response: {
data: [ { timestamp, value } ],
resolution: "1m",
downsampled: false
}
The step parameter determines the query resolution. The query service automatically routes to raw data or the downsampled tier based on the requested time range.
FR 3 - Manage alert rules:
POST /alerts/rules
Body: { name, expr, threshold, duration_seconds, channel_id }
Response: { rule_id }
GET /alerts/events?rule_id=&start=&end=
Response: { events: [ { rule_id, fired_at, resolved_at, value, host } ] }
Alert rules reference the same query expression syntax as dashboard queries. The Alert Evaluator runs each rule expression on a schedule and compares the result against the threshold.
High-Level Design
1. Collecting metrics from servers
The naive approach: Each server emits one HTTP request per metric per interval directly to a central ingest server that writes samples to a relational database.
This fails immediately. 100 metrics per server times 10,000 servers at a 10-second interval means 100,000 HTTP requests per second. A relational database cannot sustain 100,000 row inserts per second without expensive partitioning, and the per-request overhead alone saturates a central ingest server.
The key insight is to batch at two levels: the agent batches all metrics from one host into one request per scrape interval, and the ingest service writes to Kafka (not directly to the TSDB) so the write path is decoupled from storage throughput.
Components:
- Metrics Agent: A lightweight process running on every server (think Prometheus Node Exporter or Telegraf). Scrapes local OS and process metrics every 10 seconds. Batches all samples into one
POST /metrics/batchcall per interval. - Ingest Service: Stateless HTTP servers. Validate the batch, normalize tags, and publish to Kafka. Acknowledges the agent as soon as the batch lands in Kafka.
- Kafka: Write-ahead buffer. Partitioned by
host_idto preserve per-host ordering. 7-day retention for replay. - TSDB Writer: Kafka consumer that writes samples to the time-series database in bulk.
This is the write path. The read path and query layer come next.
2. Querying stored metrics
With 8.6 billion samples per day and 15 days of raw retention, the TSDB holds roughly 130 billion raw samples. A naive range query scanning all samples for a metric over 7 days touches ~60 million samples per series. The TSDB must handle this without a full scan.
Components added:
- Query Service: Accepts PromQL-style range queries. Resolves the series by label set (uses the TSDB's inverted index), fetches the time range from the correct retention tier, and returns downsampled results when the query spans more than 7 days.
- Query Cache (Redis): Caches query results for identical time ranges. TTL of 60 seconds for recent data, 10 minutes for historical. Cache key includes the expression, start, end, and step.
The three-tier retention model keeps storage costs linear with time. Raw data is expensive per sample but small in total duration. Downsampled tiers compress 6 raw samples into 1 (1-minute tier) or 360 raw samples into 1 (1-hour tier), reducing long-range storage by up to 360x.
I often see candidates propose a single retention tier with "we'll just add more disk." Run the math in the interview: 100,000 samples/second at 2 bytes/sample compressed is 17 GB/day. At 15 days that's 250 GB, manageable. At 3 years that's 18 TB of high-resolution data nobody queries at that granularity. Downsampling is not optional.
3. Triggering alerts
The alert path must evaluate threshold conditions within 30 seconds of a metric crossing the threshold. This requires a separated Alert Evaluator that runs on an independent schedule from the query service.
Components added:
- Alert Evaluator: Runs every 10 seconds. For each active alert rule, executes the rule expression against the TSDB (last 2 scrape intervals), compares against the threshold, and checks if the condition has held for the required duration.
- Notification Service: Receives fired and resolved alert events from the Alert Evaluator and dispatches to PagerDuty, Slack, or email. Deduplicates to suppress flapping alerts (alert fires and resolves repeatedly).
- Alert State Store (Redis): Tracks the current state of each rule per host: pending (threshold crossed, duration not yet elapsed), firing (firing notification sent), or resolved.
Alert evaluation reads from the TSDB raw tier, not from the query cache. Caching alert evaluations would introduce up to 60 seconds of additional staleness on top of the 10-second scrape interval, making it impossible to meet the 30-second alert latency SLA.
Separating the Alert Evaluator from the dashboard Query Service matters for availability: a dashboard query spike (engineers investigating an incident) cannot starve the alert evaluation loop and suppress alerts during the incident. I'd call this out explicitly in an interview because it is the kind of non-obvious coupling that causes real outages: the dashboard investigation of an incident suppresses the alerts for the next incident.
4. Downsampling for long-term retention
Raw data at 10-second resolution for more than 15 days is prohibitively expensive. A single metric series produces 8,640 samples per day. At 100,000 unique series (10,000 hosts times 10 metrics per host), that is 864 million samples per day. Keeping 1 year of raw data would require roughly 315 billion samples, or ~2.5 TB per metric across all hosts.
Components added:
- Downsampler: A background job that runs hourly. Reads the last hour of raw data from the TSDB, computes min, max, avg, and p99 aggregates per 1-minute bucket and per 1-hour bucket, and writes those to the downsampled tiers.
The 1-minute tier reduces storage by 6x vs raw. The 1-hour tier reduces by 360x. A 3-year retention window at 1-hour resolution for 100,000 series requires roughly 2.6 billion rows total, which a properly sized TSDB handles comfortably.
One subtle point: the downsampler must be idempotent. If it crashes halfway through processing an hour window and restarts, it re-processes the same window. Writing duplicate aggregate rows to the downsampled tier must produce the same result, not double-count. Use upsert semantics keyed on (series_id, bucket_timestamp).
Potential Deep Dives
1. Push vs pull: should agents push metrics or should a collector pull them?
This is the most debated design question in observability architecture. Prometheus uses pull; Graphite, InfluxDB, and most cloud services use push. Both have legitimate use cases and real failure modes.
2. How do you handle data ingestion at scale when thousands of agents are emitting metrics simultaneously?
The thundering herd problem: a fleet restart causes all agents to resume at the same time, producing a write spike that can be 3-5x the sustained rate.
3. What database do you use for time-series data?
This is the database selection question that decides the entire system's storage cost and query performance.
Final Architecture
The hybrid pull-within-cluster, push-to-central model solves both the network access problem (scrapers only need access to servers in their own cluster) and the single-fleet-wide-scraper bottleneck (each cluster has its own local Prometheus). The TSDB's Gorilla float compression and delta-of-delta timestamp encoding are what make 100,000 samples/second tractable: at 2 bytes/sample average, the sustained write rate is 200 KB/second of compressed data, not the 800 KB/second you would see with naive float64 storage.
Interview Cheat Sheet
- State the write volume early: 10,000 servers emitting 100 metrics every 10 seconds is 100,000 samples/second sustained. Every architecture choice flows from this throughput requirement.
- The push vs pull question comes up immediately. The production answer is: pull within each cluster (Prometheus scrapes /metrics endpoints), push across clusters (remote write to a central aggregator like Thanos or Mimir). Never pick one exclusively.
- Agents must buffer locally for 5 minutes in case the collector is temporarily unreachable. Without local buffering, a 2-minute collector restart creates a 2-minute gap in every metric series.
- Use a purpose-built TSDB (Prometheus TSDB, InfluxDB, VictoriaMetrics) not a row-store. Gorilla float compression and delta-of-delta timestamp encoding reduce storage by 10-40x versus raw float64.
- At 2 bytes/sample average compressed, 100,000 samples/second is only 200 KB/second of writes to the TSDB. The challenge is not raw throughput; it is the head chunk memory footprint at high cardinality (100,000 series = 50 MB in memory, manageable; 10 million series = 5 GB, a problem).
- Three retention tiers are standard: 15-day raw (10s resolution), 1-year downsampled (1-min p50/p99), 3-year archive (1-hour p50/p99). Each tier reduces storage by 6x or 360x vs raw.
- For the alert path, evaluate rules against the local Prometheus instance (which has fresh data within 10 seconds) rather than the central TSDB (which has 5-10 second remote write lag plus storage lag). This is the only way to reliably meet a 30-second alert SLA.
- Alert deduplication (flap suppression) belongs in the Notification Service, not the Alert Evaluator. The evaluator updates state; the notification service decides whether to actually page someone based on a 5-minute dedup window.
- The thundering herd problem (fleet restart flushes all agents simultaneously) is solved with two mechanisms: jitter on the agent scrape interval (random 0-10 second offset) and a rate-limited TSDB writer that caps ingest at 1.5x the sustained rate.
- Kafka between the ingest service and the TSDB writer is the right buffer for absorbing fleet restart spikes. 7 days of retention gives enough replay window for TSDB outages.
- Cardinality explosion (labels with millions of unique values like request IDs) will kill the TSDB inverted index. Enforce a per-label cardinality cap at the ingest layer, well before data reaches the TSDB.
- Downsampling prevents disk exhaustion: a background Downsampler runs hourly, computes min/max/avg/p99 per 1-minute and 1-hour bucket from raw data, writes those to separate retention tiers, and lets the TSDB's TTL expire raw data after 15 days. Three tiers (raw/1-min/1-hour) give 3 years of history at 360x less storage than keeping everything raw.
- Reliable, low-latency alerting requires evaluating rules against locally fresh data (local Prometheus, sub-10-second staleness), using a 5-minute dedup window in the Notification Service to suppress flapping, and routing alert evaluation through a separate read path that cannot be starved by dashboard query spikes.