Metrics Collector
Design a pull-based metrics collection pipeline that monitors thousands of servers in real time, aggregates time-series data efficiently, and triggers alerts without losing data during spikes.
What is a server metrics collection system?
A server metrics collection system scrapes CPU, memory, disk, and network statistics from every machine in a fleet, stores those measurements as time-series data, and lets engineers query and alert on them in near real time. The interesting engineering challenge is not the scraping; it is handling the write volume at scale (thousands of servers emitting metrics every 10 seconds means hundreds of thousands of data points per minute), choosing a storage engine that evaluates range queries over months of data in under a second, and deciding whether each server should push metrics out or wait to be pulled.
I'd open with the write volume math in an interview because it changes every downstream choice. This question tests write-heavy system design, time-series database internals, alerting pipeline architecture, and the push vs pull debate that comes up in every observability interview.
Functional Requirements
Core Requirements
- Collect CPU, memory, disk, and network metrics from thousands of servers every 10 to 60 seconds.
- Store time-series data with enough resolution to detect and alert on short-duration spikes.
- Support dashboard queries returning aggregate metrics over arbitrary time ranges (last 5 minutes to last 6 months).
- Trigger alerts when a metric crosses a configurable threshold for a sustained duration.
Below the Line (out of scope)
- Log collection and log-based alerting (covered in Design a Distributed Logging System)
- Distributed tracing and service-level APM
- ML-based anomaly detection on metric streams
- Multi-tenant metric isolation with per-customer encryption
The hardest part in scope: Storing time-series data efficiently. Naive row-per-sample storage in a relational database quickly reaches tens of billions of rows. The correct answer is a purpose-built time-series database (TSDB) that compresses samples using delta-of-delta encoding and gorilla float compression, enabling 10-40x storage reduction while keeping range queries under 100ms.
Log collection is below the line because it requires separate indexing infrastructure (inverted index for full-text search) that is architecturally distinct from the append-only numeric time-series store described here. To add it, build a Log Ingestion Service that writes to Elasticsearch or a ClickHouse full-text index in parallel with the metrics pipeline.
Distributed tracing is below the line because it requires a causal graph store keyed by trace ID, not a time-series store. To add it, run an OpenTelemetry Collector that fans spans to a Jaeger backend (Cassandra or Elasticsearch) alongside the metrics pipeline.
ML-based anomaly detection is below the line because it is a consumer of the metrics pipeline, not a change to the pipeline itself. To add it, subscribe a streaming job (Flink) to the metrics ingestion topic and emit anomaly events to an alert bus when a model scores a metric as anomalous.
Non-Functional Requirements
Core Requirements
- Write throughput: 10,000 servers, each emitting 100 metrics every 10 seconds, produces 100,000 metric samples per second sustained. Peak (all servers flushing simultaneously) can reach 300,000 samples/second.
- Query latency: Dashboard range queries must return in under 1 second at p95 for time ranges up to 7 days. Longer ranges (up to 6 months) must complete in under 5 seconds.
- Retention: 15 days of raw data (10-second resolution). 1 year of downsampled data (1-minute resolution). 3 years of heavily downsampled data (1-hour resolution).
- Alert latency: Threshold alerts must fire within 30 seconds of the metric crossing the threshold.
- Durability: 99.9% of metric samples must be stored. Brief gaps during a collector restart are acceptable; sustained loss is not.
- Availability: 99.9% uptime for the query and alert path. The collection path can have momentary gaps during deployments.
Below the Line
- Sub-100ms alert latency (requires streaming evaluation rather than batch rule evaluation)
- Per-metric access control (who can query which server's metrics)
- Cardinality explosion protection at ingestion time
Read/write ratio: This system is heavily write-skewed. 100,000 samples/second sustained means roughly 8.6 billion samples per day on the write side. Dashboard queries are sporadic bursts from on-call engineers; alert evaluations happen every 10 seconds per rule but touch only aggregated data. The write-to-interactive-read ratio is roughly 1,000:1. Every major design decision in this article is driven by that number: make writes cheap, make reads pre-computed wherever possible.
I treat the 1-second query SLA on 7-day ranges as the forcing constraint for storage format. Any database that requires a full table scan to answer that query is architecturally wrong for this system.
Core Entities
- MetricSample: A single measurement: metric name, host ID, tags (region, service, environment), timestamp, and float64 value. The atomic unit of storage.
- MetricSeries: A unique combination of metric name + tag set. All samples for one series share the same labels. The unit of indexing in a TSDB.
- AlertRule: A threshold condition on a metric query: expression, evaluation interval, duration (how long the condition must hold before firing), and notification channel.
- AlertEvent: A fired or resolved alert instance tied to a rule, with the triggering value, timestamp, and affected host or service.
- Dashboard: A saved collection of metric queries rendered as time-series charts or gauges. Backed by the query service.
Schema details and TSDB internals are deferred to the deep dives. These five entities drive the API and High-Level Design.
API Design
Start with one endpoint per functional requirement, then evolve where the naive shape breaks.
FR 1 - Ingest metric samples:
The naive push shape:
POST /metrics
Body: { metric_name, host_id, tags, timestamp, value }
Response: HTTP 204
This breaks at 100,000 samples/second: one HTTP request per sample is absurd overhead. The evolved shape pushes batches from each server's local agent:
POST /metrics/batch
Body: {
host_id: "server-1234",
timestamp: 1743417600,
samples: [
{ name: "cpu.usage", tags: { core: "0" }, value: 72.4 },
{ name: "mem.used_bytes", tags: {}, value: 12884901888 }
]
}
Response: HTTP 204
Each batch covers one scrape interval (10 to 60 seconds of samples from one host). The agent keeps unsent batches in local memory for up to 5 minutes to handle transient collector downtime. I've seen production outages where a 2-minute collector restart caused permanent metric gaps because agents had no local buffer. Five minutes of local retention is the minimum I'd accept.
FR 2 - Query metrics:
GET /query_range
Query params:
expr=cpu.usage{host="server-1234",env="prod"}
start=2026-03-29T00:00:00Z
end=2026-03-29T06:00:00Z
step=60s
Response: {
data: [ { timestamp, value } ],
resolution: "1m",
downsampled: false
}
The step parameter determines the query resolution. The query service automatically routes to raw data or the downsampled tier based on the requested time range.
FR 3 - Manage alert rules:
POST /alerts/rules
Body: { name, expr, threshold, duration_seconds, channel_id }
Response: { rule_id }
GET /alerts/events?rule_id=&start=&end=
Response: { events: [ { rule_id, fired_at, resolved_at, value, host } ] }
Alert rules reference the same query expression syntax as dashboard queries. The Alert Evaluator runs each rule expression on a schedule and compares the result against the threshold.
High-Level Design
1. Collecting metrics from servers
The naive approach: Each server emits one HTTP request per metric per interval directly to a central ingest server that writes samples to a relational database.
This fails immediately. 100 metrics per server times 10,000 servers at a 10-second interval means 100,000 HTTP requests per second. A relational database cannot sustain 100,000 row inserts per second without expensive partitioning, and the per-request overhead alone saturates a central ingest server.
The key insight is to batch at two levels: the agent batches all metrics from one host into one request per scrape interval, and the ingest service writes to Kafka (not directly to the TSDB) so the write path is decoupled from storage throughput.
Components:
- Metrics Agent: A lightweight process running on every server (think Prometheus Node Exporter or Telegraf). Scrapes local OS and process metrics every 10 seconds. Batches all samples into one
POST /metrics/batchcall per interval. - Ingest Service: Stateless HTTP servers. Validate the batch, normalize tags, and publish to Kafka. Acknowledges the agent as soon as the batch lands in Kafka.
- Kafka: Write-ahead buffer. Partitioned by
host_idto preserve per-host ordering. 7-day retention for replay. - TSDB Writer: Kafka consumer that writes samples to the time-series database in bulk.
Continue Reading with Premium
Unlock this article and every other in-depth system design guide on the platform with NotesFromSDE Premium.