Health Monitor
Walk through a complete cluster health monitoring design, from a single polling loop to a distributed system that tracks 10,000 nodes, fires sub-90-second alerts, and auto-remediates failures without waking an on-call engineer.
What is a cluster health monitoring system?
A cluster health monitoring system periodically checks whether every service and machine in your infrastructure is alive and healthy, then surfaces that status to engineers and automated systems. The apparent simplicity hides three hard problems: collecting data from thousands of nodes without the collector itself becoming a bottleneck, deciding when "unhealthy" means "alert someone" versus "this is a transient blip," and getting the system to self-heal before anyone is paged.
It tests time-series storage, distributed polling at scale, alert pipeline design, and the tension between false-positive alert fatigue and missed outages.
Functional Requirements
Core Requirements
- Periodically check the health of every service and machine in the cluster.
- Surface the current status of each component via an API and a dashboard.
- Fire an alert when a health check fails for a configurable number of consecutive intervals.
- Support automated remediation actions (restart a process, remove a node from load balancer rotation).
Below the Line (out of scope)
- Log aggregation (covered in Design a Distributed Logging System).
- Full metrics pipeline: CPU graphs over time, percentile reporting, SLO burn rates (covered in Design a Metrics Collection System).
- Multi-tenant monitoring of external customer clusters.
The hardest part in scope: Reliable alert delivery with zero false negatives. A check that never fires is worse than no monitoring at all. We will dedicate a full deep dive to the alert pipeline: deduplication, flap detection, and escalation policy.
Log aggregation is below the line because it is a separate write-heavy pipeline with completely different storage semantics. To add it, I would stream log lines to a Kafka topic and feed them into an Elasticsearch cluster, sitting beside the health check path, not inside it.
The full metrics pipeline is below the line because this article focuses on binary health checks (up/down) and threshold-based alerting rather than time-series analytics and SLO math. To add it, I would wire the same health check agents into a Prometheus scrape endpoint and let a separate metrics server handle aggregation and percentiles.
Multi-tenant monitoring is below the line because it introduces auth isolation, billing, and noisy-neighbor concerns that are separate from the core design.
Non-Functional Requirements
Core Requirements
- Scale: 10,000 nodes in the cluster. Each node checked every 30 seconds. Peak ingest rate: approximately 333 check results per second.
- Alert latency: An alert fires within 90 seconds of a sustained failure being detected (3 consecutive failed checks at a 30-second interval).
- Availability: 99.9% uptime for the monitoring system itself. The monitoring system must survive the failures it is designed to detect.
- Data retention: Raw check results retained for 30 days. Hourly status aggregates retained for 1 year.
- Read latency: Dashboard current-status queries return in under 200ms. Historical status queries return in under 2 seconds.
Below the Line
- Sub-second alert latency (acceptable to trade for simpler architecture)
- Custom integrations beyond PagerDuty/Slack/webhook
Write-to-read ratio: This system writes 333 check results per second and receives far more reads from dashboards, alert queries, and history lookups. The dominant design constraint here is not read throughput but write correctness. Missing one alert when a node is genuinely down is far more damaging than a brief dashboard staleness.
The 99.9% availability target for the monitoring system is intentionally lower than the 99.99% target you would set for a user-facing service. A monitoring system that goes down briefly is bad; a user-facing outage caused by over-engineering the monitoring system to achieve five-nines is worse. I default to simpler replication strategies here and accept the tradeoff.
Core Entities
- Node: A single machine or service instance being monitored. Carries a node ID, hostname, cluster region, service type, and registration timestamp.
- HealthCheck: One check result for one node at one point in time. Carries the node ID, check timestamp, status (healthy/unhealthy/unknown), HTTP response code, latency in milliseconds, and an optional error message.
- Alert: A record of a sustained failure and its notification state. Carries the node ID, onset time, resolved time, severity, and the recipient list that was notified.
- RemediationAction: A record of an automated action taken in response to an alert. Carries the alert ID, action type (restart/drain), execution time, and outcome.
The full schema is deferred to the data model deep dive. These entities are sufficient to drive the API and High-Level Design.
API Design
FR 1 and FR 2 -- check health and surface status:
GET /nodes/{node_id}/status
Response: { node_id, status, last_check_at, consecutive_failures, latency_ms }
A GET because this is a pure read. The consecutive_failures field does work in the response: the dashboard can show "3 consecutive failures" alongside "UNHEALTHY" instead of the raw binary, which is more useful to an on-call engineer.
GET /nodes?region=us-east-1&status=unhealthy
Response: { nodes: [...], next_cursor: "..." }
Cursor-based pagination because the caller is a dashboard rendering potentially thousands of nodes. Filtering by status and region up front means the dashboard loads the degraded nodes first, which is what operators care about in an incident.
FR 3 -- list and acknowledge alerts:
GET /alerts?status=firing&severity=critical
Response: { alerts: [...], next_cursor: "..." }
PATCH /alerts/{alert_id}
Body: { status: "acknowledged", acknowledged_by: "user@example.com" }
Response: { alert_id, status, acknowledged_by, acknowledged_at }
PATCH (not PUT) because we are updating a subset of the alert fields. The acknowledgement endpoint exists to close the loop: once an on-call engineer has seen the alert, the system stops re-notifying.
FR 4 -- trigger remediation:
POST /nodes/{node_id}/remediate
Body: { action: "restart" | "drain", initiated_by: "system" | "user@example.com" }
Response: { action_id, status: "queued", estimated_completion_ms: 5000 }
POST because remediation is an action, not a resource creation or update. The initiated_by field distinguishes automated remediation from human-triggered remediation in the audit log, which matters for post-incident review.
I always make acknowledgement an explicit API action rather than a UI-only toggle; the acknowledged_by field in the response creates an audit trail that post-incident reviews depend on.
Authentication is out of scope, but in production every write endpoint (PATCH /alerts, POST /nodes/{id}/remediate) would require an auth token. Remediation actions especially need scoped roles: a service account can restart a process; only a senior on-call engineer can drain a node from the load balancer.
High-Level Design
1. Periodically check the health of every node
The simplest health check system is a single polling loop: one service walks the list of all registered nodes and sends an HTTP GET to each node's /healthz endpoint every 30 seconds.
Check taxonomy (what we actually measure):
- Liveness: HTTP 200 from
/healthzwithin the 5-second timeout. The binary up/down signal that drives alerts. - Response latency: The round-trip time for the
/healthzrequest, stored aslatency_msper check. A node can return 200 but be degraded if latency spikes. - Saturation metrics (optional, agent-reported): CPU%, memory%, disk%. Stored alongside the check result in the Status Store but scoped out of the core alert path per FR. Useful for capacity planning dashboards.
Components:
- Node Registry: A database table listing every node, its address, and the check interval. The Checker queries this to know who to poll.
- Health Checker: The polling service. For each node, it opens an HTTP connection, sends
GET /healthz, waits for a 200 response, and writes the result to the Status Store. - Status Store: A time-series-friendly database that holds every check result.
Request walkthrough:
- Health Checker queries Node Registry: "give me all nodes due for a check."
- For each node, Health Checker sends
GET /healthzwith a 5-second timeout. - Node responds with HTTP 200 (healthy) or connection timeout (unhealthy).
- Health Checker writes
{ node_id, timestamp, status, latency_ms }to Status Store. - Status Store acknowledges the write.
This covers the basic write path. The read path (surfacing status to operators) and the alert logic come in the next two requirements.
The issue at 10,000 nodes: A single Health Checker cannot sustain 10,000 concurrent HTTP connections. At a 30-second interval, the checker must initiate approximately 333 new connections every second. A single process with a naive sequential loop would take 10,000 x (average latency 20ms) = 200 seconds per cycle, meaning it is already 170 seconds behind before the second cycle starts.
The fix is to parallelize the checker using a thread pool or async I/O. Even with async I/O, a single process has limits on open file descriptors (typically 65,536 on Linux) and cannot survive its own failure; a distributed checker architecture, covered in the deep dives, is the right answer.
2. Surface the current status of each component
The Node Registry + Status Store from requirement 1 is already enough to answer "is node X healthy right now?" -- but querying the time-series store on every dashboard load would be slow and expensive. The evolved design adds a Status Cache that holds the latest status for every node in memory.
New components:
- Status Cache (Redis): An in-memory hash table keyed by
node_id. Value is the latest{ status, last_check_at, consecutive_failures }. The Health Checker updates this on every write, not just on status changes. - Query API: A thin read service that serves
/nodesand/nodes/{id}/statusfrom the Status Cache. Falls back to Status Store on a cache miss.
Request walkthrough (dashboard query):
- Dashboard calls
GET /nodes?region=us-east-1&status=unhealthy. - Query API scans the Redis key space for nodes matching the filter.
- Redis returns matching node statuses directly from memory (under 5ms).
- Query API returns the paginated response; dashboard renders the list.
The Query API never writes. All writes flow through the Health Checker, which updates the cache synchronously and the Status Store asynchronously. This means the dashboard always has a view that is at most one check interval (30 seconds) stale.
3. Fire an alert when a health check fails for consecutive intervals
The naive approach is: "if the last check result is unhealthy, send a PagerDuty alert." This fails the moment you test it. A transient network blip, a node restarting after a deploy, or a slow /healthz endpoint that times out once all fire false positive alerts and destroy on-call trust in the system.
The correct design uses an Alert Manager that consumes check results from the Status Cache and applies a configurable consecutive-failure threshold before firing.
New components:
- Alert Manager: A stateful service that reads the
consecutive_failurescounter from the Status Cache. When the counter reaches the configured threshold (default: 3), it creates an Alert record and pushes the notification. - Alert Store: A relational database table holding Alert records (onset time, severity, resolved time, acknowledged status). Relational because alert queries are structured ("show me all critical alerts for us-east-1 in the last 24 hours"), not time-series analytics.
- Notification Dispatcher: Sends the alert payload to PagerDuty, Slack webhook, or email. Decoupled from the Alert Manager so notification failures do not block alert creation.
Request walkthrough (alert fires):
- Health Checker writes check result;
consecutive_failuresincrements to 3 in the Status Cache. - Alert Manager is watching the Status Cache stream (via Redis keyspace notifications or a separate polling loop).
- Alert Manager reads:
consecutive_failures == 3, threshold is 3. Condition met. - Alert Manager inserts an Alert record into the Alert Store.
- Alert Manager pushes notification payload to Notification Dispatcher queue.
- Notification Dispatcher sends PagerDuty event. Records delivery status.
The Notification Dispatcher is fire-and-forget but durable: notifications are written to a queue before dispatch, so a temporary PagerDuty outage does not lose an alert. The Alert Manager does both alert creation and deduplication -- it checks whether an open Alert record already exists for this node before creating a new one.
4. Support automated remediation actions
Automated remediation is the highest-leverage but highest-risk capability in this system. Restarting the wrong process during a cascading failure can make things worse. The design must be conservative by default.
New components:
- Remediation Engine: Triggered by the Alert Manager when an alert reaches a configurable severity. Looks up the Runbook associated with the alert's service type and executes the first safe action (restart before drain).
- Runbook Store: A configuration store mapping service types to ordered remediation actions. Version-controlled; changes require approval.
- Action Queue: A durable queue (backed by Postgres or a dedicated job table) that serialises remediation actions to prevent concurrent conflicting actions on the same node.
Request walkthrough (auto-restart):
- Alert Manager fires a severity:warning alert for node X (service type: api-server).
- Alert Manager checks remediation policy: "api-server: retry=3, then restart."
- Alert Manager enqueues a restart action for node X in the Action Queue.
- Remediation Engine dequeues the action, checks that the node still shows consecutive_failures >= 3 (guard against stale queue entries), and sends a restart command to the node's management API.
- Remediation Engine writes a RemediationAction record: action type, timestamp, outcome.
- On the next health check cycle, Health Checker sends GET /healthz; node responds 200 OK; consecutive_failures resets to 0; alert auto-resolves.
I always recommend adding a "circuit breaker" on the Remediation Engine: if it has restarted the same node more than 3 times in 10 minutes without recovery, it stops automatic restarts and escalates to a human. Blind autorestarts during a dependency outage will just restart a healthy service into the same broken state it was before.
Potential Deep Dives
1. How do we collect health data from 10,000 nodes?
There are two fundamentally different models: the monitoring system pulls data from each node on a schedule (pull model), or each node pushes its own heartbeat to the monitoring system (push model). The choice has major implications for resilience, network topology, and what kinds of failures you can actually detect.
We need to handle:
- 10,000 nodes checked every 30 seconds (333 checks/second sustained).
- A single checker process failing silently without dropping nodes from rotation.
- Cross-region deployments where the monitoring system might have high latency to some nodes.
2. How do we store health check results?
Health monitoring generates two classes of queries that pull in opposite directions. "Is node X healthy right now?" (point-in-time, low latency, always answered from cache) versus "Show me when node X was unhealthy over the past 7 days" (range scan, can tolerate seconds latency, needs efficient time-range access). The storage choice needs to serve both.
Constraints:
- 333 check results per second sustained write rate.
- Records retained for 30 days at raw granularity, 1 year at hourly aggregates.
- Raw data volume: 333 writes/sec x 86,400 sec/day x 30 days = approximately 863 million rows.
3. How do we build a reliable alert pipeline?
The entire value of a monitoring system collapses the moment engineers stop trusting its alerts. Alert fatigue from false positives is the most common reason monitoring systems get ignored -- and an ignored monitoring system is worse than no monitoring system. The alert pipeline must get three things right: filtering transient failures (flap detection), suppressing duplicate notifications (deduplication), and escalating on silence (escalation policy).
Constraints:
- Alert must fire within 90 seconds of a sustained 3-consecutive-failure condition.
- No duplicate alerts for the same ongoing outage.
- If a notified engineer does not acknowledge within 15 minutes, escalate to their backup.
Final Architecture
The core architectural insight is the separation of the current-status read path from the historical read path. Redis serves every dashboard query at under 5ms, while TimescaleDB handles historical range scans with automated downsampling. The Alert Manager is an explicit state machine, not a polling loop, which makes deduplication and flap detection trivial to reason about and test.
Interview Cheat Sheet
- State the scale early: 10,000 nodes, 30-second intervals, 333 checks per second. This number is small enough that a modest infrastructure handles it.
- A single Health Checker process is a SPOF. Use a partitioned pool with a Coordinator that detects checker failures via Redis TTLs and rebalances via consistent hashing.
- Pull is more reliable than push for detecting network partitions. A push heartbeat stops arriving whether the node is dead or the network is severed. Active polling from multiple regional checkers lets you correlate: if only one region's checker cannot reach a node, it is a network partition; if no checker can reach it, the node is down.
- Separate the current-status read path (Redis, under 5ms) from the historical read path (TimescaleDB, seconds tolerance). Never query the time-series DB on a dashboard page load.
- TimescaleDB continuous aggregates handle the 1-year retention requirement automatically. Raw data drops after 30 days; hourly aggregates persist at a fraction of the storage cost.
- Alert state machine: HEALTHY, PENDING, FIRING, RESOLVED. A node in FIRING state does not re-notify. This is the entire deduplication mechanism.
- Fire an alert only after N consecutive failures (default 3). A single failure at a 30-second interval is not an incident; it is a deployment.
- Flap detection: if a node transitions more than 3 times in 5 minutes, downgrade the alert to warning and note the flap. Do not page on a node that cannot make up its mind.
- Group alerts: if more than 20 nodes enter FIRING within 60 seconds, fire one blast-radius alert. Individual node alerts during a mass failure are noise.
- Escalation policy: if no acknowledgement within 15 minutes, escalate to the backup on-call. Recurse up the chain. Store the escalation job ID in Redis so it can be cancelled on acknowledgement.
- Add a dead man's switch: the Alert Manager sends a synthetic heartbeat to PagerDuty Heartbeat Monitoring (or a standalone Dead Man's Snitch service) every 5 minutes. If it stops, the heartbeat monitor fires an alert for the monitoring system itself.
- Automated remediation needs a circuit breaker: if a node has been restarted 3 times in 10 minutes without recovery, stop auto-remediating and escalate to human. Blind restarts during a dependency outage make things worse.
- Alerts and node health data belong in different stores. Alerts are relational (Postgres). Raw check results are time-series (TimescaleDB). Current status is in-memory (Redis). Match the store to the query pattern.