Health Monitor
Walk through a complete cluster health monitoring design, from a single polling loop to a distributed system that tracks 10,000 nodes, fires sub-90-second alerts, and auto-remediates failures without waking an on-call engineer.
What is a cluster health monitoring system?
A cluster health monitoring system periodically checks whether every service and machine in your infrastructure is alive and healthy, then surfaces that status to engineers and automated systems. The apparent simplicity hides three hard problems: collecting data from thousands of nodes without the collector itself becoming a bottleneck, deciding when "unhealthy" means "alert someone" versus "this is a transient blip," and getting the system to self-heal before anyone is paged.
It tests time-series storage, distributed polling at scale, alert pipeline design, and the tension between false-positive alert fatigue and missed outages.
Functional Requirements
Core Requirements
- Periodically check the health of every service and machine in the cluster.
- Surface the current status of each component via an API and a dashboard.
- Fire an alert when a health check fails for a configurable number of consecutive intervals.
- Support automated remediation actions (restart a process, remove a node from load balancer rotation).
Below the Line (out of scope)
- Log aggregation (covered in Design a Distributed Logging System).
- Full metrics pipeline: CPU graphs over time, percentile reporting, SLO burn rates (covered in Design a Metrics Collection System).
- Multi-tenant monitoring of external customer clusters.
The hardest part in scope: Reliable alert delivery with zero false negatives. A check that never fires is worse than no monitoring at all. We will dedicate a full deep dive to the alert pipeline: deduplication, flap detection, and escalation policy.
Log aggregation is below the line because it is a separate write-heavy pipeline with completely different storage semantics. To add it, I would stream log lines to a Kafka topic and feed them into an Elasticsearch cluster, sitting beside the health check path, not inside it.
The full metrics pipeline is below the line because this article focuses on binary health checks (up/down) and threshold-based alerting rather than time-series analytics and SLO math. To add it, I would wire the same health check agents into a Prometheus scrape endpoint and let a separate metrics server handle aggregation and percentiles.
Multi-tenant monitoring is below the line because it introduces auth isolation, billing, and noisy-neighbor concerns that are separate from the core design.
Non-Functional Requirements
Core Requirements
- Scale: 10,000 nodes in the cluster. Each node checked every 30 seconds. Peak ingest rate: approximately 333 check results per second.
- Alert latency: An alert fires within 90 seconds of a sustained failure being detected (3 consecutive failed checks at a 30-second interval).
- Availability: 99.9% uptime for the monitoring system itself. The monitoring system must survive the failures it is designed to detect.
- Data retention: Raw check results retained for 30 days. Hourly status aggregates retained for 1 year.
- Read latency: Dashboard current-status queries return in under 200ms. Historical status queries return in under 2 seconds.
Below the Line
- Sub-second alert latency (acceptable to trade for simpler architecture)
- Custom integrations beyond PagerDuty/Slack/webhook
Write-to-read ratio: This system writes 333 check results per second and receives far more reads from dashboards, alert queries, and history lookups. The dominant design constraint here is not read throughput but write correctness. Missing one alert when a node is genuinely down is far more damaging than a brief dashboard staleness.
The 99.9% availability target for the monitoring system is intentionally lower than the 99.99% target you would set for a user-facing service. A monitoring system that goes down briefly is bad; a user-facing outage caused by over-engineering the monitoring system to achieve five-nines is worse. I default to simpler replication strategies here and accept the tradeoff.
Core Entities
- Node: A single machine or service instance being monitored. Carries a node ID, hostname, cluster region, service type, and registration timestamp.
- HealthCheck: One check result for one node at one point in time. Carries the node ID, check timestamp, status (healthy/unhealthy/unknown), HTTP response code, latency in milliseconds, and an optional error message.
- Alert: A record of a sustained failure and its notification state. Carries the node ID, onset time, resolved time, severity, and the recipient list that was notified.
- RemediationAction: A record of an automated action taken in response to an alert. Carries the alert ID, action type (restart/drain), execution time, and outcome.
The full schema is deferred to the data model deep dive. These entities are sufficient to drive the API and High-Level Design.
API Design
FR 1 and FR 2 -- check health and surface status:
GET /nodes/{node_id}/status
Response: { node_id, status, last_check_at, consecutive_failures, latency_ms }
A GET because this is a pure read. The consecutive_failures field does work in the response: the dashboard can show "3 consecutive failures" alongside "UNHEALTHY" instead of the raw binary, which is more useful to an on-call engineer.
GET /nodes?region=us-east-1&status=unhealthy
Response: { nodes: [...], next_cursor: "..." }
Cursor-based pagination because the caller is a dashboard rendering potentially thousands of nodes. Filtering by status and region up front means the dashboard loads the degraded nodes first, which is what operators care about in an incident.
FR 3 -- list and acknowledge alerts:
GET /alerts?status=firing&severity=critical
Response: { alerts: [...], next_cursor: "..." }
PATCH /alerts/{alert_id}
Body: { status: "acknowledged", acknowledged_by: "user@example.com" }
Response: { alert_id, status, acknowledged_by, acknowledged_at }
PATCH (not PUT) because we are updating a subset of the alert fields. The acknowledgement endpoint exists to close the loop: once an on-call engineer has seen the alert, the system stops re-notifying.
FR 4 -- trigger remediation:
POST /nodes/{node_id}/remediate
Body: { action: "restart" | "drain", initiated_by: "system" | "user@example.com" }
Response: { action_id, status: "queued", estimated_completion_ms: 5000 }
POST because remediation is an action, not a resource creation or update. The initiated_by field distinguishes automated remediation from human-triggered remediation in the audit log, which matters for post-incident review.
I always make acknowledgement an explicit API action rather than a UI-only toggle; the acknowledged_by field in the response creates an audit trail that post-incident reviews depend on.
Authentication is out of scope, but in production every write endpoint (PATCH /alerts, POST /nodes/{id}/remediate) would require an auth token. Remediation actions especially need scoped roles: a service account can restart a process; only a senior on-call engineer can drain a node from the load balancer.
High-Level Design
1. Periodically check the health of every node
The simplest health check system is a single polling loop: one service walks the list of all registered nodes and sends an HTTP GET to each node's /healthz endpoint every 30 seconds.
Check taxonomy (what we actually measure):
- Liveness: HTTP 200 from
/healthzwithin the 5-second timeout. The binary up/down signal that drives alerts. - Response latency: The round-trip time for the
/healthzrequest, stored aslatency_msper check. A node can return 200 but be degraded if latency spikes. - Saturation metrics (optional, agent-reported): CPU%, memory%, disk%. Stored alongside the check result in the Status Store but scoped out of the core alert path per FR. Useful for capacity planning dashboards.
Components:
- Node Registry: A database table listing every node, its address, and the check interval. The Checker queries this to know who to poll.
- Health Checker: The polling service. For each node, it opens an HTTP connection, sends
GET /healthz, waits for a 200 response, and writes the result to the Status Store. - Status Store: A time-series-friendly database that holds every check result.
Request walkthrough:
- Health Checker queries Node Registry: "give me all nodes due for a check."
- For each node, Health Checker sends
GET /healthzwith a 5-second timeout. - Node responds with HTTP 200 (healthy) or connection timeout (unhealthy).
- Health Checker writes
{ node_id, timestamp, status, latency_ms }to Status Store. - Status Store acknowledges the write.
This covers the basic write path. The read path (surfacing status to operators) and the alert logic come in the next two requirements.
The issue at 10,000 nodes: A single Health Checker cannot sustain 10,000 concurrent HTTP connections. At a 30-second interval, the checker must initiate approximately 333 new connections every second. A single process with a naive sequential loop would take 10,000 x (average latency 20ms) = 200 seconds per cycle, meaning it is already 170 seconds behind before the second cycle starts.
The fix is to parallelize the checker using a thread pool or async I/O. Even with async I/O, a single process has limits on open file descriptors (typically 65,536 on Linux) and cannot survive its own failure; a distributed checker architecture, covered in the deep dives, is the right answer.
2. Surface the current status of each component
The Node Registry + Status Store from requirement 1 is already enough to answer "is node X healthy right now?" -- but querying the time-series store on every dashboard load would be slow and expensive. The evolved design adds a Status Cache that holds the latest status for every node in memory.
New components:
- Status Cache (Redis): An in-memory hash table keyed by
node_id. Value is the latest{ status, last_check_at, consecutive_failures }. The Health Checker updates this on every write, not just on status changes. - Query API: A thin read service that serves
/nodesand/nodes/{id}/statusfrom the Status Cache. Falls back to Status Store on a cache miss.
Request walkthrough (dashboard query):
- Dashboard calls
GET /nodes?region=us-east-1&status=unhealthy. - Query API scans the Redis key space for nodes matching the filter.
- Redis returns matching node statuses directly from memory (under 5ms).
- Query API returns the paginated response; dashboard renders the list.
Continue Reading with Premium
Unlock this article and every other in-depth system design guide on the platform with NotesFromSDE Premium.