Observability: metrics, logs, and traces
The three pillars of observability in distributed systems: what each signal type is for, how they complement each other, and how to instrument services for full system visibility.
TL;DR
- Observability is the ability to understand a system's internal state from its external outputs alone.
- The three pillars are metrics (aggregated numeric measurements), logs (discrete event records), and traces (request flows across services).
- Metrics answer "is something wrong?", logs answer "what happened?", traces answer "where is the slow part?".
- A well-instrumented system lets you move from alert β root cause without requiring a code change to add visibility.
- The goal is not to predict every possible failure β it's to have enough signal that any failure is diagnosable from what already exists.
The Problem It Solves
It's 2 AM. Your checkout service starts returning errors. The on-call engineer opens the dashboard: CPU is fine, memory is fine, disk is fine. Latency graphs show p50 is normal, but p99 spiked 10x. Is it one endpoint or all of them? Is it one region or both?
The engineer SSHs into a production box and starts grepping through unstructured log files. Forty minutes later, they find a clue: one downstream payment provider is timing out. But which requests hit that provider? How many users are affected? Is the problem getting worse or stabilizing?
This is what debugging looks like without observability. You have infrastructure metrics that say everything is fine, unstructured logs that require SSH access to even read, and zero ability to trace a single user request across service boundaries. The system is running, but you can't see inside it.
Monitoring tells you that something broke. Observability tells you why it broke, for which requests, and where in the call chain the failure originated. The gap between those two capabilities is the gap between a 90-minute outage and a 5-minute diagnosis.
What Is It?
Observability is the ability to understand a system's internal state purely from the data it emits: metrics, logs, and traces. You don't need to deploy new code, add a debug flag, or SSH into anything. If the system is observable, the answers are already in the data.
Think of it like a car dashboard versus a car with a transparent body. Monitoring is the dashboard: speed, fuel, engine temperature. You see predefined gauges. Observability is the transparent body: you can look at any component, follow any pipe, trace any wire. You don't have to know in advance which gauge you'll need because you can ask any question after the fact.
The three pillars of observability are:
- Metrics: aggregated numeric measurements over time. Cheap to store, fast to query, great for alerting. They answer "is something wrong?"
- Logs: discrete event records with rich context. Expensive at volume, but they carry the detail. They answer "what happened?"
- Traces: request-scoped timelines across services. They answer "where is the slow part?"
The magic is in the correlation. Each pillar alone is useful. Together, connected by a shared request ID, they let you go from "something is wrong" to "here is exactly what happened to this specific request" in minutes instead of hours.
For your interview: define observability as "understanding internal state from external outputs" and immediately name the three pillars. That's the complete answer.
Observability is not monitoring with a fancier name
Monitoring checks known failure modes against predefined thresholds. Observability lets you investigate unknown failure modes after they happen. If you only have dashboards for things you predicted, you have monitoring. If you can ask arbitrary questions about any request without deploying new code, you have observability.
How It Works
Let's trace the investigation workflow when an alert fires. This is the core value proposition of observability: each pillar narrows the search space until you reach the root cause.
Step 1: Metrics surface the problem
An alert fires: checkout_service p99 latency > 2s. The engineer opens the metrics dashboard and sees latency elevated on the /checkout endpoint specifically, not sitewide. Error rate is unchanged, so requests are succeeding but slowly.
Step 2: Traces locate the slow component
The engineer opens the tracing UI and samples a few slow requests from the last 10 minutes. The trace waterfall shows:
[checkout-service: 2100ms]
βββ [auth-service: 12ms]
βββ [inventory-service: 8ms]
βββ [payment-service: 2050ms] β slow leg
β βββ [card-validation: 2ms]
β βββ [stripe-api-call: 2040ms] β root cause
βββ [notification-service: async]
Within 30 seconds, the slow component is identified: stripe-api-call within payment-service. No guessing, no SSH, no cross-referencing timestamps.
Step 3: Logs provide context
The engineer filters logs by the trace_id from the slow trace and finds:
{
"timestamp": "2024-01-15T02:14:42.123Z",
"level": "warn",
"service": "payment-service",
"trace_id": "4bf92f3577b34da6",
"message": "stripe API timeout, retrying",
"attempt": 3,
"endpoint": "https://api.stripe.com/v1/charges",
"timeout_ms": 500
}
Root cause: Stripe's API is experiencing elevated latency, causing retries that compound into 2s+ checkout times. The fix is to increase the circuit breaker sensitivity on the Stripe client, not to scale up the checkout service.
I'll often see teams that only have metrics. They can tell something is slow, but they spend 30+ minutes manually correlating timestamps across service logs to find the slow component. Adding traces cuts that investigation time to under 5 minutes.
Metric types in detail
Understanding metric types matters because each one answers a different question:
| Type | What it measures | Example | Query pattern |
|---|---|---|---|
| Counter | Cumulative total (only goes up) | http_requests_total | Rate of change: rate(http_requests_total[5m]) |
| Gauge | Current point-in-time value | db_connections_active | Direct value or max/min over window |
| Histogram | Distribution of values in buckets | request_duration_seconds | Percentiles: p50, p95, p99 |
| Summary | Client-calculated percentiles | request_duration_summary | Pre-computed quantiles (less flexible) |
My recommendation: use counters for anything you want to rate (requests, errors, bytes). Use gauges for pool sizes and queue depths. Use histograms for latency and size distributions. Skip summaries unless you have a specific need for client-side aggregation.
The observability pipeline
Data doesn't magically appear in dashboards. Every signal flows through a pipeline: emission, collection, processing, storage, and query.
Continue Reading with Premium
Unlock this article and every other in-depth system design guide on the platform with NotesFromSDE Premium.