Observability: metrics, logs, and traces
The three pillars of observability in distributed systems: what each signal type is for, how they complement each other, and how to instrument services for full system visibility.
TL;DR
- Observability is the ability to understand a system's internal state from its external outputs alone.
- The three pillars are metrics (aggregated numeric measurements), logs (discrete event records), and traces (request flows across services).
- Metrics answer "is something wrong?", logs answer "what happened?", traces answer "where is the slow part?".
- A well-instrumented system lets you move from alert ā root cause without requiring a code change to add visibility.
- The goal is not to predict every possible failure ā it's to have enough signal that any failure is diagnosable from what already exists.
The Problem It Solves
It's 2 AM. Your checkout service starts returning errors. The on-call engineer opens the dashboard: CPU is fine, memory is fine, disk is fine. Latency graphs show p50 is normal, but p99 spiked 10x. Is it one endpoint or all of them? Is it one region or both?
The engineer SSHs into a production box and starts grepping through unstructured log files. Forty minutes later, they find a clue: one downstream payment provider is timing out. But which requests hit that provider? How many users are affected? Is the problem getting worse or stabilizing?
This is what debugging looks like without observability. You have infrastructure metrics that say everything is fine, unstructured logs that require SSH access to even read, and zero ability to trace a single user request across service boundaries. The system is running, but you can't see inside it.
Monitoring tells you that something broke. Observability tells you why it broke, for which requests, and where in the call chain the failure originated. The gap between those two capabilities is the gap between a 90-minute outage and a 5-minute diagnosis.
What Is It?
Observability is the ability to understand a system's internal state purely from the data it emits: metrics, logs, and traces. You don't need to deploy new code, add a debug flag, or SSH into anything. If the system is observable, the answers are already in the data.
Think of it like a car dashboard versus a car with a transparent body. Monitoring is the dashboard: speed, fuel, engine temperature. You see predefined gauges. Observability is the transparent body: you can look at any component, follow any pipe, trace any wire. You don't have to know in advance which gauge you'll need because you can ask any question after the fact.
The three pillars of observability are:
- Metrics: aggregated numeric measurements over time. Cheap to store, fast to query, great for alerting. They answer "is something wrong?"
- Logs: discrete event records with rich context. Expensive at volume, but they carry the detail. They answer "what happened?"
- Traces: request-scoped timelines across services. They answer "where is the slow part?"
The magic is in the correlation. Each pillar alone is useful. Together, connected by a shared request ID, they let you go from "something is wrong" to "here is exactly what happened to this specific request" in minutes instead of hours.
For your interview: define observability as "understanding internal state from external outputs" and immediately name the three pillars. That's the complete answer.
Observability is not monitoring with a fancier name
Monitoring checks known failure modes against predefined thresholds. Observability lets you investigate unknown failure modes after they happen. If you only have dashboards for things you predicted, you have monitoring. If you can ask arbitrary questions about any request without deploying new code, you have observability.
How It Works
Let's trace the investigation workflow when an alert fires. This is the core value proposition of observability: each pillar narrows the search space until you reach the root cause.
Step 1: Metrics surface the problem
An alert fires: checkout_service p99 latency > 2s. The engineer opens the metrics dashboard and sees latency elevated on the /checkout endpoint specifically, not sitewide. Error rate is unchanged, so requests are succeeding but slowly.
Step 2: Traces locate the slow component
The engineer opens the tracing UI and samples a few slow requests from the last 10 minutes. The trace waterfall shows:
[checkout-service: 2100ms]
āāā [auth-service: 12ms]
āāā [inventory-service: 8ms]
āāā [payment-service: 2050ms] ā slow leg
ā āāā [card-validation: 2ms]
ā āāā [stripe-api-call: 2040ms] ā root cause
āāā [notification-service: async]
Within 30 seconds, the slow component is identified: stripe-api-call within payment-service. No guessing, no SSH, no cross-referencing timestamps.
Step 3: Logs provide context
The engineer filters logs by the trace_id from the slow trace and finds:
{
"timestamp": "2024-01-15T02:14:42.123Z",
"level": "warn",
"service": "payment-service",
"trace_id": "4bf92f3577b34da6",
"message": "stripe API timeout, retrying",
"attempt": 3,
"endpoint": "https://api.stripe.com/v1/charges",
"timeout_ms": 500
}
Root cause: Stripe's API is experiencing elevated latency, causing retries that compound into 2s+ checkout times. The fix is to increase the circuit breaker sensitivity on the Stripe client, not to scale up the checkout service.
I'll often see teams that only have metrics. They can tell something is slow, but they spend 30+ minutes manually correlating timestamps across service logs to find the slow component. Adding traces cuts that investigation time to under 5 minutes.
Metric types in detail
Understanding metric types matters because each one answers a different question:
| Type | What it measures | Example | Query pattern |
|---|---|---|---|
| Counter | Cumulative total (only goes up) | http_requests_total | Rate of change: rate(http_requests_total[5m]) |
| Gauge | Current point-in-time value | db_connections_active | Direct value or max/min over window |
| Histogram | Distribution of values in buckets | request_duration_seconds | Percentiles: p50, p95, p99 |
| Summary | Client-calculated percentiles | request_duration_summary | Pre-computed quantiles (less flexible) |
My recommendation: use counters for anything you want to rate (requests, errors, bytes). Use gauges for pool sizes and queue depths. Use histograms for latency and size distributions. Skip summaries unless you have a specific need for client-side aggregation.
The observability pipeline
Data doesn't magically appear in dashboards. Every signal flows through a pipeline: emission, collection, processing, storage, and query.
OpenTelemetry (OTel) is the vendor-neutral standard for this pipeline. Services emit data via the OTel SDK. Agents collect locally, the collector routes centrally, and backends store each signal type optimally. The beauty of this architecture is that changing your storage backend (say, from Jaeger to Tempo) requires zero application code changes.
Key Components
| Component | Role |
|---|---|
| OTel SDK | Library embedded in each service. Emits metrics, logs, and traces via OTLP protocol. |
| OTel Agent | Sidecar or daemonset that collects signals from local services, buffers, and forwards. |
| OTel Collector | Central pipeline. Receives from agents, applies sampling/enrichment/routing, exports to backends. |
| Metrics backend (Prometheus, Mimir, M3) | Time-series database optimized for numeric aggregations. Handles billions of data points. |
| Log backend (Loki, Elasticsearch) | Indexed log storage. Supports full-text search and structured field queries. |
| Trace backend (Jaeger, Tempo, Zipkin) | Stores trace trees. Optimized for trace-ID lookup and waterfall rendering. |
| Visualization (Grafana, Datadog) | Unified query layer with dashboards, alerts, and cross-pillar correlation. |
| Alertmanager | Evaluates alert rules against metrics, deduplicates, groups, and routes notifications. |
Types / Variations
Methodologies for using metrics
Two frameworks dominate how teams structure their metric collection. I recommend knowing both because they apply to different scopes.
| Method | Stands for | Scope | Best for |
|---|---|---|---|
| USE | Utilization, Saturation, Errors | Infrastructure resources (CPU, memory, disk, network) | Diagnosing resource bottlenecks |
| RED | Rate, Errors, Duration | Request-driven services (APIs, microservices) | Diagnosing service health |
| Four Golden Signals | Latency, Traffic, Errors, Saturation | Any service | Google SRE's universal baseline |
USE method (Brendan Gregg): For every resource, measure utilization (how busy), saturation (how overloaded, e.g. queue depth), and errors (hardware/software faults). This is your go-to for infrastructure layer diagnosis. If CPU utilization is 90% with high saturation, you need more capacity. If utilization is low but errors are high, you have a different problem.
RED method (Tom Wilkie): For every service, measure request rate, error rate, and request duration. This is your go-to for application layer diagnosis. It directly maps to user experience: are requests flowing, are they failing, are they slow?
The rule of thumb: USE for resources below your code, RED for your code. Most teams need both.
Alerting strategies
| Strategy | How it works | When to use |
|---|---|---|
| Threshold alert | Fire when metric crosses static value | Simple cases: disk > 90%, error rate > 5% |
| Burn rate alert | Fire when SLO budget consumption rate is too fast | SLO-based alerting (preferred for latency/availability) |
| Anomaly detection | Fire when metric deviates from predicted baseline | Seasonal patterns, traffic-dependent thresholds |
| Dead man's switch | Fire when a metric stops being emitted | Detecting silent failures, crashed exporters |
Burn rate alerting is the modern standard. Instead of "alert when p99 > 500ms," you define "alert when we're burning through our monthly error budget fast enough to exhaust it in 6 hours." This naturally accounts for brief spikes that don't matter and catches sustained degradation that does.
Trade-offs
| Advantage | Disadvantage |
|---|---|
| Diagnose unknown failures without code changes | Significant infrastructure cost (storage, compute, network) |
| Reduce mean time to recovery (MTTR) from hours to minutes | Cardinality explosion can make metrics backends unusable |
| Cross-service correlation via shared trace IDs | Requires organizational discipline for consistent instrumentation |
| SLO-driven alerting reduces alert fatigue | Tail-based sampling adds complexity and buffering latency |
| Vendor-neutral via OpenTelemetry | Learning curve for PromQL, LogQL, and trace analysis |
| Enables data-driven capacity planning | Data retention costs scale with traffic volume |
The fundamental tension: visibility vs. cost. More data means better debugging, but storage and query costs grow linearly (or worse) with traffic. Every team eventually faces the question of how much observability they can afford, and the answer is always "less than you want but more than you think."
The cardinality trap will bite you
The single most expensive mistake in observability is putting unbounded values (user IDs, request IDs, session tokens) as metric labels. Each unique label combination creates a new time series. A service with 10M users and a user_id label generates 10M time series per metric. Your Prometheus instance will OOM. Use high-cardinality data in logs and trace attributes, never in metric labels.
When to Use It / When to Avoid It
Use observability when:
- You run more than one service (even two services make cross-service debugging painful without traces)
- You have SLOs and need to measure compliance programmatically
- Your team is on-call and needs to diagnose issues at 2 AM without deploying debug code
- You're scaling beyond what a single engineer can hold in their head
- Your deployment frequency is high enough that "what changed?" is a frequent investigation question
Avoid over-investing when:
- You're a single monolith with under 100 RPS (application-level logging and APM may be sufficient)
- Your budget doesn't support the infrastructure cost (start with metrics + structured logs, add traces later)
- You don't have the team discipline to maintain consistent instrumentation (observability without consistency is noise)
The honest answer: you always need some observability. The question is how much. Start with metrics and structured logging. Add distributed tracing when you have 3+ services. Add a full OTel pipeline when debugging time becomes a meaningful cost.
Real-World Examples
Netflix processes over 1 billion metrics per minute across its microservice fleet. Their observability platform, Atlas, is a custom time-series database built specifically to handle this cardinality at sub-second query latency. The key lesson: at Netflix's scale, off-the-shelf solutions like Prometheus can't keep up, and the bottleneck is always metric storage, not collection. They also pioneered "chaos engineering" which depends entirely on observability to measure the blast radius of injected failures.
Uber built two foundational observability tools that became open-source projects. M3 handles their metrics pipeline (billions of data points per day with 30-day retention), and Jaeger handles distributed tracing (processing millions of traces daily across thousands of microservices). The key lesson from Uber: they found that adding Jaeger reduced their median debugging time from ~45 minutes of manual log correlation to under 5 minutes of trace inspection. That ROI justified the infrastructure investment within the first quarter.
Datadog, as both a vendor and a practitioner, runs one of the largest observability pipelines in the world. They ingest trillions of data points per day across metrics, logs, and traces for their customers. Their architecture uses a tiered storage model: hot storage for recent data (SSDs), warm storage for 30-day retention (HDDs), and cold archival (object storage). The lesson: retention tiering is not optional at scale. Storing everything at hot-tier cost is financially unsustainable.
How This Shows Up in Interviews
When to bring it up
Mention observability whenever you're designing a system with multiple services. After laying out your architecture, say: "I'd instrument each service with OpenTelemetry to emit metrics, structured logs, and traces, with a shared trace ID for cross-service correlation." This shows operational maturity that most candidates skip entirely.
Also bring it up when the interviewer asks "how would you debug this?" or "how do you know when something goes wrong?" Those are direct invitations to discuss observability.
Depth expected at senior/staff level
- Name the three pillars and explain what each one is uniquely good at
- Explain the USE method (for infrastructure) and RED method (for services)
- Describe the alert-to-root-cause workflow: metrics ā traces ā logs
- Know the difference between monitoring (predefined checks) and observability (arbitrary questions)
- Mention cardinality as the primary scaling constraint for metrics
- Describe burn rate alerting vs. threshold alerting and why burn rate is preferred
- Reference OpenTelemetry as the instrumentation standard
Interview shortcut: the 30-second observability pitch
"Every service exports RED metrics (rate, errors, duration) via OpenTelemetry. Traces correlate requests across service boundaries using W3C trace context headers. Structured logs carry the trace ID for deep-dive context. Alerts use SLO burn rate, not static thresholds. When an alert fires, I go metrics ā traces ā logs to reach root cause in under 5 minutes." That answer covers everything most interviewers want to hear.
Common follow-up questions
| Interviewer asks | Strong answer |
|---|---|
| "How do you handle high-cardinality metrics?" | "Never use unbounded values (user IDs, request IDs) as metric labels. Those belong in trace attributes and log fields. Metric labels must be bounded: status code, endpoint, region. If you need per-user analysis, query traces, not metrics." |
| "How would you alert on this system?" | "SLO-based burn rate alerts, not static thresholds. Define a monthly error budget (e.g., 99.9% = 43 minutes of downtime). Alert when the burn rate would exhaust the budget in 6 hours. This ignores brief spikes and catches real degradation." |
| "What's the difference between monitoring and observability?" | "Monitoring checks known failure modes against predefined thresholds. Observability lets you investigate unknown failures after they happen, without deploying new code. Monitoring answers 'is the thing I predicted broken?' Observability answers 'what broke and why?'" |
| "How do you keep observability costs under control?" | "Three levers: sampling (especially tail-based for traces), retention tiering (hot/warm/cold storage), and strict cardinality budgets per service. Most teams find that 1-5% trace sampling with 100% error/slow trace capture is sufficient." |
| "How do traces connect to logs?" | "Every log line includes the trace_id field from the active span context. When investigating a trace, I filter logs by that trace ID to see every log event across every service for that specific request." |
Test Your Understanding
Quick Recap
- Observability is the ability to understand internal system state from external outputs (metrics, logs, traces) without deploying new code.
- Metrics are cheap and aggregated, ideal for alerting. Logs are detailed event records, ideal for context. Traces show request paths across services, ideal for latency diagnosis.
- The investigation workflow is metrics (what's broken?) ā traces (where in the call chain?) ā logs (what's the context?), connected by a shared trace ID.
- USE method for infrastructure resources, RED method for service health, four golden signals as the universal baseline.
- Cardinality is the primary constraint in metrics systems. Unbounded label values (user IDs) will crash your metrics backend.
- Burn rate alerting (SLO budget consumption rate) is the modern standard, replacing static threshold alerts.
- OpenTelemetry is the vendor-neutral instrumentation standard. Mention it by name in interviews to signal operational maturity.
Related Concepts
- Distributed Tracing ā A deep dive into the traces pillar: spans, context propagation, sampling strategies, and backend architectures. If you want to understand the mechanics behind the trace waterfall, start here.
- Microservices ā The architecture pattern that makes observability essential. More services means more boundaries, and more boundaries means more places for requests to fail or slow down invisibly.
- Service Mesh ā Sidecar proxies (Envoy, Linkerd) can auto-generate RED metrics and inject trace context without application code changes. A service mesh is the fastest path to baseline observability across a large fleet.
- Circuit Breaker ā The pattern that prevents cascading failures. Circuit breaker state (open/closed/half-open) is one of the most important metrics to expose in your observability pipeline.
- Rate Limiting ā Rate limiter metrics (rejection rate, current vs. max capacity) are key observability signals for protecting backend services from traffic spikes.