Distributed tracing
How distributed tracing tracks a single request across multiple services: spans, trace context propagation, sampling strategies, and how to use traces to diagnose production latency problems.
TL;DR
- Distributed tracing tracks a single request through every service it touches, building a causally connected timeline called a trace.
- A trace is composed of spans — one span per operation (service call, database query, external API call).
- Traces are connected by trace context propagation: every outbound call carries a
trace-idandspan-idheader that the receiving service uses to link its span to the parent. - Sampling is required — capturing 100% of traces at high traffic volume is expensive. Head-based or tail-based sampling reduces volume while preserving signal.
- OpenTelemetry is the vendor-neutral standard for instrumentation; Jaeger, Zipkin, and Tempo are popular open-source backends.
The Problem It Solves
You add a new feature to checkout. Latency goes from 300ms to 900ms. The on-call engineer checks per-service metrics: UserService averages 50ms, InventoryService 60ms, PaymentService 800ms. Great, PaymentService is the culprit.
But PaymentService calls four downstream dependencies: Stripe API, a rate-limiter, a fraud-check service, and a database. Which one got slow? The metrics for each of those also look normal in isolation. The slowness only manifests when those calls happen in a specific sequence under certain traffic conditions.
The engineer starts grepping logs across services, trying to correlate timestamps. Forty minutes later, they discover FraudCheck calls an ML inference service that recently deployed a model 3x larger than the previous one. The fix takes 2 minutes. Finding the root cause took 40.
This is the fundamental limitation of metrics-only debugging. Metrics aggregate across requests, so they tell you that something is slow. They can't tell you which specific call chain within a request is slow. Logs can contain the detail, but cross-service correlation by timestamp is fragile, slow, and error-prone.
Distributed tracing exists to answer one question instantly: for this specific request, which service call took the longest, and why?
What Is It?
Distributed tracing tracks a single request as it flows through every service it touches, building a causally connected timeline called a trace. Each operation within the request (an HTTP call, a database query, a cache lookup) is recorded as a span. Spans are linked by parent-child relationships, forming a tree that shows exactly how time was spent.
Think of it like a package tracking system. When you ship a package, every facility it passes through scans the barcode: picked up at warehouse, arrived at sorting center, loaded on truck, delivered. Each scan is a span. The barcode is the trace ID. If the package is late, you look at the timeline and see exactly which facility introduced the delay. You don't have to call each facility and ask.
With this trace, the 40-minute investigation becomes a 30-second glance. The root cause (ML inference taking 460ms inside FraudCheck) is immediately visible in the span tree.
For your interview: define distributed tracing as "request-scoped timelines across service boundaries, built from spans linked by a shared trace ID." That's precise and complete.
Tracing is not logging with timestamps
Logs record events. Traces record causal relationships between events. A log line says "PaymentService took 800ms." A trace says "PaymentService took 800ms, of which 480ms was spent in FraudCheck, which spent 460ms in MLInference." The causal chain, the parent-child relationships, and the timing overlap are what make traces uniquely powerful. Timestamps alone cannot reconstruct this.
How It Works
Step 1: The first service creates the trace
When a request enters the system (e.g., API gateway receives POST /checkout), the OTel SDK generates a new trace ID (128-bit random hex) and creates the root span. This span records the start time, operation name, and service identity.
Step 2: Context propagates through headers
When the first service calls a downstream service, it injects trace context into the HTTP headers:
The traceparent header follows the W3C Trace Context standard:
traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
│ trace-id (16 bytes hex) span-id flags
version sample flag
Every downstream service extracts the trace ID and parent span ID from these headers, creates its own span with the same trace ID, and passes the context forward. This is how spans from 10 different services end up in the same trace tree.
Step 3: Spans record timing and metadata
Each span captures:
# What a span contains
span = {
"trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
"span_id": "00f067aa0ba902b7",
"parent_span_id": "a1b2c3d4e5f6a7b8",
"operation_name": "PaymentService.charge",
"start_time": "2024-01-15T10:30:42.123Z",
"end_time": "2024-01-15T10:30:42.903Z",
"duration_ms": 780,
"status": "OK",
"attributes": {
"http.method": "POST",
"http.url": "/charge",
"http.status_code": 200,
"order.id": "ord_xyz789",
"payment.amount_cents": 4999,
},
"events": [
{"name": "retry_attempt", "timestamp": "...", "attributes": {"attempt": 2}}
]
}
Attributes are the richness layer. Auto-instrumentation adds HTTP metadata automatically. Manual instrumentation adds business context (order ID, payment amount) that makes traces actually useful for debugging business logic, not just infrastructure.
Step 4: Spans ship to a backend
Each service sends its completed spans to a tracing backend (via OTel Collector). The backend stores them indexed by trace ID. When you query for trace abc123, the backend assembles all spans with that trace ID into a tree and renders the waterfall view.
Checkout |████████████████ 900ms |
Auth.verify |█ 50ms|
Inventory.check |█ 60ms|
Payment.charge |████████ 780ms|
RateLimiter.check |█|
FraudCheck.evaluate |████ 480ms|
MLInference.score |████ 460ms|
Stripe.charge |██ 280ms| (parallel)
Sequential spans compound latency. Parallel spans share it. I'll often see engineers identify a 2x speedup just by looking at a trace and realizing two sequential calls could be parallelized. That insight is invisible without tracing.
Key Components
| Component | Role |
|---|---|
| Trace | A tree of spans representing one end-to-end request. Identified by a unique trace ID. |
| Span | A single unit of work (service call, DB query, cache read). Has start/end time, status, attributes. |
| Trace context | The traceparent + tracestate HTTP headers that propagate trace/span IDs across service boundaries. |
| OTel SDK | Library in each service that creates spans, injects/extracts context, and exports span data. |
| OTel Collector | Central pipeline that receives spans, applies sampling and enrichment, and exports to backends. |
| Tracing backend (Jaeger, Tempo, Zipkin) | Stores spans indexed by trace ID. Renders waterfall views. Supports trace search. |
| Span attributes | Key-value metadata on spans: HTTP method, status code, user ID, business context fields. |
| Span events | Timestamped log-like records within a span: retry attempts, cache hits, error details. |
Types / Variations
Sampling strategies
| Strategy | Decision point | Pros | Cons | Best for |
|---|---|---|---|---|
| Always-on (100%) | Every request traced | Complete visibility | Extremely expensive at scale | Low-traffic services, staging environments |
| Head-based | At request entry | Simple, low overhead | Misses rare errors, slow traces | High-traffic with uniform patterns |
| Tail-based | After trace completes | Captures all errors and anomalies | Requires buffering, more infrastructure | Production services with rare but critical failures |
| Rule-based | Per-request rules | Targeted: always trace /checkout, 1% of /health | Complex configuration | Mixed-criticality endpoints |
| Adaptive | Dynamic rate adjustment | Adjusts to traffic volume | Requires rate estimation | Bursty traffic patterns |
My recommendation: start with head-based at 1-5% for general coverage. Add tail-based sampling when you need to guarantee capture of all error and high-latency traces. The infrastructure cost of tail-based is real (buffering all spans requires memory), but the debugging value is worth it for any service on the critical user path.
Tracing backends
| Backend | Architecture | Storage | Strengths |
|---|---|---|---|
| Jaeger (Uber) | Microservice-based collectors + storage | Cassandra, Elasticsearch, or Kafka | Mature, battle-tested at Uber scale |
| Zipkin (Twitter) | Single binary or distributed | In-memory, MySQL, Cassandra, ES | Simple setup, good for getting started |
| Grafana Tempo | Object-storage-native | S3/GCS/Azure Blob (columnar) | Lowest cost at scale, no index to maintain |
| AWS X-Ray | Managed service | AWS-managed | Zero ops, native AWS integration |
| Datadog APM | SaaS | Datadog-managed | Full-stack correlation, ML-powered insights |
The industry is moving toward Tempo-style backends that store traces in object storage (S3) without maintaining a traditional index. This dramatically reduces operational cost. The trade-off is that searches by attributes (find all traces where user_id=123) require scanning, which is slower than indexed backends but acceptable for most debugging workflows where you start from a trace ID.
Trade-offs
| Advantage | Disadvantage |
|---|---|
| Pinpoints exact slow component in seconds | Instrumentation overhead (1-3% latency per service for context injection) |
| Visualizes causal relationships between services | Sampling means you never have 100% coverage in production |
| Makes parallelization opportunities visible | Requires consistent propagation across all services (one un-instrumented service breaks the chain) |
| Enables SLO measurement per-dependency | Storage costs grow with span count and retention |
| Vendor-neutral via OpenTelemetry and W3C headers | Tail-based sampling adds buffering complexity |
| Auto-instrumentation covers most framework spans | Manual instrumentation needed for business context |
The fundamental tension: coverage vs. cost. More spans and higher sampling rates give better debugging power, but storage and processing costs scale linearly with span volume. At 100K RPS with 5 spans per request, 100% sampling produces 43 billion spans per day. Even at 1 KB per span, that's 43 TB/day. Sampling is not optional at scale.
The broken chain problem
If even one service in the request path doesn't propagate trace context headers, the trace breaks into two disconnected fragments. In interviews, mention this as the primary operational challenge: "The hardest part of tracing isn't the backend, it's ensuring 100% of services propagate context. One un-instrumented service breaks the chain." Service meshes (Envoy, Linkerd) can inject context at the proxy layer, bypassing the application entirely.
When to Use It / When to Avoid It
Use distributed tracing when:
- You have 3+ services on the critical request path (even 3 services make ad-hoc log correlation painful)
- Latency debugging is a recurring on-call burden (traces turn 40-minute investigations into 30-second glances)
- You need to set per-dependency SLOs (e.g., "calls to PaymentService must complete in < 200ms")
- You're planning to parallelize sequential calls and need data on which calls are actually sequential
- Compliance requires audit trails showing exactly which services processed a request
Avoid or defer when:
- You have a single monolith with no service-to-service calls (profiling is more useful than tracing for single-process latency)
- Your traffic is under 100 RPS and you can affordably log every request with full context
- You haven't yet invested in structured logging (get structured logs with trace IDs first, then add tracing)
The honest answer: if you're running microservices in production, you need tracing. The question is how soon. My recommendation is to add auto-instrumentation the moment you hit 3 services. The cost is minimal (one OTel agent JAR or Python wrapper) and the debugging value is immediate.
Real-World Examples
Uber and Jaeger. Uber built Jaeger in 2015 to debug latency across their rapidly growing microservice fleet (now 4,000+ services). Before Jaeger, debugging a single slow ride-booking request required correlating logs across dozens of services manually. After Jaeger, engineers could pull up a trace and see the full 20+ service call chain with timing. Jaeger processes millions of traces per day at Uber. The key lesson: Uber found that the biggest ROI wasn't in rare catastrophic failures; it was in the everyday latency investigations that every on-call engineer deals with. Reducing median debug time from 45 minutes to 5 minutes, multiplied across hundreds of engineers, saved thousands of engineering hours per quarter.
Google and Dapper. Google's 2010 Dapper paper is the origin of modern distributed tracing. Dapper was designed to trace requests across Google's massive internal services (Search, Ads, Gmail, all built on shared infrastructure). The paper introduced the concepts of traces, spans, and annotations that every tracing system since has adopted. Key insight from the paper: Dapper used always-on tracing with low overhead (< 0.01% latency impact) achieved through aggressive sampling and asynchronous span export. The paper showed that even 0.1% sampling was sufficient to capture representative traces for debugging, as long as 100% of error traces were always captured.
Grafana Tempo. Tempo took a different approach to trace storage. Instead of indexing spans in Elasticsearch or Cassandra (expensive at scale), Tempo stores traces as columnar data in object storage (S3/GCS). Lookups by trace ID are fast because traces are stored contiguously. The trade-off: you can't search by arbitrary attributes without a separate index (Tempo uses Grafana Loki logs as the search index, querying log lines for trace IDs). This architecture reduced Grafana Labs' trace storage costs by over 10x compared to Jaeger-on-Elasticsearch. The lesson: at very high scale, the storage backend architecture matters more than the tracing protocol.
How This Shows Up in Interviews
When to bring it up
Mention distributed tracing whenever your system design has 3+ services communicating synchronously. After drawing the architecture, say: "I'd add OpenTelemetry auto-instrumentation to each service, with W3C trace context propagation, so any latency regression is immediately diagnosable from the trace waterfall."
Also bring it up when the interviewer says "how would you debug a latency issue across these services?" That's a direct invitation.
Depth expected at senior/staff level
- Explain traces, spans, and parent-child relationships
- Describe W3C Trace Context header format and how propagation works
- Compare head-based vs. tail-based sampling with trade-offs
- Distinguish auto-instrumentation from manual instrumentation
- Name OpenTelemetry as the standard and at least one backend (Jaeger, Tempo)
- Explain critical path analysis: sequential vs. parallel spans
- Mention the broken chain problem (un-instrumented services)
Interview shortcut: the trace pitch
"Each service runs the OpenTelemetry SDK which auto-creates spans for HTTP and database calls. Trace context propagates via W3C traceparent headers. I'd use tail-based sampling to guarantee we capture all error and slow traces. The waterfall view lets us pinpoint the slow span in seconds instead of correlating logs for 30 minutes." That answer covers the mechanisms interviewers want to hear.
Common follow-up questions
| Interviewer asks | Strong answer |
|---|---|
| "How does trace context propagate?" | "The calling service injects a traceparent header with the trace ID and current span ID. The receiving service extracts it, creates a child span with the same trace ID, and passes it forward. W3C Trace Context is the standard. For async (message queues), the trace context goes into message headers instead of HTTP headers." |
| "What if one service doesn't propagate headers?" | "The trace breaks into disconnected fragments. That service becomes a black box, and the parent span shows the total duration of the call but you can't see what happened inside. The fix is auto-instrumentation (OTel agents) or, if the service can't be modified, a service mesh sidecar that injects context at the proxy layer." |
| "How do you handle sampling?" | "Head-based sampling decides at entry (simple but misses rare errors). Tail-based sampling buffers all spans and decides after completion (captures all errors/slow traces, but needs more memory). I'd use tail-based for any service on the critical path, with rules to keep 100% of errors and requests above the p99 latency threshold." |
| "How do traces relate to logs and metrics?" | "Metrics tell you something is wrong. Traces tell you where in the call chain. Logs tell you the specific context (error message, request body). They're connected by trace_id: I can go from a metric spike → sample traces from that time window → filter logs by trace_id to get full context." |
| "What's the overhead of tracing?" | "Context injection/extraction adds 1-3ms per service call. Span export is asynchronous (batched, non-blocking). The main cost is storage, not runtime. At 100K RPS with 5 spans/request, 1% sampling produces ~430M spans/day. At 1 KB/span, that's ~430 GB/day of storage." |
Test Your Understanding
Quick Recap
- Distributed tracing builds a causally connected timeline (trace) of a single request across all services it touches, composed of spans linked by parent-child relationships.
- Trace context propagates via W3C
traceparentHTTP headers. Each service extracts the trace ID and parent span ID, creates a child span, and forwards the context to its own downstream calls. - Head-based sampling decides at the start (simple, misses rare errors). Tail-based sampling decides after the trace completes (captures all errors and slow traces, but requires buffering).
- Auto-instrumentation (OTel agents) covers HTTP, database, and framework spans with zero code changes. Manual instrumentation adds business context attributes that make traces useful for debugging logic, not just infrastructure.
- In a trace waterfall, sequential spans compound latency and parallel spans share it. The highest-leverage optimization is often converting sequential calls to parallel.
- One un-instrumented service breaks the trace chain. Service meshes can inject trace context at the proxy layer, bypassing the application.
- OpenTelemetry is the vendor-neutral standard. Mention it by name, along with W3C Trace Context, to signal operational maturity in interviews.
Related Concepts
- Observability — The umbrella discipline that includes tracing alongside metrics and logs. Distributed tracing is one of the three pillars, and understanding how all three correlate is essential for effective debugging.
- Microservices — The architecture pattern that creates the need for distributed tracing. More services means more network boundaries, more latency sources, and more places for context to get lost.
- Service Mesh — Sidecar proxies can auto-inject trace context at the network layer, solving the "broken chain" problem for services that can't be modified. A service mesh is the fastest path to 100% trace coverage.
- Circuit Breaker — Trace spans naturally surface circuit breaker behavior: a span to a failing dependency shows retries, timeouts, and eventual circuit open. Tracing makes circuit breaker tuning data-driven instead of guesswork.
- Message Queues — Async communication via queues requires special trace context propagation through message headers. Understanding how tracing works across async boundaries is a common interview follow-up.