Distributed tracing

TL;DR

Distributed tracing tracks a single request through every service it touches, building a causally connected timeline called a trace.
A trace is composed of spans — one span per operation (service call, database query, external API call).
Traces are connected by trace context propagation: every outbound call carries a trace-id and span-id header that the receiving service uses to link its span to the parent.
Sampling is required — capturing 100% of traces at high traffic volume is expensive. Head-based or tail-based sampling reduces volume while preserving signal.
OpenTelemetry is the vendor-neutral standard for instrumentation; Jaeger, Zipkin, and Tempo are popular open-source backends.

You add a new feature to checkout. Latency goes from 300ms to 900ms. The on-call engineer checks per-service metrics: UserService averages 50ms, InventoryService 60ms, PaymentService 800ms. Great, PaymentService is the culprit.

But PaymentService calls four downstream dependencies: Stripe API, a rate-limiter, a fraud-check service, and a database. Which one got slow? The metrics for each of those also look normal in isolation. The slowness only manifests when those calls happen in a specific sequence under certain traffic conditions.

The engineer starts grepping logs across services, trying to correlate timestamps. Forty minutes later, they discover FraudCheck calls an ML inference service that recently deployed a model 3x larger than the previous one. The fix takes 2 minutes. Finding the root cause took 40.

This is the fundamental limitation of metrics-only debugging. Metrics aggregate across requests, so they tell you that something is slow. They can't tell you which specific call chain within a request is slow. Logs can contain the detail, but cross-service correlation by timestamp is fragile, slow, and error-prone.

Distributed tracing exists to answer one question instantly: for this specific request, which service call took the longest, and why?

What Is It?

Distributed tracing tracks a single request as it flows through every service it touches, building a causally connected timeline called a trace. Each operation within the request (an HTTP call, a database query, a cache lookup) is recorded as a span. Spans are linked by parent-child relationships, forming a tree that shows exactly how time was spent.

Think of it like a package tracking system. When you ship a package, every facility it passes through scans the barcode: picked up at warehouse, arrived at sorting center, loaded on truck, delivered. Each scan is a span. The barcode is the trace ID. If the package is late, you look at the timeline and see exactly which facility introduced the delay. You don't have to call each facility and ask.

With this trace, the 40-minute investigation becomes a 30-second glance. The root cause (ML inference taking 460ms inside FraudCheck) is immediately visible in the span tree.

For your interview: define distributed tracing as "request-scoped timelines across service boundaries, built from spans linked by a shared trace ID." That's precise and complete.

Tracing is not logging with timestamps

Logs record events. Traces record causal relationships between events. A log line says "PaymentService took 800ms." A trace says "PaymentService took 800ms, of which 480ms was spent in FraudCheck, which spent 460ms in MLInference." The causal chain, the parent-child relationships, and the timing overlap are what make traces uniquely powerful. Timestamps alone cannot reconstruct this.

How It Works

Step 1: The first service creates the trace

When a request enters the system (e.g., API gateway receives POST /checkout), the OTel SDK generates a new trace ID (128-bit random hex) and creates the root span. This span records the start time, operation name, and service identity.

Step 2: Context propagates through headers

When the first service calls a downstream service, it injects trace context into the HTTP headers:

The traceparent header follows the W3C Trace Context standard:

traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
             │   trace-id (16 bytes hex)              span-id       flags
             version                                                sample flag

Every downstream service extracts the trace ID and parent span ID from these headers, creates its own span with the same trace ID, and passes the context forward. This is how spans from 10 different services end up in the same trace tree.

Step 3: Spans record timing and metadata

Each span captures:

# What a span contains
span = {
    "trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
    "span_id": "00f067aa0ba902b7",
    "parent_span_id": "a1b2c3d4e5f6a7b8",
    "operation_name": "PaymentService.charge",
    "start_time": "2024-01-15T10:30:42.123Z",
    "end_time": "2024-01-15T10:30:42.903Z",
    "duration_ms": 780,
    "status": "OK",
    "attributes": {
        "http.method": "POST",
        "http.url": "/charge",
        "http.status_code": 200,
        "order.id": "ord_xyz789",
        "payment.amount_cents": 4999,
    },
    "events": [
        {"name": "retry_attempt", "timestamp": "...", "attributes": {"attempt": 2}}
    ]
}

Attributes are the richness layer. Auto-instrumentation adds HTTP metadata automatically. Manual instrumentation adds business context (order ID, payment amount) that makes traces actually useful for debugging business logic, not just infrastructure.

Step 4: Spans ship to a backend

Each service sends its completed spans to a tracing backend (via OTel Collector). The backend stores them indexed by trace ID. When you query for trace abc123, the backend assembles all spans with that trace ID into a tree and renders the waterfall view.

Checkout                  |████████████████ 900ms |
  Auth.verify             |█ 50ms|
  Inventory.check           |█ 60ms|
  Payment.charge                     |████████ 780ms|
    RateLimiter.check                |█|
    FraudCheck.evaluate              |████ 480ms|
      MLInference.score              |████ 460ms|
    Stripe.charge                    |██ 280ms| (parallel)

Sequential spans compound latency. Parallel spans share it. I'll often see engineers identify a 2x speedup just by looking at a trace and realizing two sequential calls could be parallelized. That insight is invisible without tracing.

Key Components

Component	Role
Trace	A tree of spans representing one end-to-end request. Identified by a unique trace ID.
Span	A single unit of work (service call, DB query, cache read). Has start/end time, status, attributes.
Trace context	The `traceparent` + `tracestate` HTTP headers that propagate trace/span IDs across service boundaries.
OTel SDK	Library in each service that creates spans, injects/extracts context, and exports span data.
OTel Collector	Central pipeline that receives spans, applies sampling and enrichment, and exports to backends.
Tracing backend (Jaeger, Tempo, Zipkin)	Stores spans indexed by trace ID. Renders waterfall views. Supports trace search.
Span attributes	Key-value metadata on spans: HTTP method, status code, user ID, business context fields.
Span events	Timestamped log-like records within a span: retry attempts, cache hits, error details.

Types / Variations

Sampling strategies

TL;DR

Distributed tracing tracks a single request through every service it touches, building a causally connected timeline called a trace.
A trace is composed of spans — one span per operation (service call, database query, external API call).
Traces are connected by trace context propagation: every outbound call carries a trace-id and span-id header that the receiving service uses to link its span to the parent.
Sampling is required — capturing 100% of traces at high traffic volume is expensive. Head-based or tail-based sampling reduces volume while preserving signal.
OpenTelemetry is the vendor-neutral standard for instrumentation; Jaeger, Zipkin, and Tempo are popular open-source backends.

The Problem It Solves

Distributed tracing exists to answer one question instantly: for this specific request, which service call took the longest, and why?

What Is It?

With this trace, the 40-minute investigation becomes a 30-second glance. The root cause (ML inference taking 460ms inside FraudCheck) is immediately visible in the span tree.

For your interview: define distributed tracing as "request-scoped timelines across service boundaries, built from spans linked by a shared trace ID." That's precise and complete.

Tracing is not logging with timestamps

traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
             │   trace-id (16 bytes hex)              span-id       flags
             version                                                sample flag

Step 3: Spans record timing and metadata

Each span captures:

# What a span contains
span = {
    "trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
    "span_id": "00f067aa0ba902b7",
    "parent_span_id": "a1b2c3d4e5f6a7b8",
    "operation_name": "PaymentService.charge",
    "start_time": "2024-01-15T10:30:42.123Z",
    "end_time": "2024-01-15T10:30:42.903Z",
    "duration_ms": 780,
    "status": "OK",
    "attributes": {
        "http.method": "POST",
        "http.url": "/charge",
        "http.status_code": 200,
        "order.id": "ord_xyz789",
        "payment.amount_cents": 4999,
    },
    "events": [
        {"name": "retry_attempt", "timestamp": "...", "attributes": {"attempt": 2}}
    ]
}

Step 4: Spans ship to a backend

Checkout                  |████████████████ 900ms |
  Auth.verify             |█ 50ms|
  Inventory.check           |█ 60ms|
  Payment.charge                     |████████ 780ms|
    RateLimiter.check                |█|
    FraudCheck.evaluate              |████ 480ms|
      MLInference.score              |████ 460ms|
    Stripe.charge                    |██ 280ms| (parallel)

Key Components

Component	Role
Trace	A tree of spans representing one end-to-end request. Identified by a unique trace ID.
Span	A single unit of work (service call, DB query, cache read). Has start/end time, status, attributes.
Trace context	The `traceparent` + `tracestate` HTTP headers that propagate trace/span IDs across service boundaries.
OTel SDK	Library in each service that creates spans, injects/extracts context, and exports span data.
OTel Collector	Central pipeline that receives spans, applies sampling and enrichment, and exports to backends.
Tracing backend (Jaeger, Tempo, Zipkin)	Stores spans indexed by trace ID. Renders waterfall views. Supports trace search.
Span attributes	Key-value metadata on spans: HTTP method, status code, user ID, business context fields.
Span events	Timestamped log-like records within a span: retry attempts, cache hits, error details.

Distributed tracing

TL;DR

The Problem It Solves

What Is It?

How It Works

Step 1: The first service creates the trace

Step 2: Context propagates through headers

Step 3: Spans record timing and metadata

Step 4: Spans ship to a backend

Key Components

Types / Variations

Sampling strategies

Continue Reading with Premium

Comments

Distributed tracing

TL;DR

The Problem It Solves

What Is It?

How It Works

Step 1: The first service creates the trace

Step 2: Context propagates through headers

Step 3: Spans record timing and metadata

Step 4: Spans ship to a backend

Key Components

Types / Variations

Sampling strategies

Continue Reading with Premium

Comments