Tracing Pipeline
Design an end-to-end request tracing system like Jaeger or Zipkin that correlates logs, spans, and errors across microservices, giving on-call engineers full visibility into every cross-service call in production.
What is a distributed tracing system?
A distributed tracing system tracks a single request as it flows through multiple microservices, recording timing and metadata at each hop. Visit an e-commerce site and your checkout request might touch an API gateway, a cart service, an inventory service, a payment service, and a notifications service before completing.
Without tracing, a 2-second slowdown is a mystery. With tracing, you see the inventory service added 1.8 of those seconds.
The apparent problem is observability. The hard engineering problem is doing it without adding measurable latency to production requests while ingesting millions of spans per second, and assembling them into a coherent call graph on demand. I'd lead with that latency constraint in an interview because it immediately separates this from a simple logging problem.
This question tests your knowledge of context propagation, async buffering patterns, write-heavy storage design, and the tradeoffs between head-based and tail-based sampling.
Functional Requirements
Core Requirements
- Every request gets a globally unique trace ID that propagates through all downstream services.
- Each service emits a span (start time, duration, operation name, metadata) linked to the parent trace.
- Engineers can search by trace ID, service name, or error status and see the full call graph.
- The system surfaces tail latency and error hot spots across the service graph.
Below the Line (out of scope)
- Log aggregation and log-to-trace correlation
- Continuous profiling (CPU flamegraphs, heap allocation tracking)
- Real-time alerting on trace data
The hardest part in scope: Collecting spans without impacting production latency. Every instrumented service is a potential victim of a slow or unavailable collector. The SDK design and the collector pipeline together must make span emission invisible to request throughput.
Log aggregation is below the line because it has a different ingestion pipeline and storage model. To add correlation, I would embed the trace ID into every log line and store it as a searchable field in the log aggregation system. The trace ID becomes a join key between logs and traces.
Continuous profiling is below the line because it requires a separate sampling profiler in each service process and a flame graph rendering pipeline. It does not share the span collection pipeline.
Real-time alerting is below the line because it requires a stream processing layer on top of the span pipeline. To add it, I would consume from the same Kafka topic, compute error rate and p99 latency per service in a sliding window, and trigger alerts on threshold breaches.
Non-Functional Requirements
Core Requirements
- Low overhead: Span emission adds less than 1ms to production request latency. The instrumentation must be invisible to throughput.
- Ingestion scale: Handle 1M spans per second from 500 or more services.
- Query latency: Retrieving all spans for a single trace ID completes in under 2 seconds.
- Retention: Spans retained for 30 days; search indexes retained for 7 days.
- Availability: 99.9% for the collector pipeline. A small fraction of dropped spans under extreme load is acceptable; blocking production services is not.
Below the Line
- Sub-second trace assembly
- Multi-region span collection with guaranteed cross-region delivery
- Guaranteed exactly-once delivery (at-least-once is sufficient)
Read/write ratio: This system is extremely write-heavy. A deployment handling 50K requests/second where each request touches 10 services generates 500K spans/second. Engineers query traces during incidents, not continuously. Expect a 1000:1 write-to-read ratio. Every architectural decision in the storage and ingestion tiers traces back to this write dominance.
I'd call out this ratio early in an interview because it kills the most obvious storage choices. If someone suggests PostgreSQL, you point at the write rate and the conversation shifts immediately.
The less-than-1ms overhead constraint rules out any synchronous network call on the hot path. Span data must be buffered in-process and flushed asynchronously on a background thread. The collector pipeline must be fully decoupled from production request latency.
The 2-second trace assembly target shapes the storage schema: fast trace lookup requires partitioning span storage by trace ID, not by time. A time-partitioned schema would scan every time bucket to find spans for a single trace, which is too slow at our query target.
Core Entities
- Trace: The end-to-end record of one request journey. Identified by a 128-bit trace ID. Contains a root span and zero or more child spans. Has a computed status (ok, error, or partial) based on its constituent spans.
- Span: One unit of work within a trace. Fields:
trace_id,span_id,parent_span_id(null for the root span),service_name,operation_name,start_time_unix_ns,duration_ms,status(ok or error), and atagsmap for arbitrary key-value metadata such ashttp.status_code,user_id, or an exception message. - SpanContext: The lightweight propagation carrier passed between services at every RPC boundary. Contains
trace_id,span_id, and asampling_flag. This is what travels inside the W3Ctraceparentheader, not the full Span record. - Service: A named instrumented service in the dependency graph. Used for building the service topology map and for search facets (filter all traces involving
payment-service).
The full schema, column types, and Cassandra partition key design are deferred to the storage deep dive. These four entities are sufficient to drive the API and High-Level Design.
API Design
There are two distinct API surfaces: the ingest API used by SDK instrumentation libraries running inside each service, and the query API used by the UI and on-call engineers.
Ingest API (SDK to Collector):
POST /v1/spans
Body: [
{
trace_id, span_id, parent_span_id?,
service_name, operation_name,
start_time_unix_ns, duration_ms,
status, tags: { key: value }
},
...
]
Response: 202 Accepted
The endpoint accepts batches, not individual spans. The SDK buffers spans in-process and flushes in batches of 100-500 spans every 5 seconds. Batch ingest reduces network overhead by roughly 100x compared to one-span-per-request.
The 202 response is fire-and-forget: the SDK does not wait for storage confirmation.
Query API:
GET /api/traces/{trace_id}
Response: { trace_id, root_span, spans: [...], status, duration_ms }
GET /api/traces?service=&operation=&error=&min_duration_ms=&start=&end=&limit=50&cursor=
Response: { traces: [...summary...], next_cursor: "..." }
GET /api/services
Response: { services: ["api-gateway", "cart-service", "payment-service", ...] }
GET /api/services/{service}/operations
Response: { operations: ["POST /checkout", "GET /cart", ...] }
The search endpoint uses cursor-based pagination because time-range queries over trace metadata can span millions of records. Offset-based pagination is unstable when new spans arrive during a query window.
GET /api/traces/{trace_id} is the most latency-sensitive query. On-call engineers arrive at a trace ID from a log line or an alert. This lookup must return the full call graph in under 2 seconds.
302 vs search result distinction: The ingest and query APIs run on separate services with independent scaling. The ingest path handles 1M writes/second; the query path handles a few hundred reads per second. They share no code path and no server pool.
High-Level Design
1. Trace ID propagation across service boundaries
The fundamental problem: when Service A calls Service B, Service B needs to know the trace ID so it can link its span to the same trace. Without an explicit mechanism, every service creates an isolated span with no parent relationship.
Naive approach: carry only a custom X-Request-ID header with the trace ID. Service B can group its span with Service A's, but it does not know which of Service A's spans was the parent. I start with this naive version intentionally because most engineers have seen X-Request-ID in production and it anchors the conversation.
Components:
- SDK (client agent): A library embedded in each service process. Creates spans, records start and end times, and reads or writes the trace header on all outbound calls.
- Service A / Service B: Production services, unchanged except for the SDK running inside them.
What breaks: A plain request ID tells you two spans belong to the same trace, but not their parent-child relationship. If Service A calls three downstream services in parallel, you cannot tell which call was the direct parent of a given error span. The hierarchical call tree is lost.
Service B knows the trace ID but cannot record a parent_span_id. You can group spans by trace, but you cannot reconstruct which service called which.
Evolved approach: W3C traceparent header carrying both trace ID and parent span ID.
The key insight is that the propagation carrier must encode the parent span ID, not just the trace ID. The W3C traceparent standard defines a single compact header: traceparent: 00-{trace_id}-{parent_span_id}-{flags}. The caller's span ID becomes the parent_span_id for the callee.
Components:
- SDK (updated): Injects
traceparenton all outbound calls (HTTP headers, gRPC metadata, message headers). Extracts it on all inbound requests. - Services A, B, C: Each extracts
trace_idandparent_span_idfrom the header, creates a child span withparent_span_idset to the caller's span ID, then injects its own span ID on any further outbound calls.
Request walkthrough:
- User hits Service A. No
traceparentheader present: this is the trace root. - Service A generates a 128-bit
trace_idand a 64-bitspan_id_A. - Service A calls Service B with header
traceparent: 00-{trace_id}-{span_id_A}-01. - Service B extracts
trace_idand setsparent_span_id = span_id_A. Generates its ownspan_id_B. - Service B calls Service C with header
traceparent: 00-{trace_id}-{span_id_B}-01. - Each service emits its completed span to the collector asynchronously.
Service A is the root: it generates the trace ID and span ID, then embeds both in the outbound traceparent header. Service B extracts the trace ID and the caller's span ID, creates a child span with parent_span_id = span_id_A, and forwards its own span ID to Service C. The full call tree can be reconstructed from span records alone.
For gRPC, the context travels in metadata with the same field names. For Kafka and SQS, it travels in message headers or attributes. The SDK handles all transport types; application code does not change.
2. Services emit spans without impacting production latency
With context propagation solved, the next problem is getting spans out of each service to a central store without adding latency to production requests.
Naive approach: Each span is sent synchronously via HTTP POST to a central collector immediately after the operation completes.
Components:
- SDK: Records span, calls
POST /v1/spanssynchronously before returning from the operation. - Collector: Receives spans, writes to storage.
What breaks: At 50K requests/second, the SDK makes 500K HTTP calls/second to the collector. Each call adds 1-5ms of network round-trip to the production request thread. The collector becomes a bottleneck and a source of cascading latency. If the collector is slow or unavailable, every production service degrades.
Evolved approach: async in-process SDK buffer with background flush.
The fix is to decouple span recording from span transmission. The SDK appends spans to an in-process ring buffer (non-blocking, O(1) enqueue). A background thread flushes the buffer to the collector in batches every 5 seconds or when the batch reaches 500 spans.
The production request thread never waits for a network call. I consider this the single most important design decision in the entire system. Get this wrong and you have built an observability tool that degrades the thing it observes.
Components:
- SDK in-process buffer: Ring buffer with a configurable capacity (default 50K span slots). If the buffer fills before the background thread can flush, the oldest entries are dropped. A dropped span is far better than a blocked production thread.
- Collector (stateless): Validates and normalizes span batches. Publishes to Kafka. Does not write to storage directly; no state, scales horizontally by adding instances.
- Kafka (spans topic): Decouples the burst ingest rate from the sustained storage write rate. If Cassandra has a slow moment, Kafka holds the backlog. Retains spans for 1 hour as a replay buffer.
- Storage Worker: Consumes from Kafka. Batch-writes spans to Cassandra at 10K rows per batch.
Request walkthrough:
- Service A completes an operation. SDK calls
span.end()in the production thread. - SDK enqueues the span into the ring buffer. Non-blocking, under 1 microsecond.
- Background SDK thread wakes every 5 seconds (or when batch reaches 500 spans).
- Background thread sends
POST /v1/spanswith the batch to the collector. Production thread is unblocked. - Collector validates, re-serializes, and publishes the batch to Kafka.
- Storage Worker consumes from Kafka and batch-inserts into Cassandra.
Continue Reading with Premium
Unlock this article and every other in-depth system design guide on the platform with NotesFromSDE Premium.