Tracing Pipeline
Design an end-to-end request tracing system like Jaeger or Zipkin that correlates logs, spans, and errors across microservices, giving on-call engineers full visibility into every cross-service call in production.
What is a distributed tracing system?
A distributed tracing system tracks a single request as it flows through multiple microservices, recording timing and metadata at each hop. Visit an e-commerce site and your checkout request might touch an API gateway, a cart service, an inventory service, a payment service, and a notifications service before completing.
Without tracing, a 2-second slowdown is a mystery. With tracing, you see the inventory service added 1.8 of those seconds.
The apparent problem is observability. The hard engineering problem is doing it without adding measurable latency to production requests while ingesting millions of spans per second, and assembling them into a coherent call graph on demand. I'd lead with that latency constraint in an interview because it immediately separates this from a simple logging problem.
This question tests your knowledge of context propagation, async buffering patterns, write-heavy storage design, and the tradeoffs between head-based and tail-based sampling.
Functional Requirements
Core Requirements
- Every request gets a globally unique trace ID that propagates through all downstream services.
- Each service emits a span (start time, duration, operation name, metadata) linked to the parent trace.
- Engineers can search by trace ID, service name, or error status and see the full call graph.
- The system surfaces tail latency and error hot spots across the service graph.
Below the Line (out of scope)
- Log aggregation and log-to-trace correlation
- Continuous profiling (CPU flamegraphs, heap allocation tracking)
- Real-time alerting on trace data
The hardest part in scope: Collecting spans without impacting production latency. Every instrumented service is a potential victim of a slow or unavailable collector. The SDK design and the collector pipeline together must make span emission invisible to request throughput.
Log aggregation is below the line because it has a different ingestion pipeline and storage model. To add correlation, I would embed the trace ID into every log line and store it as a searchable field in the log aggregation system. The trace ID becomes a join key between logs and traces.
Continuous profiling is below the line because it requires a separate sampling profiler in each service process and a flame graph rendering pipeline. It does not share the span collection pipeline.
Real-time alerting is below the line because it requires a stream processing layer on top of the span pipeline. To add it, I would consume from the same Kafka topic, compute error rate and p99 latency per service in a sliding window, and trigger alerts on threshold breaches.
Non-Functional Requirements
Core Requirements
- Low overhead: Span emission adds less than 1ms to production request latency. The instrumentation must be invisible to throughput.
- Ingestion scale: Handle 1M spans per second from 500 or more services.
- Query latency: Retrieving all spans for a single trace ID completes in under 2 seconds.
- Retention: Spans retained for 30 days; search indexes retained for 7 days.
- Availability: 99.9% for the collector pipeline. A small fraction of dropped spans under extreme load is acceptable; blocking production services is not.
Below the Line
- Sub-second trace assembly
- Multi-region span collection with guaranteed cross-region delivery
- Guaranteed exactly-once delivery (at-least-once is sufficient)
Read/write ratio: This system is extremely write-heavy. A deployment handling 50K requests/second where each request touches 10 services generates 500K spans/second. Engineers query traces during incidents, not continuously. Expect a 1000:1 write-to-read ratio. Every architectural decision in the storage and ingestion tiers traces back to this write dominance.
I'd call out this ratio early in an interview because it kills the most obvious storage choices. If someone suggests PostgreSQL, you point at the write rate and the conversation shifts immediately.
The less-than-1ms overhead constraint rules out any synchronous network call on the hot path. Span data must be buffered in-process and flushed asynchronously on a background thread. The collector pipeline must be fully decoupled from production request latency.
The 2-second trace assembly target shapes the storage schema: fast trace lookup requires partitioning span storage by trace ID, not by time. A time-partitioned schema would scan every time bucket to find spans for a single trace, which is too slow at our query target.
Core Entities
- Trace: The end-to-end record of one request journey. Identified by a 128-bit trace ID. Contains a root span and zero or more child spans. Has a computed status (ok, error, or partial) based on its constituent spans.
- Span: One unit of work within a trace. Fields:
trace_id,span_id,parent_span_id(null for the root span),service_name,operation_name,start_time_unix_ns,duration_ms,status(ok or error), and atagsmap for arbitrary key-value metadata such ashttp.status_code,user_id, or an exception message. - SpanContext: The lightweight propagation carrier passed between services at every RPC boundary. Contains
trace_id,span_id, and asampling_flag. This is what travels inside the W3Ctraceparentheader, not the full Span record. - Service: A named instrumented service in the dependency graph. Used for building the service topology map and for search facets (filter all traces involving
payment-service).
The full schema, column types, and Cassandra partition key design are deferred to the storage deep dive. These four entities are sufficient to drive the API and High-Level Design.
API Design
There are two distinct API surfaces: the ingest API used by SDK instrumentation libraries running inside each service, and the query API used by the UI and on-call engineers.
Ingest API (SDK to Collector):
POST /v1/spans
Body: [
{
trace_id, span_id, parent_span_id?,
service_name, operation_name,
start_time_unix_ns, duration_ms,
status, tags: { key: value }
},
...
]
Response: 202 Accepted
The endpoint accepts batches, not individual spans. The SDK buffers spans in-process and flushes in batches of 100-500 spans every 5 seconds. Batch ingest reduces network overhead by roughly 100x compared to one-span-per-request.
The 202 response is fire-and-forget: the SDK does not wait for storage confirmation.
Query API:
GET /api/traces/{trace_id}
Response: { trace_id, root_span, spans: [...], status, duration_ms }
GET /api/traces?service=&operation=&error=&min_duration_ms=&start=&end=&limit=50&cursor=
Response: { traces: [...summary...], next_cursor: "..." }
GET /api/services
Response: { services: ["api-gateway", "cart-service", "payment-service", ...] }
GET /api/services/{service}/operations
Response: { operations: ["POST /checkout", "GET /cart", ...] }
The search endpoint uses cursor-based pagination because time-range queries over trace metadata can span millions of records. Offset-based pagination is unstable when new spans arrive during a query window.
GET /api/traces/{trace_id} is the most latency-sensitive query. On-call engineers arrive at a trace ID from a log line or an alert. This lookup must return the full call graph in under 2 seconds.
302 vs search result distinction: The ingest and query APIs run on separate services with independent scaling. The ingest path handles 1M writes/second; the query path handles a few hundred reads per second. They share no code path and no server pool.
High-Level Design
1. Trace ID propagation across service boundaries
The fundamental problem: when Service A calls Service B, Service B needs to know the trace ID so it can link its span to the same trace. Without an explicit mechanism, every service creates an isolated span with no parent relationship.
Naive approach: carry only a custom X-Request-ID header with the trace ID. Service B can group its span with Service A's, but it does not know which of Service A's spans was the parent. I start with this naive version intentionally because most engineers have seen X-Request-ID in production and it anchors the conversation.
Components:
- SDK (client agent): A library embedded in each service process. Creates spans, records start and end times, and reads or writes the trace header on all outbound calls.
- Service A / Service B: Production services, unchanged except for the SDK running inside them.
What breaks: A plain request ID tells you two spans belong to the same trace, but not their parent-child relationship. If Service A calls three downstream services in parallel, you cannot tell which call was the direct parent of a given error span. The hierarchical call tree is lost.
Service B knows the trace ID but cannot record a parent_span_id. You can group spans by trace, but you cannot reconstruct which service called which.
Evolved approach: W3C traceparent header carrying both trace ID and parent span ID.
The key insight is that the propagation carrier must encode the parent span ID, not just the trace ID. The W3C traceparent standard defines a single compact header: traceparent: 00-{trace_id}-{parent_span_id}-{flags}. The caller's span ID becomes the parent_span_id for the callee.
Components:
- SDK (updated): Injects
traceparenton all outbound calls (HTTP headers, gRPC metadata, message headers). Extracts it on all inbound requests. - Services A, B, C: Each extracts
trace_idandparent_span_idfrom the header, creates a child span withparent_span_idset to the caller's span ID, then injects its own span ID on any further outbound calls.
Request walkthrough:
- User hits Service A. No
traceparentheader present: this is the trace root. - Service A generates a 128-bit
trace_idand a 64-bitspan_id_A. - Service A calls Service B with header
traceparent: 00-{trace_id}-{span_id_A}-01. - Service B extracts
trace_idand setsparent_span_id = span_id_A. Generates its ownspan_id_B. - Service B calls Service C with header
traceparent: 00-{trace_id}-{span_id_B}-01. - Each service emits its completed span to the collector asynchronously.
Service A is the root: it generates the trace ID and span ID, then embeds both in the outbound traceparent header. Service B extracts the trace ID and the caller's span ID, creates a child span with parent_span_id = span_id_A, and forwards its own span ID to Service C. The full call tree can be reconstructed from span records alone.
For gRPC, the context travels in metadata with the same field names. For Kafka and SQS, it travels in message headers or attributes. The SDK handles all transport types; application code does not change.
2. Services emit spans without impacting production latency
With context propagation solved, the next problem is getting spans out of each service to a central store without adding latency to production requests.
Naive approach: Each span is sent synchronously via HTTP POST to a central collector immediately after the operation completes.
Components:
- SDK: Records span, calls
POST /v1/spanssynchronously before returning from the operation. - Collector: Receives spans, writes to storage.
What breaks: At 50K requests/second, the SDK makes 500K HTTP calls/second to the collector. Each call adds 1-5ms of network round-trip to the production request thread. The collector becomes a bottleneck and a source of cascading latency. If the collector is slow or unavailable, every production service degrades.
Evolved approach: async in-process SDK buffer with background flush.
The fix is to decouple span recording from span transmission. The SDK appends spans to an in-process ring buffer (non-blocking, O(1) enqueue). A background thread flushes the buffer to the collector in batches every 5 seconds or when the batch reaches 500 spans.
The production request thread never waits for a network call. I consider this the single most important design decision in the entire system. Get this wrong and you have built an observability tool that degrades the thing it observes.
Components:
- SDK in-process buffer: Ring buffer with a configurable capacity (default 50K span slots). If the buffer fills before the background thread can flush, the oldest entries are dropped. A dropped span is far better than a blocked production thread.
- Collector (stateless): Validates and normalizes span batches. Publishes to Kafka. Does not write to storage directly; no state, scales horizontally by adding instances.
- Kafka (spans topic): Decouples the burst ingest rate from the sustained storage write rate. If Cassandra has a slow moment, Kafka holds the backlog. Retains spans for 1 hour as a replay buffer.
- Storage Worker: Consumes from Kafka. Batch-writes spans to Cassandra at 10K rows per batch.
Request walkthrough:
- Service A completes an operation. SDK calls
span.end()in the production thread. - SDK enqueues the span into the ring buffer. Non-blocking, under 1 microsecond.
- Background SDK thread wakes every 5 seconds (or when batch reaches 500 spans).
- Background thread sends
POST /v1/spanswith the batch to the collector. Production thread is unblocked. - Collector validates, re-serializes, and publishes the batch to Kafka.
- Storage Worker consumes from Kafka and batch-inserts into Cassandra.
The collector is stateless and scales horizontally by adding instances behind a load balancer. Kafka is the buffer that makes everything asynchronous: ingestion speed is independent of storage write speed.
The 50K-slot ring buffer has a maximum capacity. Under sustained high-volume traffic where the background flush thread cannot keep up, the SDK drops the oldest spans from the buffer. This is a deliberate tradeoff: spans are best-effort. Blocking the production thread to preserve every span would add unpredictable latency to user requests.
3. Engineers can search and retrieve the full call graph
With spans stored in Cassandra, engineers need two query capabilities: fetching a specific trace by ID (the most common on-call action), and searching by service name, error status, or duration range (for discovery when you do not have a trace ID yet).
Naive approach: query Cassandra directly on every search request with SELECT * FROM spans WHERE service_name = 'payment-service' AND status = 'error'.
What breaks: Cassandra does not support efficient secondary indexes on non-partition-key columns. A WHERE service_name = ? query requires a full cluster scan across all nodes. At 1B+ rows, this returns in 30+ seconds under load. You cannot satisfy latency-sensitive on-call workflows from Cassandra alone.
Components:
- Query Service: Reads from Cassandra by
trace_id, assembles the span tree, and returns the full trace response. - Indexer Worker: A second Kafka consumer group consuming the same spans topic. Writes span summaries (service name, operation, duration, error flag, trace ID) to Elasticsearch. Does not write the full span payload, which keeps the Elasticsearch cluster 5-8x smaller.
- Elasticsearch: Serves the search API with inverted-index speed over service name, operation name, error status, and duration range.
- Query API: Exposes
GET /api/traces/{trace_id}andGET /api/traces?...to the UI.
Request walkthrough (trace by ID):
- Engineer pastes a trace ID from a log line into the Jaeger UI.
- UI calls
GET /api/traces/{trace_id}. - Query Service runs
SELECT * FROM spans WHERE trace_id = ?against Cassandra. Single-partition read, returns in under 10ms. - Query Service assembles spans into a tree using
parent_span_idlinks. - Query Service returns the full tree with timing, status, and tags for every span.
Request walkthrough (search by service + error):
- Engineer opens the error dashboard, filters by
service=payment-service&error=true. - Query Service executes a search against Elasticsearch using the inverted index.
- Elasticsearch returns matching trace IDs and summary data in under 500ms.
- UI displays the list. Engineer clicks a trace ID to fetch the full tree from Cassandra.
The indexer and storage workers belong to separate Kafka consumer groups. Both consume the same topic independently at their own pace. Cassandra gets the full payload; Elasticsearch gets only the 200-byte summary needed for search.
I'd highlight this dual-store pattern to an interviewer because it shows you understand that one storage engine rarely satisfies two fundamentally different access patterns. Cassandra does point lookups; Elasticsearch does full-text search. Trying to make one do both is where tracing systems break down.
The Elasticsearch cluster is 5x smaller than Cassandra for the same span volume.
4. The system surfaces tail latency and error hot spots
With per-trace search working, the last requirement is system-level visibility: which services have the highest error rates right now, and where does tail latency originate across the dependency graph.
Naive approach: compute p99 latency per service on demand by querying all recent spans from Cassandra on every dashboard request.
What breaks: aggregating 30 seconds of spans from 500 services on every page load means reading millions of rows from Cassandra and computing percentiles in the Query Service memory. At load, this takes 10-30 seconds and hammers the storage tier. Service overview dashboards become unusable during incidents, which is exactly when engineers need them most.
Components added:
- Aggregation Worker: A third Kafka consumer group. Computes per-service, per-operation error rates and p99 latency in a sliding 5-minute window using in-memory counters. Writes results to Redis every 10 seconds.
- Redis (metrics): Stores pre-aggregated metrics keyed by service name and operation. Updated by the Aggregation Worker; read by the Query Service for the service overview dashboard.
- Query Service (updated):
GET /api/services/{service}/operationsnow reads error rate and latency percentiles from Redis in under 1ms rather than aggregating from Cassandra on demand.
Request walkthrough:
- Aggregation Worker consumes spans from Kafka continuously.
- For each span, it increments in-memory counters: request count, error count, and a histogram bucket for
duration_msper(service_name, operation_name)pair. - Every 10 seconds, the Aggregation Worker flushes counters to Redis:
HSET service:payment-service:checkout error_rate 0.023 p99_ms 1450. - Engineer opens the service overview dashboard. UI calls
GET /api/services. - Query Service reads all service metrics from Redis in a single multi-key GET. Returns in under 1ms.
The three Kafka consumer groups are the system's key architectural pattern. Adding a new capability (alerting, anomaly detection, capacity planning signals) means adding a fourth consumer group without touching the ingestion pipeline at all. I love this extensibility angle in interviews because it shows you are thinking beyond the immediate requirements.
Potential Deep Dives
1. How do we propagate context across service boundaries?
The span context must cross every service boundary: HTTP calls, gRPC calls, async message queue publishes, and internal thread handoffs. Three constraints apply: the carrier must be compact (it rides on every request), it must encode both the trace ID and the parent span ID, and it must be standardized so SDKs across different languages and frameworks interoperate.
2. How do we collect spans without overwhelming storage?
A production system with 500 services generating 1M spans/second cannot store 100% of spans at 30-day retention without prohibitive cost. At 1KB per span average, 100% sampling generates 86TB of new data per day. At 30-day retention, that is 2.6PB of storage.
3. How do we store spans for fast trace assembly?
The query GET /api/traces/{trace_id} must return all spans for a trace assembled as a call tree in under 2 seconds. At 1M spans/second with 30-day retention, the storage layer handles enormous write throughput and must support one dominant access pattern: "give me all rows where trace_id = X."
4. How do we handle partial traces and surface error context to on-call engineers?
Two scenarios break the assumption that every trace is complete: a service crashes mid-request and its in-process buffer is lost, or a collector instance restarts while buffering spans for open traces. On-call engineers investigating an incident need the full context: which service, which span, which user request, what exception was thrown.
Final Architecture
The three Kafka consumer groups are the system's scaling foundation. Storage, search indexing, and aggregation are independently horizontal and independently deployable. The sampling decision happens post-store: every error trace is guaranteed to be retained because the has_error flag is set at ingestion time, before any sampling logic runs.
Interview Cheat Sheet
- Start by establishing the write/read ratio: 1000:1 write-to-query. Every storage and ingestion decision traces back to write dominance.
- The SDK async ring buffer is non-negotiable: never block the production thread for a network call. Spans enqueue in under 1 microsecond and flush in the background.
- W3C
traceparentis the right propagation standard: single header, 55 bytes, encodes both trace ID and parent span ID, natively forwarded by browsers and CDNs. - Parent span ID in the propagation carrier is what enables the call tree. A trace ID alone gives you grouping but not hierarchy.
- Kafka between the collector and storage decouples burst ingestion rate from sustained storage write rate. The collector stays fast even when Cassandra is under compaction pressure.
- Cassandra with
trace_idas partition key is the primary storage choice: single-partition reads for trace assembly, linear write scaling by adding nodes, automatic 30-day TTL. - Dual-write to Cassandra (full payload) and Elasticsearch (summary only) gives fast trace lookup and flexible search. Elasticsearch indexes only 200 bytes per span vs 1KB, keeping the cluster 5-8x smaller.
- Three Kafka consumer groups let you extend capabilities (alerting, anomaly detection) by adding a consumer group without touching the ingestion pipeline.
- Tail-based sampling keeps 100% of error traces and 100% of latency outliers while discarding 95%+ of healthy traces. Head-based 1% sampling has a 99% chance of dropping any given error.
- The sampling decision is post-store, not pre-ingest. Write everything to Cassandra first, set the error flag in Redis at ingest, then delete non-sampled traces asynchronously in a background worker that runs every 10 seconds.
- Partial traces are surfaced with a "partial" status and orphaned spans highlighted, not dropped. In a typical service crash, 3-5% of the trace's spans may be missing because the in-process buffer is lost before flushing.
- Error context lives in span tags on the error span:
user_id,request_id,http.status_code,exception.message, and stack trace. Each error span captures 5-7 key tags; the storage layer preserves them verbatim for the on-call engineer.