LLM observability
Learn how to instrument LLM applications with traces, logs, and metrics to debug failures, detect prompt drift, and link production issues back to specific prompts and model versions.
TL;DR
- Traditional monitoring (Datadog, Prometheus) tells you the API call succeeded; it cannot tell you whether the answer was correct, safe, or useful
- Every LLM call should capture: full prompt, full output, model name + version, time to first token (TTFT), total latency, token counts, cost, and a trace ID linking multi-step flows
- Trace trees are the core primitive: they link retrieval, summarization, and generation steps so you can pinpoint which step caused a bad answer
- LangSmith is the most widely deployed platform; Langfuse is the open-source, self-hostable alternative for GDPR/HIPAA compliance
- Alert on quality metrics (evaluation score trends, refusal rate, cost per session), not just error rates; a 200 OK tells you nothing about answer quality
The problem it solves
Your LLM-powered support bot returns a 200 OK on every request. Latency looks normal. Error rate is 0%. Datadog dashboards are green across the board. And users are submitting complaints because 12% of answers are confidently wrong.
This is the fundamental gap. Traditional observability measures whether the system is up and whether the API call completed. LLM observability measures whether the output was actually good. When your system produces natural language, "the request succeeded" and "the response was correct" are completely different questions.
The traditional stack answers: "Is it up?" The LLM stack answers: "Is it good?" Both are necessary, but the second one is what catches the failures your users actually experience.
I've watched teams run LLM applications for weeks with perfect uptime metrics while quality silently degraded after a model version update. Without LLM-specific telemetry, they had no way to detect it until the ticket queue filled up.
What is it?
LLM observability is an instrumentation layer for AI applications that captures prompts, responses, quality signals, costs, and latency at the trace level, then gives you the tooling to query, slice, and alert on that data.
Think of it like a flight data recorder for every AI interaction. An airplane's black box doesn't just log "flight completed successfully." It records altitude, speed, control inputs, engine parameters, and communications, second by second. When something goes wrong, investigators reconstruct exactly what happened. LLM observability does the same thing for model calls: it records everything so you can reconstruct why a specific response was wrong and which step in the pipeline caused it.
The reason traditional Application Performance Monitoring (APM) tools don't cover this is that LLM outputs are non-deterministic. You can't write a unit test for "the answer was helpful." You need quality scorers, prompt versioning, and trace-level analysis that APM tools were never designed for.
If you're coming from a backend engineering background and you know Datadog, Prometheus, or Grafana, think of LLM observability as the quality-aware extension of what you already do. It doesn't replace your existing monitoring stack. It adds a layer on top that captures the signals unique to generative AI: the full text in and out, semantic quality scores, and cost attribution.
How it works
The three pillars for LLMs
Traditional observability has three pillars: logs, metrics, and traces. LLM observability adapts each one.
Traces become the most important pillar. A single user query in a RAG agent triggers 3-5 LLM calls (query rewriting, retrieval, summarization, generation, quality check). Traces link all of these into a tree so you can see the full execution path. Without trace linking, debugging a bad response means grepping through thousands of unrelated log entries.
Metrics expand beyond latency and error rate. You now track token costs per call, per session, and per feature. You track evaluation scores (faithfulness, relevance, toxicity). You track TTFT separately from total latency because TTFT is the user-perceived responsiveness while total latency determines throughput.
Logs become much larger. Instead of logging a request ID and status code, you log the full rendered prompt (system message + injected context + user message) and the full model response. These are the raw materials for debugging and for running retrospective evaluations on production data.
| Pillar | Traditional | LLM-Adapted |
|---|---|---|
| Traces | Request to response per service | Trace tree across retrieval, summarization, generation, scoring |
| Metrics | Latency, error rate, throughput | Token cost, TTFT, evaluation score, refusal rate, response length |
| Logs | Request ID, status, error message | Full prompt, full response, model version, temperature, token counts |
Trace-based observability
Trace trees are the core debugging primitive for LLM applications. Here's how they work.
Every LLM call is a span. A span records: the input (full prompt), the output (full response), model name and version, token counts, latency, and any metadata (temperature, top_p, tool definitions). Spans are nested under a parent trace ID that represents the user's original request.
When a user reports a bad answer, you look up the trace ID. The trace shows you: what documents retrieval returned (were they relevant?), what the rendered prompt looked like (was context injected correctly?), which model version was used (did a rollout change behavior?), and what quality scores the automated scorer assigned. This turns "the answer was wrong" into "retrieval returned an irrelevant document from the 2019 policy, which the model faithfully summarized."
If you're not logging every prompt/response pair with trace IDs, you're flying blind. You'll know something is wrong, but you won't know where in the pipeline it went wrong.
Log the full prompt, not a summary
The most common observability mistake is logging only the user message. When a production failure occurs, you need the exact inputs the model received: system prompt, injected documents, conversation history. A summary or template reference doesn't help. Full prompts are larger to store, but they're the only thing that lets you reproduce and fix failures.
Prompt versioning and drift
Prompt versioning tracks which prompt template version produced which outputs. When you change a system prompt, you tag it with a version identifier (v1.3, v2.0). Every trace records the prompt version alongside the model version. This lets you compare quality metrics between prompt versions the same way you'd compare A/B test variants.
Prompt drift is the silent killer. User behavior changes over time. A customer support bot built for billing questions will eventually receive questions about new products, regulatory changes, or topics that didn't exist at launch. If your prompts and retrieval aren't updated, quality degrades on these new cases without any code change triggering it.
Detection approach: embed all production inputs weekly using your text embedding model, cluster them (k-means or HDBSCAN), and compare clusters across weeks. A new cluster appearing means users are asking about something new. An existing cluster shifting means the vocabulary or framing of common questions has changed. Alert when a new cluster exceeds 1% of traffic or when an existing cluster's quality score drops below your threshold.
For your interview: prompt drift is the answer to "how would you detect quality degradation that isn't caused by a code change?"
Quality metrics
Traditional alerting fires when error rates spike. LLM alerting fires when quality drops.
Continue Reading with Premium
Unlock this article and every other in-depth system design guide on the platform with NotesFromSDE Premium.