What Is LLM Observability? Tracing, Monitoring, and Debugging AI
Observability Versus Logging
Logging and observability are often confused, but they answer different questions. A log records that an event happened: "LLM call completed in 1.4 seconds." Observability lets you answer questions you did not anticipate when you wrote the code: which prompt templates produced the most negative feedback this week, what the p95 latency of the retrieval step is across all traffic, how token cost per conversation changed after yesterday's deploy. The difference is structure and aggregation. Observability data is structured into traces and spans with consistent attributes, so you can slice and aggregate it after the fact rather than grepping unstructured text.
This distinction matters because LLM failures are almost never explained by the final output alone. An answer is wrong, but why? Because retrieval returned the wrong document, because a tool call failed silently and the model improvised, because the prompt template dropped a variable, because the model version changed. The cause lives in the intermediate steps, and only an observability layer that captures the full request tree exposes it.
Traces and Spans
The core unit of observability is the trace, which represents one complete user request. A trace is a tree of spans, where each span is a single operation with a start time, duration, and attributes. A typical RAG request produces a trace like this: a root span for the request, a child span for embedding the query, a child span for the vector search that records how many results came back and their scores, a span for assembling the prompt that records the final token count, a span for the LLM call that records the model, input tokens, output tokens, and latency, and a final span for post-processing. If the request used an agent, the tree is deeper, with a span for each tool call and its arguments and result.
Capturing the full tree is what makes debugging possible. When you open a trace for a bad answer, you see the actual retrieved chunks, the exact prompt the model received, the raw model output before post-processing, and the timing and cost of each step. The vast majority of production LLM bugs are obvious the moment you can see the trace, and nearly impossible to diagnose without it. A practical guide to tracing and debugging LLM calls covers how to instrument this.
A trace is the full tree of operations behind one request: prompts, retrieved context, model calls with token counts, and tool calls. The failure is usually in an intermediate span, not the final output, which is why capturing the whole tree is the foundation of LLM debugging.
The Metrics Observability Produces
Aggregating traces produces the metrics that run your dashboards and alerts. Operational metrics come almost for free from the spans: request latency at p50 and p95, token usage and cost per request and per conversation, error and timeout rates, and throughput. Quality metrics require online evaluators (below) that score outputs and attach the scores to traces. Usage metrics describe how the system is being used: which features and prompts get the most traffic, which users or sessions are most active, and how usage shifts over time. The value of having all of these in one place is correlation. When latency spikes, you can see whether it tracks a specific model, a specific prompt, or a traffic surge, because the dimensions are all attached to the same traces.
Online Evaluation Lives in Observability
You cannot run reference-based evaluation on live traffic because you do not have gold answers for real user questions. What you can do is run cheap reference-free evaluators continuously on a sample of production traces: a faithfulness check that flags RAG answers unsupported by their retrieved context, a toxicity classifier on outputs, a JSON-validity assertion on structured responses, a relevance score. These online evaluators turn raw traces into continuous quality signals that feed the same dashboards and alerts as latency and cost. This is the bridge between evaluation and observability: evaluation defines how to score quality, observability provides the live data to score and the place to surface the results.
OpenTelemetry and GenAI Conventions
LLM observability has been converging on OpenTelemetry, the open standard for traces and metrics that already underpins conventional application monitoring. The OpenTelemetry project has defined semantic conventions for generative AI spans, standardizing attribute names for the model, token counts, and operation type. The practical benefit is portability: instrumentation you add against the standard works across any backend that supports it, so you are not locked into one vendor. Most dedicated LLM observability platforms either build on OpenTelemetry or interoperate with it, and several conventional observability vendors have added GenAI-specific span support on top of their existing tracing.
Practical Concerns: Privacy, Cost, and Sampling
A production observability layer has to handle realities that a toy logging setup ignores. Prompts and outputs routinely contain personal or confidential data, so you must decide what to capture in full, what to redact or hash, and what to drop, in line with your privacy and compliance requirements. Capturing a customer's full message verbatim into a third-party observability service can itself be a compliance violation, so this decision belongs in the design rather than as an afterthought. The safest default is to capture structure and metadata fully while redacting or tokenizing the free-text content that carries personal data, preserving the ability to debug shape and flow without storing the sensitive payload.
Volume is the other practical constraint. A high-traffic system generates more traces than it is economical to store in full, so sampling becomes necessary. The naive approach of keeping a fixed random fraction throws away the failures you most want to see, so the better pattern is tail-based sampling: decide whether to keep a trace after seeing its outcome, always retaining errors and low-scoring or flagged requests while sampling the routine successes. This keeps full visibility into the interesting cases without paying to store every healthy request. Getting sampling right is what lets observability scale to production traffic without either blowing the budget or losing the traces that matter.
Why Observability Is Non-Negotiable in Production
An LLM application without observability is a black box that might be silently degrading while reporting healthy status. Model providers update models, retrieval indexes grow and shift, and user behavior changes, and any of these can quietly reduce quality. Without traces, the first signal a team gets is a customer complaint, and even then they cannot easily diagnose the cause. With observability, the same change shows up as a metric shift on a dashboard, the alert fires, and the trace shows exactly what broke. This is the same reason web teams run application performance monitoring: you cannot operate what you cannot see. For LLM systems, where behavior is probabilistic and inputs are open-ended, the need is greater, not smaller.
Observability also compounds with evaluation and with memory systems. The traces you capture become the raw material for building a better evaluation dataset, because real failures are the best test cases. And for any system with persistent memory, the trace shows which memories were retrieved and with what confidence, making retrieval quality visible alongside everything else. A memory layer like Adaptive Recall that reports retrieval and confidence signals slots directly into an observability dashboard as one more measurable component rather than an opaque dependency.
Observability captures structured traces of every request, aggregates them into operational and quality metrics, and hosts the online evaluators that score live traffic. It is the difference between catching a regression on a dashboard and learning about it from a customer.