How to Trace and Debug LLM Calls
The reason tracing is the foundation of LLM observability is that LLM failures are compositional. A wrong answer is rarely a pure model failure; far more often the retriever returned the wrong chunk, a tool call failed and the model improvised, the prompt template dropped a variable, or a model version changed underneath you. None of these are visible in the final output. All of them are obvious in a trace that captures the full request tree.
Create a root span when a request arrives and close it when the response is sent. Inside it, open a child span for each meaningful operation: embedding the query, the vector or memory retrieval, prompt assembly, each LLM call, each tool or function call, and post-processing. Spans nest to reflect causality, so an agent's tool call that itself triggers a model call produces a nested subtree. The resulting tree is a faithful record of what the request actually did, in order, with timing.
A span is only as useful as its attributes. On model-call spans, record the model name and version, input and output token counts, latency, temperature, and the actual prompt and completion. On retrieval spans, record the query, the number of results, and each result's identifier and score. On tool spans, record the tool name, the arguments, and the result or error. The two attributes that catch the most bugs are the exact prompt the model received (which reveals template errors) and the retrieved chunks (which reveal retrieval failures). Capturing model version is what lets you later correlate a quality drop with a provider update.
with tracer.start_span("llm.generate") as span:
span.set_attribute("gen_ai.request.model", model)
span.set_attribute("gen_ai.usage.input_tokens", in_tokens)
span.set_attribute("gen_ai.usage.output_tokens", out_tokens)
span.set_attribute("gen_ai.prompt", prompt)
response = client.generate(prompt)
span.set_attribute("gen_ai.completion", response.text)Rather than inventing attribute names, adopt the OpenTelemetry semantic conventions for generative AI, which standardize names for the model, token usage, and operation type. Using the standard makes your traces portable: any backend that understands OpenTelemetry can ingest them, so you are not locked into one observability vendor, and you can switch or run multiple backends without re-instrumenting. Most LLM observability platforms either build on OpenTelemetry or accept its data.
Real applications span multiple services: a gateway, a retrieval service, the model provider, tool backends. For the trace to represent the whole request as one connected tree, the trace context (the trace and span identifiers) must propagate across every hop, through HTTP headers or message metadata. Without propagation, you get disconnected fragments instead of one trace, and the most valuable debugging view, the complete request, is lost. This is standard distributed-tracing practice and the same mechanisms apply.
Prompts and outputs often contain personal or confidential data, so decide what to capture in full, what to redact or hash, and what to drop, in line with your privacy and compliance requirements. For high-traffic systems, trace sampling controls cost and storage: capture all errors and a representative sample of successful requests, so you keep full visibility into failures without storing every successful trace. Tail-based sampling, which decides whether to keep a trace after seeing its outcome, is well suited to LLM systems because it lets you always keep the interesting traces.
With tracing in place, debugging a bad answer becomes a procedure rather than a guessing game. Open the trace, walk the span tree, and check each step: did retrieval return the right chunks (look at the retrieval span's results), did the prompt contain what it should (look at the prompt attribute), did the model output what the final answer reflects or did post-processing alter it, did any tool call fail. The failing step is almost always obvious once the trace is in front of you, which is the entire payoff of the instrumentation.
Model every request as a span tree, capture the exact prompt and the retrieved chunks above all, use OpenTelemetry GenAI conventions for portability, and propagate context across services. The failing step in a bad answer is nearly always visible in the trace.
Sessions, Users, and Linking Traces Together
A single trace captures one request, but real understanding often requires linking traces across a conversation or a user. A support conversation spans many requests, and a problem in the fifth turn may originate in what the system stored or failed to store in the second. Attach a session identifier and, where privacy allows, a user identifier to every trace, so you can reconstruct the full conversation and see how state carried across turns. This is what lets you debug failures that are not visible in any single request, such as an assistant that contradicts something it said earlier because the relevant context was never carried forward.
Session-level linking also unlocks the most useful production analyses. You can measure how quality varies over the course of a long conversation, whether cost accumulates faster than expected as context grows, and which users or session types produce the most failures. For agents and multi-turn assistants, the session view is often more informative than the request view, because the interesting behavior, drift, repetition, accumulating confusion, only emerges across turns. Designing the trace schema to support session and user grouping from the start costs little and pays off the first time you need to understand a failure that spans more than one request.
Tracing the Memory and Retrieval Steps
For systems with persistent memory, the retrieval span is where memory behavior becomes visible. Capture which memories were retrieved, their confidence scores, and their recency, so that when an answer is wrong you can see whether a stale or low-confidence memory drove it. This makes the memory layer debuggable in the same trace as everything else, rather than an opaque step. Adaptive Recall returns the retrieved memories with their confidence and source, which slots directly into the retrieval span and connects to the broader practice of debugging memory issues. A trace that shows the memory step alongside the model call is what lets you attribute a failure to memory rather than reasoning.