How to Monitor LLMs in Production
Offline evaluation gates changes before they ship, but it cannot catch the problems that only appear in production: model-provider updates, data drift, and the long tail of real user inputs no test set anticipated. Production monitoring is the safety net for everything offline evaluation misses, and it is the only way to know that a system validated last week is still good today.
Everything starts with the trace. Capture a structured record of every request: the system prompt, the retrieved context with scores, each model call with its model version, input and output tokens, and latency, each tool call with arguments and results, and the final output. This is the tracing foundation, and it is what makes every downstream metric and every debugging session possible. Use an OpenTelemetry-based approach so the instrumentation is portable across backends.
Pick metrics that map to real impact, not vanity numbers. Operational: p50 and p95 latency, cost per request and per conversation, token usage, error and timeout rates. Quality: faithfulness and answer relevance from online evaluators, and task completion for agents. Safety: toxicity, PII leakage, and policy-violation rates. Usage: request volume by feature and prompt. The metrics guide covers how each is computed. Segment every metric by query type and time window, because a problem in one slice is invisible in the aggregate.
You have no gold answers for live traffic, so quality monitoring uses reference-free evaluators: a faithfulness check on RAG answers, a toxicity classifier, a JSON-validity assertion, a relevance score. Run these on a sample of traffic rather than every request to control cost, attach the scores to the traces, and aggregate them into the quality metrics on your dashboard. The same evaluators that gate changes offline run here as online evaluators, which keeps the definition of quality consistent across development and production.
User feedback is the most direct signal of real quality. Capture explicit feedback (thumbs up and down, ratings) and implicit signals (did the user retry, abandon, escalate, or accept the suggestion), and attach them to the corresponding traces. Explicit feedback is sparse but clean; implicit feedback is abundant but noisy, so weight them accordingly. Feedback both feeds the quality dashboard and identifies the traces most worth adding to your evaluation dataset.
An alert that does not lead to an action is noise that trains the team to ignore alerts. Set thresholds tied to clear responses: alert if faithfulness drops more than ten percent against the rolling baseline (investigate the model and prompt), if p95 latency exceeds the user-experience budget (investigate retrieval and model latency), if error rate spikes (check the provider and infrastructure), or if any safety metric crosses zero tolerance (page immediately). Anomaly detection on top of static thresholds catches the gradual drifts that fixed thresholds miss. The next action belongs in the alert itself.
Surface everything in one view organized for fast answers. An overview section shows current status with traffic-light indicators for each metric category. A trends section shows the last thirty days so you can see drift. A drill-down section links from any metric to the specific traces behind it, so an investigation goes from "faithfulness dropped" to "here are the failing traces" in one click. The dashboard should answer "is the system healthy right now, and if not, where" within thirty seconds.
Production monitoring is the safety net for everything offline evaluation cannot catch. Trace every request, score a sample with online evaluators, capture feedback, and alert on thresholds tied to clear actions. The goal is to find regressions on a dashboard, never from a complaint.
Controlling the Cost of Online Evaluation
Online evaluators are themselves LLM calls, so running them on every production request can rival the cost of serving the requests in the first place. The discipline is to sample intelligently rather than score everything. Run the cheap deterministic assertions, JSON validity, format checks, banned-phrase detection, on every request, because they cost almost nothing. Run the expensive judge-based evaluators on a sample sized to give statistically meaningful aggregate metrics without scoring every call, and bias the sample toward the requests most worth scoring: anything that received negative user feedback, anything flagged by a cheap assertion, and a representative random slice of the rest. This keeps the quality signal strong while keeping the evaluation bill a small fraction of serving cost.
The sampling rate is a dial you can adjust to the stakes. A high-risk system, or one in the days after a significant change, warrants a higher sampling rate to catch problems quickly. A stable, low-risk system can sample lightly to control cost. The key is to make the rate a deliberate decision tied to risk rather than an accident of whatever was easy to implement, and to log what fraction of traffic is actually being scored so the metrics are interpreted correctly. A faithfulness number computed on one percent of traffic means something different from one computed on half, and the dashboard should make that sampling rate visible alongside the metric.
Monitoring the Memory Layer
If your system has persistent memory, the memory layer is a component that can degrade like any other and belongs on the dashboard. Monitor its retrieval quality (are relevant memories being surfaced), its confidence calibration (does high confidence predict correctness), and its growth and consolidation behavior over time. Adaptive Recall's status tool reports these signals directly, so memory health sits alongside latency, cost, and faithfulness rather than hiding as an unmonitored dependency in the middle of the pipeline. A memory system that silently starts returning stale results is exactly the kind of slow regression that monitoring exists to catch.