Home » LLM Evaluation and Observability » Evaluate AI Agents

How to Evaluate AI Agents

Evaluating an AI agent means scoring two things: the outcome, whether the task was completed correctly, and the trajectory, the sequence of tool calls and decisions the agent made to get there. A correct answer reached through a wasteful, lucky, or unsafe path is still a production problem, so trajectory evaluation matters as much as outcome evaluation. The method is to capture the full tool-call trace for each task, score task completion against a success criterion, and separately score tool-selection accuracy, argument validity, error recovery, and efficiency.

Agent evaluation is harder than evaluating a single LLM response because an agent produces a process, not just an output. The same task can be completed in two steps or twenty, with the right tools or the wrong ones, recovering gracefully from an error or papering over it. Two agents that both produce a correct final answer can differ enormously in cost, latency, and reliability, and those differences only show up when you evaluate the trajectory.

Step 1: Capture the full trajectory.
You cannot evaluate what you did not record. Instrument the agent so every run produces a structured trace: the initial task, each step's reasoning, each tool call with its name and arguments, each tool result, and the final output. This is the same trace your observability layer captures, and it is the raw material for every trajectory metric. Without the full tree, you can only score the outcome, which means you are blind to half the failure modes.
Step 2: Score the outcome.
Define what success means for the task and check it. For a task with a verifiable result (the file was created, the record was updated, the math is correct), outcome scoring is a deterministic assertion against the end state, which is the most reliable signal you can get. For open-ended tasks (research, drafting), outcome quality is scored by an LLM judge or human review against a rubric. Always prefer a checkable end-state assertion when the task allows it, because it removes judge ambiguity entirely.
Step 3: Score the trajectory.
This is where agent evaluation earns its keep. Measure tool-selection accuracy: did the agent call the right tools for the task, and did it avoid irrelevant ones? Measure argument validity: were the arguments well-formed and correct, or did the agent pass malformed inputs that happened to be tolerated? Measure error recovery: when a tool failed, did the agent retry sensibly, switch approaches, or ignore the failure and improvise a fabricated result? Measure efficiency: how many steps and how many tokens did the task take relative to the minimum needed? Each of these can be scored with assertions where the expected tool sequence is known, or with an LLM judge evaluating the trajectory against a rubric where it is open-ended.
Step 4: Build task scenarios with success criteria.
Assemble a set of representative tasks, each with a clear, checkable definition of success and, where possible, a reference trajectory or an allowed set of tool sequences. Cover the easy common cases, the harder multi-step cases, and the failure cases where a tool returns an error or no result, because error handling is where agents most often break in production. Drawing scenarios from real production traces ensures the set reflects how the agent is actually used rather than how you imagine it is used.
Step 5: Diagnose failures by trajectory pattern.
The trajectory scores localize failures the way layered RAG evaluation does. An agent that picks the wrong tool needs better tool descriptions or fewer, clearer tools. An agent that passes bad arguments needs better schemas or examples. An agent that loops without converging needs a step limit and a better stopping condition. An agent that ignores tool errors and fabricates results needs explicit error-handling instructions and validation of tool outputs, the issues covered in handling tool errors and why tool calls fail. Each failure pattern maps to a specific fix.
Step 6: Run in production with observability.
In production, the same trajectory traces feed online evaluation. Score a sample of live agent runs for outcome success and trajectory quality, alert when completion rate or efficiency degrades, and add the failing trajectories to your scenario set. Long-running agents that persist state across steps and sessions add a memory dimension: the agent's behavior depends on what it remembered, so evaluating the memory layer's retrieval quality is part of evaluating the agent.
Key Takeaway

Score the outcome and the trajectory. Outcome alone hides agents that succeed by luck or at twenty times the necessary cost. Trajectory scoring, tool selection, argument validity, error recovery, and efficiency, is what localizes where a multi-step agent actually goes wrong.

Outcome-Only Evaluation and Its Traps

It is tempting to evaluate agents on outcome alone, because outcome is what users care about and it is often the cheapest thing to check. But outcome-only evaluation hides failure modes that surface as production incidents. An agent that reaches the right answer by trying every tool until one works will look perfect on outcome while costing many times the necessary tokens and latency, and that cost only becomes visible under load. An agent that succeeds on the test scenarios by exploiting a quirk of the environment, rather than by sound reasoning, will fail the moment the environment changes. An agent that ignores a tool error and fabricates a plausible result can produce a correct-looking outcome that is actually unsupported, which is worse than an honest failure because it is harder to detect.

Trajectory evaluation catches all three. Measuring efficiency exposes the brute-force agent before its cost shows up in the bill. Measuring tool-selection and argument validity exposes the agent that succeeds by luck rather than reasoning. Validating that tool outputs were actually used, rather than ignored and overwritten by the model, exposes the agent that fabricates. None of these are visible in the final answer, which is exactly why agent evaluation has to look at the path and not just the destination. The cost of trajectory evaluation is higher, since it requires capturing and scoring the full trace, but for any agent that takes real actions in production it is the difference between an evaluation that reflects reliability and one that flatters it.

Why Memory Changes Agent Evaluation

An agent with persistent memory carries information across steps and across sessions, which means its decisions depend on what it stored and retrieved earlier. A wrong action can trace back to a stale or incorrect memory rather than a reasoning error in the moment. Evaluating such an agent therefore includes evaluating its memory: did it retrieve the relevant prior context, did it avoid acting on contradicted information, did its confidence in remembered facts match reality? Adaptive Recall surfaces retrieval and confidence signals that let you attribute an agent's behavior to specific memories, so a bad trajectory can be traced to the memory that caused it rather than left as an unexplained reasoning failure. This connects directly to the broader topic of debugging agent memory issues.