Home » LLM Evaluation and Observability » What Is LLM Evaluation

What Is LLM Evaluation? Methods, Metrics, and Why It Matters

LLM evaluation is the process of measuring whether a language model application produces outputs that are correct, relevant, grounded in the right information, and safe, using a defined set of inputs and a defined notion of quality. It exists because LLM outputs are non-deterministic and sensitive to small changes, so the only reliable way to know whether a prompt change, a model upgrade, or a new retrieval index made the system better or worse is to score it deliberately rather than spot-checking a few examples by hand.

Why Evaluation Is Different for LLMs

In conventional software, a function returns a deterministic value and you assert that it equals an expected value. The test passes or it fails. LLM applications break this model because the same input can produce different outputs across runs, because there is often no single correct output for an open-ended question, and because quality is a matter of degree rather than a binary. An answer can be mostly correct but miss a detail, correct but irrelevant to what was asked, or fluent and confident while being factually wrong. None of these failure modes are captured by an equality assertion.

This is why LLM evaluation is closer to measurement than to testing. You are estimating where the output distribution falls on several quality dimensions, not checking a single value. A good evaluation reports a set of scores, segmented by input type, that together describe how the system behaves, and it produces those scores repeatably so that a change in the score reflects a change in the system rather than noise. The repeatability comes from a fixed evaluation dataset and fixed scoring logic, which is why building a stable dataset is the foundation of any serious evaluation effort.

The Four Approaches to Evaluation

There are four established approaches, and production systems generally combine all four because each catches failures the others miss.

Reference-based evaluation compares the output against a known correct answer. For classification and extraction tasks this is straightforward: exact match, precision, recall, and F1 all apply. For generation, the comparison uses semantic similarity or token-overlap metrics like BLEU and ROUGE, though these correlate weakly with human judgment for open-ended text. Reference-based evaluation is the most objective approach when a ground truth exists, but most interesting LLM tasks do not have a single gold answer.

Reference-free evaluation scores an output without a gold answer, by judging intrinsic properties. Does the answer follow from the provided context? Is it relevant to the question? Is it coherent? This is almost always done with a model in the loop, which is why LLM-as-a-judge has become the dominant method for open-ended tasks. It is the only approach that scales to the cases where there is no reference, which is most production generation.

Human evaluation puts trained reviewers in the loop to rate outputs. It is the gold standard for nuance, domain correctness, and genuine helpfulness, and it is the source of truth that keeps automated evaluators calibrated. Its weakness is cost and speed: humans are slow and expensive, so human evaluation is reserved for samples and for disputed cases rather than every output.

Assertion-based evaluation checks concrete, deterministic properties: is the output valid JSON, does it contain a required citation, does it avoid a banned phrase, is it under a length limit, does it match a regex. These checks are cheap and unambiguous, and they catch a surprising fraction of real production bugs, especially in structured-output and tool-using systems. They are the LLM analog of unit tests.

Key Takeaway

Reference-based evaluation works when you have a ground truth, reference-free and LLM-as-a-judge work for open-ended generation, human evaluation provides the calibration truth, and assertion-based checks catch format and policy bugs cheaply. Real systems use all four as a pyramid.

The Metrics That Matter

The metrics you track depend on the task, but a few families recur. Quality metrics include correctness against ground truth, faithfulness or groundedness (is the answer supported by the context), answer relevance (does it address the question), and completeness. For any system that retrieves information, retrieval metrics matter just as much: context precision, context recall, and mean reciprocal rank. Operational metrics, latency, cost, and token usage, belong in the same report because a quality gain that triples cost is rarely shippable. Safety metrics cover toxicity, PII leakage, and policy violations. A dedicated guide to LLM evaluation metrics covers how each is computed and when to use it.

The discipline that separates useful evaluation from vanity dashboards is tying every metric to a decision. Before you add a metric, state what you would do if it moved. If accuracy on a query category drops, you investigate the retriever and prompt for that category. If cost per conversation rises, you consider model routing or caching. If a metric does not change any action, it is not worth computing. This decision frame also tells you the threshold: the metric needs to be precise enough to distinguish a real regression from normal run-to-run variance, which usually means running multiple samples and reporting a confidence interval rather than a single number.

Offline and Online Evaluation

Evaluation happens in two settings. Offline evaluation runs against a fixed, curated dataset, typically in continuous integration, and gates changes before they ship. Because the dataset is stable, a score change is meaningful, but the dataset is always incomplete relative to real user behavior. Online evaluation runs against live production traffic, where you do not have gold answers, so it relies on reference-free evaluators and assertions running on a sample of requests. Online evaluation catches the long tail of real inputs that no test set anticipates. The strongest pattern connects the two: failures surfaced online get added to the offline dataset, so a problem that slipped through once becomes a permanent regression test.

Common Mistakes That Undermine Evaluation

Several mistakes recur often enough to be worth naming. The first is reducing quality to a single number, which hides the distinct failure modes that need distinct fixes; a system that reports only accuracy cannot tell you it is hallucinating versus answering the wrong question. The second is an unrepresentative dataset, where the test cases are easy or synthetic and the scores look healthy while real users hit edge cases the dataset never covered. The third is ignoring variance: running each input once and treating a noisy two-point move as a real change, which produces both false alarms and missed regressions. The fourth is evaluating only at launch and never again, leaving the system exposed to the silent model-provider updates and data drift that degrade quality with no code change at all.

The fifth and most insidious mistake is an uncalibrated judge. Teams adopt LLM-as-a-judge for its speed and scale, then never check whether the judge actually agrees with human raters on their task. A judge that systematically prefers verbose answers, or that misreads the rubric, will report confident scores that point the team in the wrong direction. The fix is cheap: label a small sample by hand and measure the judge's agreement with the humans before trusting it. Avoiding these five mistakes does more for evaluation quality than any sophisticated metric, because they are failures of process rather than of math, and process is what makes evaluation trustworthy.

Evaluation in the Development Loop

Evaluation is most valuable when it runs automatically on every change, the same way unit tests do. A practical loop looks like this: a developer changes a prompt or swaps a model, the evaluation suite runs against the offline dataset, and the result reports per-dimension scores against the previous baseline. If quality holds or improves and cost and latency stay within budget, the change ships. If a dimension regresses, the developer sees exactly which inputs failed and why, because the evaluation stores the inputs, outputs, and judge reasoning. This turns prompt engineering from guesswork into measurement, and it is the single highest-leverage practice for teams shipping LLM features.

The same evaluators that gate changes offline can run on production traffic as online evaluators, attaching scores to the observability traces. This is where evaluation and observability meet: the trace captures what happened, the online evaluator scores it, and the aggregate feeds the dashboards and alerts that catch regressions after release. A team with both has closed the loop from development through production.