Home » LLM Evaluation and Observability » LLM Evaluation Metrics

LLM Evaluation Metrics: A Practical Guide

LLM evaluation metrics fall into four families: quality metrics like correctness, faithfulness, and answer relevance; retrieval metrics like context precision, context recall, and mean reciprocal rank; operational metrics like latency, cost, and token usage; and safety metrics like toxicity and PII leakage. No single number captures quality, so mature evaluation tracks several metrics at once and reports them separately, segmented by query type, so that a regression in one dimension is not hidden by an average.

Why One Metric Is Never Enough

The instinct to reduce quality to a single accuracy number is the most common mistake in LLM evaluation. An answer can be accurate but irrelevant to the question, relevant but unsupported by the source material, faithful but incomplete, or correct but so slow and expensive that it is impractical to ship. Each of these is a distinct failure mode, and each needs its own metric. A system that reports a single 0.85 tells you almost nothing about which of these failure modes are present. A system that reports correctness 0.91, faithfulness 0.72, and answer relevance 0.88 tells you immediately that the system is inventing claims not supported by its context, which points at the prompt and the generation model rather than the retriever.

Quality Metrics

Correctness or accuracy measures whether the output matches a known correct answer. For classification and extraction it is computed directly with exact match or F1. For generation, it is judged either by semantic similarity to a reference or by an LLM judge against a rubric. It is the headline metric when ground truth exists, but it is undefined for genuinely open-ended generation.

Faithfulness or groundedness measures whether the output is supported by the provided context, and it is the central metric for any RAG or memory system. A faithful answer makes only claims that follow from the retrieved material. It is typically computed by decomposing the answer into individual claims and checking each against the context, often with an LLM judge. Low faithfulness with good retrieval is the signature of hallucination, and it is the metric most directly tied to grounding.

Answer relevance measures whether the output actually addresses the question that was asked, independent of whether it is correct. A response can be factually true and well-grounded while answering a different question than the user asked. Relevance is usually scored by a judge comparing the answer against the original query.

Completeness measures whether the answer covers everything the question required. It matters most for multi-part questions and for tasks where omitting a relevant detail is as bad as stating a wrong one.

Key Takeaway

Faithfulness and answer relevance are the two quality metrics most teams under-measure. Correctness alone misses both: an answer can score well on a reference match while inventing unsupported claims or addressing the wrong question.

Retrieval Metrics

Any system that retrieves context, whether RAG or a memory layer, must measure the retrieval step separately from generation, because a bad answer often traces to bad retrieval.

Context precision is the fraction of retrieved chunks that were actually relevant. Low precision means the retriever is returning noise that crowds the context window and can distract the model.

Context recall is the fraction of all relevant chunks that were retrieved. Low recall means the retriever is missing information that the answer needs, and no amount of generation quality can compensate, because the model never sees the relevant material.

Mean reciprocal rank (MRR) measures how high the first relevant result appears in the ranked list, rewarding retrievers that put the right answer near the top. Related ranking metrics include normalized discounted cumulative gain (NDCG), which accounts for the position of all relevant results, not just the first.

MRR = (1/N) * sum(1 / rank_of_first_relevant_result)

These metrics require a labeled set of which documents are relevant to which queries, which is the most labor-intensive part of building a retrieval evaluation. The investment pays off because it lets you tune the retriever, the embedding model, and the chunk size against a number rather than by feel. The vector search evaluation guide covers measuring recall in detail.

Operational Metrics

Quality metrics are meaningless without the operational context. Latency at p50 and p95 determines whether the system is usable; the p95 matters more than the average because the worst experiences drive user frustration. Cost per request and per conversation determines whether the system is economically viable at scale. Token usage for input and output is the lever behind cost and often behind latency too. Error and timeout rates capture reliability. The reason these belong in the same report as quality metrics is that nearly every quality improvement has an operational cost, and you can only judge the tradeoff when the numbers sit together. A change that adds two points of accuracy but doubles cost per conversation is usually the wrong change, and a metrics report that separates quality from cost hides that.

Safety Metrics

Safety metrics measure the ways a system can do harm regardless of correctness. Toxicity and harmful-content rates flag offensive or dangerous outputs. PII leakage flags outputs that expose personal data. Jailbreak success rate measures how often adversarial inputs bypass safety instructions. Policy-violation rate measures compliance with domain rules, such as a financial bot giving prohibited advice. These are typically computed with classifiers or judge models running as assertions, and many run as online evaluators on live traffic because the cost of a single violation can be high.

Choosing Metrics by Task Type

The right metric set depends on what the system does, and matching metrics to task type prevents both gaps and wasted effort. For a classification or extraction task with a defined ground truth, accuracy, precision, recall, and F1 are the core, and an LLM judge adds little. For open-ended generation with no single right answer, reference-free judge-based metrics like helpfulness and coherence carry the load, supported by assertions for format and policy. For a RAG or memory system, the retrieval metrics (context precision and recall) and faithfulness are non-negotiable, because they capture the two ways such a system fails. For an agent, task completion and trajectory metrics dominate, since the process matters as much as the output. For any user-facing system, operational metrics and a safety set apply on top of whatever quality metrics the task demands.

A useful exercise is to write down, for your specific system, the two or three ways it most plausibly fails in production, then choose the metric that would catch each. A support bot most plausibly fails by inventing policy details, so faithfulness is the headline metric. A search system most plausibly fails by missing the relevant document, so context recall leads. This failure-first approach produces a focused metric set tied to real risk, rather than a sprawling dashboard of numbers that no one acts on. Start narrow with the metrics that map to your top failure modes, and add others only when a real decision needs them.

The Discipline of Segmentation

The most important practice with any metric is segmentation. An aggregate hides local failure, and local failure is what generates complaints. A correctness of 0.85 can mean uniform 0.85 across all query types, or it can mean 0.95 on common queries and 0.55 on an important but less frequent category. Only the segmented view distinguishes the two, and only the segmented view tells you where to focus. Always break metrics down by query type, user segment, and time window. The time-window segmentation is what turns evaluation metrics into regression detection: comparing this week against last week against a baseline reveals drift that a single snapshot cannot.