Home » LLM Evaluation and Observability » Evaluate a RAG System

How to Evaluate a RAG System

Evaluating a RAG system means measuring the retrieval layer and the generation layer separately, because a bad answer can come from either, and the fix is different for each. Retrieval is scored with context precision, context recall, and mean reciprocal rank against a labeled set of relevant documents. Generation is scored with faithfulness and answer relevance against the retrieved context. The combination of scores tells you which layer failed: low retrieval scores point at the embeddings, chunking, and retriever, while good retrieval with low faithfulness points at the prompt and the generation model.

The single most important idea in RAG evaluation is that a RAG answer is the product of two stages, and an end-to-end accuracy number cannot tell you which stage broke. A question gets a wrong answer. Was it because the retriever never surfaced the document with the answer, or because the model had the right document and hallucinated anyway? These demand opposite fixes, and only layered evaluation distinguishes them.

Step 1: Separate retrieval from generation.
Instrument your pipeline so the evaluation can inspect the retrieved chunks independently of the final answer. For each evaluation question you want two artifacts: the ranked list of chunks the retriever returned, and the final generated answer. With both captured, you can score retrieval quality without reference to the answer, and score generation quality given the retrieved context. This separation is the foundation; everything else builds on it.
Step 2: Build a labeled question set.
Collect 100 to 300 questions that represent real usage, drawn from production logs where possible because synthetic questions miss the messiness of real phrasing. For each question, label which documents or chunks in your corpus are relevant, and write or approve a reference answer. The relevance labels are the expensive part and the part that makes retrieval measurable. A practical shortcut for bootstrapping is to use an LLM to propose relevance labels and reference answers, then have a human review and correct them, which is far faster than labeling from scratch. The evaluation dataset guide covers this in depth.
Step 3: Score retrieval.
Against the relevance labels, compute context precision (what fraction of retrieved chunks were relevant), context recall (what fraction of the relevant chunks were retrieved), and mean reciprocal rank (how high the first relevant chunk appears). Recall is usually the metric to watch first, because if the relevant chunk is not retrieved at all, the generator cannot recover. Low recall sends you to the embedding model, the chunk size, and the retrieval strategy, where a hybrid search combining semantic and keyword matching often helps. Low precision means noise in the context, which you address with reranking or a higher relevance threshold.
Step 4: Score generation.
Given the retrieved context, score the answer for faithfulness (does every claim follow from the context) and answer relevance (does it address the question). Both are best measured with a validated LLM judge, decomposing the answer into claims and checking each against the context. You can also measure answer correctness against the reference answer where one exists. The key insight is to score generation conditioned on what was actually retrieved, not on the ideal context, because that isolates whether the generation step did its job with the material it was given.
Step 5: Diagnose and fix the right layer.
Now the layered scores pay off. High retrieval scores and high faithfulness means the system is working. Low retrieval recall means fix the retriever before touching anything else, because no prompt change helps when the answer was never retrieved. Good retrieval but low faithfulness means the model is hallucinating beyond its context, which you fix with a tighter prompt, a stronger model, or explicit grounding instructions, the same techniques covered in reducing hallucinations in RAG. Good retrieval and good faithfulness but low answer relevance usually means a prompt that does not focus the model on the actual question.
Step 6: Run continuously and feed back failures.
Wire the evaluation into continuous integration so every change to the retriever, embeddings, chunking, prompt, or model runs against the labeled set and reports per-layer scores against the baseline. In production, run online faithfulness checks on sampled traffic and add the failures to the labeled set, so real-world failure modes become permanent regression tests. Over time the dataset grows to cover the long tail that no initial curation anticipated.
Key Takeaway

Always evaluate retrieval and generation separately. Low context recall is a retriever problem no prompt can fix; good retrieval with low faithfulness is a generation problem no embedding change can fix. The layered diagnosis is the entire value of RAG evaluation.

Generating an Evaluation Set for RAG

The labeling burden is the main obstacle to RAG evaluation, and a practical accelerator is synthetic question generation grounded in your own corpus. For each document or chunk, prompt a strong model to generate questions that the chunk answers, which gives you question-answer pairs where the relevant chunk is known by construction, no manual relevance labeling required. Generate across the corpus to get broad coverage, then have a human review a sample to confirm the questions are realistic and the labeled-relevant chunks are actually sufficient to answer them. This produces a usable retrieval evaluation set in hours rather than days, and it covers documents that real traffic has not yet exercised.

Synthetic generation has a known limitation: the questions tend to be cleaner and more direct than real user queries, which are messy, ambiguous, and often span multiple documents. So treat the synthetic set as a starting point that you supplement with real questions from production logs as soon as you have them. The strongest RAG evaluation set combines both: synthetic pairs for broad corpus coverage and known relevance labels, plus real production questions for the messy long tail that reveals how the retriever behaves on actual usage. The two sources cover each other's blind spots, and together they give a realistic picture of retrieval quality.

Evaluating Memory-Backed Retrieval

A persistent memory system is a RAG system with two extra wrinkles: the corpus changes over time as memories are stored and consolidated, and each memory carries a confidence score. The same precision, recall, and faithfulness metrics apply directly, but you also evaluate whether confidence predicts correctness, because downstream components rely on it. Adaptive Recall's status tool reports retrieval quality and confidence distribution, which lets you treat the memory layer as a measurable retrieval component and score it with exactly the methodology above rather than trusting it blindly.