Home » Beyond RAG » Diagnose Bad Results

How to Diagnose Why Your RAG Returns Bad Results

Most RAG accuracy problems are retrieval problems, not generation problems. The LLM usually generates a reasonable answer from whatever context it receives. When the answer is wrong, it is almost always because the retrieval step returned the wrong chunks, the right chunks ranked too low, or the relevant information was fragmented across documents. This guide walks you through a systematic process to identify exactly where your pipeline is failing and which fixes to apply.

Why Systematic Diagnosis Matters

The most common mistake teams make with underperforming RAG is changing things randomly. They switch embedding models, increase chunk sizes, add more context, or swap LLMs without knowing which component is actually causing the failures. Each change costs engineering time and may fix one problem while introducing another. A 30-minute diagnosis pass on your failure cases will tell you exactly where to invest your effort.

RAG has four stages where failure can occur: retrieval (finding candidates), ranking (ordering candidates), context assembly (building the prompt), and generation (producing the answer). Each stage has different failure symptoms and different solutions. A retrieval failure requires different fixes than a generation failure, and applying generation fixes to retrieval problems wastes effort.

Step-by-Step Diagnosis

Step 1: Build a failure dataset.
Collect 50 to 100 queries where your RAG system returned wrong, incomplete, or irrelevant answers. For each query, record the query text, the expected correct answer (look it up manually in your source documents), and the actual answer the system returned. If you do not have 50 failure cases, run your system against a test set of representative queries and manually evaluate the results. The size of this dataset matters because you need enough examples to see patterns. Individual failures can be misleading, but 50 failures with 30 sharing the same root cause give you a clear signal.

# Simple failure logging structure
failure_log = {
    "query": "What is our SLA for the payments API?",
    "expected": "99.95% uptime with 200ms p99 latency, per section 4.2 of the service agreement",
    "actual": "Our APIs are designed for high availability and low latency",
    "retrieved_chunks": [...],  # Log these for diagnosis
    "chunk_scores": [...],
    "timestamp": "2026-05-12T10:30:00Z"
}

Step 2: Check retrieval recall.
For each failed query, inspect the top-k retrieved chunks (typically top-20 or top-50, not just the top-5 that get passed to the LLM). Search your chunk store manually for the chunk that contains the correct answer. If the correct chunk is not in the top-k at all, you have a retrieval failure. If it is in the top-k but not in the top-5, you have a ranking failure. Track the counts: how many of your 50 failures are retrieval failures versus ranking failures versus something else.

from your_vector_db import search

def diagnose_retrieval(query, expected_answer_chunk_id, k=50):
    results = search(query, top_k=k)
    retrieved_ids = [r.id for r in results]

    if expected_answer_chunk_id not in retrieved_ids:
        return "RETRIEVAL_FAILURE", None

    rank = retrieved_ids.index(expected_answer_chunk_id) + 1
    if rank > 5:  # Assuming top-5 go to the LLM
        return "RANKING_FAILURE", rank

    return "CONTEXT_OR_GENERATION", rank

Common finding: In most production systems, 40 to 60% of RAG failures are retrieval failures where the correct chunk is not in the top-50 results at all. This means the embedding model, chunking strategy, or indexing is the bottleneck, not the LLM.

Step 3: Check ranking precision.
For failures where the correct chunk was retrieved but ranked below the cutoff (ranking failures), look at what ranked higher. Are the higher-ranked chunks topically similar but from the wrong document? Are they outdated versions of the same information? Are they from a different section of the correct document? Each pattern points to a different fix. Topically similar distractors suggest you need a reranker. Outdated content suggests you need timestamp-based decay or metadata filtering. Same-document fragmentation suggests your chunking strategy splits related information apart.

Step 4: Check context assembly.
Inspect the actual prompt sent to the LLM for cases where the correct chunk was in the top-5 but the answer was still wrong. Common problems include: the chunk was truncated because the total context exceeded the window size, the chunk was included but buried among irrelevant chunks (the "lost in the middle" effect), or the chunk formatting stripped important structure (tables rendered as garbled text, code blocks lost their formatting). Log the full prompt for failed queries so you can inspect it during diagnosis.

Step 5: Check generation quality.
For cases where the correct context reached the LLM in a usable format but the answer was still wrong, the problem is in how the LLM uses the context. Common causes include: the generation prompt does not instruct the LLM to cite specific passages, the LLM over-generalizes rather than extracting the specific answer, or the LLM hallucinates additional details not in the context. These are fixed by improving the generation prompt, adding citation requirements, or switching to a model that follows grounding instructions more faithfully.

Step 6: Categorize and prioritize.
Group your failures by root cause. Fix the category that accounts for the most failures first. If 35 of your 50 failures are retrieval failures, improving the embedding model or adding hybrid search will have more impact than any generation-side improvement. If 20 are ranking failures, adding a cross-encoder reranker is the highest-leverage fix. This prioritization prevents the common mistake of optimizing the wrong component.

Fixes by Failure Type

Retrieval failures: Switch to a better embedding model (Voyage or Cohere embed-v3 outperform OpenAI's ada-002 on retrieval benchmarks), add hybrid search with BM25 for keyword matching, improve your chunking strategy to keep answers intact within single chunks, and add query expansion to rephrase the query in multiple ways.

Ranking failures: Add a cross-encoder reranker (ms-marco-MiniLM-L-6-v2 is a good starting point for English), implement metadata-based filtering to exclude irrelevant content before similarity scoring, and add recency weighting so that current content outranks outdated content.

Context failures: Increase the context window or reduce the number of chunks passed to the LLM, reorder chunks so the most relevant appear first and last (avoiding the middle), and improve chunk formatting to preserve structure.

Generation failures: Improve the generation prompt with explicit grounding instructions, require the LLM to cite specific passages, and add a verification step that checks whether the answer is actually supported by the retrieved context.

Adaptive Recall addresses all four failure types architecturally. Cognitive scoring replaces pure similarity ranking with a multi-factor score that includes recency, frequency, entity connections, and confidence. The knowledge graph provides an alternative retrieval path that finds structurally connected information even when vocabulary does not overlap. Memory consolidation keeps the knowledge base current by merging, updating, and removing stale information. Evidence-gated learning prevents the system from storing or retrieving information that has not been validated.

Stop guessing why your retrieval fails. Adaptive Recall's cognitive scoring and knowledge graph address the root causes of RAG accuracy problems.

Try It Free

How to Diagnose Why Your RAG Returns Bad Results

Why Systematic Diagnosis Matters

Step-by-Step Diagnosis

Fixes by Failure Type

Related Articles