Home » Reducing AI Hallucinations » Reduce in RAG Pipelines

How to Reduce Hallucinations in RAG Pipelines

RAG pipelines hallucinate for two reasons: the retrieval step returns irrelevant or insufficient context, or the generation step ignores or goes beyond the retrieved context. Fixing hallucinations in RAG requires addressing both sides, improving retrieval so the model gets the right information, and constraining generation so the model stays grounded in what it was given.

Before You Start

If your RAG pipeline is hallucinating frequently, diagnose where the problem originates before adding mitigation layers. Retrieve the context for a set of hallucinated responses and check whether the correct information was present in the retrieved chunks. If the right information was retrieved but the model ignored it, the problem is in generation. If the right information was not retrieved, the problem is in retrieval. Most RAG hallucination problems are retrieval problems disguised as generation problems, so start your investigation there.

Step-by-Step Hallucination Reduction

Step 1: Fix retrieval quality first.
The single most impactful change for RAG hallucination rates is improving retrieval so the model consistently receives relevant, accurate context. Three changes make the biggest difference. First, re-evaluate your chunking strategy. Chunks that are too small lose context. Chunks that are too large dilute the relevant information with noise. For most applications, 200 to 500 tokens per chunk with 50-token overlap produces the best retrieval quality. Second, switch from pure vector search to hybrid search that combines semantic similarity with keyword matching (BM25). Hybrid search catches queries where the relevant document shares exact terms with the query but not necessarily the same semantic framing. Third, add a reranking step that uses a cross-encoder to rescore the top 20 to 50 vector search results and return the most relevant 3 to 5. Reranking consistently improves retrieval precision by 10% to 25%.
Step 2: Add relevance filtering to discard bad retrievals.
Not every query has relevant content in your knowledge base, and passing irrelevant chunks to the model is worse than passing no chunks at all. Irrelevant context confuses the model, causing it to either incorporate the irrelevant information into its response (creating a grounded-but-wrong answer) or ignore the context entirely and fall back to parametric generation (the same as not having RAG at all). Set a minimum similarity threshold for retrieved chunks. If no chunk exceeds the threshold, return a response indicating that the system does not have relevant information rather than generating a potentially hallucinated answer.
def filtered_retrieval(query, min_similarity=0.72): results = hybrid_search(query, top_k=20) reranked = cross_encoder_rerank(query, results, top_k=5) relevant = [r for r in reranked if r.score >= min_similarity] if not relevant: return None, "no_relevant_context" return relevant, "context_found"
Step 3: Constrain the generation prompt to stay within context.
The generation prompt is where you tell the model how to use the retrieved context. Weak prompts like "use the following information to help answer" give the model permission to supplement with its own knowledge, which is where hallucinations enter. Strong prompts explicitly constrain the model: answer only from the provided context, state when the context is insufficient, do not infer or extrapolate beyond what the documents say. The prompt should also handle the case where relevant context was found but does not fully answer the question, instructing the model to answer what it can from the context and clearly indicate what remains unanswered.
CONSTRAINED_PROMPT = """Answer the user's question using ONLY the context provided below. Follow these rules: 1. Base every factual claim on specific passages in the context 2. If the context does not contain enough information to answer the question fully, say what you can answer from the context and state what information is missing 3. Never add facts, statistics, names, or dates that do not appear in the context 4. If the context contains no relevant information at all, respond: "I don't have information about that in my available sources." 5. When referencing context, indicate which document or passage supports each claim Context: {retrieved_chunks} User question: {query}"""
Step 4: Require citations that link claims to specific context passages.
Requiring the model to cite its sources creates an implicit constraint against hallucination. When the model must point to a specific passage that supports each claim, fabricating a claim becomes harder because it also requires fabricating or misattributing a citation. Add citation instructions to your prompt: "For each factual statement, include a reference in brackets indicating which context passage supports it." Then verify in post-processing that the cited passages actually exist in the retrieved context and actually support the claims attributed to them.
Step 5: Add post-generation verification.
Even with good retrieval and constrained prompts, models occasionally go beyond their context. A post-generation verification step catches these cases. The simplest approach extracts factual claims from the generated response and checks each one against the retrieved context using semantic similarity or NLI entailment classification. Claims that are supported pass through. Claims that contradict the context are flagged or removed. Claims that are not addressed by the context at all (extrinsic additions) are flagged as unverified. This verification step adds 1 to 3 seconds of latency but catches the hallucinations that prompt engineering alone misses.

Common RAG Hallucination Patterns and Fixes

Several hallucination patterns recur in RAG systems, each with a specific fix. The "confident extrapolation" pattern occurs when the model has partial information and fills in gaps with plausible-sounding fabrication. For example, if the context mentions that a feature was released in Q3 but does not specify the month, the model might claim "released in August" rather than saying "released in Q3." Fix this by explicitly instructing the model to preserve the precision level of the source: if the source says Q3, the response should say Q3, not invent a specific month.

The "entity confusion" pattern occurs when the model conflates two similar entities mentioned in different retrieved chunks. If one chunk discusses PostgreSQL 15 and another discusses PostgreSQL 16, the model might attribute PostgreSQL 16 features to version 15 or vice versa. Fix this by adding entity-aware chunking that keeps version-specific information in distinct chunks, and by adding an instruction to the prompt that says "when the context discusses multiple versions, entities, or time periods, keep them distinct in your response."

The "stale context" pattern occurs when the knowledge base contains outdated information that was accurate when written but no longer reflects reality. The model faithfully reproduces the outdated information, creating a hallucination that is grounded in sources but still wrong. Fix this by adding timestamps to your indexed documents and weighting more recent documents higher during retrieval. For frequently-changing information, add a staleness check that flags context older than a configurable threshold.

The "insufficient context" pattern occurs when the retrieved chunks are tangentially related to the query but do not actually contain the answer. The model, rather than admitting it cannot answer, synthesizes a response from the tangential context that sounds reasonable but addresses a slightly different question. Fix this with the relevance filtering from Step 2, which prevents low-quality retrievals from reaching the model, and with the constrained prompt from Step 3, which explicitly instructs the model to acknowledge when the context does not answer the question.

Measuring Improvement

Track your RAG hallucination rate before and after each change by sampling 50 to 100 responses per week and manually classifying claims as grounded, extrapolated, or fabricated. The most impactful changes in order are typically: hybrid search with reranking (reduces hallucinations by 15% to 25%), relevance filtering (reduces by another 10% to 15%), constrained prompting (reduces by 10% to 20%), and post-generation verification (catches 60% to 70% of remaining hallucinations). The cumulative effect of all four changes typically brings RAG hallucination rates from 15% to 25% down to 3% to 7%.

Go beyond naive RAG. Adaptive Recall combines cognitive scoring, knowledge graph grounding, and confidence-weighted retrieval to build RAG pipelines that stay grounded in facts.

Get Started Free