Home » Reducing AI Hallucinations » Worse with Longer Context

Do Hallucinations Get Worse with Longer Context

Yes. Research consistently shows that hallucination rates increase as context length grows, even for models that technically support very long context windows. The "lost in the middle" effect causes models to miss or misuse facts buried in long contexts, particularly in the middle portion. A fact that a model uses correctly in a 2,000-token context might be ignored or distorted in a 100,000-token context. This is why targeted retrieval of the most relevant, compact context produces more accurate responses than stuffing the full context window.

The Lost-in-the-Middle Effect

The seminal "Lost in the Middle" research demonstrated that language models attend unevenly to information based on its position in the context window. Facts placed at the beginning and end of the context are used reliably. Facts placed in the middle of a long context are frequently ignored or incorrectly recalled. This positional bias means that a model with a 128K token context window does not effectively use all 128K tokens equally. The practical usable context, the portion where facts are reliably attended to, is significantly smaller than the technical maximum.

For hallucination, this means that long contexts create more opportunities for the model to miss relevant information and fall back to parametric generation. If the answer to a user's question is buried at position 60,000 in a 100,000-token context, the model may not attend to it and instead generate an answer from its training data patterns. The irony is that the correct information was provided but not used, producing a hallucination that would not have occurred with better context management.

The effect is not uniform across all fact types. Simple, distinctive facts (a specific product name, an unusual number) are more likely to be recalled from any position because they stand out from surrounding text. Common patterns and less distinctive facts (a standard configuration value, a date among many dates) are more likely to be missed in the middle zone because they blend into the surrounding content. This means the hallucination penalty from long context is worst for exactly the type of facts that need to be most precise.

How Context Length Affects Different Hallucination Types

Intrinsic hallucination rates (contradicting information in the context) increase roughly linearly with context length. At 2,000 tokens, models contradict the provided context on about 2% of factual claims. At 32,000 tokens, this rises to 5% to 8%. At 100,000 tokens, it can reach 10% to 15%, depending on where the relevant information is positioned. These numbers mean that doubling the context does not double the hallucination rate, but it does increase it measurably and consistently.

Extrinsic hallucination rates (adding information not in the context) also increase with context length, but for a different reason. Longer contexts give the model more surface area for synthesis, which creates more opportunities to combine facts from different parts of the context in ways that produce false composite claims. A 2,000-token context about a single topic provides limited synthesis opportunities. A 100,000-token context containing information about dozens of related topics provides many more ways for the model to mix and mismatch facts.

Temporal confusion becomes significantly worse with long contexts that contain information from multiple time periods. A short context might contain only current documentation. A long context might contain current documentation, last year's changelog, and historical design documents, all mixed together. The model must determine which information is current and which is historical, and longer contexts with more temporal variety increase the chance of the model citing outdated information as current.

Why Targeted Retrieval Beats Full Context

The practical implication is that less context is often better than more context, as long as the right context is selected. Retrieving the 5 to 10 most relevant passages and placing them in a 2,000-token context block produces more accurate responses than dumping 50 documents into a 100,000-token context. The compact context focuses the model's attention on the most relevant information, avoids the lost-in-the-middle problem, and leaves the model less room to cherry-pick unhelpful passages or fall back to parametric knowledge.

Persistent memory systems like Adaptive Recall are designed around this principle. Cognitive scoring retrieves the most relevant, most confident, most recent memories rather than returning everything in the store. The result is a compact, high-quality grounding context that the model attends to fully, producing more accurate responses than an approach that provides maximum context and hopes the model finds the relevant parts.

The quality of context selection matters as much as the quantity. Retrieving 5 highly relevant passages produces better accuracy than retrieving 50 moderately relevant passages, even though the 50 passages contain more total information. The relevant passages ground the model on exactly the facts it needs. The moderately relevant passages dilute the signal with tangential information that the model might incorporate into a confused response. Cognitive scoring, which ranks retrieval by relevance, recency, confidence, and entity connections rather than just semantic similarity, produces higher-quality selections because it considers multiple dimensions of relevance rather than just topical overlap.

Strategies for Long Context Applications

When your application genuinely needs long contexts (document analysis, codebase questions, multi-document synthesis), several strategies reduce the hallucination penalty.

Position critical facts strategically. Place the most important information at the beginning and end of the context, where attention is strongest. If you need the model to use a specific date, number, or configuration value accurately, put it in the first 500 tokens of the context block. If you are providing multiple documents, put the most critical document first and the supplementary documents after it.

Use structured formatting with clear section headers so the model can navigate long contexts more effectively. Unformatted text walls are harder for the model to parse than well-organized content with explicit labels. Mark each section with its source, topic, and date. Use clear separators between documents from different sources. Add a "key facts" summary at the top that lists the specific values the model is most likely to need, even if those values also appear in the full documents below.

For applications where the full document must be in context (legal document review, code analysis), consider a two-pass approach: first, identify the relevant sections of the document using a lightweight retrieval step, then present only those sections plus surrounding context to the generation model. This gives the model the benefit of focused context while still covering the full document. The first pass can use a cheaper, faster model or even keyword search to narrow down the relevant portions, and the second pass uses the full-capability model on the focused subset.

Add post-generation verification proportional to context length. Short-context responses need minimal verification because the hallucination rate is low. Long-context responses should always be verified because the hallucination rate is higher. Scale your verification investment with your context length: if you are using 100K tokens of context, budget for the additional detection overhead that the longer context demands.

Get focused, high-quality context instead of everything at once. Adaptive Recall's cognitive scoring retrieves the most relevant, most reliable memories for every query.

Get Started Free