How to Combine RAG with Long Context Windows
Why It Is Not RAG Versus Long Context
The debate framing RAG against long context windows presents a false choice. Long context windows solve the "not enough space" problem: small knowledge bases can be loaded entirely so the LLM can attend to everything. RAG solves the "too much noise" problem: large knowledge bases get filtered so the LLM only processes relevant content. Real applications need both because knowledge bases are rarely uniform in size and query types are rarely uniform in complexity.
A customer support application might have 50 core policy documents (easily fit in context) and 200,000 historical ticket records (require RAG). A coding assistant might have a small project README (context) and a large codebase (RAG). The right architecture routes each query to the appropriate strategy rather than forcing everything through one path.
Step-by-Step Implementation
Measure your knowledge base in tokens and compare it against current model context windows. Claude's 200k context window holds roughly 150,000 words. GPT-4's 128k window holds roughly 96,000 words. Models with 1M+ token windows hold roughly 750,000 words. If your entire knowledge base fits in the smallest context window you need to support, full-context approaches are viable. If it exceeds the largest available window, RAG is necessary for at least part of the corpus.
import tiktoken
def estimate_tokens(text, model="cl100k_base"):
enc = tiktoken.get_encoding(model)
return len(enc.encode(text))
def assess_knowledge_base(documents):
total_tokens = sum(estimate_tokens(doc.text) for doc in documents)
return {
"total_tokens": total_tokens,
"fits_200k": total_tokens < 180000, # Leave room for prompt
"fits_128k": total_tokens < 110000,
"fits_1m": total_tokens < 900000,
"estimated_cost_per_query_full": total_tokens * 0.000003,
"estimated_cost_per_query_rag": 5000 * 0.000003 # ~5k tokens
}The cost calculation matters. If your knowledge base is 500,000 tokens and you process 1,000 queries per day using full context at $3 per million input tokens, that is $1.50 per day just for input tokens. The same queries with RAG retrieving 5,000 tokens each cost $0.015 per day. At low query volumes the difference is negligible. At scale it is the dominant cost.
Create tiers based on knowledge base size and query type. Tier 1 (full context): small, high-value collections where completeness matters more than cost, such as product documentation, policy documents, and configuration references. Tier 2 (RAG plus expanded context): medium collections where you retrieve broadly and provide more context than typical RAG, such as codebase documentation and technical specifications. Tier 3 (focused RAG): large collections where cost matters, such as historical logs, ticket archives, and large document stores.
class TieredRetriever:
def __init__(self, full_context_docs, rag_index):
self.full_context = "\n\n".join(
doc.text for doc in full_context_docs)
self.rag_index = rag_index
def retrieve(self, query, query_type="auto"):
if query_type == "policy" or query_type == "config":
# Tier 1: full context
return self.full_context, "full_context"
if query_type == "technical":
# Tier 2: broad RAG with expanded context
chunks = self.rag_index.search(query, top_k=20)
context = "\n\n".join(c.text for c in chunks)
return context, "expanded_rag"
# Tier 3: focused RAG
chunks = self.rag_index.search(query, top_k=5)
context = "\n\n".join(c.text for c in chunks)
return context, "focused_rag"The most powerful combination uses RAG as a pre-filtering step before long context reasoning. Instead of retrieving 5 chunks and hoping they contain the answer, retrieve 20 to 50 chunks (broad recall) and pass all of them to the LLM in a long context window. The LLM can then reason across all 50 chunks, finding connections that narrow retrieval would miss, while the pre-filtering step keeps costs manageable by excluding the 99% of the knowledge base that is definitely irrelevant.
def rag_plus_long_context(query, index, k=30):
# Broad retrieval: high recall, accept some noise
chunks = index.search(query, top_k=k)
# Build a long context from all retrieved chunks
context = "\n\n---\n\n".join(
f"[Source {i+1}] {c.text}" for i, c in enumerate(chunks))
# Let the LLM reason over the full retrieved set
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=2000,
messages=[{
"role": "user",
"content": f"""I have retrieved {len(chunks)} potentially
relevant documents for this question. Read all of them carefully
and synthesize a complete answer. Cite sources by number.
Question: {query}
Retrieved documents:
{context}"""
}]
)
return response.content[0].textThis pattern is particularly effective for queries that require synthesizing information across multiple documents. Single-pass RAG with top-5 retrieval might miss 3 of the 5 relevant documents. Broad retrieval with top-30 is much more likely to catch all relevant documents, and the long context window gives the LLM enough room to find and combine the information across all 30.
If your Tier 1 full-context documents do not change frequently, use prompt caching to avoid reprocessing the same context on every query. Anthropic's prompt caching stores the processed context and serves it from cache for subsequent requests that share the same prefix. This reduces both cost and latency: the first request processes the full context, but subsequent requests within the cache TTL (currently 5 minutes) pay only the cache read cost.
from anthropic import Anthropic
client = Anthropic()
# The static context goes in a system message with cache control
def query_with_caching(query, static_context):
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=2000,
system=[{
"type": "text",
"text": f"Reference documentation:\n\n{static_context}",
"cache_control": {"type": "ephemeral"}
}],
messages=[{"role": "user", "content": query}]
)
return response.content[0].textTrack three metrics for each tier: accuracy (percentage of queries answered correctly), cost per query (total API cost including retrieval and generation), and latency (time from query to response). Adjust tier boundaries based on real data. If Tier 3 focused RAG has low accuracy on a category of queries, move that category to Tier 2 expanded RAG. If Tier 1 full context is processing queries that Tier 3 handles just as well, move those queries down to reduce cost.
When to Use Each Approach
Full context only: Knowledge base under 50,000 tokens, low query volume (under 100 per day), queries require comprehensive understanding of the full corpus. Example: answering questions about a single product's documentation.
RAG only: Knowledge base over 1 million tokens, high query volume, most queries target specific information. Example: searching historical support tickets.
RAG plus long context: Knowledge base of any size, queries require synthesis across multiple documents, accuracy matters more than cost. Example: technical analysis requiring information from specifications, code, and architecture documents.
Adaptive Recall operates as a memory retrieval system that naturally combines these approaches. The recall tool retrieves relevant memories using cognitive scoring and graph traversal (the RAG component), and returns them in a structured format that fits naturally into the LLM's context window (the long context component). The cognitive scoring ensures that the most relevant, recent, and well-corroborated memories rank highest, so the context the LLM receives is both comprehensive and focused.
Get smart retrieval that scales. Adaptive Recall retrieves the right memories with cognitive scoring, so your context stays focused and your costs stay low.
Get Started Free