How to Upgrade from Naive to Production RAG
What Makes Naive RAG Naive
Naive RAG has five characteristics that cause production accuracy problems. First, single retrieval path: only vector similarity, no keyword matching. Second, no reranking: the initial similarity scores determine the final ranking. Third, no metadata awareness: all chunks are treated equally regardless of source, date, or type. Fourth, fixed-size chunking: documents are split at arbitrary token boundaries. Fifth, static index: once indexed, content is never re-evaluated for freshness or accuracy. Each of these is a fixable problem, and fixing them in the right order maximizes impact per engineering hour.
Step-by-Step Upgrade
Hybrid search combines vector similarity (good for semantic meaning) with BM25 keyword matching (good for exact terms, names, identifiers). Many production queries include specific terms that need exact matching: product names, error codes, API endpoints, configuration keys. Vector search is weak at exact matching because embeddings represent semantic meaning, not exact text. BM25 finds these reliably. Combine the two using reciprocal rank fusion, which merges ranked lists from different retrieval methods into a single ranking.
def reciprocal_rank_fusion(ranked_lists, k=60):
"""Merge multiple ranked lists using RRF."""
scores = {}
for ranked_list in ranked_lists:
for rank, item in enumerate(ranked_list):
if item.id not in scores:
scores[item.id] = {"item": item, "score": 0}
scores[item.id]["score"] += 1.0 / (k + rank + 1)
merged = sorted(scores.values(),
key=lambda x: x["score"], reverse=True)
return [s["item"] for s in merged]
def hybrid_search(query, vector_index, bm25_index, top_k=20):
vector_results = vector_index.search(query, top_k=top_k)
bm25_results = bm25_index.search(query, top_k=top_k)
return reciprocal_rank_fusion([vector_results, bm25_results])Hybrid search typically improves recall by 10 to 15% compared to vector-only search. The improvement is largest on queries containing specific terms, identifiers, or proper nouns that embed poorly but match exactly with BM25.
Initial retrieval (whether vector, BM25, or hybrid) optimizes for recall: finding candidates that might be relevant. Reranking optimizes for precision: scoring each candidate by how well it actually answers the specific question. A cross-encoder reranker processes the query and each candidate as a pair, attending to both simultaneously. This captures fine-grained relevance that independent embeddings miss. Add the reranker after hybrid search, scoring the top-20 to top-50 candidates and keeping the top-5 for the LLM context.
from sentence_transformers import CrossEncoder
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
def rerank(query, candidates, top_k=5):
pairs = [(query, c.text) for c in candidates]
scores = reranker.predict(pairs)
scored = list(zip(candidates, scores))
scored.sort(key=lambda x: x[1], reverse=True)
return [item for item, score in scored[:top_k]]Cross-encoder reranking typically improves precision by 15 to 25% on top of hybrid search. The latency cost is moderate: reranking 50 candidates takes 50 to 200 milliseconds on a CPU, or 10 to 30 milliseconds on a GPU.
Add metadata to every chunk during indexing: document source, creation date, last modified date, document type, section hierarchy, and any domain-specific tags. At query time, apply metadata filters before similarity scoring to exclude irrelevant content. This reduces the search space so similarity scoring operates on a more focused set of candidates.
def filtered_search(query, index, filters=None):
# Apply metadata filters to constrain search space
filter_params = {}
if filters:
if filters.get("min_date"):
filter_params["date"] = {"$gte": filters["min_date"]}
if filters.get("doc_type"):
filter_params["type"] = {"$in": filters["doc_type"]}
if filters.get("source"):
filter_params["source"] = filters["source"]
return index.search(query, top_k=50, filter=filter_params)Metadata filtering is particularly important for preventing staleness failures. A filter that excludes content older than your freshness threshold ensures the LLM never reasons over outdated information. It also helps with multi-tenant applications where access permissions must be enforced at the retrieval layer.
Replace fixed-size chunking with semantic chunking that respects document structure. Split at paragraph boundaries, section headers, and logical topic transitions rather than at arbitrary token counts. Keep related information together: a definition and its example should be in the same chunk, a step-by-step process should not be split across chunks. Add parent-child relationships so that when a specific chunk is retrieved, the broader section can also be included for context.
import re
def semantic_chunk(text, max_tokens=500):
# Split on double newlines (paragraph boundaries)
paragraphs = re.split(r'\n\n+', text)
chunks = []
current_chunk = ""
current_tokens = 0
for para in paragraphs:
para_tokens = len(para.split()) * 1.3 # Rough token estimate
if current_tokens + para_tokens > max_tokens and current_chunk:
chunks.append(current_chunk.strip())
current_chunk = para
current_tokens = para_tokens
else:
current_chunk += "\n\n" + para
current_tokens += para_tokens
if current_chunk.strip():
chunks.append(current_chunk.strip())
return chunksTrack when each chunk was last confirmed as current. Apply a time-decay factor to similarity scores so that recently verified content ranks higher than older content with the same similarity. Set up periodic re-indexing to detect and update stale content. Flag chunks from deprecated or superseded documents so they are excluded from retrieval.
import math
from datetime import datetime, timezone
def apply_freshness_decay(score, last_verified, half_life_days=90):
"""Decay retrieval score based on time since last verification."""
now = datetime.now(timezone.utc)
age_days = (now - last_verified).days
decay = math.pow(0.5, age_days / half_life_days)
return score * decayCapture signals about retrieval quality: did the user accept the answer, ask a follow-up (suggesting the answer was incomplete), rephrase the question (suggesting the answer was wrong), or explicitly rate it. Use these signals to adjust chunk scores over time. Chunks that consistently contribute to accepted answers get a retrieval boost. Chunks that consistently appear in rejected answers get demoted. This is the simplest form of learning from production traffic.
def update_chunk_scores(feedback_log, chunk_index):
for entry in feedback_log:
for chunk_id in entry["retrieved_chunk_ids"]:
current = chunk_index.get_metadata(chunk_id)
if entry["feedback"] == "positive":
current["quality_score"] = min(
1.0, current.get("quality_score", 0.5) + 0.05)
elif entry["feedback"] == "negative":
current["quality_score"] = max(
0.0, current.get("quality_score", 0.5) - 0.1)
chunk_index.update_metadata(chunk_id, current)The Upgrade Order Matters
The steps above are ordered by impact-per-effort. Hybrid search is typically the highest-leverage improvement because it fixes an entire category of failures (keyword-based queries) with minimal complexity. Reranking is second because it improves precision across all query types. Metadata filtering and chunking improvements come next because they require more engineering work but address specific failure patterns. Freshness management and feedback loops are last because they require ongoing operational attention.
Alternatively, you can skip the incremental upgrade path and use a system that includes all of these capabilities by design. Adaptive Recall provides hybrid retrieval (cognitive scoring combines multiple relevance signals), reranking (base-level activation and confidence weighting), entity-aware retrieval (knowledge graph traversal), freshness management (memory lifecycle with decay), and learning (reinforcement from usage patterns). Each of these runs automatically as part of the memory recall operation.
Skip the upgrade path. Adaptive Recall is production-grade retrieval from the first query, with cognitive scoring, graph traversal, and memory lifecycle built in.
Get Started Free