How to Evaluate Vector Search Quality with Recall
Why Evaluation Matters
Without evaluation, you are guessing whether your vector search works. A system can feel like it works when you test a few queries manually, but systematic evaluation often reveals that 20 to 30% of queries return incomplete or irrelevant results. These failures are invisible until a user complains or an LLM generates a wrong answer because it received poor context. Regular evaluation catches degradation early and gives you a baseline to measure improvements against.
Evaluation is especially important when you change any component of the search pipeline: the embedding model, chunk size, index parameters, or fusion weights. Each change affects retrieval quality in ways that are not obvious from looking at a few example queries. A formal evaluation with dozens or hundreds of queries reveals whether the change improved, degraded, or had no effect on overall retrieval quality.
Step-by-Step Evaluation Process
An evaluation dataset is a set of queries paired with their known relevant documents. Start with 50 to 100 queries that represent your actual query distribution. For each query, identify the documents (or chunks) that contain the correct answer. This is manual work, but it is the foundation of reliable evaluation. Bias toward queries that your users actually ask (from query logs if available) rather than queries you think they should ask.
import json
# Evaluation dataset format
eval_dataset = [
{
"query": "how to configure database connection pooling",
"relevant_doc_ids": ["doc_142", "doc_143", "doc_891"],
"category": "configuration"
},
{
"query": "what happens when the authentication token expires",
"relevant_doc_ids": ["doc_055", "doc_056"],
"category": "troubleshooting"
},
{
"query": "ERR_CONNECTION_REFUSED on port 5432",
"relevant_doc_ids": ["doc_201"],
"category": "error_code"
}
# ... 50-100 queries total
]
# Save for reuse
with open("eval_dataset.json", "w") as f:
json.dump(eval_dataset, f, indent=2)Choose metrics based on what matters for your application. Recall@k measures the fraction of relevant documents found in the top k results. Precision@k measures the fraction of returned results that are actually relevant. NDCG@k (Normalized Discounted Cumulative Gain) measures both relevance and rank position, penalizing relevant documents that appear lower in the results. MRR (Mean Reciprocal Rank) measures how high the first relevant result appears.
import numpy as np
from typing import List, Set
def recall_at_k(retrieved_ids: List[str],
relevant_ids: Set[str], k: int) -> float:
"""Fraction of relevant docs found in top k results."""
top_k = set(retrieved_ids[:k])
if not relevant_ids:
return 0.0
return len(top_k & relevant_ids) / len(relevant_ids)
def precision_at_k(retrieved_ids: List[str],
relevant_ids: Set[str], k: int) -> float:
"""Fraction of top k results that are relevant."""
top_k = set(retrieved_ids[:k])
return len(top_k & relevant_ids) / k
def mrr(retrieved_ids: List[str],
relevant_ids: Set[str]) -> float:
"""Reciprocal rank of first relevant result."""
for i, doc_id in enumerate(retrieved_ids):
if doc_id in relevant_ids:
return 1.0 / (i + 1)
return 0.0
def ndcg_at_k(retrieved_ids: List[str],
relevant_ids: Set[str], k: int) -> float:
"""Normalized Discounted Cumulative Gain at k."""
dcg = 0.0
for i, doc_id in enumerate(retrieved_ids[:k]):
rel = 1.0 if doc_id in relevant_ids else 0.0
dcg += rel / np.log2(i + 2)
ideal_count = min(len(relevant_ids), k)
idcg = sum(1.0 / np.log2(i + 2) for i in range(ideal_count))
return dcg / idcg if idcg > 0 else 0.0Execute each query against your search system, collect the ranked results, and compute metrics. Report per-query metrics and aggregated averages. The aggregated averages tell you overall quality, while per-query results help you identify specific failure cases.
def run_evaluation(eval_dataset: list, search_fn, k: int = 10):
results = []
for item in eval_dataset:
query = item["query"]
relevant = set(item["relevant_doc_ids"])
retrieved = search_fn(query, top_k=k)
retrieved_ids = [r["id"] for r in retrieved]
metrics = {
"query": query,
"category": item.get("category", "unknown"),
"recall@k": recall_at_k(retrieved_ids, relevant, k),
"precision@k": precision_at_k(retrieved_ids, relevant, k),
"mrr": mrr(retrieved_ids, relevant),
"ndcg@k": ndcg_at_k(retrieved_ids, relevant, k),
"relevant_count": len(relevant),
"found_count": len(set(retrieved_ids[:k]) & relevant)
}
results.append(metrics)
# Aggregate
avg_recall = np.mean([r["recall@k"] for r in results])
avg_precision = np.mean([r["precision@k"] for r in results])
avg_mrr = np.mean([r["mrr"] for r in results])
avg_ndcg = np.mean([r["ndcg@k"] for r in results])
print(f"Mean Recall@{k}: {avg_recall:.3f}")
print(f"Mean Precision@{k}: {avg_precision:.3f}")
print(f"Mean MRR: {avg_mrr:.3f}")
print(f"Mean NDCG@{k}: {avg_ndcg:.3f}")
return resultsSort results by recall@k and examine the queries with the lowest scores. Categorize the failures: is the relevant document missing from the corpus? Is it chunked poorly so the embedding does not capture the relevant content? Is the query using exact terms that the embedding model does not handle (error codes, version numbers)? Is the query conceptually different from the document's vocabulary (the user says "broken" but the document says "failure")? Each category points to a different fix.
def analyze_failures(results: list, threshold: float = 0.5):
failures = [r for r in results if r["recall@k"] < threshold]
failures.sort(key=lambda x: x["recall@k"])
print(f"\n{len(failures)} queries below {threshold} recall:")
for f in failures[:20]:
print(f" [{f['category']}] recall={f['recall@k']:.2f}: {f['query']}")
# Group by category
from collections import Counter
cats = Counter(f["category"] for f in failures)
print(f"\nFailure distribution by category:")
for cat, count in cats.most_common():
print(f" {cat}: {count} failures")After making changes (new embedding model, different chunk size, added hybrid search), re-run the evaluation and compare. Save results with timestamps so you can track trends. Set up automated evaluation as part of your CI pipeline if you change the search system frequently.
Evaluation Benchmarks to Aim For
For a well-tuned vector search system on domain-specific content, target Recall@10 above 0.80. Adding hybrid search should push this above 0.88. Adding reranking typically adds 2 to 5 percentage points. If your Recall@10 is below 0.70, the most likely cause is poor chunking or a mismatched embedding model, not a database or index issue.
Adaptive Recall continuously evaluates retrieval quality through its cognitive scoring system, automatically adjusting ranking based on which memories prove useful over time.
Try It Free