Home » Vector Search and Embeddings » Evaluate with Recall

How to Evaluate Vector Search Quality with Recall

Evaluating vector search means measuring whether the system returns the right documents for a given query. Recall at k, the fraction of relevant documents that appear in the top k results, is the most important metric because it directly measures whether the information needed to answer a question is being retrieved. This guide shows how to build an evaluation dataset, compute retrieval metrics, and use the results to improve your search pipeline.

Why Evaluation Matters

Without evaluation, you are guessing whether your vector search works. A system can feel like it works when you test a few queries manually, but systematic evaluation often reveals that 20 to 30% of queries return incomplete or irrelevant results. These failures are invisible until a user complains or an LLM generates a wrong answer because it received poor context. Regular evaluation catches degradation early and gives you a baseline to measure improvements against.

Evaluation is especially important when you change any component of the search pipeline: the embedding model, chunk size, index parameters, or fusion weights. Each change affects retrieval quality in ways that are not obvious from looking at a few example queries. A formal evaluation with dozens or hundreds of queries reveals whether the change improved, degraded, or had no effect on overall retrieval quality.

Step-by-Step Evaluation Process

Step 1: Build an evaluation dataset.
An evaluation dataset is a set of queries paired with their known relevant documents. Start with 50 to 100 queries that represent your actual query distribution. For each query, identify the documents (or chunks) that contain the correct answer. This is manual work, but it is the foundation of reliable evaluation. Bias toward queries that your users actually ask (from query logs if available) rather than queries you think they should ask.
import json # Evaluation dataset format eval_dataset = [ { "query": "how to configure database connection pooling", "relevant_doc_ids": ["doc_142", "doc_143", "doc_891"], "category": "configuration" }, { "query": "what happens when the authentication token expires", "relevant_doc_ids": ["doc_055", "doc_056"], "category": "troubleshooting" }, { "query": "ERR_CONNECTION_REFUSED on port 5432", "relevant_doc_ids": ["doc_201"], "category": "error_code" } # ... 50-100 queries total ] # Save for reuse with open("eval_dataset.json", "w") as f: json.dump(eval_dataset, f, indent=2)
LLM-assisted dataset creation: If manual annotation is too slow, use an LLM to generate candidate query-document pairs from your corpus. Show the LLM a document and ask it to generate 3 to 5 questions that the document answers. Then manually review the generated pairs for accuracy. This is 3 to 5 times faster than creating pairs entirely from scratch.
Step 2: Define your metrics.
Choose metrics based on what matters for your application. Recall@k measures the fraction of relevant documents found in the top k results. Precision@k measures the fraction of returned results that are actually relevant. NDCG@k (Normalized Discounted Cumulative Gain) measures both relevance and rank position, penalizing relevant documents that appear lower in the results. MRR (Mean Reciprocal Rank) measures how high the first relevant result appears.
import numpy as np from typing import List, Set def recall_at_k(retrieved_ids: List[str], relevant_ids: Set[str], k: int) -> float: """Fraction of relevant docs found in top k results.""" top_k = set(retrieved_ids[:k]) if not relevant_ids: return 0.0 return len(top_k & relevant_ids) / len(relevant_ids) def precision_at_k(retrieved_ids: List[str], relevant_ids: Set[str], k: int) -> float: """Fraction of top k results that are relevant.""" top_k = set(retrieved_ids[:k]) return len(top_k & relevant_ids) / k def mrr(retrieved_ids: List[str], relevant_ids: Set[str]) -> float: """Reciprocal rank of first relevant result.""" for i, doc_id in enumerate(retrieved_ids): if doc_id in relevant_ids: return 1.0 / (i + 1) return 0.0 def ndcg_at_k(retrieved_ids: List[str], relevant_ids: Set[str], k: int) -> float: """Normalized Discounted Cumulative Gain at k.""" dcg = 0.0 for i, doc_id in enumerate(retrieved_ids[:k]): rel = 1.0 if doc_id in relevant_ids else 0.0 dcg += rel / np.log2(i + 2) ideal_count = min(len(relevant_ids), k) idcg = sum(1.0 / np.log2(i + 2) for i in range(ideal_count)) return dcg / idcg if idcg > 0 else 0.0
Step 3: Run the evaluation.
Execute each query against your search system, collect the ranked results, and compute metrics. Report per-query metrics and aggregated averages. The aggregated averages tell you overall quality, while per-query results help you identify specific failure cases.
def run_evaluation(eval_dataset: list, search_fn, k: int = 10): results = [] for item in eval_dataset: query = item["query"] relevant = set(item["relevant_doc_ids"]) retrieved = search_fn(query, top_k=k) retrieved_ids = [r["id"] for r in retrieved] metrics = { "query": query, "category": item.get("category", "unknown"), "recall@k": recall_at_k(retrieved_ids, relevant, k), "precision@k": precision_at_k(retrieved_ids, relevant, k), "mrr": mrr(retrieved_ids, relevant), "ndcg@k": ndcg_at_k(retrieved_ids, relevant, k), "relevant_count": len(relevant), "found_count": len(set(retrieved_ids[:k]) & relevant) } results.append(metrics) # Aggregate avg_recall = np.mean([r["recall@k"] for r in results]) avg_precision = np.mean([r["precision@k"] for r in results]) avg_mrr = np.mean([r["mrr"] for r in results]) avg_ndcg = np.mean([r["ndcg@k"] for r in results]) print(f"Mean Recall@{k}: {avg_recall:.3f}") print(f"Mean Precision@{k}: {avg_precision:.3f}") print(f"Mean MRR: {avg_mrr:.3f}") print(f"Mean NDCG@{k}: {avg_ndcg:.3f}") return results
Step 4: Analyze failure patterns.
Sort results by recall@k and examine the queries with the lowest scores. Categorize the failures: is the relevant document missing from the corpus? Is it chunked poorly so the embedding does not capture the relevant content? Is the query using exact terms that the embedding model does not handle (error codes, version numbers)? Is the query conceptually different from the document's vocabulary (the user says "broken" but the document says "failure")? Each category points to a different fix.
def analyze_failures(results: list, threshold: float = 0.5): failures = [r for r in results if r["recall@k"] < threshold] failures.sort(key=lambda x: x["recall@k"]) print(f"\n{len(failures)} queries below {threshold} recall:") for f in failures[:20]: print(f" [{f['category']}] recall={f['recall@k']:.2f}: {f['query']}") # Group by category from collections import Counter cats = Counter(f["category"] for f in failures) print(f"\nFailure distribution by category:") for cat, count in cats.most_common(): print(f" {cat}: {count} failures")
Step 5: Iterate and track over time.
After making changes (new embedding model, different chunk size, added hybrid search), re-run the evaluation and compare. Save results with timestamps so you can track trends. Set up automated evaluation as part of your CI pipeline if you change the search system frequently.

Evaluation Benchmarks to Aim For

For a well-tuned vector search system on domain-specific content, target Recall@10 above 0.80. Adding hybrid search should push this above 0.88. Adding reranking typically adds 2 to 5 percentage points. If your Recall@10 is below 0.70, the most likely cause is poor chunking or a mismatched embedding model, not a database or index issue.

Adaptive Recall continuously evaluates retrieval quality through its cognitive scoring system, automatically adjusting ranking based on which memories prove useful over time.

Try It Free