Home » AI Memory System Design » Benchmark Performance

How to Benchmark Your Memory System Performance

Benchmarking a memory system means measuring two things that are equally important but often tested separately: how fast the system returns results (latency and throughput) and how good those results are (retrieval relevance and ranking quality). A memory system that returns irrelevant results in 10ms is just as broken as one that returns perfect results in 10 seconds. This guide shows you how to build benchmarks that measure both dimensions together.

Before You Start

You need a memory system with at least basic store and retrieve functionality, and you need a clear definition of what "good results" means for your application. If you cannot articulate what a correct retrieval result looks like for a given query, you cannot measure retrieval quality. You also need a test environment that mirrors your production configuration, including the same storage backend, the same embedding model, and the same retrieval pipeline. Benchmarks run against a different configuration than production produce misleading results.

Step-by-Step Benchmarking

Step 1: Build a representative dataset.
Your benchmark dataset must match your production data in three dimensions: size (same order of magnitude as your production memory count), diversity (same distribution of memory types, topics, and entities), and age distribution (memories spanning the same time range, with realistic access patterns). The easiest approach is to create a synthetic dataset based on production statistics. If you have production data, sample and anonymize it. If you are pre-production, generate synthetic memories that model your expected data distribution. Include at least 10x the number of memories you expect to retrieve for any single query, so the system has enough noise to make retrieval meaningfully challenging. A benchmark with 100 memories where every query matches 10 is not testing retrieval quality; it is testing serialization speed. A benchmark with 10,000 memories where the target results are scattered among irrelevant content tests the system's ability to find needles in haystacks, which is what production retrieval actually requires.
Step 2: Define your metrics.
Choose metrics that map to your application's requirements. For latency, measure p50 (median), p95 (tail), and p99 (worst case) retrieval times. The p50 tells you typical performance. The p95 tells you what 1 in 20 users experiences. The p99 tells you whether you have latency outliers that will cause timeouts or poor experience. For retrieval quality, measure precision at k (what fraction of the top-k results are relevant), recall at k (what fraction of all relevant memories appear in the top-k results), and normalized discounted cumulative gain (nDCG, which measures whether the most relevant results appear at the top of the list rather than scattered through it). For throughput, measure queries per second at your target latency budget, and writes per second at your target ingestion rate. For each metric, define an acceptable threshold based on your application requirements. For example: p95 retrieval latency under 300ms, precision at 5 above 0.7, recall at 10 above 0.8, minimum 50 queries per second sustained.
Step 3: Build a query test suite.
Create a set of test queries with known expected results. For each query, specify: the query text, the set of memories that should appear in the results (the "ground truth" relevant set), and the ideal ranking order (which results are most relevant). Aim for at least 50 test queries covering your main retrieval patterns: semantic similarity queries ("find memories about topic X"), entity queries ("find memories connected to entity Y"), temporal queries ("find recent memories about Z"), and composite queries that combine multiple patterns. For each query, manually identify the relevant memories in your test dataset. This is the most labor-intensive part of benchmarking, but without ground truth labels, you cannot measure retrieval quality. If manual labeling is impractical for large datasets, use a stratified sample: label ground truth for 50 to 100 queries that represent your main patterns, and use these for quality benchmarks while using unlabeled queries for latency and throughput benchmarks.
Step 4: Measure retrieval latency.
Run your query test suite against the loaded benchmark dataset and record the latency for each query. Important measurement practices: warm up the system before measuring (run 100 throwaway queries to populate caches and warm up connection pools), measure end-to-end latency from query submission to result receipt (not just the database query time), run each query multiple times and take the median to eliminate variance from network jitter, separate latency measurements by query type (semantic queries may have different latency characteristics than entity queries), and measure under load by running concurrent queries at your expected production concurrency. Report latency as percentiles, not averages. An average latency of 100ms that hides a p99 of 5 seconds is misleading and will cause production problems. If your p95 or p99 exceeds your latency budget, investigate which queries are slow and why. Common causes include: large result sets that require expensive post-processing, graph traversals that follow too many edges, missing indexes for metadata filters, and cold cache misses.
Step 5: Measure retrieval quality.
For each test query with ground truth labels, compare the returned results against the expected results and calculate your quality metrics. Precision at 5 measures whether the top 5 results are relevant: if 4 out of 5 are relevant, precision is 0.8. Recall at 10 measures whether all relevant memories appear in the top 10 results: if the ground truth set has 8 relevant memories and 6 appear in the top 10, recall is 0.75. nDCG measures ranking quality: if the most relevant result appears at position 1, the score is higher than if it appears at position 5. Run quality benchmarks separately for each query type and report the results by type. You may find that semantic queries have high quality while entity queries have low quality, which tells you exactly which retrieval strategy needs improvement. Also track how quality changes when cognitive scoring is enabled versus disabled. Cognitive scoring should improve nDCG (better ranking) without significantly affecting precision and recall (same relevant results, just better ordered). If enabling cognitive scoring reduces recall, something in the scoring pipeline is filtering out valid results.
Step 6: Run scale tests.
Repeat your latency and quality benchmarks at increasing data sizes to identify the point where performance degrades. Load your benchmark dataset at 1x, 2x, 5x, and 10x the current size, and measure how each metric changes. Plot the results to identify trends. Ideal behavior is sub-linear latency growth (latency increases slower than data size) and stable quality metrics regardless of data size. Warning signs include: linear or super-linear latency growth (the system is scanning more data as it grows, rather than using efficient indexes), quality degradation at larger sizes (the retrieval pipeline cannot distinguish relevant from irrelevant results when there is more noise), and throughput collapse at a specific size threshold (indicating a resource bottleneck that saturates at that scale). If you identify a scale threshold, determine whether the fix is an index optimization (cheap), an architecture change (moderate), or a fundamental redesign (expensive), and plan accordingly.

Automating Continuous Benchmarks

Benchmarks should not be one-time events. Build an automated benchmark pipeline that runs regularly (weekly or after significant changes) and tracks metrics over time. A quality regression caught by an automated benchmark is far cheaper to fix than one discovered by users in production. Your continuous benchmark pipeline should: load a consistent test dataset, run the full query test suite, calculate all metrics, compare against previous results and thresholds, and alert if any metric degrades beyond a defined tolerance.

Adaptive Recall includes built-in performance monitoring with retrieval latency tracking, quality signals from user feedback, and automatic alerts when metrics degrade.

Get Started Free