How to Test That Your AI Actually Remembers
Before You Start
You need a memory system to test (a custom implementation, a framework integration, or a managed service like Adaptive Recall). You also need the ability to create test users, store test memories, and query the memory store programmatically. These tests should run in a test environment with a separate memory store from production to avoid contaminating real user data.
Step-by-Step Test Plan
Start with the simplest possible test: store a memory, then retrieve it by querying with similar text. If this fails, nothing else will work. This test verifies that the embedding pipeline, storage backend, and search functionality are all connected correctly.
def test_basic_store_and_retrieve():
user_id = "test_user_001"
# Store a specific fact
memory_service.store(
content="The project uses PostgreSQL 16 with pgvector",
user_id=user_id
)
# Retrieve with a similar query
results = memory_service.search(
query="What database does the project use?",
user_id=user_id,
limit=5
)
assert len(results) > 0
assert "PostgreSQL" in results[0]["content"]
print("PASS: Basic store and retrieve")
def test_multiple_memories():
user_id = "test_user_002"
memories = [
"The API is built with FastAPI and Python 3.12",
"Deployments go to AWS ECS with Fargate",
"The frontend is React 18 with TypeScript",
"CI/CD runs on GitHub Actions"
]
for m in memories:
memory_service.store(content=m, user_id=user_id)
# Query should return the most relevant memory
results = memory_service.search(
query="What cloud provider do they use?",
user_id=user_id,
limit=3
)
assert any("AWS" in r["content"] for r in results)
print("PASS: Multiple memories with targeted query")Store memories, then simulate a session boundary (close the connection, create a new client instance) and verify the memories survive. This catches problems with in-memory caching that disappears on restart, uncommitted transactions, and connection pooling issues.
def test_cross_session_persistence():
user_id = "test_user_003"
# Session 1: Store memories
session1 = create_memory_client()
session1.store(
content="User prefers dark mode interfaces",
user_id=user_id
)
session1.store(
content="User timezone is US Pacific",
user_id=user_id
)
session1.close()
# Session 2: New client, verify memories exist
session2 = create_memory_client()
results = session2.search(
query="What are the user preferences?",
user_id=user_id,
limit=5
)
assert len(results) >= 2
contents = " ".join(r["content"] for r in results)
assert "dark mode" in contents
assert "Pacific" in contents
session2.close()
print("PASS: Cross-session persistence")Test that the retrieval system returns relevant memories, not just any memories. Store a set of memories on different topics, then query for specific topics and verify the results are correctly scoped. This catches over-broad retrieval that returns everything and under-specific retrieval that misses relevant entries.
def test_retrieval_accuracy():
user_id = "test_user_004"
# Store memories on different topics
topics = {
"database": "PostgreSQL 16 with read replicas",
"auth": "OAuth 2.0 with Auth0 as the provider",
"deploy": "Kubernetes on GKE with Helm charts",
"monitor": "Datadog for metrics and PagerDuty for alerts",
"testing": "pytest with 85% coverage requirement"
}
for content in topics.values():
memory_service.store(content=content, user_id=user_id)
# Query for authentication, should get auth memory
auth_results = memory_service.search(
query="How does login work?",
user_id=user_id,
limit=2
)
assert any("Auth0" in r["content"] for r in auth_results)
# Query for monitoring, should not return auth memories
monitor_results = memory_service.search(
query="What alerting system is configured?",
user_id=user_id,
limit=2
)
assert any("PagerDuty" in r["content"]
for r in monitor_results)
assert not any("Auth0" in r["content"]
for r in monitor_results)
print("PASS: Retrieval accuracy")This is a critical security test. Store memories for two different users and verify that searching as one user never returns memories belonging to another user. A failure here is a data leak.
def test_user_isolation():
user_a = "test_user_isolation_a"
user_b = "test_user_isolation_b"
# Store different facts for each user
memory_service.store(
content="Company uses AWS and Python",
user_id=user_a
)
memory_service.store(
content="Company uses Azure and C#",
user_id=user_b
)
# Search as user A
results_a = memory_service.search(
query="What cloud provider?",
user_id=user_a,
limit=10
)
# User A should see AWS, never Azure
for r in results_a:
assert "Azure" not in r["content"]
assert "C#" not in r["content"]
# Search as user B
results_b = memory_service.search(
query="What programming language?",
user_id=user_b,
limit=10
)
# User B should see C#, never Python
for r in results_b:
assert "Python" not in r["content"]
assert "AWS" not in r["content"]
print("PASS: Multi-user isolation")Verify that memory updates, deduplication, and deletion work correctly. Store a memory, update it with new information, and verify the update persists. Store a duplicate memory and verify it is deduplicated. Delete a memory and verify it no longer appears in search results.
def test_deduplication():
user_id = "test_user_dedup"
memory_service.store(
content="The team uses Slack for communication",
user_id=user_id
)
memory_service.store(
content="Team communication happens through Slack",
user_id=user_id
)
results = memory_service.search(
query="What communication tool?",
user_id=user_id,
limit=10
)
# Should have 1 result, not 2
slack_results = [
r for r in results if "Slack" in r["content"]
]
assert len(slack_results) <= 1
print("PASS: Deduplication")
def test_deletion():
user_id = "test_user_delete"
memory_service.store(
content="Temporary test data for deletion",
user_id=user_id
)
# Verify it exists
results = memory_service.search(
query="temporary test data",
user_id=user_id
)
assert len(results) > 0
# Delete it
memory_service.delete(
memory_id=results[0]["id"],
user_id=user_id
)
# Verify it is gone
results_after = memory_service.search(
query="temporary test data",
user_id=user_id
)
assert len(results_after) == 0
print("PASS: Deletion")The final test verifies that the LLM actually uses injected memories in its responses. Store a specific, unusual fact as a memory, then ask the model a question that requires that fact to answer correctly. If the model's response includes the stored information, the full pipeline works.
def test_llm_uses_memory():
user_id = "test_user_e2e"
# Store an unusual fact the model could not know
memory_service.store(
content="The internal project codename is Thunderbird",
user_id=user_id
)
# Ask the model using memory-enriched context
response = chat_with_memory(
user_message="What is our project codename?",
user_id=user_id
)
assert "Thunderbird" in response
print("PASS: LLM uses injected memory context")Ongoing Quality Monitoring
Beyond initial testing, monitor memory quality in production. Track the retrieval relevance score distribution (are most retrievals high-confidence?), the percentage of responses that reference injected memories (is the model actually using the context?), and user feedback on whether the AI remembered correctly (are memories accurate and timely?).
Adaptive Recall provides built-in monitoring through the status tool, which reports memory counts, retrieval performance, confidence distributions, and lifecycle metrics. This gives you visibility into memory health without building custom monitoring infrastructure.
Test memory that works out of the box. Adaptive Recall's free tier gives you 500 memories to evaluate retrieval quality with your real data.
Get Started Free