Home » AI Memory » Test Memory

How to Test That Your AI Actually Remembers

Testing AI memory requires verifying multiple layers: that memories are stored correctly, that retrieval returns the right results, that memories persist across sessions, that user isolation is enforced, and that the LLM actually incorporates injected memories into its responses. This guide provides a structured test plan with concrete test cases.

Before You Start

You need a memory system to test (a custom implementation, a framework integration, or a managed service like Adaptive Recall). You also need the ability to create test users, store test memories, and query the memory store programmatically. These tests should run in a test environment with a separate memory store from production to avoid contaminating real user data.

Step-by-Step Test Plan

Step 1: Test basic storage and retrieval.
Start with the simplest possible test: store a memory, then retrieve it by querying with similar text. If this fails, nothing else will work. This test verifies that the embedding pipeline, storage backend, and search functionality are all connected correctly.

def test_basic_store_and_retrieve():
    user_id = "test_user_001"

    # Store a specific fact
    memory_service.store(
        content="The project uses PostgreSQL 16 with pgvector",
        user_id=user_id
    )

    # Retrieve with a similar query
    results = memory_service.search(
        query="What database does the project use?",
        user_id=user_id,
        limit=5
    )

    assert len(results) > 0
    assert "PostgreSQL" in results[0]["content"]
    print("PASS: Basic store and retrieve")


def test_multiple_memories():
    user_id = "test_user_002"

    memories = [
        "The API is built with FastAPI and Python 3.12",
        "Deployments go to AWS ECS with Fargate",
        "The frontend is React 18 with TypeScript",
        "CI/CD runs on GitHub Actions"
    ]

    for m in memories:
        memory_service.store(content=m, user_id=user_id)

    # Query should return the most relevant memory
    results = memory_service.search(
        query="What cloud provider do they use?",
        user_id=user_id,
        limit=3
    )

    assert any("AWS" in r["content"] for r in results)
    print("PASS: Multiple memories with targeted query")

Step 2: Test cross-session persistence.
Store memories, then simulate a session boundary (close the connection, create a new client instance) and verify the memories survive. This catches problems with in-memory caching that disappears on restart, uncommitted transactions, and connection pooling issues.

def test_cross_session_persistence():
    user_id = "test_user_003"

    # Session 1: Store memories
    session1 = create_memory_client()
    session1.store(
        content="User prefers dark mode interfaces",
        user_id=user_id
    )
    session1.store(
        content="User timezone is US Pacific",
        user_id=user_id
    )
    session1.close()

    # Session 2: New client, verify memories exist
    session2 = create_memory_client()
    results = session2.search(
        query="What are the user preferences?",
        user_id=user_id,
        limit=5
    )

    assert len(results) >= 2
    contents = " ".join(r["content"] for r in results)
    assert "dark mode" in contents
    assert "Pacific" in contents
    session2.close()
    print("PASS: Cross-session persistence")

Step 3: Test retrieval accuracy.
Test that the retrieval system returns relevant memories, not just any memories. Store a set of memories on different topics, then query for specific topics and verify the results are correctly scoped. This catches over-broad retrieval that returns everything and under-specific retrieval that misses relevant entries.

def test_retrieval_accuracy():
    user_id = "test_user_004"

    # Store memories on different topics
    topics = {
        "database": "PostgreSQL 16 with read replicas",
        "auth": "OAuth 2.0 with Auth0 as the provider",
        "deploy": "Kubernetes on GKE with Helm charts",
        "monitor": "Datadog for metrics and PagerDuty for alerts",
        "testing": "pytest with 85% coverage requirement"
    }

    for content in topics.values():
        memory_service.store(content=content, user_id=user_id)

    # Query for authentication, should get auth memory
    auth_results = memory_service.search(
        query="How does login work?",
        user_id=user_id,
        limit=2
    )
    assert any("Auth0" in r["content"] for r in auth_results)

    # Query for monitoring, should not return auth memories
    monitor_results = memory_service.search(
        query="What alerting system is configured?",
        user_id=user_id,
        limit=2
    )
    assert any("PagerDuty" in r["content"]
               for r in monitor_results)
    assert not any("Auth0" in r["content"]
                    for r in monitor_results)

    print("PASS: Retrieval accuracy")

Step 4: Test multi-user isolation.
This is a critical security test. Store memories for two different users and verify that searching as one user never returns memories belonging to another user. A failure here is a data leak.

def test_user_isolation():
    user_a = "test_user_isolation_a"
    user_b = "test_user_isolation_b"

    # Store different facts for each user
    memory_service.store(
        content="Company uses AWS and Python",
        user_id=user_a
    )
    memory_service.store(
        content="Company uses Azure and C#",
        user_id=user_b
    )

    # Search as user A
    results_a = memory_service.search(
        query="What cloud provider?",
        user_id=user_a,
        limit=10
    )

    # User A should see AWS, never Azure
    for r in results_a:
        assert "Azure" not in r["content"]
        assert "C#" not in r["content"]

    # Search as user B
    results_b = memory_service.search(
        query="What programming language?",
        user_id=user_b,
        limit=10
    )

    # User B should see C#, never Python
    for r in results_b:
        assert "Python" not in r["content"]
        assert "AWS" not in r["content"]

    print("PASS: Multi-user isolation")

Step 5: Test memory lifecycle.
Verify that memory updates, deduplication, and deletion work correctly. Store a memory, update it with new information, and verify the update persists. Store a duplicate memory and verify it is deduplicated. Delete a memory and verify it no longer appears in search results.

def test_deduplication():
    user_id = "test_user_dedup"

    memory_service.store(
        content="The team uses Slack for communication",
        user_id=user_id
    )
    memory_service.store(
        content="Team communication happens through Slack",
        user_id=user_id
    )

    results = memory_service.search(
        query="What communication tool?",
        user_id=user_id,
        limit=10
    )

    # Should have 1 result, not 2
    slack_results = [
        r for r in results if "Slack" in r["content"]
    ]
    assert len(slack_results) <= 1
    print("PASS: Deduplication")


def test_deletion():
    user_id = "test_user_delete"

    memory_service.store(
        content="Temporary test data for deletion",
        user_id=user_id
    )

    # Verify it exists
    results = memory_service.search(
        query="temporary test data",
        user_id=user_id
    )
    assert len(results) > 0

    # Delete it
    memory_service.delete(
        memory_id=results[0]["id"],
        user_id=user_id
    )

    # Verify it is gone
    results_after = memory_service.search(
        query="temporary test data",
        user_id=user_id
    )
    assert len(results_after) == 0
    print("PASS: Deletion")

Step 6: Test end-to-end with the LLM.
The final test verifies that the LLM actually uses injected memories in its responses. Store a specific, unusual fact as a memory, then ask the model a question that requires that fact to answer correctly. If the model's response includes the stored information, the full pipeline works.

def test_llm_uses_memory():
    user_id = "test_user_e2e"

    # Store an unusual fact the model could not know
    memory_service.store(
        content="The internal project codename is Thunderbird",
        user_id=user_id
    )

    # Ask the model using memory-enriched context
    response = chat_with_memory(
        user_message="What is our project codename?",
        user_id=user_id
    )

    assert "Thunderbird" in response
    print("PASS: LLM uses injected memory context")

Ongoing Quality Monitoring

Beyond initial testing, monitor memory quality in production. Track the retrieval relevance score distribution (are most retrievals high-confidence?), the percentage of responses that reference injected memories (is the model actually using the context?), and user feedback on whether the AI remembered correctly (are memories accurate and timely?).

Adaptive Recall provides built-in monitoring through the status tool, which reports memory counts, retrieval performance, confidence distributions, and lifecycle metrics. This gives you visibility into memory health without building custom monitoring infrastructure.

Test memory that works out of the box. Adaptive Recall's free tier gives you 500 memories to evaluate retrieval quality with your real data.

Get Started Free

How to Test That Your AI Actually Remembers

Before You Start

Step-by-Step Test Plan

Ongoing Quality Monitoring

Related Articles