Home » AI Agent Memory » LOCOMO Benchmark

The LOCOMO Benchmark: How Agent Memory Is Tested

LOCOMO (Long Context and Memory Benchmark) is a benchmark that evaluates how well AI systems retain and use information across multiple conversation sessions. It tests five capabilities: single-hop fact retrieval, multi-hop reasoning, temporal ordering, open-ended knowledge application, and adversarial memory challenges. Most LLM-based systems score well on single-hop retrieval but fail on temporal reasoning and multi-hop questions, which is exactly where dedicated memory systems with cognitive scoring and entity awareness outperform naive approaches.

What LOCOMO Measures

LOCOMO was developed to address a gap in AI evaluation: most benchmarks test within a single conversation, but real-world agents operate across multiple sessions over days or weeks. A coding assistant that remembers the project architecture from last Tuesday, a customer service bot that recalls a complaint from three conversations ago, and a research agent that builds on findings from previous investigations all need memory that works across sessions, not just within them.

The benchmark uses long, multi-session conversation datasets where each session contains facts, preferences, events, and entity relationships. Test questions require the system to recall specific information from past sessions, reason about how facts relate across sessions, order events chronologically, and apply accumulated knowledge to new questions.

Single-Hop Retrieval

The simplest test: can the system recall a specific fact from a previous session? "In our conversation last week, you mentioned your dog's name. What is it?" This tests basic memory storage and retrieval. Systems with any form of persistent memory (even simple conversation logging) score well here, typically 80 to 95% accuracy. The challenge is not finding the answer but distinguishing the specific fact from similar facts in other sessions.

Multi-Hop Reasoning

Multi-hop questions require combining information from multiple sessions. "Based on what you told me about your dietary restrictions and the restaurant you enjoyed last month, what would you order at the Italian place we discussed yesterday?" This requires retrieving the dietary restrictions from one session, the restaurant preferences from another, and the Italian menu details from a third, then reasoning about the intersection. Systems without entity-aware memory score poorly because they cannot connect information across sessions by entity relationship. Their retrieval returns fragments from each session but cannot synthesize them.

Temporal Reasoning

Temporal questions test whether the system tracks when things happened and can reason about sequences. "Did you mention the job interview before or after the vacation?" This is the hardest category for most systems because it requires not just storing facts but storing them with accurate temporal metadata and being able to compare timestamps across sessions. Systems that store memories without timestamps or that rely on insertion order rather than actual temporal context fail here consistently.

Open-Ended Application

Open-ended questions test whether the system can apply accumulated knowledge to new situations. "Given everything you know about my preferences, suggest a birthday gift." This requires retrieving multiple relevant memories (hobbies, interests, mentioned desires, budget hints), filtering by relevance, and synthesizing a novel recommendation. Systems score well here when their memory retrieval is broad enough to surface non-obvious connections and their generation model is capable of creative synthesis.

Adversarial Challenges

Adversarial questions test whether the system can handle contradictions, updates, and misleading information. "You told me you live in Boston, but last session you mentioned moving to Austin. Where do you live now?" This tests whether the memory system handles fact updates correctly: the old fact (Boston) should be superseded by the new fact (Austin), or both should be available with temporal context so the system can determine the current answer.

How Different Approaches Score

Full conversation replay (stuffing all past conversations into the context) scores highest on single-hop retrieval (90 to 95%) because the information is directly in the context. It degrades on temporal reasoning because the LLM struggles to track time across a massive context. It fails practically for long histories because the context window fills up.

Summarization-based memory (summarizing each session and storing the summaries) scores moderately across all categories (65 to 75%) because summaries lose the specific details needed for single-hop retrieval while retaining the general knowledge needed for open-ended questions. Temporal reasoning depends entirely on whether the summaries include timestamps.

RAG-based memory (embedding conversation chunks and retrieving by similarity) scores well on single-hop retrieval (80 to 85%) when the query matches stored chunks. It struggles on multi-hop (55 to 65%) because similarity search does not connect information across chunks unless the query happens to match all relevant fragments. Temporal reasoning is weak (50 to 60%) because embeddings do not encode temporal relationships.

Structured memory with entity awareness (storing facts with entity extraction, temporal metadata, and relationship graphs) scores best on multi-hop (75 to 85%) and temporal reasoning (70 to 80%) because the entity graph connects information across sessions and timestamps enable temporal ordering. Single-hop is comparable to RAG (80 to 90%). This is the approach that cognitive memory systems like Adaptive Recall use.

What This Means for Production Agents

The LOCOMO results map directly to production agent capabilities. If your agent needs to recall specific facts from past sessions (most common), any persistent memory system works. If your agent needs to reason across sessions, connecting what it learned in one context to a question in another, you need entity awareness and relational retrieval. If your agent needs temporal awareness (what happened first, what is the current state after updates), you need timestamps and confidence-based freshness scoring.

Adaptive Recall addresses the benchmark's hardest categories through its cognitive scoring model. Base-level activation tracks recency and access frequency, which supports temporal reasoning. Spreading activation through the knowledge graph connects entities across sessions, which supports multi-hop reasoning. Confidence scoring tracks how well-supported each fact is, which handles adversarial updates (the newer, more confident fact supersedes the older one). These features are not benchmarking tricks; they are implementations of cognitive science principles that happen to align with what good memory systems need.

Build agent memory that handles the hard cases. Adaptive Recall's cognitive scoring and entity awareness address exactly the multi-hop and temporal challenges where naive memory falls short.

Get Started Free