Home » AI Memory System Design » Letta Memory Hierarchy

The OS-Inspired Memory Hierarchy in Letta

Letta (formerly MemGPT) treats AI memory like an operating system treats computer memory, with a hierarchy of storage tiers that the agent manages itself. The LLM's context window acts as RAM, a database serves as disk storage, and the agent decides what to page in and out of its active context. This is a fundamentally different approach from external memory systems where retrieval is handled by infrastructure rather than the agent.

The Operating System Analogy

In a traditional operating system, physical RAM is limited but programs need access to more data than RAM can hold. The OS solves this with virtual memory: programs operate as if they have unlimited memory, while the OS transparently pages data between fast RAM and slower disk storage. Programs do not know or care which tier their data is in; the OS handles the illusion of infinite memory.

Letta applies this same principle to LLM context windows. The context window is the AI equivalent of RAM: fast, limited, and essential for active processing. External storage (databases, files) is the AI equivalent of disk: abundant, slower to access, and capable of holding far more data than fits in active context. Letta gives the LLM agent the ability to manage its own virtual context, deciding what to keep in the context window, what to page out to storage, and what to page back in when needed.

The analogy is architectural, not just metaphorical. Just as an OS has specific mechanisms for page replacement (LRU, clock algorithm, working set model), Letta has specific mechanisms for context management. The agent has explicit tools for reading from and writing to its external storage, for searching across stored information, and for managing what appears in its active context. The agent learns to use these tools effectively over time, developing strategies for what to keep in context and what to offload.

The Three-Tier Architecture

Letta's memory hierarchy has three tiers, each with different characteristics and purposes.

Core memory is the equivalent of registers or L1 cache: a small, fixed-size block of text that is always present in the agent's context window. Core memory typically contains the agent's persona description, key facts about the current user, and critical context that the agent needs on every turn. Core memory is editable by the agent through explicit tool calls, so the agent can update its understanding of the user or its own role as the conversation evolves. The fixed size forces the agent to be selective about what it keeps in core memory, which is the intended design: only the most important, most frequently needed information should occupy this premium real estate.

Recall memory is the equivalent of RAM: a searchable store of conversation history. Every message in the conversation is automatically stored in recall memory, providing the agent with access to the full conversation history even though only a portion fits in the context window at any time. The agent can search recall memory by keywords, time ranges, or semantic similarity to page relevant conversation segments back into context. This is how Letta handles long conversations that exceed the context window: instead of truncating the conversation (losing information) or summarizing it (losing detail), the full conversation is preserved in recall memory and the agent can retrieve specific segments as needed.

Archival memory is the equivalent of disk storage: a large, persistent store for information that the agent wants to remember across conversations and sessions. Archival memory is explicitly managed by the agent: it decides what to store, how to organize it, and when to retrieve it. Unlike recall memory (which is automatically populated from conversation), archival memory contains information that the agent has deliberately chosen to preserve. This makes archival memory more like a personal knowledge base than a conversation log. The agent stores processed information: conclusions, summaries, important facts, learned preferences, and distilled knowledge from multiple conversations.

Self-Managed vs. Infrastructure-Managed Memory

The defining characteristic of Letta's approach is that the agent manages its own memory. This is fundamentally different from most memory systems (including Adaptive Recall) where memory management is handled by infrastructure external to the agent.

In an infrastructure-managed system, the agent calls a store API to save information and a retrieve API to find it. The infrastructure handles embedding, indexing, retrieval ranking, lifecycle management, and consolidation. The agent does not know or care how memories are stored or retrieved; it trusts the infrastructure to return relevant results.

In Letta's self-managed system, the agent is responsible for deciding what to store, where to store it (core, recall, or archival), when to retrieve it, and how to use the retrieved information to manage its context. The agent develops memory management strategies through its interactions: it learns that certain types of information belong in core memory (always available), that conversation details should be searched in recall memory when needed, and that important knowledge should be explicitly saved to archival memory.

Each approach has trade-offs. Self-managed memory gives the agent more autonomy and can lead to more sophisticated memory strategies tailored to the agent's specific tasks. Infrastructure-managed memory provides more consistent, reliable retrieval because the ranking, scoring, and lifecycle logic is implemented in deterministic code rather than depending on the LLM's judgment. Self-managed memory uses more tokens (every memory operation requires a tool call that consumes context window space). Infrastructure-managed memory is more transparent (you can inspect and tune the retrieval pipeline independently of the agent's behavior).

Strengths of the OS-Inspired Approach

Letta's approach has several genuine strengths. First, it elegantly solves the context window limitation without external summarization pipelines. The full conversation is preserved and accessible, which means no information is permanently lost when it scrolls out of the context window. Second, it gives agents agency over their own knowledge management, which can lead to more intelligent memory strategies than a fixed retrieval algorithm. Third, the tiered architecture provides natural separation between always-needed information (core), session history (recall), and long-term knowledge (archival), which maps well to how different types of information are used in practice.

The approach is particularly well-suited for personal assistant agents that have long-running relationships with individual users. The agent can gradually build up core memory with user preferences, store important events in archival memory, and use recall memory to maintain conversation continuity across sessions. The self-management aspect means the agent's memory strategy can adapt to the specific user's needs without requiring infrastructure changes.

Limitations and Trade-Offs

The self-managed approach has limitations that become apparent at scale and in production environments. First, memory quality depends on the LLM's judgment. If the agent makes poor decisions about what to store, what to retrieve, or how to organize archival memory, the entire memory system degrades. There is no external quality check on the agent's memory management decisions. Second, memory operations consume context window tokens. Every tool call to search, read, or write memory reduces the space available for the actual task. In conversations with heavy memory management, a significant fraction of the context window is occupied by memory operations rather than user interaction. Third, the approach does not include cognitive scoring, consolidation, or confidence tracking. Memories stored in archival memory are retrieved by keyword or semantic similarity, with no consideration of recency, access frequency, or corroboration status. Fourth, multi-agent memory sharing is limited because each agent manages its own memory independently.

These are not flaws in Letta's design; they are trade-offs inherent in the self-managed approach. For applications where agent autonomy and conversation continuity are the primary requirements, the trade-offs are worth it. For applications where retrieval quality at scale, multi-tenant isolation, and lifecycle management are primary requirements, an infrastructure-managed approach is better suited.

Lessons for Architecture Design

Regardless of whether you use Letta's approach, the OS memory analogy provides useful architectural principles. The idea of tiered storage with different access characteristics applies to any memory system. The principle that agents should have agency over what they remember (even if the infrastructure handles how it is stored and retrieved) leads to better memory quality than purely automated extraction. And the recognition that context window management is a first-class architectural concern, not an afterthought, is essential for any production AI application.

Adaptive Recall takes a different approach: infrastructure-managed memory with cognitive scoring, knowledge graphs, and lifecycle automation, so your agents focus on tasks while the memory system handles quality and scale. Try both approaches and see which fits your application.

Get Started Free