Short-Term vs Long-Term Memory in AI Systems
Short-Term Memory: The Context Window
The context window is the model's working memory. It contains everything the model can "see" for a given request: the system prompt, conversation history, any injected context, and the current user message. The model processes all of this simultaneously using its attention mechanism, which is why it can reference information from the beginning of the window while generating a response.
Context window sizes have grown dramatically. GPT-4o supports 128K tokens. Claude supports 200K tokens. Gemini supports up to 1 million tokens. At the upper end, these windows can hold entire codebases, book-length documents, or days of conversation history. But they are still fundamentally temporary. Nothing in the context window persists after the response is generated. The next request starts with a fresh window that must be populated again.
Short-term memory has two key properties. First, it has perfect recall within its bounds. The model can reference any token in the context window with full attention, meaning it does not forget or lose track of information within a single request. Second, it is expensive. Every token in the context window costs computation time and money, because the model must process all tokens to generate each response. A 100K-token context window costs roughly 100 times more per request than a 1K-token window.
Long-Term Memory: External Storage
Long-term memory exists outside the model in a separate storage system. Memories are extracted from conversations, embedded as vectors, stored in a database, and retrieved when relevant. The storage persists indefinitely, surviving session boundaries, application restarts, and model upgrades. A memory stored today is available tomorrow, next week, and next year.
Long-term memory has different properties from the context window. It is selective rather than comprehensive: not every token is stored, only information deemed worth remembering. It uses approximate retrieval rather than perfect recall: memories are found by semantic similarity or scoring, not by position. And it is cheap to store but requires computation to retrieve: maintaining a large memory store costs little, but searching it adds latency and processing time to each request.
The critical capability of long-term memory is persistence across sessions. When a user returns after days or weeks, the memory system retrieves relevant context and injects it into the context window. The model reads this context alongside the new conversation and responds as if it remembers the previous interactions. The user experiences continuity even though the model itself is stateless.
How They Work Together
Short-term and long-term memory are complementary layers. Long-term memory feeds into short-term memory through context injection. At the start of each session or request, the memory system retrieves relevant stored memories and adds them to the context window. The model then has both the injected long-term context and the current conversation history to work with.
The interaction creates a natural flow. Information enters through the context window as part of the conversation. The extraction layer identifies important information and stores it in long-term memory. On the next session, the retrieval layer pulls relevant memories from long-term storage and injects them back into the context window. The cycle repeats, with the long-term store growing richer and more useful over time.
The design challenge is balancing how much context window space to allocate to injected memories versus current conversation history. More injected memories provide richer context but consume tokens that could be used for conversation depth. A typical allocation dedicates 10-20% of the context window to memory context, leaving the rest for the system prompt and conversation. Advanced systems dynamically adjust this allocation based on how many relevant memories are available and how long the current conversation is.
The Promotion Process
Not everything in short-term memory should be promoted to long-term storage. Greetings, clarifying questions, and transient debugging details have no lasting value. The extraction process acts as a filter, selecting information with persistence value and discarding the rest.
The promotion criteria vary by application. For a customer support bot, customer preferences, issue resolutions, and account details have high promotion value. For a coding assistant, architecture decisions, technology choices, and recurring patterns are worth remembering. For a personal assistant, user preferences, scheduled events, and relationship information deserve long-term storage.
Adaptive Recall handles promotion through its seven specialized tools. The store tool saves memories explicitly. The reflect tool synthesizes observations from recent interactions, identifying patterns and insights that are worth long-term storage. The consolidation system runs periodically to merge related memories, extract lasting knowledge from episodic events, and reduce redundancy. This multi-stage promotion process ensures that long-term storage contains high-quality, non-redundant information.
Architectural Patterns
Several architectural patterns have emerged for combining short-term and long-term memory.
Simple injection. Retrieve the top-k most relevant memories and inject them as a block in the system message. This is the simplest pattern and works well for small memory stores. It breaks down when the memory store grows large enough that simple similarity search returns too many loosely related results.
Tiered injection. Separate memories into tiers (core facts always included, contextual memories included when relevant, archived memories included only on explicit request) and inject each tier differently. Core facts go in every prompt. Contextual memories are retrieved per-query. Archived memories require specific triggering.
Dynamic budget. Allocate context window tokens to memories dynamically based on the conversation state. Early in a conversation (before the user has provided much context), inject more memories. Later in the conversation (when the context window is filling with conversation history), inject fewer memories. This maximizes the value of both memory context and conversation depth.
Cognitive scoring. Instead of simple similarity search, rank memories using multiple signals: recency, access frequency, entity connections, and confidence. This is the approach Adaptive Recall uses, based on ACT-R cognitive architecture. It produces consistently better results than similarity alone because it accounts for how humans actually use memory, prioritizing recent, frequently accessed, well-connected, and well-validated information.
Combine short-term and long-term memory in your application. Adaptive Recall provides the long-term layer with cognitive scoring and automatic lifecycle management.
Get Started Free