Home » AI Memory » Three-Layer Architecture

The Three-Layer Memory Architecture Explained

The three-layer memory architecture separates AI memory into hot (recent, frequently accessed), warm (bulk searchable storage), and cold (archived, rarely accessed) tiers. This mirrors how computer memory hierarchies and human memory systems both work, optimizing for the fact that a small subset of memories handles most retrieval requests.

Why Layers Matter

A single flat memory store works when you have hundreds of memories. It breaks down at scale for two reasons. First, retrieval becomes noisy. With thousands of memories in a single store, similarity searches return many loosely related results that dilute the truly relevant ones. Second, costs increase linearly. Searching a 100,000-memory vector store costs roughly 100 times more than searching a 1,000-memory store, even though most queries only need context from recent interactions.

The three-layer architecture solves both problems by separating memories based on access patterns. Recent memories that are likely to be needed go in the hot layer, which is small and fast. The bulk of memories go in the warm layer, which supports full vector search. Old, rarely accessed memories go in the cold layer, which is cheap to store and only searched when the other layers do not produce results. This mirrors the CPU cache hierarchy (L1, L2, L3), the memory hierarchy in operating systems (registers, RAM, disk), and the human memory hierarchy (working memory, long-term recall, deep archive).

The Hot Layer

The hot layer contains the most recent and most frequently accessed memories. Think of it as the memories most likely to be needed in the next few interactions. It is small (typically the last 50-200 memories per user), stored in a fast data store like Redis or an in-memory cache, and queried first for every retrieval request.

Access to the hot layer is essentially free in terms of latency, usually under 5 milliseconds. Because the set is small, you can afford expensive ranking operations on every result. Some systems include all hot-layer memories in every prompt as persistent context, similar to how Letta's core memory is always present in the model's context window.

The hot layer has aggressive refresh behavior. New memories enter the hot layer immediately. Memories that are retrieved (regardless of which layer they come from) are promoted to the hot layer. Memories that have not been accessed in a configurable time window (typically 7-30 days) are demoted to the warm layer. This creates a natural recency bias that matches user expectations: the AI should remember recent conversations without being asked.

The Warm Layer

The warm layer is the primary searchable archive. It contains the bulk of a user's memories, typically thousands to hundreds of thousands of entries, stored in a vector database with full similarity search capability. This is where the embedding-based retrieval happens, the standard cosine similarity search against the query vector.

Warm-layer queries take 10-100 milliseconds depending on the database, the index type, and the number of memories. This is fast enough for interactive applications but slow enough that you want to minimize unnecessary queries. The typical pattern is to search the hot layer first and only query the warm layer if the hot layer does not produce sufficient results.

The warm layer benefits most from metadata filtering. Rather than searching all memories, scope the search by user ID, by time range, by memory type, or by topic. A filtered search over 50,000 memories can be faster than an unfiltered search over 5,000 if the filter narrows the candidate set effectively. This is where the metadata you store alongside each memory (timestamps, tags, types, sources) pays off in retrieval performance.

The Cold Layer

The cold layer stores archived memories that have not been accessed in a long time. These memories are unlikely to be needed, but they might contain historical context that becomes relevant in specific situations. A user asking "what did we decide about the auth system six months ago?" triggers a cold-layer search because that context is too old for the hot or warm layers but still valuable.

Cold storage prioritizes cost over speed. Memories in the cold layer can be stored in compressed format, in cheaper storage tiers (S3, blob storage), or even with reduced vector dimensionality. Query latency in the cold layer is acceptable at 200-500 milliseconds because cold queries are rare and the user is asking for something they know is old.

The cold layer also serves as the source of truth for audit and compliance. If a regulation requires retaining all user interactions for a specific period, the cold layer holds that archive without polluting the active retrieval layers. Deletion requests (right to be forgotten) need to sweep all three layers, but day-to-day retrieval only touches hot and warm.

Movement Between Layers

Memories flow between layers based on access patterns and time. The movement rules determine how well the system adapts to changing user needs.

Hot to warm: When a hot-layer memory has not been accessed for the demotion threshold (7-30 days), it moves to the warm layer. This keeps the hot layer lean and focused on current context.

Warm to cold: When a warm-layer memory has not been accessed for the archival threshold (60-180 days), it moves to the cold layer. This keeps the warm layer searchable without growing indefinitely.

Cold to warm (promotion): When a cold-layer memory is retrieved by a user query, it gets promoted back to the warm layer (and often to the hot layer). This allows old but still relevant memories to resurface naturally.

Any layer to deleted: The consolidation process can delete memories that are redundant (merged into a consolidated memory), contradicted (superseded by newer information), or below a confidence threshold. Deletion removes the memory from all layers permanently.

Consolidation Across Layers

The consolidation process operates primarily on the warm layer, where the bulk of memories live. It identifies groups of related memories, merges them into more compact representations, and deletes the originals. A user who mentioned "we use PostgreSQL" in five different conversations does not need five separate memories. Consolidation merges these into one fact with higher confidence.

Adaptive Recall's consolidation system goes beyond simple deduplication. It detects contradictions (the user said PostgreSQL in three conversations and MySQL in two, which suggests a change), extracts lasting knowledge from episodic memories (the specific debugging session becomes the general fact "the auth module has a race condition on high traffic"), and builds confidence scores based on how many independent sources corroborate each fact. This process typically reduces memory count by 40-60% while preserving or improving information density.

Implementation Considerations

You can implement the three-layer architecture with different technology stacks at each layer. A common stack is Redis for hot, pgvector for warm, and S3 with a metadata index for cold. Each layer uses the storage technology that matches its access pattern.

For teams that do not want to manage three separate storage systems, managed memory services handle the tiering internally. Adaptive Recall implements the full three-layer architecture with cognitive scoring across all layers, automatic promotion and demotion, and background consolidation. The consumer of the API sees a single memory store; the tiering happens transparently.

Get a three-layer memory architecture without managing the infrastructure. Adaptive Recall handles tiering, consolidation, and lifecycle management automatically.

Get Started Free