How to Build a Memory Layer with Vector Storage
Before You Start
You need access to an embedding API (OpenAI, Voyage, Cohere, or a local model) and a vector database. This guide uses pgvector as the storage backend because many applications already run PostgreSQL, so you can add vector storage without deploying a separate service. The same patterns apply to Pinecone, Qdrant, Weaviate, or any other vector database.
If you want the memory layer without building it, Adaptive Recall provides a managed vector storage layer with cognitive scoring and lifecycle management built on top. This guide is for developers who want to understand the internals or need a custom implementation.
Step-by-Step Setup
The embedding model converts text into numerical vectors that capture semantic meaning. Two pieces of text with similar meaning produce vectors that are close together in the vector space, enabling search by meaning rather than keyword matching.
The main trade-off is between accuracy and cost. Higher-dimension models (1536 or 3072 dimensions) capture more nuance but cost more to store and search. Lower-dimension models (768 or 1024 dimensions) are faster and cheaper but may miss subtle distinctions. For most memory applications, OpenAI's text-embedding-3-small (1536 dimensions) provides a good balance. Voyage AI and Cohere offer competitive alternatives with different pricing models.
from openai import OpenAI
client = OpenAI()
def embed_text(text):
response = client.embeddings.create(
model="text-embedding-3-small",
input=text
)
return response.data[0].embeddingInstall the pgvector extension in your PostgreSQL instance and create a table for memories. The table stores the vector alongside metadata fields that support filtering during retrieval.
-- Enable the pgvector extension
CREATE EXTENSION IF NOT EXISTS vector;
-- Create the memories table
CREATE TABLE memories (
id SERIAL PRIMARY KEY,
user_id VARCHAR(64) NOT NULL,
content TEXT NOT NULL,
embedding vector(1536),
source VARCHAR(32) DEFAULT 'conversation',
confidence FLOAT DEFAULT 1.0,
access_count INTEGER DEFAULT 0,
created_at TIMESTAMP DEFAULT NOW(),
last_accessed TIMESTAMP DEFAULT NOW()
);
-- Create an index for fast similarity search
CREATE INDEX ON memories USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100);
-- Create an index for user isolation
CREATE INDEX ON memories (user_id);Each memory record includes the raw text, its vector embedding, and metadata that supports filtering and ranking. The metadata fields determine what kinds of queries your memory layer can answer efficiently.
The essential metadata fields are user_id (for multi-tenant isolation), created_at (for recency-based ranking), and content (the original text for display). Useful additional fields include source (where the memory came from), confidence (how reliable the information is), access_count (how often it has been retrieved), and last_accessed (for tracking staleness). These fields enable retrieval strategies beyond simple vector similarity.
The write path takes text, generates an embedding, and inserts the record into the database. Include deduplication logic to prevent storing the same information multiple times.
import psycopg2
import json
conn = psycopg2.connect("postgresql://localhost/myapp")
def store_memory(user_id, content, source="conversation"):
embedding = embed_text(content)
with conn.cursor() as cur:
# Check for near-duplicates
cur.execute("""
SELECT id, content FROM memories
WHERE user_id = %s
AND embedding <=> %s::vector < 0.05
LIMIT 1
""", (user_id, json.dumps(embedding)))
duplicate = cur.fetchone()
if duplicate:
# Update existing memory instead of creating duplicate
cur.execute("""
UPDATE memories
SET access_count = access_count + 1,
last_accessed = NOW()
WHERE id = %s
""", (duplicate[0],))
else:
cur.execute("""
INSERT INTO memories
(user_id, content, embedding, source)
VALUES (%s, %s, %s::vector, %s)
""", (user_id, content, json.dumps(embedding), source))
conn.commit()The read path takes a query, embeds it, and searches the vector database for the most similar stored memories. The database handles the similarity computation and returns results ranked by closeness.
def search_memories(user_id, query, limit=5):
query_embedding = embed_text(query)
with conn.cursor() as cur:
cur.execute("""
SELECT content, confidence, access_count,
created_at, last_accessed,
1 - (embedding <=> %s::vector) AS similarity
FROM memories
WHERE user_id = %s
ORDER BY embedding <=> %s::vector
LIMIT %s
""", (
json.dumps(query_embedding),
user_id,
json.dumps(query_embedding),
limit
))
results = []
for row in cur.fetchall():
results.append({
"content": row[0],
"confidence": row[1],
"access_count": row[2],
"created_at": row[3].isoformat(),
"last_accessed": row[4].isoformat(),
"similarity": round(row[5], 4)
})
return resultsCombine vector search with metadata filters to scope results by user, recency, confidence, or source. This is where the metadata fields from step 3 pay off. You can filter for only high-confidence memories, only memories from the last 30 days, or only memories from a specific source.
def search_with_filters(user_id, query, limit=5,
min_confidence=0.5, days_back=None):
query_embedding = embed_text(query)
filters = ["user_id = %s"]
params = [json.dumps(query_embedding), user_id]
if min_confidence > 0:
filters.append("confidence >= %s")
params.append(min_confidence)
if days_back:
filters.append("created_at >= NOW() - INTERVAL '%s days'")
params.append(days_back)
where_clause = " AND ".join(filters)
params.append(json.dumps(query_embedding))
params.append(limit)
with conn.cursor() as cur:
cur.execute(f"""
SELECT content, confidence,
1 - (embedding <=> %s::vector) AS similarity
FROM memories
WHERE {where_clause}
ORDER BY embedding <=> %s::vector
LIMIT %s
""", params)
return [
{"content": row[0], "confidence": row[1],
"similarity": round(row[2], 4)}
for row in cur.fetchall()
]Scaling Considerations
The IVFFlat index used in step 2 works well up to a few million vectors per table. Beyond that, consider HNSW indexes (faster queries, more memory) or partitioning the table by user_id. Embedding API calls add latency to both writes and reads, so batch embedding calls when possible and cache embeddings for frequently queried terms.
For production workloads, connection pooling, async embedding calls, and query result caching reduce latency significantly. The memory layer should add less than 100 milliseconds to the total request time when properly optimized. If you need faster retrieval with cognitive scoring on top, Adaptive Recall handles the vector infrastructure and adds ACT-R based ranking that improves with usage.
Get a production memory layer without managing vector infrastructure. Adaptive Recall provides embedding, storage, cognitive retrieval, and lifecycle management out of the box.
Get Started Free