How to Implement Response Caching for AI
Before You Start
You need a cache backend (Redis is the most common choice), request logging that shows query patterns and repetition rates, and a way to measure hit rates after implementation. Understanding your traffic patterns is critical: caching works best when a meaningful percentage of requests are identical or semantically similar. Review at least one week of request logs to estimate the potential hit rate before investing in implementation.
Step-by-Step Implementation
Not all AI requests should be cached. Good candidates have three properties: deterministic outputs (the same input should produce the same quality of output), tolerance for slightly stale results, and meaningful repetition rates. FAQ responses, classification results, entity extraction, document summaries, and knowledge base queries are typically cacheable. Creative writing, personalized recommendations based on recent behavior, and multi-turn conversations with evolving context are typically not cacheable. Analyze your logs to identify which request types repeat most frequently and estimate the percentage of traffic each represents.
Redis is the standard choice for AI response caching because it offers sub-millisecond reads, configurable TTL, and efficient memory usage with LRU eviction. For applications processing fewer than 10,000 requests per day, an in-memory dictionary with periodic disk persistence works fine and avoids the operational overhead of a Redis instance. For high-volume applications, Redis Cluster provides horizontal scaling. If you need persistence across restarts and cannot tolerate cache cold starts, consider a database-backed cache (PostgreSQL with a cache table) that trades some read latency for durability.
Create a hash of the complete request payload (model, system prompt, messages, tools, temperature, and other parameters) and use it as the cache key. Before making an API call, check the cache for this key. If found, return the cached response immediately. If not found, make the API call, store the response in the cache with the hash key, and return the response. Set the model temperature to 0 for cacheable requests to ensure deterministic outputs.
import hashlib
import json
import redis
cache = redis.Redis(host="localhost", port=6379, db=0)
CACHE_TTL = 3600 # 1 hour
def get_cached_response(request_params):
cache_key = hashlib.sha256(
json.dumps(request_params, sort_keys=True).encode()
).hexdigest()
cached = cache.get(f"ai:response:{cache_key}")
if cached:
return json.loads(cached)
response = client.messages.create(**request_params)
cache.setex(
f"ai:response:{cache_key}",
CACHE_TTL,
json.dumps({
"content": response.content[0].text,
"usage": {
"input_tokens": response.usage.input_tokens,
"output_tokens": response.usage.output_tokens
}
})
)
return responseExact match caching misses queries that are worded differently but ask the same thing. Semantic caching solves this by embedding each query and checking for cached responses to semantically similar queries. When a new query arrives, embed it, search your cache embeddings for vectors within a similarity threshold (typically 0.95 cosine similarity for high precision), and return the cached response if a match is found. If no match exceeds the threshold, make the API call and store both the response and its query embedding. The embedding call adds a small cost (roughly $0.0001 per query at current embedding prices) but is far cheaper than the full LLM call it replaces.
Set TTL (time-to-live) based on how quickly the underlying data changes. For FAQ responses based on static documentation, 24 to 72 hours is appropriate. For responses based on product inventory or pricing, 15 to 60 minutes keeps the cache fresh. For classification results where the categories do not change, a week or more is fine. Implement active invalidation that clears relevant cache entries when the underlying data changes: when a knowledge base article is updated, invalidate cached responses that referenced that article. Active invalidation is more work to implement but prevents serving stale information.
Track three metrics continuously: cache hit rate (percentage of requests served from cache), latency reduction (average response time for cached vs uncached requests), and cost savings (API dollars saved by cached responses per day). Start with conservative settings (high similarity threshold, short TTL) and relax them gradually while monitoring response quality. If cache hit rates are below 15 percent after a week, either the traffic is too diverse for caching to be effective or the similarity threshold needs adjustment. If response quality complaints increase, tighten the similarity threshold or reduce TTL for affected categories.
Semantic Cache Architecture
A production semantic cache maintains two data structures: a vector index of query embeddings (stored in a vector database or an in-memory index like FAISS) and a key-value store of cached responses keyed by query ID. When a request arrives, the query is embedded and searched against the vector index. If a match is found within the similarity threshold, the corresponding response is retrieved from the key-value store. If no match is found, the full API call is made, and both the query embedding and response are stored.
The similarity threshold is the most important tuning parameter. At 0.98, the cache only matches nearly identical queries, producing high precision but low hit rates. At 0.90, the cache matches loosely similar queries, producing high hit rates but risking incorrect responses. The optimal threshold depends on the application: factual Q&A systems can tolerate lower thresholds (0.92 to 0.95) because similar questions have similar answers, while personalized or context-dependent responses need higher thresholds (0.96 to 0.99) to avoid returning inappropriate cached results.
Cache warming pre-populates the cache with responses to known common queries before they arrive from users. If you have historical query logs, run the top 100 to 500 most frequent queries through the model and cache the results before a product launch, a traffic spike, or a new deployment. Cache warming eliminates the cold start problem where the first user to ask each question gets full latency while subsequent users get cached responses.
Adaptive Recall works alongside response caching by providing a persistent memory layer that reduces the tokens you send in the first place. Cache the responses you do make, and use memory to avoid making many of them at all.
Get Started Free