Home » AI Cost Optimization » Response Caching

How to Implement Response Caching for AI

Response caching stores the output of AI API calls and returns the cached result for identical or similar future requests, eliminating the API call entirely. A well-implemented cache can reduce API costs by 20 to 50 percent depending on query repetition rates, with the added benefit of reducing latency from seconds to milliseconds for cached responses.

Before You Start

You need a cache backend (Redis is the most common choice), request logging that shows query patterns and repetition rates, and a way to measure hit rates after implementation. Understanding your traffic patterns is critical: caching works best when a meaningful percentage of requests are identical or semantically similar. Review at least one week of request logs to estimate the potential hit rate before investing in implementation.

Step-by-Step Implementation

Step 1: Identify cacheable requests.
Not all AI requests should be cached. Good candidates have three properties: deterministic outputs (the same input should produce the same quality of output), tolerance for slightly stale results, and meaningful repetition rates. FAQ responses, classification results, entity extraction, document summaries, and knowledge base queries are typically cacheable. Creative writing, personalized recommendations based on recent behavior, and multi-turn conversations with evolving context are typically not cacheable. Analyze your logs to identify which request types repeat most frequently and estimate the percentage of traffic each represents.

Step 2: Choose a cache backend.
Redis is the standard choice for AI response caching because it offers sub-millisecond reads, configurable TTL, and efficient memory usage with LRU eviction. For applications processing fewer than 10,000 requests per day, an in-memory dictionary with periodic disk persistence works fine and avoids the operational overhead of a Redis instance. For high-volume applications, Redis Cluster provides horizontal scaling. If you need persistence across restarts and cannot tolerate cache cold starts, consider a database-backed cache (PostgreSQL with a cache table) that trades some read latency for durability.

Step 3: Implement exact match caching.
Create a hash of the complete request payload (model, system prompt, messages, tools, temperature, and other parameters) and use it as the cache key. Before making an API call, check the cache for this key. If found, return the cached response immediately. If not found, make the API call, store the response in the cache with the hash key, and return the response. Set the model temperature to 0 for cacheable requests to ensure deterministic outputs.

import hashlib
import json
import redis

cache = redis.Redis(host="localhost", port=6379, db=0)
CACHE_TTL = 3600  # 1 hour

def get_cached_response(request_params):
    cache_key = hashlib.sha256(
        json.dumps(request_params, sort_keys=True).encode()
    ).hexdigest()

    cached = cache.get(f"ai:response:{cache_key}")
    if cached:
        return json.loads(cached)

    response = client.messages.create(**request_params)

    cache.setex(
        f"ai:response:{cache_key}",
        CACHE_TTL,
        json.dumps({
            "content": response.content[0].text,
            "usage": {
                "input_tokens": response.usage.input_tokens,
                "output_tokens": response.usage.output_tokens
            }
        })
    )
    return response

Step 4: Add semantic caching.
Exact match caching misses queries that are worded differently but ask the same thing. Semantic caching solves this by embedding each query and checking for cached responses to semantically similar queries. When a new query arrives, embed it, search your cache embeddings for vectors within a similarity threshold (typically 0.95 cosine similarity for high precision), and return the cached response if a match is found. If no match exceeds the threshold, make the API call and store both the response and its query embedding. The embedding call adds a small cost (roughly $0.0001 per query at current embedding prices) but is far cheaper than the full LLM call it replaces.

Step 5: Configure TTL and invalidation.
Set TTL (time-to-live) based on how quickly the underlying data changes. For FAQ responses based on static documentation, 24 to 72 hours is appropriate. For responses based on product inventory or pricing, 15 to 60 minutes keeps the cache fresh. For classification results where the categories do not change, a week or more is fine. Implement active invalidation that clears relevant cache entries when the underlying data changes: when a knowledge base article is updated, invalidate cached responses that referenced that article. Active invalidation is more work to implement but prevents serving stale information.

Step 6: Measure and tune.
Track three metrics continuously: cache hit rate (percentage of requests served from cache), latency reduction (average response time for cached vs uncached requests), and cost savings (API dollars saved by cached responses per day). Start with conservative settings (high similarity threshold, short TTL) and relax them gradually while monitoring response quality. If cache hit rates are below 15 percent after a week, either the traffic is too diverse for caching to be effective or the similarity threshold needs adjustment. If response quality complaints increase, tighten the similarity threshold or reduce TTL for affected categories.

Semantic Cache Architecture

A production semantic cache maintains two data structures: a vector index of query embeddings (stored in a vector database or an in-memory index like FAISS) and a key-value store of cached responses keyed by query ID. When a request arrives, the query is embedded and searched against the vector index. If a match is found within the similarity threshold, the corresponding response is retrieved from the key-value store. If no match is found, the full API call is made, and both the query embedding and response are stored.

The similarity threshold is the most important tuning parameter. At 0.98, the cache only matches nearly identical queries, producing high precision but low hit rates. At 0.90, the cache matches loosely similar queries, producing high hit rates but risking incorrect responses. The optimal threshold depends on the application: factual Q&A systems can tolerate lower thresholds (0.92 to 0.95) because similar questions have similar answers, while personalized or context-dependent responses need higher thresholds (0.96 to 0.99) to avoid returning inappropriate cached results.

Cache warming pre-populates the cache with responses to known common queries before they arrive from users. If you have historical query logs, run the top 100 to 500 most frequent queries through the model and cache the results before a product launch, a traffic spike, or a new deployment. Cache warming eliminates the cold start problem where the first user to ask each question gets full latency while subsequent users get cached responses.

Prompt caching vs response caching: These are complementary strategies. Prompt caching (offered by Anthropic) reduces the cost of processing the static prefix of your request on the provider's side. Response caching eliminates the API call entirely on your side. Use both: prompt caching reduces costs for cache misses, and response caching eliminates costs for cache hits.

Adaptive Recall works alongside response caching by providing a persistent memory layer that reduces the tokens you send in the first place. Cache the responses you do make, and use memory to avoid making many of them at all.

Get Started Free

How to Implement Response Caching for AI

Before You Start

Step-by-Step Implementation

Semantic Cache Architecture

Related Articles