How to Reduce Token Usage by 50% or More
Where Tokens Go
Before optimizing, measure. Add logging to your API calls that records the token count for each prompt component (system prompt, conversation history, retrieved context, tools) and the response token count. Run this for a day or a week of production traffic and calculate the averages. Most developers are surprised by the results. The system prompt they wrote in an afternoon is consuming 3,000 tokens per call. The RAG pipeline is retrieving 10 documents when 3 would suffice. The conversation history is growing without any management.
A typical unoptimized application distributes tokens roughly as follows: 20% system prompt, 10% tool definitions, 30% conversation history, 25% retrieved context, and 15% response. Each of these can be reduced significantly without changing what the model actually does.
Optimization Techniques by Impact
Add instrumentation that logs the token count for every API call, broken down by component. Use your provider's usage API or count tokens locally with the appropriate tokenizer library. Calculate the cost per conversation, per query, and per day. Set a baseline so you can measure the impact of each optimization.
import tiktoken
def log_token_usage(messages, model="gpt-4o"):
enc = tiktoken.encoding_for_model(model)
breakdown = {}
for msg in messages:
role = msg["role"]
tokens = len(enc.encode(msg["content"]))
breakdown[role] = breakdown.get(role, 0) + tokens
total = sum(breakdown.values())
print(f"Total: {total} tokens")
for role, count in breakdown.items():
pct = (count / total) * 100
print(f" {role}: {count} ({pct:.0f}%)")
return breakdownMost system prompts are 2 to 5 times longer than necessary. Rewrite verbose natural language instructions as concise directives. Remove explanations of why the model should behave a certain way and just tell it what to do. Replace multi-sentence descriptions with bullet points. Remove examples unless they are genuinely necessary for correct behavior.
Common reductions:
- "You are an AI assistant designed to help customers with questions about our product. Always be helpful, professional, and accurate." becomes "Customer support assistant for [Product]. Professional tone."
- Multi-paragraph guardrails can often be expressed as a short list: "Do not: discuss competitors, share pricing without verification, make promises about roadmap."
- Few-shot examples that show obvious behavior can be removed. Only keep examples that demonstrate subtle or non-obvious formatting requirements.
RAG pipelines often retrieve 10 or more documents by default, but studies show that top-3 retrieval produces 90% of the quality improvement while using 70% fewer tokens. Reduce your retrieval count to 3 to 5 results and increase the relevance threshold. A single highly relevant document is more useful than five moderately relevant ones because the model spends attention budget on each retrieved document, and marginally relevant documents dilute focus.
Implement a sliding window that keeps the last 4 to 6 turns verbatim and summarizes everything older. Without a cap, a 30-turn conversation accumulates roughly 12,000 tokens of history. With a 6-turn window and summarization, the same conversation uses about 3,000 tokens. See the sliding window implementation guide for the full approach.
If your provider supports prompt caching (Anthropic does), structure your prompt with static content first and dynamic content last. The static prefix is cached after the first call, reducing the cost of those tokens by up to 90% on subsequent calls. For an application with a 3,000-token system prompt making 10,000 calls per day, caching saves the equivalent of 27 million tokens per day.
Set max_tokens to a reasonable limit for your use case. A customer support answer rarely needs more than 500 tokens. A code snippet rarely needs more than 1,000. Add instructions like "Answer in 2-3 sentences" or "Be concise" for queries that do not require detailed explanations. Output tokens are typically more expensive than input tokens, so controlling response length has an outsized impact on cost.
Combined Impact
| Optimization | Typical Reduction | Cumulative Effect |
|---|---|---|
| Compress system prompt | 40-60% of prompt tokens | 10-15% total |
| Limit retrieval to top-3 | 60-70% of retrieval tokens | 15-20% total |
| Sliding window history | 50-75% of history tokens | 15-25% total |
| Prompt caching | 90% cost reduction on cached tokens | 10-15% cost |
| Response length limits | 30-50% of output tokens | 5-10% total |
Applied together, these optimizations typically achieve a 50 to 70% reduction in total token cost. For a high-volume application spending $10,000 per month on LLM API calls, that is $5,000 to $7,000 in monthly savings with no degradation in user-facing quality.
The Architectural Solution
The optimizations above are incremental improvements within the existing architecture. The structural solution is to move persistent knowledge out of the context window entirely. An external memory system stores all accumulated knowledge and retrieves only the specific pieces needed for each query. This reduces conversation history to near zero (the model recalls from memory instead of from history), reduces retrieved context to the minimum (targeted memory retrieval instead of broad document search), and makes the system prompt the only fixed cost per call.
Reduce token costs structurally. Adaptive Recall moves knowledge to external memory so every API call uses the minimum tokens necessary.
Get Started Free