Home » Context Window Management » Reduce Token Usage

How to Reduce Token Usage by 50% or More

Most LLM applications waste 40 to 60% of their tokens on verbose prompts, excessive retrieved context, unbounded conversation history, and overly long responses. Fixing these issues is straightforward and does not require changing models or degrading quality. This guide walks through each optimization in order of impact, with concrete techniques that typically achieve a 50% or greater reduction in total token cost.

Where Tokens Go

Before optimizing, measure. Add logging to your API calls that records the token count for each prompt component (system prompt, conversation history, retrieved context, tools) and the response token count. Run this for a day or a week of production traffic and calculate the averages. Most developers are surprised by the results. The system prompt they wrote in an afternoon is consuming 3,000 tokens per call. The RAG pipeline is retrieving 10 documents when 3 would suffice. The conversation history is growing without any management.

A typical unoptimized application distributes tokens roughly as follows: 20% system prompt, 10% tool definitions, 30% conversation history, 25% retrieved context, and 15% response. Each of these can be reduced significantly without changing what the model actually does.

Optimization Techniques by Impact

Step 1: Audit your current token spend.
Add instrumentation that logs the token count for every API call, broken down by component. Use your provider's usage API or count tokens locally with the appropriate tokenizer library. Calculate the cost per conversation, per query, and per day. Set a baseline so you can measure the impact of each optimization.
import tiktoken def log_token_usage(messages, model="gpt-4o"): enc = tiktoken.encoding_for_model(model) breakdown = {} for msg in messages: role = msg["role"] tokens = len(enc.encode(msg["content"])) breakdown[role] = breakdown.get(role, 0) + tokens total = sum(breakdown.values()) print(f"Total: {total} tokens") for role, count in breakdown.items(): pct = (count / total) * 100 print(f" {role}: {count} ({pct:.0f}%)") return breakdown
Step 2: Compress your system prompt.
Most system prompts are 2 to 5 times longer than necessary. Rewrite verbose natural language instructions as concise directives. Remove explanations of why the model should behave a certain way and just tell it what to do. Replace multi-sentence descriptions with bullet points. Remove examples unless they are genuinely necessary for correct behavior.

Common reductions:

Step 3: Limit retrieved context.
RAG pipelines often retrieve 10 or more documents by default, but studies show that top-3 retrieval produces 90% of the quality improvement while using 70% fewer tokens. Reduce your retrieval count to 3 to 5 results and increase the relevance threshold. A single highly relevant document is more useful than five moderately relevant ones because the model spends attention budget on each retrieved document, and marginally relevant documents dilute focus.
Step 4: Cap conversation history.
Implement a sliding window that keeps the last 4 to 6 turns verbatim and summarizes everything older. Without a cap, a 30-turn conversation accumulates roughly 12,000 tokens of history. With a 6-turn window and summarization, the same conversation uses about 3,000 tokens. See the sliding window implementation guide for the full approach.
Step 5: Enable prompt caching.
If your provider supports prompt caching (Anthropic does), structure your prompt with static content first and dynamic content last. The static prefix is cached after the first call, reducing the cost of those tokens by up to 90% on subsequent calls. For an application with a 3,000-token system prompt making 10,000 calls per day, caching saves the equivalent of 27 million tokens per day.
Step 6: Control response length.
Set max_tokens to a reasonable limit for your use case. A customer support answer rarely needs more than 500 tokens. A code snippet rarely needs more than 1,000. Add instructions like "Answer in 2-3 sentences" or "Be concise" for queries that do not require detailed explanations. Output tokens are typically more expensive than input tokens, so controlling response length has an outsized impact on cost.

Combined Impact

OptimizationTypical ReductionCumulative Effect
Compress system prompt40-60% of prompt tokens10-15% total
Limit retrieval to top-360-70% of retrieval tokens15-20% total
Sliding window history50-75% of history tokens15-25% total
Prompt caching90% cost reduction on cached tokens10-15% cost
Response length limits30-50% of output tokens5-10% total

Applied together, these optimizations typically achieve a 50 to 70% reduction in total token cost. For a high-volume application spending $10,000 per month on LLM API calls, that is $5,000 to $7,000 in monthly savings with no degradation in user-facing quality.

The Architectural Solution

The optimizations above are incremental improvements within the existing architecture. The structural solution is to move persistent knowledge out of the context window entirely. An external memory system stores all accumulated knowledge and retrieves only the specific pieces needed for each query. This reduces conversation history to near zero (the model recalls from memory instead of from history), reduces retrieved context to the minimum (targeted memory retrieval instead of broad document search), and makes the system prompt the only fixed cost per call.

Reduce token costs structurally. Adaptive Recall moves knowledge to external memory so every API call uses the minimum tokens necessary.

Get Started Free