Home » Context Window Management » Reduce Token Usage

How to Reduce Token Usage by 50% or More

Most LLM applications waste 40 to 60% of their tokens on verbose prompts, excessive retrieved context, unbounded conversation history, and overly long responses. Fixing these issues is straightforward and does not require changing models or degrading quality. This guide walks through each optimization in order of impact, with concrete techniques that typically achieve a 50% or greater reduction in total token cost.

Where Tokens Go

Before optimizing, measure. Add logging to your API calls that records the token count for each prompt component (system prompt, conversation history, retrieved context, tools) and the response token count. Run this for a day or a week of production traffic and calculate the averages. Most developers are surprised by the results. The system prompt they wrote in an afternoon is consuming 3,000 tokens per call. The RAG pipeline is retrieving 10 documents when 3 would suffice. The conversation history is growing without any management.

A typical unoptimized application distributes tokens roughly as follows: 20% system prompt, 10% tool definitions, 30% conversation history, 25% retrieved context, and 15% response. Each of these can be reduced significantly without changing what the model actually does.

Optimization Techniques by Impact

Step 1: Audit your current token spend.
Add instrumentation that logs the token count for every API call, broken down by component. Use your provider's usage API or count tokens locally with the appropriate tokenizer library. Calculate the cost per conversation, per query, and per day. Set a baseline so you can measure the impact of each optimization.

import tiktoken

def log_token_usage(messages, model="gpt-4o"):
    enc = tiktoken.encoding_for_model(model)
    breakdown = {}
    for msg in messages:
        role = msg["role"]
        tokens = len(enc.encode(msg["content"]))
        breakdown[role] = breakdown.get(role, 0) + tokens

    total = sum(breakdown.values())
    print(f"Total: {total} tokens")
    for role, count in breakdown.items():
        pct = (count / total) * 100
        print(f"  {role}: {count} ({pct:.0f}%)")
    return breakdown

Step 2: Compress your system prompt.
Most system prompts are 2 to 5 times longer than necessary. Rewrite verbose natural language instructions as concise directives. Remove explanations of why the model should behave a certain way and just tell it what to do. Replace multi-sentence descriptions with bullet points. Remove examples unless they are genuinely necessary for correct behavior.

Common reductions:

"You are an AI assistant designed to help customers with questions about our product. Always be helpful, professional, and accurate." becomes "Customer support assistant for [Product]. Professional tone."
Multi-paragraph guardrails can often be expressed as a short list: "Do not: discuss competitors, share pricing without verification, make promises about roadmap."
Few-shot examples that show obvious behavior can be removed. Only keep examples that demonstrate subtle or non-obvious formatting requirements.

Step 3: Limit retrieved context.
RAG pipelines often retrieve 10 or more documents by default, but studies show that top-3 retrieval produces 90% of the quality improvement while using 70% fewer tokens. Reduce your retrieval count to 3 to 5 results and increase the relevance threshold. A single highly relevant document is more useful than five moderately relevant ones because the model spends attention budget on each retrieved document, and marginally relevant documents dilute focus.

Step 4: Cap conversation history.
Implement a sliding window that keeps the last 4 to 6 turns verbatim and summarizes everything older. Without a cap, a 30-turn conversation accumulates roughly 12,000 tokens of history. With a 6-turn window and summarization, the same conversation uses about 3,000 tokens. See the sliding window implementation guide for the full approach.

Step 5: Enable prompt caching.
If your provider supports prompt caching (Anthropic does), structure your prompt with static content first and dynamic content last. The static prefix is cached after the first call, reducing the cost of those tokens by up to 90% on subsequent calls. For an application with a 3,000-token system prompt making 10,000 calls per day, caching saves the equivalent of 27 million tokens per day.

Step 6: Control response length.
Set max_tokens to a reasonable limit for your use case. A customer support answer rarely needs more than 500 tokens. A code snippet rarely needs more than 1,000. Add instructions like "Answer in 2-3 sentences" or "Be concise" for queries that do not require detailed explanations. Output tokens are typically more expensive than input tokens, so controlling response length has an outsized impact on cost.

Combined Impact

Optimization	Typical Reduction	Cumulative Effect
Compress system prompt	40-60% of prompt tokens	10-15% total
Limit retrieval to top-3	60-70% of retrieval tokens	15-20% total
Sliding window history	50-75% of history tokens	15-25% total
Prompt caching	90% cost reduction on cached tokens	10-15% cost
Response length limits	30-50% of output tokens	5-10% total

Applied together, these optimizations typically achieve a 50 to 70% reduction in total token cost. For a high-volume application spending $10,000 per month on LLM API calls, that is $5,000 to $7,000 in monthly savings with no degradation in user-facing quality.

The Architectural Solution

The optimizations above are incremental improvements within the existing architecture. The structural solution is to move persistent knowledge out of the context window entirely. An external memory system stores all accumulated knowledge and retrieves only the specific pieces needed for each query. This reduces conversation history to near zero (the model recalls from memory instead of from history), reduces retrieved context to the minimum (targeted memory retrieval instead of broad document search), and makes the system prompt the only fixed cost per call.

Reduce token costs structurally. Adaptive Recall moves knowledge to external memory so every API call uses the minimum tokens necessary.

Get Started Free

How to Reduce Token Usage by 50% or More

Where Tokens Go

Optimization Techniques by Impact

Combined Impact

The Architectural Solution

Related Articles