Home » AI Cost Optimization » Audit AI Spending

How to Audit and Reduce Your AI API Spending

An AI cost audit examines every API call your application makes, identifies where tokens are wasted on redundant processing, and produces a prioritized list of optimizations with estimated savings. Most teams that complete their first audit discover they can cut spending by 40 to 70 percent without changing model quality by eliminating uncached repetition, oversized context, and misrouted requests.

Before You Start

You need access to your AI provider's usage dashboard (Anthropic Console, OpenAI Usage, or equivalent) and your application logs. If your application does not currently log token counts per request, that is the first thing you will fix in Step 1. You also need the ability to view or estimate the cost per request by model tier. Have your last three months of API invoices available for trend analysis.

The audit works best as a focused effort with both a developer who understands the application architecture and someone who can connect costs to business outcomes (product manager, engineering lead, or finance partner). The developer identifies where tokens go. The business partner determines which costs are justified and which are waste.

Step-by-Step Audit Process

Step 1: Instrument per-request tracking.
Add middleware or logging that captures the following for every API call: the model used, input token count, output token count, cached token count (if applicable), calculated cost, request latency, the feature or workflow that initiated the call, and the user or tenant responsible. If you use an AI gateway like LiteLLM, Portkey, or Helicone, most of this data is captured automatically. If you call provider APIs directly, add structured logging around your API client. Store at least 7 days of data before proceeding to analysis.
# Python example: logging wrapper for Anthropic API calls import anthropic import json import time from datetime import datetime client = anthropic.Anthropic() def tracked_completion(feature, user_id, **kwargs): start = time.time() response = client.messages.create(**kwargs) elapsed = time.time() - start log_entry = { "timestamp": datetime.utcnow().isoformat(), "feature": feature, "user_id": user_id, "model": kwargs.get("model"), "input_tokens": response.usage.input_tokens, "output_tokens": response.usage.output_tokens, "cache_read_tokens": getattr(response.usage, "cache_read_input_tokens", 0), "cache_creation_tokens": getattr(response.usage, "cache_creation_input_tokens", 0), "latency_ms": int(elapsed * 1000), "cost_usd": calculate_cost( kwargs.get("model"), response.usage.input_tokens, response.usage.output_tokens, getattr(response.usage, "cache_read_input_tokens", 0) ) } logger.info(json.dumps(log_entry)) return response
Step 2: Build a token breakdown report.
After collecting at least 7 days of data, analyze where input tokens go for your highest-volume features. For each feature, measure the average token count of each input component: system prompt tokens, conversation history tokens (for multi-turn features), RAG retrieval tokens, tool definition tokens, and actual user message tokens. Most teams discover that the user's actual message represents less than 5 percent of total input tokens, with system prompts, history, and retrieved context consuming the other 95 percent.
Step 3: Identify redundant processing.
Scan your logs for patterns of waste. Common findings include: the same system prompt (often 1,500 to 3,000 tokens) sent with every request without prompt caching enabled, full conversation histories resent every turn instead of summarized, identical RAG queries retrieving the same chunks for similar user questions, tool definitions included in every request even when 80 percent of requests use only 1 or 2 tools, and the same classification or extraction task run on identical inputs. Calculate the token cost of each redundancy by multiplying the wasted tokens by the request volume. Rank the redundancies by total wasted cost.
Step 4: Measure cost per business outcome.
Connect API costs to the value they generate. For a support bot, calculate cost per resolved ticket. For a content generator, calculate cost per published piece. For a search system, calculate cost per query. This metric reveals which features are cost-effective and which are not. A feature costing $0.15 per interaction that replaces a $12 human interaction is a clear win. A feature costing $2.00 per interaction that users abandon 60 percent of the time needs optimization or elimination.
Step 5: Prioritize and implement reductions.
Create a ranked list of optimization opportunities ordered by (estimated annual savings) divided by (estimated implementation effort in engineer-days). Common high-impact, low-effort optimizations include: enabling prompt caching (typically 1 to 2 hours of work for 15 to 30 percent savings on input costs), adding response caching for repeated queries (1 to 2 days for 10 to 40 percent savings depending on repetition rate), reducing conversation history by switching to memory summaries (2 to 5 days for 20 to 40 percent savings on multi-turn features), and routing simple requests to smaller models (3 to 5 days for 20 to 50 percent savings depending on the proportion of simple requests).
Step 6: Set up ongoing monitoring.
After implementing optimizations, configure automated monitoring that tracks: daily total cost by feature and model, cost per business outcome (updated daily), cache hit rates (prompt cache and response cache), model routing distribution, and anomalies (daily cost exceeding 150 percent of 7-day average, individual requests exceeding token thresholds). Set up weekly summary reports that show cost trends and monthly detailed reports for audit reviews. The monitoring infrastructure ensures that optimizations are maintained and that new features do not introduce cost regressions.

Common Audit Findings

Across dozens of AI cost audits, certain patterns appear consistently. System prompts that were written during prototyping and never trimmed for production represent 20 to 40 percent of input tokens in many applications. The prompt contains extensive examples, edge case instructions, and formatting guidelines that could be condensed by 50 percent or moved to few-shot examples that are only included when relevant.

Conversation history accumulation is the most expensive component in multi-turn applications. By turn 10, conversation history often exceeds 5,000 tokens, and the oldest messages (which are least relevant to the current turn) consume the most space. Switching to a memory system that stores key facts and decisions from the conversation, then recalls only what is relevant to the current turn, typically reduces history tokens by 80 to 90 percent.

RAG over-retrieval is common in applications that were tuned for recall (returning relevant results) without considering precision (not returning irrelevant results). Retrieving 5 chunks at 500 tokens each adds 2,500 tokens per request. When only 1 or 2 of those chunks are actually relevant, the other 3 are pure waste. Improving retrieval precision through better chunking, reranking, or memory-augmented search reduces RAG token usage while improving response quality.

Quick win: Check whether prompt caching is enabled for your provider. Anthropic's prompt caching reduces cached input token costs by 90 percent, and enabling it requires minimal code changes. If your system prompt exceeds 1,024 tokens, this single change typically saves 15 to 25 percent of total input costs.

Replace expensive context with efficient memory recall. Adaptive Recall stores conversation knowledge, domain facts, and user context in persistent memory, cutting the tokens you send per request while improving response relevance.

Start Free Trial