How Much Does It Cost to Run an AI Assistant
Cost Breakdown by Component
Model API costs are the largest expense for most AI assistants. These costs scale with two factors: the number of tokens processed (both input and output) and the model's per-token price. Input tokens (system prompt, conversation history, retrieved context, tool definitions) are typically cheaper than output tokens (the model's response), but input token count is usually much higher because context windows accumulate information across turns. A conversation with a 4,000-token system prompt, 2,000 tokens of conversation history, 1,500 tokens of retrieved memories, and 1,000 tokens of tool definitions sends 8,500 input tokens per turn, even before the user's message.
At Claude Sonnet's pricing, those 8,500 input tokens plus a 500-token response cost about $0.03 per turn. GPT-4o pricing is similar. Smaller models (Claude Haiku, GPT-4o-mini) cost roughly one-fifth as much for the same token volume. A 10-turn conversation at full model pricing costs $0.30 to $0.50; with model routing that sends 7 simple turns to the cheap model and 3 complex turns to the expensive model, the same conversation costs $0.12 to $0.20.
Tool execution costs depend on what tools your assistant uses. Database queries, API calls, and web searches have their own pricing (or infrastructure costs if self-hosted). Memory service costs scale with storage volume and request frequency. Adaptive Recall's free tier handles 500 memories with 30 requests per minute, which is sufficient for development and light production use. Paid tiers scale with the number of stored memories and API calls.
Infrastructure costs include the server or serverless compute that runs your application logic, the database for conversation history, any caching layer (Redis for session state and response caching), and monitoring tools. For a moderately trafficked assistant, these costs run $50 to $200 per month on cloud infrastructure, significantly less than the model API costs.
Cost Optimization Strategies
Model routing is the single most effective cost optimization. Classify incoming requests by complexity and route simple ones to cheaper models. A question like "what is our return policy" does not need GPT-4o; GPT-4o-mini handles it equally well at one-fifth the cost. Reserve the expensive model for multi-step reasoning, complex tool orchestration, and nuanced conversations.
Context pruning reduces the token count sent with each request. Summarize old conversation turns instead of including the full history. Filter tool results to include only relevant fields. Remove memories from context that were retrieved but not referenced. Every token removed from the input saves money on every subsequent turn in the conversation because the accumulated context grows smaller.
Persistent memory actually reduces costs over time. Instead of re-retrieving and re-processing documents, the assistant recalls previously extracted knowledge, which is already concise and relevant. Instead of asking clarifying questions that add conversation turns (and therefore model calls), the assistant retrieves the answer from memory. Over time, the memory investment pays for itself through reduced token consumption and fewer conversation turns per task.
Response caching avoids redundant model calls for common queries. If 50 users ask the same product question today, caching the response after the first query saves 49 model calls. Cache TTLs should match how frequently the underlying information changes.
A Worked Example
Consider a developer assistant serving a team of 50 engineers with an average of 8 conversations per person per day, 5 turns per conversation. That is 2,000 conversations and 10,000 model calls per day (assuming 1 model call per turn with no tool use). With a system prompt of 3,000 tokens, average conversation history of 2,000 tokens, 1,500 tokens of retrieved memories, and 800 tokens of tool definitions, each request sends roughly 7,500 input tokens. Average response length is 400 output tokens.
At Claude Sonnet pricing ($3 per million input tokens, $15 per million output tokens), the daily cost for model API calls is: 10,000 requests at 7,500 input tokens equals 75 million input tokens ($225) plus 10,000 requests at 400 output tokens equals 4 million output tokens ($60), totaling $285 per day or about $8,550 per month. That is the unoptimized baseline.
Now apply optimizations. Model routing sends 60% of requests (simple questions, status checks, greetings) to Claude Haiku at one-tenth the cost, keeping 40% on Sonnet for complex work. Cost drops to roughly $3,700 per month. Prompt caching reduces the input cost for the repeated 3,000-token system prompt by 90%, saving another $600 per month. Context pruning keeps conversation history lean, reducing average input tokens from 7,500 to 6,000 on later turns, saving roughly $400. Response caching eliminates 8% of model calls for common questions, saving another $250. Total optimized cost: approximately $2,450 per month, a 71% reduction from the unoptimized baseline. That works out to about $1.60 per engineer per workday.
Reduce assistant costs with memory that eliminates redundant processing. Adaptive Recall stores knowledge once and retrieves it efficiently across sessions, cutting token usage and conversation turns.
Get Started Free