Home » AI Cost Optimization » Cheapest LLM API

Which LLM API Is Cheapest in 2026

Google's Gemini Flash offers the lowest per-token pricing at $0.075 per million input tokens, but the cheapest per-token API is not always the cheapest per-outcome. Anthropic's 90 percent prompt caching discount makes Claude the most cost-effective option for applications with large, stable system prompts. The truly cheapest approach is model routing, sending each request to the cheapest model that handles it well, combined with caching and persistent memory to reduce the tokens sent per request.

Per-Token Pricing Comparison

At the economy tier, Google Gemini Flash leads on raw per-token cost at $0.075 per million input tokens and $0.30 per million output tokens. OpenAI's GPT-4o-mini comes in at $0.15 per million input and $0.60 per million output. Anthropic's Claude Haiku costs $0.80 per million input and $4.00 per million output. On pure per-token pricing, Google and OpenAI's economy models are significantly cheaper than Anthropic's Haiku.

At the mid-tier (where most production workloads run), the gap narrows. Google Gemini Pro costs $1.25 per million input. OpenAI GPT-4o costs $2.50 per million input. Anthropic Claude Sonnet costs $3.00 per million input. Google maintains a price advantage, but the differences are smaller as a percentage of total costs when output tokens, caching, and feature completeness are factored in.

At the premium tier, pricing converges across providers. Claude Opus costs $15.00 per million input and $75.00 per million output. OpenAI's o1 costs $15.00 per million input and $60.00 per million output, with additional costs for thinking tokens in reasoning-heavy tasks. Google's Gemini Ultra costs $5.00 per million input and $15.00 per million output. Google's premium tier is notably cheaper per token, but the quality gap on complex reasoning tasks means that the cheaper model may require more retries or human oversight, increasing the effective cost per successful completion.

Why Per-Token Pricing Is Misleading

Per-token pricing assumes all tokens are created equal, but they are not. Two applications making the same number of API calls with the same token volume can have dramatically different effective costs depending on their traffic patterns and the optimization features they can use.

An application with a 2,000-token system prompt making 1 million daily requests with sustained traffic can use Anthropic's prompt caching to process those 2,000 tokens at $0.30 per million instead of $3.00. The effective system prompt cost is $0.60 per day instead of $6.00, a savings that may offset the higher base price compared to Google or OpenAI. An application with bursty traffic that does not keep the cache warm gets no benefit from prompt caching, making the higher base price a net disadvantage.

Quality differences between models also affect effective cost. If Google's economy model handles 90 percent of a task correctly and Anthropic's handles 97 percent, the Google model generates 3x more failures. Each failure that requires a retry or human intervention has a cost, and factoring in failure costs can reverse the per-token price advantage. The cheapest model per token is not the cheapest model per successful outcome.

Output-to-input price ratios also shift the comparison. Anthropic charges 5x more for output than input on Sonnet ($15 vs $3). OpenAI charges 4x more on GPT-4o ($10 vs $2.50). Google charges 4x more on Gemini Pro ($5 vs $1.25). For applications that generate long outputs (content creation, code generation, detailed analysis), the output multiplier matters more than the input rate. An application where output tokens equal input tokens pays effectively double what the headline input rate suggests, and the provider with the lowest output rate wins in that scenario even if their input rate is slightly higher.

The Batch Pricing Wildcard

Both Anthropic and OpenAI offer batch processing at 50 percent of standard pricing for workloads that tolerate asynchronous processing. Anthropic's Message Batches API processes requests within 24 hours (typically 1 to 4 hours) at half price. OpenAI's Batch API offers similar terms. This effectively cuts the per-token rate for eligible workloads in half, which can change the provider ranking entirely for batch-heavy applications.

A team that processes 60 percent of their workload through batch APIs and 40 percent in real time pays an effective blended rate of 70 percent of the standard rate (60 percent at half price, 40 percent at full price). At this blended rate, Anthropic Sonnet's effective input rate drops to $2.10 per million, closer to OpenAI's standard GPT-4o rate and competitive with Google's Pro rate when caching benefits are factored in. Teams evaluating the cheapest API should calculate their batch-eligible percentage and use blended rates, not just standard rates, for the comparison.

Cheapest by Use Case

The cheapest provider depends on what you are building. For high-volume classification and extraction with short prompts and minimal output, Google Gemini Flash is the clear winner on price at $0.075 per million input. For conversational applications with large system prompts and sustained traffic, Anthropic Claude with prompt caching is cheapest because the 90 percent cache discount dominates the cost calculation. For batch processing of documents, reports, or analytical tasks, whichever provider offers the best batch discount for the model quality you need wins. For mixed workloads with a combination of simple and complex tasks, a multi-provider routing approach that sends each request to the cheapest adequate model across providers is cheapest overall.

The practical challenge with multi-provider routing is engineering complexity. Maintaining API integrations, handling authentication, normalizing response formats, and managing rate limits across two or three providers adds development and operational overhead. For teams spending under $5,000 per month on APIs, the engineering cost of multi-provider routing may exceed the savings. For teams spending over $20,000 per month, the savings typically justify the complexity within the first quarter.

Total Cost Optimization

The cheapest AI API strategy is not choosing a single provider. It is combining multiple strategies: route simple requests to the cheapest adequate model (often Google Flash or OpenAI Mini), use prompt caching on every request (Anthropic for maximum savings), batch non-urgent workloads for 50 percent discounts (both Anthropic and OpenAI offer batch APIs), and store reusable knowledge in persistent memory to reduce per-request tokens regardless of which model you use. An application that applies all four strategies typically pays 20 to 30 percent of what a single-model, no-optimization approach would cost.

Persistent memory deserves special attention because it reduces costs regardless of which provider you use. While caching and routing optimize how you pay for tokens, memory optimization reduces the number of tokens you need to send in the first place. A 50 percent reduction in per-request tokens from memory-based context optimization is worth more than a 50 percent reduction in per-token pricing because it applies to every provider, every model tier, and every request type. The cheapest API call is the one you do not need to make, and persistent memory eliminates redundant token processing that no amount of pricing optimization can address.

Reduce costs regardless of which API you choose. Adaptive Recall cuts the tokens you send per request by 50 to 80 percent, making every model cheaper to use.

Get Started Free