Home » AI Cost Optimization » Batch API Calls

How to Batch AI API Calls for Lower Costs

Batching AI API calls groups multiple requests into a single submission that the provider processes at a discounted rate. Anthropic's Message Batches API and OpenAI's Batch API both offer 50 percent cost reduction for batch-submitted requests, with results typically available within a few hours. For workloads that tolerate processing delays, batching cuts costs in half with no quality trade-off.

Before You Start

You need a workload that does not require real-time responses. Batching is not compatible with interactive conversations, live customer support, or any workflow where the user waits for the response. You also need a queue or job system to collect requests over a time window before submitting them. Familiarity with the Anthropic or OpenAI batch API endpoints is helpful but not required; this guide covers the implementation from scratch.

Step-by-Step Implementation

Step 1: Identify batch-eligible workloads.
Review your application's AI workflows and categorize each as real-time (user waits for response) or deferrable (response can be delivered later). Common batch-eligible workloads include: document classification and tagging, content generation for scheduled publishing, data extraction from uploaded files, periodic summarization of logs or reports, evaluation and scoring of queued items, bulk content moderation, and nightly analytics processing. Calculate the percentage of your total API spend that comes from deferrable workloads. If it exceeds 20 percent, batching will produce meaningful savings.
Step 2: Set up the batch pipeline.
Build a pipeline with three components: a collector that receives deferrable requests and stores them in a queue (Redis list, database table, or file-based queue), a submitter that runs on a schedule (every 15 minutes, every hour, or when the queue reaches a threshold size) and packages queued requests into a batch submission, and a processor that polls for batch completion and delivers results to the appropriate destinations (database, webhook, notification, or file output). Keep the pipeline simple. A cron job that runs every 30 minutes, reads pending requests from a database table, submits them as a batch, and writes results back to the table when they complete is sufficient for most workloads.
Step 3: Submit batches via the provider API.
Anthropic's Message Batches API accepts an array of message requests, each with a custom_id for tracking, and processes them asynchronously at 50 percent of standard pricing. Results are available within 24 hours, though most batches complete in 1 to 4 hours.
import anthropic client = anthropic.Anthropic() # Collect requests into batch format batch_requests = [] for item in pending_items: batch_requests.append({ "custom_id": item["id"], "params": { "model": "claude-sonnet-4-6-20260414", "max_tokens": 1024, "messages": [ {"role": "user", "content": item["prompt"]} ] } }) # Submit batch batch = client.messages.batches.create(requests=batch_requests) print(f"Batch submitted: {batch.id}, {len(batch_requests)} requests") # Poll for completion (or use webhook) import time while True: status = client.messages.batches.retrieve(batch.id) if status.processing_status == "ended": break time.sleep(60) # Process results for result in client.messages.batches.results(batch.id): item_id = result.custom_id if result.result.type == "succeeded": response_text = result.result.message.content[0].text save_result(item_id, response_text) else: handle_failure(item_id, result.result)
Step 4: Handle results and errors.
Batch processing introduces failure modes that do not exist in synchronous calls. Individual items within a batch can fail while others succeed. Handle partial failures by: logging failed items with their error details, retrying failed items in the next batch submission, and alerting on high failure rates (more than 5 percent of items failing indicates a systematic issue). Set a maximum retry count to prevent infinitely retrying items that consistently fail. For items that fail after the retry limit, escalate to synchronous processing (at full price) or flag for human review.
Step 5: Optimize batch size and timing.
Larger batches are more cost-efficient because they amortize the overhead of batch management, but they also increase the delay before results are available. Tune the collection window based on your use case: for document processing where users upload files and check back later, a 30 to 60 minute window is appropriate. For content generation scheduled for the next day, collecting all requests until midnight and submitting a single batch is optimal. Monitor the queue depth and batch completion time to find the sweet spot between cost savings and acceptable delay. Most providers have limits on batch size (Anthropic allows up to 10,000 requests per batch), so large workloads need to be split across multiple batches.

Content Batching Within Single Requests

Beyond the batch API, you can batch content within a single request to reduce per-item overhead. If you need to classify 30 support tickets, sending all 30 in a single prompt is cheaper than 30 individual prompts because the system prompt, tool definitions, and instructions are included once instead of 30 times. Structure the prompt clearly with numbered items and request structured output (JSON array) so results are easy to parse.

Content batching works best for simple, uniform tasks (classification, extraction, tagging) where each item is independent. It works poorly for tasks that require deep reasoning per item, because the model's attention to each item decreases as the batch grows. Empirical testing shows that quality remains stable for up to 20 to 30 items per batch for simple tasks but degrades beyond 10 items for complex analytical tasks. Test with your specific data to find the quality threshold.

Batch pricing math: If your application makes 100,000 deferrable API calls per month at an average cost of $0.03 each, that is $3,000 per month at standard pricing. Batching at 50 percent discount reduces it to $1,500 per month, saving $18,000 per year from a few hours of implementation work.

Combine batching with persistent memory to maximize savings. Adaptive Recall stores results from batch processing, so subsequent real-time queries can retrieve the answers from memory instead of making new API calls.

Get Started Free