Home » AI Cost Optimization » Choose Model Size

How to Choose the Right Model Size for Your Task

The right model for a task is the smallest one that meets your quality requirements. Using Claude Opus for a classification task that Claude Haiku handles equally well costs 60x more per request with no quality benefit. A systematic evaluation process reveals which model tier each task actually needs, typically showing that 60 to 80 percent of production requests can be handled by smaller, cheaper models without quality degradation.

Before You Start

You need a list of the distinct task types your application performs (classification, summarization, extraction, generation, reasoning, conversation, tool use), access to at least two model tiers from your provider, and the ability to run batch evaluations. Budget 2 to 4 hours for building the evaluation dataset and 1 to 2 hours for running benchmarks. The cost of the evaluation itself is minimal compared to the ongoing savings from proper model selection.

Step-by-Step Selection Process

Step 1: Define quality criteria for each task.
Before comparing models, define what "good enough" means for each task type. For classification, this might be 95 percent accuracy against a labeled dataset. For summarization, it might be capturing all key points without hallucinated details. For customer support responses, it might be passing a relevance and helpfulness rubric scored by a more capable model. Write these criteria down as measurable metrics, not subjective impressions. The criteria determine the floor below which a cheaper model is not acceptable, regardless of cost savings.
Step 2: Build an evaluation dataset.
For each task type, collect 50 to 200 representative examples that cover the range of difficulty your application encounters. Include easy cases (simple classification, short summaries, straightforward questions), medium cases (multi-step reasoning, nuanced topics, ambiguous inputs), and hard cases (complex analysis, edge cases, adversarial inputs). For tasks with ground truth (classification, extraction), include the correct answer. For open-ended tasks (generation, conversation), include criteria for evaluation. Pull examples from production logs to ensure they reflect real usage, not synthetic test cases.
Step 3: Benchmark across model tiers.
Run the evaluation dataset through each model tier you are considering. For Anthropic, this means Haiku, Sonnet, and Opus. For OpenAI, this means GPT-4o-mini, GPT-4o, and o1. Record the quality score (based on your criteria from Step 1), the average input and output token counts, the average latency, and the cost per request. Use temperature 0 for reproducibility. Run each example 3 times if the task has any variability and average the results. Log everything in a structured format for analysis.
# Example: benchmarking classification across model tiers import anthropic import json client = anthropic.Anthropic() models = ["claude-haiku-4-5-20251001", "claude-sonnet-4-6-20260414", "claude-opus-4-6-20260515"] results = [] for model in models: correct = 0 total_cost = 0 for example in evaluation_dataset: response = client.messages.create( model=model, max_tokens=100, temperature=0, messages=[{"role": "user", "content": example["prompt"]}] ) predicted = response.content[0].text.strip() is_correct = predicted == example["expected"] correct += int(is_correct) cost = calculate_cost(model, response.usage) total_cost += cost results.append({ "model": model, "accuracy": correct / len(evaluation_dataset), "total_cost": total_cost, "avg_cost_per_request": total_cost / len(evaluation_dataset) }) for r in results: print(f"{r['model']}: {r['accuracy']:.1%} accuracy, " f"${r['avg_cost_per_request']:.6f}/request")
Step 4: Identify the capability floor.
For each task type, find the smallest model that meets your quality threshold. If Haiku achieves 96 percent accuracy on classification and your threshold is 95 percent, Haiku is the capability floor for that task. If Haiku achieves 88 percent on complex reasoning but your threshold is 93 percent, move up to Sonnet. If Sonnet achieves 94 percent, Sonnet is the floor. The capability floor tells you exactly how much quality headroom you have: a model at 96 percent against a 95 percent threshold gives you 1 percent headroom, while a model at 99 percent gives you 4 percent headroom for future drift or edge cases.
Step 5: Build a model selection matrix.
Create a table mapping each task type to its optimal model. Include the task name, the recommended model, the quality score, the cost per request, the headroom above threshold, and any notes about edge cases where the model struggles. This matrix becomes the reference document for your routing implementation. Share it with the team so everyone understands why different tasks use different models. Review and update the matrix quarterly as new model versions are released and as your task mix evolves.
Step 6: Implement and monitor.
Implement model routing based on the selection matrix (see the companion guide on routing by complexity). After deployment, track quality metrics by task type and model tier. Set alerts for quality scores dropping below the threshold, which indicates model degradation, data distribution shift, or new task patterns that were not represented in the evaluation dataset. Re-run the evaluation quarterly or whenever a new model version is released to ensure your selection matrix remains optimal.

When Smaller Models Outperform

Larger models are not universally better. There are specific scenarios where smaller models produce better results at lower cost. For tasks with well-defined schemas (JSON extraction, classification into fixed categories, structured data transformation), smaller models often match or exceed larger models because the task does not require deep reasoning, just pattern recognition. Larger models sometimes overthink these tasks, adding qualifications, alternative interpretations, or additional fields that the application does not need.

For latency-sensitive applications, smaller models provide faster responses, and speed is itself a quality metric. A customer support bot that responds in 0.5 seconds using Haiku provides a better user experience than one that responds in 2.5 seconds using Opus, even if the Opus response is marginally more nuanced. When you factor in user satisfaction and engagement metrics alongside response quality, the smaller model can score higher overall.

For high-volume, low-complexity tasks (log analysis, data cleaning, format conversion, content tagging), the cost of using a larger model is difficult to justify because the quality improvement is imperceptible while the cost increase is 10x to 60x. These tasks are the low-hanging fruit of model optimization and should be the first to move to smaller models.

Pair model selection with persistent memory for maximum cost efficiency. Adaptive Recall reduces the context each model needs to process, making smaller models even more effective by giving them curated, relevant information instead of raw context dumps.

Start Free Trial