How to Route Queries to Cheaper Models by Complexity
Before You Start
You need benchmark results showing how each model tier performs on your specific tasks (see the companion guide on choosing model sizes). You also need a representative dataset of production requests labeled by the minimum model tier that handles them well. If you do not have labeled data, start with the rule-based approach described in Step 2 and collect labeled data over time through quality monitoring in Step 5.
Step-by-Step Implementation
Create three tiers with clear criteria. Tier 1 (simple) includes classification, entity extraction, format conversion, template completion, FAQ responses, and short summaries. These tasks have well-defined outputs and do not require multi-step reasoning. Tier 2 (medium) includes analytical responses, comparison tasks, code generation, detailed explanations, and multi-step workflows with clear parameters. These tasks require understanding and generation but not creative or open-ended reasoning. Tier 3 (complex) includes creative writing, multi-document synthesis, strategic analysis, ambiguous problem solving, and tasks requiring strong world knowledge or nuanced judgment. These tasks need the full capability of a frontier model. Document each tier with 10 to 20 examples from your production traffic to create a shared understanding across your team.
Start with a rule-based classifier for immediate deployment, then upgrade to ML-based classification as you collect data. The rule-based classifier uses keyword patterns, task type detection, and message length as signals. Short messages with question words ("what is", "how many", "when did") route to Tier 1. Messages requesting analysis, comparison, or code route to Tier 2. Long messages with multiple requirements, ambiguous instructions, or creative requests route to Tier 3.
def classify_complexity(message, task_type=None):
"""Rule-based complexity classifier. Returns 1, 2, or 3."""
msg_lower = message.lower()
word_count = len(message.split())
# Task type overrides when available
if task_type in ("classify", "extract", "tag", "format"):
return 1
if task_type in ("analyze", "compare", "generate_code"):
return 2
if task_type in ("creative", "synthesize", "strategy"):
return 3
# Keyword patterns for simple tasks
simple_patterns = [
"what is", "define", "list", "how many",
"yes or no", "true or false", "classify",
"extract the", "summarize in one"
]
if any(p in msg_lower for p in simple_patterns) and word_count < 50:
return 1
# Keyword patterns for complex tasks
complex_patterns = [
"analyze and compare", "design a system",
"write a detailed", "evaluate the tradeoffs",
"create a comprehensive", "multiple perspectives"
]
if any(p in msg_lower for p in complex_patterns) or word_count > 300:
return 3
return 2 # Default to medium
# Map tiers to models
TIER_MODELS = {
1: "claude-haiku-4-5-20251001",
2: "claude-sonnet-4-6-20260414",
3: "claude-opus-4-6-20260515"
}Use your benchmark results from the model selection process to assign each tier to the cheapest model that meets your quality threshold. A common mapping is: Tier 1 to Haiku (cheapest, handling the majority of requests), Tier 2 to Sonnet (balanced cost and capability), and Tier 3 to Opus (maximum capability for the hardest tasks). If your benchmarks show that Sonnet handles all your Tier 1 tasks equally well and the cost difference is acceptable, you can simplify to two tiers. The mapping should be driven by data, not assumptions.
Add routing middleware between your application code and your API client. The middleware receives each request, runs the classifier, selects the model, and passes the request to the API client with the selected model. Log every routing decision (the classified tier, the selected model, and the classifier's confidence) for monitoring and analysis. Keep the routing layer thin and fast: classification should add less than 10 milliseconds of latency. If you use an AI gateway like LiteLLM, implement routing as a custom router plugin rather than building standalone middleware.
After deploying routing, monitor quality metrics by tier and model to verify that cheaper models are meeting standards. For tasks with ground truth (classification, extraction), measure accuracy against labels. For open-ended tasks, use an LLM-as-judge approach: periodically sample responses from each tier and have a larger model evaluate their quality on a rubric. Track quality scores over time and alert when scores for any tier drop below the threshold. Quality degradation indicates that the tier boundary needs adjustment or that the request distribution has shifted.
Add automatic escalation when a smaller model's response does not meet quality checks. After the smaller model generates a response, run a lightweight quality check (format validation, confidence check, or a quick rubric evaluation). If the check fails, automatically retry the request with the next tier model. This catches the cases where the classifier misjudges complexity, ensuring quality is maintained at the cost of processing some requests twice. Track the escalation rate: if it exceeds 15 percent for a tier, the tier boundary needs to move to include those requests in the higher tier by default.
ML-Based Classification Upgrade
After collecting 2,000 to 5,000 labeled examples through quality monitoring, upgrade from rule-based to ML-based classification. Embed each request using a lightweight embedding model, then train a simple classifier (logistic regression or a small neural network) to predict the complexity tier from the embedding. ML-based classification handles novel request patterns that rule-based systems miss and typically improves routing accuracy by 10 to 15 percentage points. The classifier itself is fast (under 5 milliseconds inference) and cheap (no API call needed if you use a local embedding model like sentence-transformers).
Memory-informed routing adds historical context to the classification. If a persistent memory system tracks previous routing decisions and their outcomes for similar queries, the classifier can factor in this history. A query about a topic that historically required Tier 3 reasoning can be routed directly to Opus without attempting cheaper models first, reducing the wasted cost of failed escalations. Adaptive Recall's cognitive scoring provides this naturally: routing memories that are recent, frequently accessed, and corroborated receive high activation scores and influence routing decisions for similar future queries.
Make model routing smarter with memory. Adaptive Recall stores routing outcomes and query patterns, helping your system learn which queries need which models based on actual results, not just rules.
Get Started Free