LLM routing and model selection
Learn how LLM routers pick the cheapest model that can handle each query, why cascading from small to large models cuts costs 60-80%, and how to build a routing layer for production AI systems.
TL;DR
- LLM routing sends each query to the cheapest model capable of handling it, instead of routing everything to the most expensive model.
- A simple cascade (try GPT-4o-mini first, escalate to GPT-4o if confidence is low) cuts API costs 60-80% with minimal quality loss on most workloads.
- Routing can be rule-based (query length, keyword triggers), classifier-based (trained on historical quality data), or LLM-judged (a small model decides which large model to use).
- Martian, Unify, and open-source routers like RouteLLM provide automated routing across providers. Many teams build custom routers tuned to their specific quality/cost tradeoff.
- The key metric is the quality-cost Pareto curve: plot quality (eval score) against cost per query, and find the configuration that maximizes quality per dollar.
The problem it solves
Your AI-powered document analysis feature processes 500K queries per month. You route everything to GPT-4o at $5.00 per million input tokens, which puts your monthly bill at roughly $15,000. An analysis of your query logs reveals that 75% are simple lookups ("what is the sender's name on this invoice?") that GPT-4o-mini handles with identical accuracy. GPT-4o-mini costs $0.15 per million tokens, 33x cheaper.
If you routed 75% of queries to GPT-4o-mini, you would cut costs from $15K to roughly $4K per month with no measurable quality drop. Most teams using frontier models pay a 3-5x cost premium because they do not route by query complexity.
Here is the before state: every query flows to the most expensive model.
After routing, simple and complex queries each reach the right model tier.
The routed system costs roughly $4K per month instead of $15K, a 73% reduction with no quality regression on the simple majority.
What is it?
LLM routing is the practice of dispatching each query to the cheapest model that can meet the quality bar for that specific query. Instead of using one model for everything, a router sits in front of a pool of models and makes a per-request model selection decision.
Think of triage in an emergency room. A triage nurse assesses each patient's severity and routes them to the appropriate care level: the general practitioner, urgent care, or the emergency trauma team. Not every patient needs the trauma team, and sending everyone there wastes resources and slows care for genuine emergencies. The nurse makes a fast, imperfect assessment that optimizes resource usage without sacrificing quality for those who truly need the highest level of care.
In LLM routing, the "nurse" is a lightweight decision layer (a classifier, a confidence check, or an embedding comparison) that assesses query complexity and selects the appropriate model tier.
How it works
Cascade routing (small-to-large)
Cascade routing tries the cheapest model first, then assesses the response quality. If the quality meets the threshold, the response is returned immediately. If it does not, the router escalates to the next tier.
Quality assessment options:
- Logprob-based confidence: compute the mean log probability of generated tokens, convert to a probability. Low confidence signals a hard query.
- Separate quality classifier: a small model trained to judge output quality without generating a full response.
- Format checks: if the model refuses to answer or produces a generic non-response, escalate.
The cascade terminates as soon as a satisfactory response is found. Typical cascade: GPT-4o-mini then GPT-4o, or Claude Haiku then Claude Sonnet then Claude Opus.
Latency works in your favor for the majority path. If 75% of queries resolve at tier 1, those queries respond faster because GPT-4o-mini is quicker than GPT-4o. Only the escalated 25% see higher latency (tier-1 latency plus tier-2 latency stacked). I have seen teams celebrate their P50 improvement without noticing their P99 quietly tripled on the escalated set.
# Simplified confidence-based cascade router
import math
async def cascade_router(query: str) -> str:
# Try cheap model first
response = await call_model("gpt-4o-mini", query)
# Check confidence via logprob sum
avg_logprob = sum(response.logprobs) / len(response.logprobs)
confidence = math.exp(avg_logprob) # convert to probability
if confidence >= 0.85:
return response.content
# Escalate to expensive model
return await call_model("gpt-4o", query)
Classifier-based routing
Classifier-based routing trains a lightweight model (a fine-tuned BERT-scale classifier, roughly 110M parameters) to predict which tier should handle a given query. The classifier runs before any LLM inference, adding roughly 5ms of overhead with no first-pass inference cost wasted.
Training data comes from historical production logs: annotated queries where you know which model tier produced acceptable quality. Build a golden evaluation set through human review, then use those labels to train the classifier. Features that work well include query length, presence of code, numerical reasoning signals, domain-specific keywords, and entity count.
Classifier routing beats cascade whenever your escalation rate would be high. If your data shows 60% of queries would escalate in a cascade, the classifier avoids paying tier-1 inference cost on those 60% upfront.
Semantic routing (content-based)
Semantic routing embeds the query using a cheap embedding model and compares it to category centroid vectors derived from labeled training examples. The router selects the most similar category centroid and dispatches to the specialized model for that domain.
This is most valuable when you have domain-specific fine-tuned models that outperform general frontier models in their domain and cost less per token. A medical-query fine-tuned model often delivers better accuracy on clinical questions than GPT-4o while running on smaller, faster hardware.
The overhead is 2-10ms for the embedding call plus cosine similarity computation. Categories with high embedding-space separation (medical vs. legal vs. code) get clean routing. For ambiguous queries near category boundaries, keep a fallback to a general-purpose model using a minimum-similarity threshold.
Multi-provider routing
Multi-provider routing dispatches queries across OpenAI, Anthropic, Google, and self-hosted models based on four live signals: current cost, current latency (measured p50 per provider in real time), current availability, and capability match. Gemini 1.5 Pro handles long-context queries. GPT-4o handles structured output tasks. Claude handles nuanced reasoning. The router weights all four signals and selects dynamically.
Provider circuit breakers and rate limit distribution are mandatory components. When OpenAI's error rate exceeds a threshold, the circuit breaker should open and reroute traffic to Anthropic or Vertex AI automatically. Rate limit distribution prevents hitting any single provider's per-minute cap, which matters especially at high volume.
Continue Reading with Premium
Unlock this article and every other in-depth system design guide on the platform with NotesFromSDE Premium.