Budget-aware model routing

TL;DR

A routing layer classifies each request by complexity, selects the cheapest model that meets quality requirements, and enforces a hard dollar cap before every LLM call.
The 80/20 rule applies: roughly 80% of agent tasks (summarization, extraction, formatting) can be handled by models that cost 10-30x less than frontier models.
Hard caps are deterministic guardrails in code, not soft guidance in prompts. If estimated cost exceeds the remaining budget, the router downgrades the model or rejects the request.
FrugalGPT (Stanford, 2023) showed that cascading through cheaper models first can reduce costs by up to 98% while matching GPT-4 quality on 90%+ of queries.
The tradeoff: routing adds latency (classifier overhead), engineering complexity (model catalog maintenance), and the risk of under-powering hard tasks if the classifier is weak.

Full pattern: Budget-Aware Model Routing with Hard Cost Caps

Your AI coding assistant processes 50,000 requests per day. Every request hits Claude Opus because the team defaulted to the strongest model during prototyping and never changed it. The monthly bill is $47,000 and climbing. Your finance team is asking questions, but engineering insists that reducing model quality will break the product.

Here's the thing nobody mentions in the pitch deck: most of those 50,000 requests are simple. Auto-complete suggestions, docstring generation, import statement fixes, variable renaming. These tasks don't need a frontier model. A model 30x cheaper handles them identically. But without a routing mechanism, every request gets the premium treatment.

I've watched this exact scenario play out at three different companies. The pattern is always the same: prototype with the best model, launch, watch costs scale linearly with adoption, panic. The fix isn't "use a cheaper model everywhere." The fix is routing: send hard tasks to expensive models and easy tasks to cheap ones, with deterministic budget enforcement that prevents any single user, session, or workflow from blowing through its allocation.

What Is It?

Budget-aware model routing selects the cheapest LLM that satisfies quality requirements for each request, while enforcing hard cost caps that prevent overspend at the request, session, user, and organization level.

Think of it like a hospital triage system. Not every patient needs a specialist. The triage nurse (complexity classifier) assesses severity and routes accordingly: a broken arm goes to the ER doctor, a common cold goes to the nurse practitioner. The specialist's time (expensive model tokens) is reserved for cases that genuinely need it. And if the hospital hits its daily budget, non-critical patients are deferred, never the critical ones.

The routing layer sits between the application and the LLM providers. It makes three decisions for every request: how hard is this task, how much budget remains, and which model is the cheapest that can handle it within budget.

How It Works

Step 1: Complexity Classification

The classifier scores each incoming request on a difficulty scale. This is the most important component in the system, and getting it wrong cascades into either wasted money (over-routing) or degraded quality (under-routing).

Three classification approaches, ordered by sophistication:

Heuristic rules. Check token count, presence of code blocks, conversation turn count, and keywords. A request under 50 tokens with no code is almost certainly simple. A request with "debug this error" and a 200-line stack trace is complex. Heuristics are fast (sub-millisecond), free, and surprisingly effective for the first 70% of routing decisions.

Small classifier model. A fine-tuned BERT or a small LLM (GPT-3.5-mini, Haiku) that takes the request and outputs a complexity score. This adds 50-100ms latency and costs fractions of a cent per classification. The classifier is trained on historical data: requests paired with their actual difficulty based on which model tier produced acceptable quality.

Cascade routing. Skip the classifier entirely. Try the cheapest model first, evaluate the output quality, and escalate if it fails. FrugalGPT demonstrated this approach: start with GPT-3.5, check the output with a quality scorer, and only call GPT-4 if the score is below threshold. This is the most accurate but adds latency for hard tasks (they hit two models).

Approach	Latency Added	Cost	Accuracy	Best For
Heuristic rules	< 1ms	Free	70-80%	High-volume, latency-sensitive
Small classifier	50-100ms	$0.001/req	85-92%	Balanced cost/quality
Cascade routing	200-500ms (on escalation)	Varies	95%+	Quality-critical, budget-flexible
Hybrid (heuristic + classifier)	10-50ms	$0.0005/req	88-94%	Production sweet spot

I've found the hybrid approach works best in production: heuristics handle the obvious cases (80% of traffic), and the classifier only runs on ambiguous requests.

Step 2: Budget Tracking and Enforcement

Budget pools operate at multiple granularities. Each pool has a hard cap that the routing layer enforces before every LLM call.

The cost estimate before each call uses: estimated_cost = (estimated_input_tokens + estimated_output_tokens) * price_per_token. Token estimation is imprecise (typically +/- 20%), so production systems add a 30% safety margin.

Budget enforcement happens in code, not in prompts. This is the "hard cap" part. The model never sees the budget. The routing layer simply refuses to make an API call that would exceed the remaining allocation.

Budget replenishment strategies:

Fixed window: $50/day resets at midnight UTC. Simple but causes end-of-day budget exhaustion.
Sliding window: $50 in any rolling 24-hour period. Smoother distribution but harder to implement.
Token bucket: Steady rate of budget "drip" with burst allowance. Best for bursty workloads.
Priority reserve: Hold 20% of budget for high-priority tasks. Prevents cheap tasks from starving important ones.

Step 3: Model Selection and Routing Logic

The selector combines the complexity score with the remaining budget to pick the optimal model. The algorithm is straightforward: find the cheapest model whose capability tier meets or exceeds the classified difficulty, and whose estimated cost fits within the remaining budget.

TL;DR

A routing layer classifies each request by complexity, selects the cheapest model that meets quality requirements, and enforces a hard dollar cap before every LLM call.
The 80/20 rule applies: roughly 80% of agent tasks (summarization, extraction, formatting) can be handled by models that cost 10-30x less than frontier models.
Hard caps are deterministic guardrails in code, not soft guidance in prompts. If estimated cost exceeds the remaining budget, the router downgrades the model or rejects the request.
FrugalGPT (Stanford, 2023) showed that cascading through cheaper models first can reduce costs by up to 98% while matching GPT-4 quality on 90%+ of queries.
The tradeoff: routing adds latency (classifier overhead), engineering complexity (model catalog maintenance), and the risk of under-powering hard tasks if the classifier is weak.

Full pattern: Budget-Aware Model Routing with Hard Cost Caps

The Problem It Solves

What Is It?

How It Works

Step 1: Complexity Classification

Three classification approaches, ordered by sophistication:

Approach	Latency Added	Cost	Accuracy	Best For
Heuristic rules	< 1ms	Free	70-80%	High-volume, latency-sensitive
Small classifier	50-100ms	$0.001/req	85-92%	Balanced cost/quality
Cascade routing	200-500ms (on escalation)	Varies	95%+	Quality-critical, budget-flexible
Hybrid (heuristic + classifier)	10-50ms	$0.0005/req	88-94%	Production sweet spot

I've found the hybrid approach works best in production: heuristics handle the obvious cases (80% of traffic), and the classifier only runs on ambiguous requests.

Step 2: Budget Tracking and Enforcement

Budget pools operate at multiple granularities. Each pool has a hard cap that the routing layer enforces before every LLM call.

Budget replenishment strategies:

Fixed window: $50/day resets at midnight UTC. Simple but causes end-of-day budget exhaustion.
Sliding window: $50 in any rolling 24-hour period. Smoother distribution but harder to implement.
Token bucket: Steady rate of budget "drip" with burst allowance. Best for bursty workloads.
Priority reserve: Hold 20% of budget for high-priority tasks. Prevents cheap tasks from starving important ones.

Budget-aware model routing

TL;DR

The Problem It Solves

What Is It?

How It Works

Step 1: Complexity Classification

Step 2: Budget Tracking and Enforcement

Step 3: Model Selection and Routing Logic

Continue Reading with Premium

Comments

Budget-aware model routing

TL;DR

The Problem It Solves

What Is It?

How It Works

Step 1: Complexity Classification

Step 2: Budget Tracking and Enforcement

Step 3: Model Selection and Routing Logic

Continue Reading with Premium

Comments