LLM pricing and cost optimization

TL;DR

LLMs are priced per token; output tokens cost 3-5x more than input tokens. A 10K-input / 2K-output call to Opus 4.6 costs $0.30, the same call to Sonnet 4.6 costs $0.06.
The 60/30/10 tiering rule routes 60% of agent tasks to cheap models, 30% to mid-tier, and 10% to flagships, producing 3-5x cost reduction with minimal quality loss.
Batch API processing (available on Claude and OpenAI) halves your bill for any workload that tolerates a 24-hour delay.
Prompt caching cuts system prompt costs by up to 87.5% for repeated calls: $150 uncached vs $18.75 cached for 1,000 calls with a 10K-token system prompt.
Context size dominates agentic cost. A 10-step agent at 50K context per step costs 5x more than the same agent at 10K context per step through compression.
The right model is the cheapest model that reliably completes the task, not the most capable one you can access.

The problem it solves

You build an AI agent. It works. Then you see the invoice.

A production agent running 10,000 tasks per day on Opus 4.6 costs $90,000 per month. The same agent on Sonnet 4.6 costs $18,000 per month. If your agent makes 10 LLM calls per task, multiply those numbers by 10. Most teams do not calculate this until after launch, when the bill makes it impossible to ignore.

The gap is not about quality. Most tasks your agent handles (summarize this document, extract these fields, classify this input) do not require the most capable model available. They require a model that is good enough, reliably, at acceptable latency. The difference between "good enough" and "best available" is often 80% of your infrastructure cost.

This is what LLM cost optimization solves: matching task complexity to model capability so you spend money where it produces value.

What is it?

LLM cost optimization is the practice of minimizing token spend in AI systems while preserving acceptable output quality. It covers model selection, context management, batching strategies, caching, and output formatting.

Think of it like cloud compute right-sizing. When you first move to AWS, you run everything on the largest instance because it is easy. Then you right-size: small instances for static sites, medium for APIs, large for databases. LLM optimization is the same discipline applied to inference spend rather than compute spend.

The key insight is that "more expensive = better" is false at the task level. Opus 4.6 outperforms DeepSeek R2 on hard reasoning benchmarks. But on "does this email contain an order number? yes or no" tasks, they perform identically. You pay for the capability gap only when the task actually requires it.

How it works

Token-based pricing: the fundamentals

Every major LLM provider prices by token. A token is roughly 3/4 of a word: 1,000 words is approximately 1,333 tokens. The critical asymmetry is that input and output tokens have separate prices, and output tokens cost 3-5x more.

Here are the 2026 pricing benchmarks for major models:

Model	Input (per 1M tokens)	Output (per 1M tokens)	Best For
🧠 Claude Opus 4.6	$15.00	$75.00	Hard planning, synthesis, critical decisions
🧠 GPT 5.4	$20.00	$80.00	Complex reasoning, flagship capability
🔵 Gemini 3.1 Pro	$7.00	$21.00	Mid-tier reasoning, multimodal tasks
🔵 Claude Sonnet 4.6	$3.00	$15.00	Code generation, analysis, reasoning
⚡ DeepSeek R2	$0.55	$2.19	Classification, extraction, low-complexity generation
⚡ Haiku 3.5 / Flash tier	$0.25-$0.80	$1.00-$3.00	Simple routing, classification, fast lookups

A single API call with 10K input tokens and 2K output tokens costs:

Opus 4.6: (10,000 / 1,000,000) × $15 + (2,000 / 1,000,000) × $75 = $0.15 + $0.15 = $0.30
Sonnet 4.6: (10,000 / 1,000,000) × $3 + (2,000 / 1,000,000) × $15 = $0.03 + $0.03 = $0.06
DeepSeek R2: (10,000 / 1,000,000) × $0.55 + (2,000 / 1,000,000) × $2.19 = $0.0055 + $0.0044 = $0.01

At 10,000 tasks per day, those unit costs become $3,000/day, $600/day, and $100/day respectively. Multiply by 30 days: $90,000, $18,000, and $3,000 per month. That is a 30x spread from the same workload, driven entirely by model selection.

The output token asymmetry is critical. If you ask a model for a 500-word explanation when a 50-word answer would suffice, you are spending 10x on output unnecessarily. Controlling output length is one of the highest-leverage optimization levers available, and it costs nothing to implement.

The 60/30/10 tiering rule

In multi-agent systems, not all tasks are equal. Sort your workload by complexity and route accordingly.

The 60/30/10 rule works because most production workloads follow a complexity power law. The majority of LLM calls do work that any competent model handles correctly: classifying, extracting, formatting, routing. A small minority requires flagship capability.

Applying 60/30/10 to 10,000 tasks per day: 6,000 tasks at $0.01, 3,000 tasks at $0.06, 1,000 tasks at $0.30 = $60 + $180 + $300 = $540/day versus $3,000/day for all-Opus. That is a 5.5x cost reduction. The quality difference, measured by task success rate across the full workload, is typically under 2%.

I have seen teams resist this because they fear quality degradation. The right response is to measure it empirically: run each task type against three model tiers, measure success rate, and set the threshold. You will almost always find the cheap model handles 60-70% of tasks with no meaningful quality difference.

Context cost dominates agentic pipelines

Single-call cost estimates miss the dominant cost driver in agent systems: context accumulation across steps.

Each tool call in an agentic loop is a new LLM invocation. Each invocation includes the full conversation history plus tool outputs. A 10-step agent loop where each step carries 50K tokens of context costs:

50K input tokens × 10 steps = 500K total input tokens per run
At Opus 4.6 ($15/M): (500,000 / 1,000,000) × $15 = $7.50 per run

The same agent with context compression (keeping each step to 10K tokens through summarization and pruning):

10K input tokens × 10 steps = 100K total input tokens per run
At Opus 4.6: (100,000 / 1,000,000) × $15 = $1.50 per run

A 5x cost reduction from context management alone. If this agent runs 1,000 times per day, the difference is $7,500/day versus $1,500/day, or $225,000/month versus $45,000/month. Context compression is not optional at scale.

Batch API: 50% off for deferred work

TL;DR

LLMs are priced per token; output tokens cost 3-5x more than input tokens. A 10K-input / 2K-output call to Opus 4.6 costs $0.30, the same call to Sonnet 4.6 costs $0.06.
The 60/30/10 tiering rule routes 60% of agent tasks to cheap models, 30% to mid-tier, and 10% to flagships, producing 3-5x cost reduction with minimal quality loss.
Batch API processing (available on Claude and OpenAI) halves your bill for any workload that tolerates a 24-hour delay.
Prompt caching cuts system prompt costs by up to 87.5% for repeated calls: $150 uncached vs $18.75 cached for 1,000 calls with a 10K-token system prompt.
Context size dominates agentic cost. A 10-step agent at 50K context per step costs 5x more than the same agent at 10K context per step through compression.
The right model is the cheapest model that reliably completes the task, not the most capable one you can access.

The problem it solves

You build an AI agent. It works. Then you see the invoice.

This is what LLM cost optimization solves: matching task complexity to model capability so you spend money where it produces value.

What is it?

How it works

Token-based pricing: the fundamentals

Here are the 2026 pricing benchmarks for major models:

Model	Input (per 1M tokens)	Output (per 1M tokens)	Best For
🧠 Claude Opus 4.6	$15.00	$75.00	Hard planning, synthesis, critical decisions
🧠 GPT 5.4	$20.00	$80.00	Complex reasoning, flagship capability
🔵 Gemini 3.1 Pro	$7.00	$21.00	Mid-tier reasoning, multimodal tasks
🔵 Claude Sonnet 4.6	$3.00	$15.00	Code generation, analysis, reasoning
⚡ DeepSeek R2	$0.55	$2.19	Classification, extraction, low-complexity generation
⚡ Haiku 3.5 / Flash tier	$0.25-$0.80	$1.00-$3.00	Simple routing, classification, fast lookups

A single API call with 10K input tokens and 2K output tokens costs:

Opus 4.6: (10,000 / 1,000,000) × $15 + (2,000 / 1,000,000) × $75 = $0.15 + $0.15 = $0.30
Sonnet 4.6: (10,000 / 1,000,000) × $3 + (2,000 / 1,000,000) × $15 = $0.03 + $0.03 = $0.06
DeepSeek R2: (10,000 / 1,000,000) × $0.55 + (2,000 / 1,000,000) × $2.19 = $0.0055 + $0.0044 = $0.01

The 60/30/10 tiering rule

In multi-agent systems, not all tasks are equal. Sort your workload by complexity and route accordingly.

Context cost dominates agentic pipelines

Single-call cost estimates miss the dominant cost driver in agent systems: context accumulation across steps.

50K input tokens × 10 steps = 500K total input tokens per run
At Opus 4.6 ($15/M): (500,000 / 1,000,000) × $15 = $7.50 per run

The same agent with context compression (keeping each step to 10K tokens through summarization and pruning):

10K input tokens × 10 steps = 100K total input tokens per run
At Opus 4.6: (100,000 / 1,000,000) × $15 = $1.50 per run

LLM pricing and cost optimization

TL;DR

The problem it solves

What is it?

How it works

Token-based pricing: the fundamentals

The 60/30/10 tiering rule

Context cost dominates agentic pipelines

Batch API: 50% off for deferred work

Continue Reading with Premium

Comments

LLM pricing and cost optimization

TL;DR

The problem it solves

What is it?

How it works

Token-based pricing: the fundamentals

The 60/30/10 tiering rule

Context cost dominates agentic pipelines

Batch API: 50% off for deferred work

Continue Reading with Premium

Comments