Design LLM inference serving
Walk through designing a high-throughput LLM serving system supporting 10,000 concurrent users with SLAs on TTFT, cost per token, and availability across multiple model tiers.
TL;DR
- Two request tiers require fundamentally different infrastructure: interactive (TTFT under 1s, streaming) and batch (maximize throughput, no streaming needed).
- Route requests to the cheapest capable model using a lightweight classifier. Complex queries go to Sonnet/Opus; simple lookups go to Haiku.
- KV cache management with PagedAttention (as in vLLM) is the single biggest GPU efficiency unlock. Shared prefix pages across requests with long system prompts cut memory use by 40-60%.
- Autoscale on GPU memory utilization plus queue depth, not CPU. GPU memory at 70% means scale up; at 20% means scale down.
- Cost allocation by team creates the right incentives. Without it, every team defaults to the most expensive model.
Requirements
Functional requirements
- 10 product teams can make LLM API calls to models including GPT-4o, Claude 3.5 Sonnet, and a self-hosted Llama 3 70B.
- Interactive requests stream responses (token-by-token) with TTFT under 1 second.
- Batch requests run in the background with no streaming requirement; throughput-maximized.
- Teams can specify model preference or allow automatic routing based on query complexity.
- Per-team usage (tokens in/out, cost) is tracked and reported daily.
Non-functional requirements
- 10K concurrent users across all teams, with individual team peaks up to 2K.
- Interactive tier: TTFT under 1s at P90, total response time under 10s.
- Batch tier: throughput maximized; 24-hour completion guarantee for queued jobs.
- 99.9% availability for the interactive tier. Batch tier allows degraded operation (slower queues) during incidents.
- Self-hosted models must cost at least 40% less than equivalent commercial API spend at scale.
The core entities
InferenceRequest
request_id,team_id,model_preference,tier(interactive/batch),prompt_tokens,max_completion_tokens,stream,priority,created_at
ModelInstance
instance_id,model_id,gpu_node_id,status(active/draining),queue_depth,memory_utilization_pct,requests_per_second
UsageRecord
record_id,team_id,request_id,model_id,input_tokens,output_tokens,cost_usd,latency_ms,timestamp
ModelConfig
model_id,provider(openai/anthropic/self-hosted),tier_eligibility[],cost_per_1k_input,cost_per_1k_output,max_context_tokens,routing_weight
API design
POST /v1/chat/completions (unified inference endpoint, penAI-compatible)
Request: {
"model": "auto",
"messages": [{"role": "user", "content": "Summarize this document..."}],
"stream": true,
"x-tier": "interactive",
"x-team-id": "team_payments"
}
Response: SSE stream of { "delta": { "content": "token" }, "model_used": "claude-haiku-3" }
POST /v1/batch (submit background batch job)
Request: { "requests": [...], "callback_url": "https://...", "priority": "normal" }
Response: { "batch_id": "batch_abc", "estimated_completion": "2026-04-05T14:00:00Z" }
GET /v1/usage?team_id=team_payments&date=2026-04-05
Response: { "team_id": "team_payments", "input_tokens": 4200000, "output_tokens": 840000, "cost_usd": 12.34 }
High-level design
All requests enter through an API gateway that handles authentication, rate limiting, and tier classification. The gateway stamps each request with the team ID and routes it to the model router. The model router makes two decisions: which model (based on complexity and cost), and which serving pool (interactive or batch) to send the request to.
Self-hosted models (Llama 3 70B) run on a dedicated GPU cluster managed by vLLM. Third-party models (GPT-4o, Claude) are proxied through their commercial APIs with retry logic and circuit breakers. The gateway presents a unified OpenAI-compatible interface regardless of the backend, so teams don't need to change code when the routing changes.
The serving layer is the part that changes fundamental GPU economics. vLLM's continuous batching replaces static batching: as soon as any sequence in a batch finishes generation, a new request is inserted into the freed slot. GPU utilization jumps from 40-60% (static batching) to 70-90% (continuous batching). Combined with PagedAttention's memory management, you can serve 4-6x more concurrent requests on the same GPU fleet.
Inference request lifecycle
This animation traces a single interactive request from arrival to final response. Watch how the KV cache hit on the shared system prompt prefix eliminates 40% of prefill work, and how continuous batching lets the request join a running batch mid-decode rather than waiting for a new batch to form.
The AI-specific challenges
KV cache management with PagedAttention
During autoregressive generation, the key-value vectors for every previously generated token must stay in GPU memory. For a 70B model generating 2K tokens at 16-bit precision, a single request consumes roughly 8GB of KV cache. Without KV cache management, you can only run 5-6 concurrent sequences on an 80GB A100.
Continue Reading with Premium
Unlock this article and every other in-depth system design guide on the platform with NotesFromSDE Premium.