Design LLM inference serving

TL;DR

Two request tiers require fundamentally different infrastructure: interactive (TTFT under 1s, streaming) and batch (maximize throughput, no streaming needed).
Route requests to the cheapest capable model using a lightweight classifier. Complex queries go to Sonnet/Opus; simple lookups go to Haiku.
KV cache management with PagedAttention (as in vLLM) is the single biggest GPU efficiency unlock. Shared prefix pages across requests with long system prompts cut memory use by 40-60%.
Autoscale on GPU memory utilization plus queue depth, not CPU. GPU memory at 70% means scale up; at 20% means scale down.
Cost allocation by team creates the right incentives. Without it, every team defaults to the most expensive model.

Requirements

Functional requirements

10 product teams can make LLM API calls to models including GPT-4o, Claude 3.5 Sonnet, and a self-hosted Llama 3 70B.
Interactive requests stream responses (token-by-token) with TTFT under 1 second.
Batch requests run in the background with no streaming requirement; throughput-maximized.
Teams can specify model preference or allow automatic routing based on query complexity.
Per-team usage (tokens in/out, cost) is tracked and reported daily.

Non-functional requirements

10K concurrent users across all teams, with individual team peaks up to 2K.
Interactive tier: TTFT under 1s at P90, total response time under 10s.
Batch tier: throughput maximized; 24-hour completion guarantee for queued jobs.
99.9% availability for the interactive tier. Batch tier allows degraded operation (slower queues) during incidents.
Self-hosted models must cost at least 40% less than equivalent commercial API spend at scale.

The core entities

InferenceRequest

request_id, team_id, model_preference, tier (interactive/batch), prompt_tokens, max_completion_tokens, stream, priority, created_at

ModelInstance

instance_id, model_id, gpu_node_id, status (active/draining), queue_depth, memory_utilization_pct, requests_per_second

UsageRecord

record_id, team_id, request_id, model_id, input_tokens, output_tokens, cost_usd, latency_ms, timestamp

ModelConfig

model_id, provider (openai/anthropic/self-hosted), tier_eligibility[], cost_per_1k_input, cost_per_1k_output, max_context_tokens, routing_weight

API design

POST /v1/chat/completions (unified inference endpoint, penAI-compatible)

Request:  {
  "model": "auto",
  "messages": [{"role": "user", "content": "Summarize this document..."}],
  "stream": true,
  "x-tier": "interactive",
  "x-team-id": "team_payments"
}
Response: SSE stream of { "delta": { "content": "token" }, "model_used": "claude-haiku-3" }

POST /v1/batch (submit background batch job)

Request:  { "requests": [...], "callback_url": "https://...", "priority": "normal" }
Response: { "batch_id": "batch_abc", "estimated_completion": "2026-04-05T14:00:00Z" }

GET /v1/usage?team_id=team_payments&date=2026-04-05

Response: { "team_id": "team_payments", "input_tokens": 4200000, "output_tokens": 840000, "cost_usd": 12.34 }

All requests enter through an API gateway that handles authentication, rate limiting, and tier classification. The gateway stamps each request with the team ID and routes it to the model router. The model router makes two decisions: which model (based on complexity and cost), and which serving pool (interactive or batch) to send the request to.

Self-hosted models (Llama 3 70B) run on a dedicated GPU cluster managed by vLLM. Third-party models (GPT-4o, Claude) are proxied through their commercial APIs with retry logic and circuit breakers. The gateway presents a unified OpenAI-compatible interface regardless of the backend, so teams don't need to change code when the routing changes.

The serving layer is the part that changes fundamental GPU economics. vLLM's continuous batching replaces static batching: as soon as any sequence in a batch finishes generation, a new request is inserted into the freed slot. GPU utilization jumps from 40-60% (static batching) to 70-90% (continuous batching). Combined with PagedAttention's memory management, you can serve 4-6x more concurrent requests on the same GPU fleet.

Inference request lifecycle

This animation traces a single interactive request from arrival to final response. Watch how the KV cache hit on the shared system prompt prefix eliminates 40% of prefill work, and how continuous batching lets the request join a running batch mid-decode rather than waiting for a new batch to form.

Request Arrives

>Waiting for request...

Auth + Quota

>Waiting...

Model Router

>Waiting...

KV Cache Check

>Waiting...

Batch Scheduler

>Waiting...

GPU Worker

>Waiting...

Token Streaming

>Waiting...

Response Complete

>Waiting...

A single inference request flows through 8 stages. The KV cache prefix hit saves 40% of prefill compute. Continuous batching lets the request join a running GPU batch between decode steps rather than waiting.

The AI-specific challenges

KV cache management with PagedAttention

During autoregressive generation, the key-value vectors for every previously generated token must stay in GPU memory. For a 70B model generating 2K tokens at 16-bit precision, a single request consumes roughly 8GB of KV cache. Without KV cache management, you can only run 5-6 concurrent sequences on an 80GB A100.

TL;DR

Two request tiers require fundamentally different infrastructure: interactive (TTFT under 1s, streaming) and batch (maximize throughput, no streaming needed).
Route requests to the cheapest capable model using a lightweight classifier. Complex queries go to Sonnet/Opus; simple lookups go to Haiku.
KV cache management with PagedAttention (as in vLLM) is the single biggest GPU efficiency unlock. Shared prefix pages across requests with long system prompts cut memory use by 40-60%.
Autoscale on GPU memory utilization plus queue depth, not CPU. GPU memory at 70% means scale up; at 20% means scale down.
Cost allocation by team creates the right incentives. Without it, every team defaults to the most expensive model.

Requirements

Functional requirements

10 product teams can make LLM API calls to models including GPT-4o, Claude 3.5 Sonnet, and a self-hosted Llama 3 70B.
Interactive requests stream responses (token-by-token) with TTFT under 1 second.
Batch requests run in the background with no streaming requirement; throughput-maximized.
Teams can specify model preference or allow automatic routing based on query complexity.
Per-team usage (tokens in/out, cost) is tracked and reported daily.

Non-functional requirements

10K concurrent users across all teams, with individual team peaks up to 2K.
Interactive tier: TTFT under 1s at P90, total response time under 10s.
Batch tier: throughput maximized; 24-hour completion guarantee for queued jobs.
99.9% availability for the interactive tier. Batch tier allows degraded operation (slower queues) during incidents.
Self-hosted models must cost at least 40% less than equivalent commercial API spend at scale.

The core entities

InferenceRequest

request_id, team_id, model_preference, tier (interactive/batch), prompt_tokens, max_completion_tokens, stream, priority, created_at

ModelInstance

instance_id, model_id, gpu_node_id, status (active/draining), queue_depth, memory_utilization_pct, requests_per_second

UsageRecord

record_id, team_id, request_id, model_id, input_tokens, output_tokens, cost_usd, latency_ms, timestamp

ModelConfig

model_id, provider (openai/anthropic/self-hosted), tier_eligibility[], cost_per_1k_input, cost_per_1k_output, max_context_tokens, routing_weight

API design

POST /v1/chat/completions (unified inference endpoint, penAI-compatible)

Request:  {
  "model": "auto",
  "messages": [{"role": "user", "content": "Summarize this document..."}],
  "stream": true,
  "x-tier": "interactive",
  "x-team-id": "team_payments"
}
Response: SSE stream of { "delta": { "content": "token" }, "model_used": "claude-haiku-3" }

POST /v1/batch (submit background batch job)

Request:  { "requests": [...], "callback_url": "https://...", "priority": "normal" }
Response: { "batch_id": "batch_abc", "estimated_completion": "2026-04-05T14:00:00Z" }

GET /v1/usage?team_id=team_payments&date=2026-04-05

Response: { "team_id": "team_payments", "input_tokens": 4200000, "output_tokens": 840000, "cost_usd": 12.34 }

High-level design

Inference request lifecycle

Request Arrives

>Waiting for request...

Auth + Quota

>Waiting...

Model Router

>Waiting...

KV Cache Check

>Waiting...

Batch Scheduler

>Waiting...

GPU Worker

>Waiting...

Token Streaming

>Waiting...

Response Complete

>Waiting...

Design LLM inference serving

TL;DR

Requirements

Functional requirements

Non-functional requirements

The core entities

API design

High-level design

Inference request lifecycle

The AI-specific challenges

KV cache management with PagedAttention

Continue Reading with Premium

Comments

Design LLM inference serving

TL;DR

Requirements

Functional requirements

Non-functional requirements

The core entities

API design

High-level design

Inference request lifecycle

The AI-specific challenges

KV cache management with PagedAttention

Continue Reading with Premium

Comments