Design an LLM gateway
Walk through designing an LLM gateway that routes requests across multiple providers, enforces rate limits and cost budgets, provides observability, and handles failover with sub-50ms overhead.
TL;DR
- The gateway is a thin reverse proxy (sub-50ms overhead) between all internal services and LLM providers. It centralizes model routing, rate limiting, cost tracking, and failover so individual teams do not manage provider integrations themselves.
- Intelligent routing with a lightweight classifier saves 40-60% on LLM spend by sending simple lookups to GPT-4o-mini ($0.15/1M input) and reserving GPT-4o ($2.50/1M input) or Claude Opus ($15/1M input) for complex reasoning tasks.
- Token-budget rate limiting beats request-count rate limiting. One request with 100K tokens costs 1,000x more than one request with 100 tokens. The rate limiter must track both request count (to protect the gateway) and token budget (to protect the budget).
- Circuit breaker failover with model equivalency mapping (GPT-4o to Claude 3.5 Sonnet, GPT-4o-mini to Claude Haiku) keeps traffic flowing during provider outages. Each provider averages 2-4 incidents per month lasting 5-30 minutes.
- The production lesson that separates senior engineers from juniors: streaming complicates everything. You do not know the output token count until the stream finishes, so rate limiting, cost tracking, and timeout handling all need post-stream reconciliation.
Requirements
Functional requirements
- The gateway proxies all LLM API calls from internal services to external providers (OpenAI, Anthropic, Google) through a unified API interface.
- The gateway routes each request to the cheapest capable model based on query complexity classification (simple, medium, hard).
- Each team has a configurable token budget (daily and monthly caps) with automatic enforcement and alerting at 80% usage.
- The gateway retries failed requests against an alternate provider when the primary provider returns 5xx errors or times out.
- The gateway logs every request with latency, input/output tokens, cost, model used, and team ID for observability and audit.
- The gateway supports a semantic cache that returns cached responses for identical or near-identical prompts.
Non-functional requirements
- Gateway routing overhead: P99 under 50ms (excluding LLM inference latency).
- Throughput: 10,000 LLM requests per minute from all teams combined.
- Availability: 99.95% uptime. A gateway outage blocks all AI features across the company.
- Cost reduction: at least 40% savings compared to every team using GPT-4o for all requests.
- Failover time: under 2 seconds to detect a provider failure and route to the backup.
- Cache hit rate: 30-50% for support and FAQ workloads, reducing both latency and cost.
The hardest engineering problem here: streaming breaks your accounting. When a request uses server-sent events, you know the input tokens up front but the output tokens trickle in one chunk at a time. Your rate limiter must reserve a pessimistic token budget on entry, then reconcile the actual usage after the stream completes. Get this wrong and teams either blow past their budgets or get blocked unnecessarily.
The core entities
GatewayRequest
request_id,team_id,model_requested,model_routed,prompt_hash,input_tokens,estimated_output_tokens,priority,received_at
GatewayResponse
response_id,request_id,provider_used,model_used,output_tokens,latency_ms,cost_usd,cache_hit(boolean),completed_at,status(success/failed/fallback)
TeamBudget
team_id,daily_limit_tokens,monthly_limit_tokens,daily_used_tokens,monthly_used_tokens,burst_allowance,auto_downgrade_enabled,alert_threshold_pct
ProviderHealth
provider_id,endpoint,error_rate_1m,error_rate_5m,p50_latency_ms,p99_latency_ms,circuit_state(closed/open/half-open),last_failure_at,last_success_at
ModelEquivalency
model_a,model_b,task_compatibility_score,format_adapter(openai-to-anthropic, etc.),prompt_rewrite_required(boolean)
CacheEntry
cache_key(prompt_hash + model + temperature),embedding_vector,response_text,token_count,created_at,ttl,hit_count
API design
POST /v1/chat/completions - unified LLM inference endpoint (OpenAI-compatible)
Request: {
"model": "auto",
"messages": [
{ "role": "system", "content": "You are a helpful assistant." },
{ "role": "user", "content": "What is the capital of France?" }
],
"temperature": 0.0,
"stream": true,
"metadata": {
"team_id": "team-search",
"priority": "high",
"bypass_cache": false
}
}
Response (non-streaming): {
"id": "gw_req_abc123",
"model": "gpt-4o-mini",
"routed_from": "auto",
"provider": "openai",
"choices": [{ "message": { "content": "The capital of France is Paris." } }],
"usage": { "input_tokens": 28, "output_tokens": 9, "cost_usd": 0.0000096 },
"gateway_metadata": {
"routing_decision": "simple_query",
"cache_hit": false,
"gateway_latency_ms": 12,
"total_latency_ms": 340
}
}
The unified endpoint keeps the OpenAI chat completions format so teams can switch to the gateway by changing one base URL. The model: "auto" option triggers intelligent routing. Teams can still pin a specific model if needed.
GET /v1/teams/{team_id}/usage - query budget usage and remaining allocation
Response: {
"team_id": "team-search",
"daily": { "used_tokens": 4200000, "limit_tokens": 10000000, "pct_used": 42 },
"monthly": { "used_tokens": 98000000, "limit_tokens": 500000000, "pct_used": 19.6 },
"top_models": [
{ "model": "gpt-4o-mini", "requests": 12400, "tokens": 3100000, "cost_usd": 1.86 },
{ "model": "gpt-4o", "requests": 340, "tokens": 1100000, "cost_usd": 13.75 }
],
"auto_downgrade_active": false
}
Teams need visibility into their own usage. This endpoint powers dashboards and budget alerts. Finance uses the aggregated version across all teams for forecasting.
GET /v1/providers/health - provider health status for ops dashboards
Response: {
"providers": [
{ "provider": "openai", "status": "healthy", "error_rate_1m": 0.2, "p50_latency_ms": 280, "circuit": "closed" },
{ "provider": "anthropic", "status": "healthy", "error_rate_1m": 0.1, "p50_latency_ms": 310, "circuit": "closed" },
{ "provider": "google", "status": "degraded", "error_rate_1m": 8.5, "p50_latency_ms": 1200, "circuit": "half-open" }
]
}
Used by the on-call team to see provider status at a glance and understand routing decisions during incidents.
DELETE /v1/cache - invalidate cache entries (admin only)
Request: {
"scope": "team",
"team_id": "team-search",
"older_than": "2026-04-01T00:00:00Z"
}
Response: { "entries_deleted": 1247 }
Cache invalidation for when a team changes their system prompt or when a provider updates model behavior after a version bump.
High-level design
The gateway sits between every internal service and every LLM provider. Think of it as an API gateway specifically designed for LLM traffic, with three capabilities that a generic gateway (Kong, Envoy) cannot provide: intelligent model routing based on query complexity, token-aware rate limiting, and cross-provider failover with prompt format adaptation.
The architecture has two planes. The data plane handles the hot path: receive request, check cache, route to provider, stream response back. This must be fast (sub-50ms overhead). The control plane handles configuration, health monitoring, and budget enforcement. This runs asynchronously and updates the data plane through shared state in Redis.
I have seen teams try to build this inside their existing API gateway with custom plugins. It works until you need streaming support, token counting, or model equivalency mapping. Those features are LLM-specific and deserve a dedicated service.
For your interview: draw this diagram early and name the three capabilities that make this more than a generic API gateway. Interviewers want to see that you understand LLM-specific requirements.
Here is the core request flow animated step by step. Notice how the gateway adds minimal overhead on the hot path (auth check, rate limit lookup, cache check, route decision) and handles the expensive work (provider call, logging) without blocking the response.
Continue Reading with Premium
Unlock this article and every other in-depth system design guide on the platform with NotesFromSDE Premium.