Rate Limiter
Design a distributed rate limiter that throttles requests per user or IP across a fleet of servers, covering token bucket, sliding window, and Redis-backed strategies at FAANG scale.
What is an API rate limiter?
A rate limiter controls how many requests a client can make in a given time window. Visit the Twitter API twice a second or a thousand times a minute and you cross a threshold; the next request gets a 429. The interesting engineering problem is not the counting itself; it is counting atomically across a fleet of servers without a per-request distributed lock, while keeping reject latency under 5ms and surviving Redis node failures without cascading 429s that bring your entire service down.
I open every rate limiter interview by asking: "What happens to your users when Redis dies?" The answer reveals whether someone has built one or just read about one.
Functional Requirements
Core Requirements
- Limit requests from a client to N per time window (for example, 1,000 req/min per API key).
- Return HTTP 429 with a
Retry-Afterheader when the limit is exceeded. - Rules can vary per endpoint and per client tier (free vs. premium).
- Rules are configurable without a code deploy.
Below the Line (out of scope)
- Billing or quota enforcement
- Full WAF protection or DDoS mitigation
- Geographic restrictions
The hardest part in scope: The distributed counter race condition. Counting requests in a single process is trivial. Counting them accurately across dozens of stateless servers sharing a Redis instance, without a lock per request, under sub-second time windows is the actual engineering challenge. We will spend two full deep dives on it.
Billing and quota enforcement are below the line because they require integration with a payments system and a separate overage billing model. To add them, tie the rate limit rules to a subscription_tier column in a billing table. When a request arrives, the rule lookup resolves the tier and checks the billing system for active quotas.
The rate limiter becomes a quota enforcer rather than just a traffic shaper.
WAF and DDoS mitigation are below the line because they require packet-level inspection, IP reputation scoring, and bot fingerprinting. These belong in a dedicated network appliance (CloudFlare, AWS WAF) sitting upstream of the rate limiter. The rate limiter handles API-level semantics; a WAF handles transport-level threats.
Geographic restrictions are below the line because they require a GeoIP lookup on every request and a rules table with a region dimension. To add them, tag the request context with a region code and add a region column to the LimitRule table. Rules then resolve on (api_key, endpoint, region) instead of (api_key, endpoint).
Non-Functional Requirements
Core Requirements
- Latency: The rate-limiting check adds less than 5ms p99 to every request. At 5ms, rate limiting is invisible to users but measurable in profiling.
- Availability: 99.99% uptime. On limiter failure, the system fails open (allows traffic) rather than failing closed (rejects all traffic).
- Throughput: Handle up to 1M requests per second across the fleet without degrading check latency.
- Scale: Support 10,000 concurrent API keys, each with independent per-key, per-endpoint, per-tier rules.
- Consistency: Prevent more than a 5-10% overshoot on the stated limit under burst conditions. Exact accuracy is not required, but wild overruns (10x the limit) are not acceptable.
Below the Line
- Sub-millisecond p99 latency (requires co-locating Redis with every app server, operationally complex)
- Persistent audit log of every reject event (needed for billing-grade enforcement but not for throttling)
Read/write ratio: Every incoming API request triggers one Redis counter write (INCR) and one implicit read (INCR returns the new value). This is roughly 1:1, not the read-heavy ratio we see in most systems. Rule lookups are the exception: rules update infrequently (perhaps once per hour), so rule reads are cached in memory on each gateway node. The counter INCR is the performance-critical operation on every single request, and it must be atomic.
The 1:1 counter read/write ratio means we cannot rely on read-heavy optimizations like CDN caching. The counter must be incremented and checked atomically on every call. This pushes toward Redis INCR, which combines the read (returns the new value) and the write (increments the counter) in a single atomic command.
I would call out the 1:1 ratio early in the interview because it immediately disqualifies half the caching patterns interviewers expect to hear. That one sentence reframes the entire conversation.
Core Entities
- LimitRule: The configuration for how many requests are allowed per window for a given combination of
(key_type, tier, endpoint). Carries the limit N, the window duration in seconds, and the rule ID. - RequestCounter: The current count of requests for a specific client in the current time window. Stored in Redis as a key with a TTL equal to the window duration. Never persisted to a relational database.
- ClientIdentity: The resolved identifier used for rate-limit bucketing (API key, user ID, or IP address). Determines which
LimitRuleapplies to a given request.
Full schema details for the rules table, including indexes, tier columns, and composite key structure, are deferred to the deep dives. The three entities above are sufficient to drive the API design and High-Level Design.
API Design
The rate limiter is middleware, not a public-facing API. Clients never call it directly; it intercepts requests at the gateway layer. Two explicit interfaces are worth defining: the management API for configuring rules, and the internal check API used by gateway components that are not collocated with the limiter logic.
Configure a rate limit rule:
POST /rate-limit-rules
Body: {
key_type: "api_key",
tier: "free",
endpoint: "/search",
limit: 100,
window_seconds: 60
}
Response: { rule_id: "rl_abc123" }
Get a rule:
GET /rate-limit-rules/{rule_id}
Response: { rule_id, key_type, tier, endpoint, limit, window_seconds }
Internal rate check (used by distributed gateway nodes that delegate to a centralized checker):
POST /check
Body: { key: "api-key-abc123", endpoint: "/search", tier: "free" }
Response: {
allowed: true,
remaining: 45,
limit: 100,
reset_at: 1720000060
}
When a request is rejected, the response includes the full set of throttle headers:
HTTP 429 Too Many Requests
Headers:
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1720000060
Retry-After: 37
Body: { error: "rate_limit_exceeded", retry_after_seconds: 37 }
The X-RateLimit-* headers appear on every response, including successful ones, so well-behaved clients can self-throttle before hitting the limit. Retry-After is expressed in seconds (not an absolute timestamp) so client libraries can implement exponential backoff without timestamp arithmetic.
High-Level Design
1. Single server naive: in-memory counter per API key
A single-server in-memory counter handles the basic rate check but silently fails the moment a second gateway node is deployed.
The simplest rate limiter runs on a single server with an in-memory hash map from API key to request count. No external dependencies, no network calls.
Components:
- Client: Makes API requests to the gateway server.
- API Gateway (with in-memory map): On each request, increments the counter for the requesting API key. If the counter exceeds the configured limit for the current window, returns 429. Otherwise, forwards to the backend.
- Backend Service: Handles business logic. Only sees requests that passed the rate limit check.
Request walkthrough:
- Client sends a request with an API key header.
- Gateway extracts the API key from the
X-API-Keyheader. - Gateway looks up the current counter for
api_keyin the local hash map. - If the counter exceeds the configured limit, return 429 with
Retry-After. - Otherwise, increment the counter and forward the request to the Backend Service.
- A background goroutine (or scheduled task) resets counters at each window boundary.
This design works on one server and adds zero latency overhead. The failure is obvious: deploy two gateway instances and each maintains its own counter. A client sending 100 requests per minute routes through round-robin load balancing, hitting each instance for 50 requests per minute.
Both counters stay under the 100-request limit. The client sends 200 requests per minute against a nominal 100-request limit, and neither server knows. The rate limiter is silently broken.
I like to pause here in an interview and let this sink in. The single-server version is correct. The multi-server version is wrong in a way that no monitoring will catch, because every node thinks it is enforcing the limit. This is the kind of failure that surfaces only in load testing or incident review.
2. Distributed tracking with a centralized Redis counter
Moving counters to a shared Redis instance gives all gateway nodes a consistent global view, solving the distributed counting problem.
Move the counters out of each server's memory and into a shared Redis instance. All gateway nodes read from and write to the same counter state.
Components:
- Load Balancer: Distributes incoming requests across the API Gateway fleet. No rate-limiting logic here; it is purely a traffic distributor.
- API Gateway Fleet: Multiple stateless gateway nodes. All share a single Redis connection pool. Rate limit state is no longer local to any node.
- Redis Counter Store: Holds one key per
(api_key, window)pair. Each key is an integer incremented atomically with RedisINCR. The TTL equals the window duration so old window keys expire automatically without any cleanup job. - Backend Service: Receives only requests that pass the centralized counter check.
Request walkthrough:
- Client sends a request.
- Load Balancer routes to any gateway node (round-robin or least-connections).
- Gateway constructs the Redis key:
rl:{api_key}:{window_start}wherewindow_start = floor(now / window_seconds) * window_seconds. - Gateway calls
INCR rl:{api_key}:{window_start}. Redis atomically increments and returns the new count. - If the returned count exceeds the configured limit, return 429.
- Otherwise, forward to the Backend Service.
- On the first INCR for a new window (count == 1), Gateway also calls
EXPIRE rl:{api_key}:{window_start} {window_seconds}to attach a TTL.
All gateway nodes share the same Redis key namespace, so counts are globally accurate across the fleet. This is the baseline design; it solves the distributed counting problem cleanly and is where most teams start. The remaining questions are which windowing algorithm to use, how to handle Redis failures, and how to resolve rules per tier and endpoint.
All of those go to the deep dives.
3. Configurable rules per tier and endpoint
A Rules Service with per-node in-memory caching adds configurable per-tier, per-endpoint rules with zero added latency per request.
Not every client has the same limit. A free-tier API key gets 100 req/min on /search; a premium key gets 1,000. These rules must be configurable at runtime without a code deploy, and they must not add a database query to every request path.
Components:
- Rules Service: A separate service that owns the
LimitRuletable and exposes a CRUD management API. Publishes rule changes to a pub/sub channel so gateway nodes can refresh promptly. - Rules Cache (in-memory per node): Each Gateway node caches the rules map in local memory, refreshed every 60 seconds from the Rules Service. A rule lookup is now a hash map read (CPU-bound), not a network round-trip.
- Redis Counter Store: Unchanged. The key now includes the endpoint to support per-endpoint limits:
rl:{api_key}:{endpoint}:{window_start}.
Request walkthrough:
- Client sends a request with an API key header.
- Gateway resolves the
ClientIdentity: extract API key from header, look up the tier from a cached key-to-tier map. - Gateway resolves the applicable
LimitRulefrom the local in-memory rules cache using(key_tier, endpoint). - If no explicit rule exists, fall back to the global default rule (for example, 60 req/min for free-tier traffic).
- Gateway calls
INCR rl:{api_key}:{endpoint}:{window_start}in Redis. - If the returned count exceeds the rule's limit, return 429 with
Retry-AfterandX-RateLimit-*headers. - Otherwise, forward to the Backend Service.
The 60-second local cache on each gateway node means a newly configured rule takes up to 60 seconds to propagate to all nodes. For a rate limiter this is acceptable: the window is typically 60 seconds, so at most one window is enforced under the old rule. For security-critical changes (revoking a compromised API key), add a pub/sub invalidation channel that triggers immediate cache refresh on all nodes.
This tradeoff is worth calling out explicitly when discussing rule distribution consistency. In production, I have seen teams skip the pub/sub channel and rely solely on the 60-second TTL refresh, then get burned when a key revocation takes a full minute to propagate during a security incident.
Potential Deep Dives
1. Which rate-limiting algorithm should we use?
The baseline design uses a fixed window counter, keyed on {window_start}. This is fast and simple, but it has a well-known spike problem. The deep dive is about whether to replace it with something more accurate.
2. How do we store counters at distributed scale?
The choice of counter storage determines both the accuracy of global limits and the throughput ceiling of the rate limiter as a whole.
3. How do we handle Redis failures without cascading 429s?
Redis is the shared state backbone for the rate limiter. When it goes down, every rate limit check fails. The decision of what to do with those requests during the outage determines whether you have a graceful degradation or a complete service outage.
4. How do we identify clients and resolve rules across tiers?
The rate limiter needs to know who is making a request and which rule applies. This sounds simple but has edge cases that repeatedly cause production incidents, especially around shared IPs and anonymous traffic paths.
Final Architecture
The key insight: every request pays for exactly one in-memory hash map lookup (rules resolution) and one atomic Redis round-trip (Lua sliding window check). The circuit breaker ensures the gateway continues serving traffic when Redis is unavailable. Accuracy degrades gracefully from exact (Redis healthy) to approximately permissive (local emergency limiter) rather than from exact to a service outage.
Interview Cheat Sheet
- Scope to four core behaviors up front: per-key limits, 429 with headers, per-tier rules configurable at runtime, and no code deploy required to update rules.
- State the read/write ratio explicitly: every incoming request triggers one Redis INCR. The rate limiter is write-heavy on the counter store, unlike most systems where reads dominate.
- Redis INCR is atomic. It reads, increments, and returns the new value in a single operation. No distributed lock is needed around it, ever.
- Fixed window counter has a boundary spike bug: a client can send double the configured rate across a window boundary. Always name this when asked about algorithm choices; it signals you know the failure mode.
- Sliding window counter beats fixed window with no memory overhead. Blend two adjacent INCR counters weighted by elapsed fraction. A 5-10% overshoot in high-concurrency bursts is acceptable and should be stated in the NFRs.
- Sliding window log is more accurate but stores one sorted set entry per request. Use it only for billing-grade quota enforcement where exact limits are contractually guaranteed.
- Use a Lua script on Redis to make the window check and INCR fully atomic in one round-trip. Use EVALSHA (not EVAL) to avoid resending the script body on every request.
- Use curly-brace key tagging (
rl:{api_key}:currentandrl:{api_key}:prev) so both window keys always land on the same Redis Cluster shard. Lua scripts in Redis Cluster can only touch keys on one shard. - Identity resolution priority: API key first (most specific, carries tier), user ID second (authenticated session), IP third (anonymous, tightest limit). Never downgrade to IP when a higher-priority identity is present.
- Cache rules in memory on each gateway node with a 60-second TTL. A rule lookup is an O(1) hash map read, not a network call. Zero added latency per request for rule resolution.
- Fail-open, never fail-closed. When Redis is unavailable, a per-node circuit breaker trips and the gateway passes traffic through a local emergency limiter set to 10x the normal limit. Service stays up; rate limit accuracy degrades temporarily.
- Redis Sentinel provides HA: 1 primary and 2 replicas per shard, with Sentinel promoting a replica in under 30 seconds on primary failure. This minimizes the time the circuit breaker spends in the OPEN state.
- Emit
X-RateLimit-Limit,X-RateLimit-Remaining, andX-RateLimit-Reseton every response including successful ones. Well-behaved clients use these to self-throttle before hitting the limit. - For security-critical key revocations, add a pub/sub invalidation channel to push immediate cache evictions to all gateway nodes rather than waiting for the 60-second TTL refresh.