Stateful vs. stateless services
The architectural difference between stateful and stateless services: what state means, why stateless services are easier to scale, and the patterns for managing state outside the service when it's necessary.
TL;DR
| Dimension | Stateless | Stateful |
|---|---|---|
| Scaling | Horizontal, trivial (add instances behind load balancer) | Requires sticky sessions, connection draining, state migration |
| Deploys | Rolling restart, zero coordination | Drain connections, migrate state, blue-green with care |
| Failure recovery | Instance dies, traffic reroutes instantly | Instance dies, all in-memory state is lost |
| Load balancing | Round-robin, least-connections, random | Session affinity (sticky sessions), connection-aware routing |
| Best for | REST APIs, microservices, batch workers | WebSocket servers, game servers, stream processors, Raft leaders |
Default to stateless. Externalize session state to Redis or a database. Only accept statefulness when the use case genuinely requires in-process memory (real-time connections, stream processing, consensus).
The Framing
Your team runs 4 app servers behind a load balancer. Deploys are simple: pull new image, restart, done. Traffic rebalances automatically. Then the product manager wants a chat feature. You add WebSocket connections. Now each server holds thousands of open connections with user state in memory.
Next deploy: you restart Server 2, and 3,000 users get disconnected mid-conversation. Their reconnections all hit Server 1 (wrong server, no session state). The load balancer doesn't know which server has which user's state. You've accidentally introduced statefulness, and the operational tax is immediate.
This is the core tension: stateless services are simple to scale and deploy, but some problems genuinely require state in memory. The engineering challenge is knowing which problems those are, and which problems only feel like they require statefulness.
If Server 1 crashes in the stateful model, every user with sessions on that server loses their state. If Server 1 is taken down for a deploy, sessions must be drained first. Adding Server 3 doesn't help users already stuck on overloaded Server 1.
How Each Works
Stateless Services
A stateless service processes each request independently. All state lives in external stores (database, cache, object storage). The service itself holds nothing between requests.
// Stateless API handler: any instance can serve this
app.get('/api/orders/:id', async (req, res) => {
// Auth: validate JWT (self-contained, no server state)
const user = verifyJWT(req.headers.authorization);
// Data: read from shared database
const order = await db.orders.findOne({ id: req.params.id });
// Cache: read/write shared Redis
await redis.set(`order:${order.id}:last_viewed`, Date.now());
res.json(order);
});
Why this scales: any instance can serve any request. The load balancer uses round-robin, random, or least-connections. Adding instances is kubectl scale deployment api --replicas=10. Removing instances is equally trivial. No coordination, no migration, no draining.
The JWT pattern is the most common stateless migration. Instead of storing sessions server-side (which creates statefulness), the server issues a signed JWT containing user claims. Any instance verifies the signature and extracts the user without a database lookup.
# Stateful session (requires shared state or sticky sessions)
session_id = cookie("session_id") # "abc123"
user = session_store.get(session_id) # Redis or DB lookup required
# Stateless JWT (any instance, no lookup)
token = header("Authorization") # "Bearer eyJhbGci..."
user = jwt.verify(token, PUBLIC_KEY) # Signature check only
# user = { id: 456, role: "admin", exp: 1712000000 }
The JWT tradeoff: tokens can't be individually revoked until they expire. If a user's permissions change or their account is compromised, the old JWT is still valid until expiry. Mitigation: short expiry (15 minutes) plus refresh token rotation. For high-security systems, maintain a small revocation list in Redis (checked on sensitive operations only, not every request).
Stateful Services
A stateful service maintains data in memory between requests. Subsequent requests from the same client depend on state from previous interactions held in that specific instance.
// WebSocket server: stateful by nature
const connections = new Map<string, WebSocket>(); // in-memory state
const rooms = new Map<string, Set<string>>(); // in-memory state
wss.on('connection', (ws, req) => {
const userId = authenticateConnection(req);
connections.set(userId, ws); // state stored in THIS instance
ws.on('message', (data) => {
const msg = JSON.parse(data.toString());
const room = rooms.get(msg.roomId);
// Broadcast to all users in the room ON THIS SERVER
room?.forEach(uid => connections.get(uid)?.send(data));
});
ws.on('close', () => {
connections.delete(userId);
// State lost if this server restarts
});
});
Why this is harder to operate: the load balancer must route each user to the specific server holding their state (sticky sessions via cookie, IP hash, or connection ID). Scaling out doesn't help users pinned to an overloaded server. Deploys require connection draining (wait for users to disconnect, or forcibly close connections and accept momentary disruption).
Head-to-Head Comparison
| Dimension | Stateless | Stateful | Verdict |
|---|---|---|---|
| Horizontal scaling | Add instances, done | Sticky sessions, affinity routing | Stateless |
| Load balancing | Round-robin, random, least-conn | Session affinity or connection-aware | Stateless |
| Deploy strategy | Rolling restart, zero downtime | Connection drain, blue-green, state migration | Stateless |
| Failure recovery | Instant reroute to healthy instances | State lost, users must reconnect/re-authenticate | Stateless |
| Request latency | External store lookup per request (~1-5ms for Redis) | In-memory access (~0.01ms) | Stateful |
| Connection overhead | New TCP/TLS per request (or keep-alive) | Persistent connection, minimal per-message overhead | Stateful |
| Real-time capability | Polling or SSE (server-push only) | WebSockets, bidirectional, sub-10ms | Stateful |
| Memory efficiency | State in shared store, no per-instance duplication | State in RAM, efficient for hot data | Depends |
| Operational complexity | Low (cattle, not pets) | High (pets that need care) | Stateless |
| Testing | Stateless functions, easy to unit test | Connection lifecycle, reconnection, state sync | Stateless |
The fundamental tension is operational simplicity vs. performance for stateful workloads. Stateless services are easier to build, deploy, and operate. Stateful services are faster and more capable for connection-oriented workloads, but the operational tax is significant.
When Stateless Wins
Stateless is the right default for any service that handles request/response workloads:
- REST/GraphQL APIs. Every request contains everything needed (auth token, parameters). Any instance serves any request. This is the bread and butter of microservice architectures.
- Background job workers. Pull a job from a queue, process it, write results. If the worker dies, another worker picks up the job. No state to lose.
- Serverless functions. Lambda, Cloud Functions, Cloud Run. Stateless by design. The platform manages instance lifecycle entirely.
- Batch processors. Read from input, transform, write to output. Checkpoint progress to external storage. Resume from checkpoint on failure.
For your interview: "I'd make this service stateless by externalizing session state to Redis. Stateless services scale horizontally with zero coordination, which keeps the operations story simple."
When Stateful Wins
Stateful is the right choice when the workload genuinely requires persistent in-memory state:
- WebSocket servers. Each connection is an open socket with associated state (user identity, room membership, message buffer). This state exists in process memory by nature.
- Real-time game servers. Game state (player positions, physics simulation, match state) must be in memory for sub-millisecond access. Writing to Redis on every frame would add 1-5ms of latency, making the game feel sluggish.
- Stream processors. Kafka Streams, Apache Flink. They maintain windowed aggregations, join state, and running totals in memory. Checkpointing to durable storage happens periodically, not on every event.
- Consensus leaders. Raft/Paxos leaders maintain the log and current state in memory. The leader's in-memory state IS the system's ground truth.
- In-memory caches and databases. Redis itself is a stateful service. The entire value proposition is that data lives in memory.
The Nuance
The Externalization Spectrum
Most real systems aren't purely stateless or stateful. They sit on a spectrum of how much state lives in-process vs. in external stores.
The hybrid approach is the most common in production: stateless HTTP APIs for request/response, stateful WebSocket servers for real-time, Redis as the bridge between them. When a WebSocket message triggers an action that affects the HTTP API (like updating a record), the WebSocket server writes to the database and the HTTP API reads it. No cross-service in-memory state sharing.
Deploy Patterns
The deploy strategy differs dramatically:
For stateless services, a rolling deploy takes seconds. Kill an instance, the load balancer stops routing to it, a new instance starts, begins receiving traffic. Zero user impact.
For stateful services, you need connection draining. Send a "reconnecting" frame to all connected WebSocket clients. Wait for them to close gracefully (or force-close after a timeout). Then restart. Users experience a brief disconnection and reconnect to a different server. Their in-flight state (typing indicators, cursor position) is lost unless it was synchronized to an external store.
Real-World Examples
Netflix: Their API gateway (Zuul) and microservices are stateless. Any instance serves any request. They scale to millions of concurrent streams by adding instances behind a load balancer. State lives in EVCache (their Memcached layer), Cassandra, and MySQL. But their video encoding pipeline is stateful: each encoding job maintains intermediate state in local disk/memory. If an encoder dies, the job restarts from the last checkpoint, not from scratch.
Discord: Their WebSocket gateway servers are stateful. Each server holds thousands of open connections with user and guild state in memory. They use consistent hashing to assign guilds to gateway instances. When they deploy, they drain connections from the old instance (sending clients a "reconnect" opcode) and clients automatically reconnect to the new instance. They handle 10+ million concurrent WebSocket connections across their gateway fleet.
Uber: Their trip matching service is stateful (driver locations, active trip state in memory for sub-millisecond access). Their pricing service is stateless (compute price from inputs, no memory between requests). Their approach: use the right model per service. Real-time matching needs in-memory state. Price computation doesn't. The geospatial index for driver locations uses partitioned statefulness with consistent hashing by geographic cell.
How This Shows Up in Interviews
Interview tip: default to stateless, justify stateful
Say: "I'd make this service stateless by storing session data in Redis. Stateless services scale horizontally with zero coordination." Only introduce statefulness if the interviewer's scenario genuinely requires it (WebSockets, game state, stream processing). Then explain the operational cost: sticky sessions, connection draining, partition-aware routing.
Common mistake: ignoring the JWT revocation problem
Many candidates say "I'll use JWTs for stateless auth" without addressing revocation. If a user's account is compromised, you need to invalidate their token immediately. Pure JWT can't do this until expiry. Always mention short expiry (15 min) + refresh token rotation, or a lightweight revocation check for sensitive operations.
When to bring this up proactively:
- Any system that needs horizontal scaling (state externalization is the first step)
- Chat, collaboration, or gaming features (inherently stateful)
- Discussions about deploy strategy and zero-downtime deployments
Depth expected at senior/staff level:
- Know the session externalization pattern and JWT tradeoffs
- Explain connection draining for stateful deploys
- Discuss partitioned statefulness (Kafka Streams, consistent hashing for WebSockets)
- Understand when statefulness is correct (real-time, stream processing, consensus)
| Interviewer asks | Strong answer |
|---|---|
| "How do you scale this service?" | "First, make it stateless: externalize sessions to Redis, use JWT for auth. Then scale horizontally behind a load balancer with round-robin routing." |
| "The system needs WebSockets for chat. How do you handle state?" | "WebSocket servers are stateful. I'd use consistent hashing to assign chat rooms to servers, Redis Pub/Sub for cross-server message delivery, and connection draining for deploys." |
| "What happens when a stateful server crashes?" | "All in-memory state is lost. Connected users reconnect to a different server. Persistent state (messages, room membership) is in the database. Ephemeral state (typing indicators) is lost and recovers on reconnect." |
| "When would you NOT use stateless?" | "Game servers (sub-ms state access), stream processors (windowed aggregations), consensus leaders (Raft log in memory). These genuinely need in-process state." |
| "How do you deploy stateful services?" | "Connection draining. Send a 'reconnect' signal, wait for graceful close (30-60s timeout), restart. Kubernetes preStop hooks handle this. It's slower than stateless deploys but prevents state loss." |
Quick Recap
- A stateless service holds no state between requests; all data lives in external stores (database, Redis), enabling trivial horizontal scaling.
- A stateful service maintains in-memory state between requests; it requires sticky sessions, connection draining, and careful deploy orchestration.
- The session externalization pattern (move sessions to Redis, use JWT for auth) is the most common path from stateful to stateless.
- WebSocket servers, game servers, stream processors, and consensus leaders are valid cases for statefulness because they genuinely need sub-millisecond in-memory access.
- Partitioned statefulness (consistent hashing, Kafka Streams) scales stateful workloads by limiting the blast radius of instance failure to one partition, not the whole system.
- Default to stateless for every new service; only introduce statefulness when you can name the specific requirement that demands in-process memory.
Related Trade-offs
- Scalability - Horizontal scaling is the primary benefit of stateless architecture, and statelessness is a prerequisite for it.
- Load balancing - Round-robin vs. sticky sessions, and how the statefulness of your service determines which algorithm works.
- Caching - Redis as an external session store is a caching pattern; understanding cache failure modes matters for externalized state.