Polling anti-pattern
Understand why polling for state changes wastes resources and adds latency, when to replace it with webhooks, Server-Sent Events, or WebSockets, and how to evaluate the trade-offs.
TL;DR
- Polling asks "did anything change?" on a schedule. At scale, most polls return "no," burning compute, connections, and network bandwidth for nothing.
- 10,000 clients polling every 5 seconds = 2,000 requests per second to your server, 99%+ of which return empty responses.
- Replace polling with webhooks (you call them when something changes), Server-Sent Events (push a stream over HTTP), or WebSockets (full-duplex persistent connection), each suited for different scenarios.
- Polling is acceptable when the update frequency is low and the acceptable latency is high. For anything near real-time, it's the wrong choice.
The Problem
It's Monday morning. Your job processing platform launched last week. Users submit async jobs and want to know when they're done, so your frontend polls GET /jobs/{id}/status every 2 seconds until the job completes.
Jobs typically take 30 to 120 seconds. In that window, here's what happens:
- 10,000 concurrent users = 5,000 status requests per second
- Each request hits your database to read the status field
- 98% of them return "RUNNING" (the same response as the last poll)
- You're load-testing your database with 5,000 reads/second to deliver no new information
Scale this to 100,000 concurrent users and your status endpoint serves 50,000 requests/second, all reading the same "still running" value from the same rows. You've built an in-house DDoS against your own database.
I saw this exact scenario take down a production Postgres instance. The status table had no read replica, no caching layer, and 50,000 identical SELECT queries per second. The database CPU hit 100%, which slowed actual job processing, which made jobs take longer, which meant more polling. A death spiral.
The Wasteful Math
The economics of polling are brutal at scale:
Polling interval: 5 seconds
Active clients: 10,000
Requests per second: 10,000 / 5 = 2,000 req/s
Expected update frequency: 1 update per 60 seconds per client
Useful requests per second: 10,000 / 60 ≈ 167 req/s
Waste ratio: (2,000 - 167) / 2,000 = 91.6% of all requests return no new data
The faster the interval, the higher the waste. The lower the update frequency, the higher the waste.
| Interval | Clients | Requests/sec | Useful/sec | Waste |
|---|---|---|---|---|
| 2s | 10,000 | 5,000 | 167 | 96.7% |
| 5s | 10,000 | 2,000 | 167 | 91.6% |
| 10s | 10,000 | 1,000 | 167 | 83.3% |
| 2s | 100,000 | 50,000 | 1,667 | 96.7% |
At 50,000 requests/second, you're paying for infrastructure to serve responses that nobody needs. Each request consumes a TCP connection, a thread (or event loop tick), a database read, serialization, and network bandwidth. Multiply by 96.7% waste and you understand why polling doesn't scale.
Why It Happens
Polling is the simplest possible notification mechanism, and simplicity is seductive.
"It works in development." With 5 test users, polling every 2 seconds is 2.5 requests/second. Your laptop handles that without blinking. The problem only surfaces at 10,000+ concurrent users, which you never test locally.
"We don't control the client." If you're building a public API, you can't force clients to implement webhooks or SSE. Polling is the lowest-common-denominator approach that works from any HTTP client, even curl in a loop.
"The frontend team wanted something quick." SSE requires server-side connection management. WebSockets need a load balancer that supports sticky sessions or connection-aware routing. Polling is a GET request in a setInterval. It ships in 30 minutes.
"We'll optimize later." Except "later" is always after the next feature. By the time polling becomes a problem, it's baked into the client SDK, documented in the API docs, and used by external integrators. Ripping it out is a breaking change.
"We tried webhooks but firewalls blocked them." Corporate networks often block inbound connections to client machines. This is a real constraint for B2B integrations, and it's why long polling exists as a middle ground.
How to Detect It
| Symptom | What It Means | How to Check |
|---|---|---|
| High request rate with low information gain | Most responses return no new data | Compare unique status changes/sec vs total requests/sec |
| Database CPU spikes on status tables | Polling queries overwhelming the read path | Check slow query logs for SELECT on status columns |
| Identical responses in access logs | Clients receiving the same payload repeatedly | Sample access logs, compare consecutive responses per client |
| Latency increases as users grow linearly | Polling doesn't have connection pooling benefits | Plot request rate and p99 latency vs. active users |
| Load balancer connection count climbing steadily | Each poll is a new TCP connection (HTTP/1.1) | Monitor active connections per backend instance |
Quick audit
Run this against your access logs to find polling endpoints:
# Find endpoints called > 100 times/min by the same client
awk '{print $1, $7}' access.log | sort | uniq -c | sort -rn | head -20
If the same client IP hits the same endpoint hundreds of times per minute, and most responses are identical, you've found a polling anti-pattern.
Cost estimation
Here's a quick way to estimate the monthly cost of polling waste:
Wasted requests/sec × 86,400 sec/day × 30 days = monthly wasted requests
Monthly wasted requests × cost per request (compute + DB + bandwidth)
Example:
1,833 wasted req/s × 86,400 × 30 = 4.75 billion wasted requests/month
At $0.000001 per request (Lambda-style pricing) = $4,750/month in pure waste
That's $4,750/month in infrastructure cost to deliver zero new information. I've seen teams reduce their cloud bill by 40% just by replacing a polling endpoint with SSE.
The Fix
Choose the right push mechanism based on your constraints.
Fix 1: Webhooks (push callbacks)
Client registers: POST /webhooks { url: "https://client.com/callback", events: ["job.completed"] }
Server fires: POST https://client.com/callback { jobId: "123", status: "COMPLETED" }
Client receives notification immediately when state changes. No polling. Zero waste.
Trade-off: the client must expose an endpoint. Firewall rules, retries on delivery failure, and security (verify webhook signatures) add complexity. Best for server-to-server integrations.
// Webhook delivery with retry and signature verification
async function deliverWebhook(event: JobEvent, webhook: WebhookConfig) {
const payload = JSON.stringify(event);
const signature = crypto.createHmac('sha256', webhook.secret).update(payload).digest('hex');
for (let attempt = 0; attempt < 3; attempt++) {
try {
await fetch(webhook.url, {
method: 'POST',
headers: { 'X-Signature': signature, 'Content-Type': 'application/json' },
body: payload,
});
return; // success
} catch {
await delay(Math.pow(2, attempt) * 1000); // exponential backoff
}
}
await deadLetterQueue.push(event); // give up, store for manual retry
}
Fix 2: Server-Sent Events (SSE)
Client opens one long-lived HTTP connection. Server pushes events over it.
// Server
app.get('/jobs/:id/events', (req, res) => {
res.setHeader('Content-Type', 'text/event-stream');
res.setHeader('Cache-Control', 'no-cache');
const unsub = jobStatus.subscribe(req.params.id, (update) => {
res.write(`data: ${JSON.stringify(update)}\n\n`);
});
req.on('close', unsub);
});
Good for: notifications, live updates, progress bars. HTTP/2 multiplexes well. Automatic reconnect in the browser via the EventSource API.
// Client (browser)
const source = new EventSource('/jobs/123/events');
source.onmessage = (event) => {
const update = JSON.parse(event.data);
if (update.status === 'COMPLETED') {
showResult(update);
source.close();
} else {
updateProgressBar(update.progress);
}
};
source.onerror = () => console.log('Reconnecting...'); // auto-reconnect built in
Fix 3: Long polling
The client sends a request with a long timeout. The server holds the connection open until the data changes or the timeout expires.
// Server: hold connection until status changes
app.get('/jobs/:id/status', async (req, res) => {
const timeout = parseInt(req.query.timeout as string) || 30;
const lastEtag = req.headers['if-none-match'];
const result = await jobStatus.waitForChange(req.params.id, {
timeout: timeout * 1000,
lastKnownVersion: lastEtag,
});
if (result.changed) {
res.setHeader('ETag', result.version);
res.json(result.data);
} else {
res.status(304).end(); // no change within timeout
}
});
// Client: reconnect immediately after each response
async function longPoll(jobId: string, etag?: string) {
const headers: Record<string, string> = {};
if (etag) headers['If-None-Match'] = etag;
const res = await fetch(`/jobs/${jobId}/status?timeout=30`, { headers });
if (res.status === 200) {
const data = await res.json();
handleUpdate(data);
longPoll(jobId, res.headers.get('ETag') || undefined); // recurse
} else {
longPoll(jobId, etag); // timeout, try again
}
}
Simulates push without a WebSocket. Works through proxies and firewalls. Higher latency per update (one request per event), but much lower request volume than short polling.
Fix 4: If you must poll, poll smart
Sometimes polling is the only option (third-party API, legacy system). Make it less painful:
- Exponential backoff. Start at 2 seconds, double after each "no change" response, cap at 60 seconds. A change resets the interval.
- ETag/If-Modified-Since. Let the server return
304 Not Modifiedwith zero body when nothing changed. Saves bandwidth and parsing time. - Cache in front of the database. Add Redis between your API and database. Status reads hit cache first. Cache TTL of 5 seconds means at most one DB read per 5 seconds per job, regardless of how many clients poll.
- Jitter. If 10,000 clients all poll at exactly the same interval, they synchronize into request bursts. Add random jitter (polling interval ± 20%) to spread the load.
Which mechanism to choose
| Scenario | Best mechanism | Why |
|---|---|---|
| Server-to-client one-way updates | Server-Sent Events (SSE) | Simple, HTTP-based, automatic reconnect |
| Two-way real-time communication | WebSockets | Full-duplex, lower overhead per message |
| Server notifies an external system | Webhooks | Push to client endpoint, no persistent connection |
| Low frequency, bursty updates | Long polling | Holds request open until update, then responds |
| Live feed (scores, prices, statuses) | SSE or WebSockets | Persistent connection, no polling overhead |
Severity and Blast Radius
Polling is medium severity but with a wide blast radius. It doesn't cause sudden failures like a missing circuit breaker. Instead, it degrades gradually: database CPU creeps up, connection pools fill, and response times slowly inflate. By the time someone notices, the database is at 95% CPU and every service that shares it is affected.
The blast radius depends on whether the polled resource is shared. If 10 services share a Postgres instance and one team's polling endpoint is hammering it with 50,000 reads/second, all 10 services degrade. Recovery is fast (stop the polling clients or add caching), but diagnosis can take hours because "the database is slow" doesn't immediately point to one endpoint.
When It's Actually OK
Polling is acceptable when:
- Update frequency is deliberately low. A dashboard that refreshes every 60 seconds doesn't generate meaningful waste. At 1,000 clients, that's 17 req/s, trivial for any server.
- The system is simple and real-time delivery is not a requirement. A cron job checking "is deployment done?" every 30 seconds is fine. Don't over-engineer a $5 problem.
- You're integrating with external systems that don't support webhooks. Some third-party APIs only offer a status endpoint. Poll it, but add exponential backoff and caching.
- Exactly-once delivery is critical and you need full control over the read timing. Webhooks can be lost, duplicated, or arrive out of order. Polling gives the client control: "I'll check when I'm ready to process the update."
- You're prototyping. Polling ships in 10 minutes. SSE or WebSockets take a day to set up properly with connection management, reconnection logic, and load balancer configuration. If you're building an MVP, poll now and push later.
The mistake is using 2-second polling intervals for systems that only change every few minutes.
How This Shows Up in Interviews
Interviewers test this whenever your design includes clients waiting for state changes: job completion, order status, payment confirmation, message delivery. The test is whether you proactively choose a push mechanism or default to polling.
A strong answer includes:
- Naming the specific push mechanism you'd use (SSE, WebSockets, webhooks) and why
- Acknowledging the trade-offs (connection management, firewall constraints, complexity)
- Distinguishing between internal push (SSE/WebSocket to browser) and external push (webhooks to third-party servers)
When your design has clients that need to know about state changes (job completion, payment status, message delivery), describe the notification mechanism explicitly. "The client polls" is an acceptable answer only if you immediately acknowledge the overhead and either justify it or describe the alternative. "I'd use webhooks for external system notifications and SSE for in-browser live updates" shows you've thought about the trade-offs.
Interview tip: When you mention polling in a design, preempt the follow-up by saying: "For the MVP, clients can poll with exponential backoff. For production scale, I'd migrate to SSE for browser clients and webhooks for server-to-server. The polling endpoint stays as a fallback for clients behind restrictive firewalls."
Test Your Understanding
Quick Recap
- Polling sends requests whether or not anything has changed. Most requests return no new data.
- At scale, polling creates high request rates against your infrastructure for near-zero information gain.
- Webhooks push to the client immediately on state change. Zero waste, but requires a client-side endpoint.
- SSE provides one long-lived HTTP stream from server to client. Good for live feeds and progress updates.
- WebSockets handle full-duplex real-time communication. Higher setup complexity, lower per-message overhead.
- Long polling is the middle ground when clients can't accept inbound connections. Holds the request until data changes, then responds.
- Polling is acceptable for low-frequency updates (60s+ intervals) or when integrating with systems that don't support push.