epoll and kqueue: event-driven I/O explained
How Linux epoll and BSD kqueue enable a single thread to monitor thousands of file descriptors. The C10K problem epoll solved, how Nginx and Node.js use it, and why event-driven I/O beats thread-per-connection at scale.
The problem
In 1999, Dan Kegel published the C10K problem: how do you handle 10,000 concurrent connections on a single server?
The naive approach is one thread per connection. Each thread blocks on read() waiting for data. With 10,000 connections, you have 10,000 threads. Each thread consumes ~8KB of stack space by default, totaling 80MB just for stacks. The OS spends most of its time context-switching between 10,000 threads instead of doing useful work.
The fundamental issue: most connections are idle most of the time. A chat server with 10,000 users might have only 100 messages per second. That means 99% of threads are sleeping, burning memory and causing context-switch overhead for nothing.
The first fix was select() and poll(). These let a single thread monitor multiple file descriptors. But they required the kernel to scan every registered fd on every call, even if only one was ready. At 10,000 fds, that is O(10,000) work per event. 100 events per second means one million fd scans per second, most of them finding nothing.
This is the problem epoll and kqueue solve: O(1) event notification per ready descriptor, no matter how many total descriptors you are watching.
What it is
epoll (Linux, 2002) and kqueue (FreeBSD, 2000) are kernel event notification mechanisms. They let a single thread register interest in thousands of file descriptors and then sleep until one or more become ready. The kernel wakes the thread and hands it only the ready descriptors, not the entire set.
Analogy: Imagine a hotel concierge watching 10,000 guest rooms. The old method (select/poll) means the concierge walks every floor, knocking on every door to ask "do you need anything?" With epoll/kqueue, each room has a call button wired to the concierge desk. The concierge sits at the desk, does nothing until a button lights up, then walks directly to that room. The number of rooms does not affect response time.
Interview tip: the C10K framing
When an interviewer asks "how does a web server handle thousands of connections," open with the C10K problem and immediately name epoll/kqueue. This shows you understand the kernel-level mechanism, not just the application framework.
How it works
The event loop model
The core idea is an event loop: register file descriptors once, then repeatedly ask the kernel "which ones are ready?" The kernel returns only the ready set. Your application processes those and loops back to waiting.
The key insight: adding or removing a watched fd is O(log N) via the red-black tree. But getting the ready set is O(ready), not O(total). If you watch 50,000 fds and 3 are ready, you get back 3, not 50,000.
select() and poll(): the predecessors
Before epoll, there were select() and poll().
select(fds, ...) watches a set of file descriptors for readiness. When any fd becomes readable or writable, it returns. Your application then iterates the entire fd set to find which ones are ready.
Problems:
select()has a hard limit of 1024 file descriptors (FD_SETSIZE)- Both
select()andpoll()require O(N) scanning of all fds on each call, even if only 1 of 10,000 is ready - The fd set must be re-passed to the kernel on every call (copied from user space to kernel space)
With 10,000 connections using poll(), each event requires scanning 10,000 file descriptors. At 100 events/second, that is 1 million fd scans per second, most of them finding nothing.
epoll: the three system calls
epoll uses three system calls that separate registration from waiting:
// 1. Create an epoll instance (returns a file descriptor)
int epfd = epoll_create1(0);
// 2. Register a file descriptor to watch
struct epoll_event ev;
ev.events = EPOLLIN | EPOLLET; // readable + edge-triggered
ev.data.fd = socket_fd;
epoll_ctl(epfd, EPOLL_CTL_ADD, socket_fd, &ev);
// 3. Wait for events (blocks until at least one fd is ready)
struct epoll_event events[MAX_EVENTS];
int n = epoll_wait(epfd, events, MAX_EVENTS, timeout_ms);
for (int i = 0; i < n; i++) {
handle(events[i].data.fd); // only iterate ready fds
}
Why this is faster:
- O(1) per event:
epoll_waitreturns only the ready file descriptors. No scanning. - Persistent registration: fds are registered once via
epoll_ctl. No re-passing on each wait call. - No fd limit: epoll has no 1024 fd limitation.
kqueue: the BSD/macOS equivalent
kqueue (FreeBSD, macOS) provides similar functionality with more generality. A kevent can watch not just file descriptors but also signals, timers, child process state changes, and filesystem events.
int kq = kqueue();
struct kevent changelist[1];
EV_SET(&changelist[0], socket_fd, EVFILT_READ, EV_ADD, 0, 0, NULL);
kevent(kq, changelist, 1, NULL, 0, NULL); // register
struct kevent eventlist[MAX_EVENTS];
int n = kevent(kq, NULL, 0, eventlist, MAX_EVENTS, NULL); // wait
for (int i = 0; i < n; i++) {
handle(eventlist[i].ident);
}
kqueue batches registration and retrieval into a single kevent() call (you can pass both changelist and eventlist simultaneously). This reduces system call overhead when you need to both register new fds and wait in one operation.
Edge-triggered vs level-triggered
This is the most important implementation decision when using epoll. It changes how your application must handle I/O.
Level-triggered (default): epoll_wait returns an fd as long as data is available in the buffer. You can read partial data, return to epoll_wait, and it will notify you again. Simpler to program. More syscall wakeups.
Edge-triggered (EPOLLET): epoll_wait returns an fd only once when new data arrives. You must drain the fd completely (loop read() until EAGAIN) on each notification. If you read only part of the data and return to epoll_wait, the remaining data sits in the buffer with no further notification until more data arrives.
// Level-triggered: simple, partial reads are fine
function handle_level_triggered(fd):
data = read(fd, 4096) // read some data
process(data) // process it
// return to epoll_wait (it will fire again if more data remains)
// Edge-triggered: must drain completely
function handle_edge_triggered(fd):
loop:
data = read(fd, 4096)
if data == EAGAIN: // buffer is empty, done
break
process(data)
// return to epoll_wait (safe, buffer is fully drained)
| Property | Level-triggered | Edge-triggered |
|---|---|---|
| Notification frequency | Every epoll_wait while data available | Once per new data arrival |
| Read requirement | Can read partial | Must drain to EAGAIN |
| Syscall overhead | Higher (repeated wakeups) | Lower (one wakeup per event) |
| Programming complexity | Simple | Requires drain loop |
| Bug risk | Low | Data stalls if not drained |
| Used by | Redis, older servers | Nginx, most high-perf servers |
Edge-triggered mode is the #1 source of subtle epoll bugs. If you forget to drain the fd completely, data sits in the kernel buffer indefinitely. The connection appears to "hang" with no errors. Always pair EPOLLET with a non-blocking fd and a drain loop.
The readiness list and internal data structures
Understanding epoll's kernel-side implementation explains why it is O(1) per event.
epoll maintains two data structures:
- A red-black tree of all registered file descriptors.
epoll_ctl(ADD/DEL/MOD)operations are O(log N) on this tree. - A ready list (doubly-linked list) of file descriptors that have pending events.
When a packet arrives on a socket, the NIC triggers an interrupt. The kernel's network stack processes the packet and calls a callback function that was registered when the fd was added to epoll. This callback appends the fd to the ready list. No scanning is involved.
When your application calls epoll_wait, the kernel checks the ready list. If it is non-empty, the kernel copies the ready entries to user space and returns immediately. If the list is empty, the calling thread sleeps until a callback fires.
// Simplified kernel-side epoll implementation
structure EpollInstance:
rbtree: RedBlackTree<fd, EpollEntry> // all watched fds
ready_list: LinkedList<EpollEntry> // fds with pending events
wait_queue: WaitQueue // sleeping threads
function epoll_ctl(epfd, op, fd, event):
if op == ADD:
entry = new EpollEntry(fd, event)
rbtree.insert(entry) // O(log N)
register_callback(fd, entry) // device driver will call us
if op == DEL:
rbtree.remove(fd) // O(log N)
function device_callback(entry): // called by device driver on IRQ
ready_list.append(entry) // O(1)
wake_up(wait_queue) // wake sleeping epoll_wait
function epoll_wait(epfd, events, max, timeout):
if ready_list.empty():
sleep_on(wait_queue, timeout) // block until callback fires
n = min(ready_list.size(), max)
copy ready_list[0..n] to events // O(ready), not O(total)
return n
epoll_ctl operations
epoll_ctl supports three operations for managing the interest set:
| Operation | Flag | Effect | When to use |
|---|---|---|---|
| Add | EPOLL_CTL_ADD | Register a new fd | When a new connection is accepted |
| Modify | EPOLL_CTL_MOD | Change watched events on an existing fd | Switch from read-ready to write-ready |
| Delete | EPOLL_CTL_DEL | Remove an fd from the interest set | When a connection closes |
A common pattern in event-driven servers:
// Accept a new connection
int client_fd = accept(listen_fd, NULL, NULL);
fcntl(client_fd, F_SETFL, O_NONBLOCK); // MUST be non-blocking for EPOLLET
// Register for read events, edge-triggered
struct epoll_event ev = { .events = EPOLLIN | EPOLLET, .data.fd = client_fd };
epoll_ctl(epfd, EPOLL_CTL_ADD, client_fd, &ev);
// Later: switch to watching for write-readiness (e.g., response buffer ready)
ev.events = EPOLLOUT | EPOLLET;
epoll_ctl(epfd, EPOLL_CTL_MOD, client_fd, &ev);
// Connection done: remove from epoll
epoll_ctl(epfd, EPOLL_CTL_DEL, client_fd, NULL);
close(client_fd);
EPOLL_CTL_MOD is critical for protocols like HTTP where you first read a request, then write a response. You toggle between EPOLLIN and EPOLLOUT as the connection moves through its lifecycle.
select/poll vs epoll/kqueue comparison
| Property | select() | poll() | epoll (Linux) | kqueue (BSD/macOS) |
|---|---|---|---|---|
| Max fds | 1024 (FD_SETSIZE) | No hard limit | No hard limit | No hard limit |
| Per-call cost | O(N) scan | O(N) scan | O(ready) | O(ready) |
| Registration | Re-pass every call | Re-pass every call | Persistent (epoll_ctl) | Persistent (kevent) |
| Event types | Read/write/error | Read/write/error/hangup | Read/write/error/hangup/ET | Read/write/signal/timer/vnode/proc |
| Edge-triggered | No | No | Yes (EPOLLET) | Yes (EV_CLEAR) |
| Batched changes | No | No | No (one epoll_ctl per change) | Yes (changelist array) |
| Introduced | 1983 (4.2 BSD) | 1986 (SVR3) | 2002 (Linux 2.5.44) | 2000 (FreeBSD 4.1) |
kqueue's ability to batch registration and retrieval in a single syscall gives it an edge when rapidly adding/removing fds. epoll requires one epoll_ctl call per fd change. In practice, both handle C10K-scale workloads with similar performance. The choice is determined by your OS, not by a performance difference.
Interview tip: abstraction layers
Most applications never call epoll or kqueue directly. They use an abstraction: libuv (Node.js), libevent/libev (C), netty (Java), tokio (Rust), asyncio (Python). The abstraction selects the right mechanism per OS. Name the abstraction layer your favorite language uses.
Production usage
| System | Mechanism | Notable behavior |
|---|---|---|
| Nginx | epoll (Linux), kqueue (macOS/BSD) | One event loop per worker process. Each worker handles thousands of connections. No thread-per-connection overhead. The worker_connections directive sets the max fds per worker. |
| Node.js (libuv) | epoll (Linux), kqueue (macOS), IOCP (Windows) | Single-threaded event loop. I/O callbacks run when fds become ready. CPU-bound work offloaded to a thread pool via worker_threads. |
| Redis | epoll (Linux), kqueue (macOS) | Single-threaded event loop in ae.c. All client connections handled by one thread. Commands are fast enough that the single thread processes ~100K ops/sec without blocking. Level-triggered mode. |
| HAProxy | epoll with edge-triggered | Handles millions of concurrent connections. Uses EPOLLRDHUP to detect half-closed connections without extra read() calls. |
| Envoy Proxy | epoll via libevent | Thread-per-core model. Each worker thread runs its own epoll instance. Connections are distributed across workers at accept time. |
Limitations and when NOT to use it
- CPU-bound work blocks the event loop. If a single request handler takes 50ms of CPU time (JSON parsing, encryption, image processing), all other connections wait. CPU work must be offloaded to a thread pool or separate process.
- Edge-triggered mode is fragile. Forgetting to drain an fd or accidentally using a blocking fd with
EPOLLETcauses silent data stalls. Level-triggered is safer for most applications. - File I/O is not truly async on Linux.
epollworks for sockets, pipes, and eventfds, but regular file I/O on Linux always blocks (the "ready" notification fires immediately, then theread()blocks on disk). For true async file I/O, you needio_uring(Linux 5.1+) or a thread pool. - Thundering herd on
accept(). If multiple threads callepoll_waiton the same listen socket, a new connection wakes all of them, but only one succeeds. UseEPOLLEXCLUSIVE(Linux 4.5+) orSO_REUSEPORTto avoid this. - Debugging is harder than thread-per-connection. Stack traces show the event loop, not the logical request path. Structured logging with request IDs becomes essential.
- epoll is Linux-only. Code that calls
epoll_*directly will not compile on macOS or Windows. Use a cross-platform abstraction (libuv, libevent, tokio) unless you are writing Linux-specific infrastructure.
Interview cheat sheet
- When asked "how does a server handle 10K+ connections," state epoll/kqueue immediately. The key insight: the kernel notifies you which fds are ready (O(ready)), instead of you scanning all fds (O(total)).
- When discussing Nginx or Node.js architecture, explain the single-threaded event loop backed by epoll. One thread handles thousands of connections because network I/O is mostly waiting, not computing.
- When asked about edge-triggered vs level-triggered, say edge-triggered is more efficient (fewer wakeups) but requires draining the fd to
EAGAIN. Level-triggered is simpler but generates more syscalls. Nginx uses edge-triggered; Redis uses level-triggered. - When comparing to thread-per-connection, quantify: 10,000 threads need ~80MB of stack space plus O(N) context-switch overhead. epoll uses one thread, one data structure, O(ready) per event.
- When asked about epoll internals, describe the red-black tree (for fd registration, O(log N)) and the ready list (populated by device-driver callbacks, checked by
epoll_wait). - When asked about portability, explain that epoll is Linux-only, kqueue is BSD/macOS, IOCP is Windows. Production code uses an abstraction layer (libuv, libevent, netty, tokio) that selects the right mechanism per OS.
- When discussing limitations, mention that epoll does not help with CPU-bound work or file I/O on Linux. CPU work blocks the event loop; file
read()always blocks. Useio_uringor a thread pool for those. - When asked about the thundering herd, name
EPOLLEXCLUSIVEandSO_REUSEPORT. Without them, a new connection on a shared listen socket wakes all waiting threads, wasting CPU.
Quick recap
- The C10K problem: thread-per-connection fails above ~10,000 simultaneous connections due to stack memory (80MB+) and context-switch overhead. epoll and kqueue solve this with kernel-level event notification.
select()andpoll()require O(N) scans of all file descriptors on every call. epoll returns only the ready fds (O(ready)) because device-driver callbacks populate a ready list, not a scan.- epoll uses a red-black tree for fd registration (O(log N) add/remove) and a linked list for ready events.
epoll_ctlregisters once;epoll_waitretrieves only ready fds. - Edge-triggered mode (
EPOLLET) notifies once per data arrival, requiring a drain loop toEAGAIN. Level-triggered (default) re-notifies while data remains. Edge is more efficient; level is safer. - kqueue (BSD/macOS) provides equivalent functionality plus events for signals, timers, and filesystem changes. Cross-platform code uses an abstraction layer (libuv, libevent, tokio) that selects the right mechanism per OS.
- epoll does not help with CPU-bound work (it blocks the event loop) or file I/O on Linux (files are always "ready"). Use thread pools or
io_uringfor those cases.
Related concepts
- Networking fundamentals - epoll is the kernel mechanism that makes non-blocking network servers possible. Understanding TCP socket buffers helps explain what "ready" means.
- Zero-copy I/O - Once epoll tells you a socket is ready, zero-copy (
sendfile) eliminates the memory copies when transferring file data to that socket. - Event loop internals - The event loop is the application-level pattern built on top of epoll/kqueue. Node.js, Nginx, and Redis all implement event loops differently.
- Connection pooling - Connection pools reduce the number of fds epoll must track by reusing connections instead of creating new ones per request.