epoll and kqueue: event-driven I/O explained
How Linux epoll and BSD kqueue enable a single thread to monitor thousands of file descriptors. The C10K problem epoll solved, how Nginx and Node.js use it, and why event-driven I/O beats thread-per-connection at scale.
The problem
In 1999, Dan Kegel published the C10K problem: how do you handle 10,000 concurrent connections on a single server?
The naive approach is one thread per connection. Each thread blocks on read() waiting for data. With 10,000 connections, you have 10,000 threads. Each thread consumes ~8KB of stack space by default, totaling 80MB just for stacks. The OS spends most of its time context-switching between 10,000 threads instead of doing useful work.
The fundamental issue: most connections are idle most of the time. A chat server with 10,000 users might have only 100 messages per second. That means 99% of threads are sleeping, burning memory and causing context-switch overhead for nothing.
The first fix was select() and poll(). These let a single thread monitor multiple file descriptors. But they required the kernel to scan every registered fd on every call, even if only one was ready. At 10,000 fds, that is O(10,000) work per event. 100 events per second means one million fd scans per second, most of them finding nothing.
This is the problem epoll and kqueue solve: O(1) event notification per ready descriptor, no matter how many total descriptors you are watching.
What it is
epoll (Linux, 2002) and kqueue (FreeBSD, 2000) are kernel event notification mechanisms. They let a single thread register interest in thousands of file descriptors and then sleep until one or more become ready. The kernel wakes the thread and hands it only the ready descriptors, not the entire set.
Analogy: Imagine a hotel concierge watching 10,000 guest rooms. The old method (select/poll) means the concierge walks every floor, knocking on every door to ask "do you need anything?" With epoll/kqueue, each room has a call button wired to the concierge desk. The concierge sits at the desk, does nothing until a button lights up, then walks directly to that room. The number of rooms does not affect response time.
Interview tip: the C10K framing
When an interviewer asks "how does a web server handle thousands of connections," open with the C10K problem and immediately name epoll/kqueue. This shows you understand the kernel-level mechanism, not just the application framework.
How it works
The event loop model
The core idea is an event loop: register file descriptors once, then repeatedly ask the kernel "which ones are ready?" The kernel returns only the ready set. Your application processes those and loops back to waiting.
The key insight: adding or removing a watched fd is O(log N) via the red-black tree. But getting the ready set is O(ready), not O(total). If you watch 50,000 fds and 3 are ready, you get back 3, not 50,000.
select() and poll(): the predecessors
Before epoll, there were select() and poll().
select(fds, ...) watches a set of file descriptors for readiness. When any fd becomes readable or writable, it returns. Your application then iterates the entire fd set to find which ones are ready.
Problems:
select()has a hard limit of 1024 file descriptors (FD_SETSIZE)- Both
select()andpoll()require O(N) scanning of all fds on each call, even if only 1 of 10,000 is ready - The fd set must be re-passed to the kernel on every call (copied from user space to kernel space)
With 10,000 connections using poll(), each event requires scanning 10,000 file descriptors. At 100 events/second, that is 1 million fd scans per second, most of them finding nothing.
epoll: the three system calls
epoll uses three system calls that separate registration from waiting:
// 1. Create an epoll instance (returns a file descriptor)
int epfd = epoll_create1(0);
// 2. Register a file descriptor to watch
struct epoll_event ev;
ev.events = EPOLLIN | EPOLLET; // readable + edge-triggered
ev.data.fd = socket_fd;
epoll_ctl(epfd, EPOLL_CTL_ADD, socket_fd, &ev);
// 3. Wait for events (blocks until at least one fd is ready)
struct epoll_event events[MAX_EVENTS];
int n = epoll_wait(epfd, events, MAX_EVENTS, timeout_ms);
for (int i = 0; i < n; i++) {
handle(events[i].data.fd); // only iterate ready fds
}
Why this is faster:
- O(1) per event:
epoll_waitreturns only the ready file descriptors. No scanning. - Persistent registration: fds are registered once via
epoll_ctl. No re-passing on each wait call. - No fd limit: epoll has no 1024 fd limitation.
kqueue: the BSD/macOS equivalent
kqueue (FreeBSD, macOS) provides similar functionality with more generality. A kevent can watch not just file descriptors but also signals, timers, child process state changes, and filesystem events.
int kq = kqueue();
struct kevent changelist[1];
EV_SET(&changelist[0], socket_fd, EVFILT_READ, EV_ADD, 0, 0, NULL);
kevent(kq, changelist, 1, NULL, 0, NULL); // register
struct kevent eventlist[MAX_EVENTS];
int n = kevent(kq, NULL, 0, eventlist, MAX_EVENTS, NULL); // wait
for (int i = 0; i < n; i++) {
handle(eventlist[i].ident);
}
kqueue batches registration and retrieval into a single kevent() call (you can pass both changelist and eventlist simultaneously). This reduces system call overhead when you need to both register new fds and wait in one operation.
Edge-triggered vs level-triggered
This is the most important implementation decision when using epoll. It changes how your application must handle I/O.
Level-triggered (default): epoll_wait returns an fd as long as data is available in the buffer. You can read partial data, return to epoll_wait, and it will notify you again. Simpler to program. More syscall wakeups.
Edge-triggered (EPOLLET): epoll_wait returns an fd only once when new data arrives. You must drain the fd completely (loop read() until EAGAIN) on each notification. If you read only part of the data and return to epoll_wait, the remaining data sits in the buffer with no further notification until more data arrives.
// Level-triggered: simple, partial reads are fine
function handle_level_triggered(fd):
data = read(fd, 4096) // read some data
process(data) // process it
// return to epoll_wait (it will fire again if more data remains)
// Edge-triggered: must drain completely
function handle_edge_triggered(fd):
loop:
data = read(fd, 4096)
if data == EAGAIN: // buffer is empty, done
break
process(data)
// return to epoll_wait (safe, buffer is fully drained)
Continue Reading with Premium
Unlock this article and every other in-depth system design guide on the platform with NotesFromSDE Premium.