Zero-copy I/O
Understand how sendfile and DMA transfer eliminate unnecessary kernel-to-user copies, why Kafka can sustain 1M+ messages per second on commodity hardware, and when zero-copy applies.
The problem
A file server sends a 100MB video file to each of 100 simultaneous clients. For each client, the OS follows four steps: read the file from disk into a kernel read buffer (DMA transfer), copy from the kernel buffer into a user-space buffer (CPU copy), copy from user-space back into a kernel socket buffer (CPU copy), then copy from the socket buffer to the NIC (DMA transfer).
Two of those four steps are pure overhead. Your application code never inspects the bytes. It calls read(), receives a buffer, and immediately calls write(). Nothing useful happens in user space.
At 100 clients, those two CPU copies move 100 clients x 100MB x 2 = 20GB through the CPU and memory bus on every full send cycle. The CPU is not computing anything. It is draining the memory bus doing memcpy. On a 10Gbps NIC, memory bandwidth becomes the bottleneck before network bandwidth does.
This is the problem zero-copy I/O solves: eliminate the copies that serve no purpose.
What it is
Zero-copy I/O is a kernel mechanism that transfers data from a file descriptor directly to a network socket descriptor without copying the bytes into user-space memory at any point. The data path stays entirely in the kernel, and in many cases the NIC reads directly from the kernel's file page cache via DMA, involving the CPU only to set up the transfer.
Analogy: Think of a warehouse shipping boxes from a receiving dock to a shipping dock. The traditional method has workers carry each box into the break room, set it down, then carry it to the shipping dock. Zero-copy is a conveyor belt running directly from the receiving dock to the shipping dock. Workers only configure the belt's destination; they never touch the boxes.
How it works
Traditional path (4 copies, 4 context switches)
Copies 2 and 3 are pure overhead. Your code calls read(), receives a buffer pointer, and immediately calls write() with that same pointer. Two context switches, two CPU copies, doubled memory bus pressure.
Zero-copy path with sendfile() (2 copies, 2 context switches)
With sendfile(), the kernel reads the file into the page cache via DMA, then passes a descriptor of those pages to the socket. The NIC reads directly from the page cache via scatter-gather DMA. User space is never involved. The CPU performs zero memcpy operations. Context switches drop from 4 to 2 (one to call sendfile(), one to return).
On modern Linux kernels with scatter-gather DMA support, the socket buffer is not even populated: the NIC reads from the page cache pages directly. The data physically moves through memory only once (disk to RAM), not twice.
System call comparison
# Traditional path: 4 syscalls, 2 user-space CPU copies
file_fd = open(path, O_RDONLY)
while not EOF:
n = read(file_fd, user_buffer, CHUNK_SIZE) # copy 1: kernel > user
write(socket_fd, user_buffer, n) # copy 2: user > kernel
close(file_fd)
# Zero-copy path: 2 syscalls, 0 user-space copies
file_fd = open(path, O_RDONLY)
file_size = stat(path).size
sendfile(socket_fd, file_fd, offset=0, count=file_size) # kernel handles all
close(file_fd)
sendfile() system call
The Linux sendfile(out_fd, in_fd, offset, count) call requires:
in_fd: a file descriptor opened for reading (a regular file or block device).out_fd: a socket file descriptor.- Both must be kernel-managed descriptors. sendfile() does not work if either side requires user-space processing.
The core limitation: because user space never sees the bytes, you cannot transform data in flight. Encryption, compression, and content inspection all require the data to pass through user-space code. sendfile() is for pass-through serving of static content only.
| Scenario | sendfile() applicable | Reason |
|---|---|---|
| Serving a static HTML or video file | Yes | No transformation needed |
| Serving a file over HTTPS (without kTLS) | No | TLS encryption requires user-space access |
| Serving compressed file content | No | Decompression requires user-space processing |
| Serving Kafka log segments to consumers | Yes | Consumer receives raw bytes verbatim |
| Proxying a response without modification | Possible via kernel TProxy | Kernel must see bytes as pass-through |
DMA and the NIC
DMA (Direct Memory Access) is a hardware mechanism where a peripheral device (disk controller or NIC) can read from or write to main memory without CPU involvement. The CPU programs the DMA transfer by specifying source address, destination address, and byte count. The DMA engine executes the transfer independently and interrupts the CPU only on completion.
With sendfile() and a NIC that supports scatter-gather DMA, the data path is:
- CPU calls
sendfile()and returns to the kernel scheduler immediately after setting up the transfer. - Disk DMA controller reads file blocks from storage into the kernel page cache (RAM).
- NIC scatter-gather DMA reads from the page cache pages and transmits them on the wire.
- NIC interrupts the CPU on completion.
The CPU's involvement in steps 2 and 3 is zero memcpy operations. It handles only interrupt dispatch and kernel bookkeeping. At 10Gbps line rate, this frees the CPU for application logic while the NIC runs at full throughput.
mmap() as an alternative
mmap(file_fd, length, offset) maps file pages directly into the calling process's virtual address space. The physical pages are the same kernel page-cache pages; user space receives a pointer that maps onto them without copying. Writes to the mapped region go directly to kernel pages.
Benefits over traditional read/write:
- Eliminates one CPU copy: reading through the mmap pointer addresses kernel pages directly, so no kernel-to-user copy occurs.
- Useful when user-space code needs to inspect or transform the data before sending.
Limitations compared to sendfile():
- Page fault overhead on first access: each page must be faulted in from disk on first read. For large files accessed sequentially once, this overhead exceeds the benefit.
- Still requires a
write()call to push data to the socket, which copies from the mapped pages to the socket buffer. One copy saved, not two. - Cannot be combined with scatter-gather DMA for the socket transmission path without kernel TLS.
Kafka and zero-copy
spawnSync d2 ENOENT
Continue Reading with Premium
Unlock this article and every other in-depth system design guide on the platform with NotesFromSDE Premium.