Zero-copy I/O
Understand how sendfile and DMA transfer eliminate unnecessary kernel-to-user copies, why Kafka can sustain 1M+ messages per second on commodity hardware, and when zero-copy applies.
The problem
A file server sends a 100MB video file to each of 100 simultaneous clients. For each client, the OS follows four steps: read the file from disk into a kernel read buffer (DMA transfer), copy from the kernel buffer into a user-space buffer (CPU copy), copy from user-space back into a kernel socket buffer (CPU copy), then copy from the socket buffer to the NIC (DMA transfer).
Two of those four steps are pure overhead. Your application code never inspects the bytes. It calls read(), receives a buffer, and immediately calls write(). Nothing useful happens in user space.
At 100 clients, those two CPU copies move 100 clients x 100MB x 2 = 20GB through the CPU and memory bus on every full send cycle. The CPU is not computing anything. It is draining the memory bus doing memcpy. On a 10Gbps NIC, memory bandwidth becomes the bottleneck before network bandwidth does.
This is the problem zero-copy I/O solves: eliminate the copies that serve no purpose.
What it is
Zero-copy I/O is a kernel mechanism that transfers data from a file descriptor directly to a network socket descriptor without copying the bytes into user-space memory at any point. The data path stays entirely in the kernel, and in many cases the NIC reads directly from the kernel's file page cache via DMA, involving the CPU only to set up the transfer.
Analogy: Think of a warehouse shipping boxes from a receiving dock to a shipping dock. The traditional method has workers carry each box into the break room, set it down, then carry it to the shipping dock. Zero-copy is a conveyor belt running directly from the receiving dock to the shipping dock. Workers only configure the belt's destination; they never touch the boxes.
How it works
Traditional path (4 copies, 4 context switches)
Copies 2 and 3 are pure overhead. Your code calls read(), receives a buffer pointer, and immediately calls write() with that same pointer. Two context switches, two CPU copies, doubled memory bus pressure.
Zero-copy path with sendfile() (2 copies, 2 context switches)
With sendfile(), the kernel reads the file into the page cache via DMA, then passes a descriptor of those pages to the socket. The NIC reads directly from the page cache via scatter-gather DMA. User space is never involved. The CPU performs zero memcpy operations. Context switches drop from 4 to 2 (one to call sendfile(), one to return).
On modern Linux kernels with scatter-gather DMA support, the socket buffer is not even populated: the NIC reads from the page cache pages directly. The data physically moves through memory only once (disk to RAM), not twice.
System call comparison
# Traditional path: 4 syscalls, 2 user-space CPU copies
file_fd = open(path, O_RDONLY)
while not EOF:
n = read(file_fd, user_buffer, CHUNK_SIZE) # copy 1: kernel > user
write(socket_fd, user_buffer, n) # copy 2: user > kernel
close(file_fd)
# Zero-copy path: 2 syscalls, 0 user-space copies
file_fd = open(path, O_RDONLY)
file_size = stat(path).size
sendfile(socket_fd, file_fd, offset=0, count=file_size) # kernel handles all
close(file_fd)
sendfile() system call
The Linux sendfile(out_fd, in_fd, offset, count) call requires:
in_fd: a file descriptor opened for reading (a regular file or block device).out_fd: a socket file descriptor.- Both must be kernel-managed descriptors. sendfile() does not work if either side requires user-space processing.
The core limitation: because user space never sees the bytes, you cannot transform data in flight. Encryption, compression, and content inspection all require the data to pass through user-space code. sendfile() is for pass-through serving of static content only.
| Scenario | sendfile() applicable | Reason |
|---|---|---|
| Serving a static HTML or video file | Yes | No transformation needed |
| Serving a file over HTTPS (without kTLS) | No | TLS encryption requires user-space access |
| Serving compressed file content | No | Decompression requires user-space processing |
| Serving Kafka log segments to consumers | Yes | Consumer receives raw bytes verbatim |
| Proxying a response without modification | Possible via kernel TProxy | Kernel must see bytes as pass-through |
DMA and the NIC
DMA (Direct Memory Access) is a hardware mechanism where a peripheral device (disk controller or NIC) can read from or write to main memory without CPU involvement. The CPU programs the DMA transfer by specifying source address, destination address, and byte count. The DMA engine executes the transfer independently and interrupts the CPU only on completion.
With sendfile() and a NIC that supports scatter-gather DMA, the data path is:
- CPU calls
sendfile()and returns to the kernel scheduler immediately after setting up the transfer. - Disk DMA controller reads file blocks from storage into the kernel page cache (RAM).
- NIC scatter-gather DMA reads from the page cache pages and transmits them on the wire.
- NIC interrupts the CPU on completion.
The CPU's involvement in steps 2 and 3 is zero memcpy operations. It handles only interrupt dispatch and kernel bookkeeping. At 10Gbps line rate, this frees the CPU for application logic while the NIC runs at full throughput.
mmap() as an alternative
mmap(file_fd, length, offset) maps file pages directly into the calling process's virtual address space. The physical pages are the same kernel page-cache pages; user space receives a pointer that maps onto them without copying. Writes to the mapped region go directly to kernel pages.
Benefits over traditional read/write:
- Eliminates one CPU copy: reading through the mmap pointer addresses kernel pages directly, so no kernel-to-user copy occurs.
- Useful when user-space code needs to inspect or transform the data before sending.
Limitations compared to sendfile():
- Page fault overhead on first access: each page must be faulted in from disk on first read. For large files accessed sequentially once, this overhead exceeds the benefit.
- Still requires a
write()call to push data to the socket, which copies from the mapped pages to the socket buffer. One copy saved, not two. - Cannot be combined with scatter-gather DMA for the socket transmission path without kernel TLS.
Kafka and zero-copy
spawnSync d2 ENOENT
Kafka's consumer throughput is enabled almost entirely by sendfile(). The broker receives messages and appends them to log segment files on disk. Consumers request log data by offset and byte count. The broker calls sendfile(socket_fd, segment_fd, offset, length) and returns. No broker application code touches the message bytes.
At 1 million messages per second at 1KB per message (1GB/s throughput):
- Without zero-copy: 2GB/s of CPU copies to serve consumers (one read copy, one write copy per byte)
- With zero-copy: 0 bytes of CPU copies; the path is pure DMA
This is why Kafka brokers sustain 1GB/s+ consumer throughput on commodity hardware with modest CPU. The CPU cost of serving consumers is proportional to the number of fetch requests processed, not the number of bytes transmitted.
Replication to follower brokers uses the same path. When a follower fetch arrives, the leader calls sendfile() for the follower's socket exactly as it does for a consumer.
The page cache multiplies this advantage: once a log segment page is read for consumer A, it lands in the kernel page cache. Consumer B fetching the same offset finds the pages already in RAM. The sendfile() call serves them via DMA from RAM (microseconds) rather than from disk (milliseconds).
Production usage
| System | Zero-copy mechanism | Why |
|---|---|---|
| Kafka broker (consumer path) | sendfile() per consumer fetch | Log segments served directly from page cache to socket; broker application code never touches the bytes |
| nginx static file serving | sendfile on; directive (default on Linux) | Eliminates read and write copies; file served via DMA from page cache to NIC |
| HTTP/2 with kernel TLS (kTLS) | sendfile() with TLS offload to kernel or NIC | kTLS moves TLS record encryption into the kernel, re-enabling zero-copy for encrypted sends |
Java NIO FileChannel.transferTo() | JVM wrapper around sendfile() on Linux | Same syscall under the hood; the JVM uses the OS optimization transparently for file-to-socket transfers |
Limitations and when NOT to use it
- Cannot transform data in flight. sendfile() bypasses user space entirely. Encryption, compression, and content inspection all require bytes to pass through user-space code. For HTTPS without kTLS, you must use the traditional four-copy path.
- Requires kernel-managed file descriptors on both sides. sendfile() works for a file descriptor to a socket descriptor. It does not work with Java input streams, Go
io.Readerinterfaces, or any abstraction that interposes user-space buffers between the kernel and the network. - mmap page fault overhead. On first access, each page must be faulted in from disk. For random access patterns across large files, the per-page fault overhead can exceed the benefit of eliminating one copy.
- Not zero user-space copies for already-in-memory data. Zero-copy saves the user-space CPU copy, not the disk-to-kernel DMA. If data is already in the page cache, you still get one DMA transfer from kernel memory to the NIC. The "zero" refers to zero user-space copies, not zero total copies.
- Linux-specific semantics.
sendfile()is a Linux system call. FreeBSD has sendfile with different arguments; macOS has sendfile with yet another signature. This is universally available on modern Linux server hardware but not guaranteed across all container runtimes or VMs. - Harder to debug. When bytes never pass through user space, you cannot inspect them with application-level logging, add trace IDs, or measure per-message latency inside the broker. Network-level tools (tcpdump, Wireshark) are your only visibility into what was sent.
kTLS: restoring zero-copy for HTTPS
Linux kernel TLS (kTLS), introduced in Linux 4.13, moves TLS record encryption into the kernel so sendfile() can encrypt pages before handing them to the NIC. With kTLS + OpenSSL 3.0 and the nginx ssl_sendfile directive, a TLS-encrypted nginx server can serve static HTTPS files at near-zero CPU cost rather than the traditional 4-copy encrypted path. As of 2024, kTLS is production-ready on major Linux distributions with AES-NI capable CPUs.
When to use zero-copy I/O
Interview cheat sheet
- When asked why Kafka is fast, say: zero-copy. The broker calls
sendfile()to transfer log segments from the page cache directly to the consumer's socket. No broker application code touches the bytes. CPU cost of serving consumers is proportional to fetch request count, not byte count. - When asked what zero-copy means, say: the OS transfers data from a file descriptor to a socket descriptor without copying the bytes through user-space memory. The NIC's DMA engine reads from the kernel page cache directly. The CPU performs no memcpy operations.
- When asked how many copies exist in the traditional path, say: four. Disk to kernel read buffer (DMA), kernel to user space (CPU), user space to kernel socket buffer (CPU), socket buffer to NIC (DMA). Zero-copy eliminates the two CPU copies, leaving two DMA copies.
- When asked about the limitation of sendfile, say: you cannot transform data in flight. If you need to encrypt, compress, or modify bytes before sending, user-space code must touch them. For TLS, kernel TLS (kTLS) moves TLS record encryption into the kernel to restore zero-copy, but requires Linux 4.13+ and NIC TLS offload support.
- When asked about mmap vs sendfile, say: mmap maps file pages into the user-space address space so your code can read them without a kernel-to-user copy, saving one copy. But sending via a socket still requires write(), which is another copy. sendfile bypasses user space entirely and saves two copies for pure pass-through serving.
- When asked about Java and zero-copy, say:
FileChannel.transferTo()callssendfile()under the hood on Linux. This is the mechanism used by Kafka's Java broker. It is transparent to the application; the JVM chooses the syscall based on the OS. - When asked about context switches, say: the traditional path has 4 context switches (user to kernel for read, kernel to user return, user to kernel for write, kernel to user return). sendfile() reduces this to 2 (user to kernel for sendfile, kernel to user return).
- When asked to compare DMA with zero-copy, say: DMA is hardware-level (the NIC and disk controller transfer memory without CPU involvement). Zero-copy is OS-level (the kernel routes data between page cache and socket without user-space involvement). They are complementary: zero-copy paths use DMA for both the disk-to-RAM and RAM-to-NIC transfers.
Quick recap
- The traditional I/O path copies data four times per send: disk to kernel read buffer (DMA), kernel to user space (CPU), user space to kernel socket buffer (CPU), and socket buffer to NIC (DMA). The two CPU copies are pure overhead when the application never modifies the bytes.
sendfile(out_fd, in_fd, offset, count)instructs the kernel to transfer file pages directly to the socket without user-space involvement. On hardware with scatter-gather DMA, the NIC reads from the kernel page cache pages directly, making the CPU copy count zero.- Zero-copy reduces context switches from 4 to 2 per send, eliminates 2 memcpy operations per chunk, and frees the CPU for application logic while the NIC transmits at line rate.
- Kafka's consumer fetch path calls
sendfile()per consumer, which is why brokers sustain 1GB/s+ consumer throughput with minimal CPU: the broker application thread never touches the log bytes being served. - The core limitation is that data cannot be transformed in flight: encryption, compression, and inspection all require user-space code to touch the bytes, disabling sendfile(). Kernel TLS (kTLS) can restore zero-copy for TLS by moving encryption into the kernel.
- mmap maps file pages into user-space virtual memory so user code can access them without an extra copy, but sending via a socket still requires write(), which is one remaining copy; sendfile is more efficient for pure pass-through serving where user space never needs to inspect the bytes.
Related concepts
- Message queues covers Kafka's architecture at the system design level, including log segments and consumer groups that the zero-copy path serves directly.
- Networking covers TCP, DMA, and kernel I/O primitives that explain why sendfile() reduces context switches and memory bus pressure.
- Scalability covers horizontal scaling strategies; zero-copy is a key vertical optimization that lets a single Kafka broker scale to much higher throughput before horizontal scaling is required.