Zero-copy I/O

The problem

A file server sends a 100MB video file to each of 100 simultaneous clients. For each client, the OS follows four steps: read the file from disk into a kernel read buffer (DMA transfer), copy from the kernel buffer into a user-space buffer (CPU copy), copy from user-space back into a kernel socket buffer (CPU copy), then copy from the socket buffer to the NIC (DMA transfer).

Two of those four steps are pure overhead. Your application code never inspects the bytes. It calls read(), receives a buffer, and immediately calls write(). Nothing useful happens in user space.

At 100 clients, those two CPU copies move 100 clients x 100MB x 2 = 20GB through the CPU and memory bus on every full send cycle. The CPU is not computing anything. It is draining the memory bus doing memcpy. On a 10Gbps NIC, memory bandwidth becomes the bottleneck before network bandwidth does.

This is the problem zero-copy I/O solves: eliminate the copies that serve no purpose.

What it is

Zero-copy I/O is a kernel mechanism that transfers data from a file descriptor directly to a network socket descriptor without copying the bytes into user-space memory at any point. The data path stays entirely in the kernel, and in many cases the NIC reads directly from the kernel's file page cache via DMA, involving the CPU only to set up the transfer.

Analogy: Think of a warehouse shipping boxes from a receiving dock to a shipping dock. The traditional method has workers carry each box into the break room, set it down, then carry it to the shipping dock. Zero-copy is a conveyor belt running directly from the receiving dock to the shipping dock. Workers only configure the belt's destination; they never touch the boxes.

How it works

Traditional path (4 copies, 4 context switches)

Copies 2 and 3 are pure overhead. Your code calls read(), receives a buffer pointer, and immediately calls write() with that same pointer. Two context switches, two CPU copies, doubled memory bus pressure.

Zero-copy path with sendfile() (2 copies, 2 context switches)

With sendfile(), the kernel reads the file into the page cache via DMA, then passes a descriptor of those pages to the socket. The NIC reads directly from the page cache via scatter-gather DMA. User space is never involved. The CPU performs zero memcpy operations. Context switches drop from 4 to 2 (one to call sendfile(), one to return).

On modern Linux kernels with scatter-gather DMA support, the socket buffer is not even populated: the NIC reads from the page cache pages directly. The data physically moves through memory only once (disk to RAM), not twice.

System call comparison

# Traditional path: 4 syscalls, 2 user-space CPU copies
file_fd = open(path, O_RDONLY)
while not EOF:
  n     = read(file_fd, user_buffer, CHUNK_SIZE)  # copy 1: kernel > user
  write(socket_fd, user_buffer, n)                # copy 2: user > kernel
close(file_fd)

# Zero-copy path: 2 syscalls, 0 user-space copies
file_fd   = open(path, O_RDONLY)
file_size = stat(path).size
sendfile(socket_fd, file_fd, offset=0, count=file_size)  # kernel handles all
close(file_fd)

sendfile() system call

The Linux sendfile(out_fd, in_fd, offset, count) call requires:

in_fd: a file descriptor opened for reading (a regular file or block device).
out_fd: a socket file descriptor.
Both must be kernel-managed descriptors. sendfile() does not work if either side requires user-space processing.

The core limitation: because user space never sees the bytes, you cannot transform data in flight. Encryption, compression, and content inspection all require the data to pass through user-space code. sendfile() is for pass-through serving of static content only.

Scenario	sendfile() applicable	Reason
Serving a static HTML or video file	Yes	No transformation needed
Serving a file over HTTPS (without kTLS)	No	TLS encryption requires user-space access
Serving compressed file content	No	Decompression requires user-space processing
Serving Kafka log segments to consumers	Yes	Consumer receives raw bytes verbatim
Proxying a response without modification	Possible via kernel TProxy	Kernel must see bytes as pass-through

DMA and the NIC

DMA (Direct Memory Access) is a hardware mechanism where a peripheral device (disk controller or NIC) can read from or write to main memory without CPU involvement. The CPU programs the DMA transfer by specifying source address, destination address, and byte count. The DMA engine executes the transfer independently and interrupts the CPU only on completion.

With sendfile() and a NIC that supports scatter-gather DMA, the data path is:

CPU calls sendfile() and returns to the kernel scheduler immediately after setting up the transfer.
Disk DMA controller reads file blocks from storage into the kernel page cache (RAM).
NIC scatter-gather DMA reads from the page cache pages and transmits them on the wire.
NIC interrupts the CPU on completion.

The CPU's involvement in steps 2 and 3 is zero memcpy operations. It handles only interrupt dispatch and kernel bookkeeping. At 10Gbps line rate, this frees the CPU for application logic while the NIC runs at full throughput.

mmap() as an alternative

mmap(file_fd, length, offset) maps file pages directly into the calling process's virtual address space. The physical pages are the same kernel page-cache pages; user space receives a pointer that maps onto them without copying. Writes to the mapped region go directly to kernel pages.

Benefits over traditional read/write:

Eliminates one CPU copy: reading through the mmap pointer addresses kernel pages directly, so no kernel-to-user copy occurs.
Useful when user-space code needs to inspect or transform the data before sending.

Limitations compared to sendfile():

Page fault overhead on first access: each page must be faulted in from disk on first read. For large files accessed sequentially once, this overhead exceeds the benefit.
Still requires a write() call to push data to the socket, which copies from the mapped pages to the socket buffer. One copy saved, not two.
Cannot be combined with scatter-gather DMA for the socket transmission path without kernel TLS.

Kafka and zero-copy

D2 render error.

spawnSync d2 ENOENT

# Traditional path: 4 syscalls, 2 user-space CPU copies
file_fd = open(path, O_RDONLY)
while not EOF:
  n     = read(file_fd, user_buffer, CHUNK_SIZE)  # copy 1: kernel > user
  write(socket_fd, user_buffer, n)                # copy 2: user > kernel
close(file_fd)

# Zero-copy path: 2 syscalls, 0 user-space copies
file_fd   = open(path, O_RDONLY)
file_size = stat(path).size
sendfile(socket_fd, file_fd, offset=0, count=file_size)  # kernel handles all
close(file_fd)

sendfile() system call

The Linux sendfile(out_fd, in_fd, offset, count) call requires:

in_fd: a file descriptor opened for reading (a regular file or block device).
out_fd: a socket file descriptor.
Both must be kernel-managed descriptors. sendfile() does not work if either side requires user-space processing.

Scenario	sendfile() applicable	Reason
Serving a static HTML or video file	Yes	No transformation needed
Serving a file over HTTPS (without kTLS)	No	TLS encryption requires user-space access
Serving compressed file content	No	Decompression requires user-space processing
Serving Kafka log segments to consumers	Yes	Consumer receives raw bytes verbatim
Proxying a response without modification	Possible via kernel TProxy	Kernel must see bytes as pass-through

DMA and the NIC

With sendfile() and a NIC that supports scatter-gather DMA, the data path is:

CPU calls sendfile() and returns to the kernel scheduler immediately after setting up the transfer.
Disk DMA controller reads file blocks from storage into the kernel page cache (RAM).
NIC scatter-gather DMA reads from the page cache pages and transmits them on the wire.
NIC interrupts the CPU on completion.

mmap() as an alternative

Benefits over traditional read/write:

Eliminates one CPU copy: reading through the mmap pointer addresses kernel pages directly, so no kernel-to-user copy occurs.
Useful when user-space code needs to inspect or transform the data before sending.

Limitations compared to sendfile():

Page fault overhead on first access: each page must be faulted in from disk on first read. For large files accessed sequentially once, this overhead exceeds the benefit.
Still requires a write() call to push data to the socket, which copies from the mapped pages to the socket buffer. One copy saved, not two.
Cannot be combined with scatter-gather DMA for the socket transmission path without kernel TLS.

Kafka and zero-copy

D2 render error.

spawnSync d2 ENOENT

Zero-copy I/O

The problem

What it is

How it works

Traditional path (4 copies, 4 context switches)

Zero-copy path with sendfile() (2 copies, 2 context switches)

System call comparison

sendfile() system call

DMA and the NIC

mmap() as an alternative

Kafka and zero-copy

Continue Reading with Premium

Comments

Zero-copy I/O

The problem

What it is

How it works

Traditional path (4 copies, 4 context switches)

Zero-copy path with sendfile() (2 copies, 2 context switches)

System call comparison

sendfile() system call

DMA and the NIC

mmap() as an alternative

Kafka and zero-copy

Continue Reading with Premium

Comments