Memory-mapped files
How mmap maps file contents directly into virtual memory, enabling zero-copy file access. How databases use it, the OS page cache interaction, and when mmap is faster (and slower) than read/write system calls.
The problem
Your database stores 50GB of B-tree index data on disk. Every query traverses 3-4 tree levels, each requiring a pread() system call. Each pread() copies 4KB from the kernel page cache into a user-space buffer: context switch to kernel, memcpy, context switch back. At 50,000 queries per second with 4 reads each, that is 200,000 system calls per second, each copying data that your code reads once and discards.
The data is already in the kernel page cache (your working set fits in RAM). The OS has the bytes sitting in memory. But your application cannot access them without copying them across the kernel/user-space boundary. You are paying for 200,000 copies per second of data that already exists in RAM.
It gets worse with writes. Your application writes to a user-space buffer, then calls pwrite() to copy that buffer into the kernel page cache, then calls fsync() to flush to disk. Three steps, two copies, and the buffer management code (allocation, eviction, write-back) is thousands of lines of complexity in your application.
This is the problem memory-mapped files solve: eliminate the copy between kernel page cache and user space by letting your process access the page cache directly.
What it is
mmap() is a system call that maps a file (or a portion of a file) into the calling process's virtual address space. After the mapping, the process reads and writes the file using ordinary memory access instructions, with no read() or write() system calls.
Analogy: Think of a library with a special reading room. The traditional approach: you request a book, a librarian fetches it from the shelf, photocopies the pages you need, and hands you the copies. With mmap, the librarian gives you a key to the shelf. You walk to the shelf and read the book directly. No photocopying, no waiting for the librarian. If someone else has a key, they read the same physical book.
// Traditional file I/O: copy through the kernel
int fd = open("data.db", O_RDONLY);
char buffer[4096];
pread(fd, buffer, 4096, 0); // syscall: kernel copies to user buffer
int value = *(int*)(buffer + 16); // read from user buffer
// mmap: access page cache directly
int fd = open("data.db", O_RDONLY);
void* addr = mmap(NULL, file_size, PROT_READ, MAP_SHARED, fd, 0);
int value = *(int*)(addr + 16); // direct memory read, no syscall
Interview tip: mmap as zero-copy
When an interviewer asks about zero-copy I/O, mmap is one of the two main mechanisms (the other is sendfile). The key phrase: "mmap lets the process access the kernel page cache directly, eliminating the user-space buffer copy."
How it works
The virtual memory mapping
When you call mmap(), the kernel does not read any data from disk. It creates a mapping in the process's page table: virtual addresses in the process point to "not yet loaded" entries. The actual data loads on demand via page faults.
Step by step:
mmap()creates virtual address mappings. No disk I/O happens yet.- The process dereferences a pointer into the mapped region (e.g.,
*(addr + 16)). - The MMU checks the page table. If the page is not resident (no physical mapping), it triggers a page fault.
- The kernel's page fault handler checks if the page is in the page cache. If yes (soft fault), it maps the physical page and returns. If no (hard fault), it reads 4KB from disk into the page cache, maps it, then returns.
- The process's memory access completes. From the process's perspective, it was a normal pointer dereference.
Traditional read() vs mmap path
The critical difference: with pread(), every read is a system call that copies data. With mmap, only the first access to each page involves the kernel. After the page is mapped, reads are pure CPU memory instructions with no kernel involvement.
Page fault handling: soft vs hard faults
Not all page faults are expensive. The cost depends entirely on whether the data is already in the page cache.
| Fault type | Cause | Cost | Frequency |
|---|---|---|---|
| Soft fault (minor) | Page is in page cache but not mapped into this process | ~1-2 microseconds (page table update only) | Common after warm-up |
| Hard fault (major) | Page is not in page cache, must read from disk | 50-200 microseconds (SSD) or 5-15ms (HDD) | Cold start, working set exceeds RAM |
For a database with a working set that fits in RAM, virtually all faults after warm-up are soft faults. The database runs at memory speed with no system call overhead. I have seen LMDB benchmarks where read latency drops to sub-microsecond after the page cache is warm.
When the working set exceeds available RAM, the OS evicts pages from the page cache under memory pressure. Subsequent accesses to evicted pages trigger hard faults. The database has no control over which pages the OS evicts. This is the fundamental tradeoff of mmap: you trade simplicity for control.
// What happens inside the kernel on a page fault
function handle_page_fault(virtual_address, process):
file_offset = virtual_address - mmap_base_address
page_number = file_offset / PAGE_SIZE
if page_cache.contains(file, page_number):
// Soft fault: page is in memory, just not mapped
physical_frame = page_cache.get(file, page_number)
process.page_table.map(virtual_address, physical_frame)
return // ~1-2 microseconds
// Hard fault: must read from disk
physical_frame = allocate_page_frame()
disk_read(file, page_number * PAGE_SIZE, physical_frame) // blocks
page_cache.insert(file, page_number, physical_frame)
process.page_table.map(virtual_address, physical_frame)
return // 50us (SSD) to 15ms (HDD)
Hard page faults on mmap are indistinguishable from normal memory accesses in application code. A pointer dereference that normally takes nanoseconds can suddenly block for milliseconds if the page was evicted. This makes latency unpredictable for latency-sensitive databases, which is why PostgreSQL avoids mmap for data files.
msync, dirty pages, and write durability
When you write to an mmap-ed region (with PROT_WRITE), the kernel marks the page as "dirty" in the page cache. The data is in memory but not yet on disk. The OS will eventually flush dirty pages to disk, but "eventually" is not good enough for databases.
Continue Reading with Premium
Unlock this article and every other in-depth system design guide on the platform with NotesFromSDE Premium.