Memory-mapped files
How mmap maps file contents directly into virtual memory, enabling zero-copy file access. How databases use it, the OS page cache interaction, and when mmap is faster (and slower) than read/write system calls.
The problem
Your database stores 50GB of B-tree index data on disk. Every query traverses 3-4 tree levels, each requiring a pread() system call. Each pread() copies 4KB from the kernel page cache into a user-space buffer: context switch to kernel, memcpy, context switch back. At 50,000 queries per second with 4 reads each, that is 200,000 system calls per second, each copying data that your code reads once and discards.
The data is already in the kernel page cache (your working set fits in RAM). The OS has the bytes sitting in memory. But your application cannot access them without copying them across the kernel/user-space boundary. You are paying for 200,000 copies per second of data that already exists in RAM.
It gets worse with writes. Your application writes to a user-space buffer, then calls pwrite() to copy that buffer into the kernel page cache, then calls fsync() to flush to disk. Three steps, two copies, and the buffer management code (allocation, eviction, write-back) is thousands of lines of complexity in your application.
This is the problem memory-mapped files solve: eliminate the copy between kernel page cache and user space by letting your process access the page cache directly.
What it is
mmap() is a system call that maps a file (or a portion of a file) into the calling process's virtual address space. After the mapping, the process reads and writes the file using ordinary memory access instructions, with no read() or write() system calls.
Analogy: Think of a library with a special reading room. The traditional approach: you request a book, a librarian fetches it from the shelf, photocopies the pages you need, and hands you the copies. With mmap, the librarian gives you a key to the shelf. You walk to the shelf and read the book directly. No photocopying, no waiting for the librarian. If someone else has a key, they read the same physical book.
// Traditional file I/O: copy through the kernel
int fd = open("data.db", O_RDONLY);
char buffer[4096];
pread(fd, buffer, 4096, 0); // syscall: kernel copies to user buffer
int value = *(int*)(buffer + 16); // read from user buffer
// mmap: access page cache directly
int fd = open("data.db", O_RDONLY);
void* addr = mmap(NULL, file_size, PROT_READ, MAP_SHARED, fd, 0);
int value = *(int*)(addr + 16); // direct memory read, no syscall
Interview tip: mmap as zero-copy
When an interviewer asks about zero-copy I/O, mmap is one of the two main mechanisms (the other is sendfile). The key phrase: "mmap lets the process access the kernel page cache directly, eliminating the user-space buffer copy."
How it works
The virtual memory mapping
When you call mmap(), the kernel does not read any data from disk. It creates a mapping in the process's page table: virtual addresses in the process point to "not yet loaded" entries. The actual data loads on demand via page faults.
Step by step:
mmap()creates virtual address mappings. No disk I/O happens yet.- The process dereferences a pointer into the mapped region (e.g.,
*(addr + 16)). - The MMU checks the page table. If the page is not resident (no physical mapping), it triggers a page fault.
- The kernel's page fault handler checks if the page is in the page cache. If yes (soft fault), it maps the physical page and returns. If no (hard fault), it reads 4KB from disk into the page cache, maps it, then returns.
- The process's memory access completes. From the process's perspective, it was a normal pointer dereference.
Traditional read() vs mmap path
The critical difference: with pread(), every read is a system call that copies data. With mmap, only the first access to each page involves the kernel. After the page is mapped, reads are pure CPU memory instructions with no kernel involvement.
Page fault handling: soft vs hard faults
Not all page faults are expensive. The cost depends entirely on whether the data is already in the page cache.
| Fault type | Cause | Cost | Frequency |
|---|---|---|---|
| Soft fault (minor) | Page is in page cache but not mapped into this process | ~1-2 microseconds (page table update only) | Common after warm-up |
| Hard fault (major) | Page is not in page cache, must read from disk | 50-200 microseconds (SSD) or 5-15ms (HDD) | Cold start, working set exceeds RAM |
For a database with a working set that fits in RAM, virtually all faults after warm-up are soft faults. The database runs at memory speed with no system call overhead. I have seen LMDB benchmarks where read latency drops to sub-microsecond after the page cache is warm.
When the working set exceeds available RAM, the OS evicts pages from the page cache under memory pressure. Subsequent accesses to evicted pages trigger hard faults. The database has no control over which pages the OS evicts. This is the fundamental tradeoff of mmap: you trade simplicity for control.
// What happens inside the kernel on a page fault
function handle_page_fault(virtual_address, process):
file_offset = virtual_address - mmap_base_address
page_number = file_offset / PAGE_SIZE
if page_cache.contains(file, page_number):
// Soft fault: page is in memory, just not mapped
physical_frame = page_cache.get(file, page_number)
process.page_table.map(virtual_address, physical_frame)
return // ~1-2 microseconds
// Hard fault: must read from disk
physical_frame = allocate_page_frame()
disk_read(file, page_number * PAGE_SIZE, physical_frame) // blocks
page_cache.insert(file, page_number, physical_frame)
process.page_table.map(virtual_address, physical_frame)
return // 50us (SSD) to 15ms (HDD)
Hard page faults on mmap are indistinguishable from normal memory accesses in application code. A pointer dereference that normally takes nanoseconds can suddenly block for milliseconds if the page was evicted. This makes latency unpredictable for latency-sensitive databases, which is why PostgreSQL avoids mmap for data files.
msync, dirty pages, and write durability
When you write to an mmap-ed region (with PROT_WRITE), the kernel marks the page as "dirty" in the page cache. The data is in memory but not yet on disk. The OS will eventually flush dirty pages to disk, but "eventually" is not good enough for databases.
Three options for flushing dirty pages:
| Method | Behavior | Use case |
|---|---|---|
msync(addr, len, MS_SYNC) | Blocks until pages are written to disk | Transaction commit (LMDB) |
msync(addr, len, MS_ASYNC) | Schedules write but returns immediately | Advisory, no durability guarantee |
| OS background flush | Kernel writes dirty pages every 30s (Linux default) | Not suitable for databases |
LMDB calls msync(MS_SYNC) at transaction commit. This is equivalent to fsync() for the mapped region. The cost is the same as fsync(): disk write latency. The advantage over pwrite() + fsync() is that you avoid the pwrite() copy entirely; the data is already in the page cache from your memory write.
MAP_SHARED vs MAP_PRIVATE
The flags argument to mmap() controls whether writes are visible to other processes and whether they propagate to the underlying file.
| Flag | Writes visible to other processes? | Writes propagate to file? | Use case |
|---|---|---|---|
MAP_SHARED | Yes | Yes (via msync/flush) | Databases (LMDB), shared memory IPC |
MAP_PRIVATE | No | No (copy-on-write) | Reading config files, process isolation |
MAP_PRIVATE uses copy-on-write (CoW): the first write to a page triggers a copy of that page into private memory. The original page cache page is untouched. Other processes sharing the same file see the original data.
// MAP_PRIVATE copy-on-write behavior
process_A = mmap(file, MAP_SHARED) // shared mapping
process_B = mmap(file, MAP_PRIVATE) // private mapping
process_A.write(page_0, "hello") // modifies page cache, visible to all
process_B.write(page_0, "world") // triggers CoW: B gets private copy
// page_0 in file/page cache: "hello" (A's write)
// page_0 in process B's private memory: "world" (B's copy)
I find MAP_PRIVATE most useful for reading configuration or data files that the process might modify in memory without affecting the on-disk copy. Databases almost always use MAP_SHARED because they need writes to reach disk.
madvise: telling the kernel your access pattern
The OS readahead heuristic is generic. It does not know whether your database reads pages sequentially (compaction), randomly (point lookups), or will never access certain pages again (one-time scans). madvise() lets you give the kernel explicit hints.
| Hint | Effect | Use case |
|---|---|---|
MADV_SEQUENTIAL | Aggressive readahead, free pages after reading | Full table scans, compaction reads |
MADV_RANDOM | Disable readahead | B-tree traversals, hash index lookups |
MADV_WILLNEED | Pre-fault pages into page cache | Warm-up on startup, preloading hot indexes |
MADV_DONTNEED | Free pages immediately | After finishing a one-time scan, release memory |
MADV_HUGEPAGE | Request transparent huge pages for this region | Large sequential mappings |
// Example: warm up an mmap-ed index file on startup
void* addr = mmap(NULL, index_size, PROT_READ, MAP_SHARED, fd, 0);
madvise(addr, index_size, MADV_WILLNEED); // pre-fault all pages
madvise(addr, index_size, MADV_RANDOM); // B-tree: random access pattern
// Example: after scanning a data range, release it
madvise(scan_start, scan_length, MADV_DONTNEED); // free pages
RocksDB uses MADV_RANDOM on SST file mappings (B-tree lookups are random) and MADV_SEQUENTIAL during compaction (which reads files sequentially). Without these hints, the OS defaults to moderate readahead that wastes I/O for random workloads and under-reads for sequential ones.
MADV_DONTNEED is particularly powerful for controlling memory usage. After a background compaction reads through a large mmap-ed file, calling MADV_DONTNEED tells the kernel to free those pages immediately rather than keeping them in the page cache (where they would evict hotter pages).
When mmap beats read/write (and when it does not)
This is the most practical question for system design: when should you actually use mmap?
mmap wins:
- Read-heavy, random access, warm cache. After the page cache is warm, mmap reads are pointer dereferences (nanoseconds) vs
pread()system calls (hundreds of nanoseconds). For a B-tree traversal doing 4 random reads per query at 50K QPS, that is 200K fewer syscalls per second. - Multiple processes reading the same file. With
MAP_SHARED, all processes share the same physical pages in the page cache. No per-process buffer duplication. - Simple code. No buffer pool, no eviction policy, no page lifecycle management. The OS handles it. LMDB's entire storage engine is dramatically simpler than PostgreSQL's because of this.
read/write wins:
- Write-heavy workloads. Dirty page tracking in the OS is coarse-grained. You cannot control which pages flush first. For WAL-based recovery, you need explicit write ordering that mmap cannot provide.
- Working set larger than RAM. Hard faults are unpredictable. A database with its own buffer pool can implement priority eviction (keep hot index pages, evict cold data pages). The OS LRU treats all pages equally.
- Latency-sensitive reads. A hard fault on mmap blocks the calling thread with no way to set a timeout. With
pread(), you can use async I/O or a thread pool with timeouts. - 32-bit systems. mmap consumes virtual address space. A 32-bit process has ~3GB of virtual address space, limiting the maximum file size you can map.
Interview tip: PostgreSQL's choice
When asked "why doesn't PostgreSQL use mmap," the answer is: crash recovery requires precise control over write ordering. PostgreSQL must write WAL records before data pages. With mmap, the OS can flush a dirty data page before the corresponding WAL record is on disk, breaking crash recovery guarantees. This is the classic argument against mmap in write-heavy ACID databases.
TLB pressure and huge pages
mmap performance depends heavily on the TLB (Translation Lookaside Buffer), a CPU cache that stores recent virtual-to-physical page mappings. TLB misses are expensive: the CPU must walk the page table in memory.
Default page size: 4KB
1GB file = 262,144 pages = 262,144 TLB entries needed
Typical TLB: 512-2048 entries → constant misses
Huge pages: 2MB
1GB file = 512 pages = 512 TLB entries needed
Fits in TLB → minimal misses
For databases mapping multi-gigabyte files, TLB misses dominate performance when using 4KB pages. Huge pages (2MB on x86) reduce this by 512x.
# Linux: enable transparent huge pages
echo always > /sys/kernel/mm/transparent_hugepage/enabled
# Or allocate explicit huge pages (1024 x 2MB = 2GB)
echo 1024 > /proc/sys/vm/nr_hugepages
LMDB, MongoDB, and Redis all document huge page configuration as a performance tuning step. On Linux, transparent huge pages (THP) can cause latency spikes during page compaction, so some databases (Redis, MongoDB) recommend disabling THP and using explicit huge pages instead.
Production usage
| System | Usage | Notable behavior |
|---|---|---|
| LMDB | Maps entire database file with MAP_SHARED. No application-level buffer pool. | Read transactions are lock-free (they read from a consistent snapshot of mapped pages). Write transactions use copy-on-write B-trees and msync at commit. |
| SQLite | Optional mmap for reads via PRAGMA mmap_size. | Falls back to pread for writes. Recommends disabling mmap for write-heavy workloads due to I/O ordering issues with fsync. |
| MongoDB (legacy MMAP engine) | Mapped entire collection files. Replaced by WiredTiger in 3.2+. | The MMAP engine had no document-level locking and poor write performance, which drove the switch to WiredTiger's own buffer pool. |
| RocksDB | Uses mmap for reading SST files (configurable via allow_mmap_reads). | Does not use mmap for writes. Uses pwrite + fdatasync for WAL and compaction output. |
| Kafka | Log segments are mmap-ed for consumer reads via transferTo (which uses sendfile). | Producers write via pwrite. The OS page cache acts as Kafka's read cache, which is why Kafka recommends not running with a JVM heap larger than necessary. |
Limitations and when NOT to use it
- No control over eviction. The OS decides which pages to evict under memory pressure. A database cannot say "keep index pages, evict data pages." This leads to unpredictable performance when the working set exceeds RAM.
- Write ordering is not guaranteed. The OS can flush dirty pages in any order. Databases that need write-ahead logging (WAL) semantics cannot use mmap for data pages because a data page might flush before the corresponding WAL record.
- Hard faults are synchronous and uninterruptible. A thread that hits a hard fault blocks until disk I/O completes. There is no timeout, no cancellation, no async alternative. This makes tail latency unpredictable.
- SIGBUS on truncated files. If another process truncates the file while it is mapped, accessing the truncated region delivers SIGBUS (not a segfault). This crashes the process unless you install a signal handler, and handling SIGBUS correctly is difficult.
- Virtual address space limits on 32-bit. A 32-bit process has ~3GB of virtual address space. You cannot mmap a 10GB file. Even on 64-bit, mapping extremely large files (hundreds of GB) consumes page table memory.
- TLB pressure with small pages. Mapping large files with 4KB pages generates heavy TLB miss traffic. Huge pages help but introduce their own complexity (compaction latency, allocation failures).
Interview cheat sheet
- When asked "how does mmap work," say: it creates a virtual memory mapping to a file. The first access per page triggers a page fault that loads data from disk into the page cache. Subsequent accesses are direct memory reads with no system calls.
- When asked about zero-copy, explain that mmap eliminates the
preadcopy from kernel buffer to user buffer. The process accesses the page cache pages directly through its virtual address space. - When comparing mmap to
read/write, state: mmap wins for read-heavy random access with warm caches (no syscall overhead).read/writewins for write-heavy workloads needing explicit flush ordering. - When asked "why doesn't PostgreSQL use mmap," explain write ordering: PostgreSQL must write WAL before data pages. mmap dirty page flush is OS-controlled and can violate this ordering, breaking crash recovery.
- When asked about LMDB, explain: LMDB maps the entire database, has no buffer pool, and uses the OS page cache directly. Reads are lock-free via consistent snapshots of mapped pages. This is why LMDB is extremely simple and fast for read-heavy workloads.
- When discussing page faults, distinguish soft faults (page in cache, ~1us, just a page table update) from hard faults (not in cache, 50us-15ms, requires disk I/O). Performance depends entirely on this ratio.
- When asked about MAP_SHARED vs MAP_PRIVATE, explain: SHARED propagates writes to the file and is visible to other processes. PRIVATE uses copy-on-write and is isolated. Databases use SHARED; config file readers use PRIVATE.
- When discussing TLB pressure, explain: mapping a 1GB file with 4KB pages needs 262K TLB entries (TLB has ~1K entries). Huge pages (2MB) reduce this to 512 entries. This is why databases recommend huge page configuration for mmap workloads.
Quick recap
mmap()maps a file into virtual memory. The first access per page triggers a page fault that loads from disk; subsequent reads are direct memory accesses with no system calls, eliminating thepread()copy overhead.- Soft faults (page in cache, ~1us) are fast. Hard faults (page not in cache, 50us-15ms) require disk I/O. Performance depends entirely on the soft/hard fault ratio, which depends on working set size vs available RAM.
MAP_SHAREDlets multiple processes share the same physical pages and propagates writes to the file.MAP_PRIVATEuses copy-on-write for process isolation.msync(MS_SYNC)flushes dirty pages to disk, but the OS can flush other dirty pages at any time. This lack of write ordering control is why PostgreSQL avoids mmap for data pages.- LMDB maps the entire database, uses the OS page cache as its buffer pool, and achieves extreme simplicity and read performance. The tradeoff: no control over page eviction when the working set exceeds RAM.
- Huge pages (2MB) reduce TLB pressure by 512x compared to 4KB pages. For large mmap workloads, use explicit huge pages rather than transparent huge pages (THP), which can cause compaction latency spikes.
Related concepts
- Zero-copy I/O - mmap is one of two main zero-copy mechanisms (the other is
sendfile). Understanding when each applies is essential for high-throughput I/O design. - Write-ahead log - WAL depends on precise write ordering that mmap cannot guarantee. This interaction is the core reason write-heavy databases avoid mmap.
- Databases - Storage engine design (buffer pool vs mmap) is a foundational database architecture decision. Understanding mmap explains why LMDB and PostgreSQL make opposite choices.
- LSM trees - RocksDB uses mmap for reading SST files, combining LSM tree structure with mmap's zero-copy reads for the read path.