Memory-mapped files

The problem

Your database stores 50GB of B-tree index data on disk. Every query traverses 3-4 tree levels, each requiring a pread() system call. Each pread() copies 4KB from the kernel page cache into a user-space buffer: context switch to kernel, memcpy, context switch back. At 50,000 queries per second with 4 reads each, that is 200,000 system calls per second, each copying data that your code reads once and discards.

The data is already in the kernel page cache (your working set fits in RAM). The OS has the bytes sitting in memory. But your application cannot access them without copying them across the kernel/user-space boundary. You are paying for 200,000 copies per second of data that already exists in RAM.

It gets worse with writes. Your application writes to a user-space buffer, then calls pwrite() to copy that buffer into the kernel page cache, then calls fsync() to flush to disk. Three steps, two copies, and the buffer management code (allocation, eviction, write-back) is thousands of lines of complexity in your application.

This is the problem memory-mapped files solve: eliminate the copy between kernel page cache and user space by letting your process access the page cache directly.

What it is

mmap() is a system call that maps a file (or a portion of a file) into the calling process's virtual address space. After the mapping, the process reads and writes the file using ordinary memory access instructions, with no read() or write() system calls.

Analogy: Think of a library with a special reading room. The traditional approach: you request a book, a librarian fetches it from the shelf, photocopies the pages you need, and hands you the copies. With mmap, the librarian gives you a key to the shelf. You walk to the shelf and read the book directly. No photocopying, no waiting for the librarian. If someone else has a key, they read the same physical book.

// Traditional file I/O: copy through the kernel
int fd = open("data.db", O_RDONLY);
char buffer[4096];
pread(fd, buffer, 4096, 0);        // syscall: kernel copies to user buffer
int value = *(int*)(buffer + 16);  // read from user buffer

// mmap: access page cache directly
int fd = open("data.db", O_RDONLY);
void* addr = mmap(NULL, file_size, PROT_READ, MAP_SHARED, fd, 0);
int value = *(int*)(addr + 16);    // direct memory read, no syscall

Interview tip: mmap as zero-copy

When an interviewer asks about zero-copy I/O, mmap is one of the two main mechanisms (the other is sendfile). The key phrase: "mmap lets the process access the kernel page cache directly, eliminating the user-space buffer copy."

How it works

The virtual memory mapping

When you call mmap(), the kernel does not read any data from disk. It creates a mapping in the process's page table: virtual addresses in the process point to "not yet loaded" entries. The actual data loads on demand via page faults.

Step by step:

mmap() creates virtual address mappings. No disk I/O happens yet.
The process dereferences a pointer into the mapped region (e.g., *(addr + 16)).
The MMU checks the page table. If the page is not resident (no physical mapping), it triggers a page fault.
The kernel's page fault handler checks if the page is in the page cache. If yes (soft fault), it maps the physical page and returns. If no (hard fault), it reads 4KB from disk into the page cache, maps it, then returns.
The process's memory access completes. From the process's perspective, it was a normal pointer dereference.

Traditional read() vs mmap path

The critical difference: with pread(), every read is a system call that copies data. With mmap, only the first access to each page involves the kernel. After the page is mapped, reads are pure CPU memory instructions with no kernel involvement.

Page fault handling: soft vs hard faults

Not all page faults are expensive. The cost depends entirely on whether the data is already in the page cache.

Fault type	Cause	Cost	Frequency
Soft fault (minor)	Page is in page cache but not mapped into this process	~1-2 microseconds (page table update only)	Common after warm-up
Hard fault (major)	Page is not in page cache, must read from disk	50-200 microseconds (SSD) or 5-15ms (HDD)	Cold start, working set exceeds RAM

For a database with a working set that fits in RAM, virtually all faults after warm-up are soft faults. The database runs at memory speed with no system call overhead. I have seen LMDB benchmarks where read latency drops to sub-microsecond after the page cache is warm.

When the working set exceeds available RAM, the OS evicts pages from the page cache under memory pressure. Subsequent accesses to evicted pages trigger hard faults. The database has no control over which pages the OS evicts. This is the fundamental tradeoff of mmap: you trade simplicity for control.

// What happens inside the kernel on a page fault
function handle_page_fault(virtual_address, process):
  file_offset = virtual_address - mmap_base_address
  page_number = file_offset / PAGE_SIZE

  if page_cache.contains(file, page_number):
    // Soft fault: page is in memory, just not mapped
    physical_frame = page_cache.get(file, page_number)
    process.page_table.map(virtual_address, physical_frame)
    return  // ~1-2 microseconds

  // Hard fault: must read from disk
  physical_frame = allocate_page_frame()
  disk_read(file, page_number * PAGE_SIZE, physical_frame)  // blocks
  page_cache.insert(file, page_number, physical_frame)
  process.page_table.map(virtual_address, physical_frame)
  return  // 50us (SSD) to 15ms (HDD)

Hard page faults on mmap are indistinguishable from normal memory accesses in application code. A pointer dereference that normally takes nanoseconds can suddenly block for milliseconds if the page was evicted. This makes latency unpredictable for latency-sensitive databases, which is why PostgreSQL avoids mmap for data files.

msync, dirty pages, and write durability

When you write to an mmap-ed region (with PROT_WRITE), the kernel marks the page as "dirty" in the page cache. The data is in memory but not yet on disk. The OS will eventually flush dirty pages to disk, but "eventually" is not good enough for databases.

The problem

This is the problem memory-mapped files solve: eliminate the copy between kernel page cache and user space by letting your process access the page cache directly.

What it is

// Traditional file I/O: copy through the kernel
int fd = open("data.db", O_RDONLY);
char buffer[4096];
pread(fd, buffer, 4096, 0);        // syscall: kernel copies to user buffer
int value = *(int*)(buffer + 16);  // read from user buffer

// mmap: access page cache directly
int fd = open("data.db", O_RDONLY);
void* addr = mmap(NULL, file_size, PROT_READ, MAP_SHARED, fd, 0);
int value = *(int*)(addr + 16);    // direct memory read, no syscall

Interview tip: mmap as zero-copy

How it works

The virtual memory mapping

Step by step:

mmap() creates virtual address mappings. No disk I/O happens yet.
The process dereferences a pointer into the mapped region (e.g., *(addr + 16)).
The MMU checks the page table. If the page is not resident (no physical mapping), it triggers a page fault.
The kernel's page fault handler checks if the page is in the page cache. If yes (soft fault), it maps the physical page and returns. If no (hard fault), it reads 4KB from disk into the page cache, maps it, then returns.
The process's memory access completes. From the process's perspective, it was a normal pointer dereference.

Traditional read() vs mmap path

Page fault handling: soft vs hard faults

Not all page faults are expensive. The cost depends entirely on whether the data is already in the page cache.

Fault type	Cause	Cost	Frequency
Soft fault (minor)	Page is in page cache but not mapped into this process	~1-2 microseconds (page table update only)	Common after warm-up
Hard fault (major)	Page is not in page cache, must read from disk	50-200 microseconds (SSD) or 5-15ms (HDD)	Cold start, working set exceeds RAM

// What happens inside the kernel on a page fault
function handle_page_fault(virtual_address, process):
  file_offset = virtual_address - mmap_base_address
  page_number = file_offset / PAGE_SIZE

  if page_cache.contains(file, page_number):
    // Soft fault: page is in memory, just not mapped
    physical_frame = page_cache.get(file, page_number)
    process.page_table.map(virtual_address, physical_frame)
    return  // ~1-2 microseconds

  // Hard fault: must read from disk
  physical_frame = allocate_page_frame()
  disk_read(file, page_number * PAGE_SIZE, physical_frame)  // blocks
  page_cache.insert(file, page_number, physical_frame)
  process.page_table.map(virtual_address, physical_frame)
  return  // 50us (SSD) to 15ms (HDD)

Memory-mapped files

The problem

What it is

How it works

The virtual memory mapping

Traditional read() vs mmap path

Page fault handling: soft vs hard faults

msync, dirty pages, and write durability

Continue Reading with Premium

Comments

Memory-mapped files

The problem

What it is

How it works

The virtual memory mapping

Traditional read() vs mmap path

Page fault handling: soft vs hard faults

msync, dirty pages, and write durability

Continue Reading with Premium

Comments