Copy-on-write (COW)
Learn how copy-on-write defers expensive data copies until the moment a write actually happens, how Redis uses it for non-blocking snapshotting, and where COW produces surprising memory spikes.
The problem
Your Redis instance uses 8 GB of memory. At 2 AM, BGSAVE starts automatically to create an RDB snapshot. Thirty seconds later, your monitoring fires: RSS memory has climbed to 14 GB. The host has 16 GB total. One more spike like this and the OOM killer will fire, taking Redis down.
Nothing changed in the dataset size. Write traffic is heavier than usual tonight. That is the variable your team did not account for.
If you do not understand copy-on-write, you will size Redis based on your dataset alone, get surprised by this memory spike on every snapshot, and spend hours chasing what looks like a memory leak but is not.
What copy-on-write is
Copy-on-write (COW) is an OS-level optimization that defers copying a memory region until one of the parties sharing it actually writes to it. Before any modification, two processes share the same physical pages read-only. The physical copy happens lazily, one 4 KB page at a time, exactly as writes occur.
Think of a shared recipe binder in a restaurant kitchen. Two chefs use the same binder every night. When Chef A wants to annotate a recipe, she tears out that one page, photocopies it, and writes on her copy alone. Chef B's original is untouched. Every other page remains shared. Nobody copies the entire binder just because one chef wants to scribble on page 12.
How copy-on-write works
When Redis runs BGSAVE, it calls fork() to spawn a child process that serializes all data to disk. The Linux kernel does not copy all of Redis's memory at that moment. Instead, it marks every page in both the parent and child as read-only and shared.
Step by step, walking through one write during an active BGSAVE:
- Redis calls
fork(). The child process inherits the parent's virtual memory map. All pages are marked read-only in both page tables. No physical copy happens yet. - The child reads every key and serializes it to the RDB file. Reads do not trigger any copy.
- A client sends
SET user:99 "active"to the parent. The parent tries to write to thedictEntryfor that key. - The CPU sees the page is marked read-only and raises a page fault, trapping into the kernel.
- The kernel's COW handler allocates a new 4 KB physical page, copies the old page contents into it, and updates the parent's page table to point to the new private page marked writable.
- The original page remains in the child's page table, untouched.
- The parent's write completes on the private copy. The child never sees it.
// Pseudocode: kernel page fault handler for a COW page
on_write_fault(process, virtual_address):
page = page_table_lookup(process, virtual_address)
if page.is_cow_shared:
new_page = alloc_physical_page()
copy_contents(src=page, dst=new_page) // 4 KB memcpy
page_table_update(process, virtual_address, new_page)
new_page.writable = true
page.ref_count -= 1
resume_write(process, virtual_address)
The flow from fork() to private page allocation, step by step:
spawnSync d2 ENOENT
The total extra memory equals: number of pages written during BGSAVE multiplied by 4 KB. A read-heavy workload during BGSAVE uses almost no extra memory. A write-heavy workload can approach 2x the dataset size.
Page granularity and memory amplification
COW copies entire pages, not individual keys. A single SET command that writes 8 bytes to a key sharing a 4 KB page with 50 other keys causes all 4 KB to be copied, even though only 8 bytes changed.
| Write intensity during BGSAVE | Pages copied | Extra memory on 8 GB dataset |
|---|---|---|
| Near zero (read-heavy) | Less than 1% | Less than 100 MB |
| Moderate (balanced workload) | 10-30% | 800 MB to 2.5 GB |
| High (write-heavy) | 50-100% | 4 GB to 8 GB |
The correct formula for Redis host sizing is: host_RAM >= maxmemory + COW_overhead. For write-heavy workloads, budget at least 1.5x the dataset size for the host. I always tell teams to treat the COW buffer as non-negotiable, not optional headroom.
fork() latency and the blocked event loop
fork() itself is not instant, even with COW. The kernel does not copy physical pages at fork time, but it must copy the page table and mark every entry as read-only. For a 32 GB Redis instance using 4 KB pages, that is roughly 8 million page table entries at 8 bytes each, giving a ~64 MB page table to copy and mark. The Redis event loop is fully blocked during this work.
| Redis dataset size | Typical fork() latency |
|---|---|
| 1 GB | ~5 ms |
| 8 GB | ~30-50 ms |
| 32 GB | ~80-150 ms |
| 64 GB | ~200-400 ms |
During those 80-150 ms on a large instance, no commands are processed. Clients waiting for responses see elevated latency. The mitigation is Redis Cluster: shard data across many smaller instances so no single fork pays a large cost.
Transparent Huge Pages and COW amplification
Linux Transparent Huge Pages (THP) consolidates 4 KB pages into 2 MB huge pages to reduce TLB pressure. COW operates at the page granularity. With THP enabled, a write that would normally copy 4 KB copies 2 MB instead. That is 512x amplification per COW event.
The Redis documentation is explicit: disable THP on Redis hosts.
Continue Reading with Premium
Unlock this article and every other in-depth system design guide on the platform with NotesFromSDE Premium.