Copy-on-write (COW)
Learn how copy-on-write defers expensive data copies until the moment a write actually happens, how Redis uses it for non-blocking snapshotting, and where COW produces surprising memory spikes.
The problem
Your Redis instance uses 8 GB of memory. At 2 AM, BGSAVE starts automatically to create an RDB snapshot. Thirty seconds later, your monitoring fires: RSS memory has climbed to 14 GB. The host has 16 GB total. One more spike like this and the OOM killer will fire, taking Redis down.
Nothing changed in the dataset size. Write traffic is heavier than usual tonight. That is the variable your team did not account for.
If you do not understand copy-on-write, you will size Redis based on your dataset alone, get surprised by this memory spike on every snapshot, and spend hours chasing what looks like a memory leak but is not.
What copy-on-write is
Copy-on-write (COW) is an OS-level optimization that defers copying a memory region until one of the parties sharing it actually writes to it. Before any modification, two processes share the same physical pages read-only. The physical copy happens lazily, one 4 KB page at a time, exactly as writes occur.
Think of a shared recipe binder in a restaurant kitchen. Two chefs use the same binder every night. When Chef A wants to annotate a recipe, she tears out that one page, photocopies it, and writes on her copy alone. Chef B's original is untouched. Every other page remains shared. Nobody copies the entire binder just because one chef wants to scribble on page 12.
How copy-on-write works
When Redis runs BGSAVE, it calls fork() to spawn a child process that serializes all data to disk. The Linux kernel does not copy all of Redis's memory at that moment. Instead, it marks every page in both the parent and child as read-only and shared.
Step by step, walking through one write during an active BGSAVE:
- Redis calls
fork(). The child process inherits the parent's virtual memory map. All pages are marked read-only in both page tables. No physical copy happens yet. - The child reads every key and serializes it to the RDB file. Reads do not trigger any copy.
- A client sends
SET user:99 "active"to the parent. The parent tries to write to thedictEntryfor that key. - The CPU sees the page is marked read-only and raises a page fault, trapping into the kernel.
- The kernel's COW handler allocates a new 4 KB physical page, copies the old page contents into it, and updates the parent's page table to point to the new private page marked writable.
- The original page remains in the child's page table, untouched.
- The parent's write completes on the private copy. The child never sees it.
// Pseudocode: kernel page fault handler for a COW page
on_write_fault(process, virtual_address):
page = page_table_lookup(process, virtual_address)
if page.is_cow_shared:
new_page = alloc_physical_page()
copy_contents(src=page, dst=new_page) // 4 KB memcpy
page_table_update(process, virtual_address, new_page)
new_page.writable = true
page.ref_count -= 1
resume_write(process, virtual_address)
The flow from fork() to private page allocation, step by step:
spawnSync d2 ENOENT
The total extra memory equals: number of pages written during BGSAVE multiplied by 4 KB. A read-heavy workload during BGSAVE uses almost no extra memory. A write-heavy workload can approach 2x the dataset size.
Page granularity and memory amplification
COW copies entire pages, not individual keys. A single SET command that writes 8 bytes to a key sharing a 4 KB page with 50 other keys causes all 4 KB to be copied, even though only 8 bytes changed.
| Write intensity during BGSAVE | Pages copied | Extra memory on 8 GB dataset |
|---|---|---|
| Near zero (read-heavy) | Less than 1% | Less than 100 MB |
| Moderate (balanced workload) | 10-30% | 800 MB to 2.5 GB |
| High (write-heavy) | 50-100% | 4 GB to 8 GB |
The correct formula for Redis host sizing is: host_RAM >= maxmemory + COW_overhead. For write-heavy workloads, budget at least 1.5x the dataset size for the host. I always tell teams to treat the COW buffer as non-negotiable, not optional headroom.
fork() latency and the blocked event loop
fork() itself is not instant, even with COW. The kernel does not copy physical pages at fork time, but it must copy the page table and mark every entry as read-only. For a 32 GB Redis instance using 4 KB pages, that is roughly 8 million page table entries at 8 bytes each, giving a ~64 MB page table to copy and mark. The Redis event loop is fully blocked during this work.
| Redis dataset size | Typical fork() latency |
|---|---|
| 1 GB | ~5 ms |
| 8 GB | ~30-50 ms |
| 32 GB | ~80-150 ms |
| 64 GB | ~200-400 ms |
During those 80-150 ms on a large instance, no commands are processed. Clients waiting for responses see elevated latency. The mitigation is Redis Cluster: shard data across many smaller instances so no single fork pays a large cost.
Transparent Huge Pages and COW amplification
Linux Transparent Huge Pages (THP) consolidates 4 KB pages into 2 MB huge pages to reduce TLB pressure. COW operates at the page granularity. With THP enabled, a write that would normally copy 4 KB copies 2 MB instead. That is 512x amplification per COW event.
The Redis documentation is explicit: disable THP on Redis hosts.
# Disable THP immediately (until reboot)
echo never > /sys/kernel/mm/transparent_hugepage/enabled
# Persist across reboots
echo "transparent_hugepage=never" >> /etc/rc.local
I have seen teams spend hours investigating mysterious Redis memory spikes before discovering THP was silently re-enabled by their infrastructure automation on the last provision run. Check it, then lock it.
THP is silently re-enabled on every host reprovisioned by many infrastructure automation tools (Ansible, cloud-init, vendor AMIs). Always add the transparent_hugepage=never setting explicitly to your provisioning scripts and verify it on running hosts. Running Redis on a fleet of 50 hosts and finding THP enabled on 3 of them is a common source of unexplained memory spikes that take days to diagnose.
Production usage
| System | Usage | Notable behavior |
|---|---|---|
| Redis | BGSAVE and BGREWRITEAOF both call fork() | Write-heavy workloads during snapshot window cause memory spikes proportional to write rate |
| Linux kernel | All fork()-based process spawning | Every shell command uses COW implicitly via fork + exec |
| PostgreSQL | Buffer pool page management | COW pages held by concurrent readers persist until those transactions close |
| ZFS / Btrfs | Copy-on-write filesystem blocks | Snapshots are instantaneous because unchanged blocks are shared; only modified blocks are copied |
| Git | Blob and tree objects | Unchanged files between commits share the same blob objects; new content creates new blobs |
Limitations and when NOT to use it
- Write-heavy workloads during snapshotting cause near-2x memory spikes. If Redis is taking 50K writes/second during
BGSAVE, memory can approach twice the dataset size. OOM kills Redis if the host is not sized for this. - THP amplifies COW cost by 512x. One write to a shared 2 MB huge page copies 2 MB instead of 4 KB. Always disable THP on Redis hosts and verify your provisioning scripts enforce this.
fork()blocks the event loop proportionally to the page table size. At 64 GB, fork latency reaches 200-400 ms. Oversized single Redis instances are an operational hazard with strict SLAs.- COW granularity wastes memory on dense pages. Modifying one 8-byte key on a page shared with 50 other keys still copies all 4 KB. High key density amplifies this waste.
- Serialization time determines total exposure window. The longer BGSAVE runs (slow disks, large datasets), the more writes accumulate COW pages. HDDs at 100 MB/s mean far more writes during a 60-second snapshot than SSDs at 3 GB/s.
vm.overcommit_memory = 0(kernel default) can prevent fork() entirely. With overcommit disabled, the kernel estimates worst-case memory at fork time and refuses if it calculates a shortfall. Setvm.overcommit_memory = 1on Redis hosts sofork()is always allowed.
Three kernel settings that must be locked in provisioning for every Redis host: vm.overcommit_memory = 1 (so fork always succeeds), transparent_hugepage = never (prevents 512x COW amplification), and vm.swappiness = 0 (prevents Redis pages from being swapped out, which causes severe latency spikes when swapped pages are needed). Run redis-cli INFO server | grep redis_git on a fresh host to verify the binary version, then verify all three sysctl values before allowing traffic.
Interview cheat sheet
- When asked why Redis memory doubles during backups: Explain
BGSAVEcallsfork(). The child shares all pages read-only with the parent. Every write during the snapshot copies a 4 KB page privately. Write-heavy workloads drive memory toward 2x the dataset size. - When asked how
fork()can be fast for large Redis datasets: COW is why.fork()copies the page table (kilobytes to megabytes), not physical memory. Physical pages are copied only on write, after fork returns. - When asked about Transparent Huge Pages: Say THP is an explicit Redis anti-recommendation. 2 MB pages amplify per-write COW cost by 512x. Prescribe
echo never > /sys/kernel/mm/transparent_hugepage/enabledand verify provisioning scripts enforce it. - When asked about Redis memory sizing: Host RAM must cover
maxmemoryplus COW overhead. For write-heavy workloads, budget 1.5-2x dataset size for the host. Treat this as a hard requirement, not optional headroom. - When asked about
fork()latency: It is proportional to page table size, not data size. At 32 GB, expect 80-150 ms of event loop blocking. Mitigate by keeping individual Redis instances under 10-12 GB and using Redis Cluster. - When asked about other systems using COW: ZFS/Btrfs for instant snapshots, Git for shared blob objects, Linux
fork()for all processes. COW is a universal mechanism, not Redis-specific. - When asked
BGREWRITEAOFvsBGSAVE: Both callfork()and both trigger COW. AOF rewrites can run longer than RDB saves, meaning more COW exposure time and potentially larger memory spikes. - When asked how to fix
fork: Cannot allocate memoryerrors: Setvm.overcommit_memory = 1immediately. This is the Redis-documented required kernel configuration. The kernel's default conservative estimate refuses fork even when physical memory would suffice.
Quick recap
- COW defers physical memory copies by marking pages read-only and triggering a kernel-level 4 KB copy only when one party actually writes to a shared page.
- Redis uses
fork()for bothBGSAVEandBGREWRITEAOF, relying entirely on COW to let the parent continue serving writes while the child serializes a consistent snapshot. - Every parent write during the snapshot copies a 4 KB page privately; write-heavy workloads during BGSAVE can drive resident memory toward 2x the dataset size.
- Transparent Huge Pages amplify COW cost by 512x per write event; always disable THP on Redis hosts and lock this in provisioning automation.
fork()itself blocks the Redis event loop proportionally to page table size, not data size; at 32 GB the block is 80-150 ms, making oversized single instances dangerous for latency SLAs.- Size Redis hosts at 1.5-2x the dataset size, set
vm.overcommit_memory = 1, and disable THP to make COW-based snapshotting reliable under write-heavy workloads.
Related concepts
- Redis data structures โ Redis's internal data structures determine which operations are most likely to dirty pages during BGSAVE and drive COW memory amplification at page granularity.
- Write-ahead log โ WAL provides crash recovery through append-only logging without using
fork(); understanding both helps you choose between RDB snapshots and AOF for your durability and memory requirements. - Databases โ COW is not Redis-specific; PostgreSQL, ZFS, and Btrfs all use copy-on-write at the page level for snapshot isolation, crash recovery, and efficient storage.