Copy-on-write (COW)

The problem

Your Redis instance uses 8 GB of memory. At 2 AM, BGSAVE starts automatically to create an RDB snapshot. Thirty seconds later, your monitoring fires: RSS memory has climbed to 14 GB. The host has 16 GB total. One more spike like this and the OOM killer will fire, taking Redis down.

Nothing changed in the dataset size. Write traffic is heavier than usual tonight. That is the variable your team did not account for.

If you do not understand copy-on-write, you will size Redis based on your dataset alone, get surprised by this memory spike on every snapshot, and spend hours chasing what looks like a memory leak but is not.

What copy-on-write is

Copy-on-write (COW) is an OS-level optimization that defers copying a memory region until one of the parties sharing it actually writes to it. Before any modification, two processes share the same physical pages read-only. The physical copy happens lazily, one 4 KB page at a time, exactly as writes occur.

Think of a shared recipe binder in a restaurant kitchen. Two chefs use the same binder every night. When Chef A wants to annotate a recipe, she tears out that one page, photocopies it, and writes on her copy alone. Chef B's original is untouched. Every other page remains shared. Nobody copies the entire binder just because one chef wants to scribble on page 12.

How copy-on-write works

When Redis runs BGSAVE, it calls fork() to spawn a child process that serializes all data to disk. The Linux kernel does not copy all of Redis's memory at that moment. Instead, it marks every page in both the parent and child as read-only and shared.

Step by step, walking through one write during an active BGSAVE:

Redis calls fork(). The child process inherits the parent's virtual memory map. All pages are marked read-only in both page tables. No physical copy happens yet.
The child reads every key and serializes it to the RDB file. Reads do not trigger any copy.
A client sends SET user:99 "active" to the parent. The parent tries to write to the dictEntry for that key.
The CPU sees the page is marked read-only and raises a page fault, trapping into the kernel.
The kernel's COW handler allocates a new 4 KB physical page, copies the old page contents into it, and updates the parent's page table to point to the new private page marked writable.
The original page remains in the child's page table, untouched.
The parent's write completes on the private copy. The child never sees it.

// Pseudocode: kernel page fault handler for a COW page

on_write_fault(process, virtual_address):
    page = page_table_lookup(process, virtual_address)

    if page.is_cow_shared:
        new_page = alloc_physical_page()
        copy_contents(src=page, dst=new_page)    // 4 KB memcpy
        page_table_update(process, virtual_address, new_page)
        new_page.writable = true
        page.ref_count -= 1

    resume_write(process, virtual_address)

The flow from fork() to private page allocation, step by step:

D2 render error.

spawnSync d2 ENOENT

The total extra memory equals: number of pages written during BGSAVE multiplied by 4 KB. A read-heavy workload during BGSAVE uses almost no extra memory. A write-heavy workload can approach 2x the dataset size.

Page granularity and memory amplification

COW copies entire pages, not individual keys. A single SET command that writes 8 bytes to a key sharing a 4 KB page with 50 other keys causes all 4 KB to be copied, even though only 8 bytes changed.

Write intensity during BGSAVE	Pages copied	Extra memory on 8 GB dataset
Near zero (read-heavy)	Less than 1%	Less than 100 MB
Moderate (balanced workload)	10-30%	800 MB to 2.5 GB
High (write-heavy)	50-100%	4 GB to 8 GB

The correct formula for Redis host sizing is: host_RAM >= maxmemory + COW_overhead. For write-heavy workloads, budget at least 1.5x the dataset size for the host. I always tell teams to treat the COW buffer as non-negotiable, not optional headroom.

fork() latency and the blocked event loop

fork() itself is not instant, even with COW. The kernel does not copy physical pages at fork time, but it must copy the page table and mark every entry as read-only. For a 32 GB Redis instance using 4 KB pages, that is roughly 8 million page table entries at 8 bytes each, giving a ~64 MB page table to copy and mark. The Redis event loop is fully blocked during this work.

Redis dataset size	Typical fork() latency
1 GB	~5 ms
8 GB	~30-50 ms
32 GB	~80-150 ms
64 GB	~200-400 ms

During those 80-150 ms on a large instance, no commands are processed. Clients waiting for responses see elevated latency. The mitigation is Redis Cluster: shard data across many smaller instances so no single fork pays a large cost.

Transparent Huge Pages and COW amplification

Linux Transparent Huge Pages (THP) consolidates 4 KB pages into 2 MB huge pages to reduce TLB pressure. COW operates at the page granularity. With THP enabled, a write that would normally copy 4 KB copies 2 MB instead. That is 512x amplification per COW event.

The Redis documentation is explicit: disable THP on Redis hosts.

Redis calls fork(). The child process inherits the parent's virtual memory map. All pages are marked read-only in both page tables. No physical copy happens yet.
The child reads every key and serializes it to the RDB file. Reads do not trigger any copy.
A client sends SET user:99 "active" to the parent. The parent tries to write to the dictEntry for that key.
The CPU sees the page is marked read-only and raises a page fault, trapping into the kernel.
The kernel's COW handler allocates a new 4 KB physical page, copies the old page contents into it, and updates the parent's page table to point to the new private page marked writable.
The original page remains in the child's page table, untouched.
The parent's write completes on the private copy. The child never sees it.

// Pseudocode: kernel page fault handler for a COW page

on_write_fault(process, virtual_address):
    page = page_table_lookup(process, virtual_address)

    if page.is_cow_shared:
        new_page = alloc_physical_page()
        copy_contents(src=page, dst=new_page)    // 4 KB memcpy
        page_table_update(process, virtual_address, new_page)
        new_page.writable = true
        page.ref_count -= 1

    resume_write(process, virtual_address)

The flow from fork() to private page allocation, step by step:

D2 render error.

spawnSync d2 ENOENT

Page granularity and memory amplification

Write intensity during BGSAVE	Pages copied	Extra memory on 8 GB dataset
Near zero (read-heavy)	Less than 1%	Less than 100 MB
Moderate (balanced workload)	10-30%	800 MB to 2.5 GB
High (write-heavy)	50-100%	4 GB to 8 GB

fork() latency and the blocked event loop

Redis dataset size	Typical fork() latency
1 GB	~5 ms
8 GB	~30-50 ms
32 GB	~80-150 ms
64 GB	~200-400 ms

Transparent Huge Pages and COW amplification

The Redis documentation is explicit: disable THP on Redis hosts.

Copy-on-write (COW)

The problem

What copy-on-write is

How copy-on-write works

Page granularity and memory amplification

fork() latency and the blocked event loop

Transparent Huge Pages and COW amplification

Continue Reading with Premium

Comments

Copy-on-write (COW)

The problem

What copy-on-write is

How copy-on-write works

Page granularity and memory amplification

fork() latency and the blocked event loop

Transparent Huge Pages and COW amplification

Continue Reading with Premium

Comments