How Docker works
How Docker creates isolated containers using Linux kernel primitives: namespaces, cgroups, UnionFS layers, container networking modes, and what Docker does that the kernel does not.
The Problem Statement
Interviewer: "Walk me through what actually happens when you run
docker run nginx. What system calls fire, what kernel features are used, and how is the container isolated from the host? How is this different from a virtual machine?"
This question tests three things: whether you understand that containers are not VMs, whether you know the specific Linux kernel primitives (namespaces, cgroups, OverlayFS) that make isolation work, and whether you can articulate the role Docker plays on top of those primitives.
Most candidates say "containers are lightweight VMs." That is wrong. A container is a regular Linux process with restricted visibility and limited resources. There is no hypervisor, no guest kernel, no hardware emulation. The difference between "lightweight VM" and "isolated process using kernel namespaces" is the difference between a surface-level answer and a senior one.
I like this question because it separates people who have used Docker from people who understand Docker. Everyone can type docker run. Far fewer can explain what happens between pressing Enter and the container serving its first request.
Clarifying the Scenario
You: "Great question. Let me clarify scope before I dive in."
You: "When you say 'how Docker works,' do you want me to focus on the kernel primitives (namespaces, cgroups, filesystem layers) or the Docker daemon architecture (containerd, runc, the image registry protocol)?"
Interviewer: "Both. Start with the kernel primitives, then explain what Docker adds on top."
You: "Got it. Should I also cover networking? Docker has multiple network modes that behave very differently."
Interviewer: "Yes, cover networking. I want the full picture."
You: "One more thing. Should I compare containers to VMs at the start to set the baseline?"
Interviewer: "Briefly, yes."
You: "OK. I will structure this in four parts: how containers differ from VMs at the kernel level, the three kernel primitives that make isolation work, how Docker orchestrates these primitives through its daemon architecture, and how container networking works across the different modes."
My Approach
I break this into five parts:
- Containers vs VMs: The fundamental architectural difference at the kernel boundary
- Namespaces: How Linux gives each container its own view of PID, network, filesystem, and more
- Cgroups: How the kernel enforces CPU, memory, and I/O limits per container
- OverlayFS and image layers: How Docker builds images as stacked read-only layers with a writable top
- Container networking: Bridge, host, overlay, and macvlan modes and when each applies
The key insight most people miss: Docker itself does not isolate anything. The Linux kernel does all the isolation. Docker is a user-friendly toolchain (CLI, daemon, image format, registry protocol) that calls the right kernel APIs in the right order. If you wanted to, you could create a "container" with raw unshare and cgroup commands. Docker just makes it reproducible and portable.
Here is the mental model I use. Think of a VM as renting a separate apartment in a building. You get your own walls, your own plumbing, your own electrical panel. A container is more like renting a desk in a coworking space. You share the building's walls, plumbing, and electrical (the kernel), but you get a divider around your desk (namespaces), a cap on how much electricity your desk can use (cgroups), and your own set of files in a locked drawer (OverlayFS).
| Aspect | Virtual Machine | Container |
|---|---|---|
| Kernel | Own guest kernel | Shares host kernel |
| Boot time | 30-60 seconds | Milliseconds |
| Memory overhead | 512 MB - 4 GB per VM | 5 - 50 MB per container |
| Isolation strength | Hardware-level (hypervisor) | Process-level (kernel features) |
| Density | 10-20 per host | 100-1000+ per host |
| Filesystem | Full disk image | Layered filesystem (OverlayFS) |
| Security boundary | Strong (separate kernel) | Weaker (shared kernel attack surface) |
Containers share the host kernel. This means a kernel vulnerability affects every container on the host. This is why VM-based isolation (like Firecracker, gVisor, or Kata Containers) exists for multi-tenant workloads where you do not trust the container code.
The Architecture
Here is the full picture of what happens when you run docker run nginx, from the CLI command down to the kernel:
Here is the step-by-step walkthrough:
-
CLI to daemon:
docker run nginxsends a REST API call to the Docker daemon (dockerd). The daemon validates the image reference, pull policy, and runtime flags. -
Daemon to containerd:
dockerdcallscontainerdover gRPC. containerd is the actual container lifecycle manager. It handles image pulling, creating OverlayFS snapshots, and managing container tasks. -
Snapshot creation: containerd creates an OverlayFS mount for the container. The nginx image layers become read-only lower directories, and a new writable upper directory is created for this specific container instance.
-
Shim spawn: containerd spawns a
containerd-shimprocess for this container. The shim is the parent process of the actual container. If containerd crashes or restarts, the shim keeps the container alive. This is how you can upgrade Docker without killing running containers. -
runc execution: The shim forks and execs
runc, the OCI-compliant runtime. runc callsclone()with namespace flags (CLONE_NEWPID,CLONE_NEWNET,CLONE_NEWNS,CLONE_NEWUTS,CLONE_NEWIPC). The resulting child process is the container's PID 1. -
cgroup setup: runc creates a new cgroup hierarchy under
/sys/fs/cgroup/and writes the CPU, memory, and I/O limits from thedocker runflags (--memory,--cpus). -
Network setup: runc creates a veth pair (virtual ethernet), attaches one end to the container's network namespace and the other to the Docker bridge (
docker0). The container gets its own IP address from the bridge's subnet.
For your interview: the critical insight is this layered architecture. Docker is not one monolithic program. It is a stack: CLI calls daemon, daemon calls containerd, containerd calls runc, runc calls the kernel. Each layer has a clear responsibility and can be swapped independently. This is why you can use Podman (which replaces dockerd) or gVisor (which replaces runc) without changing anything else.
Namespaces and Cgroups: The Isolation Primitives
This is where the real isolation happens. Docker does not isolate anything. The kernel does, through two mechanisms: namespaces (what the process can see) and cgroups (what the process can use).
Continue Reading with Premium
Unlock this article and every other in-depth system design guide on the platform with NotesFromSDE Premium.