How Kubernetes works
How Kubernetes orchestrates containers: control plane components, the scheduler, Pod lifecycle, service discovery, rolling deployments, and how it self-heals on node failure.
The Problem Statement
Interviewer: "You have 200 microservices running across a fleet of servers. Every deployment is a manual SSH-and-restart dance, scaling is reactive, and when a server dies at 3 AM your on-call pages you to manually migrate workloads. Walk me through what Kubernetes actually does under the hood, how it decides where to place a container, and how it recovers when a node goes down."
This question tests three things: whether you understand the control plane architecture (not just the kubectl commands), whether you can explain the reconciliation loop that makes Kubernetes declarative, and whether you grasp how the scheduler, kubelet, and controller manager work together to close the gap between desired and actual state.
Most candidates describe Kubernetes as "a thing that runs containers." That is like saying a database is "a thing that stores data." The interesting parts are the mechanics: how the API server serializes desired state into etcd, how the scheduler assigns Pods to nodes using predicates and priorities, how the kubelet on each node watches for work, and how controllers constantly loop to fix drift between what you declared and what is actually running.
I like this question because it separates people who have deployed to Kubernetes from people who understand why their deployment survived a node failure.
Clarifying the Scenario
You: "Great question. Before I go deep, let me scope this."
You: "When you say 'what Kubernetes does under the hood,' should I focus on the control plane architecture (API server, etcd, scheduler, controllers) or the data plane (kubelet, kube-proxy, container runtime)? Or both?"
Interviewer: "Both. I want the full picture."
You: "Got it. Should I also cover Services and networking? The service discovery and routing layer is a big part of how K8s works in production."
Interviewer: "Yes, cover Services. Especially how a request reaches a Pod."
You: "Last question: should I cover StatefulSets and DaemonSets, or keep the focus on Deployments?"
Interviewer: "Mention them briefly, but go deep on Deployments and rolling updates."
You: "OK. I will structure my answer in five parts: the control plane and its reconciliation loop, how the scheduler picks a node for a Pod, the Pod lifecycle from Pending to Running, how Services and DNS route traffic to Pods, and how rolling deployments work without downtime."
My Approach
I break this into five parts:
- The control plane reconciliation loop: How desired state in etcd becomes actual state on nodes
- The scheduler algorithm: How Kubernetes picks which node runs a new Pod
- The Pod lifecycle: From Pending through Running to Succeeded or Failed
- Service discovery and networking: How ClusterIP, kube-proxy, and CoreDNS route traffic
- Rolling deployments and self-healing: How Kubernetes replaces Pods without downtime and recovers from node failure
The mental model I always use: Kubernetes is a distributed state machine. You write a YAML file that says "I want 3 replicas of this container, each with 512MB RAM." You submit that desired state to the API server. From that moment on, every component in the cluster has one job: make the actual state match the desired state. If a node dies and kills a replica, a controller notices the mismatch (desired: 3, actual: 2) and creates a replacement. Nobody pages you at 3 AM.
This "desired vs actual" reconciliation loop is the single most important concept in Kubernetes. Everything else is implementation detail.
Kubernetes does not run containers directly. It delegates to a container runtime (containerd, CRI-O) through the Container Runtime Interface (CRI). Kubernetes manages the orchestration, not the isolation.
The Architecture
Here is the full picture of a Kubernetes cluster, showing both the control plane and worker nodes:
Here is how the pieces work together:
-
You submit a Deployment YAML to the API server via kubectl. The API server validates it, runs admission controllers, and writes the desired state to etcd.
-
The Deployment controller (inside controller-manager) sees the new Deployment. It creates a ReplicaSet object, which in turn creates Pod objects with
nodeNameempty. -
The scheduler watches for Pods without a nodeName. It runs its filtering and scoring algorithm, picks a node, and writes the decision back to the API server.
-
The kubelet on the chosen node watches the API server for Pods assigned to its node. When it sees a new assignment, it calls the container runtime (containerd) through CRI to pull the image and start the container.
-
kube-proxy on every node watches Service objects and programs iptables or IPVS rules so that traffic to a Service ClusterIP gets load-balanced across the backing Pods.
This entire flow takes 2-10 seconds for a simple Pod on a healthy cluster. Most of that time is image pulling, not scheduling. If the image is already cached on the node, Pods start in under a second.
My advice for interviews: walk through this flow step by step. It shows you understand the control plane as a pipeline, not a black box.
The key components of the control plane:
| Component | What it does | Failure impact |
|---|---|---|
| kube-apiserver | REST front door to all cluster state. Every component talks to it, never directly to etcd | Cluster is unmanageable, but running Pods continue to run |
| etcd | Stores all desired state as key-value pairs. Uses Raft consensus for replication | Data loss if no backup. Cluster becomes read-only if quorum is lost |
| kube-scheduler | Assigns unscheduled Pods to nodes via filter-then-score | New Pods stay Pending. Existing Pods are unaffected |
| controller-manager | Runs 30+ controllers (ReplicaSet, Deployment, Node, Job, etc.) | No self-healing. Drift accumulates until it recovers |
The Control Plane and Reconciliation Loop
This is the heart of how Kubernetes works. Every controller in the system follows the same pattern: watch the desired state, compare to actual state, take action to close the gap. Then loop forever.
Here is a concrete example. You create a Deployment with replicas: 3. The ReplicaSet controller sees the desired count is 3 but the actual count is 0. It creates 3 Pod objects. The scheduler assigns each to a node. The kubelets start the containers. Now desired and actual both equal 3.
If a node crashes and kills one Pod, the ReplicaSet controller sees desired: 3, actual: 2. It creates a new Pod. The scheduler places it on a healthy node. The kubelet starts it. Back to 3/3. Nobody had to do anything.
Continue Reading with Premium
Unlock this article and every other in-depth system design guide on the platform with NotesFromSDE Premium.