Kubernetes is a declarative control system. Users express desired state via the API; a set of reconciliation loops continuously drives actual state toward desired state. The cluster is split into a control plane (brain) and worker nodes (muscle).
Control Plane Components
| Component | Role |
|---|---|
| kube-apiserver | Single entry point for all REST operations. Persists state to etcd. Implements authentication, authorization, and admission. |
| etcd | Distributed key-value store (Raft). The only stateful component. All cluster state lives here. |
| kube-scheduler | Watches for unscheduled pods, selects a node via filter + score, writes the binding back to the API server. |
| kube-controller-manager | Runs all built-in control loops (Deployment, ReplicaSet, Node, Endpoints, …) in a single binary. |
| cloud-controller-manager | Optional. Integrates with cloud provider APIs (LoadBalancer, Node, Route). |
Node Components
| Component | Role |
|---|---|
| kubelet | Runs on every node. Watches the API server for pods assigned to its node; drives the container runtime via CRI. |
| kube-proxy | Maintains network rules (iptables / ipvs) that implement Service VIPs. |
| container runtime | Implements CRI (containerd, CRI-O). Pulls images, creates/deletes containers. |
Core design principles
resourceVersion. Updates include this field; the API server rejects stale writes (HTTP 409), forcing clients to re-read and retry.
The API server is the hub of the cluster. It validates and persists objects, enforces policy, and serves a long-poll watch mechanism that all other components use to react to state changes.
Request lifecycle
Client → Authentication → Authorization (RBAC) → Admission (Mutating webhooks)
→ Validation → Admission (Validating webhooks) → Persist to etcd → Response
Authentication methods
| Method | How it works |
|---|---|
| X.509 client certs | CN = username, O = groups. Used by system components and kubeadm-generated kubeconfigs. |
| Bearer tokens | ServiceAccount tokens (JWT signed by API server), static token files, OIDC tokens. |
| Bootstrap tokens | Short-lived tokens for node bootstrapping (kubeadm join). |
| Webhook | API server calls an external service to validate the token. |
Watch mechanism
The watch API (?watch=true) keeps an HTTP/2 stream open. The API server pushes ADDED, MODIFIED, and DELETED events as objects change. Controllers use informers (client-go) to multiplex a single watch into an in-memory cache + work queue, avoiding thundering-herd fan-out.
# Watch pods as raw events kubectl get pods --watch -o json # Inspect audit log (if enabled) cat /var/log/kubernetes/audit.log | jq . # Check API server health kubectl get --raw /healthz kubectl get --raw /readyz kubectl get --raw /livez
Admission controllers
Admission controllers sit between authorization and persistence. Mutating webhooks run first (can modify the object), then validating webhooks (can only accept/reject). Important built-in controllers:
NamespaceLifecycle
Prevents creating objects in terminating namespaces and protects system namespaces from deletion.
LimitRanger
Applies default resource requests/limits when not set, enforces LimitRange constraints.
ServiceAccount
Automatically injects the default ServiceAccount and mounts its token into pods.
ResourceQuota
Rejects objects that would exceed namespace resource quotas.
etcd is a strongly consistent, distributed key-value store based on the Raft consensus algorithm. It is the single source of truth for all cluster state. Losing etcd without a backup means losing the cluster.
Raft basics
etcd storage layout in Kubernetes
# Keys are prefixed by resource path /registry/pods/default/my-pod /registry/deployments/production/my-app /registry/services/endpoints/kube-system/kube-dns # Read a key directly (requires etcdctl and certs) ETCDCTL_API=3 etcdctl \ --endpoints=https://127.0.0.1:2379 \ --cacert=/etc/kubernetes/pki/etcd/ca.crt \ --cert=/etc/kubernetes/pki/etcd/server.crt \ --key=/etc/kubernetes/pki/etcd/server.key \ get /registry/pods/default/my-pod | strings | head -50 # List all keys etcdctl get / --prefix --keys-only
Backup and restore
# Snapshot backup etcdctl snapshot save /backup/etcd-snapshot-$(date +%F).db # Verify snapshot etcdctl snapshot status /backup/etcd-snapshot.db --write-out=table # Restore (run before starting etcd) etcdctl snapshot restore /backup/etcd-snapshot.db \ --data-dir=/var/lib/etcd-restored \ --name=etcd-node1 \ --initial-cluster=etcd-node1=https://10.0.0.1:2380 \ --initial-advertise-peer-urls=https://10.0.0.1:2380
Compaction and defragmentation
# Get current revision etcdctl endpoint status --write-out=json | jq '.[0].Status.header.revision' # Compact to current revision (removes old revisions) etcdctl compact $(etcdctl endpoint status --write-out=json | jq '.[0].Status.header.revision') # Defragment (reclaims disk space after compaction) etcdctl defrag
The scheduler watches for pods with spec.nodeName == "" and assigns them to nodes. Scheduling is a two-phase process: filtering (eliminate ineligible nodes) then scoring (rank remaining nodes).
Scheduling pipeline
New pod (nodeName="") detected │ ▼ Filter plugins — eliminate nodes that cannot run the pod ├─ NodeUnschedulable (node has NoSchedule taint / unschedulable flag) ├─ NodeResourcesFit (insufficient CPU/memory) ├─ NodeAffinity (nodeSelector, nodeAffinity) ├─ TaintToleration (pod must tolerate node taints) ├─ PodTopologySpread (spread constraints) └─ VolumeBinding (PVCs that need specific node topology) │ ▼ Score plugins — rank feasible nodes (0-100 each, weighted sum) ├─ NodeResourcesBalancedAllocation (prefer balanced CPU/mem usage) ├─ LeastAllocated (prefer nodes with most free resources) ├─ ImageLocality (prefer nodes with image already pulled) └─ InterPodAffinity (prefer nodes satisfying pod affinity) │ ▼ Select highest-score node → write Binding object → kubelet picks it up
Taints and tolerations
# Add a taint to a node kubectl taint nodes node1 dedicated=gpu:NoSchedule kubectl taint nodes node1 dedicated=gpu:NoExecute # Remove a taint kubectl taint nodes node1 dedicated=gpu:NoSchedule- # Pod toleration (YAML) tolerations: - key: "dedicated" operator: "Equal" value: "gpu" effect: "NoSchedule"
Pod affinity and anti-affinity
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: topology.kubernetes.io/zone
operator: In
values: [us-east-1a, us-east-1b]
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app: my-app
topologyKey: kubernetes.io/hostname # one pod per host
Preemption
If a high-priority pod cannot be scheduled, the scheduler may evict lower-priority pods to make room. Pods with PriorityClass set are considered; the scheduler finds a node where evicting low-priority pods would free enough resources.
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: high-priority
value: 1000000
globalDefault: false
preemptionPolicy: PreemptLowerPriority # or Never
A collection of reconciliation loops in one binary. Each controller watches specific resources, compares current state to desired state, and takes corrective action. Controllers use informers (shared cache + event handlers) and work queues to avoid overwhelming the API server.
Reconciliation loop pattern
loop:
desired = read desired state from API object
actual = observe current state (list pods, check nodes, …)
diff = desired - actual
if diff != empty:
apply changes (create/update/delete resources)
sleep resyncPeriod (or wait for watch event)
Key built-in controllers
| Controller | Watches | Action |
|---|---|---|
| ReplicaSet | ReplicaSet, Pod | Creates/deletes pods to match spec.replicas |
| Deployment | Deployment, ReplicaSet | Creates/scales ReplicaSets for rolling updates |
| StatefulSet | StatefulSet, Pod | Manages ordered pod creation with stable identities |
| DaemonSet | DaemonSet, Node, Pod | Ensures one pod per (selected) node |
| Job / CronJob | Job, Pod | Runs pods to completion; CronJob schedules Jobs |
| Node | Node | Marks nodes NotReady; evicts pods after pod-eviction-timeout |
| Endpoints | Service, Pod | Keeps Endpoints objects in sync with ready pod IPs |
| Namespace | Namespace | Cleans up resources in terminating namespaces |
Informer / work-queue internals
API Server watch stream
│ ADDED/MODIFIED/DELETED events
▼
Informer (per resource type, shared across controllers)
├─ ThreadSafeStore (in-memory cache — avoids API calls for reads)
└─ Event handlers → enqueue object key (namespace/name)
│
▼
Work Queue (rate-limited, deduplicating)
│
▼
Worker goroutine → Reconcile(key)
└─ reads from cache (lister)
└─ calls API server only for writes
The kubelet is the primary node agent. It watches for pods assigned to its node via the API server (and optionally static pod manifests), then drives the container runtime via CRI to converge actual pod state toward desired state.
Pod lifecycle (kubelet perspective)
Pod assigned to node (spec.nodeName set by scheduler) │ ▼ Admit pod — check resources, enforce policies (CPU manager, topology) │ ▼ Setup volumes (CNI calls for network, CSI for volumes) │ ▼ Pull images (via CRI → container runtime → image service) │ ▼ Create sandbox (pause container establishes network namespace) │ ▼ Run init containers sequentially (each must succeed before next) │ ▼ Run app containers (in parallel) │ ├─ Execute postStart hook (if defined) ├─ Start liveness / readiness / startup probes └─ Report status back to API server │ ▼ (on termination) Send SIGTERM → wait terminationGracePeriodSeconds → SIGKILL Execute preStop hook before SIGTERM
CRI (Container Runtime Interface)
The kubelet communicates with the container runtime via gRPC over a Unix socket. CRI separates the kubelet from runtime-specific code.
# Inspect containers via crictl (bypasses kubelet) crictl ps # list running containers crictl pods # list pods crictl inspect <container-id> # full container state crictl logs <container-id> crictl exec -it <container-id> sh # Check which runtime is in use kubectl get node -o wide # CONTAINER-RUNTIME column cat /var/lib/kubelet/config.yaml | grep containerRuntimeEndpoint
Node conditions and taints
| Condition | Meaning | Auto-taint added |
|---|---|---|
| Ready=True | kubelet healthy, node can accept pods | — |
| MemoryPressure | Node is low on memory | node.kubernetes.io/memory-pressure:NoSchedule |
| DiskPressure | Node disk is nearly full | node.kubernetes.io/disk-pressure:NoSchedule |
| PIDPressure | Too many processes on node | node.kubernetes.io/pid-pressure:NoSchedule |
| Ready=False/Unknown | kubelet lost contact | node.kubernetes.io/not-ready:NoExecute (after timeout) |
Static pods
# Static pod manifests (control plane components on kubeadm clusters) ls /etc/kubernetes/manifests/ # kube-apiserver.yaml kube-controller-manager.yaml kube-scheduler.yaml etcd.yaml # kubelet watches this directory; changes take effect immediately # Static pods are mirrored as read-only objects in the API server
kube-proxy implements the Service abstraction by programming kernel networking rules. It watches Services and EndpointSlices and translates ClusterIP VIPs into real pod IPs.
Modes
| Mode | Mechanism | Notes |
|---|---|---|
| iptables (default) | DNAT rules in iptables PREROUTING/OUTPUT chains. Selects pod randomly per connection. | O(n) rule lookup; performance degrades with many Services. |
| ipvs | Linux IPVS (LVS) in kernel netfilter. Hash-table lookup, O(1). Supports LB algorithms: rr, lc, sh, sed, nq. | Requires ipvs kernel modules. Better at scale. |
| nftables | nftables sets for lookup (Kubernetes ≥ 1.31, alpha/beta). | Modern replacement for iptables mode. |
How iptables DNAT works for ClusterIP
# Inspect kube-proxy iptables rules iptables -t nat -L KUBE-SERVICES -n --line-numbers iptables -t nat -L KUBE-SVC-<hash> -n # per-service chain iptables -t nat -L KUBE-SEP-<hash> -n # per-endpoint chain (DNAT) # Example flow for ClusterIP 10.96.0.10:80 with 3 endpoints # KUBE-SERVICES → KUBE-SVC-xxx → statistic --probability 0.33 → KUBE-SEP-1 (DNAT pod1:8080) # → KUBE-SEP-2 (DNAT pod2:8080) # → KUBE-SEP-3 (DNAT pod3:8080) # Check kube-proxy mode kubectl -n kube-system get configmap kube-proxy -o yaml | grep mode
Service types
| Type | How it works |
|---|---|
| ClusterIP | Virtual IP reachable only within the cluster. kube-proxy programs DNAT rules on every node. |
| NodePort | Opens a port (30000–32767) on every node. Traffic → NodePort → ClusterIP → pod. External firewall rules needed. |
| LoadBalancer | Provisions cloud load balancer via cloud-controller-manager. Gets an external IP. Includes NodePort. |
| ExternalName | Returns a CNAME for an external DNS name. No proxying; DNS only. |
| Headless (clusterIP: None) | DNS returns individual pod IPs directly. Used by StatefulSets for stable DNS names per pod. |
Kubernetes networking follows three flat-network rules: every pod gets a unique IP, pods can communicate with any other pod without NAT, and nodes can communicate with pods without NAT. CNI plugins implement the pod network.
Kubernetes networking model
Node A (10.0.0.1) Node B (10.0.0.2)
├─ eth0: 10.0.0.1 ├─ eth0: 10.0.0.2
├─ cni0 bridge: 10.244.0.1/24 ├─ cni0 bridge: 10.244.1.1/24
│ ├─ veth → Pod A1: 10.244.0.2 │ ├─ veth → Pod B1: 10.244.1.2
│ └─ veth → Pod A2: 10.244.0.3 │ └─ veth → Pod B2: 10.244.1.3
│ │
└─── Overlay / BGP routes ────────────┘
(flannel VXLAN, Calico BGP, Cilium eBPF, …)
CNI plugin comparison
| Plugin | Dataplane | NetworkPolicy | Notes |
|---|---|---|---|
| Flannel | VXLAN overlay | No (needs Calico or Kube-router) | Simple; good for learning and small clusters. |
| Calico | BGP routes (no overlay) or VXLAN | Yes + extended policies | Popular in production; works well with BIRD BGP. |
| Cilium | eBPF (no kube-proxy needed) | Yes + L7 HTTP/gRPC | Best performance and observability. Replaces kube-proxy. |
| Weave | VXLAN + fast datapath | Yes | Easy setup; encrypted by default. |
DNS (CoreDNS)
# DNS name formats my-svc.my-ns.svc.cluster.local # ClusterIP service my-pod.my-ns.pod.cluster.local # pod IP (dots replaced with dashes) pod-0.my-svc.my-ns.svc.cluster.local # StatefulSet pod via headless service # Debug DNS from a pod kubectl run -it --rm dnstest --image=busybox --restart=Never -- sh / # nslookup kubernetes.default.svc.cluster.local / # cat /etc/resolv.conf # Check CoreDNS config kubectl -n kube-system get configmap coredns -o yaml
NetworkPolicy
# Default deny all ingress in a namespace apiVersion: networking.k8s.io/v1 kind: NetworkPolicy metadata: name: default-deny-ingress namespace: production spec: podSelector: {} # select all pods policyTypes: [Ingress] --- # Allow ingress only from app=frontend pods apiVersion: networking.k8s.io/v1 kind: NetworkPolicy metadata: name: allow-frontend namespace: production spec: podSelector: matchLabels: app: backend ingress: - from: - podSelector: matchLabels: app: frontend ports: - port: 8080
Kubernetes storage is abstracted through PersistentVolumes (PVs), PersistentVolumeClaims (PVCs), and StorageClasses. The Container Storage Interface (CSI) is the standard plugin mechanism for storage drivers.
PV / PVC lifecycle
StorageClass (provisioner, parameters, reclaimPolicy) │ │ (dynamic provisioning) ▼ PersistentVolumeClaim created by user │ storageClassName: fast-ssd │ storage: 10Gi │ accessModes: [ReadWriteOnce] │ ▼ (external-provisioner sidecar watches PVCs) CSI driver called: CreateVolume RPC │ ▼ PersistentVolume created (Bound to PVC) │ ▼ Pod references PVC → kubelet calls NodeStageVolume + NodePublishVolume │ ▼ Volume mounted at pod path On PVC delete: ReclaimPolicy Retain → PV remains (manual cleanup) Delete → CSI DeleteVolume called, PV deleted
Access modes
| Mode | Abbreviation | Meaning |
|---|---|---|
| ReadWriteOnce | RWO | One node mounts read-write. Most block storage (EBS, PD, Azure Disk). |
| ReadOnlyMany | ROX | Many nodes mount read-only. |
| ReadWriteMany | RWX | Many nodes mount read-write. Requires NFS, CephFS, or similar. |
| ReadWriteOncePod | RWOP | Single pod mounts read-write (Kubernetes ≥ 1.22). |
CSI architecture
kubelet CSI driver pod │ │ │ NodeStageVolume (gRPC) │ ← format + mount to staging path │──────────────────────────►│ │ NodePublishVolume │ ← bind-mount staging → pod path │──────────────────────────►│ External controller sidecar (runs alongside driver): provisioner → CreateVolume / DeleteVolume attacher → ControllerPublishVolume / ControllerUnpublishVolume resizer → ControllerExpandVolume
Volume types quick reference
emptyDir
Ephemeral dir in pod's node scratch space. Shared between containers in a pod. Deleted when pod dies.
hostPath
Mounts a node path into the pod. Avoid in production — ties pod to a specific node and risks host access.
configMap / secret
Projected as files or env vars. ConfigMap for config data, Secret for sensitive data (base64-encoded in etcd by default; encrypt at rest).
projected
Combines multiple volume sources (ServiceAccount token, ConfigMap, Secret, downwardAPI) into a single directory.
Kubernetes workload controllers manage sets of pods. Understanding how each controller updates pods is critical for operating stateful and stateless services safely.
Deployment rolling update internals
kubectl set image deployment/my-app container=image:v2 │ ▼ Deployment controller creates new ReplicaSet (RS-v2, replicas=0) │ ▼ Scale RS-v2 up by 1 (maxSurge=1 → can go 1 above desired) │ ▼ Wait for new pod Ready │ ▼ Scale RS-v1 down by 1 (maxUnavailable=1 → can be 1 below desired) │ ▼ Repeat until RS-v2 = desired, RS-v1 = 0 │ ▼ Old RS kept (scale=0) for rollback history (revisionHistoryLimit)
StatefulSet guarantees
| Property | Behaviour |
|---|---|
| Stable network identity | Pod name is <name>-<ordinal>. DNS: <pod>.<headless-svc>.<ns>.svc.cluster.local. Survives rescheduling. |
| Stable storage | VolumeClaimTemplate creates a PVC per pod. PVC is not deleted when the pod is deleted. |
| Ordered creation | Pods created 0 → N−1. Each must be Running+Ready before the next is created. |
| Ordered deletion | Pods deleted N−1 → 0 (reverse order) by default (OrderedReady policy). |
| Parallel policy | podManagementPolicy: Parallel — create/delete all pods simultaneously (updates still ordered). |
DaemonSet scheduling
spec.nodeName directly. This means DaemonSet pods can be placed on nodes that are unschedulable (e.g., control plane nodes) when tolerations are set appropriately. The pods also start before the scheduler is fully ready during cluster bootstrap.
HorizontalPodAutoscaler internals
HPA controller (in controller-manager) runs every --horizontal-pod-autoscaler-sync-period (15s)
│
├─ Queries metrics-server (or custom/external metrics adapter)
│ currentMetricValue = avg CPU across pods
│
├─ desiredReplicas = ceil(currentReplicas × (currentValue / targetValue))
│
└─ Scales Deployment / ReplicaSet if outside [minReplicas, maxReplicas]
Cooldown: --horizontal-pod-autoscaler-downscale-stabilization (5m default)
Kubernetes security involves cluster-level (RBAC, admission, network policies) and workload-level (pod security, secrets management) controls.
RBAC model
# Role (namespace-scoped) apiVersion: rbac.authorization.k8s.io/v1 kind: Role metadata: name: pod-reader namespace: default rules: - apiGroups: [""] resources: ["pods"] verbs: ["get", "list", "watch"] --- # RoleBinding: bind Role to a ServiceAccount apiVersion: rbac.authorization.k8s.io/v1 kind: RoleBinding metadata: name: read-pods namespace: default subjects: - kind: ServiceAccount name: my-app namespace: default roleRef: kind: Role name: pod-reader apiGroup: rbac.authorization.k8s.io
RBAC: Role vs ClusterRole
| Object | Scope | Use |
|---|---|---|
| Role | Namespace | Grant access to namespaced resources within one namespace. |
| ClusterRole | Cluster-wide | Grant access to cluster-scoped resources (Nodes, PVs) or to namespaced resources across all namespaces. |
| RoleBinding | Namespace | Binds a Role or ClusterRole to subjects within a namespace. |
| ClusterRoleBinding | Cluster-wide | Binds a ClusterRole to subjects cluster-wide. |
ServiceAccount token projection
projected volume type, not via a Secret.
Secrets encryption at rest
# /etc/kubernetes/enc/encryption-config.yaml apiVersion: apiserver.config.k8s.io/v1 kind: EncryptionConfiguration resources: - resources: [secrets] providers: - aescbc: keys: - name: key1 secret: <base64-encoded-32-byte-key> - identity: {} # fallback: read unencrypted (for migration) # Then restart kube-apiserver with: # --encryption-provider-config=/etc/kubernetes/enc/encryption-config.yaml # Rewrite all secrets to encrypt them kubectl get secrets --all-namespaces -o json | kubectl replace -f -
Pod Security Admission (PSA)
| Level | Restrictions |
|---|---|
| privileged | No restrictions. Same as no policy. |
| baseline | Blocks most known privilege escalations: no privileged containers, no hostPID/hostNetwork, restricted capabilities. |
| restricted | Heavily restricted: must run as non-root, drop ALL capabilities, no privilege escalation, seccomp required. |
# Enable PSA on a namespace via labels
kubectl label namespace production \
pod-security.kubernetes.io/enforce=restricted \
pod-security.kubernetes.io/warn=restricted \
pod-security.kubernetes.io/audit=restricted
Inspect cluster internals
# Control plane component health kubectl get componentstatuses kubectl get --raw /healthz # Node internals kubectl describe node <node> # conditions, capacity, allocatable, events kubectl top node # requires metrics-server kubectl get events --sort-by=.metadata.creationTimestamp # Pod scheduling kubectl describe pod <pod> # Events: FailedScheduling, etc. kubectl get pod -o wide # which node, IP kubectl get pod <pod> -o jsonpath='{.spec.nodeName}' # Watch reconciliation kubectl get rs --watch # watch ReplicaSet convergence kubectl rollout status deploy/my-app # watch Deployment rollout kubectl rollout history deploy/my-app kubectl rollout undo deploy/my-app # roll back
etcd
etcdctl endpoint health — check cluster
etcdctl snapshot save — backup
etcdctl get /registry/ --prefix --keys-only
Quorum: (n/2)+1 nodes needed for writes
API Server
kubectl get --raw /apis — list API groups
kubectl api-resources — all resource types
kubectl explain pod.spec — schema docs
kubectl auth can-i list pods --as user
Networking
iptables -t nat -L KUBE-SERVICES -n
kubectl -n kube-system logs -l k8s-app=kube-dns
kubectl exec -it pod -- curl svc:port
kubectl port-forward svc/my-svc 8080:80
Storage
kubectl get pv,pvc — PV/PVC status
kubectl describe pvc my-claim — binding events
kubectl get sc — StorageClasses
PVC stuck Pending: check StorageClass, CSI driver logs
Workloads
kubectl rollout status deploy/x
kubectl rollout undo deploy/x --to-revision=2
kubectl scale deploy/x --replicas=5
kubectl get hpa — autoscaler status
Security
kubectl auth can-i '*' '*' --all-namespaces
kubectl get rolebindings,clusterrolebindings -A
kubectl get secret -o yaml | base64 -d
Check: pod-security.kubernetes.io/* labels on ns