docker run to a running processDocker is a layered system. The CLI talks to dockerd via a REST API, dockerd delegates container lifecycle to containerd over gRPC, and containerd spawns an OCI runtime (runc) via a shim process that persists after runc exits.
┌─────────────────────────────────────────────────────────┐
│ User Space │
│ │
│ docker CLI ──REST──▶ dockerd │
│ │ │
│ gRPC (containerd.sock) │
│ ▼ │
│ containerd │
│ │ image pull / snapshots │
│ │ │
│ fork/exec ▼ │
│ containerd-shim-runc-v2 ◀── stays alive │
│ │ │
│ fork/exec ▼ (exits after start) │
│ runc │
│ │ clone(2) + pivot_root │
│ ▼ │
│ container init (PID 1) │
│ │
└─────────────────────────────────────────────────────────┘
│ │ │
Namespaces cgroups v2 OverlayFS
(pid/net/mnt…) (cpu/mem/io…) (layers + rw)
dockerd
High-level daemon. Owns the Docker API, BuildKit, networking IPAM, volume plugins, image store, and Swarm orchestration. Does not directly run containers.
containerd
OCI-compliant container supervisor. Manages snapshots (image layers), pulls OCI images, and drives shim/runc for container start/stop/exec.
containerd-shim
One shim per container. Holds the container's stdio pipes and reports exit status. Allows containerd to restart without killing running containers.
runc
Lightweight CLI that reads an OCI bundle (config.json + rootfs), calls clone(2) with namespace flags, sets up cgroups, drops capabilities, then execves the container process.
The Docker Engine daemon. It listens on a Unix socket (/var/run/docker.sock) by default and exposes the Docker Engine API. It is responsible for everything the user sees — images, containers, networks, volumes — but delegates actual container execution to containerd.
Responsibilities
Key Files & Sockets
| Path | Purpose |
|---|---|
/var/run/docker.sock | Default Unix socket — CLI and SDK connect here |
/var/lib/docker/ | Image layers, container state, volumes, networks |
/var/lib/docker/containers/<id>/ | Per-container config, logs, hosts, resolv.conf |
/etc/docker/daemon.json | Daemon configuration (storage driver, log driver, DNS…) |
/run/containerd/containerd.sock | gRPC socket dockerd uses to talk to containerd |
daemon.json Examples
{
"storage-driver": "overlay2",
"log-driver": "json-file",
"log-opts": { "max-size": "20m", "max-file": "3" },
"default-address-pools": [{ "base": "172.30.0.0/16", "size": 24 }],
"dns": ["8.8.8.8"],
"live-restore": true // keep containers running if dockerd restarts
}
live-restore: true, dockerd reconnects to running shims on restart instead of killing containers. Requires containerd shim v2.An industry-standard container runtime (CNCF graduated). Docker uses containerd as its low-level runtime. Kubernetes also talks to containerd directly via CRI. containerd manages snapshots, image distribution, and the full container lifecycle via plugins.
Subsystems
Snapshotter
Manages the filesystem layer stack. Default is overlayfs. Provides a view of merged layers for each container rootfs.
Content Store
Content-addressed blob storage for OCI manifests, configs, and layer tarballs under /var/lib/containerd/io.containerd.content.v1.content/.
Metadata Store
bbolt database tracking images, containers, snapshots, and leases. Located at /var/lib/containerd/io.containerd.metadata.v1.bolt/meta.db.
Task Service
Creates and manages running containers (tasks). Each task maps to a shim process. Handles exec, pause, resume, kill, and delete.
Interaction via ctr
# list containerd namespaces (docker uses "moby") ctr namespaces list # list images in the moby namespace ctr -n moby images list # list running tasks (containers) ctr -n moby tasks list # pull an image directly via containerd ctr images pull docker.io/library/nginx:latest # show snapshots for a container ctr -n moby snapshots ls
moby namespace. Kubernetes uses k8s.io.runc is the reference OCI runtime. It reads an OCI bundle — a directory with config.json and a rootfs/ — and uses Linux syscalls to start the container. It exits once the container process is running. The shim stays alive to manage stdio and report the exit status back to containerd.
OCI Bundle Structure
/run/containerd/io.containerd.runtime.v2.task/moby/<id>/ ├── config.json # OCI runtime spec (namespaces, cgroups, caps, mounts…) ├── rootfs/ # merged OverlayFS mount (container filesystem) ├── init.pid # PID of container init written by shim └── log.json # shim log
runc Startup Sequence
containerd-shim
│
└── fork/exec runc create <bundle>
│
├── parse config.json
├── clone(CLONE_NEWPID | CLONE_NEWNET | CLONE_NEWNS | …)
│ └─ child enters new namespaces
├── pivot_root → container rootfs becomes /
├── mount /proc, /sys, /dev, bind mounts, volumes
├── apply cgroup limits (write to /sys/fs/cgroup/…)
├── drop capabilities
├── apply seccomp filter
└── execve(entrypoint) → container PID 1
│
└── runc exits; shim holds stdio fds open
Inspecting the OCI Spec
# dump the full OCI config for a running container docker inspect --format '{{json .}}' <container> | jq . # read runc state directly runc state <container-id> # show the OCI bundle config cat /run/containerd/io.containerd.runtime.v2.task/moby/<id>/config.json | jq .
End-to-end sequence from docker run nginx to a running process.
docker CLI
│ POST /containers/create (REST)
▼
dockerd
│ 1. resolve image (pull if missing via containerd)
│ 2. create container record (assign ID, copy config)
│ 3. set up networking (allocate IP, create veth pair, iptables rules)
│ 4. prepare OCI bundle (write config.json, set up OverlayFS mount)
│
│ NewContainer() gRPC → containerd
▼
containerd
│ 5. prepare snapshot (OverlayFS upperdir for this container)
│ 6. fork containerd-shim-runc-v2
│
│ shim.Create() → shim
▼
containerd-shim
│ 7. fork/exec runc create --bundle <path>
│
▼
runc
│ 8. clone(2) with namespace flags
│ 9. child: pivot_root, mount /proc /sys /dev, bind mounts
│ 10. write cgroup limits to /sys/fs/cgroup/<container-id>/
│ 11. drop capabilities, apply seccomp filter
│ 12. write PID to init.pid, signal parent
│ 13. execve(entrypoint) → container PID 1 starts
│ 14. runc exits
▼
containerd-shim (alive)
│ holds stdio pipes, waits for PID 1 exit
▼
container process (PID 1 inside namespace)
└── nginx, bash, your app…
Linux namespaces provide the isolation that makes containers feel like separate machines. runc calls clone(2) with the appropriate CLONE_NEW* flags to put the container process into new namespaces.
| Namespace | Flag | What it isolates | Docker default |
|---|---|---|---|
pid | CLONE_NEWPID | Process ID space — container PID 1 is host PID N | Yes |
net | CLONE_NEWNET | Network interfaces, routes, iptables, sockets | Yes (bridge mode) |
mnt | CLONE_NEWNS | Filesystem mount points | Yes |
uts | CLONE_NEWUTS | Hostname and domain name | Yes |
ipc | CLONE_NEWIPC | SysV IPC, POSIX message queues | Yes |
user | CLONE_NEWUSER | UID/GID mappings (rootless Docker) | Only rootless mode |
cgroup | CLONE_NEWCGROUP | cgroup root view | Yes |
time | CLONE_NEWTIME | System clocks (kernel 5.6+) | No |
Inspecting Namespaces
# list namespaces of a running container's PID PID=$(docker inspect -f '{{.State.Pid}}' <container>) ls -la /proc/$PID/ns/ # enter the container's network namespace from the host nsenter -t $PID -n ip addr # enter all namespaces (like docker exec) nsenter -t $PID --all -- bash # show namespace IDs — containers sharing a net ns (--network container:X) lsns -p $PID
--pid=host gives the container full visibility into all host processes. Avoid in production; it bypasses most process-level isolation.Control groups limit and account for resource usage. runc creates a cgroup hierarchy for each container and writes limits before execing the container process. Modern distros use cgroups v2 (unified hierarchy); older systems use v1 (per-subsystem trees).
cgroups v2 Hierarchy
/sys/fs/cgroup/
└── system.slice/
└── docker-<container-id>.scope/
├── memory.max # hard memory limit
├── memory.high # soft limit (throttle before OOM)
├── cpu.max # "quota period" e.g. "50000 100000" = 50% CPU
├── cpu.weight # relative CPU shares
├── io.max # device read/write bps and iops limits
├── pids.max # max PIDs (prevents fork bombs)
└── cgroup.procs # PIDs in this cgroup
Common docker run Flags → cgroup Mappings
| docker run flag | cgroups v2 file | Effect |
|---|---|---|
--memory 512m | memory.max | Hard limit; OOM kill if exceeded |
--memory-reservation 256m | memory.high | Soft limit; kernel reclaims pages |
--cpus 1.5 | cpu.max | 150000 100000 — 1.5 cores |
--cpu-shares 512 | cpu.weight | Relative weight (default 1024) |
--pids-limit 100 | pids.max | Max total threads/processes |
--blkio-weight 500 | io.weight | Relative I/O weight |
--device-read-bps | io.max | Per-device read bandwidth cap |
Inspecting cgroup Usage
# find the cgroup path for a container CID=$(docker inspect -f '{{.Id}}' <container>) cat /proc/$(docker inspect -f '{{.State.Pid}}' <container>)/cgroup # read current memory usage (cgroups v2) cat /sys/fs/cgroup/system.slice/docker-${CID}.scope/memory.current # read cpu usage stats cat /sys/fs/cgroup/system.slice/docker-${CID}.scope/cpu.stat # live resource stats docker stats <container> # check if cgroups v2 is active stat -f -c '%T' /sys/fs/cgroup # "cgroup2fs" = v2, "tmpfs" = v1
memory.max is hit, the kernel's OOM killer terminates a process in the cgroup. The container exits with code 137 (SIGKILL). Set --oom-kill-disable only if you have host-level protection.Linux capabilities split root privilege into discrete units. Docker drops most capabilities by default, retaining only those needed for typical workloads. runc writes the bounding, effective, and permitted sets from config.json before execing PID 1.
Default Capability Set (kept by Docker)
Managing Capabilities
# drop all capabilities, add back only what's needed docker run --cap-drop ALL --cap-add NET_BIND_SERVICE nginx # run privileged (all capabilities + device access — avoid in prod) docker run --privileged ... # inspect effective capabilities of a running container process PID=$(docker inspect -f '{{.State.Pid}}' <container>) grep CapEff /proc/$PID/status | awk '{print $2}' | xargs -I{} capsh --decode={} # check at runtime inside container docker exec <container> capsh --print
NET_RAW allows raw socket creation and ICMP manipulation. Drop it (--cap-drop NET_RAW) if the container doesn't need ping or network sniffing.Docker applies a default seccomp profile that blocks ~44 syscalls known to be dangerous in containers (e.g. keyctl, ptrace, mount, reboot). The profile is a JSON allowlist/denylist translated by runc into a BPF filter attached to the container process.
Seccomp Profile Modes
| Flag | Behaviour |
|---|---|
--security-opt seccomp=default | Docker default profile (recommended) |
--security-opt seccomp=unconfined | Disable seccomp entirely — all syscalls allowed |
--security-opt seccomp=/path/profile.json | Custom profile |
--privileged | Disables seccomp, AppArmor, and drops no capabilities |
Debugging Blocked Syscalls
# trace syscalls to find what's being blocked (EPERM / ENOSYS) docker run --security-opt seccomp=unconfined --rm \ strace -f -e trace=all myapp 2>&1 | grep -E "EPERM|ENOSYS" # view Docker's default seccomp profile curl -s https://raw.githubusercontent.com/moby/moby/master/profiles/seccomp/default.json | jq .syscalls[].names
docker-default) on Ubuntu/Debian systems. SELinux labels are used on RHEL/Fedora. These operate independently of seccomp and add MAC (mandatory access control) on top of DAC.Docker's default (and preferred) storage driver. OverlayFS is a union filesystem that stacks read-only image layers (lowerdir) under a single writable layer (upperdir). Reads fall through to the lowest layer that has the file; writes go to the upper layer via copy-on-write.
Mount Structure
Container view (merged)
─────────────────────────────────
/etc/nginx/nginx.conf ◀── from upperdir (modified)
/usr/sbin/nginx ◀── from lowerdir layer 3 (image)
/lib/x86_64-linux-gnu/ ◀── from lowerdir layer 1 (base)
OverlayFS mount:
lowerdir = layer3:layer2:layer1 (read-only, :-separated, top-to-bottom)
upperdir = /var/lib/docker/overlay2/<id>/diff (read-write)
workdir = /var/lib/docker/overlay2/<id>/work (kernel internal)
merged = /var/lib/docker/overlay2/<id>/merged (container rootfs)
Copy-on-Write (CoW) Mechanics
Read
Kernel checks upperdir first. If not found, falls through each lowerdir in order. Only the first matching file is returned — no data is copied.
Write (new file)
New file created directly in upperdir. No interaction with lowerdir layers.
Write (existing file)
Kernel copies the entire file from lowerdir into upperdir (copy-up), then modifies it in-place. First write can be slow for large files.
Delete
A whiteout file (.wh.filename) is created in upperdir. The kernel masks the file from lowerdir when building the merged view.
Inspecting OverlayFS Mounts
# show overlay mount options for a running container docker inspect -f '{{json .GraphDriver}}' <container> | jq . # same info from the host mount table mount | grep overlay # look at the writable layer diff (files changed by the container) UPPER=$(docker inspect -f '{{.GraphDriver.Data.UpperDir}}' <container>) ls $UPPER # see all image layers for an image docker inspect nginx | jq '.[].RootFS.Layers' # layer directories on disk ls /var/lib/docker/overlay2/
-v) for write-heavy workloads to avoid this overhead.Docker images are built from ordered, immutable layers stored as compressed tarballs in a content-addressed store. Each RUN, COPY, or ADD instruction in a Dockerfile creates a new layer. Layers are shared across images and containers — a base layer pulled once is reused by all images that build on it.
OCI Image Layout
OCI Image Index (manifest list)
└── Manifest (per platform)
├── Config JSON { "Cmd": [...], "Env": [...], "RootFS": { "diff_ids": [...] } }
└── Layers [] [ sha256:abc… (tar.gz), sha256:def… (tar.gz), … ]
▲
content-addressed blobs
/var/lib/docker/image/overlay2/layerdb/sha256/<digest>/
Layer Caching in Builds
# Bad: COPY invalidates cache early, re-runs npm install every code change COPY . /app RUN npm install # Good: copy dependencies first, code second COPY package.json package-lock.json /app/ RUN npm install # cached as long as package*.json unchanged COPY . /app # invalidated only when source changes
Useful Layer Commands
# show layer history and sizes docker history nginx # detailed layer info docker inspect nginx | jq '.[].RootFS' # export a container filesystem (flattened, no layers) docker export <container> | tar -tv # save image with all layers as a tar archive docker save nginx | tar -tv # show total disk usage by images/containers/volumes docker system df -v
Three mount types bypass OverlayFS for persistent or shared data. Volumes are managed by Docker, bind mounts expose host paths directly, and tmpfs mounts live in memory.
| Type | Source | Use case | Persists? |
|---|---|---|---|
volume | /var/lib/docker/volumes/<name>/_data | Databases, persistent app data | Yes, until docker volume rm |
bind | Any host path | Dev workflows, config injection | Yes (host owns the data) |
tmpfs | RAM | Secrets, temp scratch space | No — lost on container stop |
Mount Implementation
All three are implemented as entries in the mounts array of the OCI config. runc calls mount(2) inside the container's mount namespace, making the source visible at the destination path. Volumes and bind mounts go in before pivot_root.
# named volume docker run -v mydata:/var/lib/postgresql/data postgres # bind mount (absolute host path) docker run -v /host/config:/etc/app/config:ro myapp # tmpfs for secrets docker run --tmpfs /run/secrets:rw,noexec,nosuid,size=64m myapp # inspect mount details docker inspect -f '{{json .Mounts}}' <container> | jq . # list volumes docker volume ls docker volume inspect mydata
--volume-driver flag. The plugin implements the Docker Volume API and manages the actual storage backend.Docker networking is implemented through pluggable drivers. Each driver creates a different network topology. The driver sets up the container's network namespace, creates virtual interfaces, assigns IPs, and configures routing before the container starts.
| Driver | Topology | Use case |
|---|---|---|
bridge | Software bridge on host, NAT to outside | Default; single-host container communication |
host | Shares host network namespace | Performance-critical workloads; no isolation |
overlay | VXLAN tunnel across hosts | Docker Swarm, multi-host networking |
macvlan | Container has a real MAC on the LAN | Legacy apps needing L2 adjacency |
ipvlan | Shares host MAC, unique IP per container | Environments restricting MAC proliferation |
none | No networking — loopback only | Batch jobs, security sandboxing |
The default network mode. Docker creates a Linux bridge (docker0 for the default network, or a named bridge for user-defined networks). Each container gets a veth pair: one end in the container's net namespace, the other attached to the bridge.
Host network namespace
┌──────────────────────────────────────────────────────┐
│ │
│ eth0 ──── docker0 (bridge 172.17.0.1/16) │
│ │ │
│ ┌─────┴──────┐ │
│ veth0a veth0b …(one pair per container)
│ │ │
└────────────┼─────────────────────────────────────────┘
│ (peer end inside container net namespace)
Container net namespace
┌────────────────────────────────────────┐
│ eth0 172.17.0.2/16 │
│ lo 127.0.0.1/8 │
│ default gw: 172.17.0.1 (docker0) │
└────────────────────────────────────────┘
User-Defined Bridge vs Default
Default bridge (docker0)
No DNS resolution between containers. Must use --link (legacy) or IP addresses. All containers share one subnet.
User-defined bridge
Automatic DNS: containers resolve each other by name. Isolated subnet per network. Preferred for all new deployments.
# create a user-defined bridge network docker network create --driver bridge --subnet 192.168.10.0/24 mynet # attach containers — they resolve each other by name docker run -d --name db --network mynet postgres docker run -d --name app --network mynet myapp # app can reach db at hostname "db" # inspect network docker network inspect mynet # see the bridge on the host ip link show type bridge brctl show
veth (virtual Ethernet) pairs are a kernel primitive: a packet sent into one end exits the other. Docker uses them to connect the container's network namespace to the host bridge. iptables handles NAT, port forwarding, and inter-network policy.
veth Pair Lifecycle
# what dockerd/containerd do under the hood (simplified) ip link add veth0a type veth peer name veth0b # create pair ip link set veth0b netns <container-netns> # move peer into container ns ip link set veth0a master docker0 # attach host end to bridge ip link set veth0a up # inside the container namespace ip link set veth0b name eth0 ip addr add 172.17.0.2/16 dev eth0 ip link set eth0 up ip route add default via 172.17.0.1
iptables Rules Added by Docker
# view Docker's iptables rules iptables -t nat -L -n --line-numbers iptables -L DOCKER -n --line-numbers # MASQUERADE rule: outbound container traffic → NAT to host IP -A POSTROUTING -s 172.17.0.0/16 ! -o docker0 -j MASQUERADE # DNAT rule for port mapping (-p 8080:80) -A DOCKER -i eth0 -p tcp --dport 8080 -j DNAT --to-destination 172.17.0.2:80 # DOCKER-ISOLATION chain blocks inter-bridge traffic -A DOCKER-ISOLATION-STAGE-1 -i docker0 ! -o docker0 -j DOCKER-ISOLATION-STAGE-2
Tracing a Port Mapping
# incoming packet to host port 8080 → container:80 Host:8080 ──DNAT──▶ 172.17.0.2:80 ──veth──▶ container eth0:80 # verify port mapping docker port <container> iptables -t nat -L DOCKER -n | grep 8080
iptables rules directly. On systems that have migrated to nftables, the iptables-nft compatibility layer is used. Do not manually flush Docker's chains (iptables -F) — it breaks all container networking until Docker is restarted.Containers on user-defined networks get automatic DNS. Docker runs an embedded DNS resolver at 127.0.0.11 inside each container's network namespace. It resolves container names, network aliases, and service names (in Swarm), then forwards unknown queries to the host's upstream DNS.
How It Works
Container
└── /etc/resolv.conf
nameserver 127.0.0.11 ◀── Docker embedded resolver
options ndots:0
Container makes DNS query for "db"
│
▼
127.0.0.11 (embedded resolver, inside net ns)
│
├── name matches a container on the same network? → return its IP
├── name matches a network alias? → return aliased IP
├── name matches a Swarm service? → return VIP
└── unknown → forward to host resolver (/etc/resolv.conf on host)
DNS Commands
# verify container DNS config docker exec <container> cat /etc/resolv.conf # resolve a peer container name from inside docker exec app nslookup db # assign a network alias (multiple names for one container) docker run --network mynet --network-alias cache redis # override DNS server for a container docker run --dns 1.1.1.1 myapp # set search domains docker run --dns-search internal.corp myapp
127.0.0.11. Containers on docker0 (default bridge) use the host's /etc/resolv.conf and cannot resolve each other by name.When bridge NAT overhead is unacceptable or a container needs a presence directly on the LAN, host and macvlan modes bypass the bridge entirely.
--network host
Container shares the host's network namespace. No veth, no NAT, no port mapping needed. Container binds directly to host interfaces. Loses all network isolation.
macvlan
Kernel creates a virtual NIC with its own MAC attached to a host interface. Container appears as a distinct host on the physical LAN. Requires promiscuous mode on the NIC.
# host networking docker run --network host nginx # nginx binds to host :80 directly — no port mapping needed # macvlan network (container gets a real LAN IP) docker network create \ --driver macvlan \ --subnet 192.168.1.0/24 \ --gateway 192.168.1.1 \ -o parent=eth0 \ macnet docker run --network macnet --ip 192.168.1.50 nginx
--network host has no effect on Docker Desktop (Mac/Windows) because containers run in a Linux VM. The container shares the VM's network namespace, not the macOS/Windows host.Quick reference for inspecting containers, images, networks, and the low-level daemon state from the host.
## Container internals docker inspect <container> # full JSON config + state docker inspect -f '{{.State.Pid}}' <container> # host PID of container init docker inspect -f '{{json .HostConfig}}' <c> | jq . # resource limits docker top <container> # processes inside docker stats <container> # live cpu/mem/io/net docker diff <container> # changed files (upperdir diff) ## Filesystem docker inspect -f '{{.GraphDriver.Data.MergedDir}}' <c> # merged view docker inspect -f '{{.GraphDriver.Data.UpperDir}}' <c> # writable layer ## Namespaces from host PID=$(docker inspect -f '{{.State.Pid}}' <container>) ls -la /proc/$PID/ns/ nsenter -t $PID -n -- ip addr # host-side net namespace entry ## cgroups CID=$(docker inspect -f '{{.Id}}' <container>) cat /sys/fs/cgroup/system.slice/docker-${CID}.scope/memory.current cat /sys/fs/cgroup/system.slice/docker-${CID}.scope/cpu.stat ## containerd / runc state ctr -n moby tasks list runc state $CID runc list
| Goal | docker run flag | cgroups v2 file |
|---|---|---|
| Hard memory cap | --memory 1g | memory.max |
| Soft memory limit | --memory-reservation 512m | memory.high |
| Disable OOM kill | --oom-kill-disable | memory.oom.group |
| CPU quota | --cpus 0.5 | cpu.max 50000 100000 |
| CPU relative weight | --cpu-shares 512 | cpu.weight |
| CPU pin | --cpuset-cpus 0,1 | cpuset.cpus |
| Max processes | --pids-limit 200 | pids.max |
| Block I/O read bps | --device-read-bps /dev/sda:10mb | io.max |
| Block I/O weight | --blkio-weight 300 | io.weight |
Docker supports multiple union/copy-on-write storage drivers. overlay2 is the default and recommended on all modern Linux kernels.
| Driver | Kernel requirement | Status | Notes |
|---|---|---|---|
overlay2 | kernel ≥ 4.0, xfs/ext4 | Recommended | Uses OverlayFS; best performance and stability |
fuse-overlayfs | FUSE in kernel | Rootless only | Used by rootless Docker when overlayfs is unavailable |
btrfs | btrfs filesystem | Supported | Subvolumes for layers; built-in snapshots; complex |
zfs | ZFS on Linux | Supported | Dataset per layer; good for ZFS-centric hosts |
devicemapper | device-mapper | Deprecated | Direct-lvm mode requires setup; avoid on new installs |
vfs | Any | Testing only | No CoW; full copy per layer; very slow |
# check active storage driver docker info | grep "Storage Driver" # check backing filesystem docker info | grep "Backing Filesystem" # switch to overlay2 in daemon.json # { "storage-driver": "overlay2" } # then: systemctl restart docker (existing images/containers lost)