Docker Internals Guide

Docker Internals

dockerd, containerd, runc, cgroups, namespaces, OverlayFS, and container networking — from docker run to a running process

CLI / API │ ├── dockerd ──────────────────── REST API (:2375 / unix socket) │ │ BuildKit, image management, volume/network lifecycle │ │ │ └── containerd ──────────── gRPC (containerd.sock) │ │ snapshot management, image pull, task lifecycle │ │ │ └── containerd-shim-runc-v2 │ │ keeps container alive if containerd restarts │ └── runc ────── OCI runtime: clone(2), cgroups, pivot_root │ ├── Kernel Primitives │ ├── Namespaces (pid, net, mnt, uts, ipc, user, cgroup) │ ├── cgroups v2 (cpu, memory, io, pids, devices) │ ├── Capabilities (drop/add per container) │ └── Seccomp / AppArmor / SELinux │ ├── Filesystem │ ├── OverlayFS (lowerdir=image layers, upperdir=writable, workdir) │ ├── Image Layers (content-addressed, OCI, copy-on-write) │ └── Volumes / Bind Mounts / tmpfs │ └── Networking ├── bridge (docker0, veth pairs, NAT via iptables) ├── host (share host network namespace) ├── overlay (multi-host, VXLAN, used by Swarm/Kubernetes) ├── macvlan / ipvlan └── DNS (embedded resolver, /etc/resolv.conf rewrite)

Architecture Component Map

Docker is a layered system. The CLI talks to dockerd via a REST API, dockerd delegates container lifecycle to containerd over gRPC, and containerd spawns an OCI runtime (runc) via a shim process that persists after runc exits.

ASCII

┌─────────────────────────────────────────────────────────┐
│                     User Space                          │
│                                                         │
│  docker CLI ──REST──▶ dockerd                           │
│                          │                              │
│                       gRPC (containerd.sock)            │
│                          ▼                              │
│                      containerd                         │
│                          │   image pull / snapshots     │
│                          │                              │
│              fork/exec   ▼                              │
│           containerd-shim-runc-v2  ◀── stays alive      │
│                          │                              │
│               fork/exec  ▼  (exits after start)        │
│                        runc                             │
│                          │  clone(2) + pivot_root       │
│                          ▼                              │
│                   container init (PID 1)                │
│                                                         │
└─────────────────────────────────────────────────────────┘
         │              │               │
    Namespaces       cgroups v2      OverlayFS
  (pid/net/mnt…)  (cpu/mem/io…)   (layers + rw)

dockerd

High-level daemon. Owns the Docker API, BuildKit, networking IPAM, volume plugins, image store, and Swarm orchestration. Does not directly run containers.

containerd

OCI-compliant container supervisor. Manages snapshots (image layers), pulls OCI images, and drives shim/runc for container start/stop/exec.

containerd-shim

One shim per container. Holds the container's stdio pipes and reports exit status. Allows containerd to restart without killing running containers.

runc

Lightweight CLI that reads an OCI bundle (config.json + rootfs), calls clone(2) with namespace flags, sets up cgroups, drops capabilities, then execves the container process.

Daemon dockerd

The Docker Engine daemon. It listens on a Unix socket (/var/run/docker.sock) by default and exposes the Docker Engine API. It is responsible for everything the user sees — images, containers, networks, volumes — but delegates actual container execution to containerd.

Responsibilities

REST API server Image management BuildKit integration Network IPAM Volume plugins Swarm mode Event stream

Key Files & Sockets

Path	Purpose
`/var/run/docker.sock`	Default Unix socket — CLI and SDK connect here
`/var/lib/docker/`	Image layers, container state, volumes, networks
`/var/lib/docker/containers/<id>/`	Per-container config, logs, hosts, resolv.conf
`/etc/docker/daemon.json`	Daemon configuration (storage driver, log driver, DNS…)
`/run/containerd/containerd.sock`	gRPC socket dockerd uses to talk to containerd

daemon.json Examples

JSON

{
  "storage-driver": "overlay2",
  "log-driver": "json-file",
  "log-opts": { "max-size": "20m", "max-file": "3" },
  "default-address-pools": [{ "base": "172.30.0.0/16", "size": 24 }],
  "dns": ["8.8.8.8"],
  "live-restore": true   // keep containers running if dockerd restarts
}

live-restoreWhen live-restore: true, dockerd reconnects to running shims on restart instead of killing containers. Requires containerd shim v2.

Runtime containerd

An industry-standard container runtime (CNCF graduated). Docker uses containerd as its low-level runtime. Kubernetes also talks to containerd directly via CRI. containerd manages snapshots, image distribution, and the full container lifecycle via plugins.

Subsystems

Snapshotter

Manages the filesystem layer stack. Default is overlayfs. Provides a view of merged layers for each container rootfs.

Content Store

Content-addressed blob storage for OCI manifests, configs, and layer tarballs under /var/lib/containerd/io.containerd.content.v1.content/.

Metadata Store

bbolt database tracking images, containers, snapshots, and leases. Located at /var/lib/containerd/io.containerd.metadata.v1.bolt/meta.db.

Task Service

Creates and manages running containers (tasks). Each task maps to a shim process. Handles exec, pause, resume, kill, and delete.

Interaction via ctr

shell

# list containerd namespaces (docker uses "moby")
ctr namespaces list

# list images in the moby namespace
ctr -n moby images list

# list running tasks (containers)
ctr -n moby tasks list

# pull an image directly via containerd
ctr images pull docker.io/library/nginx:latest

# show snapshots for a container
ctr -n moby snapshots ls

Namespace isolationcontainerd uses namespaces (not kernel namespaces) to partition clients. Docker stores everything in the moby namespace. Kubernetes uses k8s.io.

Runtime runc & containerd-shim

runc is the reference OCI runtime. It reads an OCI bundle — a directory with config.json and a rootfs/ — and uses Linux syscalls to start the container. It exits once the container process is running. The shim stays alive to manage stdio and report the exit status back to containerd.

OCI Bundle Structure

shell

/run/containerd/io.containerd.runtime.v2.task/moby/<id>/
  ├── config.json     # OCI runtime spec (namespaces, cgroups, caps, mounts…)
  ├── rootfs/         # merged OverlayFS mount (container filesystem)
  ├── init.pid        # PID of container init written by shim
  └── log.json        # shim log

runc Startup Sequence

ASCII

containerd-shim
  │
  └── fork/exec runc create <bundle>
        │
        ├── parse config.json
        ├── clone(CLONE_NEWPID | CLONE_NEWNET | CLONE_NEWNS | …)
        │     └─ child enters new namespaces
        ├── pivot_root  → container rootfs becomes /
        ├── mount /proc, /sys, /dev, bind mounts, volumes
        ├── apply cgroup limits  (write to /sys/fs/cgroup/…)
        ├── drop capabilities
        ├── apply seccomp filter
        └── execve(entrypoint)  → container PID 1
              │
              └── runc exits; shim holds stdio fds open

Inspecting the OCI Spec

shell

# dump the full OCI config for a running container
docker inspect --format '{{json .}}' <container> | jq .

# read runc state directly
runc state <container-id>

# show the OCI bundle config
cat /run/containerd/io.containerd.runtime.v2.task/moby/<id>/config.json | jq .

Architecture docker run Flow

End-to-end sequence from docker run nginx to a running process.

ASCII

 docker CLI
    │  POST /containers/create  (REST)
    ▼
 dockerd
    │  1. resolve image (pull if missing via containerd)
    │  2. create container record (assign ID, copy config)
    │  3. set up networking (allocate IP, create veth pair, iptables rules)
    │  4. prepare OCI bundle (write config.json, set up OverlayFS mount)
    │
    │  NewContainer() gRPC → containerd
    ▼
 containerd
    │  5. prepare snapshot (OverlayFS upperdir for this container)
    │  6. fork containerd-shim-runc-v2
    │
    │  shim.Create() → shim
    ▼
 containerd-shim
    │  7. fork/exec runc create --bundle <path>
    │
    ▼
 runc
    │  8.  clone(2) with namespace flags
    │  9.  child: pivot_root, mount /proc /sys /dev, bind mounts
    │  10. write cgroup limits to /sys/fs/cgroup/<container-id>/
    │  11. drop capabilities, apply seccomp filter
    │  12. write PID to init.pid, signal parent
    │  13. execve(entrypoint) → container PID 1 starts
    │  14. runc exits
    ▼
 containerd-shim (alive)
    │  holds stdio pipes, waits for PID 1 exit
    ▼
 container process (PID 1 inside namespace)
    └── nginx, bash, your app…

Why the shim?If containerd crashes and restarts, running containers are unaffected because the shim (not containerd) holds the stdio fds and waitpid on PID 1. containerd re-attaches to existing shims on startup.

Kernel Namespaces

Linux namespaces provide the isolation that makes containers feel like separate machines. runc calls clone(2) with the appropriate CLONE_NEW* flags to put the container process into new namespaces.

Namespace	Flag	What it isolates	Docker default
`pid`	`CLONE_NEWPID`	Process ID space — container PID 1 is host PID N	Yes
`net`	`CLONE_NEWNET`	Network interfaces, routes, iptables, sockets	Yes (bridge mode)
`mnt`	`CLONE_NEWNS`	Filesystem mount points	Yes
`uts`	`CLONE_NEWUTS`	Hostname and domain name	Yes
`ipc`	`CLONE_NEWIPC`	SysV IPC, POSIX message queues	Yes
`user`	`CLONE_NEWUSER`	UID/GID mappings (rootless Docker)	Only rootless mode
`cgroup`	`CLONE_NEWCGROUP`	cgroup root view	Yes
`time`	`CLONE_NEWTIME`	System clocks (kernel 5.6+)	No

Inspecting Namespaces

shell

# list namespaces of a running container's PID
PID=$(docker inspect -f '{{.State.Pid}}' <container>)
ls -la /proc/$PID/ns/

# enter the container's network namespace from the host
nsenter -t $PID -n ip addr

# enter all namespaces (like docker exec)
nsenter -t $PID --all -- bash

# show namespace IDs — containers sharing a net ns (--network container:X)
lsns -p $PID

PID namespace leakage--pid=host gives the container full visibility into all host processes. Avoid in production; it bypasses most process-level isolation.

Kernel cgroups v1 & v2

Control groups limit and account for resource usage. runc creates a cgroup hierarchy for each container and writes limits before execing the container process. Modern distros use cgroups v2 (unified hierarchy); older systems use v1 (per-subsystem trees).

cgroups v2 Hierarchy

shell

/sys/fs/cgroup/
  └── system.slice/
        └── docker-<container-id>.scope/
              ├── memory.max          # hard memory limit
              ├── memory.high         # soft limit (throttle before OOM)
              ├── cpu.max             # "quota period" e.g. "50000 100000" = 50% CPU
              ├── cpu.weight          # relative CPU shares
              ├── io.max              # device read/write bps and iops limits
              ├── pids.max            # max PIDs (prevents fork bombs)
              └── cgroup.procs        # PIDs in this cgroup

Common docker run Flags → cgroup Mappings

docker run flag	cgroups v2 file	Effect
`--memory 512m`	`memory.max`	Hard limit; OOM kill if exceeded
`--memory-reservation 256m`	`memory.high`	Soft limit; kernel reclaims pages
`--cpus 1.5`	`cpu.max`	`150000 100000` — 1.5 cores
`--cpu-shares 512`	`cpu.weight`	Relative weight (default 1024)
`--pids-limit 100`	`pids.max`	Max total threads/processes
`--blkio-weight 500`	`io.weight`	Relative I/O weight
`--device-read-bps`	`io.max`	Per-device read bandwidth cap

Inspecting cgroup Usage

shell

# find the cgroup path for a container
CID=$(docker inspect -f '{{.Id}}' <container>)
cat /proc/$(docker inspect -f '{{.State.Pid}}' <container>)/cgroup

# read current memory usage (cgroups v2)
cat /sys/fs/cgroup/system.slice/docker-${CID}.scope/memory.current

# read cpu usage stats
cat /sys/fs/cgroup/system.slice/docker-${CID}.scope/cpu.stat

# live resource stats
docker stats <container>

# check if cgroups v2 is active
stat -f -c '%T' /sys/fs/cgroup   # "cgroup2fs" = v2, "tmpfs" = v1

OOM killerWhen memory.max is hit, the kernel's OOM killer terminates a process in the cgroup. The container exits with code 137 (SIGKILL). Set --oom-kill-disable only if you have host-level protection.

Kernel Capabilities

Linux capabilities split root privilege into discrete units. Docker drops most capabilities by default, retaining only those needed for typical workloads. runc writes the bounding, effective, and permitted sets from config.json before execing PID 1.

Default Capability Set (kept by Docker)

CHOWN DAC_OVERRIDE FSETID FOWNER MKNOD NET_RAW SETGID SETUID SETFCAP SETPCAP NET_BIND_SERVICE SYS_CHROOT KILL AUDIT_WRITE

Managing Capabilities

shell

# drop all capabilities, add back only what's needed
docker run --cap-drop ALL --cap-add NET_BIND_SERVICE nginx

# run privileged (all capabilities + device access — avoid in prod)
docker run --privileged ...

# inspect effective capabilities of a running container process
PID=$(docker inspect -f '{{.State.Pid}}' <container>)
grep CapEff /proc/$PID/status | awk '{print $2}' | xargs -I{} capsh --decode={}

# check at runtime inside container
docker exec <container> capsh --print

NET_RAW riskNET_RAW allows raw socket creation and ICMP manipulation. Drop it (--cap-drop NET_RAW) if the container doesn't need ping or network sniffing.

Kernel Seccomp

Docker applies a default seccomp profile that blocks ~44 syscalls known to be dangerous in containers (e.g. keyctl, ptrace, mount, reboot). The profile is a JSON allowlist/denylist translated by runc into a BPF filter attached to the container process.

Seccomp Profile Modes

Flag	Behaviour
`--security-opt seccomp=default`	Docker default profile (recommended)
`--security-opt seccomp=unconfined`	Disable seccomp entirely — all syscalls allowed
`--security-opt seccomp=/path/profile.json`	Custom profile
`--privileged`	Disables seccomp, AppArmor, and drops no capabilities

Debugging Blocked Syscalls

shell

# trace syscalls to find what's being blocked (EPERM / ENOSYS)
docker run --security-opt seccomp=unconfined --rm \
  strace -f -e trace=all myapp 2>&1 | grep -E "EPERM|ENOSYS"

# view Docker's default seccomp profile
curl -s https://raw.githubusercontent.com/moby/moby/master/profiles/seccomp/default.json | jq .syscalls[].names

AppArmor / SELinuxDocker also ships an AppArmor profile (docker-default) on Ubuntu/Debian systems. SELinux labels are used on RHEL/Fedora. These operate independently of seccomp and add MAC (mandatory access control) on top of DAC.

Filesystem OverlayFS

Docker's default (and preferred) storage driver. OverlayFS is a union filesystem that stacks read-only image layers (lowerdir) under a single writable layer (upperdir). Reads fall through to the lowest layer that has the file; writes go to the upper layer via copy-on-write.

Mount Structure

ASCII

  Container view (merged)
  ─────────────────────────────────
  /etc/nginx/nginx.conf   ◀── from upperdir (modified)
  /usr/sbin/nginx         ◀── from lowerdir layer 3 (image)
  /lib/x86_64-linux-gnu/  ◀── from lowerdir layer 1 (base)

  OverlayFS mount:
    lowerdir = layer3:layer2:layer1   (read-only, :-separated, top-to-bottom)
    upperdir = /var/lib/docker/overlay2/<id>/diff    (read-write)
    workdir  = /var/lib/docker/overlay2/<id>/work    (kernel internal)
    merged   = /var/lib/docker/overlay2/<id>/merged  (container rootfs)

Copy-on-Write (CoW) Mechanics

Read

Kernel checks upperdir first. If not found, falls through each lowerdir in order. Only the first matching file is returned — no data is copied.

Write (new file)

New file created directly in upperdir. No interaction with lowerdir layers.

Write (existing file)

Kernel copies the entire file from lowerdir into upperdir (copy-up), then modifies it in-place. First write can be slow for large files.

Delete

A whiteout file (.wh.filename) is created in upperdir. The kernel masks the file from lowerdir when building the merged view.

Inspecting OverlayFS Mounts

shell

# show overlay mount options for a running container
docker inspect -f '{{json .GraphDriver}}' <container> | jq .

# same info from the host mount table
mount | grep overlay

# look at the writable layer diff (files changed by the container)
UPPER=$(docker inspect -f '{{.GraphDriver.Data.UpperDir}}' <container>)
ls $UPPER

# see all image layers for an image
docker inspect nginx | jq '.[].RootFS.Layers'

# layer directories on disk
ls /var/lib/docker/overlay2/

Copy-up latencyFirst write to a large file (e.g. a database page) in a container triggers a full copy-up of that file from lowerdir to upperdir. Use volumes (-v) for write-heavy workloads to avoid this overhead.

Filesystem Image Layers

Docker images are built from ordered, immutable layers stored as compressed tarballs in a content-addressed store. Each RUN, COPY, or ADD instruction in a Dockerfile creates a new layer. Layers are shared across images and containers — a base layer pulled once is reused by all images that build on it.

OCI Image Layout

ASCII

OCI Image Index (manifest list)
  └── Manifest (per platform)
        ├── Config JSON   { "Cmd": [...], "Env": [...], "RootFS": { "diff_ids": [...] } }
        └── Layers []     [ sha256:abc… (tar.gz), sha256:def… (tar.gz), … ]
                               ▲
                          content-addressed blobs
                          /var/lib/docker/image/overlay2/layerdb/sha256/<digest>/

Layer Caching in Builds

dockerfile

# Bad: COPY invalidates cache early, re-runs npm install every code change
COPY . /app
RUN npm install

# Good: copy dependencies first, code second
COPY package.json package-lock.json /app/
RUN npm install           # cached as long as package*.json unchanged
COPY . /app               # invalidated only when source changes

Useful Layer Commands

shell

# show layer history and sizes
docker history nginx

# detailed layer info
docker inspect nginx | jq '.[].RootFS'

# export a container filesystem (flattened, no layers)
docker export <container> | tar -tv

# save image with all layers as a tar archive
docker save nginx | tar -tv

# show total disk usage by images/containers/volumes
docker system df -v

Filesystem Volumes & Mounts

Three mount types bypass OverlayFS for persistent or shared data. Volumes are managed by Docker, bind mounts expose host paths directly, and tmpfs mounts live in memory.

Type	Source	Use case	Persists?
`volume`	`/var/lib/docker/volumes/<name>/_data`	Databases, persistent app data	Yes, until `docker volume rm`
`bind`	Any host path	Dev workflows, config injection	Yes (host owns the data)
`tmpfs`	RAM	Secrets, temp scratch space	No — lost on container stop

Mount Implementation

All three are implemented as entries in the mounts array of the OCI config. runc calls mount(2) inside the container's mount namespace, making the source visible at the destination path. Volumes and bind mounts go in before pivot_root.

shell

# named volume
docker run -v mydata:/var/lib/postgresql/data postgres

# bind mount (absolute host path)
docker run -v /host/config:/etc/app/config:ro myapp

# tmpfs for secrets
docker run --tmpfs /run/secrets:rw,noexec,nosuid,size=64m myapp

# inspect mount details
docker inspect -f '{{json .Mounts}}' <container> | jq .

# list volumes
docker volume ls
docker volume inspect mydata

Volume driversDocker supports third-party volume plugins (e.g. NFS, AWS EBS, Ceph) via the --volume-driver flag. The plugin implements the Docker Volume API and manages the actual storage backend.

Networking Network Drivers

Docker networking is implemented through pluggable drivers. Each driver creates a different network topology. The driver sets up the container's network namespace, creates virtual interfaces, assigns IPs, and configures routing before the container starts.

Driver	Topology	Use case
`bridge`	Software bridge on host, NAT to outside	Default; single-host container communication
`host`	Shares host network namespace	Performance-critical workloads; no isolation
`overlay`	VXLAN tunnel across hosts	Docker Swarm, multi-host networking
`macvlan`	Container has a real MAC on the LAN	Legacy apps needing L2 adjacency
`ipvlan`	Shares host MAC, unique IP per container	Environments restricting MAC proliferation
`none`	No networking — loopback only	Batch jobs, security sandboxing

Networking Bridge Network

The default network mode. Docker creates a Linux bridge (docker0 for the default network, or a named bridge for user-defined networks). Each container gets a veth pair: one end in the container's net namespace, the other attached to the bridge.

ASCII

 Host network namespace
 ┌──────────────────────────────────────────────────────┐
 │                                                      │
 │   eth0 ──── docker0 (bridge 172.17.0.1/16)           │
 │                  │                                   │
 │            ┌─────┴──────┐                            │
 │         veth0a        veth0b  …(one pair per container)
 │            │                                         │
 └────────────┼─────────────────────────────────────────┘
              │ (peer end inside container net namespace)
 Container net namespace
 ┌────────────────────────────────────────┐
 │  eth0  172.17.0.2/16                  │
 │  lo    127.0.0.1/8                    │
 │  default gw: 172.17.0.1 (docker0)    │
 └────────────────────────────────────────┘

User-Defined Bridge vs Default

Default bridge (docker0)

No DNS resolution between containers. Must use --link (legacy) or IP addresses. All containers share one subnet.

User-defined bridge

Automatic DNS: containers resolve each other by name. Isolated subnet per network. Preferred for all new deployments.

shell

# create a user-defined bridge network
docker network create --driver bridge --subnet 192.168.10.0/24 mynet

# attach containers — they resolve each other by name
docker run -d --name db --network mynet postgres
docker run -d --name app --network mynet myapp
# app can reach db at hostname "db"

# inspect network
docker network inspect mynet

# see the bridge on the host
ip link show type bridge
brctl show

Networking veth pairs & iptables

veth (virtual Ethernet) pairs are a kernel primitive: a packet sent into one end exits the other. Docker uses them to connect the container's network namespace to the host bridge. iptables handles NAT, port forwarding, and inter-network policy.

veth Pair Lifecycle

shell

# what dockerd/containerd do under the hood (simplified)
ip link add veth0a type veth peer name veth0b    # create pair
ip link set veth0b netns <container-netns>       # move peer into container ns
ip link set veth0a master docker0                # attach host end to bridge
ip link set veth0a up

# inside the container namespace
ip link set veth0b name eth0
ip addr add 172.17.0.2/16 dev eth0
ip link set eth0 up
ip route add default via 172.17.0.1

iptables Rules Added by Docker

shell

# view Docker's iptables rules
iptables -t nat -L -n --line-numbers
iptables -L DOCKER -n --line-numbers

# MASQUERADE rule: outbound container traffic → NAT to host IP
-A POSTROUTING -s 172.17.0.0/16 ! -o docker0 -j MASQUERADE

# DNAT rule for port mapping (-p 8080:80)
-A DOCKER -i eth0 -p tcp --dport 8080 -j DNAT --to-destination 172.17.0.2:80

# DOCKER-ISOLATION chain blocks inter-bridge traffic
-A DOCKER-ISOLATION-STAGE-1 -i docker0 ! -o docker0 -j DOCKER-ISOLATION-STAGE-2

Tracing a Port Mapping

shell

# incoming packet to host port 8080 → container:80
Host:8080 ──DNAT──▶ 172.17.0.2:80 ──veth──▶ container eth0:80

# verify port mapping
docker port <container>
iptables -t nat -L DOCKER -n | grep 8080

iptables vs nftablesDocker manages iptables rules directly. On systems that have migrated to nftables, the iptables-nft compatibility layer is used. Do not manually flush Docker's chains (iptables -F) — it breaks all container networking until Docker is restarted.

Networking DNS & Service Discovery

Containers on user-defined networks get automatic DNS. Docker runs an embedded DNS resolver at 127.0.0.11 inside each container's network namespace. It resolves container names, network aliases, and service names (in Swarm), then forwards unknown queries to the host's upstream DNS.

How It Works

ASCII

Container
  └── /etc/resolv.conf
        nameserver 127.0.0.11           ◀── Docker embedded resolver
        options ndots:0

Container makes DNS query for "db"
  │
  ▼
127.0.0.11 (embedded resolver, inside net ns)
  │
  ├── name matches a container on the same network? → return its IP
  ├── name matches a network alias?               → return aliased IP
  ├── name matches a Swarm service?               → return VIP
  └── unknown → forward to host resolver (/etc/resolv.conf on host)

DNS Commands

shell

# verify container DNS config
docker exec <container> cat /etc/resolv.conf

# resolve a peer container name from inside
docker exec app nslookup db

# assign a network alias (multiple names for one container)
docker run --network mynet --network-alias cache redis

# override DNS server for a container
docker run --dns 1.1.1.1 myapp

# set search domains
docker run --dns-search internal.corp myapp

Default bridge has no embedded DNSOnly user-defined networks get the embedded resolver at 127.0.0.11. Containers on docker0 (default bridge) use the host's /etc/resolv.conf and cannot resolve each other by name.

Networking Host & Macvlan Networking

When bridge NAT overhead is unacceptable or a container needs a presence directly on the LAN, host and macvlan modes bypass the bridge entirely.

--network host

Container shares the host's network namespace. No veth, no NAT, no port mapping needed. Container binds directly to host interfaces. Loses all network isolation.

macvlan

Kernel creates a virtual NIC with its own MAC attached to a host interface. Container appears as a distinct host on the physical LAN. Requires promiscuous mode on the NIC.

shell

# host networking
docker run --network host nginx
# nginx binds to host :80 directly — no port mapping needed

# macvlan network (container gets a real LAN IP)
docker network create \
  --driver macvlan \
  --subnet 192.168.1.0/24 \
  --gateway 192.168.1.1 \
  -o parent=eth0 \
  macnet

docker run --network macnet --ip 192.168.1.50 nginx

Host mode on macOS / Windows--network host has no effect on Docker Desktop (Mac/Windows) because containers run in a Linux VM. The container shares the VM's network namespace, not the macOS/Windows host.

Reference Inspection Commands

Quick reference for inspecting containers, images, networks, and the low-level daemon state from the host.

shell

## Container internals
docker inspect <container>                      # full JSON config + state
docker inspect -f '{{.State.Pid}}' <container>  # host PID of container init
docker inspect -f '{{json .HostConfig}}' <c> | jq .  # resource limits
docker top <container>                          # processes inside
docker stats <container>                        # live cpu/mem/io/net
docker diff <container>                         # changed files (upperdir diff)

## Filesystem
docker inspect -f '{{.GraphDriver.Data.MergedDir}}' <c>   # merged view
docker inspect -f '{{.GraphDriver.Data.UpperDir}}' <c>    # writable layer

## Namespaces from host
PID=$(docker inspect -f '{{.State.Pid}}' <container>)
ls -la /proc/$PID/ns/
nsenter -t $PID -n -- ip addr                   # host-side net namespace entry

## cgroups
CID=$(docker inspect -f '{{.Id}}' <container>)
cat /sys/fs/cgroup/system.slice/docker-${CID}.scope/memory.current
cat /sys/fs/cgroup/system.slice/docker-${CID}.scope/cpu.stat

## containerd / runc state
ctr -n moby tasks list
runc state $CID
runc list

Reference cgroup Limits Quick Reference

Goal	docker run flag	cgroups v2 file
Hard memory cap	`--memory 1g`	`memory.max`
Soft memory limit	`--memory-reservation 512m`	`memory.high`
Disable OOM kill	`--oom-kill-disable`	`memory.oom.group`
CPU quota	`--cpus 0.5`	`cpu.max 50000 100000`
CPU relative weight	`--cpu-shares 512`	`cpu.weight`
CPU pin	`--cpuset-cpus 0,1`	`cpuset.cpus`
Max processes	`--pids-limit 200`	`pids.max`
Block I/O read bps	`--device-read-bps /dev/sda:10mb`	`io.max`
Block I/O weight	`--blkio-weight 300`	`io.weight`

Reference Storage Drivers

Docker supports multiple union/copy-on-write storage drivers. overlay2 is the default and recommended on all modern Linux kernels.

Driver	Kernel requirement	Status	Notes
`overlay2`	kernel ≥ 4.0, xfs/ext4	Recommended	Uses OverlayFS; best performance and stability
`fuse-overlayfs`	FUSE in kernel	Rootless only	Used by rootless Docker when overlayfs is unavailable
`btrfs`	btrfs filesystem	Supported	Subvolumes for layers; built-in snapshots; complex
`zfs`	ZFS on Linux	Supported	Dataset per layer; good for ZFS-centric hosts
`devicemapper`	device-mapper	Deprecated	Direct-lvm mode requires setup; avoid on new installs
`vfs`	Any	Testing only	No CoW; full copy per layer; very slow

shell

# check active storage driver
docker info | grep "Storage Driver"

# check backing filesystem
docker info | grep "Backing Filesystem"

# switch to overlay2 in daemon.json
# { "storage-driver": "overlay2" }
# then: systemctl restart docker  (existing images/containers lost)