A packet arriving from the wire goes through hardware receive, early eBPF filtering, kernel buffering, firewall chains, routing, and finally reaches the socket of the waiting application.
Step-by-Step Flow
When an application writes to a socket, the data travels down through the transport and network layers, passes firewall OUTPUT and TC egress hooks, and is handed to the NIC driver for transmission.
Step-by-Step Flow
When the routing decision in PREROUTING determines the destination is not local, the packet is forwarded: FORWARD chain → POSTROUTING → egress NIC. This is what a Linux router or NAT gateway does.
Flow
net.ipv4.ip_forward=1 is set. Containers, VMs, and VPN gateways all rely on this sysctl.
The Network Interface Card is where a packet enters or leaves the machine. Modern NICs use DMA (Direct Memory Access) to write frames directly into RAM ring buffers without CPU involvement, then raise an interrupt.
Receive Flow
Key Tunables
| Parameter | Where | Effect |
|---|---|---|
ethtool -G eth0 rx N | Driver | Enlarge RX ring buffer (reduces drops at burst) |
ethtool -C eth0 rx-usecs N | Driver | Interrupt coalescing — trade latency for throughput |
/proc/sys/net/core/netdev_budget | Kernel | Max packets per NAPI poll cycle (default 300) |
net.core.rmem_max | sysctl | Max socket receive buffer |
ethtool -K eth0 gro on | Driver | Generic Receive Offload — merge small segments |
Useful Commands
# Show NIC ring buffer sizes and current settings ethtool -g eth0 # Show NIC offload settings (GRO, GSO, TSO, …) ethtool -k eth0 # Show interrupt coalescing settings ethtool -c eth0 # Show per-queue NIC stats (drops, errors) ethtool -S eth0 # Show softirq counters per CPU cat /proc/softirqs | grep NET
eXpress Data Path is an eBPF hook that runs inside the NIC driver, before the kernel allocates an sk_buff. It provides the earliest possible packet processing point in the receive path.
XDP Actions
| Action | Meaning | Use Case |
|---|---|---|
XDP_DROP | Discard the packet immediately | DDoS mitigation, firewall |
XDP_PASS | Pass to normal network stack | Default — let kernel handle it |
XDP_TX | Retransmit on same NIC | Inline load balancer, reflection |
XDP_REDIRECT | Redirect to another NIC or CPU | AF_XDP zero-copy, multi-NIC LB |
XDP_ABORTED | Error — treated as DROP | Bug detection during development |
XDP Modes
Native XDP
Driver implements XDP hook directly. Fastest — runs before NAPI. Requires driver support (mlx5, i40e, ixgbe, virtio_net, …).
Generic XDP (SKB mode)
Implemented in the kernel, after sk_buff allocation. Works on any driver but is much slower — mainly for testing.
Offloaded XDP
Program compiled and run on the NIC SmartNIC firmware. Zero kernel CPU overhead. Requires SmartNIC (Netronome, Mellanox BlueField).
AF_XDP
Zero-copy path to userspace using XDP_REDIRECT into an AF_XDP socket. Used by DPDK-alternative stacks and custom packet processors.
Attaching an XDP Program
# Attach XDP program (native mode) ip link set dev eth0 xdp obj xdp_prog.o sec xdp # Attach in generic (SKB) mode — any driver ip link set dev eth0 xdpgeneric obj xdp_prog.o sec xdp # Remove XDP program ip link set dev eth0 xdp off # Show XDP info ip link show dev eth0 | grep xdp # Using bpftool bpftool net show dev eth0 bpftool prog list
Minimal XDP Program (C)
// Drop all UDP packets, pass everything else #include <linux/bpf.h> #include <linux/if_ether.h> #include <linux/ip.h> #include <linux/udp.h> SEC("xdp") int xdp_drop_udp(struct xdp_md *ctx) { void *data = (void *)(long)ctx->data; void *data_end = (void *)(long)ctx->data_end; struct ethhdr *eth = data; if ((void *)(eth + 1) > data_end) return XDP_DROP; if (eth->h_proto != htons(ETH_P_IP)) return XDP_PASS; struct iphdr *ip = (void *)(eth + 1); if ((void *)(ip + 1) > data_end) return XDP_DROP; if (ip->protocol == IPPROTO_UDP) return XDP_DROP; return XDP_PASS; }
The sk_buff (socket buffer) is the central data structure that represents a packet inside the Linux kernel. Every packet that passes through XDP_PASS is wrapped in an sk_buff for its journey through TC, Netfilter, IP, and transport layers.
Structure Layout
data, tail, and head pointers and using reference counting. skb_clone() shares the data buffer; skb_copy() makes a full copy.
Key Fields Used by eBPF
| Field | Type | Description |
|---|---|---|
skb->mark | u32 | Firewall mark (fwmark) — used for policy routing and iptables matching |
skb->priority | u32 | QoS / TC priority class |
skb->protocol | u16 | EtherType (ETH_P_IP = 0x0800) |
skb->cb | u8[48] | Per-layer scratch space (TC uses this for metadata) |
skb->tstamp | ktime_t | Receive/transmit timestamp |
skb->hash | u32 | Flow hash — used by RSS and load balancers |
The TC subsystem controls how packets are queued and scheduled on an interface. It provides qdiscs (queuing disciplines), classes, and filters — and is also the attachment point for TC eBPF programs via the clsact qdisc.
TC Architecture
Common Qdiscs
| Qdisc | Type | Use Case |
|---|---|---|
pfifo_fast | Classless | Default — 3-band FIFO based on IP TOS bits |
fq_codel | Classless | Fair Queue + Controlled Delay — reduces bufferbloat |
htb | Classful | Hierarchical Token Bucket — rate limiting with bursting |
tbf | Classless | Token Bucket Filter — simple rate cap |
netem | Classless | Network emulator — add delay, loss, jitter for testing |
clsact | Special | eBPF TC attachment point (ingress + egress) |
TC Commands
# Show qdiscs on interface tc qdisc show dev eth0 # Add 100ms delay + 10ms jitter for testing (netem) tc qdisc add dev eth0 root netem delay 100ms 10ms # Rate-limit to 100 Mbps with HTB tc qdisc add dev eth0 root handle 1: htb default 10 tc class add dev eth0 parent 1: classid 1:10 htb rate 100mbit burst 15k # Attach eBPF program to TC ingress tc qdisc add dev eth0 clsact tc filter add dev eth0 ingress bpf da obj tc_prog.o sec tc # Show TC filters tc filter show dev eth0 ingress tc filter show dev eth0 egress
Netfilter is the Linux kernel framework for packet filtering, NAT, and connection tracking. iptables and nftables are userspace tools that write rules into Netfilter hooks. Rules are organised into tables (raw, mangle, nat, filter) evaluated at five chain hooks.
Five Hook Points
Table Evaluation Order per Hook
| Hook | Table order (iptables) | Typical Use |
|---|---|---|
| PREROUTING | raw → mangle → nat | DNAT, NOTRACK, TTL mangle |
| INPUT | mangle → filter → security | Filter inbound to local process |
| FORWARD | mangle → filter → security | Filter forwarded packets |
| OUTPUT | raw → mangle → nat → filter → security | Filter outbound from local process, DNAT-back |
| POSTROUTING | mangle → nat | SNAT, MASQUERADE |
Common iptables Commands
# List all rules with packet/byte counters iptables -L -n -v # List rules in a specific table iptables -t nat -L -n -v # Drop all traffic from an IP iptables -A INPUT -s 1.2.3.4 -j DROP # Allow established/related (stateful firewall) iptables -A INPUT -m conntrack --ctstate ESTABLISHED,RELATED -j ACCEPT # Port forwarding: redirect :8080 → 192.168.1.10:80 (DNAT) iptables -t nat -A PREROUTING -p tcp --dport 8080 \ -j DNAT --to-destination 192.168.1.10:80 # Masquerade for NAT gateway iptables -t nat -A POSTROUTING -o eth0 -j MASQUERADE # Trace a packet through all chains (use TRACE in raw table) iptables -t raw -A PREROUTING -s 1.2.3.4 -j TRACE modprobe nf_log_ipv4 dmesg | grep TRACE
nftables as the successor to iptables. It uses a single table/chain model without hard-coded hooks, has atomic ruleset updates, and is significantly faster at large rule counts. On distros like Debian 12+ and RHEL 9, iptables is now a compatibility shim over nftables.
conntrack is the Netfilter subsystem that tracks the state of network connections. Every packet through the filter table is looked up in the connection tracking table, enabling stateful firewalling (ESTABLISHED / RELATED rules) and NAT.
Connection States
| State | Meaning |
|---|---|
NEW | First packet of a new connection (e.g. TCP SYN) |
ESTABLISHED | Connection is established — packets flowing both ways |
RELATED | New connection related to an existing one (e.g. FTP data channel) |
INVALID | Packet doesn't match any known connection — usually DROP |
UNTRACKED | Explicitly excluded from tracking via NOTRACK rule |
TCP State Machine in conntrack
Commands
# List all tracked connections conntrack -L # Watch the conntrack table in real time conntrack -E # Show conntrack table statistics conntrack -S # Show current conntrack table size and max cat /proc/sys/net/netfilter/nf_conntrack_count cat /proc/sys/net/netfilter/nf_conntrack_max # Flush the conntrack table (use with care!) conntrack -F # Delete a specific entry conntrack -D -s 1.2.3.4 -d 5.6.7.8
nf_conntrack_max, causing all new connections to be dropped with "nf_conntrack: table full, dropping packet" in dmesg. Increase the limit or disable conntrack for UDP/high-throughput flows with NOTRACK.
The IP layer performs routing, fragmentation/reassembly, and TTL management. The kernel's Forwarding Information Base (FIB) is the routing table — a lookup here determines which interface and next-hop a packet uses.
Routing Lookup (Inbound)
Useful Commands
# Show routing table ip route show ip route show table all # Which route/interface would be used for a destination? ip route get 8.8.8.8 # Show ARP / neighbor cache ip neigh show # Show policy routing rules ip rule show # Show interface addresses ip addr show # Trace route to a host traceroute 8.8.8.8 mtr --report 8.8.8.8
Fragmentation
ip link show to check MTU, and ping -M do -s 1400 host to probe path MTU.
The transport layer multiplexes data between processes using ports and handles reliability (TCP) or connectionless delivery (UDP). Incoming packets are demultiplexed to the correct socket by looking up the 4-tuple (src IP, src port, dst IP, dst port).
TCP Receive Path
TCP Congestion Control
# Show available congestion control algorithms cat /proc/sys/net/ipv4/tcp_available_congestion_control # Show currently active algorithm cat /proc/sys/net/ipv4/tcp_congestion_control # Switch to BBR (Bottleneck Bandwidth and RTT) sysctl -w net.ipv4.tcp_congestion_control=bbr # Show TCP stats ss -tin nstat -az | grep Tcp
Socket Demultiplexing
Key Tunables
| sysctl | Default | Effect |
|---|---|---|
net.core.somaxconn | 4096 | Max accept queue backlog per socket |
net.ipv4.tcp_max_syn_backlog | 1024 | SYN queue depth (incomplete handshakes) |
net.ipv4.tcp_fin_timeout | 60 | TIME_WAIT socket timeout (seconds) |
net.ipv4.tcp_tw_reuse | 0 | Allow reuse of TIME_WAIT sockets |
net.core.rmem_max | 212992 | Max socket receive buffer size |
net.core.wmem_max | 212992 | Max socket send buffer size |
A socket is the kernel object that bridges the network stack and userspace. Two ring buffers — the receive buffer and send buffer — decouple the application from the network rate.
Socket Buffer Model
Socket Types
| Type | Family | Description |
|---|---|---|
SOCK_STREAM | AF_INET / AF_INET6 | TCP — reliable, ordered byte stream |
SOCK_DGRAM | AF_INET / AF_INET6 | UDP — connectionless datagrams |
SOCK_RAW | AF_INET | Raw IP — bypass TCP/UDP, build headers manually |
SOCK_PACKET | AF_PACKET | Raw Ethernet frames — used by tcpdump |
AF_XDP | AF_XDP | Zero-copy XDP socket for kernel-bypass packet processing |
Useful Commands
# Show all TCP sockets with timer and process info ss -tinp # Show UDP sockets ss -unp # Show listen sockets with backlog ss -tlnp # Show socket memory usage ss -m # Show per-socket stats including retransmits ss -tin dst 10.0.0.1
eBPF programs can be attached to dozens of hook points across the network stack. The choice of hook determines what data is visible and what actions are possible.
Network Hook Map
Hook Summary
| Hook | Program Type | Attach Point | Key Capability |
|---|---|---|---|
| XDP | BPF_PROG_TYPE_XDP | NIC driver | Fastest drop/redirect, pre-sk_buff |
| TC ingress/egress | BPF_PROG_TYPE_SCHED_CLS | clsact qdisc | Full sk_buff access, header rewrite |
| Netfilter | BPF_PROG_TYPE_NETFILTER | Netfilter hooks | Replaces iptables rules in nf tables |
| Socket ops | BPF_PROG_TYPE_SOCK_OPS | TCP events | Observe TCP state, tune socket params |
| Socket filter | BPF_PROG_TYPE_SOCKET_FILTER | Individual socket | Filter packets per-socket (tcpdump) |
| cgroup/skb | BPF_PROG_TYPE_CGROUP_SKB | cgroup v2 | Per-cgroup ingress/egress policy |
| sk_skb | BPF_PROG_TYPE_SK_SKB | sockmap | Stream parser — L7 load balancing |
bpftrace — Quick XDP Tracing
# Count XDP actions per second bpftrace -e 'kprobe:bpf_prog_run_xdp { @[retval] = count(); }' # Trace XDP drops with stack bpftrace -e 'kprobe:xdp_do_generic_redirect { @drops = count(); }' # Show per-interface XDP stats ip -s link show dev eth0
Minimal TC Program (C)
// Tag all TCP packets with fwmark=1 on ingress #include <linux/bpf.h> #include <linux/pkt_cls.h> #include <linux/ip.h> #include <linux/tcp.h> SEC("tc") int tc_mark_tcp(struct __sk_buff *skb) { void *data = (void *)(long)skb->data; void *data_end = (void *)(long)skb->data_end; struct iphdr *ip = data + 14; /* skip Ethernet header */ if ((void *)(ip + 1) > data_end) return TC_ACT_OK; if (ip->protocol == IPPROTO_TCP) skb->mark = 1; return TC_ACT_OK; }
sock_ops programs fire on TCP socket events (connection established, retransmit, RTT measurement). They are used for per-connection observability and to set TCP socket options automatically based on policy.
# Trace new TCP connections with source/dest bpftrace -e 'kprobe:tcp_connect { $sk = (struct sock *)arg0; printf("connect: %s → %s\n", ntop($sk->__sk_common.skc_rcv_saddr), ntop($sk->__sk_common.skc_daddr)); }' # Trace TCP retransmits bpftrace -e 'kprobe:tcp_retransmit_skb { @retransmits[comm] = count(); }' # Trace socket accept latency bpftrace -e 'kprobe:inet_csk_accept { @start[tid] = nsecs; } kretprobe:inet_csk_accept /retval/ { @latency_us = hist((nsecs - @start[tid]) / 1000); }'
Quick-reference showing which chains fire for each packet scenario.
| Scenario | Chains Traversed (in order) |
|---|---|
| Packet arriving for local process | PREROUTING → INPUT |
| Packet generated by local process | OUTPUT → POSTROUTING |
| Packet being forwarded | PREROUTING → FORWARD → POSTROUTING |
| DNAT (port forwarding, inbound) | PREROUTING (nat) → FORWARD → POSTROUTING |
| SNAT / MASQUERADE (outbound NAT) | OUTPUT → POSTROUTING (nat) |
| Docker container traffic (bridge) | PREROUTING → FORWARD → POSTROUTING |
Parse a Packet with tcpdump
# Capture with full header decode tcpdump -i eth0 -v -nn # Show raw hex + ASCII tcpdump -i eth0 -XX -nn # Filter TCP SYN packets tcpdump -i eth0 'tcp[tcpflags] & tcp-syn != 0' # Capture to file for Wireshark tcpdump -i eth0 -w capture.pcap # Show ICMP only tcpdump -i eth0 icmp -nn
| Tool | Layer | Purpose |
|---|---|---|
tcpdump | L2–L7 | Packet capture and decode — gold standard for raw packet inspection |
wireshark / tshark | L2–L7 | GUI / CLI packet analysis, protocol dissection |
ss | L4 | Socket statistics — replaces netstat, shows TCP state, buffers, retransmits |
ip | L2–L3 | Interface, address, route, ARP management |
conntrack | Netfilter | View and manage the connection tracking table |
iptables / nft | Netfilter | Firewall, NAT, packet mangling rules |
tc | L2–L3 | Traffic shaping, qdisc/filter management, TC eBPF attachment |
bpftool | eBPF | Inspect loaded eBPF programs and maps |
bpftrace | eBPF | Dynamic tracing — one-liners for kernel network events |
ethtool | NIC | NIC settings: ring size, offload, coalescing, stats |
nstat / netstat | L3–L4 | Protocol-level counters (retransmits, drops, errors) |
ping / mtr | L3 | ICMP reachability and path MTU testing |
Common sysctl parameters that affect packet throughput, latency, and connection capacity.
# Socket buffer sizes (bytes) sysctl -w net.core.rmem_max=16777216 sysctl -w net.core.wmem_max=16777216 sysctl -w net.ipv4.tcp_rmem="4096 87380 16777216" sysctl -w net.ipv4.tcp_wmem="4096 65536 16777216" # Increase connection capacity sysctl -w net.core.somaxconn=65535 sysctl -w net.ipv4.tcp_max_syn_backlog=65535 sysctl -w net.ipv4.ip_local_port_range="1024 65535" # Increase conntrack table size sysctl -w net.netfilter.nf_conntrack_max=2097152 # Enable TCP BBR (better congestion control) sysctl -w net.ipv4.tcp_congestion_control=bbr sysctl -w net.core.default_qdisc=fq # Reduce TIME_WAIT duration sysctl -w net.ipv4.tcp_fin_timeout=15 sysctl -w net.ipv4.tcp_tw_reuse=1 # NAPI polling budget sysctl -w net.core.netdev_budget=600 sysctl -w net.core.netdev_budget_usecs=8000
sysctl -w changes are lost on reboot. Write them to /etc/sysctl.d/99-network.conf and run sysctl --system to apply permanently.