Linux Network Stack — Packet Journey

How a packet travels through the Linux kernel: NIC → XDP → sk_buff → TC → Netfilter → sockets, inbound and outbound

INBOUND OUTBOUND ┌───────────────────────┐ ┌───────────────────────┐ Physical wire ──────▶│ N I C │ │ N I C │◀────── Physical wire │ (DMA → ring buffer) │ │ (TX ring buffer) │ └──────────┬────────────┘ └───────────▲───────────┘ │ │ Interrupt/NAPI NIC driver │ (netdev_tx) ▼ │ ┌──────────────────────┐ ┌──────────┴──────────┐ │ XDP [eBPF hook] │ │ TC egress [eBPF] │ │ DROP/PASS/REDIRECT │ │ qdisc / shaping │ └──────────┬───────────┘ └──────────▲──────────┘ │ │ sk_buff alloc Netfilter POSTROUTING │ │ ┌──────────▼───────────┐ ┌──────────┴──────────┐ │ TC ingress [eBPF] │ │ Netfilter OUTPUT │ │ clsact / filter │ │ (iptables OUTPUT) │ └──────────┬───────────┘ └──────────▲──────────┘ │ │ ┌──────────▼───────────┐ ┌──────────┴──────────┐ │ Netfilter PREROUTING │ │ IP Layer (L3) │ │ (DNAT, raw table) │ │ route lookup, TTL │ └──────────┬───────────┘ └──────────▲──────────┘ │ │ ┌──────────▼───────────┐ ┌──────────┴──────────┐ │ IP Layer (L3) │ │ Transport (L4) │ │ routing decision │ │ TCP/UDP segment │ └──────────┬───────────┘ └──────────▲──────────┘ ┌────────┴────────┐ │ local?│ │forward? ┌─────────┴────────┐ ▼ ▼ │ Socket sendbuf │ Netfilter INPUT Netfilter FORWARD │ (write/send) │ (conntrack, (conntrack check) └─────────▲────────┘ filter) │ │ │ ▼ Userspace App ┌───────▼────── Netfilter POSTROUTING ▲ │ Transport (L4) │ │ │ TCP reassembly NIC egress ─────────────────┘ └───────┬────── │ ┌───────▼──────┐ │ Socket recvbuf│ │ (recv/read) │ └───────┬──────┘ │ Userspace App

Path

Receive Path Inbound Path

A packet arriving from the wire goes through hardware receive, early eBPF filtering, kernel buffering, firewall chains, routing, and finally reaches the socket of the waiting application.

Step-by-Step Flow

Physical wire / wifi │ ▼ ┌───────────────────────────────────────────────────────────────┐ │ 1. NIC (Network Interface Card) │ │ • Frame arrives → DMA copies bytes into RX ring buffer │ │ • NIC raises hardware interrupt │ └───────────────────────────┬───────────────────────────────────┘ │ ▼ ┌───────────────────────────────────────────────────────────────┐ │ 2. Interrupt Handler + NAPI │ │ • CPU wakes, schedules NAPI poll (softirq NET_RX) │ │ • Polling drains ring buffer in batches — avoids per- │ │ packet interrupt overhead at high pps │ └───────────────────────────┬───────────────────────────────────┘ │ ▼ ┌───────────────────────────────────────────────────────────────┐ ◀ eBPF / XDP │ 3. XDP (eXpress Data Path) [optional] │ │ • Earliest hook — runs in driver context, before sk_buff │ │ • Actions: XDP_DROP · XDP_PASS · XDP_TX · XDP_REDIRECT │ │ • Used for: DDoS drop, LB redirect, raw packet capture │ └───────────────────────────┬───────────────────────────────────┘ │ XDP_PASS ▼ ┌───────────────────────────────────────────────────────────────┐ │ 4. sk_buff Allocation │ │ • Kernel wraps the packet data in a socket buffer struct │ │ • All subsequent layers manipulate this struct │ └───────────────────────────┬───────────────────────────────────┘ │ ▼ ┌───────────────────────────────────────────────────────────────┐ ◀ eBPF / TC │ 5. TC Ingress (clsact qdisc) [optional] │ │ • Second eBPF hook — full sk_buff context available │ │ • Can classify, filter, redirect, or modify packet │ └───────────────────────────┬───────────────────────────────────┘ │ ▼ ┌───────────────────────────────────────────────────────────────┐ ◀ iptables / nftables │ 6. Netfilter PREROUTING │ │ • raw table: before conntrack — used for NOTRACK │ │ • mangle table: modify packet fields (TTL, DSCP, etc.) │ │ • nat table: DNAT (destination NAT, port forwarding) │ └───────────────────────────┬───────────────────────────────────┘ │ ▼ ┌───────────────────────────────────────────────────────────────┐ │ 7. IP Layer — Routing Decision │ │ • Is the destination IP local? │ │ → Yes: deliver up to transport layer (INPUT path) │ │ → No: forward to FORWARD chain then egress NIC │ └────────────────┬────────────────────────┬──────────────────────┘ │ local │ forward ▼ ▼ ┌─────────────────────┐ ┌──────────────────────┐ │ 8. Netfilter INPUT │ │ Netfilter FORWARD │ │ conntrack lookup│ │ conntrack check │ │ filter/mangle │ │ (see Forwarding) │ └──────────┬──────────┘ └──────────────────────┘ │ ▼ ┌───────────────────────────────────────────────────────────────┐ │ 9. Transport Layer (L4) │ │ TCP: segment reassembly, ACK generation, flow control │ │ UDP: checksum validation, socket demultiplexing │ └───────────────────────────┬───────────────────────────────────┘ │ ▼ ┌───────────────────────────────────────────────────────────────┐ │ 10. Socket Receive Buffer │ │ Data queued in sk->sk_receive_queue │ │ Application blocked in recv()/read() is woken up │ └───────────────────────────┬───────────────────────────────────┘ │ ▼ Userspace Application ← recv() / read() copies data

Key insight: two eBPF entry points XDP fires before sk_buff allocation — zero-copy, maximum throughput. TC fires after allocation — full packet context but slightly more overhead. Most production uses (DDoS, LB) use XDP; policy enforcement and observability typically use TC.

Send Path Outbound Path

When an application writes to a socket, the data travels down through the transport and network layers, passes firewall OUTPUT and TC egress hooks, and is handed to the NIC driver for transmission.

Step-by-Step Flow

Userspace Application │ write() / send() / sendmsg() ▼ ┌───────────────────────────────────────────────────────────────┐ │ 1. Socket Send Buffer │ │ Data copied from userspace into sk->sk_write_queue │ │ Backpressure via SO_SNDBUF — write() blocks if full │ └───────────────────────────┬───────────────────────────────────┘ │ ▼ ┌───────────────────────────────────────────────────────────────┐ │ 2. Transport Layer (L4) │ │ TCP: segmentation, sequence numbers, retransmit timer │ │ UDP: datagram framing, checksum computation │ └───────────────────────────┬───────────────────────────────────┘ │ ▼ ┌───────────────────────────────────────────────────────────────┐ │ 3. IP Layer (L3) │ │ Route lookup (FIB): nexthop, egress interface │ │ Fragmentation if pkt > MTU │ │ IP header construction (src/dst IP, TTL, protocol) │ └───────────────────────────┬───────────────────────────────────┘ │ ▼ ┌───────────────────────────────────────────────────────────────┐ ◀ iptables / nftables │ 4. Netfilter OUTPUT │ │ raw → mangle → nat (SNAT here) → filter │ │ conntrack new entry created for outgoing connection │ └───────────────────────────┬───────────────────────────────────┘ │ ▼ ┌───────────────────────────────────────────────────────────────┐ ◀ iptables / nftables │ 5. Netfilter POSTROUTING │ │ SNAT / MASQUERADE: rewrite source IP for NAT gateways │ └───────────────────────────┬───────────────────────────────────┘ │ ▼ ┌───────────────────────────────────────────────────────────────┐ ◀ eBPF / TC │ 6. TC Egress (clsact qdisc) [optional] │ │ Shaping (rate limiting), policing, redirect │ │ eBPF programs can modify or drop packets here │ └───────────────────────────┬───────────────────────────────────┘ │ ▼ ┌───────────────────────────────────────────────────────────────┐ │ 7. NIC Driver / TX Queue │ │ Packet placed in TX ring buffer │ │ DMA transfer → NIC transmits frame onto wire │ └───────────────────────────┬───────────────────────────────────┘ │ ▼ Physical wire / wifi

SNAT vs DNAT placement DNAT (port forwarding, load balancing) happens in PREROUTING — before routing. SNAT / MASQUERADE happens in POSTROUTING — after routing has selected the egress interface. Placing them in the wrong chain is a common iptables bug.

Forward Path Forwarding Path

When the routing decision in PREROUTING determines the destination is not local, the packet is forwarded: FORWARD chain → POSTROUTING → egress NIC. This is what a Linux router or NAT gateway does.

Flow

NIC rx → XDP → sk_buff → TC ingress → Netfilter PREROUTING │ ▼ (routing: destination IP is not local) ┌───────────────────────────────────────────────────────────────┐ ◀ iptables / nftables │ Netfilter FORWARD │ │ mangle: modify fields (TTL decrement happens at IP layer) │ │ filter: stateful rules, allow/deny forwarded traffic │ │ conntrack: validates packet against tracked connections │ └───────────────────────────┬───────────────────────────────────┘ │ ▼ ┌───────────────────────────────────────────────────────────────┐ ◀ iptables / nftables │ Netfilter POSTROUTING │ │ MASQUERADE / SNAT for outbound NAT │ └───────────────────────────┬───────────────────────────────────┘ │ ▼ TC egress → NIC TX queue → wire Enable forwarding: sysctl -w net.ipv4.ip_forward=1 sysctl -w net.ipv6.conf.all.forwarding=1

ip_forward must be enabled The kernel silently drops packets that need forwarding unless net.ipv4.ip_forward=1 is set. Containers, VMs, and VPN gateways all rely on this sysctl.

Hardware NIC & NAPI

The Network Interface Card is where a packet enters or leaves the machine. Modern NICs use DMA (Direct Memory Access) to write frames directly into RAM ring buffers without CPU involvement, then raise an interrupt.

Receive Flow

NIC hardware │ │ 1. Frame arrives at NIC │ 2. NIC writes frame bytes to RX ring buffer via DMA │ (ring buffer is a circular array of descriptors in RAM) │ 3. NIC raises hardware interrupt → CPU's interrupt handler fires │ ▼ NAPI (New API) │ │ 4. Interrupt handler disables further NIC interrupts │ and schedules a softirq (NET_RX_SOFTIRQ) │ 5. ksoftirqd / NET_RX poll loop: driver's .napi_poll() called │ → processes up to budget (default 64) packets per run │ 6. When ring is drained, re-enable NIC interrupts │ └──▶ packets handed to network stack one by one

Why NAPI? Under high load, handling an interrupt per packet causes "interrupt storms." NAPI switches to polling after the first interrupt, batching many packets per CPU wake-up. This is why a 10 Gbps NIC stays efficient at millions of packets per second.

Key Tunables

Parameter	Where	Effect
`ethtool -G eth0 rx N`	Driver	Enlarge RX ring buffer (reduces drops at burst)
`ethtool -C eth0 rx-usecs N`	Driver	Interrupt coalescing — trade latency for throughput
`/proc/sys/net/core/netdev_budget`	Kernel	Max packets per NAPI poll cycle (default 300)
`net.core.rmem_max`	sysctl	Max socket receive buffer
`ethtool -K eth0 gro on`	Driver	Generic Receive Offload — merge small segments

Useful Commands

Shell

# Show NIC ring buffer sizes and current settings
ethtool -g eth0

# Show NIC offload settings (GRO, GSO, TSO, …)
ethtool -k eth0

# Show interrupt coalescing settings
ethtool -c eth0

# Show per-queue NIC stats (drops, errors)
ethtool -S eth0

# Show softirq counters per CPU
cat /proc/softirqs | grep NET

eBPF XDP

eXpress Data Path is an eBPF hook that runs inside the NIC driver, before the kernel allocates an sk_buff. It provides the earliest possible packet processing point in the receive path.

XDP Actions

Action	Meaning	Use Case
`XDP_DROP`	Discard the packet immediately	DDoS mitigation, firewall
`XDP_PASS`	Pass to normal network stack	Default — let kernel handle it
`XDP_TX`	Retransmit on same NIC	Inline load balancer, reflection
`XDP_REDIRECT`	Redirect to another NIC or CPU	AF_XDP zero-copy, multi-NIC LB
`XDP_ABORTED`	Error — treated as DROP	Bug detection during development

XDP Modes

Native XDP

Driver implements XDP hook directly. Fastest — runs before NAPI. Requires driver support (mlx5, i40e, ixgbe, virtio_net, …).

Generic XDP (SKB mode)

Implemented in the kernel, after sk_buff allocation. Works on any driver but is much slower — mainly for testing.

Offloaded XDP

Program compiled and run on the NIC SmartNIC firmware. Zero kernel CPU overhead. Requires SmartNIC (Netronome, Mellanox BlueField).

AF_XDP

Zero-copy path to userspace using XDP_REDIRECT into an AF_XDP socket. Used by DPDK-alternative stacks and custom packet processors.

Attaching an XDP Program

Shell

# Attach XDP program (native mode)
ip link set dev eth0 xdp obj xdp_prog.o sec xdp

# Attach in generic (SKB) mode — any driver
ip link set dev eth0 xdpgeneric obj xdp_prog.o sec xdp

# Remove XDP program
ip link set dev eth0 xdp off

# Show XDP info
ip link show dev eth0 | grep xdp

# Using bpftool
bpftool net show dev eth0
bpftool prog list

Minimal XDP Program (C)

C (eBPF)

// Drop all UDP packets, pass everything else
#include <linux/bpf.h>
#include <linux/if_ether.h>
#include <linux/ip.h>
#include <linux/udp.h>

SEC("xdp")
int xdp_drop_udp(struct xdp_md *ctx) {
    void *data     = (void *)(long)ctx->data;
    void *data_end = (void *)(long)ctx->data_end;

    struct ethhdr *eth = data;
    if ((void *)(eth + 1) > data_end) return XDP_DROP;
    if (eth->h_proto != htons(ETH_P_IP)) return XDP_PASS;

    struct iphdr *ip = (void *)(eth + 1);
    if ((void *)(ip + 1) > data_end) return XDP_DROP;
    if (ip->protocol == IPPROTO_UDP)   return XDP_DROP;

    return XDP_PASS;
}

Kernel sk_buff

The sk_buff (socket buffer) is the central data structure that represents a packet inside the Linux kernel. Every packet that passes through XDP_PASS is wrapped in an sk_buff for its journey through TC, Netfilter, IP, and transport layers.

Structure Layout

sk_buff (kernel struct) ┌────────────────────────────────────────────────────────────────┐ │ head ──────────────────▶ [ headroom | L2 | L3 | L4 | data ]│ │ data ──────────────────▶ [current start of data] │ │ tail ──────────────────▶ [end of data] │ │ end ──────────────────▶ [ end of allocated buffer ] │ │ │ │ len : total length of packet data │ │ data_len : length of paged (frags) data │ │ protocol : ETH_P_IP, ETH_P_IPV6, … │ │ sk : pointer to owning socket (or NULL) │ │ dev : net_device this packet arrived on │ │ tstamp : hardware/software receive timestamp │ │ mark : fwmark — used by iptables, routing policy │ │ cb[48] : control buffer — per-layer scratch space │ │ _skb_refdst : routing destination cache entry │ └────────────────────────────────────────────────────────────────┘ Header pointers (set as packet walks up the stack): skb_mac_header() → start of Ethernet frame skb_network_header() → start of IP header skb_transport_header() → start of TCP/UDP header

Zero-copy operations Many operations (forwarding, cloning, fragmentation) avoid copying data by manipulating the data, tail, and head pointers and using reference counting. skb_clone() shares the data buffer; skb_copy() makes a full copy.

Key Fields Used by eBPF

Field	Type	Description
`skb->mark`	u32	Firewall mark (fwmark) — used for policy routing and iptables matching
`skb->priority`	u32	QoS / TC priority class
`skb->protocol`	u16	EtherType (ETH_P_IP = 0x0800)
`skb->cb`	u8[48]	Per-layer scratch space (TC uses this for metadata)
`skb->tstamp`	ktime_t	Receive/transmit timestamp
`skb->hash`	u32	Flow hash — used by RSS and load balancers

Traffic Control Traffic Control (TC)

The TC subsystem controls how packets are queued and scheduled on an interface. It provides qdiscs (queuing disciplines), classes, and filters — and is also the attachment point for TC eBPF programs via the clsact qdisc.

TC Architecture

Interface (eth0) │ ├── Ingress side (clsact) │ └── filter → TC eBPF program (ingress) [after XDP, before Netfilter] │ └── Egress side ├── Root qdisc (e.g. fq_codel, htb, pfifo_fast) │ ├── class 1:1 (high priority) │ │ └── leaf qdisc (pfifo) │ └── class 1:2 (low priority) │ └── leaf qdisc (pfifo) └── clsact (for TC eBPF egress programs)

Common Qdiscs

Qdisc	Type	Use Case
`pfifo_fast`	Classless	Default — 3-band FIFO based on IP TOS bits
`fq_codel`	Classless	Fair Queue + Controlled Delay — reduces bufferbloat
`htb`	Classful	Hierarchical Token Bucket — rate limiting with bursting
`tbf`	Classless	Token Bucket Filter — simple rate cap
`netem`	Classless	Network emulator — add delay, loss, jitter for testing
`clsact`	Special	eBPF TC attachment point (ingress + egress)

TC Commands

Shell

# Show qdiscs on interface
tc qdisc show dev eth0

# Add 100ms delay + 10ms jitter for testing (netem)
tc qdisc add dev eth0 root netem delay 100ms 10ms

# Rate-limit to 100 Mbps with HTB
tc qdisc add dev eth0 root handle 1: htb default 10
tc class add dev eth0 parent 1: classid 1:10 htb rate 100mbit burst 15k

# Attach eBPF program to TC ingress
tc qdisc add dev eth0 clsact
tc filter add dev eth0 ingress bpf da obj tc_prog.o sec tc

# Show TC filters
tc filter show dev eth0 ingress
tc filter show dev eth0 egress

Firewall Netfilter

Netfilter is the Linux kernel framework for packet filtering, NAT, and connection tracking. iptables and nftables are userspace tools that write rules into Netfilter hooks. Rules are organised into tables (raw, mangle, nat, filter) evaluated at five chain hooks.

Five Hook Points

Packet arrives ──▶ [ PREROUTING ] ──▶ Routing ──▶ [ INPUT ] ──▶ Local Process │ ▼ (forward) [ FORWARD ] │ ▼ Local Process ──▶ [ OUTPUT ] ──▶ Routing ──▶ [ POSTROUTING ] ──▶ NIC TX

Table Evaluation Order per Hook

Hook	Table order (iptables)	Typical Use
PREROUTING	raw → mangle → nat	DNAT, NOTRACK, TTL mangle
INPUT	mangle → filter → security	Filter inbound to local process
FORWARD	mangle → filter → security	Filter forwarded packets
OUTPUT	raw → mangle → nat → filter → security	Filter outbound from local process, DNAT-back
POSTROUTING	mangle → nat	SNAT, MASQUERADE

Common iptables Commands

Shell

# List all rules with packet/byte counters
iptables -L -n -v

# List rules in a specific table
iptables -t nat -L -n -v

# Drop all traffic from an IP
iptables -A INPUT -s 1.2.3.4 -j DROP

# Allow established/related (stateful firewall)
iptables -A INPUT -m conntrack --ctstate ESTABLISHED,RELATED -j ACCEPT

# Port forwarding: redirect :8080 → 192.168.1.10:80 (DNAT)
iptables -t nat -A PREROUTING -p tcp --dport 8080 \
         -j DNAT --to-destination 192.168.1.10:80

# Masquerade for NAT gateway
iptables -t nat -A POSTROUTING -o eth0 -j MASQUERADE

# Trace a packet through all chains (use TRACE in raw table)
iptables -t raw -A PREROUTING -s 1.2.3.4 -j TRACE
modprobe nf_log_ipv4
dmesg | grep TRACE

nftables vs iptables Modern Linux (kernel ≥ 3.13) ships with nftables as the successor to iptables. It uses a single table/chain model without hard-coded hooks, has atomic ruleset updates, and is significantly faster at large rule counts. On distros like Debian 12+ and RHEL 9, iptables is now a compatibility shim over nftables.

Firewall Connection Tracking (conntrack)

conntrack is the Netfilter subsystem that tracks the state of network connections. Every packet through the filter table is looked up in the connection tracking table, enabling stateful firewalling (ESTABLISHED / RELATED rules) and NAT.

Connection States

State	Meaning
`NEW`	First packet of a new connection (e.g. TCP SYN)
`ESTABLISHED`	Connection is established — packets flowing both ways
`RELATED`	New connection related to an existing one (e.g. FTP data channel)
`INVALID`	Packet doesn't match any known connection — usually DROP
`UNTRACKED`	Explicitly excluded from tracking via NOTRACK rule

TCP State Machine in conntrack

SYN → NEW SYN + SYN-ACK → ESTABLISHED Established data → ESTABLISHED FIN/RST → TIME_WAIT → entry expires & is deleted

Commands

Shell

# List all tracked connections
conntrack -L

# Watch the conntrack table in real time
conntrack -E

# Show conntrack table statistics
conntrack -S

# Show current conntrack table size and max
cat /proc/sys/net/netfilter/nf_conntrack_count
cat /proc/sys/net/netfilter/nf_conntrack_max

# Flush the conntrack table (use with care!)
conntrack -F

# Delete a specific entry
conntrack -D -s 1.2.3.4 -d 5.6.7.8

conntrack table exhaustion High-traffic servers (load balancers, gateways) can exhaust nf_conntrack_max, causing all new connections to be dropped with "nf_conntrack: table full, dropping packet" in dmesg. Increase the limit or disable conntrack for UDP/high-throughput flows with NOTRACK.

Network (L3) IP Layer

The IP layer performs routing, fragmentation/reassembly, and TTL management. The kernel's Forwarding Information Base (FIB) is the routing table — a lookup here determines which interface and next-hop a packet uses.

Routing Lookup (Inbound)

Incoming packet (after PREROUTING) │ ▼ fib_lookup(dst_ip) │ ┌─────┴──────────────────────────────────────┐ │ Is dst_ip assigned to a local interface? │ │ (lo, eth0, docker0, …) │ └─────┬──────────────────────────────────────┘ │ yes │ no ▼ ▼ LOCAL_IN path FORWARD path → Netfilter INPUT → Netfilter FORWARD → Transport layer → Netfilter POSTROUTING → Egress NIC

Useful Commands

Shell

# Show routing table
ip route show
ip route show table all

# Which route/interface would be used for a destination?
ip route get 8.8.8.8

# Show ARP / neighbor cache
ip neigh show

# Show policy routing rules
ip rule show

# Show interface addresses
ip addr show

# Trace route to a host
traceroute 8.8.8.8
mtr --report 8.8.8.8

Fragmentation

MTU and fragmentation If a packet exceeds the MTU of the egress interface, IP either fragments it (if DF=0) or drops it and sends an ICMP "Fragmentation Needed" back (if DF=1). Fragmentation is expensive. Use ip link show to check MTU, and ping -M do -s 1400 host to probe path MTU.

Transport (L4) Transport Layer — TCP & UDP

The transport layer multiplexes data between processes using ports and handles reliability (TCP) or connectionless delivery (UDP). Incoming packets are demultiplexed to the correct socket by looking up the 4-tuple (src IP, src port, dst IP, dst port).

TCP Receive Path

sk_buff arrives at TCP layer │ ▼ tcp_rcv_established() or tcp_rcv_state_process() │ ├── Validate sequence number (in-window?) ├── Out-of-order? → queue in ofo_queue ├── In-order? → append to sk_receive_queue ├── Send ACK (delayed or immediate) └── Wake up blocked recv() call if data available

TCP Congestion Control

Shell

# Show available congestion control algorithms
cat /proc/sys/net/ipv4/tcp_available_congestion_control

# Show currently active algorithm
cat /proc/sys/net/ipv4/tcp_congestion_control

# Switch to BBR (Bottleneck Bandwidth and RTT)
sysctl -w net.ipv4.tcp_congestion_control=bbr

# Show TCP stats
ss -tin
nstat -az | grep Tcp

Socket Demultiplexing

Incoming TCP segment: src=1.2.3.4:54321 dst=10.0.0.1:80 │ ▼ Lookup in socket hashtable: key = (protocol=TCP, src=1.2.3.4, sport=54321, dst=10.0.0.1, dport=80) │ ├── Match found → deliver to socket recv buffer └── No match → send TCP RST

Key Tunables

sysctl	Default	Effect
`net.core.somaxconn`	4096	Max accept queue backlog per socket
`net.ipv4.tcp_max_syn_backlog`	1024	SYN queue depth (incomplete handshakes)
`net.ipv4.tcp_fin_timeout`	60	TIME_WAIT socket timeout (seconds)
`net.ipv4.tcp_tw_reuse`	0	Allow reuse of TIME_WAIT sockets
`net.core.rmem_max`	212992	Max socket receive buffer size
`net.core.wmem_max`	212992	Max socket send buffer size

Userspace Interface Sockets & Socket Buffers

A socket is the kernel object that bridges the network stack and userspace. Two ring buffers — the receive buffer and send buffer — decouple the application from the network rate.

Socket Buffer Model

Userspace Kernel ┌─────────────────┐ ┌────────────────────────────┐ │ Application │ recv()/read()│ Socket Receive Buffer │ │ │◀─────────────▶│ sk_receive_queue │ │ │ │ (sk_rcvbuf bytes max) │ │ │ send()/write() ┌────────────────────────┐ │ │─────────────▶ │ Socket Send Buffer │ │ │ │ sk_write_queue │ │ │ │ (sk_sndbuf bytes max) │ └─────────────────┘ └────────────────────────────┘ │ ▲ TCP/UDP TCP ACKs / → NIC flow control

Socket Types

Type	Family	Description
`SOCK_STREAM`	AF_INET / AF_INET6	TCP — reliable, ordered byte stream
`SOCK_DGRAM`	AF_INET / AF_INET6	UDP — connectionless datagrams
`SOCK_RAW`	AF_INET	Raw IP — bypass TCP/UDP, build headers manually
`SOCK_PACKET`	AF_PACKET	Raw Ethernet frames — used by tcpdump
`AF_XDP`	AF_XDP	Zero-copy XDP socket for kernel-bypass packet processing

Useful Commands

Shell

# Show all TCP sockets with timer and process info
ss -tinp

# Show UDP sockets
ss -unp

# Show listen sockets with backlog
ss -tlnp

# Show socket memory usage
ss -m

# Show per-socket stats including retransmits
ss -tin dst 10.0.0.1

eBPF eBPF Hook Points

eBPF programs can be attached to dozens of hook points across the network stack. The choice of hook determines what data is visible and what actions are possible.

Network Hook Map

NIC RX NIC TX │ ▲ ▼ │ ┌─────────────────┐ ┌────────────┴──────────┐ │ XDP │ ← eBPF: xdp │ TC egress (clsact) │ ← eBPF: sched_cls │ (driver/hw) │ XDP_DROP/PASS/TX │ │ BPF_ACT_SHOT/OK └────────┬────────┘ └───────────────────────┘ │ ▲ ▼ │ ┌─────────────────┐ ┌────────────┴──────────┐ │ TC ingress │ ← eBPF: sched_cls │ Netfilter OUTPUT │ ← nftables / iptables │ (clsact) │ └───────────────────────┘ └────────┬────────┘ ▲ │ │ ▼ │ ┌─────────────────┐ ┌────────────┴──────────┐ │ Netfilter │ ← iptables/nf │ Transport (L4) │ │ PREROUTING │ └───────────────────────┘ └────────┬────────┘ ▲ │ │ ▼ │ ┌─────────────────┐ ┌────────────┴──────────┐ │ IP routing │ │ Socket sendmsg │ ← eBPF: sock_ops └────────┬────────┘ └───────────────────────┘ │ ┌──────┴────────┐ │ │ ▼ ▼ INPUT FORWARD (local) (route) │ ▼ Transport ← eBPF: sk_skb (stream parser/verdict) │ ▼ Socket ← eBPF: sock_ops, cgroup/skb, sk_msg

Hook Summary

Hook	Program Type	Attach Point	Key Capability
XDP	`BPF_PROG_TYPE_XDP`	NIC driver	Fastest drop/redirect, pre-sk_buff
TC ingress/egress	`BPF_PROG_TYPE_SCHED_CLS`	clsact qdisc	Full sk_buff access, header rewrite
Netfilter	`BPF_PROG_TYPE_NETFILTER`	Netfilter hooks	Replaces iptables rules in nf tables
Socket ops	`BPF_PROG_TYPE_SOCK_OPS`	TCP events	Observe TCP state, tune socket params
Socket filter	`BPF_PROG_TYPE_SOCKET_FILTER`	Individual socket	Filter packets per-socket (tcpdump)
cgroup/skb	`BPF_PROG_TYPE_CGROUP_SKB`	cgroup v2	Per-cgroup ingress/egress policy
sk_skb	`BPF_PROG_TYPE_SK_SKB`	sockmap	Stream parser — L7 load balancing

eBPF XDP Programs

When to use XDP vs TC Use XDP when you need maximum throughput and minimum latency — DDoS drops, load balancer redirects, hardware offload. Use TC when you need to modify packets (rewrite headers, encapsulate), need access to socket metadata, or need to work on egress.

bpftrace — Quick XDP Tracing

bpftrace

# Count XDP actions per second
bpftrace -e 'kprobe:bpf_prog_run_xdp { @[retval] = count(); }'

# Trace XDP drops with stack
bpftrace -e 'kprobe:xdp_do_generic_redirect { @drops = count(); }'

# Show per-interface XDP stats
ip -s link show dev eth0

eBPF TC Programs

Minimal TC Program (C)

C (eBPF)

// Tag all TCP packets with fwmark=1 on ingress
#include <linux/bpf.h>
#include <linux/pkt_cls.h>
#include <linux/ip.h>
#include <linux/tcp.h>

SEC("tc")
int tc_mark_tcp(struct __sk_buff *skb) {
    void *data     = (void *)(long)skb->data;
    void *data_end = (void *)(long)skb->data_end;

    struct iphdr *ip = data + 14; /* skip Ethernet header */
    if ((void *)(ip + 1) > data_end) return TC_ACT_OK;
    if (ip->protocol == IPPROTO_TCP)
        skb->mark = 1;

    return TC_ACT_OK;
}

eBPF Socket Programs

sock_ops programs fire on TCP socket events (connection established, retransmit, RTT measurement). They are used for per-connection observability and to set TCP socket options automatically based on policy.

bpftrace

# Trace new TCP connections with source/dest
bpftrace -e 'kprobe:tcp_connect {
    $sk = (struct sock *)arg0;
    printf("connect: %s → %s\n",
        ntop($sk->__sk_common.skc_rcv_saddr),
        ntop($sk->__sk_common.skc_daddr));
}'

# Trace TCP retransmits
bpftrace -e 'kprobe:tcp_retransmit_skb { @retransmits[comm] = count(); }'

# Trace socket accept latency
bpftrace -e 'kprobe:inet_csk_accept { @start[tid] = nsecs; }
             kretprobe:inet_csk_accept /retval/ {
                 @latency_us = hist((nsecs - @start[tid]) / 1000); }'

Reference Netfilter Chains

Quick-reference showing which chains fire for each packet scenario.

Scenario	Chains Traversed (in order)
Packet arriving for local process	PREROUTING → INPUT
Packet generated by local process	OUTPUT → POSTROUTING
Packet being forwarded	PREROUTING → FORWARD → POSTROUTING
DNAT (port forwarding, inbound)	PREROUTING (nat) → FORWARD → POSTROUTING
SNAT / MASQUERADE (outbound NAT)	OUTPUT → POSTROUTING (nat)
Docker container traffic (bridge)	PREROUTING → FORWARD → POSTROUTING

Reference Packet Structure (Ethernet / IPv4 / TCP)

┌──────────────────────────────────────────────────────────────────┐ │ Ethernet Header (14 bytes) │ │ dst_mac[6] src_mac[6] ethertype[2] (0x0800=IPv4, 0x86DD=IPv6) │ ├──────────────────────────────────────────────────────────────────┤ │ IPv4 Header (20 bytes min) │ │ version[4] ihl[4] dscp[6] ecn[2] total_len[16] │ │ id[16] flags[3] frag_offset[13] │ │ ttl[8] protocol[8] (6=TCP, 17=UDP, 1=ICMP) checksum[16] │ │ src_ip[32] dst_ip[32] │ ├──────────────────────────────────────────────────────────────────┤ │ TCP Header (20 bytes min) │ │ src_port[16] dst_port[16] │ │ seq_num[32] │ │ ack_num[32] │ │ data_offset[4] reserved[3] flags[9] (SYN,ACK,FIN,RST,PSH,…) │ │ window[16] checksum[16] urgent_ptr[16] │ ├──────────────────────────────────────────────────────────────────┤ │ Application Data (payload) │ └──────────────────────────────────────────────────────────────────┘

Parse a Packet with tcpdump

Shell

# Capture with full header decode
tcpdump -i eth0 -v -nn

# Show raw hex + ASCII
tcpdump -i eth0 -XX -nn

# Filter TCP SYN packets
tcpdump -i eth0 'tcp[tcpflags] & tcp-syn != 0'

# Capture to file for Wireshark
tcpdump -i eth0 -w capture.pcap

# Show ICMP only
tcpdump -i eth0 icmp -nn

Reference Useful Tools

Tool	Layer	Purpose
`tcpdump`	L2–L7	Packet capture and decode — gold standard for raw packet inspection
`wireshark / tshark`	L2–L7	GUI / CLI packet analysis, protocol dissection
`ss`	L4	Socket statistics — replaces `netstat`, shows TCP state, buffers, retransmits
`ip`	L2–L3	Interface, address, route, ARP management
`conntrack`	Netfilter	View and manage the connection tracking table
`iptables / nft`	Netfilter	Firewall, NAT, packet mangling rules
`tc`	L2–L3	Traffic shaping, qdisc/filter management, TC eBPF attachment
`bpftool`	eBPF	Inspect loaded eBPF programs and maps
`bpftrace`	eBPF	Dynamic tracing — one-liners for kernel network events
`ethtool`	NIC	NIC settings: ring size, offload, coalescing, stats
`nstat / netstat`	L3–L4	Protocol-level counters (retransmits, drops, errors)
`ping / mtr`	L3	ICMP reachability and path MTU testing

Reference Tuning Knobs

Common sysctl parameters that affect packet throughput, latency, and connection capacity.

Shell — sysctl

# Socket buffer sizes (bytes)
sysctl -w net.core.rmem_max=16777216
sysctl -w net.core.wmem_max=16777216
sysctl -w net.ipv4.tcp_rmem="4096 87380 16777216"
sysctl -w net.ipv4.tcp_wmem="4096 65536 16777216"

# Increase connection capacity
sysctl -w net.core.somaxconn=65535
sysctl -w net.ipv4.tcp_max_syn_backlog=65535
sysctl -w net.ipv4.ip_local_port_range="1024 65535"

# Increase conntrack table size
sysctl -w net.netfilter.nf_conntrack_max=2097152

# Enable TCP BBR (better congestion control)
sysctl -w net.ipv4.tcp_congestion_control=bbr
sysctl -w net.core.default_qdisc=fq

# Reduce TIME_WAIT duration
sysctl -w net.ipv4.tcp_fin_timeout=15
sysctl -w net.ipv4.tcp_tw_reuse=1

# NAPI polling budget
sysctl -w net.core.netdev_budget=600
sysctl -w net.core.netdev_budget_usecs=8000

Persist settings sysctl -w changes are lost on reboot. Write them to /etc/sysctl.d/99-network.conf and run sysctl --system to apply permanently.