Linux Network Stack — Packet Journey
How a packet travels through the Linux kernel: NIC → XDP → sk_buff → TC → Netfilter → sockets, inbound and outbound
INBOUND OUTBOUND ┌───────────────────────┐ ┌───────────────────────┐ Physical wire ──────▶│ N I C │ │ N I C │◀────── Physical wire │ (DMA → ring buffer) │ │ (TX ring buffer) │ └──────────┬────────────┘ └───────────▲───────────┘ │ │ Interrupt/NAPI NIC driver │ (netdev_tx) ▼ │ ┌──────────────────────┐ ┌──────────┴──────────┐ │ XDP [eBPF hook] │ │ TC egress [eBPF] │ │ DROP/PASS/REDIRECT │ │ qdisc / shaping │ └──────────┬───────────┘ └──────────▲──────────┘ │ │ sk_buff alloc Netfilter POSTROUTING │ │ ┌──────────▼───────────┐ ┌──────────┴──────────┐ │ TC ingress [eBPF] │ │ Netfilter OUTPUT │ │ clsact / filter │ │ (iptables OUTPUT) │ └──────────┬───────────┘ └──────────▲──────────┘ │ │ ┌──────────▼───────────┐ ┌──────────┴──────────┐ │ Netfilter PREROUTING │ │ IP Layer (L3) │ │ (DNAT, raw table) │ │ route lookup, TTL │ └──────────┬───────────┘ └──────────▲──────────┘ │ │ ┌──────────▼───────────┐ ┌──────────┴──────────┐ │ IP Layer (L3) │ │ Transport (L4) │ │ routing decision │ │ TCP/UDP segment │ └──────────┬───────────┘ └──────────▲──────────┘ ┌────────┴────────┐ │ local?│ │forward? ┌─────────┴────────┐ ▼ ▼ │ Socket sendbuf │ Netfilter INPUT Netfilter FORWARD │ (write/send) │ (conntrack, (conntrack check) └─────────▲────────┘ filter) │ │ │ ▼ Userspace App ┌───────▼────── Netfilter POSTROUTING ▲ │ Transport (L4) │ │ │ TCP reassembly NIC egress ─────────────────┘ └───────┬────── │ ┌───────▼──────┐ │ Socket recvbuf│ │ (recv/read) │ └───────┬──────┘ │ Userspace App
Receive Path Inbound Path

A packet arriving from the wire goes through hardware receive, early eBPF filtering, kernel buffering, firewall chains, routing, and finally reaches the socket of the waiting application.

Step-by-Step Flow

Physical wire / wifi │ ▼ ┌───────────────────────────────────────────────────────────────┐ │ 1. NIC (Network Interface Card) │ │ • Frame arrives → DMA copies bytes into RX ring buffer │ │ • NIC raises hardware interrupt │ └───────────────────────────┬───────────────────────────────────┘ │ ▼ ┌───────────────────────────────────────────────────────────────┐ │ 2. Interrupt Handler + NAPI │ │ • CPU wakes, schedules NAPI poll (softirq NET_RX) │ │ • Polling drains ring buffer in batches — avoids per- │ │ packet interrupt overhead at high pps │ └───────────────────────────┬───────────────────────────────────┘ │ ▼ ┌───────────────────────────────────────────────────────────────┐ ◀ eBPF / XDP │ 3. XDP (eXpress Data Path) [optional] │ │ • Earliest hook — runs in driver context, before sk_buff │ │ • Actions: XDP_DROP · XDP_PASS · XDP_TX · XDP_REDIRECT │ │ • Used for: DDoS drop, LB redirect, raw packet capture │ └───────────────────────────┬───────────────────────────────────┘ │ XDP_PASS ▼ ┌───────────────────────────────────────────────────────────────┐ │ 4. sk_buff Allocation │ │ • Kernel wraps the packet data in a socket buffer struct │ │ • All subsequent layers manipulate this struct │ └───────────────────────────┬───────────────────────────────────┘ │ ▼ ┌───────────────────────────────────────────────────────────────┐ ◀ eBPF / TC │ 5. TC Ingress (clsact qdisc) [optional] │ │ • Second eBPF hook — full sk_buff context available │ │ • Can classify, filter, redirect, or modify packet │ └───────────────────────────┬───────────────────────────────────┘ │ ▼ ┌───────────────────────────────────────────────────────────────┐ ◀ iptables / nftables │ 6. Netfilter PREROUTING │ │ • raw table: before conntrack — used for NOTRACK │ │ • mangle table: modify packet fields (TTL, DSCP, etc.) │ │ • nat table: DNAT (destination NAT, port forwarding) │ └───────────────────────────┬───────────────────────────────────┘ │ ▼ ┌───────────────────────────────────────────────────────────────┐ │ 7. IP Layer — Routing Decision │ │ • Is the destination IP local? │ │ → Yes: deliver up to transport layer (INPUT path) │ │ → No: forward to FORWARD chain then egress NIC │ └────────────────┬────────────────────────┬──────────────────────┘ │ local │ forward ▼ ▼ ┌─────────────────────┐ ┌──────────────────────┐ │ 8. Netfilter INPUT │ │ Netfilter FORWARD │ │ conntrack lookup│ │ conntrack check │ │ filter/mangle │ │ (see Forwarding) │ └──────────┬──────────┘ └──────────────────────┘ │ ▼ ┌───────────────────────────────────────────────────────────────┐ │ 9. Transport Layer (L4) │ │ TCP: segment reassembly, ACK generation, flow control │ │ UDP: checksum validation, socket demultiplexing │ └───────────────────────────┬───────────────────────────────────┘ │ ▼ ┌───────────────────────────────────────────────────────────────┐ │ 10. Socket Receive Buffer │ │ Data queued in sk->sk_receive_queue │ │ Application blocked in recv()/read() is woken up │ └───────────────────────────┬───────────────────────────────────┘ │ ▼ Userspace Application ← recv() / read() copies data
Key insight: two eBPF entry points XDP fires before sk_buff allocation — zero-copy, maximum throughput. TC fires after allocation — full packet context but slightly more overhead. Most production uses (DDoS, LB) use XDP; policy enforcement and observability typically use TC.
Send Path Outbound Path

When an application writes to a socket, the data travels down through the transport and network layers, passes firewall OUTPUT and TC egress hooks, and is handed to the NIC driver for transmission.

Step-by-Step Flow

Userspace Application │ write() / send() / sendmsg() ▼ ┌───────────────────────────────────────────────────────────────┐ │ 1. Socket Send Buffer │ │ Data copied from userspace into sk->sk_write_queue │ │ Backpressure via SO_SNDBUF — write() blocks if full │ └───────────────────────────┬───────────────────────────────────┘ │ ▼ ┌───────────────────────────────────────────────────────────────┐ │ 2. Transport Layer (L4) │ │ TCP: segmentation, sequence numbers, retransmit timer │ │ UDP: datagram framing, checksum computation │ └───────────────────────────┬───────────────────────────────────┘ │ ▼ ┌───────────────────────────────────────────────────────────────┐ │ 3. IP Layer (L3) │ │ Route lookup (FIB): nexthop, egress interface │ │ Fragmentation if pkt > MTU │ │ IP header construction (src/dst IP, TTL, protocol) │ └───────────────────────────┬───────────────────────────────────┘ │ ▼ ┌───────────────────────────────────────────────────────────────┐ ◀ iptables / nftables │ 4. Netfilter OUTPUT │ │ raw → mangle → nat (SNAT here) → filter │ │ conntrack new entry created for outgoing connection │ └───────────────────────────┬───────────────────────────────────┘ │ ▼ ┌───────────────────────────────────────────────────────────────┐ ◀ iptables / nftables │ 5. Netfilter POSTROUTING │ │ SNAT / MASQUERADE: rewrite source IP for NAT gateways │ └───────────────────────────┬───────────────────────────────────┘ │ ▼ ┌───────────────────────────────────────────────────────────────┐ ◀ eBPF / TC │ 6. TC Egress (clsact qdisc) [optional] │ │ Shaping (rate limiting), policing, redirect │ │ eBPF programs can modify or drop packets here │ └───────────────────────────┬───────────────────────────────────┘ │ ▼ ┌───────────────────────────────────────────────────────────────┐ │ 7. NIC Driver / TX Queue │ │ Packet placed in TX ring buffer │ │ DMA transfer → NIC transmits frame onto wire │ └───────────────────────────┬───────────────────────────────────┘ │ ▼ Physical wire / wifi
SNAT vs DNAT placement DNAT (port forwarding, load balancing) happens in PREROUTING — before routing. SNAT / MASQUERADE happens in POSTROUTING — after routing has selected the egress interface. Placing them in the wrong chain is a common iptables bug.
Forward Path Forwarding Path

When the routing decision in PREROUTING determines the destination is not local, the packet is forwarded: FORWARD chain → POSTROUTING → egress NIC. This is what a Linux router or NAT gateway does.

Flow

NIC rx → XDP → sk_buff → TC ingress → Netfilter PREROUTING │ ▼ (routing: destination IP is not local) ┌───────────────────────────────────────────────────────────────┐ ◀ iptables / nftables │ Netfilter FORWARD │ │ mangle: modify fields (TTL decrement happens at IP layer) │ │ filter: stateful rules, allow/deny forwarded traffic │ │ conntrack: validates packet against tracked connections │ └───────────────────────────┬───────────────────────────────────┘ │ ▼ ┌───────────────────────────────────────────────────────────────┐ ◀ iptables / nftables │ Netfilter POSTROUTING │ │ MASQUERADE / SNAT for outbound NAT │ └───────────────────────────┬───────────────────────────────────┘ │ ▼ TC egress → NIC TX queue → wire Enable forwarding: sysctl -w net.ipv4.ip_forward=1 sysctl -w net.ipv6.conf.all.forwarding=1
ip_forward must be enabled The kernel silently drops packets that need forwarding unless net.ipv4.ip_forward=1 is set. Containers, VMs, and VPN gateways all rely on this sysctl.
Hardware NIC & NAPI

The Network Interface Card is where a packet enters or leaves the machine. Modern NICs use DMA (Direct Memory Access) to write frames directly into RAM ring buffers without CPU involvement, then raise an interrupt.

Receive Flow

NIC hardware │ │ 1. Frame arrives at NIC │ 2. NIC writes frame bytes to RX ring buffer via DMA │ (ring buffer is a circular array of descriptors in RAM) │ 3. NIC raises hardware interrupt → CPU's interrupt handler fires │ ▼ NAPI (New API) │ │ 4. Interrupt handler disables further NIC interrupts │ and schedules a softirq (NET_RX_SOFTIRQ) │ 5. ksoftirqd / NET_RX poll loop: driver's .napi_poll() called │ → processes up to budget (default 64) packets per run │ 6. When ring is drained, re-enable NIC interrupts │ └──▶ packets handed to network stack one by one
Why NAPI? Under high load, handling an interrupt per packet causes "interrupt storms." NAPI switches to polling after the first interrupt, batching many packets per CPU wake-up. This is why a 10 Gbps NIC stays efficient at millions of packets per second.

Key Tunables

ParameterWhereEffect
ethtool -G eth0 rx NDriverEnlarge RX ring buffer (reduces drops at burst)
ethtool -C eth0 rx-usecs NDriverInterrupt coalescing — trade latency for throughput
/proc/sys/net/core/netdev_budgetKernelMax packets per NAPI poll cycle (default 300)
net.core.rmem_maxsysctlMax socket receive buffer
ethtool -K eth0 gro onDriverGeneric Receive Offload — merge small segments

Useful Commands

Shell
# Show NIC ring buffer sizes and current settings
ethtool -g eth0

# Show NIC offload settings (GRO, GSO, TSO, …)
ethtool -k eth0

# Show interrupt coalescing settings
ethtool -c eth0

# Show per-queue NIC stats (drops, errors)
ethtool -S eth0

# Show softirq counters per CPU
cat /proc/softirqs | grep NET
eBPF XDP

eXpress Data Path is an eBPF hook that runs inside the NIC driver, before the kernel allocates an sk_buff. It provides the earliest possible packet processing point in the receive path.

XDP Actions

ActionMeaningUse Case
XDP_DROPDiscard the packet immediatelyDDoS mitigation, firewall
XDP_PASSPass to normal network stackDefault — let kernel handle it
XDP_TXRetransmit on same NICInline load balancer, reflection
XDP_REDIRECTRedirect to another NIC or CPUAF_XDP zero-copy, multi-NIC LB
XDP_ABORTEDError — treated as DROPBug detection during development

XDP Modes

Native XDP

Driver implements XDP hook directly. Fastest — runs before NAPI. Requires driver support (mlx5, i40e, ixgbe, virtio_net, …).

Generic XDP (SKB mode)

Implemented in the kernel, after sk_buff allocation. Works on any driver but is much slower — mainly for testing.

Offloaded XDP

Program compiled and run on the NIC SmartNIC firmware. Zero kernel CPU overhead. Requires SmartNIC (Netronome, Mellanox BlueField).

AF_XDP

Zero-copy path to userspace using XDP_REDIRECT into an AF_XDP socket. Used by DPDK-alternative stacks and custom packet processors.

Attaching an XDP Program

Shell
# Attach XDP program (native mode)
ip link set dev eth0 xdp obj xdp_prog.o sec xdp

# Attach in generic (SKB) mode — any driver
ip link set dev eth0 xdpgeneric obj xdp_prog.o sec xdp

# Remove XDP program
ip link set dev eth0 xdp off

# Show XDP info
ip link show dev eth0 | grep xdp

# Using bpftool
bpftool net show dev eth0
bpftool prog list

Minimal XDP Program (C)

C (eBPF)
// Drop all UDP packets, pass everything else
#include <linux/bpf.h>
#include <linux/if_ether.h>
#include <linux/ip.h>
#include <linux/udp.h>

SEC("xdp")
int xdp_drop_udp(struct xdp_md *ctx) {
    void *data     = (void *)(long)ctx->data;
    void *data_end = (void *)(long)ctx->data_end;

    struct ethhdr *eth = data;
    if ((void *)(eth + 1) > data_end) return XDP_DROP;
    if (eth->h_proto != htons(ETH_P_IP)) return XDP_PASS;

    struct iphdr *ip = (void *)(eth + 1);
    if ((void *)(ip + 1) > data_end) return XDP_DROP;
    if (ip->protocol == IPPROTO_UDP)   return XDP_DROP;

    return XDP_PASS;
}
Kernel sk_buff

The sk_buff (socket buffer) is the central data structure that represents a packet inside the Linux kernel. Every packet that passes through XDP_PASS is wrapped in an sk_buff for its journey through TC, Netfilter, IP, and transport layers.

Structure Layout

sk_buff (kernel struct) ┌────────────────────────────────────────────────────────────────┐ │ head ──────────────────▶ [ headroom | L2 | L3 | L4 | data ]│ │ data ──────────────────▶ [current start of data] │ │ tail ──────────────────▶ [end of data] │ │ end ──────────────────▶ [ end of allocated buffer ] │ │ │ │ len : total length of packet data │ │ data_len : length of paged (frags) data │ │ protocol : ETH_P_IP, ETH_P_IPV6, … │ │ sk : pointer to owning socket (or NULL) │ │ dev : net_device this packet arrived on │ │ tstamp : hardware/software receive timestamp │ │ mark : fwmark — used by iptables, routing policy │ │ cb[48] : control buffer — per-layer scratch space │ │ _skb_refdst : routing destination cache entry │ └────────────────────────────────────────────────────────────────┘ Header pointers (set as packet walks up the stack): skb_mac_header() → start of Ethernet frame skb_network_header() → start of IP header skb_transport_header() → start of TCP/UDP header
Zero-copy operations Many operations (forwarding, cloning, fragmentation) avoid copying data by manipulating the data, tail, and head pointers and using reference counting. skb_clone() shares the data buffer; skb_copy() makes a full copy.

Key Fields Used by eBPF

FieldTypeDescription
skb->marku32Firewall mark (fwmark) — used for policy routing and iptables matching
skb->priorityu32QoS / TC priority class
skb->protocolu16EtherType (ETH_P_IP = 0x0800)
skb->cbu8[48]Per-layer scratch space (TC uses this for metadata)
skb->tstampktime_tReceive/transmit timestamp
skb->hashu32Flow hash — used by RSS and load balancers
Traffic Control Traffic Control (TC)

The TC subsystem controls how packets are queued and scheduled on an interface. It provides qdiscs (queuing disciplines), classes, and filters — and is also the attachment point for TC eBPF programs via the clsact qdisc.

TC Architecture

Interface (eth0) │ ├── Ingress side (clsact) │ └── filter → TC eBPF program (ingress) [after XDP, before Netfilter] │ └── Egress side ├── Root qdisc (e.g. fq_codel, htb, pfifo_fast) │ ├── class 1:1 (high priority) │ │ └── leaf qdisc (pfifo) │ └── class 1:2 (low priority) │ └── leaf qdisc (pfifo) └── clsact (for TC eBPF egress programs)

Common Qdiscs

QdiscTypeUse Case
pfifo_fastClasslessDefault — 3-band FIFO based on IP TOS bits
fq_codelClasslessFair Queue + Controlled Delay — reduces bufferbloat
htbClassfulHierarchical Token Bucket — rate limiting with bursting
tbfClasslessToken Bucket Filter — simple rate cap
netemClasslessNetwork emulator — add delay, loss, jitter for testing
clsactSpecialeBPF TC attachment point (ingress + egress)

TC Commands

Shell
# Show qdiscs on interface
tc qdisc show dev eth0

# Add 100ms delay + 10ms jitter for testing (netem)
tc qdisc add dev eth0 root netem delay 100ms 10ms

# Rate-limit to 100 Mbps with HTB
tc qdisc add dev eth0 root handle 1: htb default 10
tc class add dev eth0 parent 1: classid 1:10 htb rate 100mbit burst 15k

# Attach eBPF program to TC ingress
tc qdisc add dev eth0 clsact
tc filter add dev eth0 ingress bpf da obj tc_prog.o sec tc

# Show TC filters
tc filter show dev eth0 ingress
tc filter show dev eth0 egress
Firewall Netfilter

Netfilter is the Linux kernel framework for packet filtering, NAT, and connection tracking. iptables and nftables are userspace tools that write rules into Netfilter hooks. Rules are organised into tables (raw, mangle, nat, filter) evaluated at five chain hooks.

Five Hook Points

Packet arrives ──▶ [ PREROUTING ] ──▶ Routing ──▶ [ INPUT ] ──▶ Local Process │ ▼ (forward) [ FORWARD ] │ ▼ Local Process ──▶ [ OUTPUT ] ──▶ Routing ──▶ [ POSTROUTING ] ──▶ NIC TX

Table Evaluation Order per Hook

HookTable order (iptables)Typical Use
PREROUTINGraw → mangle → natDNAT, NOTRACK, TTL mangle
INPUTmangle → filter → securityFilter inbound to local process
FORWARDmangle → filter → securityFilter forwarded packets
OUTPUTraw → mangle → nat → filter → securityFilter outbound from local process, DNAT-back
POSTROUTINGmangle → natSNAT, MASQUERADE

Common iptables Commands

Shell
# List all rules with packet/byte counters
iptables -L -n -v

# List rules in a specific table
iptables -t nat -L -n -v

# Drop all traffic from an IP
iptables -A INPUT -s 1.2.3.4 -j DROP

# Allow established/related (stateful firewall)
iptables -A INPUT -m conntrack --ctstate ESTABLISHED,RELATED -j ACCEPT

# Port forwarding: redirect :8080 → 192.168.1.10:80 (DNAT)
iptables -t nat -A PREROUTING -p tcp --dport 8080 \
         -j DNAT --to-destination 192.168.1.10:80

# Masquerade for NAT gateway
iptables -t nat -A POSTROUTING -o eth0 -j MASQUERADE

# Trace a packet through all chains (use TRACE in raw table)
iptables -t raw -A PREROUTING -s 1.2.3.4 -j TRACE
modprobe nf_log_ipv4
dmesg | grep TRACE
nftables vs iptables Modern Linux (kernel ≥ 3.13) ships with nftables as the successor to iptables. It uses a single table/chain model without hard-coded hooks, has atomic ruleset updates, and is significantly faster at large rule counts. On distros like Debian 12+ and RHEL 9, iptables is now a compatibility shim over nftables.
Firewall Connection Tracking (conntrack)

conntrack is the Netfilter subsystem that tracks the state of network connections. Every packet through the filter table is looked up in the connection tracking table, enabling stateful firewalling (ESTABLISHED / RELATED rules) and NAT.

Connection States

StateMeaning
NEWFirst packet of a new connection (e.g. TCP SYN)
ESTABLISHEDConnection is established — packets flowing both ways
RELATEDNew connection related to an existing one (e.g. FTP data channel)
INVALIDPacket doesn't match any known connection — usually DROP
UNTRACKEDExplicitly excluded from tracking via NOTRACK rule

TCP State Machine in conntrack

SYN → NEW SYN + SYN-ACK → ESTABLISHED Established data → ESTABLISHED FIN/RST → TIME_WAIT → entry expires & is deleted

Commands

Shell
# List all tracked connections
conntrack -L

# Watch the conntrack table in real time
conntrack -E

# Show conntrack table statistics
conntrack -S

# Show current conntrack table size and max
cat /proc/sys/net/netfilter/nf_conntrack_count
cat /proc/sys/net/netfilter/nf_conntrack_max

# Flush the conntrack table (use with care!)
conntrack -F

# Delete a specific entry
conntrack -D -s 1.2.3.4 -d 5.6.7.8
conntrack table exhaustion High-traffic servers (load balancers, gateways) can exhaust nf_conntrack_max, causing all new connections to be dropped with "nf_conntrack: table full, dropping packet" in dmesg. Increase the limit or disable conntrack for UDP/high-throughput flows with NOTRACK.
Network (L3) IP Layer

The IP layer performs routing, fragmentation/reassembly, and TTL management. The kernel's Forwarding Information Base (FIB) is the routing table — a lookup here determines which interface and next-hop a packet uses.

Routing Lookup (Inbound)

Incoming packet (after PREROUTING) │ ▼ fib_lookup(dst_ip) │ ┌─────┴──────────────────────────────────────┐ │ Is dst_ip assigned to a local interface? │ │ (lo, eth0, docker0, …) │ └─────┬──────────────────────────────────────┘ │ yes │ no ▼ ▼ LOCAL_IN path FORWARD path → Netfilter INPUT → Netfilter FORWARD → Transport layer → Netfilter POSTROUTING → Egress NIC

Useful Commands

Shell
# Show routing table
ip route show
ip route show table all

# Which route/interface would be used for a destination?
ip route get 8.8.8.8

# Show ARP / neighbor cache
ip neigh show

# Show policy routing rules
ip rule show

# Show interface addresses
ip addr show

# Trace route to a host
traceroute 8.8.8.8
mtr --report 8.8.8.8

Fragmentation

MTU and fragmentation If a packet exceeds the MTU of the egress interface, IP either fragments it (if DF=0) or drops it and sends an ICMP "Fragmentation Needed" back (if DF=1). Fragmentation is expensive. Use ip link show to check MTU, and ping -M do -s 1400 host to probe path MTU.
Transport (L4) Transport Layer — TCP & UDP

The transport layer multiplexes data between processes using ports and handles reliability (TCP) or connectionless delivery (UDP). Incoming packets are demultiplexed to the correct socket by looking up the 4-tuple (src IP, src port, dst IP, dst port).

TCP Receive Path

sk_buff arrives at TCP layer │ ▼ tcp_rcv_established() or tcp_rcv_state_process() │ ├── Validate sequence number (in-window?) ├── Out-of-order? → queue in ofo_queue ├── In-order? → append to sk_receive_queue ├── Send ACK (delayed or immediate) └── Wake up blocked recv() call if data available

TCP Congestion Control

Shell
# Show available congestion control algorithms
cat /proc/sys/net/ipv4/tcp_available_congestion_control

# Show currently active algorithm
cat /proc/sys/net/ipv4/tcp_congestion_control

# Switch to BBR (Bottleneck Bandwidth and RTT)
sysctl -w net.ipv4.tcp_congestion_control=bbr

# Show TCP stats
ss -tin
nstat -az | grep Tcp

Socket Demultiplexing

Incoming TCP segment: src=1.2.3.4:54321 dst=10.0.0.1:80 │ ▼ Lookup in socket hashtable: key = (protocol=TCP, src=1.2.3.4, sport=54321, dst=10.0.0.1, dport=80) │ ├── Match found → deliver to socket recv buffer └── No match → send TCP RST

Key Tunables

sysctlDefaultEffect
net.core.somaxconn4096Max accept queue backlog per socket
net.ipv4.tcp_max_syn_backlog1024SYN queue depth (incomplete handshakes)
net.ipv4.tcp_fin_timeout60TIME_WAIT socket timeout (seconds)
net.ipv4.tcp_tw_reuse0Allow reuse of TIME_WAIT sockets
net.core.rmem_max212992Max socket receive buffer size
net.core.wmem_max212992Max socket send buffer size
Userspace Interface Sockets & Socket Buffers

A socket is the kernel object that bridges the network stack and userspace. Two ring buffers — the receive buffer and send buffer — decouple the application from the network rate.

Socket Buffer Model

Userspace Kernel ┌─────────────────┐ ┌────────────────────────────┐ │ Application │ recv()/read()│ Socket Receive Buffer │ │ │◀─────────────▶│ sk_receive_queue │ │ │ │ (sk_rcvbuf bytes max) │ │ │ send()/write() ┌────────────────────────┐ │ │─────────────▶ │ Socket Send Buffer │ │ │ │ sk_write_queue │ │ │ │ (sk_sndbuf bytes max) │ └─────────────────┘ └────────────────────────────┘ │ ▲ TCP/UDP TCP ACKs / → NIC flow control

Socket Types

TypeFamilyDescription
SOCK_STREAMAF_INET / AF_INET6TCP — reliable, ordered byte stream
SOCK_DGRAMAF_INET / AF_INET6UDP — connectionless datagrams
SOCK_RAWAF_INETRaw IP — bypass TCP/UDP, build headers manually
SOCK_PACKETAF_PACKETRaw Ethernet frames — used by tcpdump
AF_XDPAF_XDPZero-copy XDP socket for kernel-bypass packet processing

Useful Commands

Shell
# Show all TCP sockets with timer and process info
ss -tinp

# Show UDP sockets
ss -unp

# Show listen sockets with backlog
ss -tlnp

# Show socket memory usage
ss -m

# Show per-socket stats including retransmits
ss -tin dst 10.0.0.1
eBPF eBPF Hook Points

eBPF programs can be attached to dozens of hook points across the network stack. The choice of hook determines what data is visible and what actions are possible.

Network Hook Map

NIC RX NIC TX │ ▲ ▼ │ ┌─────────────────┐ ┌────────────┴──────────┐ │ XDP │ ← eBPF: xdp │ TC egress (clsact) │ ← eBPF: sched_cls │ (driver/hw) │ XDP_DROP/PASS/TX │ │ BPF_ACT_SHOT/OK └────────┬────────┘ └───────────────────────┘ │ ▲ ▼ │ ┌─────────────────┐ ┌────────────┴──────────┐ │ TC ingress │ ← eBPF: sched_cls │ Netfilter OUTPUT │ ← nftables / iptables │ (clsact) │ └───────────────────────┘ └────────┬────────┘ ▲ │ │ ▼ │ ┌─────────────────┐ ┌────────────┴──────────┐ │ Netfilter │ ← iptables/nf │ Transport (L4) │ │ PREROUTING │ └───────────────────────┘ └────────┬────────┘ ▲ │ │ ▼ │ ┌─────────────────┐ ┌────────────┴──────────┐ │ IP routing │ │ Socket sendmsg │ ← eBPF: sock_ops └────────┬────────┘ └───────────────────────┘ │ ┌──────┴────────┐ │ │ ▼ ▼ INPUT FORWARD (local) (route) │ ▼ Transport ← eBPF: sk_skb (stream parser/verdict) │ ▼ Socket ← eBPF: sock_ops, cgroup/skb, sk_msg

Hook Summary

HookProgram TypeAttach PointKey Capability
XDPBPF_PROG_TYPE_XDPNIC driverFastest drop/redirect, pre-sk_buff
TC ingress/egressBPF_PROG_TYPE_SCHED_CLSclsact qdiscFull sk_buff access, header rewrite
NetfilterBPF_PROG_TYPE_NETFILTERNetfilter hooksReplaces iptables rules in nf tables
Socket opsBPF_PROG_TYPE_SOCK_OPSTCP eventsObserve TCP state, tune socket params
Socket filterBPF_PROG_TYPE_SOCKET_FILTERIndividual socketFilter packets per-socket (tcpdump)
cgroup/skbBPF_PROG_TYPE_CGROUP_SKBcgroup v2Per-cgroup ingress/egress policy
sk_skbBPF_PROG_TYPE_SK_SKBsockmapStream parser — L7 load balancing
eBPF XDP Programs
When to use XDP vs TC Use XDP when you need maximum throughput and minimum latency — DDoS drops, load balancer redirects, hardware offload. Use TC when you need to modify packets (rewrite headers, encapsulate), need access to socket metadata, or need to work on egress.

bpftrace — Quick XDP Tracing

bpftrace
# Count XDP actions per second
bpftrace -e 'kprobe:bpf_prog_run_xdp { @[retval] = count(); }'

# Trace XDP drops with stack
bpftrace -e 'kprobe:xdp_do_generic_redirect { @drops = count(); }'

# Show per-interface XDP stats
ip -s link show dev eth0
eBPF TC Programs

Minimal TC Program (C)

C (eBPF)
// Tag all TCP packets with fwmark=1 on ingress
#include <linux/bpf.h>
#include <linux/pkt_cls.h>
#include <linux/ip.h>
#include <linux/tcp.h>

SEC("tc")
int tc_mark_tcp(struct __sk_buff *skb) {
    void *data     = (void *)(long)skb->data;
    void *data_end = (void *)(long)skb->data_end;

    struct iphdr *ip = data + 14; /* skip Ethernet header */
    if ((void *)(ip + 1) > data_end) return TC_ACT_OK;
    if (ip->protocol == IPPROTO_TCP)
        skb->mark = 1;

    return TC_ACT_OK;
}
eBPF Socket Programs

sock_ops programs fire on TCP socket events (connection established, retransmit, RTT measurement). They are used for per-connection observability and to set TCP socket options automatically based on policy.

bpftrace
# Trace new TCP connections with source/dest
bpftrace -e 'kprobe:tcp_connect {
    $sk = (struct sock *)arg0;
    printf("connect: %s → %s\n",
        ntop($sk->__sk_common.skc_rcv_saddr),
        ntop($sk->__sk_common.skc_daddr));
}'

# Trace TCP retransmits
bpftrace -e 'kprobe:tcp_retransmit_skb { @retransmits[comm] = count(); }'

# Trace socket accept latency
bpftrace -e 'kprobe:inet_csk_accept { @start[tid] = nsecs; }
             kretprobe:inet_csk_accept /retval/ {
                 @latency_us = hist((nsecs - @start[tid]) / 1000); }'
Reference Netfilter Chains

Quick-reference showing which chains fire for each packet scenario.

ScenarioChains Traversed (in order)
Packet arriving for local processPREROUTING → INPUT
Packet generated by local processOUTPUT → POSTROUTING
Packet being forwardedPREROUTING → FORWARD → POSTROUTING
DNAT (port forwarding, inbound)PREROUTING (nat) → FORWARD → POSTROUTING
SNAT / MASQUERADE (outbound NAT)OUTPUT → POSTROUTING (nat)
Docker container traffic (bridge)PREROUTING → FORWARD → POSTROUTING
Reference Packet Structure (Ethernet / IPv4 / TCP)
┌──────────────────────────────────────────────────────────────────┐ │ Ethernet Header (14 bytes) │ │ dst_mac[6] src_mac[6] ethertype[2] (0x0800=IPv4, 0x86DD=IPv6) │ ├──────────────────────────────────────────────────────────────────┤ │ IPv4 Header (20 bytes min) │ │ version[4] ihl[4] dscp[6] ecn[2] total_len[16] │ │ id[16] flags[3] frag_offset[13] │ │ ttl[8] protocol[8] (6=TCP, 17=UDP, 1=ICMP) checksum[16] │ │ src_ip[32] dst_ip[32] │ ├──────────────────────────────────────────────────────────────────┤ │ TCP Header (20 bytes min) │ │ src_port[16] dst_port[16] │ │ seq_num[32] │ │ ack_num[32] │ │ data_offset[4] reserved[3] flags[9] (SYN,ACK,FIN,RST,PSH,…) │ │ window[16] checksum[16] urgent_ptr[16] │ ├──────────────────────────────────────────────────────────────────┤ │ Application Data (payload) │ └──────────────────────────────────────────────────────────────────┘

Parse a Packet with tcpdump

Shell
# Capture with full header decode
tcpdump -i eth0 -v -nn

# Show raw hex + ASCII
tcpdump -i eth0 -XX -nn

# Filter TCP SYN packets
tcpdump -i eth0 'tcp[tcpflags] & tcp-syn != 0'

# Capture to file for Wireshark
tcpdump -i eth0 -w capture.pcap

# Show ICMP only
tcpdump -i eth0 icmp -nn
Reference Useful Tools
ToolLayerPurpose
tcpdumpL2–L7Packet capture and decode — gold standard for raw packet inspection
wireshark / tsharkL2–L7GUI / CLI packet analysis, protocol dissection
ssL4Socket statistics — replaces netstat, shows TCP state, buffers, retransmits
ipL2–L3Interface, address, route, ARP management
conntrackNetfilterView and manage the connection tracking table
iptables / nftNetfilterFirewall, NAT, packet mangling rules
tcL2–L3Traffic shaping, qdisc/filter management, TC eBPF attachment
bpftooleBPFInspect loaded eBPF programs and maps
bpftraceeBPFDynamic tracing — one-liners for kernel network events
ethtoolNICNIC settings: ring size, offload, coalescing, stats
nstat / netstatL3–L4Protocol-level counters (retransmits, drops, errors)
ping / mtrL3ICMP reachability and path MTU testing
Reference Tuning Knobs

Common sysctl parameters that affect packet throughput, latency, and connection capacity.

Shell — sysctl
# Socket buffer sizes (bytes)
sysctl -w net.core.rmem_max=16777216
sysctl -w net.core.wmem_max=16777216
sysctl -w net.ipv4.tcp_rmem="4096 87380 16777216"
sysctl -w net.ipv4.tcp_wmem="4096 65536 16777216"

# Increase connection capacity
sysctl -w net.core.somaxconn=65535
sysctl -w net.ipv4.tcp_max_syn_backlog=65535
sysctl -w net.ipv4.ip_local_port_range="1024 65535"

# Increase conntrack table size
sysctl -w net.netfilter.nf_conntrack_max=2097152

# Enable TCP BBR (better congestion control)
sysctl -w net.ipv4.tcp_congestion_control=bbr
sysctl -w net.core.default_qdisc=fq

# Reduce TIME_WAIT duration
sysctl -w net.ipv4.tcp_fin_timeout=15
sysctl -w net.ipv4.tcp_tw_reuse=1

# NAPI polling budget
sysctl -w net.core.netdev_budget=600
sysctl -w net.core.netdev_budget_usecs=8000
Persist settings sysctl -w changes are lost on reboot. Write them to /etc/sysctl.d/99-network.conf and run sysctl --system to apply permanently.