Linux Kernel Tuning Guide

Linux Kernel Tuning

Virtual memory, OOM, hugepages, NUMA, CPU scheduler, network stack, and I/O scheduler

Workload → needs tuning based on: ┌──────────────┬───────────────────────────────────────────────┐ │ Web servers │ High connection count, low latency │ │ │ → net socket buffers, TIME_WAIT, backlog │ ├──────────────┼───────────────────────────────────────────────┤ │ Databases │ Large working set, consistent latency │ │ │ → hugepages, swappiness=1, dirty ratios, NUMA │ ├──────────────┼───────────────────────────────────────────────┤ │ Batch / HPC │ Throughput, CPU pinning │ │ │ → cpuset, CPU affinity, NUMA interleave │ ├──────────────┼───────────────────────────────────────────────┤ │ Storage I/O │ Disk throughput, low latency │ │ │ → I/O scheduler, read-ahead, dirty bytes │ └──────────────┴───────────────────────────────────────────────┘

Virtual MemoryMemory

The kernel's virtual memory subsystem manages RAM allocation, page reclaim, swap, and dirty page writeback. These are the most impactful tuning parameters for memory-intensive workloads.

Key parameters

Parameter	Default	Tuning guidance
vm.swappiness	60	0–10 for databases (avoid swapping); 60 for general; 100 to prefer swap
vm.dirty_ratio	20%	Max % RAM of dirty pages before a process is forced to write. Lower (5-10%) for latency-sensitive workloads.
vm.dirty_background_ratio	10%	% RAM where background writeback starts. Lower (1-5%) for streaming writes.
vm.dirty_expire_centisecs	3000 (30s)	How old dirty pages must be before writeout. Lower for durability.
vm.overcommit_memory	0	0=heuristic, 1=always allow (for scientific), 2=strict (never over-commit)
vm.min_free_kbytes	varies	Reserve for kernel. Increase on high-memory systems to prevent OOM under burst.
vm.vfs_cache_pressure	100	Lower (50) to retain dentry/inode cache longer (more RAM used). Higher to reclaim faster.

Shell

# Read a parameter
sysctl vm.swappiness
cat /proc/sys/vm/swappiness

# Set at runtime
sysctl -w vm.swappiness=10
echo 10 > /proc/sys/vm/swappiness

# Database server tuning
sysctl -w vm.swappiness=1
sysctl -w vm.dirty_ratio=5
sysctl -w vm.dirty_background_ratio=1

# Check memory stats
cat /proc/meminfo
vmstat -s
free -h

OOM KillerMemory

When the kernel cannot reclaim enough memory to satisfy an allocation, the OOM (Out Of Memory) killer selects and kills a process. The selection is based on oom_score, which can be adjusted per process.

OOM score

Shell

# Check OOM score (0-1000, higher = more likely to be killed)
cat /proc/<pid>/oom_score

# OOM score adjustment (-1000 to 1000)
cat /proc/<pid>/oom_score_adj

# Protect a process from OOM killer
echo -1000 > /proc/<pid>/oom_score_adj

# Make a process the preferred OOM kill target
echo 1000 > /proc/<pid>/oom_score_adj

# Protect via systemd service
# [Service]
# OOMScoreAdjust=-900

# Show OOM events
dmesg | grep -i "oom\|killed process"
journalctl -k | grep oom

# Show all processes sorted by OOM score
for p in /proc/[0-9]*/oom_score; do
  printf "%6d %s\n" "$(cat $p 2>/dev/null)" "$(cat ${p%oom_score}cmdline 2>/dev/null | tr '\0' ' ')"
done | sort -rn | head -20

cgroup OOM. In cgroups v2, each cgroup has its own memory controller. When a cgroup exceeds memory.max, the OOM killer runs within that cgroup rather than system-wide — safer and more targeted.

HugepagesMemory

Standard pages are 4KB. Hugepages (2MB or 1GB) reduce TLB pressure for large working sets. Databases (Oracle, PostgreSQL, MySQL) and JVMs benefit significantly.

Transparent Hugepages (THP)

Shell

# Check THP status
cat /sys/kernel/mm/transparent_hugepage/enabled
# [always] madvise never

# THP modes
# always   — kernel maps hugepages wherever possible
# madvise  — only for regions that call madvise(MADV_HUGEPAGE)
# never    — disable THP entirely

# Disable THP (recommended for Redis, MongoDB, Oracle)
echo never > /sys/kernel/mm/transparent_hugepage/enabled
echo never > /sys/kernel/mm/transparent_hugepage/defrag

# Also disable khugepaged compaction (reduces latency spikes)
echo 0 > /sys/kernel/mm/transparent_hugepage/khugepaged/defrag

Explicit hugepages

Shell

# Show hugepage status
cat /proc/meminfo | grep -i huge
grep -i huge /proc/meminfo

# Allocate 512 × 2MB hugepages = 1GB
sysctl -w vm.nr_hugepages=512
echo 512 > /proc/sys/vm/nr_hugepages

# Check how many were successfully allocated
grep HugePages_Free /proc/meminfo

# Mount hugetlbfs (for shared memory apps)
mkdir /dev/hugepages
mount -t hugetlbfs none /dev/hugepages

# PostgreSQL hugepages: set huge_pages=on in postgresql.conf
# Then allocate: nr_hugepages = shared_buffers / 2MB + overhead

# 1GB hugepages (must be allocated at boot via kernel cmdline)
# GRUB: hugepagesz=1G hugepages=4 default_hugepagesz=1G

NUMAMemory

On multi-socket systems, each CPU socket has local RAM (fast) and must cross an interconnect to reach remote RAM (slow). NUMA-aware allocation minimises remote access.

Shell

# Show NUMA topology
numactl --hardware
lscpu | grep NUMA

# NUMA statistics
numastat
numastat -p <pid>      # per-process NUMA allocation

# Run a process bound to NUMA node 0
numactl --cpunodebind=0 --membind=0 my-db-server

# Interleave memory across all nodes (good for some workloads)
numactl --interleave=all my-app

# Check NUMA balancing
sysctl kernel.numa_balancing        # 1=on (default)

# For latency-sensitive apps, disable auto-balancing
sysctl -w kernel.numa_balancing=0

# Show remote vs local memory access ratio
numastat | grep -A5 node

CPU SchedulerCPU

The Completely Fair Scheduler (CFS) distributes CPU time proportionally by process weight (nice value). For real-time and low-latency workloads, scheduling policy and CPU affinity matter significantly.

nice / renice

Shell

# Run with low priority (nice +19 = batch/background)
nice -n 19 my-backup-job

# Increase priority of running process (requires root for negative values)
renice -n -5 -p <pid>

# Real-time scheduling policy (SCHED_FIFO)
chrt -f 50 my-realtime-app     # priority 1-99
chrt -p <pid>                  # check current policy
chrt -r 50 my-app              # SCHED_RR (round-robin)

# CPU affinity — pin to CPUs 0 and 1
taskset -c 0,1 my-app
taskset -cp 0,1 <pid>         # set on running process
taskset -pc <pid>              # show current affinity

CFS tuning parameters

Parameter	Default	Effect
kernel.sched_min_granularity_ns	4ms	Minimum time a task runs before being preempted. Lower = more responsive.
kernel.sched_wakeup_granularity_ns	5ms	Prevents tasks from preempting each other on wakeup too aggressively.
kernel.sched_migration_cost_ns	500000	Cost of migrating a task to another CPU. Higher = less migration = better cache locality.

Shell

# Low-latency tuning (e.g., trading systems)
sysctl -w kernel.sched_min_granularity_ns=1000000
sysctl -w kernel.sched_wakeup_granularity_ns=2000000
sysctl -w kernel.sched_migration_cost_ns=5000000

# Isolate CPUs from scheduler (requires kernel cmdline: isolcpus=2,3)
# Then use taskset/numactl to pin critical threads to isolated CPUs

Network StackNetwork

The kernel's network stack has many tunable parameters. The most impactful for high-throughput or high-connection-count systems are socket buffers, connection backlog, and TIME_WAIT handling.

Key network parameters

Shell

# Socket receive/send buffer sizes (min, default, max)
sysctl net.core.rmem_max          # max socket receive buffer
sysctl net.core.wmem_max          # max socket send buffer
sysctl net.ipv4.tcp_rmem          # min default max for TCP
sysctl net.ipv4.tcp_wmem

# High-throughput tuning (10GbE+)
sysctl -w net.core.rmem_max=134217728
sysctl -w net.core.wmem_max=134217728
sysctl -w net.ipv4.tcp_rmem="4096 87380 134217728"
sysctl -w net.ipv4.tcp_wmem="4096 65536 134217728"

# Connection backlog
sysctl -w net.core.somaxconn=65535          # max listen() backlog
sysctl -w net.ipv4.tcp_max_syn_backlog=65535

# TIME_WAIT — high connection-rate servers
sysctl -w net.ipv4.tcp_tw_reuse=1           # reuse TIME_WAIT sockets
sysctl -w net.ipv4.tcp_fin_timeout=15       # reduce FIN-WAIT-2 timeout

# Ephemeral port range (default 32768-60999)
sysctl -w net.ipv4.ip_local_port_range="1024 65535"

# Enable BBR congestion control (kernel >= 4.9)
sysctl -w net.core.default_qdisc=fq
sysctl -w net.ipv4.tcp_congestion_control=bbr

# Verify BBR is active
sysctl net.ipv4.tcp_congestion_control

Network offloads

Shell

# Check NIC offload settings
ethtool -k eth0

# Disable offloads (useful when debugging with tcpdump)
ethtool -K eth0 tso off gso off gro off

# Enable receive side scaling (multi-queue)
ethtool -L eth0 combined 8   # 8 combined queues

# Check interrupt affinity (spread across CPUs)
cat /proc/interrupts | grep eth0

I/O SchedulerI/O

The I/O scheduler reorders and merges block requests before sending to the device. For SSDs and NVMe, the simple or none scheduler avoids unnecessary overhead.

Schedulers

Scheduler	Best for	Notes
none / noop	NVMe, SSDs in VMs	No reordering — device handles it. Lowest latency overhead.
mq-deadline	General SSD, mixed workloads	Deadline-based, prevents starvation. Good default for SSDs.
bfq	HDDs, desktop, interactive	Proportional-share, fairness-oriented. Higher overhead.
kyber	Fast NVMe, low latency	Target-latency based, simple. Good for NVMe.

Shell

# Check current scheduler per device
cat /sys/block/sda/queue/scheduler
cat /sys/block/nvme0n1/queue/scheduler

# Change scheduler at runtime
echo mq-deadline > /sys/block/sda/queue/scheduler
echo none > /sys/block/nvme0n1/queue/scheduler

# Read-ahead (in KB) — increase for sequential workloads
cat /sys/block/sda/queue/read_ahead_kb
echo 2048 > /sys/block/sda/queue/read_ahead_kb

# Queue depth
cat /sys/block/nvme0n1/queue/nr_requests

# Trace I/O with blktrace
blktrace -d /dev/sda -o - | blkparse -i -

Persistent TuningConfig

Shell

# /etc/sysctl.d/99-tuning.conf — loaded by systemd-sysctl on boot
vm.swappiness = 10
vm.dirty_ratio = 5
vm.dirty_background_ratio = 2
net.core.somaxconn = 65535
net.ipv4.tcp_rmem = 4096 87380 134217728
net.ipv4.tcp_wmem = 4096 65536 134217728
net.ipv4.tcp_tw_reuse = 1
net.core.default_qdisc = fq
net.ipv4.tcp_congestion_control = bbr

# Apply without reboot
sysctl --system
# or
sysctl -p /etc/sysctl.d/99-tuning.conf

tuned profiles

Shell

# List available profiles
tuned-adm list

# Common profiles
# throughput-performance — high throughput, disables THP, tunes sysctl
# latency-performance    — low latency, disables power management
# network-latency        — ultra-low-latency networking
# database               — optimised for databases
# virtual-guest          — VM guest optimisations

# Apply a profile
tuned-adm profile throughput-performance
tuned-adm active

Workload-specific quick reference

Database (PostgreSQL / MySQL)

vm.swappiness=1
vm.nr_hugepages=N
THP: never
I/O: mq-deadline or noop (SSD)

Web server (nginx / node)

net.core.somaxconn=65535
net.ipv4.tcp_tw_reuse=1
ip_local_port_range="1024 65535"
LimitNOFILE=65536 in service

High-throughput networking

tcp_rmem/wmem max = 128MB
BBR congestion control
RFS/RPS for multi-queue NICs
ethtool combined queues = nCPU

Low-latency / real-time

isolcpus=N (kernel cmdline)
chrt -f 50 for critical threads
Disable THP + NUMA balancing
tuned: latency-performance

Cheat SheetReference

Virtual memory

sysctl vm.swappiness=10
sysctl vm.dirty_ratio=5
cat /proc/meminfo
vmstat -s

OOM killer

cat /proc/PID/oom_score
echo -1000 > /proc/PID/oom_score_adj
dmesg | grep oom
systemd: OOMScoreAdjust=-900

Hugepages

echo never > .../transparent_hugepage/enabled
sysctl vm.nr_hugepages=512
grep Huge /proc/meminfo

NUMA

numactl --hardware
numastat -p PID
numactl --membind=0 cmd
sysctl kernel.numa_balancing=0

CPU scheduler

nice -n 19 cmd — low priority
chrt -f 50 cmd — real-time
taskset -c 0,1 cmd — affinity
renice -n -5 -p PID

Network stack

sysctl net.core.somaxconn=65535
sysctl net.ipv4.tcp_tw_reuse=1
BBR: tcp_congestion_control=bbr
ethtool -k eth0