The kernel's virtual memory subsystem manages RAM allocation, page reclaim, swap, and dirty page writeback. These are the most impactful tuning parameters for memory-intensive workloads.
Key parameters
| Parameter | Default | Tuning guidance |
|---|---|---|
| vm.swappiness | 60 | 0–10 for databases (avoid swapping); 60 for general; 100 to prefer swap |
| vm.dirty_ratio | 20% | Max % RAM of dirty pages before a process is forced to write. Lower (5-10%) for latency-sensitive workloads. |
| vm.dirty_background_ratio | 10% | % RAM where background writeback starts. Lower (1-5%) for streaming writes. |
| vm.dirty_expire_centisecs | 3000 (30s) | How old dirty pages must be before writeout. Lower for durability. |
| vm.overcommit_memory | 0 | 0=heuristic, 1=always allow (for scientific), 2=strict (never over-commit) |
| vm.min_free_kbytes | varies | Reserve for kernel. Increase on high-memory systems to prevent OOM under burst. |
| vm.vfs_cache_pressure | 100 | Lower (50) to retain dentry/inode cache longer (more RAM used). Higher to reclaim faster. |
# Read a parameter sysctl vm.swappiness cat /proc/sys/vm/swappiness # Set at runtime sysctl -w vm.swappiness=10 echo 10 > /proc/sys/vm/swappiness # Database server tuning sysctl -w vm.swappiness=1 sysctl -w vm.dirty_ratio=5 sysctl -w vm.dirty_background_ratio=1 # Check memory stats cat /proc/meminfo vmstat -s free -h
When the kernel cannot reclaim enough memory to satisfy an allocation, the OOM (Out Of Memory) killer selects and kills a process. The selection is based on oom_score, which can be adjusted per process.
OOM score
# Check OOM score (0-1000, higher = more likely to be killed) cat /proc/<pid>/oom_score # OOM score adjustment (-1000 to 1000) cat /proc/<pid>/oom_score_adj # Protect a process from OOM killer echo -1000 > /proc/<pid>/oom_score_adj # Make a process the preferred OOM kill target echo 1000 > /proc/<pid>/oom_score_adj # Protect via systemd service # [Service] # OOMScoreAdjust=-900 # Show OOM events dmesg | grep -i "oom\|killed process" journalctl -k | grep oom # Show all processes sorted by OOM score for p in /proc/[0-9]*/oom_score; do printf "%6d %s\n" "$(cat $p 2>/dev/null)" "$(cat ${p%oom_score}cmdline 2>/dev/null | tr '\0' ' ')" done | sort -rn | head -20
memory.max, the OOM killer runs within that cgroup rather than system-wide — safer and more targeted.Standard pages are 4KB. Hugepages (2MB or 1GB) reduce TLB pressure for large working sets. Databases (Oracle, PostgreSQL, MySQL) and JVMs benefit significantly.
Transparent Hugepages (THP)
# Check THP status cat /sys/kernel/mm/transparent_hugepage/enabled # [always] madvise never # THP modes # always — kernel maps hugepages wherever possible # madvise — only for regions that call madvise(MADV_HUGEPAGE) # never — disable THP entirely # Disable THP (recommended for Redis, MongoDB, Oracle) echo never > /sys/kernel/mm/transparent_hugepage/enabled echo never > /sys/kernel/mm/transparent_hugepage/defrag # Also disable khugepaged compaction (reduces latency spikes) echo 0 > /sys/kernel/mm/transparent_hugepage/khugepaged/defrag
Explicit hugepages
# Show hugepage status cat /proc/meminfo | grep -i huge grep -i huge /proc/meminfo # Allocate 512 × 2MB hugepages = 1GB sysctl -w vm.nr_hugepages=512 echo 512 > /proc/sys/vm/nr_hugepages # Check how many were successfully allocated grep HugePages_Free /proc/meminfo # Mount hugetlbfs (for shared memory apps) mkdir /dev/hugepages mount -t hugetlbfs none /dev/hugepages # PostgreSQL hugepages: set huge_pages=on in postgresql.conf # Then allocate: nr_hugepages = shared_buffers / 2MB + overhead # 1GB hugepages (must be allocated at boot via kernel cmdline) # GRUB: hugepagesz=1G hugepages=4 default_hugepagesz=1G
On multi-socket systems, each CPU socket has local RAM (fast) and must cross an interconnect to reach remote RAM (slow). NUMA-aware allocation minimises remote access.
# Show NUMA topology numactl --hardware lscpu | grep NUMA # NUMA statistics numastat numastat -p <pid> # per-process NUMA allocation # Run a process bound to NUMA node 0 numactl --cpunodebind=0 --membind=0 my-db-server # Interleave memory across all nodes (good for some workloads) numactl --interleave=all my-app # Check NUMA balancing sysctl kernel.numa_balancing # 1=on (default) # For latency-sensitive apps, disable auto-balancing sysctl -w kernel.numa_balancing=0 # Show remote vs local memory access ratio numastat | grep -A5 node
The Completely Fair Scheduler (CFS) distributes CPU time proportionally by process weight (nice value). For real-time and low-latency workloads, scheduling policy and CPU affinity matter significantly.
nice / renice
# Run with low priority (nice +19 = batch/background) nice -n 19 my-backup-job # Increase priority of running process (requires root for negative values) renice -n -5 -p <pid> # Real-time scheduling policy (SCHED_FIFO) chrt -f 50 my-realtime-app # priority 1-99 chrt -p <pid> # check current policy chrt -r 50 my-app # SCHED_RR (round-robin) # CPU affinity — pin to CPUs 0 and 1 taskset -c 0,1 my-app taskset -cp 0,1 <pid> # set on running process taskset -pc <pid> # show current affinity
CFS tuning parameters
| Parameter | Default | Effect |
|---|---|---|
| kernel.sched_min_granularity_ns | 4ms | Minimum time a task runs before being preempted. Lower = more responsive. |
| kernel.sched_wakeup_granularity_ns | 5ms | Prevents tasks from preempting each other on wakeup too aggressively. |
| kernel.sched_migration_cost_ns | 500000 | Cost of migrating a task to another CPU. Higher = less migration = better cache locality. |
# Low-latency tuning (e.g., trading systems) sysctl -w kernel.sched_min_granularity_ns=1000000 sysctl -w kernel.sched_wakeup_granularity_ns=2000000 sysctl -w kernel.sched_migration_cost_ns=5000000 # Isolate CPUs from scheduler (requires kernel cmdline: isolcpus=2,3) # Then use taskset/numactl to pin critical threads to isolated CPUs
The kernel's network stack has many tunable parameters. The most impactful for high-throughput or high-connection-count systems are socket buffers, connection backlog, and TIME_WAIT handling.
Key network parameters
# Socket receive/send buffer sizes (min, default, max) sysctl net.core.rmem_max # max socket receive buffer sysctl net.core.wmem_max # max socket send buffer sysctl net.ipv4.tcp_rmem # min default max for TCP sysctl net.ipv4.tcp_wmem # High-throughput tuning (10GbE+) sysctl -w net.core.rmem_max=134217728 sysctl -w net.core.wmem_max=134217728 sysctl -w net.ipv4.tcp_rmem="4096 87380 134217728" sysctl -w net.ipv4.tcp_wmem="4096 65536 134217728" # Connection backlog sysctl -w net.core.somaxconn=65535 # max listen() backlog sysctl -w net.ipv4.tcp_max_syn_backlog=65535 # TIME_WAIT — high connection-rate servers sysctl -w net.ipv4.tcp_tw_reuse=1 # reuse TIME_WAIT sockets sysctl -w net.ipv4.tcp_fin_timeout=15 # reduce FIN-WAIT-2 timeout # Ephemeral port range (default 32768-60999) sysctl -w net.ipv4.ip_local_port_range="1024 65535" # Enable BBR congestion control (kernel >= 4.9) sysctl -w net.core.default_qdisc=fq sysctl -w net.ipv4.tcp_congestion_control=bbr # Verify BBR is active sysctl net.ipv4.tcp_congestion_control
Network offloads
# Check NIC offload settings ethtool -k eth0 # Disable offloads (useful when debugging with tcpdump) ethtool -K eth0 tso off gso off gro off # Enable receive side scaling (multi-queue) ethtool -L eth0 combined 8 # 8 combined queues # Check interrupt affinity (spread across CPUs) cat /proc/interrupts | grep eth0
The I/O scheduler reorders and merges block requests before sending to the device. For SSDs and NVMe, the simple or none scheduler avoids unnecessary overhead.
Schedulers
| Scheduler | Best for | Notes |
|---|---|---|
| none / noop | NVMe, SSDs in VMs | No reordering — device handles it. Lowest latency overhead. |
| mq-deadline | General SSD, mixed workloads | Deadline-based, prevents starvation. Good default for SSDs. |
| bfq | HDDs, desktop, interactive | Proportional-share, fairness-oriented. Higher overhead. |
| kyber | Fast NVMe, low latency | Target-latency based, simple. Good for NVMe. |
# Check current scheduler per device cat /sys/block/sda/queue/scheduler cat /sys/block/nvme0n1/queue/scheduler # Change scheduler at runtime echo mq-deadline > /sys/block/sda/queue/scheduler echo none > /sys/block/nvme0n1/queue/scheduler # Read-ahead (in KB) — increase for sequential workloads cat /sys/block/sda/queue/read_ahead_kb echo 2048 > /sys/block/sda/queue/read_ahead_kb # Queue depth cat /sys/block/nvme0n1/queue/nr_requests # Trace I/O with blktrace blktrace -d /dev/sda -o - | blkparse -i -
# /etc/sysctl.d/99-tuning.conf — loaded by systemd-sysctl on boot vm.swappiness = 10 vm.dirty_ratio = 5 vm.dirty_background_ratio = 2 net.core.somaxconn = 65535 net.ipv4.tcp_rmem = 4096 87380 134217728 net.ipv4.tcp_wmem = 4096 65536 134217728 net.ipv4.tcp_tw_reuse = 1 net.core.default_qdisc = fq net.ipv4.tcp_congestion_control = bbr # Apply without reboot sysctl --system # or sysctl -p /etc/sysctl.d/99-tuning.conf
tuned profiles
# List available profiles tuned-adm list # Common profiles # throughput-performance — high throughput, disables THP, tunes sysctl # latency-performance — low latency, disables power management # network-latency — ultra-low-latency networking # database — optimised for databases # virtual-guest — VM guest optimisations # Apply a profile tuned-adm profile throughput-performance tuned-adm active
Workload-specific quick reference
Database (PostgreSQL / MySQL)
vm.swappiness=1
vm.nr_hugepages=N
THP: never
I/O: mq-deadline or noop (SSD)
Web server (nginx / node)
net.core.somaxconn=65535
net.ipv4.tcp_tw_reuse=1
ip_local_port_range="1024 65535"
LimitNOFILE=65536 in service
High-throughput networking
tcp_rmem/wmem max = 128MB
BBR congestion control
RFS/RPS for multi-queue NICs
ethtool combined queues = nCPU
Low-latency / real-time
isolcpus=N (kernel cmdline)
chrt -f 50 for critical threads
Disable THP + NUMA balancing
tuned: latency-performance
Virtual memory
sysctl vm.swappiness=10sysctl vm.dirty_ratio=5cat /proc/meminfovmstat -s
OOM killer
cat /proc/PID/oom_scoreecho -1000 > /proc/PID/oom_score_adjdmesg | grep oom
systemd: OOMScoreAdjust=-900
Hugepages
echo never > .../transparent_hugepage/enabledsysctl vm.nr_hugepages=512grep Huge /proc/meminfo
NUMA
numactl --hardwarenumastat -p PIDnumactl --membind=0 cmdsysctl kernel.numa_balancing=0
CPU scheduler
nice -n 19 cmd — low prioritychrt -f 50 cmd — real-timetaskset -c 0,1 cmd — affinityrenice -n -5 -p PID
Network stack
sysctl net.core.somaxconn=65535sysctl net.ipv4.tcp_tw_reuse=1
BBR: tcp_congestion_control=bbrethtool -k eth0