The Swiss Army knife of stress testing. 300+ stressors covering CPU, memory, I/O, network, file system, pipes, timers, and more. Successor to stress with far more control.
Common Use Cases
CPU Stressors
# Load all CPU cores for 60 seconds stress-ng --cpu 0 --timeout 60s # Load exactly 4 workers, specific CPU method stress-ng --cpu 4 --cpu-method matrixprod --timeout 30s # Show metrics at end stress-ng --cpu 2 --metrics-brief --timeout 30s # Pin to specific CPUs stress-ng --cpu 2 --taskset 0,1 --timeout 30s # List available CPU methods stress-ng --cpu-method list
Memory Stressors
# Allocate and thrash 2 GB with 2 workers stress-ng --vm 2 --vm-bytes 1G --timeout 60s # Mmap stress — good for testing page cache pressure stress-ng --mmap 2 --mmap-bytes 512M --timeout 60s # Force OOM — allocate more than available RAM stress-ng --vm 1 --vm-bytes 90% --vm-keep --timeout 60s # Malloc/free churning stress-ng --malloc 4 --timeout 30s
I/O and File System Stressors
# Sequential I/O stress stress-ng --io 4 --timeout 60s # HDD stress (read/write temp files) stress-ng --hdd 2 --hdd-bytes 1G --timeout 60s # File descriptor exhaustion stress-ng --fd 1 --timeout 30s # inotify stress stress-ng --inotify 4 --timeout 30s
Combined / Realistic Load
# Realistic mixed load: CPU + memory + I/O stress-ng \ --cpu 4 \ --vm 2 --vm-bytes 512M \ --io 2 \ --timeout 120s \ --metrics-brief
Red Flags
--vm-keep with large allocation can fill RAM and trigger heavy swapping. Use --timeout and monitor with vmstat.dmesg | grep -i oom after the run.Gotchas
stress-ng --stressors to see what your version supports.apt install stress-ng or yum install stress-ng. The package in older distros may be very outdated — consider building from source for the full stressor set.The original stress tool. Simple, widely available, and good enough for basic saturation. Prefer stress-ng for anything more nuanced.
Common Options
stress --cpu 4 # 4 CPU workers (sqrt loop) stress --cpu 4 --timeout 60 # run for 60 seconds stress --vm 2 --vm-bytes 512M # 2 workers, each allocating 512 MB stress --io 4 # 4 I/O workers (sync calls) stress --hdd 2 --hdd-bytes 1G # write 1 GB temp files per worker stress --cpu 2 --vm 1 --vm-bytes 256M --io 1 -t 30 # combined
Gotchas
sqrt() loop — it doesn't stress integer units, cache, or SIMD. For realistic CPU stress, use stress-ng --cpu-method matrixprod or similar.--io workers call sync() in a loop, which flushes the page cache. It can tank disk throughput but does not generate raw write I/O by itself.Maximizes CPU power consumption and heat output. Used primarily for thermal stress testing, burn-in, and stability validation. Burns FPU and integer pipelines simultaneously.
Common Use Cases
# Burn a single core (run N copies in parallel for N cores) burnP6 & burnP6 & burnP6 & burnP6 & # Using stress-ng as a modern alternative with equivalent heat output stress-ng --cpu 0 --cpu-method all --timeout 300s # Monitor temperatures while burning watch -n 1 sensors
Red Flags
package temp near this range, the cooling solution is inadequate.Gotchas
burnP6 (Intel P6/Core), burnMMX, burnBX are separate programs tuned per microarchitecture. Using the wrong one may not fully stress the CPU.stress-ng is a more practical modern substitute.stress-ng memory stressors go beyond simple allocation — they cover mmap, malloc churn, NUMA access patterns, huge pages, and cache thrashing. Essential for testing OOM behaviour and memory subsystem limits.
Common Use Cases
Key Stressors
| Stressor | What it does |
|---|---|
--vm N | mmap + write + read cycle; most common memory stressor |
--mmap N | Rapid mmap/munmap calls; stresses the kernel's VMA management |
--malloc N | malloc/calloc/realloc/free churn; tests the allocator under load |
--shm N | POSIX shared memory allocation and access |
--numa N | NUMA memory access patterns across nodes |
--bigheap N | Grows heap until OOM; good for finding leak-triggered crashes |
--userfaultfd N | Exercises userfaultfd page fault handling |
Trigger OOM and Observe Kernel Response
# Allocate 90% of RAM, hold it, force swapping / OOM stress-ng --vm 1 --vm-bytes 90% --vm-keep --timeout 60s # Watch OOM events in real time dmesg -w | grep -i oom # Check which process was killed and why dmesg | grep -E "(oom_kill|Out of memory)" # Test inside a cgroup with a 512 MB memory limit systemd-run --scope -p MemoryMax=512M \ stress-ng --vm 1 --vm-bytes 1G --timeout 30s
Red Flags
Gotchas
90% of RAM on a 128 GB system is 115 GB. Combined with multiple workers this can fill and crash the host instantly.--vm-keep, workers free and reallocate memory continuously. With --vm-keep, memory is held — much more aggressive for OOM testing.Allocates a region of memory and runs a battery of read/write patterns to detect faulty RAM. Different goal from stress-ng: memtester validates correctness, not load.
Common Use Cases
# Test 1 GB of memory, 1 pass memtester 1G 1 # Test 4 GB, 3 passes (long burn-in) memtester 4G 3 # Run in background, log output memtester 2G 1 > /tmp/memtester.log 2>&1 &
Sample Failing Output
Loop 1/1: Stuck Address : ok Random Value : ok Compare XOR : FAILURE: 0xdeadbeef != 0xdeadbeff at 0x7f3a2c001008 Compare SUB : ok Compare MUL : FAILURE: ...
Red Flags
Gotchas
mlock() calls require root or raising the RLIMIT_MEMLOCK limit.Flexible I/O tester. Measures throughput, IOPS, and latency for any combination of read/write patterns and I/O depths. The standard for disk benchmarking and stress testing.
Common Use Cases
Common Patterns
# Sequential read throughput fio --name=seqread --rw=read --bs=1M --size=4G \ --numjobs=4 --runtime=60 --group_reporting # Sequential write throughput fio --name=seqwrite --rw=write --bs=1M --size=4G \ --numjobs=4 --runtime=60 --group_reporting # Random read IOPS (SSD / NVMe benchmark) fio --name=randread --rw=randread --bs=4k --size=4G \ --iodepth=32 --numjobs=4 --runtime=60 --group_reporting # Mixed 70/30 read/write (database-like) fio --name=mixed --rw=randrw --rwmixread=70 \ --bs=8k --size=4G --iodepth=16 --runtime=120 # Stress test: maximum sustained write load fio --name=sustained_write --rw=write --bs=128k \ --size=20G --numjobs=8 --iodepth=64 \ --direct=1 --runtime=300 --group_reporting # Direct I/O (bypass page cache) fio --name=direct --rw=randwrite --bs=4k \ --direct=1 --size=2G --runtime=60
Key Options
| Option | Meaning |
|---|---|
--rw | I/O pattern: read, write, randread, randwrite, randrw, readwrite |
--bs | Block size (e.g. 4k, 128k, 1M) |
--iodepth | Queue depth — how many I/Os to keep in flight per job |
--numjobs | Number of parallel worker threads/processes |
--direct=1 | Bypass page cache; measures raw device speed |
--ioengine | I/O engine: libaio, io_uring, sync, mmap |
--runtime | Time limit in seconds |
--filename | Target file or device (e.g. /dev/sdb) |
Reading Output
read: IOPS=45.2k, BW=176MiB/s (185MB/s)(10.3GiB/60001msec) lat (usec): min=78, avg=708, stdev=1423.7, max=58392 clat percentiles (usec): | 50.00th=[ 290], 99.00th=[ 5472], 99.90th=[13697]
Red Flags
Gotchas
--direct=1.--filename=/tmp/fio-test) unless you explicitly want raw device testing on a disk with no data.Linux Device Mapper target that simulates intermittent disk failures. Makes a real block device appear to fail at configurable intervals — essential for testing filesystem and application resilience to I/O errors without a physically failing disk.
Common Use Cases
Setup and Usage
# Load dm-flakey module modprobe dm-flakey # Get device size in 512-byte sectors SECTORS=$(blockdev --getsz /dev/sdb) # Create flakey device: # - works fine for 10 seconds # - fails all I/O for 5 seconds # - repeats echo "0 $SECTORS flakey /dev/sdb 0 10 5" | \ dmsetup create flakey-test # Use /dev/mapper/flakey-test as a normal block device mkfs.ext4 /dev/mapper/flakey-test mount /dev/mapper/flakey-test /mnt/test # Run I/O against it and observe errors fio --name=flakey --filename=/mnt/test/test \ --rw=write --bs=4k --size=512M --runtime=60 # Remove device when done umount /mnt/test dmsetup remove flakey-test
dm-flakey Table Format
"<start> <size> flakey <dev> <offset> <up_interval> <down_interval> [features]" # Features (optional): # drop_writes — silently discard all writes during down interval # error_writes — return errors for writes during down interval # corrupt_bio_byte N — corrupt byte N of each I/O during down interval # Example: corrupt writes (silent data corruption) echo "0 $SECTORS flakey /dev/sdb 0 10 5 1 corrupt_bio_byte 32 w 1 0" | \ dmsetup create flakey-corrupt
Red Flags
Gotchas
dmsetup.down_interval > 0 or the device will always work and you won't test anything.lsmod | grep dm_flakey.Linux kernel traffic control network emulation. Built into the kernel — no extra software needed. Injects latency, packet loss, reorder, duplication, and corruption at the NIC queue level. The most powerful and widely-used network fault injection tool on Linux.
Common Use Cases
Latency
# Add 100ms fixed latency on eth0 tc qdisc add dev eth0 root netem delay 100ms # Add 100ms ± 20ms jitter (uniform distribution) tc qdisc add dev eth0 root netem delay 100ms 20ms # Add 100ms ± 20ms jitter with 25% correlation (more realistic) tc qdisc add dev eth0 root netem delay 100ms 20ms 25% # Change an existing rule (use change, not add) tc qdisc change dev eth0 root netem delay 200ms # Remove all netem rules tc qdisc del dev eth0 root
Packet Loss
# 5% random packet loss tc qdisc add dev eth0 root netem loss 5% # 5% loss with 25% correlation (bursty loss, more realistic) tc qdisc add dev eth0 root netem loss 5% 25% # Gilbert-Elliott model (models loss bursts better) tc qdisc add dev eth0 root netem loss gemodel 1% 10% 70% 5%
Packet Corruption, Reorder, Duplicate
# Corrupt 1% of packets (flip random bit) tc qdisc add dev eth0 root netem corrupt 1% # Reorder: 25% of packets delayed by 10ms (out of order) tc qdisc add dev eth0 root netem delay 10ms reorder 25% 50% # Duplicate 5% of packets tc qdisc add dev eth0 root netem duplicate 5%
Combined / Realistic WAN
# Simulate a flaky mobile link: 150ms latency, 2% loss, some jitter tc qdisc add dev eth0 root netem \ delay 150ms 30ms 25% \ loss 2% 25% \ duplicate 0.1% # Limit bandwidth to 1 Mbit/s (add tbf on top of netem) tc qdisc add dev eth0 root handle 1: netem delay 50ms tc qdisc add dev eth0 parent 1:1 handle 10: tbf \ rate 1mbit burst 32kbit latency 400ms # Verify what's applied tc qdisc show dev eth0 # Remove everything tc qdisc del dev eth0 root
Apply Only to Specific IPs (using filter)
# Affect only traffic to 10.0.0.5, leave other traffic clean tc qdisc add dev eth0 root handle 1: prio tc qdisc add dev eth0 parent 1:3 handle 30: netem delay 200ms loss 5% tc filter add dev eth0 protocol ip parent 1:0 prio 3 u32 \ match ip dst 10.0.0.5/32 flowid 1:3
Red Flags
lo degrades all local IPC including database connections on the same host.tc qdisc show after testing.Gotchas
ifb (Intermediate Functional Block) to redirect ingress to an egress queue first.tc qdisc add when a rule already exists returns an error. Use tc qdisc change to modify an existing rule, or del then add.tc modifications need root or the CAP_NET_ADMIN capability.ping -c 20 <target> for latency and iperf3 for throughput/loss.TCP proxy for testing network conditions. Sits between your application and a dependency (database, API, cache) and injects faults per connection. Unlike tc netem, it works at the application layer and can target individual services. Originally built by Shopify.
Common Use Cases
Setup
# Start toxiproxy server (listens on :8474 for API, proxies on defined ports) toxiproxy-server & # Create a proxy: app connects to localhost:15432, toxiproxy forwards to postgres:5432 toxiproxy-cli create mydb --listen localhost:15432 --upstream postgres:5432 # List proxies toxiproxy-cli list # Inspect a proxy toxiproxy-cli inspect mydb
Injecting Toxics
# Add 200ms latency on downstream (response path) toxiproxy-cli toxic add mydb --type latency --downstream -a latency=200 # Add latency with jitter toxiproxy-cli toxic add mydb --type latency -a latency=100 -a jitter=50 # Limit bandwidth to 10 KB/s (simulate slow link) toxiproxy-cli toxic add mydb --type bandwidth -a rate=10 # Timeout: close connection after 5000ms of no data toxiproxy-cli toxic add mydb --type timeout -a timeout=5000 # Slice: send data in small chunks (e.g. 1 byte per 100ms) toxiproxy-cli toxic add mydb --type slicer -a size=1 -a delay=100000 # Take the proxy down entirely (connection refused) toxiproxy-cli toggle mydb # Remove a toxic toxiproxy-cli toxic remove mydb --toxicName latency_downstream # Reset all toxics on a proxy toxiproxy-cli toxic remove mydb --all
Toxic Types
| Toxic | Effect |
|---|---|
latency | Add fixed + jitter delay to each chunk of data |
bandwidth | Throttle transfer rate (KB/s) |
slow_close | Delay connection close by N ms — simulates half-open connections |
timeout | Close connection after N ms of inactivity |
reset_peer | Send TCP RST after N ms — abrupt connection termination |
slicer | Chop stream into tiny pieces with delays — breaks naive parsers |
limit_data | Close connection after N bytes transferred |
Gotchas
Magic System Request key — a kernel interface that remains usable even when the system is largely unresponsive. Triggered via keyboard (Alt+SysRq+key) or via /proc/sysrq-trigger. Useful for forcing kernel panic, OOM killer, sync, and remount.
Common Use Cases
Trigger via /proc
# Enable sysrq (may be disabled by default) echo 1 > /proc/sys/kernel/sysrq # Show memory info (safe — just prints to dmesg) echo m > /proc/sysrq-trigger # Show blocked task info (T) — useful for diagnosing hangs echo t > /proc/sysrq-trigger # Force OOM killer to run immediately echo f > /proc/sysrq-trigger # Emergency sync all filesystems echo s > /proc/sysrq-trigger # Remount all filesystems read-only (before power cut) echo u > /proc/sysrq-trigger # IMMEDIATE kernel panic (crash dump / kdump testing) echo c > /proc/sysrq-trigger
Key Reference
| Key | Action | Risk |
|---|---|---|
b | Reboot immediately (no sync) | Data loss |
c | Kernel panic / crash | System crash |
f | Call OOM killer | Processes killed |
k | Kill all processes on current VT | Session loss |
m | Dump memory info to console | Safe |
o | Shut down system | Data loss |
p | Dump current registers + flags | Safe |
s | Sync all filesystems | Safe |
t | Dump task list to console | Safe |
u | Remount filesystems read-only | Safe |
w | Dump uninterruptible tasks | Safe |
Gotchas
/proc/sys/kernel/sysrq. Value 0 = disabled, 1 = all enabled. Some distros default to 0 for security.kernel.sysrq = 1 to /etc/sysctl.d/99-sysrq.conf.The Linux kernel has built-in fault injection infrastructure exposed via /sys/kernel/debug/fail_*. It allows you to inject failures into specific kernel subsystems — kmalloc, block I/O, futexes, and more — without any external tools. Requires a debug kernel (CONFIG_FAULT_INJECTION=y).
Common Injection Points
| Path | Injects failures into |
|---|---|
/sys/kernel/debug/fail_page_alloc | Page allocator — simulate memory allocation failures |
/sys/kernel/debug/fail_make_request | Block layer I/O requests |
/sys/kernel/debug/fail_futex | Futex operations |
/sys/kernel/debug/failslab | Slab/kmalloc allocations |
/sys/kernel/debug/fail_function | Inject errors into arbitrary kernel functions (requires CONFIG_FAIL_FUNCTION) |
Usage: Inject kmalloc Failures
# Mount debugfs if not already mounted mount -t debugfs none /sys/kernel/debug # Configure failslab: 10% probability, no interval, all callers cd /sys/kernel/debug/failslab echo 10 > probability # 10% of allocations fail echo -1 > interval # -1 = fail every Nth call (0 = no interval) echo 100 > times # inject at most 100 failures then stop echo 0 > space # 0 = any task, 1 = only tasks in /proc/self/fail-nth echo 1 > task-filter # only fail allocations from tasks that opted in # Opt this shell into the fault injection echo 1 > /proc/self/make-it-fail # Now run your program — it will encounter kmalloc failures ./my-program # Disable echo 0 > /proc/self/make-it-fail echo 0 > /sys/kernel/debug/failslab/probability
Gotchas
ls /sys/kernel/debug/fail_*.task-filter=1 and per-task opt-in, you inject failures system-wide which can destabilize the entire host.debugfs /sys/kernel/debug debugfs defaults 0 0 to /etc/fstab if needed.Fault Injection in Userspace. A library + tool for injecting failures into POSIX API calls (open, read, write, malloc, etc.) in your application without kernel changes. Instruments your binary via LD_PRELOAD or direct library linkage.
Common Use Cases
# Install libfiu (Debian/Ubuntu) apt install libfiu-dev fiu-utils # Run a program with fault injection via fiu-run (LD_PRELOAD wrapper) # Fail posix/io/rw/read with 10% probability fiu-run -x -c "enable_random name=posix/io/rw/read,probability=0.1" ./my-program # Fail malloc 5% of the time fiu-run -x -c "enable_random name=libc/mm/malloc,probability=0.05" ./my-program # Fail the Nth call to open() (deterministic) fiu-run -x -c "enable_random name=posix/io/oc/open,probability=1,failinfo=-1" ./my-program # List available POSIX fault points fiu-run -l
Gotchas
syscall()) bypass libfiu's LD_PRELOAD hooks.-x, fiu-run does not intercept POSIX functions — only your own code's fiu_fail() calls.Sending signals is the simplest form of process fault injection — pause, resume, crash, or forcibly terminate processes to test application resilience, supervisor logic, and restart behaviour.
Common Use Cases
# Graceful shutdown (SIGTERM — process can handle it) kill -SIGTERM 1234 # Immediate kill (SIGKILL — cannot be caught or ignored) kill -9 1234 # Pause a process (SIGSTOP — cannot be caught) kill -SIGSTOP 1234 # Resume it kill -SIGCONT 1234 # Send to all processes in a process group kill -SIGTERM -1234 # negative PID = process group # Kill by name pkill -SIGTERM nginx killall -9 java # Simulate a crash loop (kill, wait for restart, repeat) while true; do kill -9 $(pgrep myservice) sleep 10 done # SIGSTOP a process for 5s then resume (CPU starvation simulation) kill -SIGSTOP 1234 && sleep 5 && kill -SIGCONT 1234
Red Flags
strace -p PID and follow up with SIGKILL.Gotchas
Linux control groups let you impose hard resource limits on a process or group of processes — CPU, memory, I/O bandwidth, PIDs. Essential for testing how applications behave under resource constraints without changing system-wide configuration.
Common Use Cases
Using systemd-run (easiest)
# Run a process with 512 MB memory limit systemd-run --scope -p MemoryMax=512M ./my-program # Limit CPU to 50% of one core (CPUQuota) systemd-run --scope -p CPUQuota=50% ./my-program # Limit to 2 CPUs worth of time systemd-run --scope -p CPUQuota=200% ./my-program # Limit I/O write bandwidth to 10 MB/s on /dev/sda systemd-run --scope -p "IOWriteBandwidthMax=/dev/sda 10M" ./my-program # Limit maximum PIDs (no forks beyond 50) systemd-run --scope -p TasksMax=50 ./my-program # Combine: memory + CPU limits systemd-run --scope \ -p MemoryMax=256M \ -p CPUQuota=25% \ stress-ng --cpu 4 --vm 1 --vm-bytes 1G --timeout 60s
Using cgroupsv2 directly
# Create a cgroup mkdir /sys/fs/cgroup/mytest # Set 256 MB memory limit echo 268435456 > /sys/fs/cgroup/mytest/memory.max # Add current process to the cgroup echo $$ > /sys/fs/cgroup/mytest/cgroup.procs # Run a child that will hit the limit stress-ng --vm 1 --vm-bytes 1G --timeout 30s # Clean up rmdir /sys/fs/cgroup/mytest
Red Flags
Gotchas
MemoryHigh is a soft limit (reclaim pressure). MemoryMax is the hard OOM limit. For OOM testing, use MemoryMax.cat /sys/fs/cgroup/mytest/memory.current and memory.events show live usage and OOM events.Guidelines for running fault injection safely and productively, based on the Chaos Engineering discipline.
Define Steady State First
Measure baseline metrics (latency p99, error rate, throughput) before injecting faults. You need a baseline to know if the system degraded or recovered.
Blast Radius Control
Start with the smallest possible scope: one host, one service, one connection. Expand only after the smaller experiment is understood.
Run in Production (carefully)
Staging environments don't mirror production traffic patterns. The most valuable experiments run in production with small blast radius and automated rollback.
Automate and Abort
Instrument every experiment with a kill switch. If an unrelated metric degrades, abort automatically. Never leave fault injection running unattended.
Vary the Fault Type
Latency faults expose different bugs than crash faults. Test both. A service that handles crashes gracefully may still fail badly on 5s latency injections.
Document Findings
Record what broke, what held, and what you changed. Fault injection that doesn't change the system is wasted time — file a ticket for every weakness found.
Tool Selection Guide
| Fault Type | Best Tool | Alternative |
|---|---|---|
| CPU saturation | stress-ng --cpu | stress |
| Memory pressure / OOM | stress-ng --vm --vm-keep | systemd-run -p MemoryMax |
| Disk I/O saturation | fio | stress-ng --hdd |
| Disk I/O errors | dm-flakey | fail_make_request debugfs |
| Network latency / loss | tc netem | toxiproxy |
| Per-service network fault | toxiproxy | tc netem + filter |
| Process crash | kill -9 | stress-ng --kill |
| Resource limits | systemd-run -p | raw cgroupsv2 |
| Kernel panic / crash dump | sysrq c | kdump config |
| Syscall-level fault injection | libfiu | fail debugfs |
Common Linux signals used in fault injection and testing.
| Signal | Number | Default Action | Catchable? | Use in Testing |
|---|---|---|---|---|
SIGTERM | 15 | Terminate | Yes | Graceful shutdown test — process should clean up |
SIGKILL | 9 | Kill immediately | No | Hard crash simulation — no cleanup possible |
SIGSTOP | 19 | Pause | No | CPU starvation / freeze simulation |
SIGCONT | 18 | Continue | Yes | Resume after SIGSTOP |
SIGHUP | 1 | Terminate | Yes | Config reload test (many daemons reload on HUP) |
SIGABRT | 6 | Core dump | Yes | Trigger core dump for post-mortem analysis |
SIGSEGV | 11 | Core dump | Yes | Simulate memory fault (can send to self via raise()) |
SIGPIPE | 13 | Terminate | Yes | Test broken pipe handling in network code |
SIGUSR1/2 | 10/12 | Terminate | Yes | Application-defined — trigger debug output, rotate logs |
SIGALRM | 14 | Terminate | Yes | Timeout simulation — process-level alarm |