Linux Fault Injection & Stress Tools
Stress testing, fault injection, and chaos engineering on Linux — use cases, options, red flags, and gotchas
Target Layer │ ├── CPU / Load → stress-ng, stress, cpuburn ├── Memory → stress-ng --vm, memtester ├── Disk / IO → fio, dm-flakey ├── Network → tc netem, toxiproxy ├── Kernel → sysrq, fail debugfs, libfiu └── Process → kill/signals, cgroups
CPU / Mem / IO / Net stress-ng

The Swiss Army knife of stress testing. 300+ stressors covering CPU, memory, I/O, network, file system, pipes, timers, and more. Successor to stress with far more control.

Common Use Cases

Saturate all CPU cores Trigger OOM killer Exhaust file descriptors Test thermal throttling Validate autoscaling Reproduce swapping

CPU Stressors

Shell
# Load all CPU cores for 60 seconds
stress-ng --cpu 0 --timeout 60s

# Load exactly 4 workers, specific CPU method
stress-ng --cpu 4 --cpu-method matrixprod --timeout 30s

# Show metrics at end
stress-ng --cpu 2 --metrics-brief --timeout 30s

# Pin to specific CPUs
stress-ng --cpu 2 --taskset 0,1 --timeout 30s

# List available CPU methods
stress-ng --cpu-method list

Memory Stressors

Shell
# Allocate and thrash 2 GB with 2 workers
stress-ng --vm 2 --vm-bytes 1G --timeout 60s

# Mmap stress — good for testing page cache pressure
stress-ng --mmap 2 --mmap-bytes 512M --timeout 60s

# Force OOM — allocate more than available RAM
stress-ng --vm 1 --vm-bytes 90% --vm-keep --timeout 60s

# Malloc/free churning
stress-ng --malloc 4 --timeout 30s

I/O and File System Stressors

Shell
# Sequential I/O stress
stress-ng --io 4 --timeout 60s

# HDD stress (read/write temp files)
stress-ng --hdd 2 --hdd-bytes 1G --timeout 60s

# File descriptor exhaustion
stress-ng --fd 1 --timeout 30s

# inotify stress
stress-ng --inotify 4 --timeout 30s

Combined / Realistic Load

Shell
# Realistic mixed load: CPU + memory + I/O
stress-ng \
  --cpu 4 \
  --vm 2 --vm-bytes 512M \
  --io 2 \
  --timeout 120s \
  --metrics-brief

Red Flags

System becomes unresponsiveToo many workers or --vm-keep with large allocation can fill RAM and trigger heavy swapping. Use --timeout and monitor with vmstat.
OOM killer fires unexpectedlyThe kernel will kill stress-ng workers (or other processes) silently. Check dmesg | grep -i oom after the run.

Gotchas

--cpu 0 means "all cores"0 is a special value that spawns one worker per CPU. On a 64-core machine this is very aggressive.
Stressor names vary by versionNewer versions add/rename stressors. Run stress-ng --stressors to see what your version supports.
Installapt install stress-ng or yum install stress-ng. The package in older distros may be very outdated — consider building from source for the full stressor set.
CPU / Mem / IO stress

The original stress tool. Simple, widely available, and good enough for basic saturation. Prefer stress-ng for anything more nuanced.

Common Options

Shell
stress --cpu 4                  # 4 CPU workers (sqrt loop)
stress --cpu 4 --timeout 60     # run for 60 seconds
stress --vm 2 --vm-bytes 512M   # 2 workers, each allocating 512 MB
stress --io 4                   # 4 I/O workers (sync calls)
stress --hdd 2 --hdd-bytes 1G   # write 1 GB temp files per worker
stress --cpu 2 --vm 1 --vm-bytes 256M --io 1 -t 30  # combined

Gotchas

CPU method is only sqrt()stress uses a tight sqrt() loop — it doesn't stress integer units, cache, or SIMD. For realistic CPU stress, use stress-ng --cpu-method matrixprod or similar.
--io stresses sync(), not disk writesThe --io workers call sync() in a loop, which flushes the page cache. It can tank disk throughput but does not generate raw write I/O by itself.
CPU cpuburn / burnP6

Maximizes CPU power consumption and heat output. Used primarily for thermal stress testing, burn-in, and stability validation. Burns FPU and integer pipelines simultaneously.

Common Use Cases

Thermal throttling Cooling system validation CPU stability (overclock testing) Power supply stress
Shell
# Burn a single core (run N copies in parallel for N cores)
burnP6 &
burnP6 &
burnP6 &
burnP6 &

# Using stress-ng as a modern alternative with equivalent heat output
stress-ng --cpu 0 --cpu-method all --timeout 300s

# Monitor temperatures while burning
watch -n 1 sensors

Red Flags

CPU temperature above TJmaxMost modern CPUs throttle at 90–105°C. If you see package temp near this range, the cooling solution is inadequate.
System hangs or rebootsIndicates thermal shutdown, instability, or an insufficient power supply.

Gotchas

Architecture-specific binariesburnP6 (Intel P6/Core), burnMMX, burnBX are separate programs tuned per microarchitecture. Using the wrong one may not fully stress the CPU.
Not in most distro reposOften needs to be compiled from source. stress-ng is a more practical modern substitute.
Memory stress-ng (memory stressors)

stress-ng memory stressors go beyond simple allocation — they cover mmap, malloc churn, NUMA access patterns, huge pages, and cache thrashing. Essential for testing OOM behaviour and memory subsystem limits.

Common Use Cases

Trigger OOM killer Test swap behaviour Validate memory limits (cgroups) NUMA locality testing Page cache pressure

Key Stressors

StressorWhat it does
--vm Nmmap + write + read cycle; most common memory stressor
--mmap NRapid mmap/munmap calls; stresses the kernel's VMA management
--malloc Nmalloc/calloc/realloc/free churn; tests the allocator under load
--shm NPOSIX shared memory allocation and access
--numa NNUMA memory access patterns across nodes
--bigheap NGrows heap until OOM; good for finding leak-triggered crashes
--userfaultfd NExercises userfaultfd page fault handling

Trigger OOM and Observe Kernel Response

Shell
# Allocate 90% of RAM, hold it, force swapping / OOM
stress-ng --vm 1 --vm-bytes 90% --vm-keep --timeout 60s

# Watch OOM events in real time
dmesg -w | grep -i oom

# Check which process was killed and why
dmesg | grep -E "(oom_kill|Out of memory)"

# Test inside a cgroup with a 512 MB memory limit
systemd-run --scope -p MemoryMax=512M \
  stress-ng --vm 1 --vm-bytes 1G --timeout 30s

Red Flags

dmesg shows "oom_kill_process"The OOM killer fired. The victim may not be your stress process — it picks the highest-scoring process which could be your application.
swap si/so non-zero in vmstatActive swapping in progress. Expect major latency degradation across the system.

Gotchas

--vm-bytes accepts % of RAM90% of RAM on a 128 GB system is 115 GB. Combined with multiple workers this can fill and crash the host instantly.
vm-keep vs defaultWithout --vm-keep, workers free and reallocate memory continuously. With --vm-keep, memory is held — much more aggressive for OOM testing.
Memory memtester

Allocates a region of memory and runs a battery of read/write patterns to detect faulty RAM. Different goal from stress-ng: memtester validates correctness, not load.

Common Use Cases

RAM fault detection Post-hardware-failure diagnosis New server burn-in
Shell
# Test 1 GB of memory, 1 pass
memtester 1G 1

# Test 4 GB, 3 passes (long burn-in)
memtester 4G 3

# Run in background, log output
memtester 2G 1 > /tmp/memtester.log 2>&1 &

Sample Failing Output

Output
Loop 1/1:
  Stuck Address       : ok
  Random Value        : ok
  Compare XOR         : FAILURE: 0xdeadbeef != 0xdeadbeff at 0x7f3a2c001008
  Compare SUB         : ok
  Compare MUL         : FAILURE: ...

Red Flags

Any FAILURE lineOne or more RAM cells are faulty. The DIMM should be replaced. Note the address — it may help identify which physical DIMM is bad.

Gotchas

Cannot test currently in-use RAMmemtester only tests the region it allocates. The kernel, running processes, and page cache occupy memory it cannot touch. Use memtest86+ (boot-time) for a full test.
Needs root for large allocationsOn some systems, large mlock() calls require root or raising the RLIMIT_MEMLOCK limit.
memtest86+ for full hardware testFor a definitive RAM test, boot from memtest86+ which runs outside the OS and can test 100% of installed RAM.
Disk / IO fio

Flexible I/O tester. Measures throughput, IOPS, and latency for any combination of read/write patterns and I/O depths. The standard for disk benchmarking and stress testing.

Common Use Cases

Measure disk throughput Saturate I/O subsystem Simulate database I/O patterns Test RAID / LVM reliability Reproduce disk contention

Common Patterns

Shell
# Sequential read throughput
fio --name=seqread --rw=read --bs=1M --size=4G \
    --numjobs=4 --runtime=60 --group_reporting

# Sequential write throughput
fio --name=seqwrite --rw=write --bs=1M --size=4G \
    --numjobs=4 --runtime=60 --group_reporting

# Random read IOPS (SSD / NVMe benchmark)
fio --name=randread --rw=randread --bs=4k --size=4G \
    --iodepth=32 --numjobs=4 --runtime=60 --group_reporting

# Mixed 70/30 read/write (database-like)
fio --name=mixed --rw=randrw --rwmixread=70 \
    --bs=8k --size=4G --iodepth=16 --runtime=120

# Stress test: maximum sustained write load
fio --name=sustained_write --rw=write --bs=128k \
    --size=20G --numjobs=8 --iodepth=64 \
    --direct=1 --runtime=300 --group_reporting

# Direct I/O (bypass page cache)
fio --name=direct --rw=randwrite --bs=4k \
    --direct=1 --size=2G --runtime=60

Key Options

OptionMeaning
--rwI/O pattern: read, write, randread, randwrite, randrw, readwrite
--bsBlock size (e.g. 4k, 128k, 1M)
--iodepthQueue depth — how many I/Os to keep in flight per job
--numjobsNumber of parallel worker threads/processes
--direct=1Bypass page cache; measures raw device speed
--ioengineI/O engine: libaio, io_uring, sync, mmap
--runtimeTime limit in seconds
--filenameTarget file or device (e.g. /dev/sdb)

Reading Output

Output
read: IOPS=45.2k, BW=176MiB/s (185MB/s)(10.3GiB/60001msec)
  lat (usec): min=78, avg=708, stdev=1423.7, max=58392
  clat percentiles (usec):
   | 50.00th=[  290], 99.00th=[ 5472], 99.90th=[13697]

Red Flags

99th percentile latency 10x+ the medianIndicates I/O jitter — common with spinning disks under mixed load or cloud storage throttling.
IOPS drops sharply mid-runDisk may be hitting thermal limits, write cache is full, or the RAID rebuild/scrub is competing.

Gotchas

Without --direct=1, you're measuring the page cacheFor benchmarking the storage device itself, always use --direct=1.
Running against /dev/sdX destroys dataAlways use a test file (e.g. --filename=/tmp/fio-test) unless you explicitly want raw device testing on a disk with no data.
numjobs × size is the total file space8 jobs × 4 GB = 32 GB of disk space consumed during the test.
Disk / IO dm-flakey

Linux Device Mapper target that simulates intermittent disk failures. Makes a real block device appear to fail at configurable intervals — essential for testing filesystem and application resilience to I/O errors without a physically failing disk.

Common Use Cases

Test filesystem error recovery Validate RAID rebuild logic Simulate cloud disk I/O errors Test application retry/fallback

Setup and Usage

Shell
# Load dm-flakey module
modprobe dm-flakey

# Get device size in 512-byte sectors
SECTORS=$(blockdev --getsz /dev/sdb)

# Create flakey device:
#   - works fine for 10 seconds
#   - fails all I/O for 5 seconds
#   - repeats
echo "0 $SECTORS flakey /dev/sdb 0 10 5" | \
  dmsetup create flakey-test

# Use /dev/mapper/flakey-test as a normal block device
mkfs.ext4 /dev/mapper/flakey-test
mount /dev/mapper/flakey-test /mnt/test

# Run I/O against it and observe errors
fio --name=flakey --filename=/mnt/test/test \
    --rw=write --bs=4k --size=512M --runtime=60

# Remove device when done
umount /mnt/test
dmsetup remove flakey-test

dm-flakey Table Format

Format
"<start> <size> flakey <dev> <offset> <up_interval> <down_interval> [features]"

# Features (optional):
#   drop_writes        — silently discard all writes during down interval
#   error_writes       — return errors for writes during down interval
#   corrupt_bio_byte N — corrupt byte N of each I/O during down interval

# Example: corrupt writes (silent data corruption)
echo "0 $SECTORS flakey /dev/sdb 0 10 5 1 corrupt_bio_byte 32 w 1 0" | \
  dmsetup create flakey-corrupt

Red Flags

Application does not detect I/O errorsIf your app silently ignores write failures from dm-flakey, you have a data integrity bug. Check return codes from write() / fsync().
Filesystem goes read-onlyext4/xfs will remount read-only on I/O errors by default. This is correct behaviour — verify your app handles it gracefully.

Gotchas

Requires root and kernel supportdm-flakey needs the Device Mapper kernel module and root access to use dmsetup.
down_interval = 0 means never failsMake sure down_interval > 0 or the device will always work and you won't test anything.
Not available in all cloud VMsSome cloud providers restrict device mapper usage inside VMs. Check with lsmod | grep dm_flakey.
Network tc netem

Linux kernel traffic control network emulation. Built into the kernel — no extra software needed. Injects latency, packet loss, reorder, duplication, and corruption at the NIC queue level. The most powerful and widely-used network fault injection tool on Linux.

Common Use Cases

Simulate WAN latency Test retry / timeout logic Reproduce packet loss Simulate flaky connections Test TCP congestion behaviour

Latency

Shell
# Add 100ms fixed latency on eth0
tc qdisc add dev eth0 root netem delay 100ms

# Add 100ms ± 20ms jitter (uniform distribution)
tc qdisc add dev eth0 root netem delay 100ms 20ms

# Add 100ms ± 20ms jitter with 25% correlation (more realistic)
tc qdisc add dev eth0 root netem delay 100ms 20ms 25%

# Change an existing rule (use change, not add)
tc qdisc change dev eth0 root netem delay 200ms

# Remove all netem rules
tc qdisc del dev eth0 root

Packet Loss

Shell
# 5% random packet loss
tc qdisc add dev eth0 root netem loss 5%

# 5% loss with 25% correlation (bursty loss, more realistic)
tc qdisc add dev eth0 root netem loss 5% 25%

# Gilbert-Elliott model (models loss bursts better)
tc qdisc add dev eth0 root netem loss gemodel 1% 10% 70% 5%

Packet Corruption, Reorder, Duplicate

Shell
# Corrupt 1% of packets (flip random bit)
tc qdisc add dev eth0 root netem corrupt 1%

# Reorder: 25% of packets delayed by 10ms (out of order)
tc qdisc add dev eth0 root netem delay 10ms reorder 25% 50%

# Duplicate 5% of packets
tc qdisc add dev eth0 root netem duplicate 5%

Combined / Realistic WAN

Shell
# Simulate a flaky mobile link: 150ms latency, 2% loss, some jitter
tc qdisc add dev eth0 root netem \
  delay 150ms 30ms 25% \
  loss 2% 25% \
  duplicate 0.1%

# Limit bandwidth to 1 Mbit/s (add tbf on top of netem)
tc qdisc add dev eth0 root handle 1: netem delay 50ms
tc qdisc add dev eth0 parent 1:1 handle 10: tbf \
  rate 1mbit burst 32kbit latency 400ms

# Verify what's applied
tc qdisc show dev eth0

# Remove everything
tc qdisc del dev eth0 root

Apply Only to Specific IPs (using filter)

Shell
# Affect only traffic to 10.0.0.5, leave other traffic clean
tc qdisc add dev eth0 root handle 1: prio
tc qdisc add dev eth0 parent 1:3 handle 30: netem delay 200ms loss 5%
tc filter add dev eth0 protocol ip parent 1:0 prio 3 u32 \
  match ip dst 10.0.0.5/32 flowid 1:3

Red Flags

Applied to loopback (lo) affects localhost trafficAdding netem to lo degrades all local IPC including database connections on the same host.
Forgot to remove after testingnetem rules survive reboots if set via persistent network config. Always verify with tc qdisc show after testing.

Gotchas

Outbound only by defaultnetem applies to the egress (outbound) queue. To affect inbound traffic, use ifb (Intermediate Functional Block) to redirect ingress to an egress queue first.
tc add vs tc changeRunning tc qdisc add when a rule already exists returns an error. Use tc qdisc change to modify an existing rule, or del then add.
Requires root / CAP_NET_ADMINAll tc modifications need root or the CAP_NET_ADMIN capability.
Verify with ping / iperf3After applying netem, confirm with ping -c 20 <target> for latency and iperf3 for throughput/loss.
Network toxiproxy

TCP proxy for testing network conditions. Sits between your application and a dependency (database, API, cache) and injects faults per connection. Unlike tc netem, it works at the application layer and can target individual services. Originally built by Shopify.

Common Use Cases

Test DB connection failures Slow downstream API Connection timeout behaviour Graceful degradation Circuit breaker validation

Setup

Shell
# Start toxiproxy server (listens on :8474 for API, proxies on defined ports)
toxiproxy-server &

# Create a proxy: app connects to localhost:15432, toxiproxy forwards to postgres:5432
toxiproxy-cli create mydb --listen localhost:15432 --upstream postgres:5432

# List proxies
toxiproxy-cli list

# Inspect a proxy
toxiproxy-cli inspect mydb

Injecting Toxics

Shell
# Add 200ms latency on downstream (response path)
toxiproxy-cli toxic add mydb --type latency --downstream -a latency=200

# Add latency with jitter
toxiproxy-cli toxic add mydb --type latency -a latency=100 -a jitter=50

# Limit bandwidth to 10 KB/s (simulate slow link)
toxiproxy-cli toxic add mydb --type bandwidth -a rate=10

# Timeout: close connection after 5000ms of no data
toxiproxy-cli toxic add mydb --type timeout -a timeout=5000

# Slice: send data in small chunks (e.g. 1 byte per 100ms)
toxiproxy-cli toxic add mydb --type slicer -a size=1 -a delay=100000

# Take the proxy down entirely (connection refused)
toxiproxy-cli toggle mydb

# Remove a toxic
toxiproxy-cli toxic remove mydb --toxicName latency_downstream

# Reset all toxics on a proxy
toxiproxy-cli toxic remove mydb --all

Toxic Types

ToxicEffect
latencyAdd fixed + jitter delay to each chunk of data
bandwidthThrottle transfer rate (KB/s)
slow_closeDelay connection close by N ms — simulates half-open connections
timeoutClose connection after N ms of inactivity
reset_peerSend TCP RST after N ms — abrupt connection termination
slicerChop stream into tiny pieces with delays — breaks naive parsers
limit_dataClose connection after N bytes transferred

Gotchas

Application must connect through the proxyYou must point your app at the toxiproxy port, not the real service. Otherwise toxiproxy has no effect.
Does not affect UDPtoxiproxy is TCP-only. For UDP fault injection, use tc netem.
Also available as a Go libraryUseful for integration tests: spin up a toxiproxy programmatically, inject faults, assert your app behaves correctly, tear down.
Kernel sysrq

Magic System Request key — a kernel interface that remains usable even when the system is largely unresponsive. Triggered via keyboard (Alt+SysRq+key) or via /proc/sysrq-trigger. Useful for forcing kernel panic, OOM killer, sync, and remount.

Common Use Cases

Trigger kernel panic for crash dump testing Force OOM killer run Emergency sync + remount read-only Debug hung system

Trigger via /proc

Shell
# Enable sysrq (may be disabled by default)
echo 1 > /proc/sys/kernel/sysrq

# Show memory info (safe — just prints to dmesg)
echo m > /proc/sysrq-trigger

# Show blocked task info (T) — useful for diagnosing hangs
echo t > /proc/sysrq-trigger

# Force OOM killer to run immediately
echo f > /proc/sysrq-trigger

# Emergency sync all filesystems
echo s > /proc/sysrq-trigger

# Remount all filesystems read-only (before power cut)
echo u > /proc/sysrq-trigger

# IMMEDIATE kernel panic (crash dump / kdump testing)
echo c > /proc/sysrq-trigger

Key Reference

KeyActionRisk
bReboot immediately (no sync)Data loss
cKernel panic / crashSystem crash
fCall OOM killerProcesses killed
kKill all processes on current VTSession loss
mDump memory info to consoleSafe
oShut down systemData loss
pDump current registers + flagsSafe
sSync all filesystemsSafe
tDump task list to consoleSafe
uRemount filesystems read-onlySafe
wDump uninterruptible tasksSafe
REISUB — safe emergency reboot sequenceR (raw keyboard) → E (SIGTERM all) → I (SIGKILL all) → S (sync) → U (remount RO) → B (reboot). Type slowly with a pause between each key.

Gotchas

sysrq may be disabledCheck /proc/sys/kernel/sysrq. Value 0 = disabled, 1 = all enabled. Some distros default to 0 for security.
echo c will crash the live system immediatelyOnly use on a test system with kdump configured or where you expect and want the crash.
Permanent via sysctlTo persist: add kernel.sysrq = 1 to /etc/sysctl.d/99-sysrq.conf.
Kernel fail debugfs interface

The Linux kernel has built-in fault injection infrastructure exposed via /sys/kernel/debug/fail_*. It allows you to inject failures into specific kernel subsystems — kmalloc, block I/O, futexes, and more — without any external tools. Requires a debug kernel (CONFIG_FAULT_INJECTION=y).

Common Injection Points

PathInjects failures into
/sys/kernel/debug/fail_page_allocPage allocator — simulate memory allocation failures
/sys/kernel/debug/fail_make_requestBlock layer I/O requests
/sys/kernel/debug/fail_futexFutex operations
/sys/kernel/debug/failslabSlab/kmalloc allocations
/sys/kernel/debug/fail_functionInject errors into arbitrary kernel functions (requires CONFIG_FAIL_FUNCTION)

Usage: Inject kmalloc Failures

Shell
# Mount debugfs if not already mounted
mount -t debugfs none /sys/kernel/debug

# Configure failslab: 10% probability, no interval, all callers
cd /sys/kernel/debug/failslab
echo 10   > probability    # 10% of allocations fail
echo -1   > interval       # -1 = fail every Nth call (0 = no interval)
echo 100  > times           # inject at most 100 failures then stop
echo 0    > space           # 0 = any task, 1 = only tasks in /proc/self/fail-nth
echo 1    > task-filter     # only fail allocations from tasks that opted in

# Opt this shell into the fault injection
echo 1 > /proc/self/make-it-fail

# Now run your program — it will encounter kmalloc failures
./my-program

# Disable
echo 0 > /proc/self/make-it-fail
echo 0 > /sys/kernel/debug/failslab/probability

Gotchas

Requires CONFIG_FAULT_INJECTION=yNot enabled in most production kernels. Typically available on development/debug kernels or custom builds. Check with ls /sys/kernel/debug/fail_*.
task-filter is importantWithout task-filter=1 and per-task opt-in, you inject failures system-wide which can destabilize the entire host.
debugfs must be mountedOn some systems debugfs is not auto-mounted. Add debugfs /sys/kernel/debug debugfs defaults 0 0 to /etc/fstab if needed.
Kernel / Userspace libfiu

Fault Injection in Userspace. A library + tool for injecting failures into POSIX API calls (open, read, write, malloc, etc.) in your application without kernel changes. Instruments your binary via LD_PRELOAD or direct library linkage.

Common Use Cases

Fail open() / read() calls Simulate ENOMEM from malloc Test error handling paths Increase test coverage of error branches
Shell
# Install libfiu (Debian/Ubuntu)
apt install libfiu-dev fiu-utils

# Run a program with fault injection via fiu-run (LD_PRELOAD wrapper)
# Fail posix/io/rw/read with 10% probability
fiu-run -x -c "enable_random name=posix/io/rw/read,probability=0.1" ./my-program

# Fail malloc 5% of the time
fiu-run -x -c "enable_random name=libc/mm/malloc,probability=0.05" ./my-program

# Fail the Nth call to open() (deterministic)
fiu-run -x -c "enable_random name=posix/io/oc/open,probability=1,failinfo=-1" ./my-program

# List available POSIX fault points
fiu-run -l

Gotchas

Only affects libc-wrapped callsDirect syscalls (via syscall()) bypass libfiu's LD_PRELOAD hooks.
-x flag enables POSIX wrappersWithout -x, fiu-run does not intercept POSIX functions — only your own code's fiu_fail() calls.
Integrate into testslibfiu can be linked directly into test binaries for deterministic fault injection at specific call sites — better than LD_PRELOAD for CI use.
Process kill / signals

Sending signals is the simplest form of process fault injection — pause, resume, crash, or forcibly terminate processes to test application resilience, supervisor logic, and restart behaviour.

Common Use Cases

Test graceful shutdown Simulate process crash Pause / resume (CPU starvation) Test supervisor restart logic Validate PID file / lock cleanup
Shell
# Graceful shutdown (SIGTERM — process can handle it)
kill -SIGTERM 1234

# Immediate kill (SIGKILL — cannot be caught or ignored)
kill -9 1234

# Pause a process (SIGSTOP — cannot be caught)
kill -SIGSTOP 1234

# Resume it
kill -SIGCONT 1234

# Send to all processes in a process group
kill -SIGTERM -1234   # negative PID = process group

# Kill by name
pkill -SIGTERM nginx
killall -9 java

# Simulate a crash loop (kill, wait for restart, repeat)
while true; do
  kill -9 $(pgrep myservice)
  sleep 10
done

# SIGSTOP a process for 5s then resume (CPU starvation simulation)
kill -SIGSTOP 1234 && sleep 5 && kill -SIGCONT 1234

Red Flags

Process does not exit on SIGTERMIt either ignores SIGTERM or is stuck in an uninterruptible sleep (D state). Check strace -p PID and follow up with SIGKILL.
Supervisor does not restart the processYour supervisor (systemd, supervisord, Kubernetes) has a restart policy gap. Check logs and restart policies.

Gotchas

SIGKILL (-9) bypasses all cleanupNo defer, no atexit, no signal handler runs. Files left open, locks held, temp files not cleaned. Use SIGTERM first; only escalate to SIGKILL if needed.
SIGSTOP is invisible to the processUnlike SIGTSTP, SIGSTOP cannot be caught. The process has no idea it was paused — this makes it effective for simulating CPU starvation or a hung scheduler.
Process / Kernel cgroups limits

Linux control groups let you impose hard resource limits on a process or group of processes — CPU, memory, I/O bandwidth, PIDs. Essential for testing how applications behave under resource constraints without changing system-wide configuration.

Common Use Cases

Enforce memory limit → trigger OOM Throttle CPU to simulate slow host Limit I/O bandwidth per service Test PID exhaustion

Using systemd-run (easiest)

Shell
# Run a process with 512 MB memory limit
systemd-run --scope -p MemoryMax=512M ./my-program

# Limit CPU to 50% of one core (CPUQuota)
systemd-run --scope -p CPUQuota=50% ./my-program

# Limit to 2 CPUs worth of time
systemd-run --scope -p CPUQuota=200% ./my-program

# Limit I/O write bandwidth to 10 MB/s on /dev/sda
systemd-run --scope -p "IOWriteBandwidthMax=/dev/sda 10M" ./my-program

# Limit maximum PIDs (no forks beyond 50)
systemd-run --scope -p TasksMax=50 ./my-program

# Combine: memory + CPU limits
systemd-run --scope \
  -p MemoryMax=256M \
  -p CPUQuota=25% \
  stress-ng --cpu 4 --vm 1 --vm-bytes 1G --timeout 60s

Using cgroupsv2 directly

Shell
# Create a cgroup
mkdir /sys/fs/cgroup/mytest

# Set 256 MB memory limit
echo 268435456 > /sys/fs/cgroup/mytest/memory.max

# Add current process to the cgroup
echo $$ > /sys/fs/cgroup/mytest/cgroup.procs

# Run a child that will hit the limit
stress-ng --vm 1 --vm-bytes 1G --timeout 30s

# Clean up
rmdir /sys/fs/cgroup/mytest

Red Flags

OOM kill fires but application keeps runningThe OOM killer may kill a child process. If the parent continues without noticing, you have a missing error-handling path.
CPUQuota throttling causes latency spikesCPU throttling under cgroups causes bursty latency — the process runs at full speed then stalls waiting for the next quota period (100ms by default).

Gotchas

MemoryMax vs MemoryHighMemoryHigh is a soft limit (reclaim pressure). MemoryMax is the hard OOM limit. For OOM testing, use MemoryMax.
CPU throttling period mattersThe default CPU quota period is 100ms. A process at 50% quota gets 50ms of CPU time then stalls for 50ms. This creates periodic 50ms latency spikes which can look like GC pauses.
Inspect cgroup statscat /sys/fs/cgroup/mytest/memory.current and memory.events show live usage and OOM events.
Reference Chaos Engineering Principles

Guidelines for running fault injection safely and productively, based on the Chaos Engineering discipline.

Define Steady State First

Measure baseline metrics (latency p99, error rate, throughput) before injecting faults. You need a baseline to know if the system degraded or recovered.

Blast Radius Control

Start with the smallest possible scope: one host, one service, one connection. Expand only after the smaller experiment is understood.

Run in Production (carefully)

Staging environments don't mirror production traffic patterns. The most valuable experiments run in production with small blast radius and automated rollback.

Automate and Abort

Instrument every experiment with a kill switch. If an unrelated metric degrades, abort automatically. Never leave fault injection running unattended.

Vary the Fault Type

Latency faults expose different bugs than crash faults. Test both. A service that handles crashes gracefully may still fail badly on 5s latency injections.

Document Findings

Record what broke, what held, and what you changed. Fault injection that doesn't change the system is wasted time — file a ticket for every weakness found.

Tool Selection Guide

Fault TypeBest ToolAlternative
CPU saturationstress-ng --cpustress
Memory pressure / OOMstress-ng --vm --vm-keepsystemd-run -p MemoryMax
Disk I/O saturationfiostress-ng --hdd
Disk I/O errorsdm-flakeyfail_make_request debugfs
Network latency / losstc netemtoxiproxy
Per-service network faulttoxiproxytc netem + filter
Process crashkill -9stress-ng --kill
Resource limitssystemd-run -praw cgroupsv2
Kernel panic / crash dumpsysrq ckdump config
Syscall-level fault injectionlibfiufail debugfs
Reference Signal Table

Common Linux signals used in fault injection and testing.

SignalNumberDefault ActionCatchable?Use in Testing
SIGTERM15TerminateYesGraceful shutdown test — process should clean up
SIGKILL9Kill immediatelyNoHard crash simulation — no cleanup possible
SIGSTOP19PauseNoCPU starvation / freeze simulation
SIGCONT18ContinueYesResume after SIGSTOP
SIGHUP1TerminateYesConfig reload test (many daemons reload on HUP)
SIGABRT6Core dumpYesTrigger core dump for post-mortem analysis
SIGSEGV11Core dumpYesSimulate memory fault (can send to self via raise())
SIGPIPE13TerminateYesTest broken pipe handling in network code
SIGUSR1/210/12TerminateYesApplication-defined — trigger debug output, rotate logs
SIGALRM14TerminateYesTimeout simulation — process-level alarm
← Back to Guides