Linux Performance Tools Guide

CPU top

Real-time view of running processes, CPU, memory, and load average. Pre-installed on virtually every Linux system.

Common Use Cases

Who is using CPU? High load average? Memory pressure? Zombie processes?

Common Options

Shell

top                          # interactive mode
top -b -n 1                   # batch mode, single snapshot (good for scripting)
top -p 1234                    # watch a specific PID
top -u myuser                  # filter by user
top -d 0.5                     # refresh every 0.5s

Interactive Keys

Key	Action
`1`	Toggle per-CPU breakdown
`M`	Sort by memory
`P`	Sort by CPU (default)
`T`	Sort by time
`k`	Kill a process by PID
`r`	Renice a process
`f`	Field manager (add/remove columns)
`W`	Save config to ~/.toprc
`q`	Quit

Reading the Header

Output

top - 14:23:01 up 10 days,  3:12,  2 users,  load average: 0.45, 0.61, 0.72
Tasks: 312 total,   3 running,  308 sleeping,   1 zombie
%Cpu(s):  8.3 us,  2.1 sy,  0.0 ni, 88.4 id,  1.2 wa,  0.0 hi,  0.0 si
MiB Mem :  15987.2 total,   1023.4 free,  12845.1 used,   2118.7 buff/cache
MiB Swap:   2048.0 total,   512.0 free,   1536.0 used.   1456.3 avail Mem

CPU Column Meanings

Field	Meaning
`us`	User space CPU time
`sy`	Kernel (system) CPU time
`ni`	Niced (low priority) processes
`id`	Idle — lower = busier CPU
`wa`	I/O wait — CPU waiting for disk
`hi`	Hardware interrupts
`si`	Software interrupts
`st`	Stolen time (VM host taking CPU)

Red Flags

High wa (I/O wait > 10–20%)Processes are blocked on disk. Check iostat for disk saturation.

Load average > number of CPU coresSystem is overloaded. Check what's in R or D state.

Swap in use + low avail MemMemory pressure. System may be swapping heavily — expect slowdowns.

Zombie processes (Z state)Parent not reaping children. Usually harmless in small numbers but indicates a bug.

Gotchas

%CPU is per-coreA process at 200% on a 4-core machine is using 2 full cores — not overloaded.

top itself uses CPUAt very fast refresh rates it can skew results. Use -d 1 or higher.

CPU htop

Enhanced interactive process viewer. Color-coded CPU/memory bars, mouse support, tree view, and easier process management than top.

Common Options

Shell

htop                         # interactive
htop -p 1234,5678            # watch specific PIDs
htop -u myuser               # filter by user
htop -d 5                    # 0.5s refresh
htop -t                      # tree view by default

Key Shortcuts

Key	Action
`F2`	Setup / configuration
`F3` / `/`	Search processes
`F4`	Filter processes
`F5`	Tree view
`F6`	Sort by column
`F9`	Kill (signal menu)
`Space`	Tag process
`u`	Filter by user
`H`	Toggle user threads
`K`	Toggle kernel threads

Gotchas

Not installed by defaultInstall with apt install htop or yum install htop.

CPU % shown differentlyhtop shows per-thread CPU % by default. Press H to collapse threads.

CPU mpstat

Reports per-CPU statistics. Essential for spotting uneven load distribution across cores.

Common Options

Shell

mpstat                       # all CPUs summary since boot
mpstat -P ALL 1 5            # per-CPU stats, 1s interval, 5 times
mpstat -P 0,1,2 1           # specific CPUs only
mpstat -I ALL 1              # include interrupt stats

Sample Output

Output

CPU    %usr   %sys   %iowait  %irq   %soft  %idle
all    23.4    4.1     18.2    0.1    0.3    53.9
  0    45.2    8.3     32.1    0.2    0.5    13.7   # CPU 0 hot
  1     1.2    0.2      0.1    0.0    0.1    98.4   # CPU 1 idle

Red Flags

One CPU at 100%, others idleSingle-threaded bottleneck. Application is not parallelizing.

High %iowait on all CPUsDisk I/O is the bottleneck. Use iostat to drill down.

Part of sysstat packageInstall with apt install sysstat.

CPU / Mem / IO vmstat

Reports virtual memory, CPU activity, I/O, and process states in a compact format. Great for a quick overall system snapshot.

Common Options

Shell

vmstat 1                     # update every 1 second
vmstat 1 10                  # 1s interval, 10 samples
vmstat -s                    # memory stats summary
vmstat -d                    # disk stats
vmstat -t 1                  # include timestamp

Output Columns

Group	Column	Meaning
procs	`r`	Processes waiting to run (run queue)
procs	`b`	Processes in uninterruptible sleep (I/O wait)
memory	`swpd`	Virtual memory used (swap)
memory	`free`	Idle memory
memory	`buff`	Buffer memory (I/O buffers)
memory	`cache`	Page cache memory
swap	`si`	Swap in (from disk to mem) KB/s
swap	`so`	Swap out (mem to disk) KB/s
io	`bi`	Blocks read from disk
io	`bo`	Blocks written to disk
cpu	`us`	User CPU %
cpu	`sy`	System CPU %
cpu	`id`	Idle CPU %
cpu	`wa`	I/O wait %
cpu	`st`	Stolen (VM)

Red Flags

r > number of CPUsRun queue saturated — CPU is the bottleneck.

b > 0 consistentlyProcesses blocked on I/O. Disk or network issue.

si/so > 0Active swapping — serious memory pressure.

Gotchas

First line is averages since bootIgnore the first output row — always start from the second.

CPU / IO pidstat

Per-process statistics for CPU, memory, I/O, and context switches. Like top but in time-series form, great for logging.

Common Options

Shell

pidstat 1                    # CPU stats for all active processes
pidstat -u 1                 # CPU usage per process
pidstat -d 1                 # disk I/O per process
pidstat -r 1                 # memory stats per process
pidstat -w 1                 # context switches per process
pidstat -p 1234 1            # watch specific PID
pidstat -t 1                 # include threads

Red Flags

High cswch/s (voluntary context switches)Process is blocking often — waiting on I/O or locks.

High nvcswch/s (involuntary context switches)Process is being preempted — too many threads competing for CPU.

CPU / Tracing perf

Linux profiling with hardware performance counters. Use for CPU flame graphs, cache misses, branch mispredictions, and kernel tracing.

Common Use Cases

Shell

# CPU profiling - sample call stack at 99Hz for 10s
perf record -F 99 -ag -- sleep 10
perf report

# Profile a specific command
perf stat -d ./my-program

# Count events system-wide
perf stat -a sleep 5

# Flame graph (with Brendan Gregg's scripts)
perf record -F 99 -ag -p 1234 -- sleep 30
perf script | stackcollapse-perf.pl | flamegraph.pl > flame.svg

# Top functions by CPU
perf top

# Trace syscalls
perf trace -p 1234

# Count cache misses
perf stat -e cache-misses,cache-references ./my-program

Common perf stat Metrics

Metric	What it means
`instructions`	Total instructions executed
`cycles`	CPU cycles consumed
`IPC`	Instructions per cycle — higher is better
`cache-misses`	L1/L2/L3 cache misses — high = memory bound
`branch-misses`	Branch mispredictions — high = CPU pipeline stalls
`page-faults`	Memory page faults

Gotchas

Requires kernel symbolsInstall linux-tools-$(uname -r) and linux-perf.

Frame pointer needed for stacksCompile with -fno-omit-frame-pointer or use DWARF unwinding (--call-graph dwarf).

OverheadHigh frequency sampling (-F 999+) can affect performance. 99Hz is a safe default.

Memory free

Quick snapshot of total, used, free, and available memory including swap.

Common Options

Shell

free -h                      # human-readable (KB/MB/GB)
free -m                      # megabytes
free -s 1                    # update every 1 second
free -t                      # include total row

Sample Output

Output

              total        used        free      shared  buff/cache   available
Mem:          15987        12845        1023         312        2118        2847
Swap:          2048        1536         512

Key: available vs free

Use "available", not "free""free" excludes buff/cache. "available" is what the kernel can actually give to new processes — it includes reclaimable cache. This is the number that matters.

Red Flags

available memory near 0System is under memory pressure. OOM killer may trigger.

Swap used > 0Memory spilling to disk. Performance will degrade.

Memory vmstat -s

Detailed memory statistics including page faults, swap activity, and kernel memory.

Shell

vmstat -s                    # memory event counters since boot
cat /proc/meminfo             # raw kernel memory breakdown
cat /proc/slabinfo            # kernel slab allocator stats

Memory smem

Reports physical memory usage accounting for shared memory correctly. Shows PSS (proportional set size) — more accurate than RSS.

Common Options

Shell

smem -r                     # sort by RSS descending
smem -s pss -r              # sort by PSS descending
smem -u                     # per-user summary
smem -t                     # show totals
smem -P nginx               # filter by process name

Memory Metrics

Metric	Meaning
`VSZ`	Virtual memory — includes everything mapped (not all physical)
`RSS`	Resident Set Size — physical memory used (double-counts shared)
`PSS`	Proportional Set Size — shared memory split proportionally. Most accurate.
`USS`	Unique Set Size — memory used exclusively by this process

Use PSS for realistic memory accountingRSS overstates usage when processes share libraries. PSS divides shared pages proportionally.

Memory pmap

Shows the memory map of a process — all mapped regions, sizes, and permissions.

Shell

pmap 1234                    # basic memory map
pmap -x 1234                # extended (RSS, dirty pages)
pmap -d 1234                # show device format
pmap -x 1234 | tail -1     # just the totals line

Use pmap to find memory leaks — watch the total grow over time for the same PID.

Disk / IO iostat

Reports CPU and I/O statistics for devices. The go-to tool for diagnosing disk bottlenecks.

Common Options

Shell

iostat -x 1                 # extended stats, 1s interval
iostat -xz 1                # extended, skip idle devices
iostat -x 1 10              # 10 samples
iostat -p sda 1             # specific device
iostat -t -x 1              # with timestamp

Key Columns (iostat -x)

Column	Meaning
`r/s`	Reads per second
`w/s`	Writes per second
`rkB/s`	KB read per second
`wkB/s`	KB written per second
`await`	Average I/O wait time (ms) — includes queue time
`r_await`	Read wait time (ms)
`w_await`	Write wait time (ms)
`svctm`	Service time (deprecated, ignore)
`%util`	Device busy % — how saturated the disk is
`aqu-sz`	Average queue depth

Red Flags

%util near 100%Disk is saturated. I/O requests are queuing up. Consider faster disk or I/O optimization.

await > 20–50ms for HDD, >1ms for SSDHigh latency. Requests waiting in queue longer than expected.

aqu-sz consistently > 1Disk queue is building up — sign of saturation.

Gotchas

First line is since boot — ignore itSame as vmstat: always start reading from the second output block.

%util of 100% doesn't mean fully saturated for SSDsSSDs can handle parallel I/O. Check aqu-sz and await too.

Disk / IO iotop

Shows real-time disk I/O per process. Like top but for disk usage.

Shell

iotop                        # interactive, requires root
iotop -o                    # only show processes doing I/O
iotop -b -n 5               # batch mode, 5 iterations
iotop -p 1234               # watch specific PID
iotop -a                    # accumulated I/O totals

Requires root / CAP_NET_ADMINRun with sudo. Also needs kernel compiled with CONFIG_TASK_IO_ACCOUNTING.

Disk df

Reports disk space usage per filesystem. First tool to check when a disk-full error occurs.

Shell

df -h                       # human readable
df -hT                      # include filesystem type
df -i                       # inode usage (not space!)
df -h /var                  # specific mount point

Red Flags

Use% at 100%Disk full. Writes will fail. Clean up or expand volume.

IUse% at 100% (df -i)Inode exhaustion — disk has space but can't create new files. Common with many small files (log spam, mail queues).

Disk du

Estimates file and directory space usage. Use to find what's consuming disk space.

Shell

du -sh *                    # size of each item in current dir
du -sh /var/log/*           # size breakdown in /var/log
du -h --max-depth=1 /       # top-level directory sizes

# Find top 10 largest directories
du -xh / | sort -rh | head -10

# Largest files anywhere on system
find / -xdev -type f -printf "%s %p\n" | sort -rn | head -20

Use -x to stay on one filesystemWithout -x, du crosses mount boundaries and can scan NFS or tmpfs unintentionally.

Files / Network / Process lsof

Lists open files — and in Linux, everything is a file: sockets, pipes, devices. Invaluable for debugging file descriptor leaks and network connections.

Common Use Cases

Shell

# What files does a process have open?
lsof -p 1234

# Who has a file open?
lsof /var/log/app.log

# What process is using a port?
lsof -i :8080
lsof -i TCP:443

# All network connections
lsof -i

# Files open by a user
lsof -u myuser

# Deleted files still held open (space not reclaimed!)
lsof +L1

# Count open file descriptors per process
lsof -n | awk '{print $2}' | sort | uniq -c | sort -rn | head

Red Flags

lsof +L1 shows large deleted filesA file was deleted but a process still holds it open — disk space is not reclaimed until the process closes it or restarts. Classic log rotation issue.

Process has thousands of open FDsFile descriptor leak. Will hit ulimit and cause "too many open files" errors.

lsof is slow on busy systemsIt reads /proc for every process. Use -n (no DNS) and -P (no port names) to speed it up.

Network ss

Socket statistics — faster replacement for netstat. Shows TCP/UDP connections, listen ports, and socket buffers.

Common Options

Shell

ss -tulnp                   # listening TCP+UDP sockets with PID
ss -tan                     # all TCP connections (numeric)
ss -s                       # socket summary stats
ss -tp                      # TCP with process info
ss -o state established     # only ESTABLISHED connections
ss -o state time-wait       # only TIME_WAIT
ss dst 10.0.0.1             # connections to a specific IP
ss sport = :8080            # connections from source port 8080

# Count connections by state
ss -tan | awk '{print $1}' | sort | uniq -c

Flags

Flag	Meaning
`-t`	TCP sockets
`-u`	UDP sockets
`-l`	Listening sockets only
`-n`	Numeric (no DNS/service name resolution)
`-p`	Show process (PID/name)
`-e`	Extended info (timers, uid)
`-m`	Socket memory info
`-i`	Internal TCP info (RTT, cwnd, retransmits)

Red Flags

Large number of TIME_WAIT socketsShort-lived connections not being reused. Enable tcp_tw_reuse or use connection pooling.

Large number of CLOSE_WAITRemote end closed connection but local app hasn't — likely a bug in the application not closing sockets.

Recv-Q > 0 on listening socketConnections accepted by kernel but not yet by application — app is too slow to accept.

Network netstat

Classic network statistics tool. Largely superseded by ss but still widely available.

Shell

netstat -tulnp              # listening sockets with PID
netstat -an                 # all connections numeric
netstat -s                  # protocol statistics (retransmits etc)
netstat -rn                 # routing table
netstat -i                  # interface stats

Prefer ss over netstatnetstat reads /proc/net/tcp which is slow on systems with many connections. ss uses netlink and is much faster.

Network iftop

Real-time bandwidth usage per connection. Like top but for network traffic.

Shell

iftop                        # interactive, auto-selects interface
iftop -i eth0               # specific interface
iftop -n                    # no DNS resolution
iftop -P                    # show ports
iftop -B                    # show bytes (not bits)
iftop -f "host 10.0.0.1"   # filter by host

Requires libpcap and rootRun with sudo.

Network nethogs

Shows network bandwidth usage per process — the missing tool between iftop (per connection) and top (per process).

Shell

nethogs                      # all interfaces
nethogs eth0                 # specific interface
nethogs -d 2               # 2s refresh interval
nethogs -b                  # tracemode (non-interactive)

Network tcpdump

Captures and analyzes raw network packets. The definitive tool for deep network debugging.

Common Use Cases

Shell

# Capture all traffic on eth0
tcpdump -i eth0

# Capture specific port
tcpdump -i eth0 port 8080

# Capture to file for Wireshark analysis
tcpdump -i eth0 -w capture.pcap

# Read pcap file
tcpdump -r capture.pcap

# Filter by host and port
tcpdump -i eth0 host 10.0.0.1 and port 443

# Show HTTP GET requests
tcpdump -i eth0 -A 'tcp port 80 and (((ip[2:2] - ((ip[0]&0xf)<<2)) - ((tcp[12]&0xf0)>>2)) != 0)'

# DNS queries
tcpdump -i any port 53

# Count packets by source IP
tcpdump -i eth0 -nn -q | awk '{print $3}' | cut -d. -f1-4 | sort | uniq -c | sort -rn

Common Flags

Flag	Meaning
`-i`	Interface (`any` for all)
`-n`	No DNS resolution
`-nn`	No DNS and no port name resolution
`-v / -vv`	More verbose output
`-A`	Print packet payload as ASCII
`-X`	Print payload as hex and ASCII
`-s 0`	Capture full packet (default is 262144)
`-c N`	Capture N packets then stop
`-w file`	Write to pcap file

Gotchas

High volume traffic can drop packetsUse -s 0 carefully and write to file instead of displaying live.

Requires root or pcap groupRun with sudo or add user to wireshark group.

Network ping / mtr

ping tests basic reachability and latency. mtr (Matt's Traceroute) combines traceroute and ping for live hop-by-hop latency and packet loss.

Shell

ping -c 10 8.8.8.8          # 10 pings to Google DNS
ping -i 0.2 8.8.8.8          # fast ping (0.2s interval)

mtr 8.8.8.8                  # interactive traceroute
mtr --report -c 50 8.8.8.8  # 50 packets, report mode (good for sharing)
mtr -n 8.8.8.8               # no DNS resolution

Red Flags

Packet loss at an intermediate hop in mtrIf loss appears at hop N but not hop N+1, it's likely ICMP rate limiting at that router — not real loss.

Packet loss at the final hopReal connectivity problem to the destination.

RTT spikes at a specific hopCongestion or routing issue at that network segment.

Process ps

Snapshot of current processes. Essential for finding PIDs, checking process state, and understanding process relationships.

Common Options

Shell

ps aux                       # all processes, all users (BSD style)
ps -ef                      # full format (UNIX style)
ps -ef --forest             # tree view showing parent/child
ps aux --sort=-%cpu        # sort by CPU descending
ps aux --sort=-%mem        # sort by memory descending
ps -p 1234 -o pid,cmd,rss  # custom output columns
ps -u myuser               # by user

# Find a process by name
ps aux | grep nginx
pgrep -la nginx             # cleaner alternative

Process States

State	Meaning
`R`	Running or runnable (on CPU or in run queue)
`S`	Sleeping — waiting for event (interruptible)
`D`	Uninterruptible sleep — usually waiting on I/O. Cannot be killed.
`T`	Stopped (SIGSTOP or traced by debugger)
`Z`	Zombie — exited but parent hasn't called wait()
`I`	Idle kernel thread

Red Flags

Many D state processesProcesses stuck in uninterruptible I/O wait. Usually points to disk or NFS hang. Cannot be killed with SIGKILL.

Accumulating Z (zombie) processesParent process has a bug and is not reaping children. The zombies themselves are harmless but indicate a leak.

Process / Tracing strace

Traces system calls made by a process. Invaluable for debugging "what is this process actually doing?" without source code.

Common Options

Shell

# Attach to running process
strace -p 1234

# Trace a new command
strace ls /tmp

# Follow child processes too
strace -f -p 1234

# Filter to specific syscalls
strace -e trace=open,read,write -p 1234
strace -e trace=network -p 1234
strace -e trace=file -p 1234

# Show timing info
strace -T -p 1234             # time spent in each syscall
strace -t -p 1234             # wall clock timestamps

# Summary: count syscalls and time
strace -c ./my-program

# Write output to file
strace -o /tmp/strace.log -p 1234

Common Use Cases

Problem	strace filter
What files is it opening?	`-e trace=openat,open`
What network calls?	`-e trace=network`
Why is it slow?	`-T -c` (summary with time)
What is it writing?	`-e trace=write -s 1024`
Signal handling?	`-e signal=all`

Gotchas

strace adds significant overheadCan slow a process by 10–100x. Never use on a busy production process without understanding the impact.

Use -f to trace forked childrenWithout -f, you only see the parent. Most multi-process apps need -f.

Process ltrace

Like strace but traces library calls instead of syscalls. Useful for seeing malloc, fopen, and other libc calls.

Shell

ltrace ./my-program          # trace library calls
ltrace -p 1234              # attach to running process
ltrace -c ./my-program       # summary count
ltrace -e malloc+free ./app  # only malloc/free calls

Even higher overhead than straceUse only for targeted debugging, never on production.

Process lsof (process focus)

Shell

# How many FDs does a process have?
ls /proc/1234/fd | wc -l

# What is the FD limit?
cat /proc/1234/limits | grep "open files"

# Check process memory maps
cat /proc/1234/maps

# Process environment variables
cat /proc/1234/environ | tr '\0' '\n'

# Actual binary path (useful for containers)
ls -la /proc/1234/exe

Tracing / eBPF bpftrace

High-level eBPF tracing language. Trace kernel and userspace events with minimal overhead. The modern replacement for many strace/ltrace use cases.

One-liners

Shell

# Files opened by process name
bpftrace -e 'tracepoint:syscalls:sys_enter_openat /comm == "nginx"/ { printf("%s\n", str(args->filename)); }'

# Syscall count per process
bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @[comm] = count(); }'

# Disk I/O latency histogram
bpftrace -e 'kprobe:blk_account_io_start { @start[arg0] = nsecs; }
kprobe:blk_account_io_done /@start[arg0]/ {
  @usecs = hist((nsecs - @start[arg0]) / 1000); delete(@start[arg0]); }'

# TCP connect latency
bpftrace -e 'kprobe:tcp_v4_connect { @start[tid] = nsecs; }
kretprobe:tcp_v4_connect /@start[tid]/ {
  @ms = hist((nsecs - @start[tid]) / 1000000); delete(@start[tid]); }'

# List available tracepoints
bpftrace -l 'tracepoint:syscalls:*'

Gotchas

Requires Linux 4.9+ and rootBest on 5.x+ kernels for full feature support.

Very low overhead vs straceeBPF runs in kernel safely. Production-safe for most one-liners.

Tracing / eBPF BCC tools

Collection of ready-made eBPF tools from the BCC toolkit. Each tool is production-safe and solves a specific performance question.

Essential BCC Tools

Tool	What it answers
`execsnoop`	What processes are being exec'd right now?
`opensnoop`	What files are being opened?
`biolatency`	Block I/O latency histogram
`biosnoop`	Per-I/O latency with process name
`tcpconnect`	What TCP connections are being made?
`tcpaccept`	What TCP connections are being accepted?
`tcpretrans`	TCP retransmits with details
`runqlat`	CPU run queue latency histogram
`offcputime`	Time spent off CPU (blocked, sleeping)
`profile`	CPU flame graph profiler
`memleak`	Outstanding allocations (memory leaks)
`funccount`	Count function calls
`trace`	Trace arbitrary kernel/user functions
`cachestat`	Page cache hit/miss ratio
`cachetop`	Page cache top by process
`fileslower`	Slow file reads/writes
`ext4slower`	Slow ext4 operations

Shell

runqlat                      # run queue latency — are tasks waiting for CPU?
biolatency -D              # disk I/O latency per device
tcpretrans                   # watch TCP retransmits live
cachestat 1                  # page cache stats every second
memleak -p 1234             # watch for memory leaks in PID
profile -F 99 -af 30        # CPU flame graph for 30s

Tracing ftrace

Built-in kernel tracing framework. Accessed via /sys/kernel/debug/tracing. No external tools needed — available on any Linux system.

Shell

cd /sys/kernel/debug/tracing

# List available tracers
cat available_tracers

# Function call tracing
echo function > current_tracer
echo 1 > tracing_on
cat trace
echo 0 > tracing_on

# Trace a specific function
echo my_function > set_ftrace_filter
echo function > current_tracer

# Easier with trace-cmd (wrapper)
trace-cmd record -e sched_switch -p function sleep 5
trace-cmd report

All-in-one sar

System Activity Reporter. Collects, records, and reports historical system performance data. Part of sysstat. The only standard tool for looking back in time.

Common Options

Shell

# CPU usage every 1s, 5 samples
sar -u 1 5

# Memory stats
sar -r 1 5

# Disk I/O
sar -d 1 5

# Network stats
sar -n DEV 1 5

# TCP stats
sar -n TCP 1 5

# Load average and run queue
sar -q 1 5

# Historical data from today's log
sar -u -f /var/log/sysstat/sa$(date +%d)

# Historical data from specific time range
sar -u -s 09:00:00 -e 10:00:00

Gotchas

Must enable collection firstEdit /etc/default/sysstat → set ENABLED="true", then systemctl enable --now sysstat.

Best tool for post-incident reviewWhen something happened at 3am, sar is often the only tool with historical data.

All-in-one dstat

Combines vmstat, iostat, ifstat, and netstat into one colorized output. Great for a live overview of all subsystems at once.

Shell

dstat                        # default: cpu, disk, net, paging, system
dstat -cdngy               # cpu, disk, net, paging, system
dstat --top-cpu             # show top CPU process
dstat --top-io              # show top I/O process
dstat --top-mem             # show top memory process
dstat -t 1 60               # with timestamp, 1s interval, 60s
dstat --output /tmp/dstat.csv 1  # export to CSV for analysis

dstat is deprecatedUse dool (fork) on newer systems. Some distros ship it as dstat still.

All-in-one glances

Cross-platform monitoring tool with a rich curses UI. Shows CPU, memory, disk, network, processes, and alerts in one screen.

Shell

glances                      # interactive TUI
glances -w                  # web server mode (port 61208)
glances -s                  # server mode (for remote monitoring)
glances -c remote-host      # connect to remote glances server
glances --export csv        # export metrics to CSV

All-in-one atop

Advanced system and process monitor. Records all activity to disk and allows replaying historical sessions. Captures processes that have already exited.

Shell

atop                         # interactive
atop -r /var/log/atop/atop_20240426  # replay saved log
atop -A                     # show all resources
atop -w /tmp/atop.log 1 60  # write log, 1s interval, 60s

Key advantage over top/htopatop captures short-lived processes that start and die between top's refresh interval. Critical for debugging transient process spikes.

Reference USE Method

Brendan Gregg's methodology: for every resource, check Utilization, Saturation, and Errors.

Resource	Utilization	Saturation	Errors
CPU	`mpstat %usr+%sys`	`vmstat r > CPUs`	`dmesg \| grep error`
Memory	`free avail`	`vmstat si/so > 0`	`dmesg OOM killer`
Disk	`iostat %util`	`iostat aqu-sz > 1`	`smartctl -a /dev/sda`
Network	`sar -n DEV txkB/s`	`netstat -s retransmits`	`ip -s link`
File descriptors	`lsof -p PID \| wc -l`	`/proc/PID/limits`	EMFILE errors in logs

Reference Load Average

Shown in top, uptime, and /proc/loadavg. Represents the average number of processes in R (running) or D (uninterruptible sleep) state over 1, 5, and 15 minutes.

Rule of thumb: load average / CPU countIf you have 4 CPUs and load is 4.0, you're at 100% utilization. Load < CPU count = fine. Load > CPU count = saturated.

Scenario	Meaning
Load 1m > 5m > 15m	Load is increasing — problem is getting worse
Load 1m < 5m < 15m	Load is decreasing — problem is recovering
Load high, CPU idle	D state processes — blocked on I/O, not CPU
Load = CPUs	Fully utilized but not saturated
Load >> CPUs	Saturated — tasks are queuing

Reference Signals

Signal	Number	Default Action	Use case
`SIGHUP`	1	Terminate	Reload config (many daemons)
`SIGINT`	2	Terminate	Ctrl+C
`SIGQUIT`	3	Core dump	Ctrl+\ — quit with core
`SIGKILL`	9	Terminate	Force kill — cannot be caught or ignored
`SIGTERM`	15	Terminate	Graceful shutdown (default kill signal)
`SIGSTOP`	19	Stop	Pause process — cannot be caught
`SIGCONT`	18	Continue	Resume stopped process
`SIGUSR1`	10	Terminate	App-defined (e.g. log rotation)
`SIGUSR2`	12	Terminate	App-defined
`SIGPIPE`	13	Terminate	Write to broken pipe
`SIGCHLD`	17	Ignore	Child process stopped or exited

Shell

kill -15 1234               # graceful terminate (default)
kill -9 1234                # force kill
kill -HUP 1234              # reload config
killall nginx                # kill all processes named nginx
pkill -f "my-app"           # kill by full command line match
kill -0 1234                # check if process exists (no signal sent)

SIGKILL bypasses graceful shutdownAlways try SIGTERM first and give the process time to clean up. SIGKILL can leave temp files, locks, and incomplete writes.