Linux Performance Tools
Common use cases, options, red flags, and gotchas
Applications │ ├── CPU → top, htop, mpstat, vmstat, pidstat, perf ├── Memory → free, vmstat, smem, pmap ├── Disk/IO → iostat, iotop, df, du, lsof ├── Network → ss, netstat, iftop, nethogs, tcpdump, ping, mtr ├── Process → ps, strace, ltrace, lsof ├── Tracing → bpftrace, BCC tools, ftrace └── All-in-one → sar, dstat, glances, atop
CPU top

Real-time view of running processes, CPU, memory, and load average. Pre-installed on virtually every Linux system.

Common Use Cases

Who is using CPU? High load average? Memory pressure? Zombie processes?

Common Options

Shell
top                          # interactive mode
top -b -n 1                   # batch mode, single snapshot (good for scripting)
top -p 1234                    # watch a specific PID
top -u myuser                  # filter by user
top -d 0.5                     # refresh every 0.5s

Interactive Keys

KeyAction
1Toggle per-CPU breakdown
MSort by memory
PSort by CPU (default)
TSort by time
kKill a process by PID
rRenice a process
fField manager (add/remove columns)
WSave config to ~/.toprc
qQuit

Reading the Header

Output
top - 14:23:01 up 10 days,  3:12,  2 users,  load average: 0.45, 0.61, 0.72
Tasks: 312 total,   3 running,  308 sleeping,   1 zombie
%Cpu(s):  8.3 us,  2.1 sy,  0.0 ni, 88.4 id,  1.2 wa,  0.0 hi,  0.0 si
MiB Mem :  15987.2 total,   1023.4 free,  12845.1 used,   2118.7 buff/cache
MiB Swap:   2048.0 total,   512.0 free,   1536.0 used.   1456.3 avail Mem

CPU Column Meanings

FieldMeaning
usUser space CPU time
syKernel (system) CPU time
niNiced (low priority) processes
idIdle — lower = busier CPU
waI/O wait — CPU waiting for disk
hiHardware interrupts
siSoftware interrupts
stStolen time (VM host taking CPU)

Red Flags

High wa (I/O wait > 10–20%)Processes are blocked on disk. Check iostat for disk saturation.
Load average > number of CPU coresSystem is overloaded. Check what's in R or D state.
Swap in use + low avail MemMemory pressure. System may be swapping heavily — expect slowdowns.
Zombie processes (Z state)Parent not reaping children. Usually harmless in small numbers but indicates a bug.

Gotchas

%CPU is per-coreA process at 200% on a 4-core machine is using 2 full cores — not overloaded.
top itself uses CPUAt very fast refresh rates it can skew results. Use -d 1 or higher.
CPU htop

Enhanced interactive process viewer. Color-coded CPU/memory bars, mouse support, tree view, and easier process management than top.

Common Options

Shell
htop                         # interactive
htop -p 1234,5678            # watch specific PIDs
htop -u myuser               # filter by user
htop -d 5                    # 0.5s refresh
htop -t                      # tree view by default

Key Shortcuts

KeyAction
F2Setup / configuration
F3 / /Search processes
F4Filter processes
F5Tree view
F6Sort by column
F9Kill (signal menu)
SpaceTag process
uFilter by user
HToggle user threads
KToggle kernel threads

Gotchas

Not installed by defaultInstall with apt install htop or yum install htop.
CPU % shown differentlyhtop shows per-thread CPU % by default. Press H to collapse threads.
CPU mpstat

Reports per-CPU statistics. Essential for spotting uneven load distribution across cores.

Common Options

Shell
mpstat                       # all CPUs summary since boot
mpstat -P ALL 1 5            # per-CPU stats, 1s interval, 5 times
mpstat -P 0,1,2 1           # specific CPUs only
mpstat -I ALL 1              # include interrupt stats

Sample Output

Output
CPU    %usr   %sys   %iowait  %irq   %soft  %idle
all    23.4    4.1     18.2    0.1    0.3    53.9
  0    45.2    8.3     32.1    0.2    0.5    13.7   # CPU 0 hot
  1     1.2    0.2      0.1    0.0    0.1    98.4   # CPU 1 idle

Red Flags

One CPU at 100%, others idleSingle-threaded bottleneck. Application is not parallelizing.
High %iowait on all CPUsDisk I/O is the bottleneck. Use iostat to drill down.
Part of sysstat packageInstall with apt install sysstat.
CPU / Mem / IO vmstat

Reports virtual memory, CPU activity, I/O, and process states in a compact format. Great for a quick overall system snapshot.

Common Options

Shell
vmstat 1                     # update every 1 second
vmstat 1 10                  # 1s interval, 10 samples
vmstat -s                    # memory stats summary
vmstat -d                    # disk stats
vmstat -t 1                  # include timestamp

Output Columns

GroupColumnMeaning
procsrProcesses waiting to run (run queue)
procsbProcesses in uninterruptible sleep (I/O wait)
memoryswpdVirtual memory used (swap)
memoryfreeIdle memory
memorybuffBuffer memory (I/O buffers)
memorycachePage cache memory
swapsiSwap in (from disk to mem) KB/s
swapsoSwap out (mem to disk) KB/s
iobiBlocks read from disk
ioboBlocks written to disk
cpuusUser CPU %
cpusySystem CPU %
cpuidIdle CPU %
cpuwaI/O wait %
cpustStolen (VM)

Red Flags

r > number of CPUsRun queue saturated — CPU is the bottleneck.
b > 0 consistentlyProcesses blocked on I/O. Disk or network issue.
si/so > 0Active swapping — serious memory pressure.

Gotchas

First line is averages since bootIgnore the first output row — always start from the second.
CPU / IO pidstat

Per-process statistics for CPU, memory, I/O, and context switches. Like top but in time-series form, great for logging.

Common Options

Shell
pidstat 1                    # CPU stats for all active processes
pidstat -u 1                 # CPU usage per process
pidstat -d 1                 # disk I/O per process
pidstat -r 1                 # memory stats per process
pidstat -w 1                 # context switches per process
pidstat -p 1234 1            # watch specific PID
pidstat -t 1                 # include threads

Red Flags

High cswch/s (voluntary context switches)Process is blocking often — waiting on I/O or locks.
High nvcswch/s (involuntary context switches)Process is being preempted — too many threads competing for CPU.
CPU / Tracing perf

Linux profiling with hardware performance counters. Use for CPU flame graphs, cache misses, branch mispredictions, and kernel tracing.

Common Use Cases

Shell
# CPU profiling - sample call stack at 99Hz for 10s
perf record -F 99 -ag -- sleep 10
perf report

# Profile a specific command
perf stat -d ./my-program

# Count events system-wide
perf stat -a sleep 5

# Flame graph (with Brendan Gregg's scripts)
perf record -F 99 -ag -p 1234 -- sleep 30
perf script | stackcollapse-perf.pl | flamegraph.pl > flame.svg

# Top functions by CPU
perf top

# Trace syscalls
perf trace -p 1234

# Count cache misses
perf stat -e cache-misses,cache-references ./my-program

Common perf stat Metrics

MetricWhat it means
instructionsTotal instructions executed
cyclesCPU cycles consumed
IPCInstructions per cycle — higher is better
cache-missesL1/L2/L3 cache misses — high = memory bound
branch-missesBranch mispredictions — high = CPU pipeline stalls
page-faultsMemory page faults

Gotchas

Requires kernel symbolsInstall linux-tools-$(uname -r) and linux-perf.
Frame pointer needed for stacksCompile with -fno-omit-frame-pointer or use DWARF unwinding (--call-graph dwarf).
OverheadHigh frequency sampling (-F 999+) can affect performance. 99Hz is a safe default.
Memory free

Quick snapshot of total, used, free, and available memory including swap.

Common Options

Shell
free -h                      # human-readable (KB/MB/GB)
free -m                      # megabytes
free -s 1                    # update every 1 second
free -t                      # include total row

Sample Output

Output
              total        used        free      shared  buff/cache   available
Mem:          15987        12845        1023         312        2118        2847
Swap:          2048        1536         512

Key: available vs free

Use "available", not "free""free" excludes buff/cache. "available" is what the kernel can actually give to new processes — it includes reclaimable cache. This is the number that matters.

Red Flags

available memory near 0System is under memory pressure. OOM killer may trigger.
Swap used > 0Memory spilling to disk. Performance will degrade.
Memory vmstat -s

Detailed memory statistics including page faults, swap activity, and kernel memory.

Shell
vmstat -s                    # memory event counters since boot
cat /proc/meminfo             # raw kernel memory breakdown
cat /proc/slabinfo            # kernel slab allocator stats
Memory smem

Reports physical memory usage accounting for shared memory correctly. Shows PSS (proportional set size) — more accurate than RSS.

Common Options

Shell
smem -r                     # sort by RSS descending
smem -s pss -r              # sort by PSS descending
smem -u                     # per-user summary
smem -t                     # show totals
smem -P nginx               # filter by process name

Memory Metrics

MetricMeaning
VSZVirtual memory — includes everything mapped (not all physical)
RSSResident Set Size — physical memory used (double-counts shared)
PSSProportional Set Size — shared memory split proportionally. Most accurate.
USSUnique Set Size — memory used exclusively by this process
Use PSS for realistic memory accountingRSS overstates usage when processes share libraries. PSS divides shared pages proportionally.
Memory pmap

Shows the memory map of a process — all mapped regions, sizes, and permissions.

Shell
pmap 1234                    # basic memory map
pmap -x 1234                # extended (RSS, dirty pages)
pmap -d 1234                # show device format
pmap -x 1234 | tail -1     # just the totals line
Use pmap to find memory leaks — watch the total grow over time for the same PID.
Disk / IO iostat

Reports CPU and I/O statistics for devices. The go-to tool for diagnosing disk bottlenecks.

Common Options

Shell
iostat -x 1                 # extended stats, 1s interval
iostat -xz 1                # extended, skip idle devices
iostat -x 1 10              # 10 samples
iostat -p sda 1             # specific device
iostat -t -x 1              # with timestamp

Key Columns (iostat -x)

ColumnMeaning
r/sReads per second
w/sWrites per second
rkB/sKB read per second
wkB/sKB written per second
awaitAverage I/O wait time (ms) — includes queue time
r_awaitRead wait time (ms)
w_awaitWrite wait time (ms)
svctmService time (deprecated, ignore)
%utilDevice busy % — how saturated the disk is
aqu-szAverage queue depth

Red Flags

%util near 100%Disk is saturated. I/O requests are queuing up. Consider faster disk or I/O optimization.
await > 20–50ms for HDD, >1ms for SSDHigh latency. Requests waiting in queue longer than expected.
aqu-sz consistently > 1Disk queue is building up — sign of saturation.

Gotchas

First line is since boot — ignore itSame as vmstat: always start reading from the second output block.
%util of 100% doesn't mean fully saturated for SSDsSSDs can handle parallel I/O. Check aqu-sz and await too.
Disk / IO iotop

Shows real-time disk I/O per process. Like top but for disk usage.

Shell
iotop                        # interactive, requires root
iotop -o                    # only show processes doing I/O
iotop -b -n 5               # batch mode, 5 iterations
iotop -p 1234               # watch specific PID
iotop -a                    # accumulated I/O totals
Requires root / CAP_NET_ADMINRun with sudo. Also needs kernel compiled with CONFIG_TASK_IO_ACCOUNTING.
Disk df

Reports disk space usage per filesystem. First tool to check when a disk-full error occurs.

Shell
df -h                       # human readable
df -hT                      # include filesystem type
df -i                       # inode usage (not space!)
df -h /var                  # specific mount point

Red Flags

Use% at 100%Disk full. Writes will fail. Clean up or expand volume.
IUse% at 100% (df -i)Inode exhaustion — disk has space but can't create new files. Common with many small files (log spam, mail queues).
Disk du

Estimates file and directory space usage. Use to find what's consuming disk space.

Shell
du -sh *                    # size of each item in current dir
du -sh /var/log/*           # size breakdown in /var/log
du -h --max-depth=1 /       # top-level directory sizes

# Find top 10 largest directories
du -xh / | sort -rh | head -10

# Largest files anywhere on system
find / -xdev -type f -printf "%s %p\n" | sort -rn | head -20
Use -x to stay on one filesystemWithout -x, du crosses mount boundaries and can scan NFS or tmpfs unintentionally.
Files / Network / Process lsof

Lists open files — and in Linux, everything is a file: sockets, pipes, devices. Invaluable for debugging file descriptor leaks and network connections.

Common Use Cases

Shell
# What files does a process have open?
lsof -p 1234

# Who has a file open?
lsof /var/log/app.log

# What process is using a port?
lsof -i :8080
lsof -i TCP:443

# All network connections
lsof -i

# Files open by a user
lsof -u myuser

# Deleted files still held open (space not reclaimed!)
lsof +L1

# Count open file descriptors per process
lsof -n | awk '{print $2}' | sort | uniq -c | sort -rn | head

Red Flags

lsof +L1 shows large deleted filesA file was deleted but a process still holds it open — disk space is not reclaimed until the process closes it or restarts. Classic log rotation issue.
Process has thousands of open FDsFile descriptor leak. Will hit ulimit and cause "too many open files" errors.
lsof is slow on busy systemsIt reads /proc for every process. Use -n (no DNS) and -P (no port names) to speed it up.
Network ss

Socket statistics — faster replacement for netstat. Shows TCP/UDP connections, listen ports, and socket buffers.

Common Options

Shell
ss -tulnp                   # listening TCP+UDP sockets with PID
ss -tan                     # all TCP connections (numeric)
ss -s                       # socket summary stats
ss -tp                      # TCP with process info
ss -o state established     # only ESTABLISHED connections
ss -o state time-wait       # only TIME_WAIT
ss dst 10.0.0.1             # connections to a specific IP
ss sport = :8080            # connections from source port 8080

# Count connections by state
ss -tan | awk '{print $1}' | sort | uniq -c

Flags

FlagMeaning
-tTCP sockets
-uUDP sockets
-lListening sockets only
-nNumeric (no DNS/service name resolution)
-pShow process (PID/name)
-eExtended info (timers, uid)
-mSocket memory info
-iInternal TCP info (RTT, cwnd, retransmits)

Red Flags

Large number of TIME_WAIT socketsShort-lived connections not being reused. Enable tcp_tw_reuse or use connection pooling.
Large number of CLOSE_WAITRemote end closed connection but local app hasn't — likely a bug in the application not closing sockets.
Recv-Q > 0 on listening socketConnections accepted by kernel but not yet by application — app is too slow to accept.
Network netstat

Classic network statistics tool. Largely superseded by ss but still widely available.

Shell
netstat -tulnp              # listening sockets with PID
netstat -an                 # all connections numeric
netstat -s                  # protocol statistics (retransmits etc)
netstat -rn                 # routing table
netstat -i                  # interface stats
Prefer ss over netstatnetstat reads /proc/net/tcp which is slow on systems with many connections. ss uses netlink and is much faster.
Network iftop

Real-time bandwidth usage per connection. Like top but for network traffic.

Shell
iftop                        # interactive, auto-selects interface
iftop -i eth0               # specific interface
iftop -n                    # no DNS resolution
iftop -P                    # show ports
iftop -B                    # show bytes (not bits)
iftop -f "host 10.0.0.1"   # filter by host
Requires libpcap and rootRun with sudo.
Network nethogs

Shows network bandwidth usage per process — the missing tool between iftop (per connection) and top (per process).

Shell
nethogs                      # all interfaces
nethogs eth0                 # specific interface
nethogs -d 2               # 2s refresh interval
nethogs -b                  # tracemode (non-interactive)
Network tcpdump

Captures and analyzes raw network packets. The definitive tool for deep network debugging.

Common Use Cases

Shell
# Capture all traffic on eth0
tcpdump -i eth0

# Capture specific port
tcpdump -i eth0 port 8080

# Capture to file for Wireshark analysis
tcpdump -i eth0 -w capture.pcap

# Read pcap file
tcpdump -r capture.pcap

# Filter by host and port
tcpdump -i eth0 host 10.0.0.1 and port 443

# Show HTTP GET requests
tcpdump -i eth0 -A 'tcp port 80 and (((ip[2:2] - ((ip[0]&0xf)<<2)) - ((tcp[12]&0xf0)>>2)) != 0)'

# DNS queries
tcpdump -i any port 53

# Count packets by source IP
tcpdump -i eth0 -nn -q | awk '{print $3}' | cut -d. -f1-4 | sort | uniq -c | sort -rn

Common Flags

FlagMeaning
-iInterface (any for all)
-nNo DNS resolution
-nnNo DNS and no port name resolution
-v / -vvMore verbose output
-APrint packet payload as ASCII
-XPrint payload as hex and ASCII
-s 0Capture full packet (default is 262144)
-c NCapture N packets then stop
-w fileWrite to pcap file

Gotchas

High volume traffic can drop packetsUse -s 0 carefully and write to file instead of displaying live.
Requires root or pcap groupRun with sudo or add user to wireshark group.
Network ping / mtr

ping tests basic reachability and latency. mtr (Matt's Traceroute) combines traceroute and ping for live hop-by-hop latency and packet loss.

Shell
ping -c 10 8.8.8.8          # 10 pings to Google DNS
ping -i 0.2 8.8.8.8          # fast ping (0.2s interval)

mtr 8.8.8.8                  # interactive traceroute
mtr --report -c 50 8.8.8.8  # 50 packets, report mode (good for sharing)
mtr -n 8.8.8.8               # no DNS resolution

Red Flags

Packet loss at an intermediate hop in mtrIf loss appears at hop N but not hop N+1, it's likely ICMP rate limiting at that router — not real loss.
Packet loss at the final hopReal connectivity problem to the destination.
RTT spikes at a specific hopCongestion or routing issue at that network segment.
Process ps

Snapshot of current processes. Essential for finding PIDs, checking process state, and understanding process relationships.

Common Options

Shell
ps aux                       # all processes, all users (BSD style)
ps -ef                      # full format (UNIX style)
ps -ef --forest             # tree view showing parent/child
ps aux --sort=-%cpu        # sort by CPU descending
ps aux --sort=-%mem        # sort by memory descending
ps -p 1234 -o pid,cmd,rss  # custom output columns
ps -u myuser               # by user

# Find a process by name
ps aux | grep nginx
pgrep -la nginx             # cleaner alternative

Process States

StateMeaning
RRunning or runnable (on CPU or in run queue)
SSleeping — waiting for event (interruptible)
DUninterruptible sleep — usually waiting on I/O. Cannot be killed.
TStopped (SIGSTOP or traced by debugger)
ZZombie — exited but parent hasn't called wait()
IIdle kernel thread

Red Flags

Many D state processesProcesses stuck in uninterruptible I/O wait. Usually points to disk or NFS hang. Cannot be killed with SIGKILL.
Accumulating Z (zombie) processesParent process has a bug and is not reaping children. The zombies themselves are harmless but indicate a leak.
Process / Tracing strace

Traces system calls made by a process. Invaluable for debugging "what is this process actually doing?" without source code.

Common Options

Shell
# Attach to running process
strace -p 1234

# Trace a new command
strace ls /tmp

# Follow child processes too
strace -f -p 1234

# Filter to specific syscalls
strace -e trace=open,read,write -p 1234
strace -e trace=network -p 1234
strace -e trace=file -p 1234

# Show timing info
strace -T -p 1234             # time spent in each syscall
strace -t -p 1234             # wall clock timestamps

# Summary: count syscalls and time
strace -c ./my-program

# Write output to file
strace -o /tmp/strace.log -p 1234

Common Use Cases

Problemstrace filter
What files is it opening?-e trace=openat,open
What network calls?-e trace=network
Why is it slow?-T -c (summary with time)
What is it writing?-e trace=write -s 1024
Signal handling?-e signal=all

Gotchas

strace adds significant overheadCan slow a process by 10–100x. Never use on a busy production process without understanding the impact.
Use -f to trace forked childrenWithout -f, you only see the parent. Most multi-process apps need -f.
Process ltrace

Like strace but traces library calls instead of syscalls. Useful for seeing malloc, fopen, and other libc calls.

Shell
ltrace ./my-program          # trace library calls
ltrace -p 1234              # attach to running process
ltrace -c ./my-program       # summary count
ltrace -e malloc+free ./app  # only malloc/free calls
Even higher overhead than straceUse only for targeted debugging, never on production.
Process lsof (process focus)
Shell
# How many FDs does a process have?
ls /proc/1234/fd | wc -l

# What is the FD limit?
cat /proc/1234/limits | grep "open files"

# Check process memory maps
cat /proc/1234/maps

# Process environment variables
cat /proc/1234/environ | tr '\0' '\n'

# Actual binary path (useful for containers)
ls -la /proc/1234/exe
Tracing / eBPF bpftrace

High-level eBPF tracing language. Trace kernel and userspace events with minimal overhead. The modern replacement for many strace/ltrace use cases.

One-liners

Shell
# Files opened by process name
bpftrace -e 'tracepoint:syscalls:sys_enter_openat /comm == "nginx"/ { printf("%s\n", str(args->filename)); }'

# Syscall count per process
bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @[comm] = count(); }'

# Disk I/O latency histogram
bpftrace -e 'kprobe:blk_account_io_start { @start[arg0] = nsecs; }
kprobe:blk_account_io_done /@start[arg0]/ {
  @usecs = hist((nsecs - @start[arg0]) / 1000); delete(@start[arg0]); }'

# TCP connect latency
bpftrace -e 'kprobe:tcp_v4_connect { @start[tid] = nsecs; }
kretprobe:tcp_v4_connect /@start[tid]/ {
  @ms = hist((nsecs - @start[tid]) / 1000000); delete(@start[tid]); }'

# List available tracepoints
bpftrace -l 'tracepoint:syscalls:*'

Gotchas

Requires Linux 4.9+ and rootBest on 5.x+ kernels for full feature support.
Very low overhead vs straceeBPF runs in kernel safely. Production-safe for most one-liners.
Tracing / eBPF BCC tools

Collection of ready-made eBPF tools from the BCC toolkit. Each tool is production-safe and solves a specific performance question.

Essential BCC Tools

ToolWhat it answers
execsnoopWhat processes are being exec'd right now?
opensnoopWhat files are being opened?
biolatencyBlock I/O latency histogram
biosnoopPer-I/O latency with process name
tcpconnectWhat TCP connections are being made?
tcpacceptWhat TCP connections are being accepted?
tcpretransTCP retransmits with details
runqlatCPU run queue latency histogram
offcputimeTime spent off CPU (blocked, sleeping)
profileCPU flame graph profiler
memleakOutstanding allocations (memory leaks)
funccountCount function calls
traceTrace arbitrary kernel/user functions
cachestatPage cache hit/miss ratio
cachetopPage cache top by process
fileslowerSlow file reads/writes
ext4slowerSlow ext4 operations
Shell
runqlat                      # run queue latency — are tasks waiting for CPU?
biolatency -D              # disk I/O latency per device
tcpretrans                   # watch TCP retransmits live
cachestat 1                  # page cache stats every second
memleak -p 1234             # watch for memory leaks in PID
profile -F 99 -af 30        # CPU flame graph for 30s
Tracing ftrace

Built-in kernel tracing framework. Accessed via /sys/kernel/debug/tracing. No external tools needed — available on any Linux system.

Shell
cd /sys/kernel/debug/tracing

# List available tracers
cat available_tracers

# Function call tracing
echo function > current_tracer
echo 1 > tracing_on
cat trace
echo 0 > tracing_on

# Trace a specific function
echo my_function > set_ftrace_filter
echo function > current_tracer

# Easier with trace-cmd (wrapper)
trace-cmd record -e sched_switch -p function sleep 5
trace-cmd report
All-in-one sar

System Activity Reporter. Collects, records, and reports historical system performance data. Part of sysstat. The only standard tool for looking back in time.

Common Options

Shell
# CPU usage every 1s, 5 samples
sar -u 1 5

# Memory stats
sar -r 1 5

# Disk I/O
sar -d 1 5

# Network stats
sar -n DEV 1 5

# TCP stats
sar -n TCP 1 5

# Load average and run queue
sar -q 1 5

# Historical data from today's log
sar -u -f /var/log/sysstat/sa$(date +%d)

# Historical data from specific time range
sar -u -s 09:00:00 -e 10:00:00

Gotchas

Must enable collection firstEdit /etc/default/sysstat → set ENABLED="true", then systemctl enable --now sysstat.
Best tool for post-incident reviewWhen something happened at 3am, sar is often the only tool with historical data.
All-in-one dstat

Combines vmstat, iostat, ifstat, and netstat into one colorized output. Great for a live overview of all subsystems at once.

Shell
dstat                        # default: cpu, disk, net, paging, system
dstat -cdngy               # cpu, disk, net, paging, system
dstat --top-cpu             # show top CPU process
dstat --top-io              # show top I/O process
dstat --top-mem             # show top memory process
dstat -t 1 60               # with timestamp, 1s interval, 60s
dstat --output /tmp/dstat.csv 1  # export to CSV for analysis
dstat is deprecatedUse dool (fork) on newer systems. Some distros ship it as dstat still.
All-in-one glances

Cross-platform monitoring tool with a rich curses UI. Shows CPU, memory, disk, network, processes, and alerts in one screen.

Shell
glances                      # interactive TUI
glances -w                  # web server mode (port 61208)
glances -s                  # server mode (for remote monitoring)
glances -c remote-host      # connect to remote glances server
glances --export csv        # export metrics to CSV
All-in-one atop

Advanced system and process monitor. Records all activity to disk and allows replaying historical sessions. Captures processes that have already exited.

Shell
atop                         # interactive
atop -r /var/log/atop/atop_20240426  # replay saved log
atop -A                     # show all resources
atop -w /tmp/atop.log 1 60  # write log, 1s interval, 60s
Key advantage over top/htopatop captures short-lived processes that start and die between top's refresh interval. Critical for debugging transient process spikes.
Reference USE Method

Brendan Gregg's methodology: for every resource, check Utilization, Saturation, and Errors.

ResourceUtilizationSaturationErrors
CPUmpstat %usr+%sysvmstat r > CPUsdmesg | grep error
Memoryfree availvmstat si/so > 0dmesg OOM killer
Diskiostat %utiliostat aqu-sz > 1smartctl -a /dev/sda
Networksar -n DEV txkB/snetstat -s retransmitsip -s link
File descriptorslsof -p PID | wc -l/proc/PID/limitsEMFILE errors in logs
Reference Load Average

Shown in top, uptime, and /proc/loadavg. Represents the average number of processes in R (running) or D (uninterruptible sleep) state over 1, 5, and 15 minutes.

Rule of thumb: load average / CPU countIf you have 4 CPUs and load is 4.0, you're at 100% utilization. Load < CPU count = fine. Load > CPU count = saturated.
ScenarioMeaning
Load 1m > 5m > 15mLoad is increasing — problem is getting worse
Load 1m < 5m < 15mLoad is decreasing — problem is recovering
Load high, CPU idleD state processes — blocked on I/O, not CPU
Load = CPUsFully utilized but not saturated
Load >> CPUsSaturated — tasks are queuing
Reference Signals
SignalNumberDefault ActionUse case
SIGHUP1TerminateReload config (many daemons)
SIGINT2TerminateCtrl+C
SIGQUIT3Core dumpCtrl+\ — quit with core
SIGKILL9TerminateForce kill — cannot be caught or ignored
SIGTERM15TerminateGraceful shutdown (default kill signal)
SIGSTOP19StopPause process — cannot be caught
SIGCONT18ContinueResume stopped process
SIGUSR110TerminateApp-defined (e.g. log rotation)
SIGUSR212TerminateApp-defined
SIGPIPE13TerminateWrite to broken pipe
SIGCHLD17IgnoreChild process stopped or exited
Shell
kill -15 1234               # graceful terminate (default)
kill -9 1234                # force kill
kill -HUP 1234              # reload config
killall nginx                # kill all processes named nginx
pkill -f "my-app"           # kill by full command line match
kill -0 1234                # check if process exists (no signal sent)
SIGKILL bypasses graceful shutdownAlways try SIGTERM first and give the process time to clean up. SIGKILL can leave temp files, locks, and incomplete writes.