Async I/O in Linux
From blocking syscalls and epoll to io_uring — how Linux evolved its async I/O model
The I/O concurrency problem: how to handle many connections with one thread? 1993 select() — first multiplexer; O(n) scan; 1024 fd limit (FD_SETSIZE) 1997 poll() — no fd limit; still O(n) scan on every call 2002 epoll — O(1) event delivery; used by nginx, Node.js, Redis, Go 2003 POSIX AIO — kernel async read/write; limited to O_DIRECT; largely abandoned 2019 io_uring — unified async interface for any I/O; shared ring buffers; zero syscall on hot path; used by Cloudflare, QEMU, tokio
Foundation Blocking I/O

The default model: a thread issues a syscall and blocks until the kernel completes the operation. Simple to program but fundamentally limits concurrency to one in-flight operation per thread.

How It Works

Thread Kernel │ │ │── read(fd, buf, n) ──────────▶│ Checks if data is available │ │ │ BLOCKED │ Waits for data to arrive │ (sleeping) │ (NIC interrupt, disk DMA complete, …) │ │ │◀── returns n bytes ───────────│ Data copied to user buffer │ │ │ (continues execution) │

Thread-per-connection model

Client 1 ──▶ Thread 1 ──▶ blocking read() │ Client 2 ──▶ Thread 2 ──▶ blocking read() │ Each thread ~1–8 MB stack Client 3 ──▶ Thread 3 ──▶ blocking read() │ Context-switch overhead … │ Client N ──▶ Thread N ──▶ blocking read() │ Breaks at ~10k connections
The C10K problem Handling 10,000 simultaneous connections with one thread per connection requires ~10,000 threads. At ~1 MB stack each that's 10 GB of virtual memory, plus kernel scheduler overhead for 10,000 context switches. This was the motivation for epoll in 2002.

Simple Blocking Read

C
#include <unistd.h>

// Blocks until n bytes are available or EOF/error
ssize_t n = read(fd, buf, sizeof(buf));
if (n < 0) perror("read");   // error
if (n == 0) { /* EOF */ }
// process buf[0..n-1]
When blocking I/O is the right choice For simple tools, CLIs, batch jobs, or services with one connection at a time — blocking I/O is cleaner and easier to reason about. Only reach for async when concurrency is actually the bottleneck.
Foundation Non-blocking I/O

Setting O_NONBLOCK on a file descriptor makes I/O syscalls return immediately with EAGAIN/EWOULDBLOCK instead of blocking when no data is ready. The application must retry or use a poller.

How It Works

Thread Kernel │ │ │── read(fd, buf, n) ──────────▶│ Checks if data is available │ │ │◀── returns -1 EAGAIN ─────────│ No data — returns immediately │ │ │ (app decides: retry, │ │ poll, or do other work) │

Setting O_NONBLOCK

C
#include <fcntl.h>
#include <errno.h>

// Set non-blocking on an existing fd
int flags = fcntl(fd, F_GETFL, 0);
fcntl(fd, F_SETFL, flags | O_NONBLOCK);

// Or set at open time
int fd = open("/path/to/file", O_RDONLY | O_NONBLOCK);

// Read loop
while (1) {
    ssize_t n = read(fd, buf, sizeof(buf));
    if (n > 0) { /* got data */ }
    else if (n == 0) { /* EOF */ break; }
    else if (errno == EAGAIN) { /* no data yet, try later */ }
    else { perror("read"); break; }
}
Busy-polling burns CPU Looping on read() until data arrives is a busy-poll — 100% CPU usage while waiting. Non-blocking I/O only becomes useful when combined with a readiness notification mechanism like epoll.
Multiplexing select

select() was the first I/O multiplexer — it blocks until one or more of a set of file descriptors becomes ready for reading, writing, or has an error. Available since BSD 4.2 and POSIX-standardised.

How select() Works

Application Kernel │ │ │ Build fd_set (bitmap of fds) │ │── select(maxfd+1, &rfds, …) ─▶│ Scans all fds 0…maxfd │ │ Blocks until ≥1 fd is ready │◀── returns count of ready fds ─│ │ │ │ Loop over all fds in set │ │ to find which ones are ready │ ← O(n) scan every time │ (no direct indication) │
C
#include <sys/select.h>

fd_set rfds;
struct timeval tv = { .tv_sec = 5, .tv_usec = 0 };

FD_ZERO(&rfds);
FD_SET(fd1, &rfds);
FD_SET(fd2, &rfds);

int ret = select(max_fd + 1, &rfds, NULL, NULL, &tv);
if (ret > 0) {
    if (FD_ISSET(fd1, &rfds)) { /* fd1 is readable */ }
    if (FD_ISSET(fd2, &rfds)) { /* fd2 is readable */ }
}

Limitations

ProblemDetail
1024 fd limitFD_SETSIZE is fixed at 1024. Cannot watch more than 1024 file descriptors without recompiling.
O(n) complexityThe kernel scans all fds from 0 to maxfd on every call, regardless of how many are ready.
Set rebuilt every callThe fd_sets are modified by select() and must be rebuilt before each call — O(n) copying overhead.
No event infoOn return, you must scan the whole set to find which fds are ready.
Multiplexing poll

poll() improves on select() by removing the 1024 fd limit and using an array of pollfd structs instead of bitmaps. But it still suffers from O(n) kernel scanning.

C
#include <poll.h>

struct pollfd fds[2] = {
    { .fd = fd1, .events = POLLIN },
    { .fd = fd2, .events = POLLIN | POLLOUT },
};

int ret = poll(fds, 2, 5000);  // 5 second timeout
if (ret > 0) {
    if (fds[0].revents & POLLIN)  { /* fd1 readable */ }
    if (fds[1].revents & POLLOUT) { /* fd2 writable */ }
}
Event FlagMeaning
POLLINData available to read
POLLOUTSpace available to write
POLLERRError condition (always checked)
POLLHUPHangup — peer closed connection
POLLRDHUPPeer shut down writing half of connection
poll vs select poll() removes the 1024-fd ceiling and has a cleaner API. But both share the fundamental flaw: the entire fd list is copied to the kernel on every call, and the kernel scans every fd. Both degrade as O(n) with the number of watched descriptors.
Multiplexing epoll

epoll is the Linux-specific scalable I/O event notification mechanism, introduced in kernel 2.6. Instead of scanning all fds on every call, it maintains a kernel-side interest list and delivers only the fds that are ready. Used by nginx, Node.js, Redis, Go's netpoller, and Java NIO.

How epoll Works

epoll_create1() │ → returns epfd (the epoll instance) │ epoll_ctl(epfd, EPOLL_CTL_ADD, fd, &event) │ → register fd in kernel interest list (one-time cost) │ epoll_wait(epfd, events, maxevents, timeout) │ → blocks; kernel delivers only READY fds into events[] │ → O(1) per ready event, not O(n) over all watched fds │ Application handles only the ready fds returned in events[]

epoll vs select/poll

select / poll epoll ───────────────────────────────────────────────────────────── Copy fd set to kernel every call Register fds once with ctl Kernel scans all fds O(n) Kernel delivers ready fds O(1) App scans all fds for ready ones App processes events array directly Rebuild set after each call Interest list persists 1024 fd limit (select) No limit POSIX-portable Linux-only

Level-triggered vs Edge-triggered

Level-triggered (LT) — default

epoll_wait returns as long as the fd is ready. If you don't read all available data, the fd is reported again next call. Easier to use — same semantics as select/poll.

Edge-triggered (ET) — EPOLLET

epoll_wait returns only when the fd transitions from not-ready to ready. You must read until EAGAIN to drain the buffer or you'll miss data. More efficient but easy to get wrong.

epoll Event Loop

C
#include <sys/epoll.h>
#include <unistd.h>
#include <fcntl.h>

#define MAX_EVENTS 64

int epfd = epoll_create1(0);

// Register a listening socket
struct epoll_event ev = {
    .events  = EPOLLIN,
    .data.fd = listen_fd,
};
epoll_ctl(epfd, EPOLL_CTL_ADD, listen_fd, &ev);

struct epoll_event events[MAX_EVENTS];

while (1) {
    int n = epoll_wait(epfd, events, MAX_EVENTS, -1);
    for (int i = 0; i < n; i++) {
        if (events[i].data.fd == listen_fd) {
            // New connection
            int conn = accept(listen_fd, NULL, NULL);
            fcntl(conn, F_SETFL, O_NONBLOCK);
            ev.events  = EPOLLIN | EPOLLET;  // edge-triggered
            ev.data.fd = conn;
            epoll_ctl(epfd, EPOLL_CTL_ADD, conn, &ev);
        } else {
            // Data ready on a connection
            handle_connection(events[i].data.fd);
        }
    }
}

EPOLLONESHOT & EPOLLEXCLUSIVE

FlagBehaviourUse Case
EPOLLETEdge-triggered — fire only on transitionsHigh-performance single-thread loops
EPOLLONESHOTDeliver one event then disarm — re-arm with EPOLL_CTL_MODMulti-threaded servers (avoid races)
EPOLLEXCLUSIVEOne wakeup per ready event across multiple epoll instancesMulti-process accept() without thundering herd
EPOLLERRError on fd — always reported even if not requestedError handling
EPOLLRDHUPPeer closed or shut down write halfDetecting half-closed connections
Who uses epoll? nginx, Node.js, Redis, Memcached, Go's runtime net poller, Java NIO (via /dev/poll on Solaris, kqueue on BSD, epoll on Linux), libuv, libevent, libev, Twisted, Tornado — essentially every high-performance server on Linux.
Async API POSIX AIO

POSIX AIO (aio_read, aio_write) was designed to provide true kernel-async I/O without threads. In practice the Linux glibc implementation uses a thread pool internally, and the kernel implementation only works efficiently with O_DIRECT files. Largely superseded by io_uring.

API Overview

C
#include <aio.h>

struct aiocb cb = {
    .aio_fildes = fd,
    .aio_buf    = buf,
    .aio_nbytes = sizeof(buf),
    .aio_offset = 0,
    // notify via signal:
    .aio_sigevent.sigev_notify = SIGEV_SIGNAL,
    .aio_sigevent.sigev_signo  = SIGUSR1,
};

// Submit async read — returns immediately
aio_read(&cb);

// Check completion (polling)
while (aio_error(&cb) == EINPROGRESS) { /* wait */ }

ssize_t n = aio_return(&cb);   // get result

Why POSIX AIO Failed to Catch On

ProblemDetail
glibc uses threadsThe glibc implementation simulates async I/O using a hidden thread pool — not truly kernel-async.
O_DIRECT only (kernel impl)The kernel's native aio only avoids blocking for O_DIRECT files. Buffered I/O may still block.
No network supportPOSIX AIO is spec'd only for regular files — sockets, pipes, and ttys are not supported.
Cumbersome APICompletion notification via signals, polling, or threads — all awkward. Hard to integrate with event loops.
Limited operationsOnly read/write/fsync. No accept, connect, send, recv, or splice.
Avoid POSIX AIO for new code Unless targeting portability to non-Linux systems, use io_uring for file I/O and epoll for sockets. POSIX AIO is effectively a dead end on Linux.
Modern io_uring

Introduced in Linux 5.1 (2019) by Jens Axboe, io_uring is a unified async I/O interface built around two shared ring buffers between userspace and the kernel. It solves the fundamental problem with all prior async interfaces: syscall overhead. In the hot path, zero syscalls are needed.

The Core Insight

All prior async I/O io_uring ──────────────────────────────────────────────────────────────── syscall per operation Batch N ops in one syscall (read, write, send, recv, …) (io_uring_enter) — or zero syscalls with SQPOLL Context switch per op Ring buffer shared between userspace and kernel: no copy, no context switch Limited operation types Unified: read, write, accept, (select→fds only, aio→files only) connect, send, recv, splice, fsync, fallocate, openat, … Callbacks or polling to get result Completion queue (CQ) — just read the ring buffer
io_uring Submission & Completion Queues

Ring Buffer Architecture

Userspace Kernel ┌──────────────────────────────┐ ┌──────────────────────────────┐ │ Submission Queue (SQ) │ │ │ │ ┌────┬────┬────┬────┐ │ │ io_uring instance │ │ │SQE │SQE │SQE │ │◀─────────▶│ (mmap'd shared memory) │ │ └────┴────┴────┴────┘ │ │ │ │ sq_tail (app writes here) │ │ sq_head (kernel reads here) │ │ │ │ │ │ Completion Queue (CQ) │ │ │ │ ┌────┬────┬────┬────┐ │ │ │ │ │CQE │CQE │ │ │◀─────────▶│ │ │ └────┴────┴────┴────┘ │ │ │ │ cq_head (app reads here) │ │ cq_tail (kernel writes here)│ └──────────────────────────────┘ └──────────────────────────────┘ SQE = Submission Queue Entry (what to do) CQE = Completion Queue Entry (result)

SQE Structure (what you submit)

C (struct io_uring_sqe)
struct io_uring_sqe {
    __u8   opcode;       // IORING_OP_READ, IORING_OP_ACCEPT, …
    __u8   flags;        // IOSQE_FIXED_FILE, IOSQE_IO_LINK, …
    __u16  ioprio;
    __s32  fd;           // target file descriptor
    __u64  off;          // file offset (or 0)
    __u64  addr;         // pointer to buffer
    __u32  len;          // length
    __u64  user_data;    // echoed back in CQE — use for correlation
    // … more fields for op-specific params
};

CQE Structure (what you get back)

C (struct io_uring_cqe)
struct io_uring_cqe {
    __u64  user_data;   // echoed from SQE — identifies the request
    __s32  res;         // result: bytes read/written, or -errno on error
    __u32  flags;
};

Submission Modes

ModeHowTrade-off
Normalio_uring_enter() syscall to submit + waitOne syscall per batch — good default
SQPOLLKernel thread polls the SQ continuously (IORING_SETUP_SQPOLL)Zero syscalls — but burns a kernel CPU
IOPOLLKernel polls for completion instead of interrupt (IORING_SETUP_IOPOLL)Lower latency for NVMe — requires O_DIRECT
io_uring Supported Operations

io_uring supports a growing set of operations. Unlike POSIX AIO, it covers both file and network I/O under a single interface.

CategoryOperations
File I/Oread, write, readv, writev, read_fixed, write_fixed, fsync, fdatasync, fallocate, ftruncate
File managementopenat, openat2, close, statx, renameat, unlinkat, mkdirat, symlinkat
Networkingaccept, connect, recv, send, recvmsg, sendmsg, shutdown, socket, bind, listen
Advancedsplice, tee, sendfile (via splice)
Controlnop, timeout, link_timeout, cancel, poll_add, poll_remove
Multi-shotaccept_multishot, recv_multishot — one SQE, many CQEs

Probing Supported Ops

C (liburing)
#include <liburing.h>

struct io_uring_probe *probe = io_uring_get_probe();
if (io_uring_opcode_supported(probe, IORING_OP_SOCKET))
    printf("IORING_OP_SOCKET supported\n");
io_uring_free_probe(probe);
io_uring Advanced Features

Fixed Buffers & Registered Files

Registering buffers and file descriptors with the kernel avoids repeated page-pinning overhead on each I/O operation.

C (liburing)
// Register buffers once — kernel pins the pages
struct iovec bufs[4];
// … populate bufs …
io_uring_register_buffers(&ring, bufs, 4);

// Use fixed buffer in SQE — avoids page-pin on each I/O
struct io_uring_sqe *sqe = io_uring_get_sqe(&ring);
io_uring_prep_read_fixed(sqe, fd, buf, len, offset, buf_index);

// Register fds — avoids fd table lookup on each op
io_uring_register_files(&ring, fds, num_fds);
// Then use IOSQE_FIXED_FILE flag and file_index instead of fd

Linked Requests

Chain SQEs so the next one only executes if the previous one succeeded. Useful for read-then-write pipelines without returning to userspace.

C (liburing)
// Chain: open → read → close, all in one submit
struct io_uring_sqe *sqe;

sqe = io_uring_get_sqe(&ring);
io_uring_prep_openat(sqe, AT_FDCWD, "/etc/hosts", O_RDONLY, 0);
sqe->flags |= IOSQE_IO_LINK;          // link to next SQE

sqe = io_uring_get_sqe(&ring);
io_uring_prep_read(sqe, -1, buf, sizeof(buf), 0);
sqe->flags |= IOSQE_IO_LINK | IOSQE_FIXED_FILE;

sqe = io_uring_get_sqe(&ring);
io_uring_prep_close(sqe, -1);

io_uring_submit(&ring);

Multishot Operations (kernel 5.19+)

Submit once, receive many completions. Ideal for accept and streaming recv — no need to re-arm after each event.

C (liburing)
// Accept connections continuously — one SQE, many CQEs
struct io_uring_sqe *sqe = io_uring_get_sqe(&ring);
io_uring_prep_multishot_accept(sqe, listen_fd, NULL, NULL, 0);
io_uring_submit(&ring);

// Each new connection produces one CQE with cqe->res = conn_fd
// The accept stays armed — no re-submit needed

Buffer Ring (zero-copy recv)

Provide a pool of buffers to the kernel; it picks one for each incoming packet without the app having to provide one per recv call.

C (liburing)
// Register a buffer ring (kernel 5.19+)
struct io_uring_buf_ring *br;
br = io_uring_setup_buf_ring(&ring, 128, bgid, 0, &ret);

// Add buffers to the ring
for (int i = 0; i < 128; i++)
    io_uring_buf_ring_add(br, bufs[i], BUF_SIZE, i, mask, i);
io_uring_buf_ring_advance(br, 128);

// recv will automatically pick a buffer from the ring
sqe = io_uring_get_sqe(&ring);
io_uring_prep_recv_multishot(sqe, conn_fd, NULL, 0, 0);
sqe->buf_group = bgid;
sqe->flags |= IOSQE_BUFFER_SELECT;
io_uring liburing — Getting Started

liburing is the reference userspace library for io_uring, written by Jens Axboe. It wraps the raw syscalls and ring manipulation into a clean API.

Install

Shell
# Debian / Ubuntu
apt install liburing-dev

# Fedora / RHEL
dnf install liburing-devel

# Build from source
git clone https://github.com/axboe/liburing
cd liburing && ./configure && make && make install

Complete Example: Async File Read

C (liburing)
#include <liburing.h>
#include <fcntl.h>
#include <stdio.h>

#define QUEUE_DEPTH 1
#define BUF_SIZE    4096

int main() {
    struct io_uring ring;
    io_uring_queue_init(QUEUE_DEPTH, &ring, 0);

    int fd = open("/etc/hostname", O_RDONLY);
    char buf[BUF_SIZE];

    // Get an SQE and fill it
    struct io_uring_sqe *sqe = io_uring_get_sqe(&ring);
    io_uring_prep_read(sqe, fd, buf, BUF_SIZE, 0);
    sqe->user_data = 42;   // tag to identify this request

    // Submit to kernel
    io_uring_submit(&ring);

    // Wait for completion
    struct io_uring_cqe *cqe;
    io_uring_wait_cqe(&ring, &cqe);

    if (cqe->res > 0)
        printf("Read %d bytes: %.*s\n", cqe->res, cqe->res, buf);
    else
        fprintf(stderr, "Error: %s\n", strerror(-cqe->res));

    io_uring_cqe_seen(&ring, cqe);    // advance CQ head
    io_uring_queue_exit(&ring);
    close(fd);
}

Complete Example: Echo Server

C sketch (liburing)
// Minimal io_uring echo server (abbreviated)
struct io_uring ring;
io_uring_queue_init(256, &ring, 0);

// Submit initial multishot accept
struct io_uring_sqe *sqe = io_uring_get_sqe(&ring);
io_uring_prep_multishot_accept(sqe, listen_fd, NULL, NULL, 0);
sqe->user_data = TAG_ACCEPT;
io_uring_submit(&ring);

while (1) {
    struct io_uring_cqe *cqe;
    io_uring_wait_cqe(&ring, &cqe);

    switch (cqe->user_data) {
    case TAG_ACCEPT:
        // cqe->res = new connection fd
        submit_recv(cqe->res);    // queue a recv on the new fd
        break;
    case TAG_RECV:
        // got data — echo it back
        submit_send(fd, buf, cqe->res);
        break;
    case TAG_SEND:
        // send done — queue next recv
        submit_recv(fd);
        break;
    }
    io_uring_cqe_seen(&ring, cqe);
}
Higher-level options Most applications shouldn't use liburing directly. Prefer runtimes and frameworks built on io_uring: Tokio (Rust, tokio-uring), Netty (Java, io_uring transport), Seastar (C++), Node.js (libuv uses io_uring on Linux), or Python asyncio (via uvloop).
Reference Comparison
Feature select poll epoll POSIX AIO io_uring
Kernel versionAncient2.1.232.62.65.1
fd limit1024UnlimitedUnlimitedUnlimitedUnlimited
ComplexityO(n)O(n)O(1)O(1)O(1)
Syscall per eventYesYesYes (batched)YesNo (batched / zero)
Works on socketsYesYesYesNoYes
Works on filesYes*Yes*Yes*YesYes
Zero-copy buffersNoNoNoNoYes
Chained opsNoNoNoNoYes
POSIX portableYesYesNo (Linux)YesNo (Linux)
Used in productionLegacyLegacyWidelyRarelyGrowing

* Regular files always report "ready" with select/poll/epoll — actual I/O may still block. Only io_uring provides true async file I/O.

Performance Characteristics

Connections / watched fds │ 1M │ ● io_uring │ ● │ ● 100k │ ● epoll │ 10k │ ● epoll │ ● 1k │ ● poll │ ● select (hits 1024 fd limit) │ └──────────────────────────────────────────────▶ Throughput / efficiency
Reference When to Use What
ScenarioRecommendationReason
Simple script / CLI toolBlocking I/OSimplest code; concurrency not needed
Handful of connections (< 100)Blocking + threads, or pollOverhead of epoll/io_uring not worth it
High-concurrency server (sockets)epollMature, well-understood, O(1) delivery
High-throughput file I/Oio_uringTrue async file I/O; epoll doesn't help for files
Mixed file + network I/Oio_uringSingle interface for both — simplifies event loop
Kernel ≥ 5.1, latency criticalio_uring + SQPOLLZero syscall overhead on hot path
Portability (BSD, macOS)epoll → kqueue abstraction (libuv, libevent)kqueue is the epoll equivalent on BSD/macOS
Rust async networkingTokio (uses epoll/io_uring via mio)Best-in-class async runtime for Rust
Go networkingnet package (uses epoll internally)Goroutines + runtime netpoller = transparent
Java NIONetty or java.nio (uses epoll/io_uring)Netty has native epoll and io_uring transports
Most applications should not use epoll or io_uring directly Use a runtime, framework, or library that handles the event loop: libuv (Node.js, used via C), libevent, libev, Tokio (Rust), asyncio + uvloop (Python), Netty (Java), Go's net package. These abstract the OS differences and handle edge cases (EINTR, spurious wakeups, etc.).
Reference Tools & Observability
Shell
# Check kernel version (io_uring needs ≥ 5.1)
uname -r

# Trace epoll_wait calls and their return counts
strace -e trace=epoll_wait,epoll_ctl -p <pid>

# Count io_uring syscalls per second
bpftrace -e 'tracepoint:syscalls:sys_enter_io_uring_enter { @[comm] = count(); }'

# Trace io_uring operations and results
bpftrace -e 'tracepoint:io_uring:io_uring_submit_sqe { printf("%s op=%d\n", comm, args->opcode); }'
bpftrace -e 'tracepoint:io_uring:io_uring_complete { printf("%s res=%d\n", comm, args->res); }'

# Show io_uring tracepoints available
ls /sys/kernel/debug/tracing/events/io_uring/

# Check epoll usage (number of epoll fds)
ls /proc/<pid>/fd | xargs -I{} readlink /proc/<pid>/fd/{} | grep -c eventpoll

# perf: count context switches (high count = too many threads)
perf stat -e context-switches -p <pid> -- sleep 5
ToolPurpose
straceTrace all I/O syscalls — see what epoll/io_uring calls an app makes
bpftraceTrace io_uring tracepoints, epoll wakeups, socket operations
perfCount context switches, cache misses, syscall rates
io_uring_benchBenchmark tool from the liburing repo
/proc/<pid>/fdinfo/<fd>Shows epoll or io_uring ring details for a given fd
lsofList open fds including eventpoll and io_uring instances