The default model: a thread issues a syscall and blocks until the kernel completes the operation. Simple to program but fundamentally limits concurrency to one in-flight operation per thread.
How It Works
Thread-per-connection model
Simple Blocking Read
#include <unistd.h> // Blocks until n bytes are available or EOF/error ssize_t n = read(fd, buf, sizeof(buf)); if (n < 0) perror("read"); // error if (n == 0) { /* EOF */ } // process buf[0..n-1]
Setting O_NONBLOCK on a file descriptor makes I/O syscalls return immediately with EAGAIN/EWOULDBLOCK instead of blocking when no data is ready. The application must retry or use a poller.
How It Works
Setting O_NONBLOCK
#include <fcntl.h> #include <errno.h> // Set non-blocking on an existing fd int flags = fcntl(fd, F_GETFL, 0); fcntl(fd, F_SETFL, flags | O_NONBLOCK); // Or set at open time int fd = open("/path/to/file", O_RDONLY | O_NONBLOCK); // Read loop while (1) { ssize_t n = read(fd, buf, sizeof(buf)); if (n > 0) { /* got data */ } else if (n == 0) { /* EOF */ break; } else if (errno == EAGAIN) { /* no data yet, try later */ } else { perror("read"); break; } }
read() until data arrives is a busy-poll — 100% CPU usage while waiting. Non-blocking I/O only becomes useful when combined with a readiness notification mechanism like epoll.
select() was the first I/O multiplexer — it blocks until one or more of a set of file descriptors becomes ready for reading, writing, or has an error. Available since BSD 4.2 and POSIX-standardised.
How select() Works
#include <sys/select.h> fd_set rfds; struct timeval tv = { .tv_sec = 5, .tv_usec = 0 }; FD_ZERO(&rfds); FD_SET(fd1, &rfds); FD_SET(fd2, &rfds); int ret = select(max_fd + 1, &rfds, NULL, NULL, &tv); if (ret > 0) { if (FD_ISSET(fd1, &rfds)) { /* fd1 is readable */ } if (FD_ISSET(fd2, &rfds)) { /* fd2 is readable */ } }
Limitations
| Problem | Detail |
|---|---|
| 1024 fd limit | FD_SETSIZE is fixed at 1024. Cannot watch more than 1024 file descriptors without recompiling. |
| O(n) complexity | The kernel scans all fds from 0 to maxfd on every call, regardless of how many are ready. |
| Set rebuilt every call | The fd_sets are modified by select() and must be rebuilt before each call — O(n) copying overhead. |
| No event info | On return, you must scan the whole set to find which fds are ready. |
poll() improves on select() by removing the 1024 fd limit and using an array of pollfd structs instead of bitmaps. But it still suffers from O(n) kernel scanning.
#include <poll.h> struct pollfd fds[2] = { { .fd = fd1, .events = POLLIN }, { .fd = fd2, .events = POLLIN | POLLOUT }, }; int ret = poll(fds, 2, 5000); // 5 second timeout if (ret > 0) { if (fds[0].revents & POLLIN) { /* fd1 readable */ } if (fds[1].revents & POLLOUT) { /* fd2 writable */ } }
| Event Flag | Meaning |
|---|---|
POLLIN | Data available to read |
POLLOUT | Space available to write |
POLLERR | Error condition (always checked) |
POLLHUP | Hangup — peer closed connection |
POLLRDHUP | Peer shut down writing half of connection |
poll() removes the 1024-fd ceiling and has a cleaner API. But both share the fundamental flaw: the entire fd list is copied to the kernel on every call, and the kernel scans every fd. Both degrade as O(n) with the number of watched descriptors.
epoll is the Linux-specific scalable I/O event notification mechanism, introduced in kernel 2.6. Instead of scanning all fds on every call, it maintains a kernel-side interest list and delivers only the fds that are ready. Used by nginx, Node.js, Redis, Go's netpoller, and Java NIO.
How epoll Works
epoll vs select/poll
Level-triggered vs Edge-triggered
Level-triggered (LT) — default
epoll_wait returns as long as the fd is ready. If you don't read all available data, the fd is reported again next call. Easier to use — same semantics as select/poll.
Edge-triggered (ET) — EPOLLET
epoll_wait returns only when the fd transitions from not-ready to ready. You must read until EAGAIN to drain the buffer or you'll miss data. More efficient but easy to get wrong.
epoll Event Loop
#include <sys/epoll.h> #include <unistd.h> #include <fcntl.h> #define MAX_EVENTS 64 int epfd = epoll_create1(0); // Register a listening socket struct epoll_event ev = { .events = EPOLLIN, .data.fd = listen_fd, }; epoll_ctl(epfd, EPOLL_CTL_ADD, listen_fd, &ev); struct epoll_event events[MAX_EVENTS]; while (1) { int n = epoll_wait(epfd, events, MAX_EVENTS, -1); for (int i = 0; i < n; i++) { if (events[i].data.fd == listen_fd) { // New connection int conn = accept(listen_fd, NULL, NULL); fcntl(conn, F_SETFL, O_NONBLOCK); ev.events = EPOLLIN | EPOLLET; // edge-triggered ev.data.fd = conn; epoll_ctl(epfd, EPOLL_CTL_ADD, conn, &ev); } else { // Data ready on a connection handle_connection(events[i].data.fd); } } }
EPOLLONESHOT & EPOLLEXCLUSIVE
| Flag | Behaviour | Use Case |
|---|---|---|
EPOLLET | Edge-triggered — fire only on transitions | High-performance single-thread loops |
EPOLLONESHOT | Deliver one event then disarm — re-arm with EPOLL_CTL_MOD | Multi-threaded servers (avoid races) |
EPOLLEXCLUSIVE | One wakeup per ready event across multiple epoll instances | Multi-process accept() without thundering herd |
EPOLLERR | Error on fd — always reported even if not requested | Error handling |
EPOLLRDHUP | Peer closed or shut down write half | Detecting half-closed connections |
POSIX AIO (aio_read, aio_write) was designed to provide true kernel-async I/O without threads. In practice the Linux glibc implementation uses a thread pool internally, and the kernel implementation only works efficiently with O_DIRECT files. Largely superseded by io_uring.
API Overview
#include <aio.h> struct aiocb cb = { .aio_fildes = fd, .aio_buf = buf, .aio_nbytes = sizeof(buf), .aio_offset = 0, // notify via signal: .aio_sigevent.sigev_notify = SIGEV_SIGNAL, .aio_sigevent.sigev_signo = SIGUSR1, }; // Submit async read — returns immediately aio_read(&cb); // Check completion (polling) while (aio_error(&cb) == EINPROGRESS) { /* wait */ } ssize_t n = aio_return(&cb); // get result
Why POSIX AIO Failed to Catch On
| Problem | Detail |
|---|---|
| glibc uses threads | The glibc implementation simulates async I/O using a hidden thread pool — not truly kernel-async. |
| O_DIRECT only (kernel impl) | The kernel's native aio only avoids blocking for O_DIRECT files. Buffered I/O may still block. |
| No network support | POSIX AIO is spec'd only for regular files — sockets, pipes, and ttys are not supported. |
| Cumbersome API | Completion notification via signals, polling, or threads — all awkward. Hard to integrate with event loops. |
| Limited operations | Only read/write/fsync. No accept, connect, send, recv, or splice. |
io_uring for file I/O and epoll for sockets. POSIX AIO is effectively a dead end on Linux.
Introduced in Linux 5.1 (2019) by Jens Axboe, io_uring is a unified async I/O interface built around two shared ring buffers between userspace and the kernel. It solves the fundamental problem with all prior async interfaces: syscall overhead. In the hot path, zero syscalls are needed.
The Core Insight
Ring Buffer Architecture
SQE Structure (what you submit)
struct io_uring_sqe { __u8 opcode; // IORING_OP_READ, IORING_OP_ACCEPT, … __u8 flags; // IOSQE_FIXED_FILE, IOSQE_IO_LINK, … __u16 ioprio; __s32 fd; // target file descriptor __u64 off; // file offset (or 0) __u64 addr; // pointer to buffer __u32 len; // length __u64 user_data; // echoed back in CQE — use for correlation // … more fields for op-specific params };
CQE Structure (what you get back)
struct io_uring_cqe { __u64 user_data; // echoed from SQE — identifies the request __s32 res; // result: bytes read/written, or -errno on error __u32 flags; };
Submission Modes
| Mode | How | Trade-off |
|---|---|---|
| Normal | io_uring_enter() syscall to submit + wait | One syscall per batch — good default |
| SQPOLL | Kernel thread polls the SQ continuously (IORING_SETUP_SQPOLL) | Zero syscalls — but burns a kernel CPU |
| IOPOLL | Kernel polls for completion instead of interrupt (IORING_SETUP_IOPOLL) | Lower latency for NVMe — requires O_DIRECT |
io_uring supports a growing set of operations. Unlike POSIX AIO, it covers both file and network I/O under a single interface.
| Category | Operations |
|---|---|
| File I/O | read, write, readv, writev, read_fixed, write_fixed, fsync, fdatasync, fallocate, ftruncate |
| File management | openat, openat2, close, statx, renameat, unlinkat, mkdirat, symlinkat |
| Networking | accept, connect, recv, send, recvmsg, sendmsg, shutdown, socket, bind, listen |
| Advanced | splice, tee, sendfile (via splice) |
| Control | nop, timeout, link_timeout, cancel, poll_add, poll_remove |
| Multi-shot | accept_multishot, recv_multishot — one SQE, many CQEs |
Probing Supported Ops
#include <liburing.h> struct io_uring_probe *probe = io_uring_get_probe(); if (io_uring_opcode_supported(probe, IORING_OP_SOCKET)) printf("IORING_OP_SOCKET supported\n"); io_uring_free_probe(probe);
Fixed Buffers & Registered Files
Registering buffers and file descriptors with the kernel avoids repeated page-pinning overhead on each I/O operation.
// Register buffers once — kernel pins the pages struct iovec bufs[4]; // … populate bufs … io_uring_register_buffers(&ring, bufs, 4); // Use fixed buffer in SQE — avoids page-pin on each I/O struct io_uring_sqe *sqe = io_uring_get_sqe(&ring); io_uring_prep_read_fixed(sqe, fd, buf, len, offset, buf_index); // Register fds — avoids fd table lookup on each op io_uring_register_files(&ring, fds, num_fds); // Then use IOSQE_FIXED_FILE flag and file_index instead of fd
Linked Requests
Chain SQEs so the next one only executes if the previous one succeeded. Useful for read-then-write pipelines without returning to userspace.
// Chain: open → read → close, all in one submit struct io_uring_sqe *sqe; sqe = io_uring_get_sqe(&ring); io_uring_prep_openat(sqe, AT_FDCWD, "/etc/hosts", O_RDONLY, 0); sqe->flags |= IOSQE_IO_LINK; // link to next SQE sqe = io_uring_get_sqe(&ring); io_uring_prep_read(sqe, -1, buf, sizeof(buf), 0); sqe->flags |= IOSQE_IO_LINK | IOSQE_FIXED_FILE; sqe = io_uring_get_sqe(&ring); io_uring_prep_close(sqe, -1); io_uring_submit(&ring);
Multishot Operations (kernel 5.19+)
Submit once, receive many completions. Ideal for accept and streaming recv — no need to re-arm after each event.
// Accept connections continuously — one SQE, many CQEs struct io_uring_sqe *sqe = io_uring_get_sqe(&ring); io_uring_prep_multishot_accept(sqe, listen_fd, NULL, NULL, 0); io_uring_submit(&ring); // Each new connection produces one CQE with cqe->res = conn_fd // The accept stays armed — no re-submit needed
Buffer Ring (zero-copy recv)
Provide a pool of buffers to the kernel; it picks one for each incoming packet without the app having to provide one per recv call.
// Register a buffer ring (kernel 5.19+) struct io_uring_buf_ring *br; br = io_uring_setup_buf_ring(&ring, 128, bgid, 0, &ret); // Add buffers to the ring for (int i = 0; i < 128; i++) io_uring_buf_ring_add(br, bufs[i], BUF_SIZE, i, mask, i); io_uring_buf_ring_advance(br, 128); // recv will automatically pick a buffer from the ring sqe = io_uring_get_sqe(&ring); io_uring_prep_recv_multishot(sqe, conn_fd, NULL, 0, 0); sqe->buf_group = bgid; sqe->flags |= IOSQE_BUFFER_SELECT;
liburing is the reference userspace library for io_uring, written by Jens Axboe. It wraps the raw syscalls and ring manipulation into a clean API.
Install
# Debian / Ubuntu apt install liburing-dev # Fedora / RHEL dnf install liburing-devel # Build from source git clone https://github.com/axboe/liburing cd liburing && ./configure && make && make install
Complete Example: Async File Read
#include <liburing.h> #include <fcntl.h> #include <stdio.h> #define QUEUE_DEPTH 1 #define BUF_SIZE 4096 int main() { struct io_uring ring; io_uring_queue_init(QUEUE_DEPTH, &ring, 0); int fd = open("/etc/hostname", O_RDONLY); char buf[BUF_SIZE]; // Get an SQE and fill it struct io_uring_sqe *sqe = io_uring_get_sqe(&ring); io_uring_prep_read(sqe, fd, buf, BUF_SIZE, 0); sqe->user_data = 42; // tag to identify this request // Submit to kernel io_uring_submit(&ring); // Wait for completion struct io_uring_cqe *cqe; io_uring_wait_cqe(&ring, &cqe); if (cqe->res > 0) printf("Read %d bytes: %.*s\n", cqe->res, cqe->res, buf); else fprintf(stderr, "Error: %s\n", strerror(-cqe->res)); io_uring_cqe_seen(&ring, cqe); // advance CQ head io_uring_queue_exit(&ring); close(fd); }
Complete Example: Echo Server
// Minimal io_uring echo server (abbreviated) struct io_uring ring; io_uring_queue_init(256, &ring, 0); // Submit initial multishot accept struct io_uring_sqe *sqe = io_uring_get_sqe(&ring); io_uring_prep_multishot_accept(sqe, listen_fd, NULL, NULL, 0); sqe->user_data = TAG_ACCEPT; io_uring_submit(&ring); while (1) { struct io_uring_cqe *cqe; io_uring_wait_cqe(&ring, &cqe); switch (cqe->user_data) { case TAG_ACCEPT: // cqe->res = new connection fd submit_recv(cqe->res); // queue a recv on the new fd break; case TAG_RECV: // got data — echo it back submit_send(fd, buf, cqe->res); break; case TAG_SEND: // send done — queue next recv submit_recv(fd); break; } io_uring_cqe_seen(&ring, cqe); }
| Feature | select | poll | epoll | POSIX AIO | io_uring |
|---|---|---|---|---|---|
| Kernel version | Ancient | 2.1.23 | 2.6 | 2.6 | 5.1 |
| fd limit | 1024 | Unlimited | Unlimited | Unlimited | Unlimited |
| Complexity | O(n) | O(n) | O(1) | O(1) | O(1) |
| Syscall per event | Yes | Yes | Yes (batched) | Yes | No (batched / zero) |
| Works on sockets | Yes | Yes | Yes | No | Yes |
| Works on files | Yes* | Yes* | Yes* | Yes | Yes |
| Zero-copy buffers | No | No | No | No | Yes |
| Chained ops | No | No | No | No | Yes |
| POSIX portable | Yes | Yes | No (Linux) | Yes | No (Linux) |
| Used in production | Legacy | Legacy | Widely | Rarely | Growing |
* Regular files always report "ready" with select/poll/epoll — actual I/O may still block. Only io_uring provides true async file I/O.
Performance Characteristics
| Scenario | Recommendation | Reason |
|---|---|---|
| Simple script / CLI tool | Blocking I/O | Simplest code; concurrency not needed |
| Handful of connections (< 100) | Blocking + threads, or poll | Overhead of epoll/io_uring not worth it |
| High-concurrency server (sockets) | epoll | Mature, well-understood, O(1) delivery |
| High-throughput file I/O | io_uring | True async file I/O; epoll doesn't help for files |
| Mixed file + network I/O | io_uring | Single interface for both — simplifies event loop |
| Kernel ≥ 5.1, latency critical | io_uring + SQPOLL | Zero syscall overhead on hot path |
| Portability (BSD, macOS) | epoll → kqueue abstraction (libuv, libevent) | kqueue is the epoll equivalent on BSD/macOS |
| Rust async networking | Tokio (uses epoll/io_uring via mio) | Best-in-class async runtime for Rust |
| Go networking | net package (uses epoll internally) | Goroutines + runtime netpoller = transparent |
| Java NIO | Netty or java.nio (uses epoll/io_uring) | Netty has native epoll and io_uring transports |
# Check kernel version (io_uring needs ≥ 5.1) uname -r # Trace epoll_wait calls and their return counts strace -e trace=epoll_wait,epoll_ctl -p <pid> # Count io_uring syscalls per second bpftrace -e 'tracepoint:syscalls:sys_enter_io_uring_enter { @[comm] = count(); }' # Trace io_uring operations and results bpftrace -e 'tracepoint:io_uring:io_uring_submit_sqe { printf("%s op=%d\n", comm, args->opcode); }' bpftrace -e 'tracepoint:io_uring:io_uring_complete { printf("%s res=%d\n", comm, args->res); }' # Show io_uring tracepoints available ls /sys/kernel/debug/tracing/events/io_uring/ # Check epoll usage (number of epoll fds) ls /proc/<pid>/fd | xargs -I{} readlink /proc/<pid>/fd/{} | grep -c eventpoll # perf: count context switches (high count = too many threads) perf stat -e context-switches -p <pid> -- sleep 5
| Tool | Purpose |
|---|---|
strace | Trace all I/O syscalls — see what epoll/io_uring calls an app makes |
bpftrace | Trace io_uring tracepoints, epoll wakeups, socket operations |
perf | Count context switches, cache misses, syscall rates |
io_uring_bench | Benchmark tool from the liburing repo |
/proc/<pid>/fdinfo/<fd> | Shows epoll or io_uring ring details for a given fd |
lsof | List open fds including eventpoll and io_uring instances |