Add support for io_uring zero-copy receive (ZCRX) on suitable platform by benjarvis · Pull Request #41 · chimera-nas/libevpl

benjarvis · 2026-05-17T14:44:06Z

WIP.

Audit and expand the io_uring backend to take advantage of modern io_uring features on capable kernels, while remaining buildable and runnable on older kernels and liburing versions. Public config additions (include/evpl/evpl_config.h, src/core/config.c): - io_uring_zerocopy_rx (off/on/auto, default auto) - io_uring_zcrx_interface, io_uring_zcrx_rxq, io_uring_zcrx_area_size, io_uring_zcrx_rq_entries - io_uring_registered_buffers, io_uring_registered_files, io_uring_send_zc, io_uring_recv_bundle (all tri-state, default auto) - EVPL_IO_URING_{OFF,ON,AUTO} mode constants Mode semantics: ON with capability missing -> fatal at ctx create; AUTO with capability missing -> log + silently fall back. Compile-time gates (CMakeLists.txt): check_symbol_exists / check_c_source_compiles probes for each new symbol/opcode/enum (io_uring_register_ifq, io_uring_prep_send_zc, io_uring_register_buffers_sparse, io_uring_register_files_sparse, io_uring_register_buffers_update_tag, io_uring_register_files_update, io_uring_prep_close_direct, IORING_OP_RECV_ZC, IORING_RECVSEND_BUNDLE, IORING_RECVSEND_FIXED_BUF, IOSQE_FIXED_FILE, IORING_ZCRX_AREA_SHIFT, struct io_uring_zcrx_ifq_reg). HAVE_IO_URING_ZCRX is set only when the full ZCRX symbol set is available. Runtime probe (io_uring.c): io_uring_get_probe_ring + io_uring_opcode_supported populates a per-ctx capability bitmap; AND'd with HAVE_* macros; resolved against config policy to compute effective_{fixed_file,fixed_buf,send_zc,recv_bundle,zcrx}. Decisions logged via evpl_io_uring_info. Ring setup is conditional: when ZCRX is wanted we drop SQPOLL (incompatible) and use DEFER_TASKRUN + SINGLE_ISSUER instead. Registered files (io_uring_tcp.c): register_files_sparse(4096) at ctx create; per-socket direct_fd_idx; IOSQE_FIXED_FILE on send/recv SQEs; slot released on close. Registered buffers (io_uring.c, io_uring_internal.h, allocator.c): evpl_framework_io_uring now implements register_memory/unregister_memory. Each slab gets a globally monotonic buf_index (stored as idx+1 in slab->framework_private[EVPL_FRAMEWORK_IO_URING]). Each ctx calls register_buffers_sparse(1024) and lazily fills its per-ring table via register_buffers_update_tag in a sync helper called from the pump and completion paths. Helper evpl_io_uring_iov_to_fixed() resolves an iov to a (buf_index, offset) pair. On ENOMEM (typically RLIMIT_MEMLOCK exhaustion) the ctx silently downgrades; init bumps RLIMIT_MEMLOCK to RLIM_INFINITY best-effort. evpl_memory_framework_private() returns NULL for refs with no slab, which is what a ZCRX synthetic iovec uses. RECVSEND_BUNDLE (io_uring_tcp.c): IORING_RECVSEND_BUNDLE set on multishot recv when effective; recv callback walks consecutive buffer ids in the ring until the total byte count is consumed. SEND_ZC (io_uring_tcp.c): when fixed_buf is in effect, the send path issues io_uring_prep_send_zc + IORING_RECVSEND_FIXED_BUF + buf_index. send_callback handles the two-CQE pattern: first CQE (F_MORE) emits EVPL_NOTIFY_SENT; second CQE (F_NOTIF) releases the iovec ref. The iov is stored on the request itself so the F_NOTIF CQE can find it. ZCRX (io_uring.c, io_uring_tcp.c): evpl_io_uring_zcrx_setup mmaps the data area + rq region, fills io_uring_zcrx_{area_reg,ifq_reg}, and registers the ifq. Per-connection getsockopt(SO_INCOMING_NAPI_ID) check; the recv path uses IORING_OP_RECV_ZC with zcrx_ifq_idx. CQE32 second half carries the area offset; a synthetic evpl_iovec with a custom release callback posts the rqe back via the rq tail. NAPI busy-poll registration was intentionally deferred: libevpl's run loop drains CQEs in userland via io_uring_peek_batch_cqe and otherwise blocks in epoll_wait on the eventfd. It never calls io_uring_enter with a wait count, so kernel NAPI busy-poll is inert under this polling model. Shutdown safety: evpl_io_uring_close detaches in-flight recv/accept requests (req->tcp.socket = NULL); callbacks bail on NULL. Request free is idempotent via an on_freelist flag. ctx destroy drains pending CQEs (without invoking callbacks) before tearing the ring down so cancel/close ops complete cleanly. Tests: re-enabled hello_world_stream_tcp, hello_world_connected_msg_tcp, ping_pong_stream_tcp, ping_pong_msg_tcp, bulk_msg_tcp, bulk_stream_tcp, rand_full_duplex_stream_tcp; all 8 io_uring tests pass on both Debug (ASan) and Release builds. Full 65-test suite green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The first cut bulk-synced every globally-registered slab into a new ring's FIXED_BUF table at ctx-create time, then re-synced from the completion path. With the default 1 GiB slab size and ~16 slabs already populated by fill_recv_ring, the worker thread blocked for ~1.6 s inside io_uring_register_buffers_update_tag pinning all of them — long enough that the first accepted connection never reached evpl_io_uring_attach in time for a pingpong test to make any progress. Switch to lazy, per-slab on-demand registration: - Replace ctx->buf_high_water (contiguous water-mark) with a sparse buf_registered[] bitmap. evpl_io_uring_ensure_buf_registered(ctx, idx) registers a single slab into this ring's FIXED_BUF table on first use and is O(1) thereafter. - Drop the create-time bulk sync and the completion-loop sync sweep. pump() now calls ensure_buf_registered just-in-time for the slab the current iov lives in; iov_to_fixed continues to gate eligibility, and the pump falls back to the legacy provided-buffer send path for any iov whose slab cannot be (or is not yet) registered. Additional fixes uncovered during validation: - IORING_RECVSEND_FIXED_BUF is only honored for SEND_ZC on this kernel; plain IORING_OP_SEND with the flag returns -EINVAL. Only set FIXED_BUF when send_zc is also engaging. - io_uring_buf_ring_advance was gated on !effective.fixed_buf, but the legacy send_ring is also used as the FIXED_BUF fallback path, so the advance must follow whenever we actually added entries (offset > 0). - evpl_io_uring_request_alloc now resets every tcp.* field on reuse, so a recycled request never carries stale use_fixed_buf / send_iov state from a previous send (caused a destroy-time double release of an already-released iov in rand_full_duplex). - evpl_io_uring_close now releases any iovs still parked in the per-socket provided-buffer send ring; the destroy-time drain also releases FIXED_BUF / SEND_ZC iovs whose owning bind was already torn down. send_callback NULLs req->tcp.send_iov.data after release so the drain knows not to release twice. Full 65-test libevpl Debug (ASan) suite passes. flowbench pingpong across io_uring_tcp now sustains ~34 Kops/s with all features on (send_zc default) and ~57 Kops/s with send_zc=off — the cross-over favors plain SEND at very small payloads since SEND_ZC pays a two-CQE overhead; send_zc wins at larger sizes where the page-pinning amortizes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The kernel UAPI added a rx_buf_len field where __resv2 used to live in struct io_uring_zcrx_ifq_reg (ZCRX_FEATURE_RX_PAGE_SIZE). Kernels with this change reject rx_buf_len == 0 with -EINVAL, which is exactly what io_uring_register_ifq was returning on mlx5/ConnectX-7 + fw 28.45.1200. Add an io_uring_zcrx_rx_buf_len config knob (default: system page size) and write it through the same 4-byte slot (named __resv2 in the installed liburing 2.14 header, rx_buf_len in newer kernel UAPI). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Some kernel/driver combinations only honor a user-supplied area when the new enum zcrx_reg_flags ZCRX_REG_IMPORT (= 1) bit is set in io_uring_zcrx_area_reg.flags. Add an io_uring_zcrx_area_import config knob (default off) so this can be flipped at runtime without rebuilding, gated against drivers/kernels that don't accept the bit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Earlier I defaulted rx_buf_len to PAGE_SIZE because kernels with the new UAPI seemed to reject 0 with -EINVAL. Tracing showed that EINVAL came from a different check ('tcp-data-split is disabled'), not from rx_buf_len. With tcp-data-split enabled, the actual semantics of rx_buf_len are: rx_buf_len == 0 -> driver picks (PAGE_SIZE). Required on drivers that do not yet advertise QCFG_RX_PAGE_SIZE in their netdev_queue_mgmt_ops (mlx5 in v7.0). Setting a non-zero value there triggers __net_mp_open_rxq's -EOPNOTSUPP "device does not support: rx_page_size". rx_buf_len != 0 -> requests a specific page size; only valid when the driver advertises QCFG_RX_PAGE_SIZE (mlx5 v7.1+). Pass the user-configured value through verbatim (default 0). The user can still override via EVPL_ZCRX_RX_BUF_LEN on kernels that support it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The SO_INCOMING_NAPI_ID gate was wrong: when the ifq is registered on this ring, the queue's page-pool memory provider is owned by ZCRX and data fragments land in the ZCRX area. A regular IORING_OP_RECV (even multishot with provided-buffer-ring) cannot consume those fragments — the bytes sit in the socket's kernel recv queue forever and the test hangs. SO_INCOMING_NAPI_ID also turns out to be unreliable right after accept() (per-socket NAPI binding may not be latched yet, getsockopt returns 0), which is what was making us silently fall back to plain RECV. Once ZCRX is set up on a ring, force IORING_OP_RECV_ZC on every accepted socket and rely on the caller to steer only matching traffic to that ring (typically via ethtool ntuple onto a dedicated RX queue). Observed on mlx5 ConnectX-7 + kernel v7.0: with the SO_INCOMING_NAPI_ID gate the server kernel TCP-ACKed the 256B ping at TCP level and left it in the socket's recv queue (ss shows CLOSE-WAIT recv-q=257) while our io_uring multishot recv never fired a CQE. With this fix the RECV_ZC SQE is posted as expected. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two cleanups in evpl_io_uring_setup_socket that mattered for the listener thread: * evpl_io_uring_fill_recv_ring allocates one 2 MiB iovec per ring slot (8 K slots = 16 GiB by default) from the slab allocator. The listen socket never consumes recv buffers - only accepted sockets do - so doing this on listen-socket setup just thrashes the slab allocator and can block the listener thread for several seconds before it can even submit its multishot-accept SQE. * The per-socket send-buf-ring is the fallback for non-FIXED_BUF sends; a listen socket never sends, so io_uring_setup_buf_ring is wasted work there too. Tests still pass; this is a no-op for accepted/connected sockets. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The ZCRX-enabled path drops SQPOLL and uses IORING_SETUP_DEFER_TASKRUN instead. With DEFER_TASKRUN, task work (including delivering CQEs to the user-side completion ring) is deferred until the ring is entered with IORING_ENTER_GETEVENTS. Plain io_uring_submit() doesn't set that flag, so SQEs got submitted but their CQEs never showed up via io_uring_peek_batch_cqe — the multishot accept ran in the kernel but its completion stayed deferred forever and our accept_callback never fired. The TCP layer happily completed the 3-way handshake and queued the client's 256B in the kernel recv buffer; userspace never read it. Fix: - Switch the flush deferral from io_uring_submit() to io_uring_submit_and_get_events() so the very same syscall that submits SQEs also drains pending task work into the CQ ring. - Call io_uring_get_events() at the top of every completion drain so even idle iterations (when nothing was just submitted) advance deferred task work. Both calls are cheap when there is no work to do, and have no extra cost relative to plain submit on SQPOLL rings. With this change ZCRX-enabled servers actually accept connections on the registered ifq queue; without it, accept SQEs sit forever. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

A single rxq can have only one memory-provider ifq bound at a time. In libevpl's multi-ring server model the listener thread's ring registers the ifq first (and owns the ZCRX page-pool); when the worker thread's ring later runs its create-time zcrx_setup it hits -EEXIST on the same rxq, which we previously treated as fatal under zerocopy_rx=ON and aborted the server right after the first client connection. Treat -EEXIST specifically as "ZCRX is already established for this queue, use non-ZC recv on THIS ring", even in ON mode. Packets still land in the ZCRX area (the queue's memory provider is unchanged) and the kernel's io_zcrx_copy_chunk path handles non-ZC consumers — data flows on the worker ring, just without zero-copy. That's the closest we can get without sharing rings across threads, which would require a larger libevpl rework. Other errors (e.g. -EOPNOTSUPP, -EINVAL) still abort in ON mode. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… socket) The kernel only allows one ifq per (rxq, ring) pair: a second io_uring_register_ifq for the same rxq returns -EEXIST. That means the previous architecture — single listener thread owning a ring with a ZCRX ifq, plus worker threads each owning their own rings — could only ever achieve zero-copy on the listener's ring. Connections handed off to workers fell back to copy-mode recv. This commit teaches the listener path to fan out the listen across workers when the protocol opts in via a new ->listen_distributed hook on struct evpl_protocol. For io_uring_tcp with ZCRX configured: - Each attached worker creates its own listen socket on its own ring, registers its own ifq on its own rxq (rxq_base + i), and accepts connections directly. No cross-thread handoff needed. - The accepted bind stays on the ring that accepted it, so IORING_OP_RECV_ZC on it actually delivers zero-copy CQEs. - For an initial implementation we hard-error if rxq_count exceeds the number of attached worker bindings at evpl_listen time. Other protocols leave listen_distributed NULL and continue to use the centralized single-listen-socket + cross-thread accept dispatch path unchanged, so no other backend is affected. Cross-thread plumbing for the fanned-out listen mirrors the existing connect-request mechanism: per-worker listen_distributed_requests list, eventfd kick, condvar completion. The worker drains the list from its ipc_callback. A new per-evpl field zcrx_rxq_override is read by evpl_io_uring_create to know which rxq this worker's ifq should bind to. New public config: evpl_global_config_set_io_uring_zcrx_rxq_count(cfg, n) All 8 io_uring tests still pass; full suite (65 tests) clean.

…> max_num_iovec evpl_io_uring_recv_deliver alloca'd a fixed-cap iovec array of max_num_iovec slots (default 128) and then called evpl_iovec_ring_copyv into it without any bound — copyv writes one iovec per ring fragment until 'length' bytes are accumulated. For a 1 MiB stream segment fragmented across 4 KiB ZCRX area pages that's 256 entries, which overruns the alloca region and trips __stack_chk_fail at function return. Replace the stack scratch with a per-evpl_io_uring_context heap buffer that grows monotonically to the recv ring's current element count. One realloc on first big segment; no churn after. Freed in evpl_io_uring_destroy. Bug visible with: flowbench -t throughput -s 1m -p io_uring_tcp on the ZCRX server side; -s 64k did not trigger because 16 < 128. Backtrace pointed at io_uring_tcp.c:250 with the unwound stack showing the 4 KiB-fragment iovec triples that copyv wrote past the buffer.

ABI was unsigned int — 32 bit on x86_64 — so EVPL_ZCRX_AREA_SIZE >= 4 GiB silently wrapped to 0 and fell back to the 256 MiB default. The underlying area_bytes variable in zcrx_setup is already size_t; the config field and setter were the narrow link.

Two cuts at the per-poll-iteration syscall cost that perf top showed dominating an mlx5 ZCRX single-core run (~30% of CPU was in syscall framing alone, with srso_alias_* Spectre mitigations contributing ~7% and fget another 5.5%): 1. io_uring_register_ring_fd() on ctx setup. Subsequent io_uring_enter calls then take the IORING_ENTER_REGISTERED_RING path which skips the per-syscall fget on the ring fd. Direct elimination of the ~5.5% fget seen on profile. 2. Lazy io_uring_get_events: peek the CQ ring first (a pure read of shared memory), and only enter the kernel if it's empty. Under load, previous task work has already deposited CQEs in the ring; we can drain them with zero syscalls. Collapses the get_events syscall rate from "one per poll-loop turn" to "one per CQE-batch drained." Measured impact pending hardware re-run, but math: at 1 M CQE/s and ~330 ns per syscall (Spectre tax included), going from 1 syscall per ~5 CQEs to 1 per 64 CQEs recovers ~25% CPU. Combined with fget elimination this should add up to ~30%+ on AMD ZCRX-heavy workloads.

…r frag Per-fragment evpl_io_uring_zcrx_frag_release did atomic_store_release on rq_ktail. With 1 MiB recv segments fragmented into 256x 4 KiB ZCRX pages, that's 256 release-fences per delivered message, ~1 M/sec under load. On AMD with mitigations each fence is ~50-100 ns of CPU. The kernel only needs to see the updated rq_ktail eventually, not on every individual rqe write. Batch: - z->tail_cached holds the next rqe write position - frag_release writes the rqe with plain stores, bumps tail_cached - evpl_io_uring_complete (poll-loop tail) does one atomic_store_release on the kernel-visible rq_ktail to publish everything accumulated this iteration Cuts ~256x fence rate on bulk-recv. The release-store at poll-loop tail still provides the memory barrier the kernel needs to observe all rqe writes.

benjarvis and others added 14 commits May 15, 2026 13:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for io_uring zero-copy receive (ZCRX) on suitable platform#41

Add support for io_uring zero-copy receive (ZCRX) on suitable platform#41
benjarvis wants to merge 14 commits into
chimera-nas:mainfrom
benjarvis:io_uring-zcrx-and-perf-audit

benjarvis commented May 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

benjarvis commented May 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant