Skip to content

Add support for io_uring zero-copy receive (ZCRX) on suitable platform#41

Open
benjarvis wants to merge 14 commits into
chimera-nas:mainfrom
benjarvis:io_uring-zcrx-and-perf-audit
Open

Add support for io_uring zero-copy receive (ZCRX) on suitable platform#41
benjarvis wants to merge 14 commits into
chimera-nas:mainfrom
benjarvis:io_uring-zcrx-and-perf-audit

Conversation

@benjarvis
Copy link
Copy Markdown
Member

WIP.

benjarvis and others added 14 commits May 15, 2026 13:35
Audit and expand the io_uring backend to take advantage of modern
io_uring features on capable kernels, while remaining buildable and
runnable on older kernels and liburing versions.

Public config additions (include/evpl/evpl_config.h, src/core/config.c):
- io_uring_zerocopy_rx (off/on/auto, default auto)
- io_uring_zcrx_interface, io_uring_zcrx_rxq,
  io_uring_zcrx_area_size, io_uring_zcrx_rq_entries
- io_uring_registered_buffers, io_uring_registered_files,
  io_uring_send_zc, io_uring_recv_bundle (all tri-state, default auto)
- EVPL_IO_URING_{OFF,ON,AUTO} mode constants

Mode semantics: ON with capability missing -> fatal at ctx create;
AUTO with capability missing -> log + silently fall back.

Compile-time gates (CMakeLists.txt): check_symbol_exists /
check_c_source_compiles probes for each new symbol/opcode/enum
(io_uring_register_ifq, io_uring_prep_send_zc,
io_uring_register_buffers_sparse, io_uring_register_files_sparse,
io_uring_register_buffers_update_tag, io_uring_register_files_update,
io_uring_prep_close_direct, IORING_OP_RECV_ZC, IORING_RECVSEND_BUNDLE,
IORING_RECVSEND_FIXED_BUF, IOSQE_FIXED_FILE, IORING_ZCRX_AREA_SHIFT,
struct io_uring_zcrx_ifq_reg). HAVE_IO_URING_ZCRX is set only when the
full ZCRX symbol set is available.

Runtime probe (io_uring.c): io_uring_get_probe_ring +
io_uring_opcode_supported populates a per-ctx capability bitmap; AND'd
with HAVE_* macros; resolved against config policy to compute
effective_{fixed_file,fixed_buf,send_zc,recv_bundle,zcrx}. Decisions
logged via evpl_io_uring_info.

Ring setup is conditional: when ZCRX is wanted we drop SQPOLL
(incompatible) and use DEFER_TASKRUN + SINGLE_ISSUER instead.

Registered files (io_uring_tcp.c): register_files_sparse(4096) at ctx
create; per-socket direct_fd_idx; IOSQE_FIXED_FILE on send/recv SQEs;
slot released on close.

Registered buffers (io_uring.c, io_uring_internal.h, allocator.c):
evpl_framework_io_uring now implements register_memory/unregister_memory.
Each slab gets a globally monotonic buf_index (stored as idx+1 in
slab->framework_private[EVPL_FRAMEWORK_IO_URING]). Each ctx calls
register_buffers_sparse(1024) and lazily fills its per-ring table via
register_buffers_update_tag in a sync helper called from the pump and
completion paths. Helper evpl_io_uring_iov_to_fixed() resolves an iov
to a (buf_index, offset) pair. On ENOMEM (typically RLIMIT_MEMLOCK
exhaustion) the ctx silently downgrades; init bumps RLIMIT_MEMLOCK to
RLIM_INFINITY best-effort.

evpl_memory_framework_private() returns NULL for refs with no slab,
which is what a ZCRX synthetic iovec uses.

RECVSEND_BUNDLE (io_uring_tcp.c): IORING_RECVSEND_BUNDLE set on
multishot recv when effective; recv callback walks consecutive buffer
ids in the ring until the total byte count is consumed.

SEND_ZC (io_uring_tcp.c): when fixed_buf is in effect, the send path
issues io_uring_prep_send_zc + IORING_RECVSEND_FIXED_BUF + buf_index.
send_callback handles the two-CQE pattern: first CQE (F_MORE) emits
EVPL_NOTIFY_SENT; second CQE (F_NOTIF) releases the iovec ref. The iov
is stored on the request itself so the F_NOTIF CQE can find it.

ZCRX (io_uring.c, io_uring_tcp.c): evpl_io_uring_zcrx_setup mmaps the
data area + rq region, fills io_uring_zcrx_{area_reg,ifq_reg}, and
registers the ifq. Per-connection getsockopt(SO_INCOMING_NAPI_ID)
check; the recv path uses IORING_OP_RECV_ZC with zcrx_ifq_idx. CQE32
second half carries the area offset; a synthetic evpl_iovec with a
custom release callback posts the rqe back via the rq tail.

NAPI busy-poll registration was intentionally deferred: libevpl's run
loop drains CQEs in userland via io_uring_peek_batch_cqe and otherwise
blocks in epoll_wait on the eventfd. It never calls io_uring_enter
with a wait count, so kernel NAPI busy-poll is inert under this
polling model.

Shutdown safety: evpl_io_uring_close detaches in-flight recv/accept
requests (req->tcp.socket = NULL); callbacks bail on NULL. Request
free is idempotent via an on_freelist flag. ctx destroy drains
pending CQEs (without invoking callbacks) before tearing the ring
down so cancel/close ops complete cleanly.

Tests: re-enabled hello_world_stream_tcp, hello_world_connected_msg_tcp,
ping_pong_stream_tcp, ping_pong_msg_tcp, bulk_msg_tcp, bulk_stream_tcp,
rand_full_duplex_stream_tcp; all 8 io_uring tests pass on both Debug
(ASan) and Release builds. Full 65-test suite green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The first cut bulk-synced every globally-registered slab into a new
ring's FIXED_BUF table at ctx-create time, then re-synced from the
completion path. With the default 1 GiB slab size and ~16 slabs already
populated by fill_recv_ring, the worker thread blocked for ~1.6 s
inside io_uring_register_buffers_update_tag pinning all of them — long
enough that the first accepted connection never reached
evpl_io_uring_attach in time for a pingpong test to make any progress.

Switch to lazy, per-slab on-demand registration:

- Replace ctx->buf_high_water (contiguous water-mark) with a sparse
  buf_registered[] bitmap. evpl_io_uring_ensure_buf_registered(ctx, idx)
  registers a single slab into this ring's FIXED_BUF table on first use
  and is O(1) thereafter.
- Drop the create-time bulk sync and the completion-loop sync sweep.
  pump() now calls ensure_buf_registered just-in-time for the slab the
  current iov lives in; iov_to_fixed continues to gate eligibility, and
  the pump falls back to the legacy provided-buffer send path for any
  iov whose slab cannot be (or is not yet) registered.

Additional fixes uncovered during validation:

- IORING_RECVSEND_FIXED_BUF is only honored for SEND_ZC on this kernel;
  plain IORING_OP_SEND with the flag returns -EINVAL. Only set FIXED_BUF
  when send_zc is also engaging.
- io_uring_buf_ring_advance was gated on !effective.fixed_buf, but the
  legacy send_ring is also used as the FIXED_BUF fallback path, so the
  advance must follow whenever we actually added entries (offset > 0).
- evpl_io_uring_request_alloc now resets every tcp.* field on reuse, so
  a recycled request never carries stale use_fixed_buf / send_iov state
  from a previous send (caused a destroy-time double release of an
  already-released iov in rand_full_duplex).
- evpl_io_uring_close now releases any iovs still parked in the
  per-socket provided-buffer send ring; the destroy-time drain also
  releases FIXED_BUF / SEND_ZC iovs whose owning bind was already torn
  down. send_callback NULLs req->tcp.send_iov.data after release so the
  drain knows not to release twice.

Full 65-test libevpl Debug (ASan) suite passes. flowbench pingpong
across io_uring_tcp now sustains ~34 Kops/s with all features on
(send_zc default) and ~57 Kops/s with send_zc=off — the cross-over
favors plain SEND at very small payloads since SEND_ZC pays a two-CQE
overhead; send_zc wins at larger sizes where the page-pinning amortizes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The kernel UAPI added a rx_buf_len field where __resv2 used to live in
struct io_uring_zcrx_ifq_reg (ZCRX_FEATURE_RX_PAGE_SIZE). Kernels with
this change reject rx_buf_len == 0 with -EINVAL, which is exactly what
io_uring_register_ifq was returning on mlx5/ConnectX-7 + fw 28.45.1200.

Add an io_uring_zcrx_rx_buf_len config knob (default: system page size)
and write it through the same 4-byte slot (named __resv2 in the
installed liburing 2.14 header, rx_buf_len in newer kernel UAPI).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Some kernel/driver combinations only honor a user-supplied area when
the new enum zcrx_reg_flags ZCRX_REG_IMPORT (= 1) bit is set in
io_uring_zcrx_area_reg.flags. Add an io_uring_zcrx_area_import config
knob (default off) so this can be flipped at runtime without rebuilding,
gated against drivers/kernels that don't accept the bit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Earlier I defaulted rx_buf_len to PAGE_SIZE because kernels with the
new UAPI seemed to reject 0 with -EINVAL. Tracing showed that EINVAL
came from a different check ('tcp-data-split is disabled'), not from
rx_buf_len. With tcp-data-split enabled, the actual semantics of
rx_buf_len are:

  rx_buf_len == 0  -> driver picks (PAGE_SIZE). Required on drivers that
                      do not yet advertise QCFG_RX_PAGE_SIZE in their
                      netdev_queue_mgmt_ops (mlx5 in v7.0). Setting a
                      non-zero value there triggers __net_mp_open_rxq's
                      -EOPNOTSUPP "device does not support: rx_page_size".
  rx_buf_len != 0  -> requests a specific page size; only valid when the
                      driver advertises QCFG_RX_PAGE_SIZE (mlx5 v7.1+).

Pass the user-configured value through verbatim (default 0). The user
can still override via EVPL_ZCRX_RX_BUF_LEN on kernels that support it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The SO_INCOMING_NAPI_ID gate was wrong: when the ifq is registered on
this ring, the queue's page-pool memory provider is owned by ZCRX and
data fragments land in the ZCRX area. A regular IORING_OP_RECV (even
multishot with provided-buffer-ring) cannot consume those fragments —
the bytes sit in the socket's kernel recv queue forever and the test
hangs.

SO_INCOMING_NAPI_ID also turns out to be unreliable right after
accept() (per-socket NAPI binding may not be latched yet, getsockopt
returns 0), which is what was making us silently fall back to plain
RECV. Once ZCRX is set up on a ring, force IORING_OP_RECV_ZC on every
accepted socket and rely on the caller to steer only matching traffic
to that ring (typically via ethtool ntuple onto a dedicated RX queue).

Observed on mlx5 ConnectX-7 + kernel v7.0: with the SO_INCOMING_NAPI_ID
gate the server kernel TCP-ACKed the 256B ping at TCP level and left
it in the socket's recv queue (ss shows CLOSE-WAIT recv-q=257) while
our io_uring multishot recv never fired a CQE. With this fix the
RECV_ZC SQE is posted as expected.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two cleanups in evpl_io_uring_setup_socket that mattered for the
listener thread:

* evpl_io_uring_fill_recv_ring allocates one 2 MiB iovec per ring slot
  (8 K slots = 16 GiB by default) from the slab allocator. The listen
  socket never consumes recv buffers - only accepted sockets do - so
  doing this on listen-socket setup just thrashes the slab allocator
  and can block the listener thread for several seconds before it can
  even submit its multishot-accept SQE.

* The per-socket send-buf-ring is the fallback for non-FIXED_BUF
  sends; a listen socket never sends, so io_uring_setup_buf_ring
  is wasted work there too.

Tests still pass; this is a no-op for accepted/connected sockets.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The ZCRX-enabled path drops SQPOLL and uses IORING_SETUP_DEFER_TASKRUN
instead. With DEFER_TASKRUN, task work (including delivering CQEs to
the user-side completion ring) is deferred until the ring is entered
with IORING_ENTER_GETEVENTS. Plain io_uring_submit() doesn't set that
flag, so SQEs got submitted but their CQEs never showed up via
io_uring_peek_batch_cqe — the multishot accept ran in the kernel but
its completion stayed deferred forever and our accept_callback never
fired. The TCP layer happily completed the 3-way handshake and queued
the client's 256B in the kernel recv buffer; userspace never read it.

Fix:
  - Switch the flush deferral from io_uring_submit() to
    io_uring_submit_and_get_events() so the very same syscall that
    submits SQEs also drains pending task work into the CQ ring.
  - Call io_uring_get_events() at the top of every completion drain
    so even idle iterations (when nothing was just submitted) advance
    deferred task work.

Both calls are cheap when there is no work to do, and have no extra
cost relative to plain submit on SQPOLL rings.

With this change ZCRX-enabled servers actually accept connections
on the registered ifq queue; without it, accept SQEs sit forever.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
A single rxq can have only one memory-provider ifq bound at a time. In
libevpl's multi-ring server model the listener thread's ring registers
the ifq first (and owns the ZCRX page-pool); when the worker thread's
ring later runs its create-time zcrx_setup it hits -EEXIST on the same
rxq, which we previously treated as fatal under zerocopy_rx=ON and
aborted the server right after the first client connection.

Treat -EEXIST specifically as "ZCRX is already established for this
queue, use non-ZC recv on THIS ring", even in ON mode. Packets still
land in the ZCRX area (the queue's memory provider is unchanged) and
the kernel's io_zcrx_copy_chunk path handles non-ZC consumers — data
flows on the worker ring, just without zero-copy. That's the closest
we can get without sharing rings across threads, which would require a
larger libevpl rework.

Other errors (e.g. -EOPNOTSUPP, -EINVAL) still abort in ON mode.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… socket)

The kernel only allows one ifq per (rxq, ring) pair: a second
io_uring_register_ifq for the same rxq returns -EEXIST. That means
the previous architecture — single listener thread owning a ring with
a ZCRX ifq, plus worker threads each owning their own rings — could
only ever achieve zero-copy on the listener's ring. Connections
handed off to workers fell back to copy-mode recv.

This commit teaches the listener path to fan out the listen across
workers when the protocol opts in via a new ->listen_distributed hook
on struct evpl_protocol. For io_uring_tcp with ZCRX configured:

  - Each attached worker creates its own listen socket on its own
    ring, registers its own ifq on its own rxq (rxq_base + i), and
    accepts connections directly. No cross-thread handoff needed.
  - The accepted bind stays on the ring that accepted it, so
    IORING_OP_RECV_ZC on it actually delivers zero-copy CQEs.
  - For an initial implementation we hard-error if rxq_count exceeds
    the number of attached worker bindings at evpl_listen time.

Other protocols leave listen_distributed NULL and continue to use the
centralized single-listen-socket + cross-thread accept dispatch path
unchanged, so no other backend is affected.

Cross-thread plumbing for the fanned-out listen mirrors the existing
connect-request mechanism: per-worker listen_distributed_requests
list, eventfd kick, condvar completion. The worker drains the list
from its ipc_callback. A new per-evpl field zcrx_rxq_override is read
by evpl_io_uring_create to know which rxq this worker's ifq should
bind to.

New public config:
  evpl_global_config_set_io_uring_zcrx_rxq_count(cfg, n)

All 8 io_uring tests still pass; full suite (65 tests) clean.
…> max_num_iovec

evpl_io_uring_recv_deliver alloca'd a fixed-cap iovec array of
max_num_iovec slots (default 128) and then called
evpl_iovec_ring_copyv into it without any bound — copyv writes one
iovec per ring fragment until 'length' bytes are accumulated. For
a 1 MiB stream segment fragmented across 4 KiB ZCRX area pages
that's 256 entries, which overruns the alloca region and trips
__stack_chk_fail at function return.

Replace the stack scratch with a per-evpl_io_uring_context heap
buffer that grows monotonically to the recv ring's current element
count. One realloc on first big segment; no churn after. Freed in
evpl_io_uring_destroy.

Bug visible with: flowbench -t throughput -s 1m -p io_uring_tcp on
the ZCRX server side; -s 64k did not trigger because 16 < 128.
Backtrace pointed at io_uring_tcp.c:250 with the unwound stack
showing the 4 KiB-fragment iovec triples that copyv wrote past the
buffer.
ABI was unsigned int — 32 bit on x86_64 — so EVPL_ZCRX_AREA_SIZE >= 4 GiB
silently wrapped to 0 and fell back to the 256 MiB default. The
underlying area_bytes variable in zcrx_setup is already size_t; the
config field and setter were the narrow link.
Two cuts at the per-poll-iteration syscall cost that perf top showed
dominating an mlx5 ZCRX single-core run (~30% of CPU was in syscall
framing alone, with srso_alias_* Spectre mitigations contributing
~7% and fget another 5.5%):

1. io_uring_register_ring_fd() on ctx setup. Subsequent io_uring_enter
   calls then take the IORING_ENTER_REGISTERED_RING path which skips
   the per-syscall fget on the ring fd. Direct elimination of the
   ~5.5% fget seen on profile.

2. Lazy io_uring_get_events: peek the CQ ring first (a pure read of
   shared memory), and only enter the kernel if it's empty. Under
   load, previous task work has already deposited CQEs in the ring;
   we can drain them with zero syscalls. Collapses the get_events
   syscall rate from "one per poll-loop turn" to "one per CQE-batch
   drained."

Measured impact pending hardware re-run, but math: at 1 M CQE/s and
~330 ns per syscall (Spectre tax included), going from 1 syscall per
~5 CQEs to 1 per 64 CQEs recovers ~25% CPU. Combined with fget
elimination this should add up to ~30%+ on AMD ZCRX-heavy workloads.
…r frag

Per-fragment evpl_io_uring_zcrx_frag_release did atomic_store_release
on rq_ktail. With 1 MiB recv segments fragmented into 256x 4 KiB
ZCRX pages, that's 256 release-fences per delivered message,
~1 M/sec under load. On AMD with mitigations each fence is ~50-100 ns
of CPU.

The kernel only needs to see the updated rq_ktail eventually, not
on every individual rqe write. Batch:
  - z->tail_cached holds the next rqe write position
  - frag_release writes the rqe with plain stores, bumps tail_cached
  - evpl_io_uring_complete (poll-loop tail) does one
    atomic_store_release on the kernel-visible rq_ktail to publish
    everything accumulated this iteration

Cuts ~256x fence rate on bulk-recv. The release-store at poll-loop
tail still provides the memory barrier the kernel needs to observe
all rqe writes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant