xlio: skip poll group poll when idle, toggle force_poll_mode dynamically by benjarvis · Pull Request #45 · chimera-nas/libevpl

benjarvis · 2026-05-18T10:38:19Z

Summary

The XLIO Ultra API only supports busy polling — there is no eventfd or CQ-event notification we can sleep on, even in current upstream (libxlio 3.61.x). Until now, the first XLIO socket created on a thread pinned the libevpl event loop into force_poll_mode for the lifetime of the framework instance, so any thread that owned an XLIO listener spun at 100% CPU even when no TCP traffic was flowing (e.g. a deployment receiving only RDMA connections).

This change tracks the count of non-listen XLIO sockets per thread and:

Short-circuits evpl_xlio_poll when the count is zero — the XLIO library is not called at all in that case.
Drops force_poll_mode on the 1→0 transition so libevpl can drain its spin_ns idle window and park in evpl_core_wait.
Re-arms force_poll_mode on the 0→1 transition so per-iteration polling resumes for active connections.

A 10ms idle_timer runs alongside the poll callback whenever any XLIO socket exists. It drives a single xlio_poll_group_poll/xlio_poll_group_flush, which is what services listen-socket accepts and any deferred XLIO work while the main poll callback is short-circuited.

Trade-off

Up to ~10ms of latency for the first inbound connection on an otherwise-idle thread. Once that connection lands, num_connections goes positive, force_poll_mode re-arms, and subsequent activity is serviced at the usual per-iteration polling latency.

Behaviour matrix

State	`force_poll_mode`	`evpl_xlio_poll`	`idle_timer`	CPU shape
No XLIO sockets yet	0	not registered	not registered	epoll-driven, idle
Only listen socket(s), no connections	0	registered, short-circuits	fires every 10ms	epoll_wait with 10ms timeout
≥1 active connection	1	full body	still fires (harmless)	spin loop
Last connection closed	flips to 0	short-circuits again	continues firing	drains `spin_ns`, then parks

The XLIO Ultra API only supports busy polling — there is no eventfd or CQ-event notification we can sleep on. Until now, the first XLIO socket created on a thread pinned the libevpl event loop into force_poll_mode for the lifetime of the framework instance, so any thread that owned an XLIO listener spun at 100% CPU even when no TCP traffic was flowing (e.g. a deployment receiving only RDMA connections). Track the count of non-listen XLIO sockets per thread and: - Short-circuit evpl_xlio_poll when the count is zero; the XLIO library is not called at all in that case. - Drop force_poll_mode on the 1→0 transition so libevpl can drain its spin_ns idle window and park in evpl_core_wait. - Re-arm force_poll_mode on the 0→1 transition so per-iteration polling resumes for active connections. A 10ms idle_timer runs alongside the poll callback whenever any XLIO socket exists. It drives a single xlio_poll_group_poll/flush, which is what services listen-socket accepts and any deferred XLIO work while the main poll is short-circuited. The trade-off is up to ~10ms of latency for the first inbound connection on an otherwise-idle thread.

evpl_address_alloc() zero-fills the struct, so srcaddr->addrlen is 0 when xlio_socket_getpeername() is called. With a zero-sized buffer the kernel writes nothing and srcaddr->sa is left as all zeros — sa_family ends up 0 (AF_UNSPEC), which evpl_address_get_address() does not recognise, leaving the caller's string buffer uninitialised. The visible symptom is corrupted peer addresses in logs for accepted XLIO connections (e.g. NFS disconnect notifications showing junk where the client IP should be). Compare with the kernel-socket accept path in core/socket/tcp.c which already initialises addrlen to sizeof(remote_addr->sa) before accept().

Without a fallback, an evpl_address whose sa_family is anything other than AF_INET or AF_INET6 (including AF_UNSPEC / 0 from a partially- populated sockaddr) caused the caller's str buffer to be left uninitialised, exposing stack garbage in log lines. Emit "(unspecified)" instead.

…race evpl_xlio_socket_accept() runs on the listener thread and previously called xlio_socket_getpeername() on the brand-new sockinfo_tcp before handing the accepted socket off to a worker thread. That call races XLIO's connection state machine: m_connected is populated during the lwip clone, but between the time the new sockinfo's lock is released (accept_lwip_cb) and the time the user callback fires (accept_connection_xlio_socket), state transitions on other contexts can clear or invalidate m_connected. Even when getpeername returns 0, sock_addr::get_sa_by_family() will silently copy zero bytes and zero the caller's addrlen — producing a sockaddr with sa_family = 0 that gets propagated to bind->remote. Under load (e.g. nconnect=16 with churn), this surfaced as corrupted peer addresses in NFS disconnect logs. Move the peer-address resolution into evpl_xlio_tcp_attach() on the worker thread, alongside the existing getsockname() call. By the time attach runs the socket has been re-attached to the worker's poll group and there's no concurrent state transition to race. Pass NULL for remote_address from the listener-thread callback; bind->remote is now populated synchronously inside attach. Supersedes the addrlen-init workaround from 556a1d3 — that call site no longer exists.

Worker threads (which only hold accepted/outbound sockets) don't need the 10ms wake-up: when they have connections force_poll_mode keeps the regular poll running every iteration, and when they don't there is no listen socket on the same thread that could possibly hand them new work. Only listener threads need to be woken up periodically to service incoming accepts. Gate evpl_add_timer() on first listen-socket creation and skip it entirely for non-listen call sites. The flag is one-shot — once a thread has hosted any listen socket it keeps the timer until the XLIO framework instance is destroyed, which avoids tracking listen counts without losing the optimisation on worker threads.

benjarvis force-pushed the xlio-idle-optimization branch from b7fdb44 to 46110ef Compare May 18, 2026 11:22

benjarvis added 4 commits May 18, 2026 11:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

xlio: skip poll group poll when idle, toggle force_poll_mode dynamically#45

xlio: skip poll group poll when idle, toggle force_poll_mode dynamically#45
benjarvis wants to merge 5 commits into
mainfrom
xlio-idle-optimization

benjarvis commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

benjarvis commented May 18, 2026

Summary

Trade-off

Behaviour matrix

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant