Skip to content

xlio: skip poll group poll when idle, toggle force_poll_mode dynamically#45

Open
benjarvis wants to merge 5 commits into
mainfrom
xlio-idle-optimization
Open

xlio: skip poll group poll when idle, toggle force_poll_mode dynamically#45
benjarvis wants to merge 5 commits into
mainfrom
xlio-idle-optimization

Conversation

@benjarvis
Copy link
Copy Markdown
Member

Summary

The XLIO Ultra API only supports busy polling — there is no eventfd or CQ-event notification we can sleep on, even in current upstream (libxlio 3.61.x). Until now, the first XLIO socket created on a thread pinned the libevpl event loop into force_poll_mode for the lifetime of the framework instance, so any thread that owned an XLIO listener spun at 100% CPU even when no TCP traffic was flowing (e.g. a deployment receiving only RDMA connections).

This change tracks the count of non-listen XLIO sockets per thread and:

  • Short-circuits evpl_xlio_poll when the count is zero — the XLIO library is not called at all in that case.
  • Drops force_poll_mode on the 1→0 transition so libevpl can drain its spin_ns idle window and park in evpl_core_wait.
  • Re-arms force_poll_mode on the 0→1 transition so per-iteration polling resumes for active connections.

A 10ms idle_timer runs alongside the poll callback whenever any XLIO socket exists. It drives a single xlio_poll_group_poll/xlio_poll_group_flush, which is what services listen-socket accepts and any deferred XLIO work while the main poll callback is short-circuited.

Trade-off

Up to ~10ms of latency for the first inbound connection on an otherwise-idle thread. Once that connection lands, num_connections goes positive, force_poll_mode re-arms, and subsequent activity is serviced at the usual per-iteration polling latency.

Behaviour matrix

State force_poll_mode evpl_xlio_poll idle_timer CPU shape
No XLIO sockets yet 0 not registered not registered epoll-driven, idle
Only listen socket(s), no connections 0 registered, short-circuits fires every 10ms epoll_wait with 10ms timeout
≥1 active connection 1 full body still fires (harmless) spin loop
Last connection closed flips to 0 short-circuits again continues firing drains spin_ns, then parks

The XLIO Ultra API only supports busy polling — there is no eventfd or
CQ-event notification we can sleep on. Until now, the first XLIO socket
created on a thread pinned the libevpl event loop into force_poll_mode
for the lifetime of the framework instance, so any thread that owned an
XLIO listener spun at 100% CPU even when no TCP traffic was flowing
(e.g. a deployment receiving only RDMA connections).

Track the count of non-listen XLIO sockets per thread and:

- Short-circuit evpl_xlio_poll when the count is zero; the XLIO library
  is not called at all in that case.
- Drop force_poll_mode on the 1→0 transition so libevpl can drain its
  spin_ns idle window and park in evpl_core_wait.
- Re-arm force_poll_mode on the 0→1 transition so per-iteration polling
  resumes for active connections.

A 10ms idle_timer runs alongside the poll callback whenever any XLIO
socket exists. It drives a single xlio_poll_group_poll/flush, which is
what services listen-socket accepts and any deferred XLIO work while
the main poll is short-circuited. The trade-off is up to ~10ms of
latency for the first inbound connection on an otherwise-idle thread.
@benjarvis benjarvis force-pushed the xlio-idle-optimization branch from b7fdb44 to 46110ef Compare May 18, 2026 11:22
benjarvis added 4 commits May 18, 2026 11:51
evpl_address_alloc() zero-fills the struct, so srcaddr->addrlen is 0
when xlio_socket_getpeername() is called. With a zero-sized buffer the
kernel writes nothing and srcaddr->sa is left as all zeros — sa_family
ends up 0 (AF_UNSPEC), which evpl_address_get_address() does not
recognise, leaving the caller's string buffer uninitialised.

The visible symptom is corrupted peer addresses in logs for accepted
XLIO connections (e.g. NFS disconnect notifications showing junk
where the client IP should be). Compare with the kernel-socket accept
path in core/socket/tcp.c which already initialises addrlen to
sizeof(remote_addr->sa) before accept().
Without a fallback, an evpl_address whose sa_family is anything other
than AF_INET or AF_INET6 (including AF_UNSPEC / 0 from a partially-
populated sockaddr) caused the caller's str buffer to be left
uninitialised, exposing stack garbage in log lines. Emit
"(unspecified)" instead.
…race

evpl_xlio_socket_accept() runs on the listener thread and previously
called xlio_socket_getpeername() on the brand-new sockinfo_tcp before
handing the accepted socket off to a worker thread. That call races
XLIO's connection state machine: m_connected is populated during the
lwip clone, but between the time the new sockinfo's lock is released
(accept_lwip_cb) and the time the user callback fires
(accept_connection_xlio_socket), state transitions on other contexts
can clear or invalidate m_connected. Even when getpeername returns 0,
sock_addr::get_sa_by_family() will silently copy zero bytes and zero
the caller's addrlen — producing a sockaddr with sa_family = 0 that
gets propagated to bind->remote. Under load (e.g. nconnect=16 with
churn), this surfaced as corrupted peer addresses in NFS disconnect
logs.

Move the peer-address resolution into evpl_xlio_tcp_attach() on the
worker thread, alongside the existing getsockname() call. By the time
attach runs the socket has been re-attached to the worker's poll group
and there's no concurrent state transition to race. Pass NULL for
remote_address from the listener-thread callback; bind->remote is now
populated synchronously inside attach.

Supersedes the addrlen-init workaround from 556a1d3 — that call site
no longer exists.
Worker threads (which only hold accepted/outbound sockets) don't need
the 10ms wake-up: when they have connections force_poll_mode keeps the
regular poll running every iteration, and when they don't there is no
listen socket on the same thread that could possibly hand them new
work. Only listener threads need to be woken up periodically to
service incoming accepts.

Gate evpl_add_timer() on first listen-socket creation and skip it
entirely for non-listen call sites. The flag is one-shot — once a
thread has hosted any listen socket it keeps the timer until the XLIO
framework instance is destroyed, which avoids tracking listen counts
without losing the optimisation on worker threads.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant