xlio: skip poll group poll when idle, toggle force_poll_mode dynamically#45
Open
benjarvis wants to merge 5 commits into
Open
xlio: skip poll group poll when idle, toggle force_poll_mode dynamically#45benjarvis wants to merge 5 commits into
benjarvis wants to merge 5 commits into
Conversation
The XLIO Ultra API only supports busy polling — there is no eventfd or CQ-event notification we can sleep on. Until now, the first XLIO socket created on a thread pinned the libevpl event loop into force_poll_mode for the lifetime of the framework instance, so any thread that owned an XLIO listener spun at 100% CPU even when no TCP traffic was flowing (e.g. a deployment receiving only RDMA connections). Track the count of non-listen XLIO sockets per thread and: - Short-circuit evpl_xlio_poll when the count is zero; the XLIO library is not called at all in that case. - Drop force_poll_mode on the 1→0 transition so libevpl can drain its spin_ns idle window and park in evpl_core_wait. - Re-arm force_poll_mode on the 0→1 transition so per-iteration polling resumes for active connections. A 10ms idle_timer runs alongside the poll callback whenever any XLIO socket exists. It drives a single xlio_poll_group_poll/flush, which is what services listen-socket accepts and any deferred XLIO work while the main poll is short-circuited. The trade-off is up to ~10ms of latency for the first inbound connection on an otherwise-idle thread.
b7fdb44 to
46110ef
Compare
evpl_address_alloc() zero-fills the struct, so srcaddr->addrlen is 0 when xlio_socket_getpeername() is called. With a zero-sized buffer the kernel writes nothing and srcaddr->sa is left as all zeros — sa_family ends up 0 (AF_UNSPEC), which evpl_address_get_address() does not recognise, leaving the caller's string buffer uninitialised. The visible symptom is corrupted peer addresses in logs for accepted XLIO connections (e.g. NFS disconnect notifications showing junk where the client IP should be). Compare with the kernel-socket accept path in core/socket/tcp.c which already initialises addrlen to sizeof(remote_addr->sa) before accept().
Without a fallback, an evpl_address whose sa_family is anything other than AF_INET or AF_INET6 (including AF_UNSPEC / 0 from a partially- populated sockaddr) caused the caller's str buffer to be left uninitialised, exposing stack garbage in log lines. Emit "(unspecified)" instead.
…race evpl_xlio_socket_accept() runs on the listener thread and previously called xlio_socket_getpeername() on the brand-new sockinfo_tcp before handing the accepted socket off to a worker thread. That call races XLIO's connection state machine: m_connected is populated during the lwip clone, but between the time the new sockinfo's lock is released (accept_lwip_cb) and the time the user callback fires (accept_connection_xlio_socket), state transitions on other contexts can clear or invalidate m_connected. Even when getpeername returns 0, sock_addr::get_sa_by_family() will silently copy zero bytes and zero the caller's addrlen — producing a sockaddr with sa_family = 0 that gets propagated to bind->remote. Under load (e.g. nconnect=16 with churn), this surfaced as corrupted peer addresses in NFS disconnect logs. Move the peer-address resolution into evpl_xlio_tcp_attach() on the worker thread, alongside the existing getsockname() call. By the time attach runs the socket has been re-attached to the worker's poll group and there's no concurrent state transition to race. Pass NULL for remote_address from the listener-thread callback; bind->remote is now populated synchronously inside attach. Supersedes the addrlen-init workaround from 556a1d3 — that call site no longer exists.
Worker threads (which only hold accepted/outbound sockets) don't need the 10ms wake-up: when they have connections force_poll_mode keeps the regular poll running every iteration, and when they don't there is no listen socket on the same thread that could possibly hand them new work. Only listener threads need to be woken up periodically to service incoming accepts. Gate evpl_add_timer() on first listen-socket creation and skip it entirely for non-listen call sites. The flag is one-shot — once a thread has hosted any listen socket it keeps the timer until the XLIO framework instance is destroyed, which avoids tracking listen counts without losing the optimisation on worker threads.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The XLIO Ultra API only supports busy polling — there is no eventfd or CQ-event notification we can sleep on, even in current upstream (libxlio 3.61.x). Until now, the first XLIO socket created on a thread pinned the libevpl event loop into
force_poll_modefor the lifetime of the framework instance, so any thread that owned an XLIO listener spun at 100% CPU even when no TCP traffic was flowing (e.g. a deployment receiving only RDMA connections).This change tracks the count of non-listen XLIO sockets per thread and:
evpl_xlio_pollwhen the count is zero — the XLIO library is not called at all in that case.force_poll_modeon the 1→0 transition so libevpl can drain itsspin_nsidle window and park inevpl_core_wait.force_poll_modeon the 0→1 transition so per-iteration polling resumes for active connections.A 10ms
idle_timerruns alongside the poll callback whenever any XLIO socket exists. It drives a singlexlio_poll_group_poll/xlio_poll_group_flush, which is what services listen-socket accepts and any deferred XLIO work while the main poll callback is short-circuited.Trade-off
Up to ~10ms of latency for the first inbound connection on an otherwise-idle thread. Once that connection lands,
num_connectionsgoes positive,force_poll_modere-arms, and subsequent activity is serviced at the usual per-iteration polling latency.Behaviour matrix
force_poll_modeevpl_xlio_pollidle_timerspin_ns, then parks