Skip to content

Add rdmaproxy for uverbs devices and NVIDIA GPU Direct RDMA#3

Draft
atoniolo76 wants to merge 298 commits into
masterfrom
alessio/development
Draft

Add rdmaproxy for uverbs devices and NVIDIA GPU Direct RDMA#3
atoniolo76 wants to merge 298 commits into
masterfrom
alessio/development

Conversation

@atoniolo76
Copy link
Copy Markdown
Collaborator

@atoniolo76 atoniolo76 commented Apr 23, 2026

Thanks for taking a look! Since this feature 1. adds a new device proxy 2. touches existing nvproxy 3. modifies boot sequence to serialize/deserialize host sysfs and move rdma net devs into container for GID resolution, I have described the high-level architecture and low-level implementation nuances in this public notion doc. I'd recommend reading this before diving into the diff!

atoniolo76 and others added 30 commits April 23, 2026 15:11
…ing, implement HostFD() for nvproxy frontendFD wrappers, make extractCQQP action-aware, cleanup on proxyAsyncEventFD only if CopyOut fails, use readBufPool for Read(), change Write() log from warning to debug.
…arent process for moving netdevs into container netns
Revert UVM_CREATE_EXTERNAL_RANGE back to the generic uvmIoctlSimple
handler. The custom handler was a development-time tracing aid that
logged at Warning level on every external range creation; not suitable
for upstream.

Co-authored-by: Cursor <cursoragent@cursor.com>
When multiple RoCE netdevs share an IP subnet (the common cloud RDMA
fabric layout: many NICs on one big /12), the kernel's permissive
default ARP behavior (arp_announce=0, arp_ignore=0) lets any local NIC
reply to ARP for any local IP. Peers can then cache the wrong MAC for
a given remote IP, the fabric drops the resulting RoCE v2 frames, and
the QP exhausts its retry counter -- surfacing as IBV_WC_RETRY_EXC_ERR
(status 12, mlx5 syndrome 0x81) and an apparent NCCL hang.

Apply the standard multi-rail ARP tightening (arp_announce=2,
arp_ignore=1, arp_filter=1, rp_filter=1) per moved netdev. The new
netns starts with an empty ARP cache for the netdev, so no flush is
needed -- the first ARP exchange after the move is already strict.

Co-authored-by: Cursor <cursoragent@cursor.com>
… and update pci_devices.go to include current link speed and width attribute from sysfs.

Co-authored-by: Cursor <cursoragent@cursor.com>
The harness previously snapshotted and moved every RDMA-backed netdev
into the container netns, including ethN devices that happen to be
mlx5-backed but serve as the host's primary Ethernet uplinks (e.g.
eth0). Moving those breaks TCP connectivity from the host (and the
NCCL bootstrap path between nodes), forcing the user to either pick a
non-RDMA bootstrap path or set ALLOW_BOOTSTRAP_RDMA_IFACE=1 — which
still moves them and just suppresses the safety check.

Add an exclusion regex (default ^eth[0-9]+$) applied right after
candidate enumeration so:
  - rdmaN, ibN, etc. continue to be moved as before;
  - ethN devices are dropped with a warning printed to stderr;
  - explicit RDMA_IFACES still wins (the early-return path skips the
    filter, so users can force-include something the regex would drop);
  - the regex is fully configurable, and an empty value disables the
    filter entirely.

This matches the typical RoCE deployment where rdma0..N are the
fabric-side netdevs that host the GIDs NCCL programs against, while
ethN is the management/uplink NIC that should remain on the host.

Co-authored-by: Cursor <cursoragent@cursor.com>
Previously the script unconditionally:
  - warned when nccl_topo.xml was missing under WORKSPACE_MOUNT (during
    both prepare preflight and start_app_container);
  - warned when /workspace/nccl_topo.xml was missing inside the container
    (right before exec);
  - injected NCCL_TOPO_FILE=/workspace/nccl_topo.xml into the container.

This is correct when the user wants to ship a pre-generated topology
XML, but noisy and counter-productive when relying on NCCL's automatic
sysfs-based topology discovery (which rdmaproxy populates with
ibverbs/PCI/NUMA data). For that workflow there's no topology file and
the warnings are spurious.

Add INCLUDE_NCCL_TOPO (default 1, current behavior). When set to 0:
  - all three warnings are skipped;
  - NCCL_TOPO_FILE is not added to the container env, letting NCCL
    auto-discover from /sys.

Caller still wins via NCCL_TOPO_FILE explicit env, since both gates
require -z "${NCCL_TOPO_FILE:-}" to fire.

Co-authored-by: Cursor <cursoragent@cursor.com>
NCCL's topology builder walks /sys/bus/pci/devices/ and creates NET
nodes for every class-0200 (Ethernet) and class-0c06 (InfiniBand)
device it finds. It then matches these topology NETs against the IB
verbs plugin's device list by PCI bus ID. When a NIC-class PCI device
has no matching IB device (because its netdev wasn't moved into the
sandbox and it has no usable GIDs), NCCL creates a dangling NET node
in the topology graph. During graph search, the dangling node triggers
"Could not find NET with id N" and ncclInternalError.

On bare metal this is fine: all NICs have valid GID tables and the IB
plugin can open them all. In the sandbox, only the moved netdevs'
RDMA devices have GIDs, so the surplus PCI entries create the mismatch.

Fix: at sysfs construction time, cross-reference NIC-class PCI devices
against the RDMA data. Only include NICs whose IB counterpart has at
least one GID entry with a valid ndev (i.e., the netdev was moved).
GPUs, bridges, and NVSwitch devices are unaffected by the filter.

Co-authored-by: Cursor <cursoragent@cursor.com>
NCCL_DEBUG, NCCL_DEBUG_SUBSYS, NCCL_TOPO_DUMP_FILE, and
NCCL_DEBUG_FILE were only passed through during the exec phase but
not injected into the container environment at start (docker run)
time. When the application is launched via exec, the docker exec
passthrough works, but the container's own environment (visible to
child processes) lacked these vars. This made it appear as if
NCCL_DEBUG had no effect.

Add all four to the start-time env_args loop alongside the existing
NCCL_IB_* passthrough. Also add NCCL_TOPO_DUMP_FILE and
NCCL_DEBUG_FILE to the exec passthrough list.

Co-authored-by: Cursor <cursoragent@cursor.com>
Previously, when RDMA_IFACES was set, rdma_netdev_records() short-
circuited the sysfs scan and emitted placeholder records of the form:

    rdma0|manual|?|?|?|

The literal string "manual" in column 2 (ibdev) then propagated
through snapshot_rdma into the snapshot file's ibdevs column, which
derive_nccl_ib_hca() reads to compute NCCL_IB_HCA. The resulting env
var was:

    NCCL_IB_HCA=manual

NCCL passes this to libibverbs as a device filter; it matches no real
HCA, so libibverbs reports "No device found", NCCL falls back to the
socket plugin, and any cross-node collective hangs because the socket
plugin can't carry RDMA traffic between sandbox netns over the bridge.

Fix: always run the sysfs/ibdev2netdev scan, then use RDMA_IFACES
purely as an include-list filter on the resulting records. Each kept
record carries its true ibdev/port/GID metadata, so derive_nccl_ib_hca
yields the real list (e.g. mlx5_0,mlx5_3,...).

Fall back to placeholder records only when no sysfs/ibdev2netdev data
exists at all, and warn so the operator knows to set NCCL_IB_HCA
explicitly.

Co-authored-by: Cursor <cursoragent@cursor.com>
NCCL's IB plugin calls realpath("/sys/class/infiniband/<ibdev>/device")
to discover the PCI bus path of each IB device, then walks UP the
/sys/devices/pci*/ hierarchy to compute PCI distances between GPUs and
NICs. On real Linux, device/ is a symlink to the PCI device directory
(e.g. /sys/devices/pci0000:08/.../0000:0c:00.0).

In gVisor's virtual sysfs, device/ was a plain directory containing
uevent, class, vendor, etc. files. realpath() on a directory returns the
path itself (/sys/class/infiniband/mlx5_0/device), which is not under
/sys/devices/pci*/, so NCCL's topology detection reported:

    network path /sys/class/infiniband/mlx5_0/de00c0 is not a PCI device

This caused all NICs to be attached to "first CPU" in the topology graph
instead of under their actual PCI bridges. The resulting graph had no
properly-placed NET nodes, and graph search failed with:

    Could not find NET with id 0

Fix:
1. In newRDMASysfsEntries: create device/ as a symlink to the PCI
   device's relative path (e.g. ../../../devices/pci0000:08/.../...) when
   a matching PCI device exists in the tree. Fall back to the old
   directory approach when no PCI data is available.

2. In newPCIDevicesSysfsEntries: add uevent and modalias files to
   NIC-class PCI device nodes so they're accessible through the symlink
   (NCCL reads PCI_SLOT_NAME from uevent via this path).

3. Return pciAddrToRelPath from newPCIDevicesSysfsEntries so
   newRDMASysfsEntries can construct the symlink targets.

Co-authored-by: Cursor <cursoragent@cursor.com>
Two related UX improvements to make multi-node prepare flows robust
against per-node IP-provisioning skew, which has been a recurring source
of "first NCCL barrier hangs" failures.

1. RDMA_AUTO_SKIP_NO_IPV4 (default 1): instead of dying with
   "RDMA netdev <dev> has no IPv4 address" when a candidate netdev has
   no IP on this host, silently drop it from the candidate set with a
   stderr note. RoCE v2 needs IP routing, so a netdev without an IPv4
   is unusable anyway — failing the whole prepare just forces the
   operator to manually exclude it. Auto-skip lets the snapshot proceed
   with the netdevs that are actually usable.

2. RDMA_PEER_IFACES: optional space-separated list of netdev names
   known to be present on the peer node. The local candidate set is
   intersected with this list to enforce symmetric NIC topology across
   ranks. NCCL requires every rank to advertise the same set of NICs;
   asymmetry causes the comm to deadlock at the first cross-rank
   collective with confusing "Could not find NET with id N" errors.

3. After prepare completes, print a clearly-formatted summary block
   listing the netdevs and ibdevs that ended up moved, plus a
   copy-pasteable "RDMA_PEER_IFACES=..." line for the other node. This
   eliminates the manual diff-the-route-tables workflow that has been
   needed to figure out the symmetric set.

Co-authored-by: Cursor <cursoragent@cursor.com>
Iterating on rdmaproxy / topology / netdev plumbing requires a specific
set of debug env vars (NCCL_DEBUG=INFO, NCCL_DEBUG_SUBSYS, NCCL_DEBUG_FILE,
NCCL_TOPO_DUMP_FILE, RUNSC_DEBUG, RUNSC_DEBUG_LOG, RUNSC_STRACE,
RUNSC_STRACE_SYSCALLS) to be set together. Setting them by hand on every
prepare/run-test invocation is error-prone — easy to forget one and end
up debugging blind, or inconsistent across nodes.

Add DEBUG_ALL=1 as a one-shot that defaults all of them at once. Each
individual var still wins if explicitly set on the command line (since
the assignments use ${VAR:-default} syntax), so users can mix and match.

Also document the cost — strace alone slows the sandbox 5-10x, so
DEBUG_ALL is for debugging only, not for performance benchmarks.

Co-authored-by: Cursor <cursoragent@cursor.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant