Add rdmaproxy for uverbs devices and NVIDIA GPU Direct RDMA#3
Draft
atoniolo76 wants to merge 298 commits into
Draft
Add rdmaproxy for uverbs devices and NVIDIA GPU Direct RDMA#3atoniolo76 wants to merge 298 commits into
atoniolo76 wants to merge 298 commits into
Conversation
…efault runtime in agent.py to runc
…urable in run_torch_pair_node_a.sh
This reverts commit 726b2df.
…ing, implement HostFD() for nvproxy frontendFD wrappers, make extractCQQP action-aware, cleanup on proxyAsyncEventFD only if CopyOut fails, use readBufPool for Read(), change Write() log from warning to debug.
…om applyRDMANetdevSnapshot
…arent process for moving netdevs into container netns
Revert UVM_CREATE_EXTERNAL_RANGE back to the generic uvmIoctlSimple handler. The custom handler was a development-time tracing aid that logged at Warning level on every external range creation; not suitable for upstream. Co-authored-by: Cursor <cursoragent@cursor.com>
When multiple RoCE netdevs share an IP subnet (the common cloud RDMA fabric layout: many NICs on one big /12), the kernel's permissive default ARP behavior (arp_announce=0, arp_ignore=0) lets any local NIC reply to ARP for any local IP. Peers can then cache the wrong MAC for a given remote IP, the fabric drops the resulting RoCE v2 frames, and the QP exhausts its retry counter -- surfacing as IBV_WC_RETRY_EXC_ERR (status 12, mlx5 syndrome 0x81) and an apparent NCCL hang. Apply the standard multi-rail ARP tightening (arp_announce=2, arp_ignore=1, arp_filter=1, rp_filter=1) per moved netdev. The new netns starts with an empty ARP cache for the netdev, so no flush is needed -- the first ARP exchange after the move is already strict. Co-authored-by: Cursor <cursoragent@cursor.com>
… and update pci_devices.go to include current link speed and width attribute from sysfs. Co-authored-by: Cursor <cursoragent@cursor.com>
The harness previously snapshotted and moved every RDMA-backed netdev
into the container netns, including ethN devices that happen to be
mlx5-backed but serve as the host's primary Ethernet uplinks (e.g.
eth0). Moving those breaks TCP connectivity from the host (and the
NCCL bootstrap path between nodes), forcing the user to either pick a
non-RDMA bootstrap path or set ALLOW_BOOTSTRAP_RDMA_IFACE=1 — which
still moves them and just suppresses the safety check.
Add an exclusion regex (default ^eth[0-9]+$) applied right after
candidate enumeration so:
- rdmaN, ibN, etc. continue to be moved as before;
- ethN devices are dropped with a warning printed to stderr;
- explicit RDMA_IFACES still wins (the early-return path skips the
filter, so users can force-include something the regex would drop);
- the regex is fully configurable, and an empty value disables the
filter entirely.
This matches the typical RoCE deployment where rdma0..N are the
fabric-side netdevs that host the GIDs NCCL programs against, while
ethN is the management/uplink NIC that should remain on the host.
Co-authored-by: Cursor <cursoragent@cursor.com>
Previously the script unconditionally:
- warned when nccl_topo.xml was missing under WORKSPACE_MOUNT (during
both prepare preflight and start_app_container);
- warned when /workspace/nccl_topo.xml was missing inside the container
(right before exec);
- injected NCCL_TOPO_FILE=/workspace/nccl_topo.xml into the container.
This is correct when the user wants to ship a pre-generated topology
XML, but noisy and counter-productive when relying on NCCL's automatic
sysfs-based topology discovery (which rdmaproxy populates with
ibverbs/PCI/NUMA data). For that workflow there's no topology file and
the warnings are spurious.
Add INCLUDE_NCCL_TOPO (default 1, current behavior). When set to 0:
- all three warnings are skipped;
- NCCL_TOPO_FILE is not added to the container env, letting NCCL
auto-discover from /sys.
Caller still wins via NCCL_TOPO_FILE explicit env, since both gates
require -z "${NCCL_TOPO_FILE:-}" to fire.
Co-authored-by: Cursor <cursoragent@cursor.com>
NCCL's topology builder walks /sys/bus/pci/devices/ and creates NET nodes for every class-0200 (Ethernet) and class-0c06 (InfiniBand) device it finds. It then matches these topology NETs against the IB verbs plugin's device list by PCI bus ID. When a NIC-class PCI device has no matching IB device (because its netdev wasn't moved into the sandbox and it has no usable GIDs), NCCL creates a dangling NET node in the topology graph. During graph search, the dangling node triggers "Could not find NET with id N" and ncclInternalError. On bare metal this is fine: all NICs have valid GID tables and the IB plugin can open them all. In the sandbox, only the moved netdevs' RDMA devices have GIDs, so the surplus PCI entries create the mismatch. Fix: at sysfs construction time, cross-reference NIC-class PCI devices against the RDMA data. Only include NICs whose IB counterpart has at least one GID entry with a valid ndev (i.e., the netdev was moved). GPUs, bridges, and NVSwitch devices are unaffected by the filter. Co-authored-by: Cursor <cursoragent@cursor.com>
NCCL_DEBUG, NCCL_DEBUG_SUBSYS, NCCL_TOPO_DUMP_FILE, and NCCL_DEBUG_FILE were only passed through during the exec phase but not injected into the container environment at start (docker run) time. When the application is launched via exec, the docker exec passthrough works, but the container's own environment (visible to child processes) lacked these vars. This made it appear as if NCCL_DEBUG had no effect. Add all four to the start-time env_args loop alongside the existing NCCL_IB_* passthrough. Also add NCCL_TOPO_DUMP_FILE and NCCL_DEBUG_FILE to the exec passthrough list. Co-authored-by: Cursor <cursoragent@cursor.com>
Previously, when RDMA_IFACES was set, rdma_netdev_records() short-
circuited the sysfs scan and emitted placeholder records of the form:
rdma0|manual|?|?|?|
The literal string "manual" in column 2 (ibdev) then propagated
through snapshot_rdma into the snapshot file's ibdevs column, which
derive_nccl_ib_hca() reads to compute NCCL_IB_HCA. The resulting env
var was:
NCCL_IB_HCA=manual
NCCL passes this to libibverbs as a device filter; it matches no real
HCA, so libibverbs reports "No device found", NCCL falls back to the
socket plugin, and any cross-node collective hangs because the socket
plugin can't carry RDMA traffic between sandbox netns over the bridge.
Fix: always run the sysfs/ibdev2netdev scan, then use RDMA_IFACES
purely as an include-list filter on the resulting records. Each kept
record carries its true ibdev/port/GID metadata, so derive_nccl_ib_hca
yields the real list (e.g. mlx5_0,mlx5_3,...).
Fall back to placeholder records only when no sysfs/ibdev2netdev data
exists at all, and warn so the operator knows to set NCCL_IB_HCA
explicitly.
Co-authored-by: Cursor <cursoragent@cursor.com>
NCCL's IB plugin calls realpath("/sys/class/infiniband/<ibdev>/device")
to discover the PCI bus path of each IB device, then walks UP the
/sys/devices/pci*/ hierarchy to compute PCI distances between GPUs and
NICs. On real Linux, device/ is a symlink to the PCI device directory
(e.g. /sys/devices/pci0000:08/.../0000:0c:00.0).
In gVisor's virtual sysfs, device/ was a plain directory containing
uevent, class, vendor, etc. files. realpath() on a directory returns the
path itself (/sys/class/infiniband/mlx5_0/device), which is not under
/sys/devices/pci*/, so NCCL's topology detection reported:
network path /sys/class/infiniband/mlx5_0/de00c0 is not a PCI device
This caused all NICs to be attached to "first CPU" in the topology graph
instead of under their actual PCI bridges. The resulting graph had no
properly-placed NET nodes, and graph search failed with:
Could not find NET with id 0
Fix:
1. In newRDMASysfsEntries: create device/ as a symlink to the PCI
device's relative path (e.g. ../../../devices/pci0000:08/.../...) when
a matching PCI device exists in the tree. Fall back to the old
directory approach when no PCI data is available.
2. In newPCIDevicesSysfsEntries: add uevent and modalias files to
NIC-class PCI device nodes so they're accessible through the symlink
(NCCL reads PCI_SLOT_NAME from uevent via this path).
3. Return pciAddrToRelPath from newPCIDevicesSysfsEntries so
newRDMASysfsEntries can construct the symlink targets.
Co-authored-by: Cursor <cursoragent@cursor.com>
Two related UX improvements to make multi-node prepare flows robust against per-node IP-provisioning skew, which has been a recurring source of "first NCCL barrier hangs" failures. 1. RDMA_AUTO_SKIP_NO_IPV4 (default 1): instead of dying with "RDMA netdev <dev> has no IPv4 address" when a candidate netdev has no IP on this host, silently drop it from the candidate set with a stderr note. RoCE v2 needs IP routing, so a netdev without an IPv4 is unusable anyway — failing the whole prepare just forces the operator to manually exclude it. Auto-skip lets the snapshot proceed with the netdevs that are actually usable. 2. RDMA_PEER_IFACES: optional space-separated list of netdev names known to be present on the peer node. The local candidate set is intersected with this list to enforce symmetric NIC topology across ranks. NCCL requires every rank to advertise the same set of NICs; asymmetry causes the comm to deadlock at the first cross-rank collective with confusing "Could not find NET with id N" errors. 3. After prepare completes, print a clearly-formatted summary block listing the netdevs and ibdevs that ended up moved, plus a copy-pasteable "RDMA_PEER_IFACES=..." line for the other node. This eliminates the manual diff-the-route-tables workflow that has been needed to figure out the symmetric set. Co-authored-by: Cursor <cursoragent@cursor.com>
Iterating on rdmaproxy / topology / netdev plumbing requires a specific
set of debug env vars (NCCL_DEBUG=INFO, NCCL_DEBUG_SUBSYS, NCCL_DEBUG_FILE,
NCCL_TOPO_DUMP_FILE, RUNSC_DEBUG, RUNSC_DEBUG_LOG, RUNSC_STRACE,
RUNSC_STRACE_SYSCALLS) to be set together. Setting them by hand on every
prepare/run-test invocation is error-prone — easy to forget one and end
up debugging blind, or inconsistent across nodes.
Add DEBUG_ALL=1 as a one-shot that defaults all of them at once. Each
individual var still wins if explicitly set on the command line (since
the assignments use ${VAR:-default} syntax), so users can mix and match.
Also document the cost — strace alone slows the sandbox 5-10x, so
DEBUG_ALL is for debugging only, not for performance benchmarks.
Co-authored-by: Cursor <cursoragent@cursor.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Thanks for taking a look! Since this feature 1. adds a new device proxy 2. touches existing nvproxy 3. modifies boot sequence to serialize/deserialize host sysfs and move rdma net devs into container for GID resolution, I have described the high-level architecture and low-level implementation nuances in this public notion doc. I'd recommend reading this before diving into the diff!