Add rdmaproxy for uverbs devices and NVIDIA GPU Direct RDMA by atoniolo76 · Pull Request #3 · modal-labs/gvisor

atoniolo76 · 2026-04-23T19:11:04Z

Thanks for taking a look! Since this feature 1. adds a new device proxy 2. touches existing nvproxy 3. modifies boot sequence to serialize/deserialize host sysfs and move rdma net devs into container for GID resolution, I have described the high-level architecture and low-level implementation nuances in this public notion doc. I'd recommend reading this before diving into the diff!

…r gVisor

…efault runtime in agent.py to runc

…urable in run_torch_pair_node_a.sh

This reverts commit 726b2df.

…syscall

…ing, implement HostFD() for nvproxy frontendFD wrappers, make extractCQQP action-aware, cleanup on proxyAsyncEventFD only if CopyOut fails, use readBufPool for Read(), change Write() log from warning to debug.

… at boot

…om applyRDMANetdevSnapshot

…ge for mlx5

…arent process for moving netdevs into container netns

Revert UVM_CREATE_EXTERNAL_RANGE back to the generic uvmIoctlSimple handler. The custom handler was a development-time tracing aid that logged at Warning level on every external range creation; not suitable for upstream. Co-authored-by: Cursor <cursoragent@cursor.com>

When multiple RoCE netdevs share an IP subnet (the common cloud RDMA fabric layout: many NICs on one big /12), the kernel's permissive default ARP behavior (arp_announce=0, arp_ignore=0) lets any local NIC reply to ARP for any local IP. Peers can then cache the wrong MAC for a given remote IP, the fabric drops the resulting RoCE v2 frames, and the QP exhausts its retry counter -- surfacing as IBV_WC_RETRY_EXC_ERR (status 12, mlx5 syndrome 0x81) and an apparent NCCL hang. Apply the standard multi-rail ARP tightening (arp_announce=2, arp_ignore=1, arp_filter=1, rp_filter=1) per moved netdev. The new netns starts with an empty ARP cache for the netdev, so no flush is needed -- the first ARP exchange after the move is already strict. Co-authored-by: Cursor <cursoragent@cursor.com>

… and update pci_devices.go to include current link speed and width attribute from sysfs. Co-authored-by: Cursor <cursoragent@cursor.com>

The harness previously snapshotted and moved every RDMA-backed netdev into the container netns, including ethN devices that happen to be mlx5-backed but serve as the host's primary Ethernet uplinks (e.g. eth0). Moving those breaks TCP connectivity from the host (and the NCCL bootstrap path between nodes), forcing the user to either pick a non-RDMA bootstrap path or set ALLOW_BOOTSTRAP_RDMA_IFACE=1 — which still moves them and just suppresses the safety check. Add an exclusion regex (default ^eth[0-9]+$) applied right after candidate enumeration so: - rdmaN, ibN, etc. continue to be moved as before; - ethN devices are dropped with a warning printed to stderr; - explicit RDMA_IFACES still wins (the early-return path skips the filter, so users can force-include something the regex would drop); - the regex is fully configurable, and an empty value disables the filter entirely. This matches the typical RoCE deployment where rdma0..N are the fabric-side netdevs that host the GIDs NCCL programs against, while ethN is the management/uplink NIC that should remain on the host. Co-authored-by: Cursor <cursoragent@cursor.com>

Previously the script unconditionally: - warned when nccl_topo.xml was missing under WORKSPACE_MOUNT (during both prepare preflight and start_app_container); - warned when /workspace/nccl_topo.xml was missing inside the container (right before exec); - injected NCCL_TOPO_FILE=/workspace/nccl_topo.xml into the container. This is correct when the user wants to ship a pre-generated topology XML, but noisy and counter-productive when relying on NCCL's automatic sysfs-based topology discovery (which rdmaproxy populates with ibverbs/PCI/NUMA data). For that workflow there's no topology file and the warnings are spurious. Add INCLUDE_NCCL_TOPO (default 1, current behavior). When set to 0: - all three warnings are skipped; - NCCL_TOPO_FILE is not added to the container env, letting NCCL auto-discover from /sys. Caller still wins via NCCL_TOPO_FILE explicit env, since both gates require -z "${NCCL_TOPO_FILE:-}" to fire. Co-authored-by: Cursor <cursoragent@cursor.com>

NCCL's topology builder walks /sys/bus/pci/devices/ and creates NET nodes for every class-0200 (Ethernet) and class-0c06 (InfiniBand) device it finds. It then matches these topology NETs against the IB verbs plugin's device list by PCI bus ID. When a NIC-class PCI device has no matching IB device (because its netdev wasn't moved into the sandbox and it has no usable GIDs), NCCL creates a dangling NET node in the topology graph. During graph search, the dangling node triggers "Could not find NET with id N" and ncclInternalError. On bare metal this is fine: all NICs have valid GID tables and the IB plugin can open them all. In the sandbox, only the moved netdevs' RDMA devices have GIDs, so the surplus PCI entries create the mismatch. Fix: at sysfs construction time, cross-reference NIC-class PCI devices against the RDMA data. Only include NICs whose IB counterpart has at least one GID entry with a valid ndev (i.e., the netdev was moved). GPUs, bridges, and NVSwitch devices are unaffected by the filter. Co-authored-by: Cursor <cursoragent@cursor.com>

NCCL_DEBUG, NCCL_DEBUG_SUBSYS, NCCL_TOPO_DUMP_FILE, and NCCL_DEBUG_FILE were only passed through during the exec phase but not injected into the container environment at start (docker run) time. When the application is launched via exec, the docker exec passthrough works, but the container's own environment (visible to child processes) lacked these vars. This made it appear as if NCCL_DEBUG had no effect. Add all four to the start-time env_args loop alongside the existing NCCL_IB_* passthrough. Also add NCCL_TOPO_DUMP_FILE and NCCL_DEBUG_FILE to the exec passthrough list. Co-authored-by: Cursor <cursoragent@cursor.com>

Previously, when RDMA_IFACES was set, rdma_netdev_records() short- circuited the sysfs scan and emitted placeholder records of the form: rdma0|manual|?|?|?| The literal string "manual" in column 2 (ibdev) then propagated through snapshot_rdma into the snapshot file's ibdevs column, which derive_nccl_ib_hca() reads to compute NCCL_IB_HCA. The resulting env var was: NCCL_IB_HCA=manual NCCL passes this to libibverbs as a device filter; it matches no real HCA, so libibverbs reports "No device found", NCCL falls back to the socket plugin, and any cross-node collective hangs because the socket plugin can't carry RDMA traffic between sandbox netns over the bridge. Fix: always run the sysfs/ibdev2netdev scan, then use RDMA_IFACES purely as an include-list filter on the resulting records. Each kept record carries its true ibdev/port/GID metadata, so derive_nccl_ib_hca yields the real list (e.g. mlx5_0,mlx5_3,...). Fall back to placeholder records only when no sysfs/ibdev2netdev data exists at all, and warn so the operator knows to set NCCL_IB_HCA explicitly. Co-authored-by: Cursor <cursoragent@cursor.com>

NCCL's IB plugin calls realpath("/sys/class/infiniband/<ibdev>/device") to discover the PCI bus path of each IB device, then walks UP the /sys/devices/pci*/ hierarchy to compute PCI distances between GPUs and NICs. On real Linux, device/ is a symlink to the PCI device directory (e.g. /sys/devices/pci0000:08/.../0000:0c:00.0). In gVisor's virtual sysfs, device/ was a plain directory containing uevent, class, vendor, etc. files. realpath() on a directory returns the path itself (/sys/class/infiniband/mlx5_0/device), which is not under /sys/devices/pci*/, so NCCL's topology detection reported: network path /sys/class/infiniband/mlx5_0/de00c0 is not a PCI device This caused all NICs to be attached to "first CPU" in the topology graph instead of under their actual PCI bridges. The resulting graph had no properly-placed NET nodes, and graph search failed with: Could not find NET with id 0 Fix: 1. In newRDMASysfsEntries: create device/ as a symlink to the PCI device's relative path (e.g. ../../../devices/pci0000:08/.../...) when a matching PCI device exists in the tree. Fall back to the old directory approach when no PCI data is available. 2. In newPCIDevicesSysfsEntries: add uevent and modalias files to NIC-class PCI device nodes so they're accessible through the symlink (NCCL reads PCI_SLOT_NAME from uevent via this path). 3. Return pciAddrToRelPath from newPCIDevicesSysfsEntries so newRDMASysfsEntries can construct the symlink targets. Co-authored-by: Cursor <cursoragent@cursor.com>

Two related UX improvements to make multi-node prepare flows robust against per-node IP-provisioning skew, which has been a recurring source of "first NCCL barrier hangs" failures. 1. RDMA_AUTO_SKIP_NO_IPV4 (default 1): instead of dying with "RDMA netdev <dev> has no IPv4 address" when a candidate netdev has no IP on this host, silently drop it from the candidate set with a stderr note. RoCE v2 needs IP routing, so a netdev without an IPv4 is unusable anyway — failing the whole prepare just forces the operator to manually exclude it. Auto-skip lets the snapshot proceed with the netdevs that are actually usable. 2. RDMA_PEER_IFACES: optional space-separated list of netdev names known to be present on the peer node. The local candidate set is intersected with this list to enforce symmetric NIC topology across ranks. NCCL requires every rank to advertise the same set of NICs; asymmetry causes the comm to deadlock at the first cross-rank collective with confusing "Could not find NET with id N" errors. 3. After prepare completes, print a clearly-formatted summary block listing the netdevs and ibdevs that ended up moved, plus a copy-pasteable "RDMA_PEER_IFACES=..." line for the other node. This eliminates the manual diff-the-route-tables workflow that has been needed to figure out the symmetric set. Co-authored-by: Cursor <cursoragent@cursor.com>

Iterating on rdmaproxy / topology / netdev plumbing requires a specific set of debug env vars (NCCL_DEBUG=INFO, NCCL_DEBUG_SUBSYS, NCCL_DEBUG_FILE, NCCL_TOPO_DUMP_FILE, RUNSC_DEBUG, RUNSC_DEBUG_LOG, RUNSC_STRACE, RUNSC_STRACE_SYSCALLS) to be set together. Setting them by hand on every prepare/run-test invocation is error-prone — easy to forget one and end up debugging blind, or inconsistent across nodes. Add DEBUG_ALL=1 as a one-shot that defaults all of them at once. Each individual var still wins if explicitly set on the command line (since the assignments use ${VAR:-default} syntax), so users can mix and match. Also document the cost — strace alone slows the sandbox 5-10x, so DEBUG_ALL is for debugging only, not for performance benchmarks. Co-authored-by: Cursor <cursoragent@cursor.com>

atoniolo76 added 30 commits March 25, 2026 22:39

Grant CAP_SYS_ADMIN perms on boot

8c8fed4

Remove CAP_SYS_ADMIN privledge

5f6436a

Update CLAUDE.md

8618db3

Update CLAUDE.md

4ced6e2

Clean up context and update CLAUDE.md

08e3054

Add peermem sysfs entries and create TEST.md

77db312

Update NCCL test

02b2a69

Add NCCL multinode bench test

b7c2b1b

Update custom benchmarking script

374a73a

Update TEST.md, RDMA_REFERENCE.md, and RDMA_STATUS.md

5d96114

Update RDMA_REFERENCE.md and TEST.md with debugging steps/commands fo…

bbe92ea

…r gVisor

Update TEST.md with hardcoded IP address

a3c7bd9

Update CLAUDE.md

dbdf1cb

Benchmark performance difference between gVisor and runc

031f31f

Create extensive benchmark

8fc59e2

Create PCIe sysfs devices in gVisor

6be51c2

Resolve PCIe device symlinks

254a4e2

Update RUNBOOK.md

6014543

Add custom profiling, torch all reduce bench, and update RUNBOOK

4f5c36d

Add NCCL traces

36b6037

Create agent.py for quick iteration loops and update RUNBOOK.md

c49d0d3

Fix agent setup with runbook

da48ca5

Update agent.py and add helper script for run_torch_pair_node_a.sh

a2327e7

Fix literal string handling in agent.py

a052680

Update RUNBOOK.md

3305aa1

Update AGENTS.md with detailed instructions for testing and set the d…

07e6bca

…efault runtime in agent.py to runc

Update agent.py and modify ioctl proxy for uverbs write()

0b44d58

Add /deploy_runsc endpoint to agent.py and make rank 1 runtime config…

462deed

…urable in run_torch_pair_node_a.sh

Update agent.py to make git pull non-interactive

109ee52

Update agent.py to use async API for /deploy-runsc

b7b8044

atoniolo76 and others added 30 commits April 23, 2026 15:11

Revert "Remove sched_yield optimization"

3674ac2

This reverts commit 726b2df.

Remove dead UVM code

decbf5c

Add rdma_matrix.sh

9247f4a

Add --roce flag to rdma_matrix.sh for non-default interface logic

35596f2

Add flag for using --network=host in rdma_matrix.sh

529bfc8

Set network=host mode for gVisor runsc runtime in rdma_matrix.sh

83e5c38

Add command for viewing detailed InfiniBand inforation to rdma_matrix.sh

a476e7d

Add VXLAN bootstrap mode to RDMA matrix

d653966

Add runc baseline commands to RDMA matrix

b8b759d

Validate explicit NCCL HCA selection

f58f653

Add temporary measure to switch container netns during ibv_modify_qp …

65812e4

…syscall

Remove netnsFD globals, check method ID on isQPModify instead of prob…

7bfbf64

…ing, implement HostFD() for nvproxy frontendFD wrappers, make extractCQQP action-aware, cleanup on proxyAsyncEventFD only if CopyOut fails, use readBufPool for Read(), change Write() log from warning to debug.

Remove network namespace switching and snapshot interface information…

56c71ef

… at boot

Exclude primary netdevs from being moved into sandboxed process

d64c812

Add note in rdma_netdev.go warning NICs in DOWN state are excluded fr…

9495687

…om applyRDMANetdevSnapshot

Add rdma_netdev.go to runsc/sandbox BUILD file

6e48a10

Split rdmaproxy into vendor-neutral core andadd cxproxy plug-in packa…

d68b040

…ge for mlx5

Add wait function to check for GID table entries for moved RDMA netdevs

06c4009

Modify sysfs startup in child process to wait on socket signal from p…

26a75be

…arent process for moving netdevs into container netns

Add numa.go to expose NUMA topology for multi-node topology discovery…

770e5be

… and update pci_devices.go to include current link speed and width attribute from sysfs. Co-authored-by: Cursor <cursoragent@cursor.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add rdmaproxy for uverbs devices and NVIDIA GPU Direct RDMA#3

Add rdmaproxy for uverbs devices and NVIDIA GPU Direct RDMA#3
atoniolo76 wants to merge 298 commits into
masterfrom
alessio/development

atoniolo76 commented Apr 23, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

atoniolo76 commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

atoniolo76 commented Apr 23, 2026 •

edited

Loading