Skip to content

CollectiveX: experimental cross-vendor collective/EP benchmark#1896

Open
Oseltamivir wants to merge 246 commits into
mainfrom
collectivex
Open

CollectiveX: experimental cross-vendor collective/EP benchmark#1896
Oseltamivir wants to merge 246 commits into
mainfrom
collectivex

Conversation

@Oseltamivir

@Oseltamivir Oseltamivir commented Jun 23, 2026

Copy link
Copy Markdown
Collaborator

Adds CollectiveX under experimental/CollectiveX/ — a cross-vendor collective / expert-parallel benchmark — plus an orchestration-only workflow.

What it adds

  • Per-SKU launch adapters (launchers/launch_<sku>.sh, the launch_${RUNNER_NAME%%_*}.sh convention) that run any benchmark via a CX_BENCH selector (nccl|deepep|all) through a shared launchers/run_in_container.sh.
  • Benchmarks: run_nccl.py (stock nccl-tests → parsed flat JSON), run_deepep.py (DeepEP dispatch/combine, normal mode), env_capture.py (Layer-0 provenance), plot.py. Every result is correctness-gated and carries a topology-aware comparison_key.
  • Single multi-arch, digest-pinned container for all NVIDIA SKUs (lmsysorg/sglang@sha256:4219…, amd64+arm64); DeepEP via rebuild-deepep. See CONTAINERS.md.
  • .github/workflows/collectivex-experimental.ymlpush to collectivex (paths experimental/CollectiveX/**) → GB200 NCCL smoke; workflow_dispatch → chosen sku+benchmark (B200, DeepEP, larger sweeps). Logic stays under experimental/.

Validated on hardware

  • NCCL primitives: B200 (8× NVLink island) + GB200 (4× NVL72 MNNVL), 4 ops, correctness-passed, topology-keyed distinctly.
  • DeepEP dispatch/combine on GB200: correctness-gated (token conservation + combine vs DeepEP's own reference), ~154 µs roundtrip, 1.66M tok/s.
  • Local: shellcheck/bash -n, py_compile, actionlint, parser fixtures.

Notes / deferred

  • Result JSONs are gitignored (captured env embeds hostnames/UUIDs); CI uploads them as workflow artifacts. Headline numbers are summarized in CONTAINERS.md.
  • Importing the exact multi-arch digest needs the runner's registry creds (validated on the pre-staged v0.5.11-cu130).
  • Precision axes (NVFP4/MXFP8/…), low-latency EP, MoRI, EPLB, multinode DeepEP, and other collectives are captured as roadmap in plan.md, not built.

Note

Low Risk
Changes are isolated to experimental/CollectiveX/ and a read-only workflow; no production benchmark matrix or serving launchers are modified. Risk is mainly operational (self-hosted GPU time, Slurm/enroot failures) rather than app or security impact.

Overview
Introduces CollectiveX under experimental/CollectiveX/ — an experimental cross-vendor collective and MoE EP benchmark — plus orchestration-only .github/workflows/collectivex-experimental.yml. Production serving paths are untouched.

Benchmark stack: run_nccl.py wraps nccl-tests/rccl-tests into provenance-tagged JSON; run_deepep.py and run_mori.py add correctness-gated DeepEP and AMD MoRI dispatch/combine; env_capture.py, summarize.py, and plot.py handle environment capture, CI summaries, and plots. Results use topology-aware comparison_keys so unlike fabrics are not merged blindly.

Execution: Per-SKU Slurm launchers (launch_b200-dgxc.sh, launch_gb200-nv.sh, launch_b200-dgxc-slurm.sh, launch_mi355x-amds.sh) follow the same launch_${RUNNER_NAME%%_*}.sh pattern as serving, with shared common.sh (enroot squash by tag, optional CX_STAGE_DIR rsync, in-container nccl/rccl builds). CX_BENCH selects nccl, deepep, mori, or all via run_in_container.sh.

CI: Push to collectivex runs MI355X MoRI on mi355x runners; workflow_dispatch picks SKU and benchmark (GB200/B200 NCCL, DeepEP, etc.), writes markdown to the job summary, and uploads gitignored results/*.json as artifacts.

Reviewed by Cursor Bugbot for commit 871086d. Bugbot is set up for automated code reviews on this repo. Configure here.

Per-SKU launch adapters (launch_<sku>.sh) that run any benchmark via a CX_BENCH selector through a shared run_in_container.sh; multi-arch digest-pinned sglang container; NCCL-primitive + DeepEP dispatch/combine benchmarks with provenance + correctness gating; and an on:push workflow (GB200 NCCL smoke; workflow_dispatch for B200/DeepEP/larger sweeps).

Validated on hardware: NCCL primitives on B200 (8x NVLink) and GB200 (4x NVL72 MNNVL); DeepEP dispatch/combine on GB200 (correctness-gated).
Comment thread experimental/CollectiveX/launchers/run_in_container.sh Outdated
Comment thread .github/workflows/collectivex-experimental.yml
Comment thread experimental/CollectiveX/run_deepep.py Outdated
Comment thread experimental/CollectiveX/plot.py Fixed
Comment thread experimental/CollectiveX/run_deepep.py Fixed
The GB200 on:push smoke hung 25 min in enroot import: a bare digest ref (repo@sha256:) can't form an anonymous Docker Hub token scope, so enroot prompted for a password and blocked in non-interactive CI. Import by the multi-arch TAG instead (anonymous auth works, same as the serving launchers) and add </dev/null so a missing token fails fast rather than hanging.

Use v0.5.11-cu130 (multi-arch amd64+arm64, index sha256:061fb71f…): v0.5.12-cu130's 62 layers overflow enroot's overlay-based squash creation on these nodes (failed to mount overlay … Invalid argument). v0.5.11-cu130 imports cleanly and is pre-staged on GB200.
Comment thread .github/workflows/collectivex-experimental.yml
Comment thread experimental/CollectiveX/run_nccl.py Outdated
On the GB200 Actions path, CX_STAGE_DIR makes the launcher rsync the tree to compute-visible Lustre and the container writes results/ there; upload-artifact reads the checkout's results/ (empty), so the green smoke produced no artifact. Add cx_collect_results to copy result JSONs from the stage dir back to the checkout after the run (no-op when no staging was used).
Comment thread experimental/CollectiveX/run_deepep.py Outdated
Comment thread experimental/CollectiveX/launchers/launch_gb200-nv.sh Outdated
Add summarize.py (compact NCCL/DeepEP results table, printed at end of every job) and make it the result gate. Fix review findings: benchmark failures/skipped-deepep now fail the job instead of reporting green (#1); DeepEP nodes from SLURM_NNODES not world_size//8 (#3); apply Buffer.set_num_sms so num_comm_sms is real (#8); nccl-tests -c 1 with a missing check footer is now invalid (#7); use context managers for file reads (#4,#5); launchers export COLLECTIVEX_IMAGE/_DIGEST for provenance (#9); trim workflow_dispatch sku options to launcher-backed pools (#2). Artifact-path finding (#6) already fixed via cx_collect_results.
Comment thread experimental/CollectiveX/run_deepep.py Outdated
is_token_in_rank=is_token_in_rank,
num_tokens_per_expert=num_tokens_per_expert,
)
combined_x, _, _ = buffer.combine(recv_x, handle, topk_weights=recv_topk_weights)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dispatch dtype not applied

Medium Severity

The --dispatch-dtype / CX_DISPATCH_DTYPE value is stored in result metadata but never used when building inputs or calling buffer.dispatch. Runs always use bfloat16 token tensors regardless of fp8 vs bf16, so provenance and comparison keys can describe a different shape than what was measured.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit b384171. Configure here.

summarize.py --markdown emits GitHub-flavored markdown tables (NCCL + DeepEP); a per-job 'Results summary' workflow step appends it to $GITHUB_STEP_SUMMARY so the run page shows a rendered table (per the GitHub job-summaries feature). Plain-text mode still drives the in-container result gate.
--timestamp "$TS" || cx_log "WARN: parse $op failed"
done

cx_log "done — JSON artifacts under $CX_DIR/results/"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multinode launcher ignores failures

High Severity

The B200 multinode adapter logs warnings when srun or run_nccl.py fail but always exits successfully. Unlike run_in_container.sh, it never runs summarize.py as a non-zero gate, so workflow_dispatch on b200-multinode can finish green with no valid NCCL results.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit f48daed. Configure here.

run: bash "experimental/CollectiveX/launchers/launch_${RUNNER_NAME%%_*}.sh"
- name: Results summary
if: always()
run: python3 experimental/CollectiveX/summarize.py --results-dir experimental/CollectiveX/results --markdown >> "$GITHUB_STEP_SUMMARY"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Workflow skips result failure gate

Medium Severity

Both jobs only run summarize.py --markdown, which is documented to always exit 0. The workflow never runs the plain summarize.py gate on the checkout’s results/ after launch, so a successful Launch step can stay green when the checkout has no valid JSON (e.g. staged runs where copy-back failed).

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit f48daed. Configure here.

dst="$repo_root/experimental/CollectiveX/results"
mkdir -p "$dst"
cp "$mount_src/experimental/CollectiveX/results/"*.json "$dst/" 2>/dev/null || true
cx_log "copied results from stage dir -> $dst (for artifact upload)"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Result copy errors ignored

Medium Severity

cx_collect_results wraps the staged-to-checkout cp in 2>/dev/null || true and always logs success, so a failed or empty copy does not affect the launcher exit code and the workflow can pass without uploadable JSON.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit f48daed. Configure here.

First AMD / cross-vendor reach, scaffolded ahead of Milestone 1:

- run_mori.py: MoRI dispatch+combine (normal mode), correctness-gated,
  mirroring ROCm/mori's dispatch_combine example — int32 routing indices,
  (n,0) fp8 scales, the zero-copy registered-combine-input-buffer staging
  step, and expected = input x (#unique destination ranks). Emits the same
  flat JSON shape (family=moe, backend=mori) with CUDA-event timing.
- launchers/launch_mi355x-amds.sh: AMD adapter — partition compute, no
  account, --cpus-per-task=128, node-local /var/lib/squash imported via srun
  on the allocated node, --container-writable --container-remap-root, forces
  CX_BENCH=mori, mounts the (compute-visible) checkout at /ix.
- launchers/run_in_container.sh: run_mori_suite + mori case (nccl|deepep|mori|all).
- launchers/common.sh: ROCm MoRI image (rocm/sgl-dev:...-mori-0227-2) in
  cx_default_image for mi355x*/mi350x*/mi325x*/mi300x*.
- workflow: mi355x sku + mori benchmark options for workflow_dispatch.
- docs: CONTAINERS.md AMD section, README files/run/risks, plan.md status.

Not yet hardware-validated (no MI355X access) — MoRI's Python API is
version-sensitive (marked ADAPT HERE); the first runner job is the
validation, as GB200 was for DeepEP. The ROCm image isn't digest-pinned yet.
Comment thread experimental/CollectiveX/run_mori.py Fixed
- workflow: replace the on:push GB200 NCCL smoke with the MI355X MoRI
  dispatch/combine run (runs-on: mi355x, CX_BENCH=mori), and name the job
  "CollectiveX Experimental" (no longer "smoke"). GB200/B200 NCCL + DeepEP
  remain on workflow_dispatch.
- launch_mi355x-amds.sh: adapt more faithfully to runners/launch_mi355x-amds.sh
  — squeue by job-name only (no -u), flock -w 600, and clear ROCm gpucore.*
  dumps after the run so the next checkout is clean. Bump default CX_TIME to 60
  for a cold ROCm-image import.
- summarize.py: drop the "N/N results valid." footer from both the job-summary
  (markdown) and plain output; the failure gate still reports invalid results.
  Relabel the MoE section "MoE dispatch+combine (DeepEP / MoRI)".
- docs: README/plan describe push -> MI355X MoRI.
rm -f \"$SQUASH_FILE\"
enroot import -o \"$SQUASH_FILE\" \"docker://$IMAGE\" </dev/null
fi
"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MI355X import errors ignored

High Severity

The node-local enroot import runs inside an srun bash snippet without set -e and with no check after import. A failed import still yields exit 0 from that snippet, so the job continues into pyxis with a missing or corrupt squash file.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit d8ee9bf. Configure here.

- name: Launch ${{ inputs.sku }} / ${{ inputs.benchmark }}
env:
RUNNER_NAME: ${{ runner.name }}
run: bash "experimental/CollectiveX/launchers/launch_${RUNNER_NAME%%_*}.sh"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Workflow skips multinode staging

Medium Severity

CX_STAGE_DIR is set only when inputs.sku is gb200. The b200-multinode dispatch target uses launch_b200-dgxc-slurm.sh, which documents the same compute-visible checkout requirement but leaves staging unset, so Slurm jobs may not see the repo mount.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit d8ee9bf. Configure here.

… default)

First MI355X run reached the MoRI dispatch kernel — salloc, ROCm-image import,
mount, torchrun, 8-rank Gloo + shmem init, and EpDispatchCombineConfig/op/dispatch
all worked, confirming the API signatures. It OOM'd MoRI's default 2 GiB static
symmetric heap (hidden=7168 dispatch/combine buffers across 8 ranks request
~0.9 GiB each).

run_mori.py now sets MORI_SHMEM_HEAP_SIZE before `import mori` (default 16 GiB,
override CX_MORI_HEAP_BYTES). Docstring + CONTAINERS.md record the finding;
correctness/timing validated by the heap-sized re-run.

salloc --partition="$PARTITION" --exclude="$EXCLUDE_NODES" --gres=gpu:"$NGPUS" \
--exclusive --cpus-per-task=128 --time="$TIME_MIN" --no-shell --job-name="$RUNNER_NAME"
JOB_ID="$(squeue --name="$RUNNER_NAME" -h -o %A | head -n1)"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slurm job ID not scoped

Medium Severity

launch_mi355x-amds.sh resolves JOB_ID with squeue --name="$RUNNER_NAME" and no -u "$USER", while the other CollectiveX NVIDIA launchers filter by user. On a shared cluster, the first matching job name may belong to another account, so subsequent srun/scancel can target the wrong allocation.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit ac3f1b9. Configure here.

The heap-bump run cleared the 2 GiB OOM but then failed registering the 16 GiB
symmetric heap as an RDMA memory region (errno 22 EINVAL, size=17179869184).
ROCm/mori's reference test uses MORI_SHMEM_HEAP_SIZE="6G" single-node — big
enough for the hidden=7168 dispatch/combine buffers, small enough to register.

Match it: default "6G" (override CX_MORI_HEAP_SIZE). The rest of the config
already matches the reference (max_num_inp_token_per_rank=4096, hidden=7168,
backend cpu:gloo,cuda:nccl), so this lands on the proven single-node setup.
Drove run_mori.py to a correct run on 8x MI355X (on-node via salloc+srun):
dispatch+combine numerically correct (combine within tol, max_rel ~2e-3),
~85us round-trip at the decode shape. The first runs surfaced four issues,
all fixed and re-validated:

- RDMA MR ceiling: MoRI registers the WHOLE symmetric heap as one RDMA MR at
  init (even single-node; no disable-RDMA knob). The ionic_rdma NICs cap GPU
  MRs at ~4 GiB — a 6 GiB heap fails (RegisterRdmaMemoryRegion errno 22), 2 GiB
  registers. Hold heap at MORI_SHMEM_HEAP_SIZE=2G (override CX_MORI_HEAP_SIZE).
- Buffer sizing: max_num_inp_token_per_rank 4096 -> max(512, n) so the buffers
  fit the 2 GiB heap (4096 was inherited from the reference test).
- Correctness shape: combine returns the full max-token buffer; compare only
  combined[:n] against expected.
- recv count: read total_recv BEFORE combine (combine resets recv_num, which
  made recv_nonzero a false negative).
- Teardown: MoRI's shmem teardown asserts (CheckStatusValid -> SIGABRT) when the
  op is destroyed after shmem_finalize(); hard-exit after writing results.

Docs (README/plan/CONTAINERS) updated from "scaffolded" to validated, with the
fabric constraints recorded.
Comment thread experimental/CollectiveX/run_mori.py Fixed
Comment thread experimental/CollectiveX/run_mori.py Fixed
…CH=nccl)

Adds the AMD collective-primitive path so all_reduce/reduce_scatter/all_gather/
alltoall run on MI355X, not just MoRI:

- common.sh: cx_build_rccl_tests — clones ROCm/rccl-tests and builds with `make`
  against /opt/rocm (amdclang++/librccl). It's a nccl-tests fork producing the
  same <op>_perf binaries and output format, so run_nccl.py parses it unchanged.
  Validated building + running all 4 ops in-container on MI355X (correctness OK).
- run_in_container.sh: run_nccl_suite picks rccl-tests on ROCm (/opt/rocm or
  hipcc), nccl-tests otherwise; identical op loop + run_nccl.py invocation.
- launch_mi355x-amds.sh: honor CX_BENCH (mori default | nccl) instead of forcing
  mori; same -g N single-node 8-GPU launch.
- docs: README/CONTAINERS note the rccl path.

B200 already has the nccl path; this makes primitives available on all three
SKUs via workflow_dispatch.
Comment thread experimental/CollectiveX/launchers/launch_mi355x-amds.sh
if name:
devices.append(name)
elif _run(["ibstat", "-l"]):
devices = [d.strip() for d in _run(["ibstat", "-l"]).splitlines() if d.strip()]

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ibstat fallback may crash capture

Low Severity

In _rdma, the ibstat -l branch calls _run twice. If the first call succeeds but the second returns None, None.splitlines() raises and env_capture.py aborts before writing provenance JSON for that run.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 2b23573. Configure here.

…on-node

launch_gb200-nv.sh now branches on CX_NODES: 1 (default) keeps the single-tray
4-GPU dispatcher path; >1 runs across the NVL72 NVLink fabric (e.g. CX_NODES=2
= 8 GPU) by building nccl-tests MPI=1, running each op across WORLD ranks via
`srun --mpi=pmix` (1 GPU/rank) with the MNNVL env, and parsing on the login node
— mirroring launch_b200-dgxc-slurm but staying on NVLink instead of IB.

Validated on GB200 (2x watchtower-navy trays, 8 GPU): all 4 ops valid, peak
busbw all_reduce 822.8 / reduce_scatter 670.6 / all_gather 651.2 / alltoall
625.0 GB/s — ~30% over single-tray and on par with B200 8-GPU NVLink, i.e.
MNNVL engaged (not an IB fallback).

- common.sh: cx_build_nccl_tests auto-detects MPI_HOME for MPI=1 (Debian OpenMPI
  headers live under /usr/lib/<arch>/openmpi/include; MPI_HOME=/usr fails). Works
  x86_64 + aarch64.
- launch_b200-dgxc-slurm.sh: fix BUILD_IN_CTR path (.nccl-tests/nccl-tests/build).
- workflow: add `nodes` dispatch input -> CX_NODES.
…e gated.md/CONTAINERS.md claims (gb300 flashinfer+hybrid clean, deepep IS bundled, wheel output_dtype); gitignore results/aggregate/
…uant-combine); arch-gate combine dtypes (nvfp4 combine rejected on Hopper at validate)
…combine asserts SM>=100); capability gates quantized combine to Blackwell
…rmittent not a fabric wall (32/32 this run); flashinfer coverage variance 30->42/46
…via the marker-commit path (same mechanism as the sweep workflow)
…l result MEASURED (h100 all-backends + mi355x, p99 ratios ~1.0, noise-dominated); single profile = normal (100 cases saved)
…:4 minimal envelope, clamped at the 512-token MR ceiling); union overlapping suite ladders into one case (no dropped points, no duplicate same-config docs)
…ministic walls (h200 flashinfer pidfd, uccl aarch64) never dispatch; failed cases preserve records without failing the shard
…d on only the LAST case's files, flipping complete shards red when the trailing case was a failing diagnostic
…publication bundle (validate-all-or-abort) in the sweep aggregate job

- ep_harness stamps schema_version=4 (docs have carried every v4 field since the contract landed; run_in_container's failed-case stamp went 4 in 3dbacd1); schema minimum stays 3 so history validates
- ep-result-v4.schema.json: backend enum gains nccl-ep (700 aggregate docs were schema-invalid under jsonschema); record_type=failed-case skeletons get their own if/then branch (judge-by-data records were unvalidatable)
- validate_results: failed-case records validated as skeletons (fallback path no longer flags them); KNOWN_CONTRACTS synced with the schema enum (mori-quant-combine-v1 reserved)
- make_bundle.py: the previously-checked-but-absent publication bundle — validates EVERY aggregate doc (schema + semantic gates) or aborts, then emits manifest.json (source-run provenance, coverage, validation counts) + report.html + SUMMARY.md + SHA256SUMS; sweep aggregate job runs it and uploads cxsweep-bundle-*; raw aggregate upload is if:always so a validation failure never loses data
…lation.md)

Does EP microbench roundtrip p99 predict serving tok/s? Pre-registered A/B on the
existing dsr1 sglang recipes — vary ONLY moe-a2a-backend/deepep-mode (the recipes
already flip the exact kernels the microbench times, in the same pinned container),
join per-rank decode T to microbench ladder points under the cached-layout contract,
measure rank agreement + ITL regression + in-situ inflation factor via a profiler
window. Companion overlapped-gemm-v1 contract (reuse the copy_engine_bench GEMM
victim) closes the comm-in-isolation critique independently of serving. Falsification
is a publishable Decision-tab result, not a failure.
…-side concentration blows the 2GiB-heap envelope, rc=124 even at 8:1:4; spreading routings clean to T=512)
…oRI; mori fp8 + model shapes enter the sweep matrix (mi355x shard 15 -> 30 cases)

Three filters (none of them capability walls) kept MI355X thin:
- backends.yaml mori dtypes=[bf16] had drifted from capability.py's bf16+fp8, so the
  validated e4m3fnuz direct-cast path (run 28318788729) never entered the matrix. Enabled;
  the harness now runs T<2 points UNSCORED at fp8 — the forced-T=1 gradual-ramp point's
  single-token relErr instability (a metric artifact, not a comm error) was what flipped
  whole fp8 docs invalid. bf16 emission is byte-identical to before.
- ep-models-v1 pins runtime-visible-v1, which MoRI cannot honor, so AMD was silently absent
  from every model shape. ep-models-amd-v1 runs the same 5 workloads on mi355x/mori under
  the cross-vendor common contract (comparison_key keeps contracts distinct).
- the push smoke was mori-only; it is now a 2-leg matrix (mori EP + nccl bench, which
  auto-selects rccl-tests on ROCm) so the RCCL all_reduce/all_gather/reduce_scatter/alltoall
  primitives stay as fresh as the EP line — NCCL-vs-RCCL on identical test binaries is the
  cleanest cross-vendor anchor the benchmark has.
…ned MoRI build's LL-kernel surface

- offload was gated ["nvidia"] with no evidenced wall — it uses only torch.cuda.* APIs
  (pin_memory/events/streams), the standard HIP aliases on ROCm torch, the same surface
  copy_engine_bench already runs green on MI355X. Enabled in capability + the mi355x
  launcher allowlist; judged by the dispatched run's artifact.
- upstream MoRI HAS low-latency kernels (test_dispatch_combine_async_ll.py + the documented
  HT/LL adaptive switch), so the adapter's normal-only is NOT a vendor property. The
  self-introspection probe now prints EpDispatchCombineKernelType members + ll/async attrs,
  so the next MI355X log answers whether the pinned mori-0227-2 build exposes LL; mode=ll
  wiring follows once a build confirms it.
…ability.py)

make_parity.py renders the per-axis NVIDIA/AMD parity table from the same capability
tables the matrix compiler enforces (mechanical rows can't drift; --check is CI-able),
with each gap classed platform / library / build / unwired and its evidence cited.
README scopes the cross-vendor claim to the common contract and points here. Honest
caveats stated: one AMD SKU, MoRI stability envelope, AMD sweep history still accruing.
… baseline anchors both vendor stacks in the same sweep

AMD-native targets were mori-only, so the one backend that runs identically on both
vendors (torch.distributed all_to_all_single over NCCL/RCCL) never swept on AMD despite
capability allowing it and the launcher supporting it. mi355x now resolves two shards:
mori (30 cases incl fp8) + nccl-ep (27, bf16-only — capability filters fp8 correctly).
Single-backend runs are unchanged (nccl-ep is added only when the requested set asks
for it); NVIDIA shards unchanged.
The runner audit found idle mi300x-amds and mi325x-amds pools — the 'one AMD SKU'
caveat was softer than documented. Thin launcher wrappers over the MI355X adapter
carry each cluster's deltas (mi300x: shared squash /home/gharunner/gharunners/squash +
chi-mi300x-049 exclude; mi325x: /raid/squash), both partition compute. RCCL lane first:
rccl-tests builds arch-native in-container; MoRI EP on CDNA3 stays out of the sweep
suites until an image/arch probe passes (the pinned MoRI build targets gfx950) —
platforms.yaml entries carry empty validated sets. sku choices + parity caveat updated.
…3 (enroot userns denied)

Probe evidence (runs 28596592604 / 28596595613): the mia1-* exclude default leaked into
the mi325x salloc as 'Invalid node name' (empty CX_EXCLUDE_NODES falls through :- to the
mi355x default) — the default is now scoped to mi355x runners and --exclude is omitted
when empty. mi300x: chi-mi300x-043 denies enroot user namespaces (pyxis container start
fails; serving runs the same flags on other nodes of this cluster) — excluded, same
node-specific-pyxis class as mia1-p01-g09.
…5x RCCL primitives valid

mi300x: enroot-nsenter 'failed to create user namespace: Permission denied' on TWO
different nodes (chi-mi300x-043 run 28596592604, chi-mi300x-057 run 28601041154) — enroot's
unprivileged runtime needs userns clone, so no pyxis flag helps; needs an admin
sysctl/apparmor fix. Pool is dormant (no recent serving runs), consistent with unnoticed
config rot. Wrapper stays wired + gated note in platforms.yaml/gated.md.

mi325x: rccl-tests all four primitives VALID (run 28601042764: all_reduce 302.0 /
all_gather 292.4 / reduce_scatter 312.1 / alltoall 299.9 GB/s peak busbw, 31 sizes). Same
run's env_capture proves the mi35x-targeted image runs on gfx942 (torch 2.9.1+rocm7.2,
8x MI325X visible) — the torch.cuda-alias bench family needs no image switch.
…es' root causes

VALID (judged by artifacts): nccl-ep (correct=True, decode 1..128 + prefill 128..4096),
kv-cache (12 groups), nccl-kv, copy-engine (28 rows), rl-mesh (4 groups), offload (36
rows — the un-gated AMD offload works). With rccl primitives, that's 7 of 10 lanes live
on mi325x in one evening.

Failures, each with a discriminated cause + fix:
- mori: RegisterRdmaMemoryRegion errno=22 at the 2GiB heap — the container's libibverbs
  cannot drive this cluster's Broadcom bnxt_re NICs (kernel ABI 8 vs supported 1), and
  the node ALSO has mlx5 devices MoRI should use instead -> MORI_RDMA_DEVICES=mlx5_0,mlx5_1
  probe in the mi325x wrapper (next rungs: exclude-all, smaller heap, rdma-core upgrade).
- mori-io: same ibverbs path (connect timeout) — covered by the same probe.
- allreduce-fw: aiter SIGSEGV on gfx942 (mi35x image ships gfx950 aiter) killed the whole
  torchrun AFTER the NCCL baseline had been measured; the single end-of-run write lost it.
  Fixed twice over: (1) the doc is now written incrementally+atomically after EVERY impl,
  so an uncatchable signal preserves everything measured before it; (2) allreduce-fw on
  mi325x/mi300x switches to the serving fleet's gfx942 image (bench-scoped switch, same
  pattern as nixl).
- provenance: topology_class was hardcoded mi355x-xgmi for all AMD runners (it is part of
  comparison_key) — now derived from the runner prefix; the first mi325x docs carry the
  wrong label and their re-runs will supersede them.
…ranode XGMI

mori + mori-io fail at the 2GiB symmetric-heap RDMA MR (ibv_reg_mr of GPU memory
-> errno=22 EINVAL) on this cluster's bnxt_re (ABI 8, undriveable) / mlx5 NICs — the
no-GPUDirect wall. But mi325x EP is a single-node 8-GPU XGMI island that needs no RDMA:
  * MORI_ENABLE_SDMA=1 routes same-host peers through the AMD SDMA engine over XGMI
    (context.cpp TransportType::SDMA); the heap is registered as an RDMA MR only when a
    peer is RDMA-classified (symmetric_memory.cpp), so no NIC is touched.
  * MORI_DISABLE_AUTO_XGMI=0 enables mori-io's XGMI-only backend fallback (engine.cpp),
    doing GPU<->GPU over hip P2P instead of ibverbs.
MORI_RDMA_DEVICES=mlx5_0,mlx5_1 kept as the fallback for any residual ibverbs path.
MoRI info logging on by default so the run self-documents the per-peer transport decision.

Also fix the image resolution in launch_mi355x-amds.sh: it hardcoded 'mi355x' and ran
before CX_BENCH was set, so the mi325x/mi300x wrappers never got the SKU-correct image and
the allreduce-fw gfx942 switch never fired (run 28606335663 ran the gfx950 MoRI image ->
aiter SIGSEGV). Resolve IMAGE after CX_BENCH is final, keyed on the actual RUNNER_NAME.
… (inert behind the userns gate, ready for day-one)
…1, dmabuf-first MR + anyRdmaPeer gate); the 0227 image predates it and hits the plain-ibv_reg_mr GPUDirect wall on this fabric
… need P2P/XGMI direct peer access; SDMA=1 wedged the first dispatch 30min); mori-io keeps SDMA=1 (validated xgmi). Per-bench SDMA split + bring-up fail-fast timeout
…fault uncached heap's IPC-mapped peer memory isn't coherent for the intranode barrier's system-scope cross-device atomics on CDNA3/gfx942 (first-dispatch deadlock); hipMalloc heap fixes coherence. gfx950 unaffected
…s tuned path) instead of the IntraNode direct-peer barrier that deadlocks at T=1 on CDNA3 — split dispatch/combine into send+recv halves, SDMA=1
…e default IntraNode direct-peer barrier deadlocks at T=1 on CDNA3, AsyncLL's SDMA copy path is upstream's tuned gfx942 EP path (decode 28619828789 + prefill 28619974616, T=1..512 all correct=valid); mark validated (ep_degrees=[8], backends=[mori]) and wire mi325x into the core EP suites (smoke/nightly/models-amd + models-v1 for symmetry); revert bring-up fail-fast timeout
# Conflicts:
#	.github/workflows/collectivex-sweep.yml
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

2 participants