CollectiveX v1: cross-vendor EP benchmark suite#2004
Conversation
| rsync -a --delete --delete-excluded \ | ||
| --exclude='__pycache__/' --exclude='results/' --exclude='.cx_workloads/' \ | ||
| --exclude='configs/platforms.yaml' --exclude='private-infra.md' \ | ||
| --exclude='goal.md' --exclude='notes.md' \ | ||
| "$repo_root/experimental/CollectiveX" "$stage_dir/experimental/" >/dev/null 2>&1 \ | ||
| || cx_die "staging CollectiveX failed" |
There was a problem hiding this comment.
🔴 The setup step writes the shard JSON to experimental/CollectiveX/results/.shard_${matrix.id}.json and sets CX_SHARD_FILE=results/.shard_${matrix.id}.json (relative), but cx_stage_repo (runtime/common.sh:145-150) rsyncs the CollectiveX tree with --exclude='results/' --delete-excluded and drops the shard file — so for every staged single-tray SKU (b300 always; gb200/gb300 with EP4 via CX_NODES<=1), the [ -f "$CX_SHARD_FILE" ] guard at run_in_container.sh:458 fails and execution falls into the single-bench else branch (line 556+), silently running one wrong-config default (uniform/decode/bf16, empty case_id) instead of the shard's N scheduled cases. Downstream make_bundle will catch this via missing_identity/coverage but only after GPU allocation was spent on the wrong workload. Cheap fix: allow-list the shard file through the rsync (--include='experimental/CollectiveX/results/' --include='experimental/CollectiveX/results/.shard_*.json' before the results/ exclude), copy the shard file into the stage dir after the rsync, or resolve CX_SHARD_FILE against the original repo root in run_in_container.sh's SHARD guard the way the rack (EP8) launchers already do (see launch_gb300-nv.sh:92-93 / launch_gb200-nv.sh cx_ep_cases).
Extended reasoning...
The bug
The sweep workflow's shard-fanout step writes the resolved case list to experimental/CollectiveX/results/.shard_${matrix.id}.json:
# .github/workflows/collectivex-sweep.yml
env:
CX_SHARD_FILE: results/.shard_${{ matrix.id }}.json # RELATIVE path
...
- name: Extract shard from matrix artifact
working-directory: experimental/CollectiveX
run: |
...
json.dump({...,'cases':s['cases']}, open('results/.shard_${{ matrix.id }}.json','w'))The physical file therefore lands at $REPO/experimental/CollectiveX/results/.shard_<id>.json, and CX_SHARD_FILE=results/.shard_<id>.json is interpreted relative to the container's cwd, which is /ix/experimental/CollectiveX.
For every SKU that requires CX_STAGE_DIR (b300 always; gb200/gb300 with EP4 via the CX_NODES<=1 delegate path in launch_gb200-nv.sh:57 / launch_gb300-nv.sh:47), the launcher calls:
# launch_b300.sh:34, launch_gb200-nv.sh:52, launch_gb300-nv.sh:24
MOUNT_SRC="$(cx_stage_repo "$REPO_ROOT" "$CX_STAGE_DIR")"which rsyncs the tree with an exclude that drops results/:
# experimental/CollectiveX/runtime/common.sh:145-150
rsync -a --delete --delete-excluded \
--exclude='__pycache__/' --exclude='results/' --exclude='.cx_workloads/' \
--exclude='configs/platforms.yaml' --exclude='private-infra.md' \
--exclude='goal.md' --exclude='notes.md' \
"$repo_root/experimental/CollectiveX" "$stage_dir/experimental/"Both --exclude='results/' and --delete-excluded guarantee that the shard file the workflow just wrote is missing from the stage dir.
The consequence at runtime
The container mounts $MOUNT_SRC:/ix, cwd=/ix/experimental/CollectiveX. Inside run_in_container.sh, the SHARD guard resolves CX_SHARD_FILE relative to that cwd:
# runtime/run_in_container.sh:458
if [ -n "${CX_SHARD_FILE:-}" ] && [ -f "${CX_SHARD_FILE:-/nonexistent}" ]; then
# SHARD mode — sweep every scheduled case
...
else
# Single-bench (workflow_dispatch) path
# uses ${CX_MODE:-normal}, ${CX_PHASE:-decode}, ${CX_ROUTING:-uniform},
# ${CX_DISPATCH_DTYPE:-bf16}, empty CX_CASE_ID/CX_SUITE/CX_WORKLOAD_NAME, ...The file resolves to /ix/experimental/CollectiveX/results/.shard_<id>.json — which is missing because rsync excluded it — so the test fails and the else branch runs a single default case with none of the shard's identity, N times cheaper than the intended N-case sweep.
Why the rack (EP8) paths escape
The rack-scale launchers iterate cases themselves in the launcher on the SUBMIT host (not inside the container). Their case-list helpers explicitly resolve the shard file against the original checkout when the relative path misses:
# launch_gb300-nv.sh cx_ep8_cases (and launch_gb200-nv.sh cx_ep_cases)
local sf="${CX_SHARD_FILE:-}"
[ -n "$sf" ] && [ ! -f "$sf" ] && [ -f "$CX_DIR/$sf" ] && sf="$CX_DIR/$sf"The same workaround is absent from run_in_container.sh:458, so the EP4 single-tray path — which shares the b300/gb200-EP4/gb300-EP4 launchers with the staged mount — hits the missing file.
Affected sweeps
Every single-tray staged shard in the v1 promoted matrix, per sweep_matrix.py + configs/suites.yaml platforms:
- b300 (all shards; launch_b300.sh is single-node)
- gb200 EP4 (CX_NODES<=1 -> run_in_container.sh)
- gb300 EP4 (CX_NODES<=1 -> run_in_container.sh)
The h100-dgxc/h200-dgxc/b200-dgxc/mi325x/mi355x paths do not set CX_STAGE_DIR in this workflow (cx_stage_repo becomes a no-op) and are unaffected.
Concrete walk-through (b300 shard)
- Setup job resolves matrix; writes
experimental/CollectiveX/results/.shard_b300-deepep.jsonon the checkout with e.g. 24 cases (varied phase/dtype/routing/eplb across ep-core-v1 + ep-routing-v1). - Sweep job on the b300 runner exports
CX_SHARD_FILE=results/.shard_b300-deepep.json, checks out the repo, and callslaunch_b300.sh. launch_b300.sh:34->cx_stage_reporsyncs to$CX_STAGE_DIR/job_<id>/experimental/CollectiveX/with--exclude='results/' --delete-excluded. The shard file is not copied.srun --container-workdir=$MOUNT_DIR/experimental/CollectiveX ... run_in_container.sh. cwd inside container =/ix/experimental/CollectiveX.run_in_container.sh:458tests[ -f "results/.shard_b300-deepep.json" ]-> that resolves to/ix/experimental/CollectiveX/results/.shard_b300-deepep.json-> missing.- Execution falls into the else branch at line 556+. It dispatches
${CX_BENCH}once withCX_MODE=normal,CX_PHASE=decode,CX_ROUTING=uniform,CX_DISPATCH_DTYPE=bf16, emptyCX_CASE_ID, emptyCX_SUITE, emptyCX_WORKLOAD_NAME, emptyCX_REQUIRED_PUBLICATION. - One result JSON is produced with no case_id and mismatched identity; the other 23 scheduled cases never run.
- Aggregate job's
make_bundle.py validate_expected_coveragecomputesmissing_identity + missing + identity_mismatchagainstmatrix_full.jsonandraise SystemExit(...)— the whole aggregate fails, after b300 GPU-time was spent on the wrong workload.
Impact
For every b300/gb200-EP4/gb300-EP4 shard promoted through v1 (three of the eight SKUs in ep-core-v1 + ep-routing-v1), the sweep silently runs one wrong-config default point instead of the scheduled N-case sweep. Bundle validation catches the divergence but only post-hoc, so the failure is loud yet wasteful: GPU allocations spent, aggregate job red, invalidating the v1 dataset this PR is producing.
Fix
Any one of:
-
Allow the shard file through the rsync in
cx_stage_repo(runtime/common.sh:146):rsync -a --delete --delete-excluded \ --include='experimental/CollectiveX/results/' \ --include='experimental/CollectiveX/results/.shard_*.json' \ --exclude='__pycache__/' --exclude='results/' ...
-
Copy the shard file into the stage dir after the rsync completes:
[ -n "${CX_SHARD_FILE:-}" ] && [ -f "$repo_root/experimental/CollectiveX/$CX_SHARD_FILE" ] \ && cp -a "$repo_root/experimental/CollectiveX/$CX_SHARD_FILE" \ "$stage_dir/experimental/CollectiveX/$CX_SHARD_FILE"
-
Mirror the rack (EP8) launcher workaround in
run_in_container.sh:458:sf="${CX_SHARD_FILE:-}" # $CX_DIR is not set inside the container; use the fixed workdir instead. [ -n "$sf" ] && [ ! -f "$sf" ] && [ -f "/ix/experimental/CollectiveX/$sf" ] \ && sf="/ix/experimental/CollectiveX/$sf" if [ -n "$sf" ] && [ -f "$sf" ]; then ...
Approach (1) or (2) is the smallest change with the least surface area.
| elif _run(["ibstat", "-l"]): | ||
| devices = [d.strip() for d in _run(["ibstat", "-l"]).splitlines() if d.strip()] | ||
| return { |
There was a problem hiding this comment.
🟡 _rdma() calls _run(["ibstat", "-l"]) twice at env_capture.py:178-179 — once in the elif condition and once in the comprehension body. If the second invocation returns None (which _run does on shutil.which miss, TimeoutExpired/OSError, or nonzero exit), .splitlines() raises AttributeError and takes down env_capture.py under run_in_container.sh's set -euo pipefail. The trigger is genuinely rare (both calls are microseconds apart on a stable IB stack, and this branch runs only when ibv_devinfo is absent), so nit — but the fix is a one-line refactor mirroring the ibv_devinfo branch just above.
Extended reasoning...
The defect. env_capture._rdma() has an asymmetry between its two RDMA-listing branches:
listing = _run(["ibv_devinfo", "-l"]) # assigned once, iterated once
if listing:
for line in listing.splitlines()[1:]:
...
elif _run(["ibstat", "-l"]): # called once (as a truthiness check)
devices = [d.strip() for d in _run(["ibstat", "-l"]).splitlines() if d.strip()] # called AGAINThe ibv_devinfo branch just above does the right thing: assign once, reuse. The ibstat branch does not.
Why the crash is theoretical but real. _run() returns None on any of: shutil.which(cmd[0]) failing (line 51), subprocess.TimeoutExpired/OSError (line 57), or out.returncode != 0 (line 59). If the first call returns a truthy string but the second returns None — a transient OS timer glitch, an OOM-killed helper, a stray nonzero exit under load — then None.splitlines() raises AttributeError. Under run_in_container.sh's set -euo pipefail (line 33), that aborts the whole shard step before any GPU benchmark runs.
Step-by-step proof of the theoretical crash path:
- Node has
ibstatin$PATHbut noibv_devinfo(a real config: MI355X-style stacks withibstatonly). - First call:
_run(["ibstat", "-l"])succeeds → returns"mlx5_0\nmlx5_1\n"→ elif condition is truthy. - Second call: a transient nonzero exit (e.g.
ibstatracing an IB-driver reload, timer wraparound, PID-namespace hiccup) →out.returncode != 0→_runreturnsNone. None.splitlines()→AttributeError: 'NoneType' object has no attribute 'splitlines'→ Python exits nonzero →set -eabortsrun_in_container.sh→ the shard step fails before GPU work.
Why this is nit, not normal. Every verifier converged on the same practical assessment: ibstat -l is a fast local device listing with no network/filesystem dependency, so a transient failure between two back-to-back calls (microseconds apart) is extremely improbable. The elif branch itself only runs when ibv_devinfo is absent, which is uncommon on the target runners since both binaries come from the same InfiniBand userspace stack. And env_capture.py produces a diagnostic/provenance artifact — even a genuine crash here would break provenance capture, not the benchmark measurement. The defect exists but doesn't justify blocking merge.
The fix. One-line refactor to mirror the ibv_devinfo branch:
else:
listing = _run(["ibstat", "-l"])
if listing:
devices = [d.strip() for d in listing.splitlines() if d.strip()]Same idiom the file uses immediately above. Eliminates the wasted subprocess call and the theoretical None-deref in one change. Worth doing as a follow-up cleanup, but the PR does not need to block for it.
| "required_publication": env("CX_REQUIRED_PUBLICATION") or None, | ||
| "backend": backend, | ||
| "phase": phase, | ||
| "ep": integer("CX_EP", integer("CX_NGPUS", 1)), | ||
| "gpus_per_node": integer("CX_GPUS_PER_NODE", integer("CX_NGPUS", 1)), | ||
| "scale_up_domain": integer("CX_SCALE_UP_DOMAIN", integer("CX_NGPUS", 1)), | ||
| "dispatch_dtype": env("CX_DISPATCH_DTYPE", "bf16"), | ||
| "mode": env("CX_MODE", "normal"), | ||
| "contract": env("CX_MEASUREMENT_CONTRACT", "layout-and-dispatch-v1"), | ||
| "routing": env("CX_ROUTING", "uniform"), | ||
| "eplb": enabled("CX_EPLB"), | ||
| "combine_quant_mode": env("CX_COMBINE_QUANT_MODE", "none"), | ||
| "resource_mode": env("CX_RESOURCE_MODE", "tuned"), | ||
| "activation_profile": env("CX_ACTIVATION_PROFILE", "normal"), | ||
| "placement": env("CX_PLACEMENT", "packed"), | ||
| "routing_step": env("CX_ROUTING_STEP", "0"), | ||
| "uneven_tokens": env("CX_UNEVEN_TOKENS", "none"), | ||
| "tokens_ladder": env("CX_TOKENS_LADDER"), | ||
| "canonical": enabled("CX_CANONICAL"), | ||
| "sampling_contract": "fixed-512-v1", | ||
| "samples_per_point": integer("CX_SAMPLES_PER_POINT", 512), | ||
| "iters": integer("CX_ITERS", 8), | ||
| "trials": integer("CX_TRIALS", 64), | ||
| "warmup": integer("CX_WARMUP", 32), | ||
| "warmup_semantics": env( | ||
| "CX_WARMUP_SEMANTICS", "full-roundtrip-per-trial-point-v1" | ||
| ), |
There was a problem hiding this comment.
🟡 cx_emit_ep_failed_case (runtime/common.sh:256-287) builds failure.case without the hidden/topk/experts/nodes keys, but every matrix case emitted by sweep_matrix.py always carries all four. On the first sweep where any case exhausts its retries (flashinfer intermittent MNNVL, HybridEP/UCCL empty-rank, any deterministic rc=5), make_bundle's _identity_differences reports the same case_id four times as hidden=None!=7168,topk=None!=8,experts=None!=256,nodes=None!=1, and validate_expected_coverage piles on by re-listing that case in missing, so the aggregate job aborts with a dual-report that hides the real signal (the case failed all retries — the intended fail-closed behavior). Fix in either place is fine: add the four fields to cx_emit_ep_failed_case from CX_HIDDEN/CX_TOPK/CX_EXPERTS (defaults 7168/8/256) and CX_NGPUS/SLURM_NNODES, or make _identity_differences skip these fields when the actual doc is a failed-case.
Extended reasoning...
The observed behavior
With the PR merged and any sweep that produces a failed-case record for a scheduled case, the aggregate job will fail with a message like:
bundle: expected-matrix coverage failed (
missing_identity=0 missing=['cxv1-...'] extra=[] duplicates=[]
identity_mismatch=['cxv1-...:hidden=None!=7168,topk=None!=8,experts=None!=256,nodes=None!=1'])
The same case_id appears in both missing and identity_mismatch, and the mismatch string names four fields that have nothing to do with why the case actually failed.
Step-by-step proof
Take a concrete promoted case, say h100-dgxc/deepep/decode under ep-core-v1 (uniform, canonical, deepseek-v3-v1 defaults). sweep_matrix.py:181-186 builds the matrix entry with:
{
...,
"hidden": "", # h==7168 -> "" sentinel
"topk": "", # t==8 -> ""
"experts": "", # e==256 -> ""
"nodes": "1", # always str
...
}When every one of the 4 flashinfer attempts wedges on the intermittent MNNVL completion-flag deadlock (documented in run_in_container.sh around line 526), the last attempt's cx_emit_ep_failed_case writes a failed_*.json whose failure.case dict is missing the four keys entirely — the emitter reads CX_DISPATCH_DTYPE/CX_MODE/etc. but has no CX_HIDDEN/CX_TOPK/CX_EXPERTS/SLURM_NNODES reads.
aggregate_results.py keeps that failed-case doc as the newest for that case_id. Then make_bundle.py runs validate_expected_coverage:
_expected_case_identity(matrix_case)—"hidden" in caseis true (value""), soidentity["hidden"] = int("" or 7168) = 7168. Same for topk/experts (8/256)."nodes" in caseis true,identity["nodes"] = int("1") = 1. Expected identity contains{hidden: 7168, topk: 8, experts: 256, nodes: 1, ...}._actual_case_identity(failed_doc)(the failed-case branch, line 184-195) copiesfailure.caseverbatim, calls_expected_case_identity. None ofhidden/topk/experts/nodesare in that dict, so theif field in case:guard skips all four. Actual identity contains everything except the four scheduled shape fields._identity_differencesiterates the expected identity's items;actual_identity.get("hidden")isNone,None != 7168->hidden=None!=7168. Same for the other three.validate_expected_coverage(line 294-298) hits thedifferencesbranch, appends the case_id toidentity_mismatch, and does not add it toactual{}. Thenmissing = set(expected) - set(actual)(line 301) also contains that case_id. Line 319 raises the dual-reportSystemExit.
validate_results.py:validate_doc's failed-case schema (v5, ~lines 234-243) requires a different, smaller field set that happens to match what the emitter writes, so it stays silent about this desync. Only make_bundle notices, and only in a way that obscures the real cause.
Why this fires in practice
The PR explicitly builds in retry logic — CX_FLASHINFER_RETRIES defaults to 3 attempts, and both the container and rack launchers loop attempts and preserve a failed_*.json when all attempts fail. Retry-exhaustion is expected behavior for known intermittents, but the aggregate step will now report those as identity_mismatch + missing for hidden/topk/experts/nodes — the least informative signal possible.
Impact
Bundle validation still correctly rejects the incomplete run (the intended fail-closed behavior), and no incorrect data ships, so this is a diagnostic-clarity regression rather than a correctness bug. It will, however, cost real triage time in CI: an operator staring at hidden=None!=7168,topk=None!=8,experts=None!=256,nodes=None!=1 will not obviously infer "one flashinfer case exhausted its retries."
Fix
Either add the four fields to cx_emit_ep_failed_case (read CX_HIDDEN/CX_TOPK/CX_EXPERTS with defaults 7168/8/256, and CX_NGPUS/SLURM_NNODES for nodes), or teach _identity_differences/_actual_case_identity to drop these fields when the actual doc is a failed-case. Either way the two validators stay in sync.
758fa52 to
1c5b901
Compare
| secret = "github_pat_" + "Z" * 24 | ||
| with tempfile.TemporaryDirectory() as temporary: | ||
| path = Path(temporary) / "artifact.json" | ||
| path.write_text(json.dumps({"note": secret})) |
Summary
Adds the sanitized CollectiveX v1 EP benchmark suite under
experimental/CollectiveX/and twoGitHub Actions entrypoints that resolve only public runner labels.
What changed
Zipf/EPLB sensitivity anchors
GB200, GB300, MI325X, and MI355X
and exactly 512 percentile observations
declared by the public capability registry
mismatched rack topology, ignored timing/resource controls, and inconsistent GB transport settings
overwriting earlier attempt identity
The historical
deepep-v2label is removed because it did not implement DeepEP PR #605. A realElasticBufferadapter remains explicit v1 follow-up work.Isolation and safety
capability.py, keyed by exact GitHub Actions labelsdatasets, and atomic local channel pointers; no managed database or object store is introduced
perf-changelog.yamlentry is changedValidation
8:64:32profile, all 512 samplesbash -nand ShellCheck