CollectiveX v1: cross-vendor EP benchmark suite by Oseltamivir · Pull Request #2004 · SemiAnalysisAI/InferenceX

Oseltamivir · 2026-07-03T17:38:00Z

Summary

Adds the sanitized CollectiveX v1 EP benchmark suite under experimental/CollectiveX/ and two
GitHub Actions entrypoints that resolve only public runner labels.

What changed

defines one canonical DeepSeek-V3 EP workload and two reduced suites: uniform core coverage plus
Zipf/EPLB sensitivity anchors
resolves 39 hardware/backend cells, 232 cases, and 618 token points across H100, H200, B200, B300,
GB200, GB300, MI325X, and MI355X
standardizes every case on 8 timed iterations x 64 trials, 32 synchronized full-roundtrip warmups,
and exactly 512 percentile observations
supports DeepEP, DeepEP Hybrid, FlashInfer, MoRI, UCCL, and a portable NCCL/RCCL reference where
declared by the public capability registry
hardens launchers against false-green distributed failures, missing result shards, stale staging,
mismatched rack topology, ignored timing/resource controls, and inconsistent GB transport settings
keeps build preparation allocation-local and fail-closed; retries rack FlashInfer attempts without
overwriting earlier attempt identity
validates and bundles artifacts without committing benchmark results or private environment data

The historical deepep-v2 label is removed because it did not implement DeepEP PR #605. A real
ElasticBuffer adapter remains explicit v1 follow-up work.

Isolation and safety

runner/backend support comes from capability.py, keyed by exact GitHub Actions labels
no private runner inventory or endpoint is tracked or used to construct the workflow matrix
results and operator notes are ignored; GitHub artifacts are transient inputs, not a durable store
the planned publisher is a self-hosted filesystem with immutable private bundles, sanitized public
datasets, and atomic local channel pointers; no managed database or object store is introduced
no production serving benchmark configuration or perf-changelog.yaml entry is changed
the PR has no sweep label, so this rewrite dispatched no GPU benchmark jobs

Validation

43 Python contract/unit tests
exact matrix assertion: 39 cells, 232 cases, 618 token points, one 8:64:32 profile, all 512 samples
bash -n and ShellCheck
Actionlint
Ruff and Python compileall
GitHub CodeQL

claude · 2026-07-03T18:12:16Z

+  rsync -a --delete --delete-excluded \
+    --exclude='__pycache__/' --exclude='results/' --exclude='.cx_workloads/' \
+    --exclude='configs/platforms.yaml' --exclude='private-infra.md' \
+    --exclude='goal.md' --exclude='notes.md' \
+    "$repo_root/experimental/CollectiveX" "$stage_dir/experimental/" >/dev/null 2>&1 \
+    || cx_die "staging CollectiveX failed"


🔴 The setup step writes the shard JSON to experimental/CollectiveX/results/.shard_${matrix.id}.json and sets CX_SHARD_FILE=results/.shard_${matrix.id}.json (relative), but cx_stage_repo (runtime/common.sh:145-150) rsyncs the CollectiveX tree with --exclude='results/' --delete-excluded and drops the shard file — so for every staged single-tray SKU (b300 always; gb200/gb300 with EP4 via CX_NODES<=1), the [ -f "$CX_SHARD_FILE" ] guard at run_in_container.sh:458 fails and execution falls into the single-bench else branch (line 556+), silently running one wrong-config default (uniform/decode/bf16, empty case_id) instead of the shard's N scheduled cases. Downstream make_bundle will catch this via missing_identity/coverage but only after GPU allocation was spent on the wrong workload. Cheap fix: allow-list the shard file through the rsync (--include='experimental/CollectiveX/results/' --include='experimental/CollectiveX/results/.shard_*.json' before the results/ exclude), copy the shard file into the stage dir after the rsync, or resolve CX_SHARD_FILE against the original repo root in run_in_container.sh's SHARD guard the way the rack (EP8) launchers already do (see launch_gb300-nv.sh:92-93 / launch_gb200-nv.sh cx_ep_cases).

Extended reasoning...

The bug

The sweep workflow's shard-fanout step writes the resolved case list to experimental/CollectiveX/results/.shard_${matrix.id}.json:

# .github/workflows/collectivex-sweep.yml env: CX_SHARD_FILE: results/.shard_${{ matrix.id }}.json # RELATIVE path ... - name: Extract shard from matrix artifact working-directory: experimental/CollectiveX run: | ... json.dump({...,'cases':s['cases']}, open('results/.shard_${{ matrix.id }}.json','w'))

The physical file therefore lands at $REPO/experimental/CollectiveX/results/.shard_<id>.json, and CX_SHARD_FILE=results/.shard_<id>.json is interpreted relative to the container's cwd, which is /ix/experimental/CollectiveX.

For every SKU that requires CX_STAGE_DIR (b300 always; gb200/gb300 with EP4 via the CX_NODES<=1 delegate path in launch_gb200-nv.sh:57 / launch_gb300-nv.sh:47), the launcher calls:

# launch_b300.sh:34, launch_gb200-nv.sh:52, launch_gb300-nv.sh:24 MOUNT_SRC="$(cx_stage_repo "$REPO_ROOT" "$CX_STAGE_DIR")"

which rsyncs the tree with an exclude that drops results/:

# experimental/CollectiveX/runtime/common.sh:145-150 rsync -a --delete --delete-excluded \ --exclude='__pycache__/' --exclude='results/' --exclude='.cx_workloads/' \ --exclude='configs/platforms.yaml' --exclude='private-infra.md' \ --exclude='goal.md' --exclude='notes.md' \ "$repo_root/experimental/CollectiveX" "$stage_dir/experimental/"

Both --exclude='results/' and --delete-excluded guarantee that the shard file the workflow just wrote is missing from the stage dir.

The consequence at runtime

The container mounts $MOUNT_SRC:/ix, cwd=/ix/experimental/CollectiveX. Inside run_in_container.sh, the SHARD guard resolves CX_SHARD_FILE relative to that cwd:

# runtime/run_in_container.sh:458 if [ -n "${CX_SHARD_FILE:-}" ] && [ -f "${CX_SHARD_FILE:-/nonexistent}" ]; then # SHARD mode — sweep every scheduled case ... else # Single-bench (workflow_dispatch) path # uses ${CX_MODE:-normal}, ${CX_PHASE:-decode}, ${CX_ROUTING:-uniform}, # ${CX_DISPATCH_DTYPE:-bf16}, empty CX_CASE_ID/CX_SUITE/CX_WORKLOAD_NAME, ...

The file resolves to /ix/experimental/CollectiveX/results/.shard_<id>.json — which is missing because rsync excluded it — so the test fails and the else branch runs a single default case with none of the shard's identity, N times cheaper than the intended N-case sweep.

Why the rack (EP8) paths escape

The rack-scale launchers iterate cases themselves in the launcher on the SUBMIT host (not inside the container). Their case-list helpers explicitly resolve the shard file against the original checkout when the relative path misses:

# launch_gb300-nv.sh cx_ep8_cases (and launch_gb200-nv.sh cx_ep_cases) local sf="${CX_SHARD_FILE:-}" [ -n "$sf" ] && [ ! -f "$sf" ] && [ -f "$CX_DIR/$sf" ] && sf="$CX_DIR/$sf"

The same workaround is absent from run_in_container.sh:458, so the EP4 single-tray path — which shares the b300/gb200-EP4/gb300-EP4 launchers with the staged mount — hits the missing file.

Affected sweeps

Every single-tray staged shard in the v1 promoted matrix, per sweep_matrix.py + configs/suites.yaml platforms:

b300 (all shards; launch_b300.sh is single-node)

gb200 EP4 (CX_NODES<=1 -> run_in_container.sh)

gb300 EP4 (CX_NODES<=1 -> run_in_container.sh)

The h100-dgxc/h200-dgxc/b200-dgxc/mi325x/mi355x paths do not set CX_STAGE_DIR in this workflow (cx_stage_repo becomes a no-op) and are unaffected.

Concrete walk-through (b300 shard)

Setup job resolves matrix; writes experimental/CollectiveX/results/.shard_b300-deepep.json on the checkout with e.g. 24 cases (varied phase/dtype/routing/eplb across ep-core-v1 + ep-routing-v1).

Sweep job on the b300 runner exports CX_SHARD_FILE=results/.shard_b300-deepep.json, checks out the repo, and calls launch_b300.sh.

launch_b300.sh:34 -> cx_stage_repo rsyncs to $CX_STAGE_DIR/job_<id>/experimental/CollectiveX/ with --exclude='results/' --delete-excluded. The shard file is not copied.

srun --container-workdir=$MOUNT_DIR/experimental/CollectiveX ... run_in_container.sh. cwd inside container = /ix/experimental/CollectiveX.

run_in_container.sh:458 tests [ -f "results/.shard_b300-deepep.json" ] -> that resolves to /ix/experimental/CollectiveX/results/.shard_b300-deepep.json -> missing.

Execution falls into the else branch at line 556+. It dispatches ${CX_BENCH} once with CX_MODE=normal, CX_PHASE=decode, CX_ROUTING=uniform, CX_DISPATCH_DTYPE=bf16, empty CX_CASE_ID, empty CX_SUITE, empty CX_WORKLOAD_NAME, empty CX_REQUIRED_PUBLICATION.

One result JSON is produced with no case_id and mismatched identity; the other 23 scheduled cases never run.

Aggregate job's make_bundle.py validate_expected_coverage computes missing_identity + missing + identity_mismatch against matrix_full.json and raise SystemExit(...) — the whole aggregate fails, after b300 GPU-time was spent on the wrong workload.

Impact

For every b300/gb200-EP4/gb300-EP4 shard promoted through v1 (three of the eight SKUs in ep-core-v1 + ep-routing-v1), the sweep silently runs one wrong-config default point instead of the scheduled N-case sweep. Bundle validation catches the divergence but only post-hoc, so the failure is loud yet wasteful: GPU allocations spent, aggregate job red, invalidating the v1 dataset this PR is producing.

Fix

Any one of:

Allow the shard file through the rsync in cx_stage_repo (runtime/common.sh:146):

rsync -a --delete --delete-excluded \ --include='experimental/CollectiveX/results/' \ --include='experimental/CollectiveX/results/.shard_*.json' \ --exclude='__pycache__/' --exclude='results/' ...

Copy the shard file into the stage dir after the rsync completes:

[ -n "${CX_SHARD_FILE:-}" ] && [ -f "$repo_root/experimental/CollectiveX/$CX_SHARD_FILE" ] \ && cp -a "$repo_root/experimental/CollectiveX/$CX_SHARD_FILE" \ "$stage_dir/experimental/CollectiveX/$CX_SHARD_FILE"

Mirror the rack (EP8) launcher workaround in run_in_container.sh:458:

sf="${CX_SHARD_FILE:-}" # $CX_DIR is not set inside the container; use the fixed workdir instead. [ -n "$sf" ] && [ ! -f "$sf" ] && [ -f "/ix/experimental/CollectiveX/$sf" ] \ && sf="/ix/experimental/CollectiveX/$sf" if [ -n "$sf" ] && [ -f "$sf" ]; then ...

Approach (1) or (2) is the smallest change with the least surface area.

claude · 2026-07-03T18:12:17Z

+    elif _run(["ibstat", "-l"]):
+        devices = [d.strip() for d in _run(["ibstat", "-l"]).splitlines() if d.strip()]
+    return {


🟡 _rdma() calls _run(["ibstat", "-l"]) twice at env_capture.py:178-179 — once in the elif condition and once in the comprehension body. If the second invocation returns None (which _run does on shutil.which miss, TimeoutExpired/OSError, or nonzero exit), .splitlines() raises AttributeError and takes down env_capture.py under run_in_container.sh's set -euo pipefail. The trigger is genuinely rare (both calls are microseconds apart on a stable IB stack, and this branch runs only when ibv_devinfo is absent), so nit — but the fix is a one-line refactor mirroring the ibv_devinfo branch just above.

Extended reasoning...

The defect. env_capture._rdma() has an asymmetry between its two RDMA-listing branches:

listing = _run(["ibv_devinfo", "-l"]) # assigned once, iterated once if listing: for line in listing.splitlines()[1:]: ... elif _run(["ibstat", "-l"]): # called once (as a truthiness check) devices = [d.strip() for d in _run(["ibstat", "-l"]).splitlines() if d.strip()] # called AGAIN

The ibv_devinfo branch just above does the right thing: assign once, reuse. The ibstat branch does not.

Why the crash is theoretical but real. _run() returns None on any of: shutil.which(cmd[0]) failing (line 51), subprocess.TimeoutExpired/OSError (line 57), or out.returncode != 0 (line 59). If the first call returns a truthy string but the second returns None — a transient OS timer glitch, an OOM-killed helper, a stray nonzero exit under load — then None.splitlines() raises AttributeError. Under run_in_container.sh's set -euo pipefail (line 33), that aborts the whole shard step before any GPU benchmark runs.

Step-by-step proof of the theoretical crash path:

Node has ibstat in $PATH but no ibv_devinfo (a real config: MI355X-style stacks with ibstat only).

First call: _run(["ibstat", "-l"]) succeeds → returns "mlx5_0\nmlx5_1\n" → elif condition is truthy.

Second call: a transient nonzero exit (e.g. ibstat racing an IB-driver reload, timer wraparound, PID-namespace hiccup) → out.returncode != 0 → _run returns None.

None.splitlines() → AttributeError: 'NoneType' object has no attribute 'splitlines' → Python exits nonzero → set -e aborts run_in_container.sh → the shard step fails before GPU work.

Why this is nit, not normal. Every verifier converged on the same practical assessment: ibstat -l is a fast local device listing with no network/filesystem dependency, so a transient failure between two back-to-back calls (microseconds apart) is extremely improbable. The elif branch itself only runs when ibv_devinfo is absent, which is uncommon on the target runners since both binaries come from the same InfiniBand userspace stack. And env_capture.py produces a diagnostic/provenance artifact — even a genuine crash here would break provenance capture, not the benchmark measurement. The defect exists but doesn't justify blocking merge.

The fix. One-line refactor to mirror the ibv_devinfo branch:

else: listing = _run(["ibstat", "-l"]) if listing: devices = [d.strip() for d in listing.splitlines() if d.strip()]

Same idiom the file uses immediately above. Eliminates the wasted subprocess call and the theoretical None-deref in one change. Worth doing as a follow-up cleanup, but the PR does not need to block for it.

claude · 2026-07-03T18:12:19Z

+    "required_publication": env("CX_REQUIRED_PUBLICATION") or None,
+    "backend": backend,
+    "phase": phase,
+    "ep": integer("CX_EP", integer("CX_NGPUS", 1)),
+    "gpus_per_node": integer("CX_GPUS_PER_NODE", integer("CX_NGPUS", 1)),
+    "scale_up_domain": integer("CX_SCALE_UP_DOMAIN", integer("CX_NGPUS", 1)),
+    "dispatch_dtype": env("CX_DISPATCH_DTYPE", "bf16"),
+    "mode": env("CX_MODE", "normal"),
+    "contract": env("CX_MEASUREMENT_CONTRACT", "layout-and-dispatch-v1"),
+    "routing": env("CX_ROUTING", "uniform"),
+    "eplb": enabled("CX_EPLB"),
+    "combine_quant_mode": env("CX_COMBINE_QUANT_MODE", "none"),
+    "resource_mode": env("CX_RESOURCE_MODE", "tuned"),
+    "activation_profile": env("CX_ACTIVATION_PROFILE", "normal"),
+    "placement": env("CX_PLACEMENT", "packed"),
+    "routing_step": env("CX_ROUTING_STEP", "0"),
+    "uneven_tokens": env("CX_UNEVEN_TOKENS", "none"),
+    "tokens_ladder": env("CX_TOKENS_LADDER"),
+    "canonical": enabled("CX_CANONICAL"),
+    "sampling_contract": "fixed-512-v1",
+    "samples_per_point": integer("CX_SAMPLES_PER_POINT", 512),
+    "iters": integer("CX_ITERS", 8),
+    "trials": integer("CX_TRIALS", 64),
+    "warmup": integer("CX_WARMUP", 32),
+    "warmup_semantics": env(
+        "CX_WARMUP_SEMANTICS", "full-roundtrip-per-trial-point-v1"
+    ),


🟡 cx_emit_ep_failed_case (runtime/common.sh:256-287) builds failure.case without the hidden/topk/experts/nodes keys, but every matrix case emitted by sweep_matrix.py always carries all four. On the first sweep where any case exhausts its retries (flashinfer intermittent MNNVL, HybridEP/UCCL empty-rank, any deterministic rc=5), make_bundle's _identity_differences reports the same case_id four times as hidden=None!=7168,topk=None!=8,experts=None!=256,nodes=None!=1, and validate_expected_coverage piles on by re-listing that case in missing, so the aggregate job aborts with a dual-report that hides the real signal (the case failed all retries — the intended fail-closed behavior). Fix in either place is fine: add the four fields to cx_emit_ep_failed_case from CX_HIDDEN/CX_TOPK/CX_EXPERTS (defaults 7168/8/256) and CX_NGPUS/SLURM_NNODES, or make _identity_differences skip these fields when the actual doc is a failed-case.

Extended reasoning...

The observed behavior

With the PR merged and any sweep that produces a failed-case record for a scheduled case, the aggregate job will fail with a message like:

bundle: expected-matrix coverage failed ( missing_identity=0 missing=['cxv1-...'] extra=[] duplicates=[] identity_mismatch=['cxv1-...:hidden=None!=7168,topk=None!=8,experts=None!=256,nodes=None!=1'])

The same case_id appears in both missing and identity_mismatch, and the mismatch string names four fields that have nothing to do with why the case actually failed.

Step-by-step proof

Take a concrete promoted case, say h100-dgxc/deepep/decode under ep-core-v1 (uniform, canonical, deepseek-v3-v1 defaults). sweep_matrix.py:181-186 builds the matrix entry with:

{ ..., "hidden": "", # h==7168 -> "" sentinel "topk": "", # t==8 -> "" "experts": "", # e==256 -> "" "nodes": "1", # always str ... }

When every one of the 4 flashinfer attempts wedges on the intermittent MNNVL completion-flag deadlock (documented in run_in_container.sh around line 526), the last attempt's cx_emit_ep_failed_case writes a failed_*.json whose failure.case dict is missing the four keys entirely — the emitter reads CX_DISPATCH_DTYPE/CX_MODE/etc. but has no CX_HIDDEN/CX_TOPK/CX_EXPERTS/SLURM_NNODES reads.

aggregate_results.py keeps that failed-case doc as the newest for that case_id. Then make_bundle.py runs validate_expected_coverage:

_expected_case_identity(matrix_case) — "hidden" in case is true (value ""), so identity["hidden"] = int("" or 7168) = 7168. Same for topk/experts (8/256). "nodes" in case is true, identity["nodes"] = int("1") = 1. Expected identity contains {hidden: 7168, topk: 8, experts: 256, nodes: 1, ...}.

_actual_case_identity(failed_doc) (the failed-case branch, line 184-195) copies failure.case verbatim, calls _expected_case_identity. None of hidden/topk/experts/nodes are in that dict, so the if field in case: guard skips all four. Actual identity contains everything except the four scheduled shape fields.

_identity_differences iterates the expected identity's items; actual_identity.get("hidden") is None, None != 7168 -> hidden=None!=7168. Same for the other three.

validate_expected_coverage (line 294-298) hits the differences branch, appends the case_id to identity_mismatch, and does not add it to actual{}. Then missing = set(expected) - set(actual) (line 301) also contains that case_id. Line 319 raises the dual-report SystemExit.

validate_results.py:validate_doc's failed-case schema (v5, ~lines 234-243) requires a different, smaller field set that happens to match what the emitter writes, so it stays silent about this desync. Only make_bundle notices, and only in a way that obscures the real cause.

Why this fires in practice

The PR explicitly builds in retry logic — CX_FLASHINFER_RETRIES defaults to 3 attempts, and both the container and rack launchers loop attempts and preserve a failed_*.json when all attempts fail. Retry-exhaustion is expected behavior for known intermittents, but the aggregate step will now report those as identity_mismatch + missing for hidden/topk/experts/nodes — the least informative signal possible.

Impact

Bundle validation still correctly rejects the incomplete run (the intended fail-closed behavior), and no incorrect data ships, so this is a diagnostic-clarity regression rather than a correctness bug. It will, however, cost real triage time in CI: an operator staring at hidden=None!=7168,topk=None!=8,experts=None!=256,nodes=None!=1 will not obviously infer "one flashinfer case exhausted its retries."

Fix

Either add the four fields to cx_emit_ep_failed_case (read CX_HIDDEN/CX_TOPK/CX_EXPERTS with defaults 7168/8/256, and CX_NGPUS/SLURM_NNODES for nodes), or teach _identity_differences/_actual_case_identity to drop these fields when the actual doc is a failed-case. Either way the two validators stay in sync.

+        secret = "github_pat_" + "Z" * 24
+        with tempfile.TemporaryDirectory() as temporary:
+            path = Path(temporary) / "artifact.json"
+            path.write_text(json.dumps({"note": secret}))


Oseltamivir requested a review from a team July 3, 2026 17:38

github-project-automation Bot added this to InferenceMAX Board Jul 3, 2026

claude Bot reviewed Jul 3, 2026

View reviewed changes

Oseltamivir force-pushed the collectivex branch 4 times, most recently from 758fa52 to 1c5b901 Compare July 4, 2026 01:11

github-advanced-security AI found potential problems Jul 4, 2026

View reviewed changes

Comment thread experimental/CollectiveX/tests/test_sampling_contract.py

secret = "github_pat_" + "Z" * 24

with tempfile.TemporaryDirectory() as temporary:

path = Path(temporary) / "artifact.json"

path.write_text(json.dumps({"note": secret}))

Oseltamivir force-pushed the collectivex branch from 1c5b901 to 8f700e5 Compare July 4, 2026 01:19

feat(collectivex): add native v1 benchmark suite

7e5f80a

Oseltamivir force-pushed the collectivex branch from 8f700e5 to 7e5f80a Compare July 4, 2026 01:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CollectiveX v1: cross-vendor EP benchmark suite#2004

CollectiveX v1: cross-vendor EP benchmark suite#2004
Oseltamivir wants to merge 1 commit into
mainfrom
collectivex

Oseltamivir commented Jul 3, 2026

Uh oh!

claude Bot Jul 3, 2026

Uh oh!

claude Bot Jul 3, 2026

Uh oh!

claude Bot Jul 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

Oseltamivir commented Jul 3, 2026

Summary

What changed

Isolation and safety

Validation

Uh oh!

claude Bot Jul 3, 2026

Choose a reason for hiding this comment

The bug

The consequence at runtime

Why the rack (EP8) paths escape

Affected sweeps

Concrete walk-through (b300 shard)

Impact

Fix

Uh oh!

claude Bot Jul 3, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot Jul 3, 2026

Choose a reason for hiding this comment

The observed behavior

Step-by-step proof

Why this fires in practice

Impact

Fix

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants