Skip to content

CollectiveX v1: cross-vendor EP benchmark suite#2004

Open
Oseltamivir wants to merge 1 commit into
mainfrom
collectivex
Open

CollectiveX v1: cross-vendor EP benchmark suite#2004
Oseltamivir wants to merge 1 commit into
mainfrom
collectivex

Conversation

@Oseltamivir

Copy link
Copy Markdown
Collaborator

Summary

Adds the sanitized CollectiveX v1 EP benchmark suite under experimental/CollectiveX/ and two
GitHub Actions entrypoints that resolve only public runner labels.

What changed

  • defines one canonical DeepSeek-V3 EP workload and two reduced suites: uniform core coverage plus
    Zipf/EPLB sensitivity anchors
  • resolves 39 hardware/backend cells, 232 cases, and 618 token points across H100, H200, B200, B300,
    GB200, GB300, MI325X, and MI355X
  • standardizes every case on 8 timed iterations x 64 trials, 32 synchronized full-roundtrip warmups,
    and exactly 512 percentile observations
  • supports DeepEP, DeepEP Hybrid, FlashInfer, MoRI, UCCL, and a portable NCCL/RCCL reference where
    declared by the public capability registry
  • hardens launchers against false-green distributed failures, missing result shards, stale staging,
    mismatched rack topology, ignored timing/resource controls, and inconsistent GB transport settings
  • keeps build preparation allocation-local and fail-closed; retries rack FlashInfer attempts without
    overwriting earlier attempt identity
  • validates and bundles artifacts without committing benchmark results or private environment data

The historical deepep-v2 label is removed because it did not implement DeepEP PR #605. A real
ElasticBuffer adapter remains explicit v1 follow-up work.

Isolation and safety

  • runner/backend support comes from capability.py, keyed by exact GitHub Actions labels
  • no private runner inventory or endpoint is tracked or used to construct the workflow matrix
  • results and operator notes are ignored; GitHub artifacts are transient inputs, not a durable store
  • the planned publisher is a self-hosted filesystem with immutable private bundles, sanitized public
    datasets, and atomic local channel pointers; no managed database or object store is introduced
  • no production serving benchmark configuration or perf-changelog.yaml entry is changed
  • the PR has no sweep label, so this rewrite dispatched no GPU benchmark jobs

Validation

  • 43 Python contract/unit tests
  • exact matrix assertion: 39 cells, 232 cases, 618 token points, one 8:64:32 profile, all 512 samples
  • bash -n and ShellCheck
  • Actionlint
  • Ruff and Python compileall
  • GitHub CodeQL

Comment on lines +145 to +150
rsync -a --delete --delete-excluded \
--exclude='__pycache__/' --exclude='results/' --exclude='.cx_workloads/' \
--exclude='configs/platforms.yaml' --exclude='private-infra.md' \
--exclude='goal.md' --exclude='notes.md' \
"$repo_root/experimental/CollectiveX" "$stage_dir/experimental/" >/dev/null 2>&1 \
|| cx_die "staging CollectiveX failed"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The setup step writes the shard JSON to experimental/CollectiveX/results/.shard_${matrix.id}.json and sets CX_SHARD_FILE=results/.shard_${matrix.id}.json (relative), but cx_stage_repo (runtime/common.sh:145-150) rsyncs the CollectiveX tree with --exclude='results/' --delete-excluded and drops the shard file — so for every staged single-tray SKU (b300 always; gb200/gb300 with EP4 via CX_NODES<=1), the [ -f "$CX_SHARD_FILE" ] guard at run_in_container.sh:458 fails and execution falls into the single-bench else branch (line 556+), silently running one wrong-config default (uniform/decode/bf16, empty case_id) instead of the shard's N scheduled cases. Downstream make_bundle will catch this via missing_identity/coverage but only after GPU allocation was spent on the wrong workload. Cheap fix: allow-list the shard file through the rsync (--include='experimental/CollectiveX/results/' --include='experimental/CollectiveX/results/.shard_*.json' before the results/ exclude), copy the shard file into the stage dir after the rsync, or resolve CX_SHARD_FILE against the original repo root in run_in_container.sh's SHARD guard the way the rack (EP8) launchers already do (see launch_gb300-nv.sh:92-93 / launch_gb200-nv.sh cx_ep_cases).

Extended reasoning...

The bug

The sweep workflow's shard-fanout step writes the resolved case list to experimental/CollectiveX/results/.shard_${matrix.id}.json:

# .github/workflows/collectivex-sweep.yml
env:
  CX_SHARD_FILE: results/.shard_${{ matrix.id }}.json   # RELATIVE path
...
- name: Extract shard from matrix artifact
  working-directory: experimental/CollectiveX
  run: |
    ...
    json.dump({...,'cases':s['cases']}, open('results/.shard_${{ matrix.id }}.json','w'))

The physical file therefore lands at $REPO/experimental/CollectiveX/results/.shard_<id>.json, and CX_SHARD_FILE=results/.shard_<id>.json is interpreted relative to the container's cwd, which is /ix/experimental/CollectiveX.

For every SKU that requires CX_STAGE_DIR (b300 always; gb200/gb300 with EP4 via the CX_NODES<=1 delegate path in launch_gb200-nv.sh:57 / launch_gb300-nv.sh:47), the launcher calls:

# launch_b300.sh:34, launch_gb200-nv.sh:52, launch_gb300-nv.sh:24
MOUNT_SRC="$(cx_stage_repo "$REPO_ROOT" "$CX_STAGE_DIR")"

which rsyncs the tree with an exclude that drops results/:

# experimental/CollectiveX/runtime/common.sh:145-150
rsync -a --delete --delete-excluded \
  --exclude='__pycache__/' --exclude='results/' --exclude='.cx_workloads/' \
  --exclude='configs/platforms.yaml' --exclude='private-infra.md' \
  --exclude='goal.md' --exclude='notes.md' \
  "$repo_root/experimental/CollectiveX" "$stage_dir/experimental/"

Both --exclude='results/' and --delete-excluded guarantee that the shard file the workflow just wrote is missing from the stage dir.

The consequence at runtime

The container mounts $MOUNT_SRC:/ix, cwd=/ix/experimental/CollectiveX. Inside run_in_container.sh, the SHARD guard resolves CX_SHARD_FILE relative to that cwd:

# runtime/run_in_container.sh:458
if [ -n "${CX_SHARD_FILE:-}" ] && [ -f "${CX_SHARD_FILE:-/nonexistent}" ]; then
  # SHARD mode — sweep every scheduled case
  ...
else
  # Single-bench (workflow_dispatch) path
  # uses ${CX_MODE:-normal}, ${CX_PHASE:-decode}, ${CX_ROUTING:-uniform},
  # ${CX_DISPATCH_DTYPE:-bf16}, empty CX_CASE_ID/CX_SUITE/CX_WORKLOAD_NAME, ...

The file resolves to /ix/experimental/CollectiveX/results/.shard_<id>.json — which is missing because rsync excluded it — so the test fails and the else branch runs a single default case with none of the shard's identity, N times cheaper than the intended N-case sweep.

Why the rack (EP8) paths escape

The rack-scale launchers iterate cases themselves in the launcher on the SUBMIT host (not inside the container). Their case-list helpers explicitly resolve the shard file against the original checkout when the relative path misses:

# launch_gb300-nv.sh cx_ep8_cases (and launch_gb200-nv.sh cx_ep_cases)
local sf="${CX_SHARD_FILE:-}"
[ -n "$sf" ] && [ ! -f "$sf" ] && [ -f "$CX_DIR/$sf" ] && sf="$CX_DIR/$sf"

The same workaround is absent from run_in_container.sh:458, so the EP4 single-tray path — which shares the b300/gb200-EP4/gb300-EP4 launchers with the staged mount — hits the missing file.

Affected sweeps

Every single-tray staged shard in the v1 promoted matrix, per sweep_matrix.py + configs/suites.yaml platforms:

  • b300 (all shards; launch_b300.sh is single-node)
  • gb200 EP4 (CX_NODES<=1 -> run_in_container.sh)
  • gb300 EP4 (CX_NODES<=1 -> run_in_container.sh)

The h100-dgxc/h200-dgxc/b200-dgxc/mi325x/mi355x paths do not set CX_STAGE_DIR in this workflow (cx_stage_repo becomes a no-op) and are unaffected.

Concrete walk-through (b300 shard)

  1. Setup job resolves matrix; writes experimental/CollectiveX/results/.shard_b300-deepep.json on the checkout with e.g. 24 cases (varied phase/dtype/routing/eplb across ep-core-v1 + ep-routing-v1).
  2. Sweep job on the b300 runner exports CX_SHARD_FILE=results/.shard_b300-deepep.json, checks out the repo, and calls launch_b300.sh.
  3. launch_b300.sh:34 -> cx_stage_repo rsyncs to $CX_STAGE_DIR/job_<id>/experimental/CollectiveX/ with --exclude='results/' --delete-excluded. The shard file is not copied.
  4. srun --container-workdir=$MOUNT_DIR/experimental/CollectiveX ... run_in_container.sh. cwd inside container = /ix/experimental/CollectiveX.
  5. run_in_container.sh:458 tests [ -f "results/.shard_b300-deepep.json" ] -> that resolves to /ix/experimental/CollectiveX/results/.shard_b300-deepep.json -> missing.
  6. Execution falls into the else branch at line 556+. It dispatches ${CX_BENCH} once with CX_MODE=normal, CX_PHASE=decode, CX_ROUTING=uniform, CX_DISPATCH_DTYPE=bf16, empty CX_CASE_ID, empty CX_SUITE, empty CX_WORKLOAD_NAME, empty CX_REQUIRED_PUBLICATION.
  7. One result JSON is produced with no case_id and mismatched identity; the other 23 scheduled cases never run.
  8. Aggregate job's make_bundle.py validate_expected_coverage computes missing_identity + missing + identity_mismatch against matrix_full.json and raise SystemExit(...) — the whole aggregate fails, after b300 GPU-time was spent on the wrong workload.

Impact

For every b300/gb200-EP4/gb300-EP4 shard promoted through v1 (three of the eight SKUs in ep-core-v1 + ep-routing-v1), the sweep silently runs one wrong-config default point instead of the scheduled N-case sweep. Bundle validation catches the divergence but only post-hoc, so the failure is loud yet wasteful: GPU allocations spent, aggregate job red, invalidating the v1 dataset this PR is producing.

Fix

Any one of:

  1. Allow the shard file through the rsync in cx_stage_repo (runtime/common.sh:146):

    rsync -a --delete --delete-excluded \
      --include='experimental/CollectiveX/results/' \
      --include='experimental/CollectiveX/results/.shard_*.json' \
      --exclude='__pycache__/' --exclude='results/' ...
  2. Copy the shard file into the stage dir after the rsync completes:

    [ -n "${CX_SHARD_FILE:-}" ] && [ -f "$repo_root/experimental/CollectiveX/$CX_SHARD_FILE" ] \
      && cp -a "$repo_root/experimental/CollectiveX/$CX_SHARD_FILE" \
              "$stage_dir/experimental/CollectiveX/$CX_SHARD_FILE"
  3. Mirror the rack (EP8) launcher workaround in run_in_container.sh:458:

    sf="${CX_SHARD_FILE:-}"
    # $CX_DIR is not set inside the container; use the fixed workdir instead.
    [ -n "$sf" ] && [ ! -f "$sf" ] && [ -f "/ix/experimental/CollectiveX/$sf" ] \
      && sf="/ix/experimental/CollectiveX/$sf"
    if [ -n "$sf" ] && [ -f "$sf" ]; then ...

Approach (1) or (2) is the smallest change with the least surface area.

Comment thread experimental/CollectiveX/env_capture.py Outdated
Comment on lines +178 to +180
elif _run(["ibstat", "-l"]):
devices = [d.strip() for d in _run(["ibstat", "-l"]).splitlines() if d.strip()]
return {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 _rdma() calls _run(["ibstat", "-l"]) twice at env_capture.py:178-179 — once in the elif condition and once in the comprehension body. If the second invocation returns None (which _run does on shutil.which miss, TimeoutExpired/OSError, or nonzero exit), .splitlines() raises AttributeError and takes down env_capture.py under run_in_container.sh's set -euo pipefail. The trigger is genuinely rare (both calls are microseconds apart on a stable IB stack, and this branch runs only when ibv_devinfo is absent), so nit — but the fix is a one-line refactor mirroring the ibv_devinfo branch just above.

Extended reasoning...

The defect. env_capture._rdma() has an asymmetry between its two RDMA-listing branches:

listing = _run(["ibv_devinfo", "-l"])   # assigned once, iterated once
if listing:
    for line in listing.splitlines()[1:]:
        ...
elif _run(["ibstat", "-l"]):             # called once (as a truthiness check)
    devices = [d.strip() for d in _run(["ibstat", "-l"]).splitlines() if d.strip()]  # called AGAIN

The ibv_devinfo branch just above does the right thing: assign once, reuse. The ibstat branch does not.

Why the crash is theoretical but real. _run() returns None on any of: shutil.which(cmd[0]) failing (line 51), subprocess.TimeoutExpired/OSError (line 57), or out.returncode != 0 (line 59). If the first call returns a truthy string but the second returns None — a transient OS timer glitch, an OOM-killed helper, a stray nonzero exit under load — then None.splitlines() raises AttributeError. Under run_in_container.sh's set -euo pipefail (line 33), that aborts the whole shard step before any GPU benchmark runs.

Step-by-step proof of the theoretical crash path:

  1. Node has ibstat in $PATH but no ibv_devinfo (a real config: MI355X-style stacks with ibstat only).
  2. First call: _run(["ibstat", "-l"]) succeeds → returns "mlx5_0\nmlx5_1\n" → elif condition is truthy.
  3. Second call: a transient nonzero exit (e.g. ibstat racing an IB-driver reload, timer wraparound, PID-namespace hiccup) → out.returncode != 0_run returns None.
  4. None.splitlines()AttributeError: 'NoneType' object has no attribute 'splitlines' → Python exits nonzero → set -e aborts run_in_container.sh → the shard step fails before GPU work.

Why this is nit, not normal. Every verifier converged on the same practical assessment: ibstat -l is a fast local device listing with no network/filesystem dependency, so a transient failure between two back-to-back calls (microseconds apart) is extremely improbable. The elif branch itself only runs when ibv_devinfo is absent, which is uncommon on the target runners since both binaries come from the same InfiniBand userspace stack. And env_capture.py produces a diagnostic/provenance artifact — even a genuine crash here would break provenance capture, not the benchmark measurement. The defect exists but doesn't justify blocking merge.

The fix. One-line refactor to mirror the ibv_devinfo branch:

else:
    listing = _run(["ibstat", "-l"])
    if listing:
        devices = [d.strip() for d in listing.splitlines() if d.strip()]

Same idiom the file uses immediately above. Eliminates the wasted subprocess call and the theoretical None-deref in one change. Worth doing as a follow-up cleanup, but the PR does not need to block for it.

Comment on lines +260 to +286
"required_publication": env("CX_REQUIRED_PUBLICATION") or None,
"backend": backend,
"phase": phase,
"ep": integer("CX_EP", integer("CX_NGPUS", 1)),
"gpus_per_node": integer("CX_GPUS_PER_NODE", integer("CX_NGPUS", 1)),
"scale_up_domain": integer("CX_SCALE_UP_DOMAIN", integer("CX_NGPUS", 1)),
"dispatch_dtype": env("CX_DISPATCH_DTYPE", "bf16"),
"mode": env("CX_MODE", "normal"),
"contract": env("CX_MEASUREMENT_CONTRACT", "layout-and-dispatch-v1"),
"routing": env("CX_ROUTING", "uniform"),
"eplb": enabled("CX_EPLB"),
"combine_quant_mode": env("CX_COMBINE_QUANT_MODE", "none"),
"resource_mode": env("CX_RESOURCE_MODE", "tuned"),
"activation_profile": env("CX_ACTIVATION_PROFILE", "normal"),
"placement": env("CX_PLACEMENT", "packed"),
"routing_step": env("CX_ROUTING_STEP", "0"),
"uneven_tokens": env("CX_UNEVEN_TOKENS", "none"),
"tokens_ladder": env("CX_TOKENS_LADDER"),
"canonical": enabled("CX_CANONICAL"),
"sampling_contract": "fixed-512-v1",
"samples_per_point": integer("CX_SAMPLES_PER_POINT", 512),
"iters": integer("CX_ITERS", 8),
"trials": integer("CX_TRIALS", 64),
"warmup": integer("CX_WARMUP", 32),
"warmup_semantics": env(
"CX_WARMUP_SEMANTICS", "full-roundtrip-per-trial-point-v1"
),

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 cx_emit_ep_failed_case (runtime/common.sh:256-287) builds failure.case without the hidden/topk/experts/nodes keys, but every matrix case emitted by sweep_matrix.py always carries all four. On the first sweep where any case exhausts its retries (flashinfer intermittent MNNVL, HybridEP/UCCL empty-rank, any deterministic rc=5), make_bundle's _identity_differences reports the same case_id four times as hidden=None!=7168,topk=None!=8,experts=None!=256,nodes=None!=1, and validate_expected_coverage piles on by re-listing that case in missing, so the aggregate job aborts with a dual-report that hides the real signal (the case failed all retries — the intended fail-closed behavior). Fix in either place is fine: add the four fields to cx_emit_ep_failed_case from CX_HIDDEN/CX_TOPK/CX_EXPERTS (defaults 7168/8/256) and CX_NGPUS/SLURM_NNODES, or make _identity_differences skip these fields when the actual doc is a failed-case.

Extended reasoning...

The observed behavior

With the PR merged and any sweep that produces a failed-case record for a scheduled case, the aggregate job will fail with a message like:

bundle: expected-matrix coverage failed (
  missing_identity=0 missing=['cxv1-...'] extra=[] duplicates=[]
  identity_mismatch=['cxv1-...:hidden=None!=7168,topk=None!=8,experts=None!=256,nodes=None!=1'])

The same case_id appears in both missing and identity_mismatch, and the mismatch string names four fields that have nothing to do with why the case actually failed.

Step-by-step proof

Take a concrete promoted case, say h100-dgxc/deepep/decode under ep-core-v1 (uniform, canonical, deepseek-v3-v1 defaults). sweep_matrix.py:181-186 builds the matrix entry with:

{
  ...,
  "hidden": "",     # h==7168 -> "" sentinel
  "topk": "",       # t==8    -> ""
  "experts": "",    # e==256  -> ""
  "nodes": "1",     # always str
  ...
}

When every one of the 4 flashinfer attempts wedges on the intermittent MNNVL completion-flag deadlock (documented in run_in_container.sh around line 526), the last attempt's cx_emit_ep_failed_case writes a failed_*.json whose failure.case dict is missing the four keys entirely — the emitter reads CX_DISPATCH_DTYPE/CX_MODE/etc. but has no CX_HIDDEN/CX_TOPK/CX_EXPERTS/SLURM_NNODES reads.

aggregate_results.py keeps that failed-case doc as the newest for that case_id. Then make_bundle.py runs validate_expected_coverage:

  1. _expected_case_identity(matrix_case)"hidden" in case is true (value ""), so identity["hidden"] = int("" or 7168) = 7168. Same for topk/experts (8/256). "nodes" in case is true, identity["nodes"] = int("1") = 1. Expected identity contains {hidden: 7168, topk: 8, experts: 256, nodes: 1, ...}.
  2. _actual_case_identity(failed_doc) (the failed-case branch, line 184-195) copies failure.case verbatim, calls _expected_case_identity. None of hidden/topk/experts/nodes are in that dict, so the if field in case: guard skips all four. Actual identity contains everything except the four scheduled shape fields.
  3. _identity_differences iterates the expected identity's items; actual_identity.get("hidden") is None, None != 7168 -> hidden=None!=7168. Same for the other three.
  4. validate_expected_coverage (line 294-298) hits the differences branch, appends the case_id to identity_mismatch, and does not add it to actual{}. Then missing = set(expected) - set(actual) (line 301) also contains that case_id. Line 319 raises the dual-report SystemExit.

validate_results.py:validate_doc's failed-case schema (v5, ~lines 234-243) requires a different, smaller field set that happens to match what the emitter writes, so it stays silent about this desync. Only make_bundle notices, and only in a way that obscures the real cause.

Why this fires in practice

The PR explicitly builds in retry logic — CX_FLASHINFER_RETRIES defaults to 3 attempts, and both the container and rack launchers loop attempts and preserve a failed_*.json when all attempts fail. Retry-exhaustion is expected behavior for known intermittents, but the aggregate step will now report those as identity_mismatch + missing for hidden/topk/experts/nodes — the least informative signal possible.

Impact

Bundle validation still correctly rejects the incomplete run (the intended fail-closed behavior), and no incorrect data ships, so this is a diagnostic-clarity regression rather than a correctness bug. It will, however, cost real triage time in CI: an operator staring at hidden=None!=7168,topk=None!=8,experts=None!=256,nodes=None!=1 will not obviously infer "one flashinfer case exhausted its retries."

Fix

Either add the four fields to cx_emit_ep_failed_case (read CX_HIDDEN/CX_TOPK/CX_EXPERTS with defaults 7168/8/256, and CX_NGPUS/SLURM_NNODES for nodes), or teach _identity_differences/_actual_case_identity to drop these fields when the actual doc is a failed-case. Either way the two validators stay in sync.

@Oseltamivir Oseltamivir force-pushed the collectivex branch 4 times, most recently from 758fa52 to 1c5b901 Compare July 4, 2026 01:11
secret = "github_pat_" + "Z" * 24
with tempfile.TemporaryDirectory() as temporary:
path = Path(temporary) / "artifact.json"
path.write_text(json.dumps({"note": secret}))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

2 participants