Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
247 commits
Select commit Hold shift + click to select a range
83761d0
Add CollectiveX experimental cross-vendor collective/EP benchmark
Oseltamivir Jun 23, 2026
b7ed913
CollectiveX: import container by multi-arch tag, fix CI import hang
Oseltamivir Jun 23, 2026
e6fdd84
Merge branch 'main' into collectivex
Oseltamivir Jun 23, 2026
ccfae8e
CollectiveX: copy staged results back to checkout for artifact upload
Oseltamivir Jun 23, 2026
b384171
CollectiveX: per-job summary table + address PR review findings
Oseltamivir Jun 23, 2026
f48daed
CollectiveX: render results as a GitHub Actions job summary
Oseltamivir Jun 23, 2026
be9cc91
CollectiveX: add MI355X / MoRI EP path (dispatch+combine)
Oseltamivir Jun 23, 2026
d8ee9bf
CollectiveX: run MI355X MoRI on push; align launcher with serving script
Oseltamivir Jun 23, 2026
ac3f1b9
CollectiveX: size MoRI symmetric heap (first MI355X run hit the 2 GiB…
Oseltamivir Jun 23, 2026
46208f2
CollectiveX: set MoRI heap to 6G (16 GiB failed RDMA MR registration)
Oseltamivir Jun 23, 2026
b62de99
CollectiveX: MoRI MI355X validated on hardware; fix heap/buffer/teardown
Oseltamivir Jun 23, 2026
481ef59
CollectiveX: wire rccl-tests collective primitives for MI355X (CX_BEN…
Oseltamivir Jun 23, 2026
78322de
CollectiveX: key dispatch concurrency by SKU so B200/MI355X runs don'…
Oseltamivir Jun 23, 2026
2b23573
CollectiveX: render busbw & latency vs bytes/rank sweep tables in the…
Oseltamivir Jun 23, 2026
a3a492c
CollectiveX: GB200 8-GPU multi-node MNNVL path (CX_NODES), validated …
Oseltamivir Jun 23, 2026
871086d
CollectiveX: fix multi-node build cache (MPI=0 vs MPI=1) + gate all-z…
Oseltamivir Jun 23, 2026
368cfbc
CollectiveX: EP dispatch/combine token sweep with separated timing (t…
Oseltamivir Jun 24, 2026
e2717a3
CollectiveX: make MI355X launcher CI-robust (writable lock dir + node…
Oseltamivir Jun 24, 2026
5c7b273
CollectiveX: fair-comparison EP rebuild — deterministic trace, real f…
Oseltamivir Jun 24, 2026
0052b11
CollectiveX: resource-normalized + tuned regimes for the EP comparison
Oseltamivir Jun 24, 2026
3a872a9
CollectiveX: fail-fast timeout guard + cap the MoRI push smoke (T>=32…
Oseltamivir Jun 24, 2026
5876ea0
CollectiveX: floor MoRI normalized block_num — it deadlocks at T>=32 …
Oseltamivir Jun 24, 2026
353c8ee
CollectiveX: FP8 dispatch + low-latency mode + reject-unsupported fra…
Oseltamivir Jun 24, 2026
3bc941c
CollectiveX: fix B300 warmup artifact + GHA matrix for h100-dgxc/b300…
Oseltamivir Jun 24, 2026
9f85d05
CollectiveX: fix h100-dgxc + b300 launcher slurm/storage from serving…
Oseltamivir Jun 24, 2026
c596882
CollectiveX: serialize same-SKU GHA dispatches + add 3-run reproducib…
Oseltamivir Jun 24, 2026
e71ef3c
CollectiveX: per-point clock-ramp burst (gated) — fixes MoRI wedge + …
Oseltamivir Jun 24, 2026
4e217f9
CollectiveX: MoRI repro/validation drivers pass COLLECTIVEX_IMAGE (pr…
Oseltamivir Jun 24, 2026
7a2f94f
CollectiveX: repro driver — match the T row (MoRI ramp-safe) + cap Mo…
Oseltamivir Jun 24, 2026
bbe0578
CollectiveX: dedicated MoRI repro driver (validation-exact invocation)
Oseltamivir Jun 24, 2026
f7b9d35
CollectiveX v3 measurement: explicit contracts, pooled-trial p50/p90/…
Oseltamivir Jun 25, 2026
1afd268
CollectiveX v3 workflow: capability resolver + NCCL phase-dedup + con…
Oseltamivir Jun 25, 2026
6122acb
CollectiveX v3 plotter: percentile + suite selectors, logical-payload…
Oseltamivir Jun 25, 2026
c136ec5
CollectiveX: v3 harness smoke driver (validates contracts/trials/rout…
Oseltamivir Jun 25, 2026
cf34cb3
CollectiveX: MoRI repro driver iters knob (MORI_ITERS, tighter fast-o…
Oseltamivir Jun 25, 2026
82ec864
CollectiveX: v3 re-run drivers (deepep _v3_rerun.sh + mori _v3_mori.s…
Oseltamivir Jun 25, 2026
cad380a
CollectiveX plotter: default to p50 (p99 too noisy a tail estimate at…
Oseltamivir Jun 25, 2026
81cddca
CollectiveX plotter: X-axis Log/Linear toggle (was hardcoded log)
Oseltamivir Jun 25, 2026
e97bc8b
CollectiveX plotter: auto-stitch decode range into prefill curves (co…
Oseltamivir Jun 25, 2026
6a3a185
chore: dispatch CollectiveX snapshot updates [skip ci]
Oseltamivir Jun 25, 2026
270b7b4
CollectiveX: GB300 EP8 across 2 NVL72 trays + EP-degree-aware plotter
Oseltamivir Jun 25, 2026
a6812dc
CollectiveX: routing axis (balanced/zipf) + EPLB expert-replication l…
Oseltamivir Jun 25, 2026
45c4570
CollectiveX v4 (goal Part 1 + scaffolding): workload identity, measur…
Oseltamivir Jun 25, 2026
600e909
CollectiveX: analyze_ep.py — operating-envelope analysis (skew penalt…
Oseltamivir Jun 25, 2026
171c7d1
CollectiveX: --workload-dir canonical-trace consumption + make_worklo…
Oseltamivir Jun 25, 2026
6dba193
CollectiveX: failure taxonomy (classify hang/OOM/registration/deadloc…
Oseltamivir Jun 25, 2026
8ff23bd
CollectiveX plotter: coverage table (publication status per measured …
Oseltamivir Jun 25, 2026
9e52693
CollectiveX: provenance enrichment (GitHub ref/job/artifact, image ar…
Oseltamivir Jun 25, 2026
82c6130
CollectiveX: structured placement metadata + routing locality fractio…
Oseltamivir Jun 25, 2026
e273009
CollectiveX: scaling efficiency (strong/weak from EP4/EP8) + regressi…
Oseltamivir Jun 25, 2026
978d338
CollectiveX: MI355X cross-vendor canonical-workload consume driver (D…
Oseltamivir Jun 25, 2026
a413de2
CollectiveX plotter: fix grid 'undefined' panel title (stale 'serial'…
Oseltamivir Jun 26, 2026
d799e0f
CollectiveX plotter: prefill panels show only the real prefill range …
Oseltamivir Jun 26, 2026
1622dff
CollectiveX plotter: --legacy {all,exclude,only} — v4-only main plot …
Oseltamivir Jun 26, 2026
f5df0ea
CollectiveX GHA: add routing/eplb inputs + h200/gb300 SKUs; wire CX_E…
Oseltamivir Jun 26, 2026
bb296c4
CollectiveX: launch_gb300-nv.sh — GHA launcher for GB300 (EP4 via run…
Oseltamivir Jun 26, 2026
73da67b
CollectiveX GHA: per-(SKU+config) concurrency group so a multi-config…
Oseltamivir Jun 26, 2026
0df55e8
CollectiveX: per-runner stage dir (fix concurrent-dispatch stale-hand…
Oseltamivir Jun 26, 2026
13f0a0f
CollectiveX: fix H200 GHA launcher FS (/home/sa-shared, not /mnt/nfs)
Oseltamivir Jun 26, 2026
9fb6e5d
CollectiveX: H200 partition main (not hpc-gpu-1)
Oseltamivir Jun 26, 2026
2b5e26c
CollectiveX: GB300 launcher uses docker tag, not squash path
Oseltamivir Jun 26, 2026
d2433e3
CollectiveX: pin h200 dispatch to the h200-dgxc runner pool
Oseltamivir Jun 26, 2026
156bf44
CollectiveX: GHA campaign tooling — collector + matrix dry-label fix
Oseltamivir Jun 26, 2026
59a05e0
CollectiveX: gitignore _ssh_v4_archive/ (superseded SSH result JSONs)
Oseltamivir Jun 26, 2026
a767844
CollectiveX: distribution-identity hardening + quant-combine (PR311) …
Oseltamivir Jun 26, 2026
fd23d02
CollectiveX: complete goal Part 1 + Part 2 — runtime-visible contract…
Oseltamivir Jun 26, 2026
70cfef3
CollectiveX: cohort official-membership gate (publication_status==off…
Oseltamivir Jun 26, 2026
60dec7d
CollectiveX: immediate-priority — LL fixed-kernel resource split, res…
Oseltamivir Jun 26, 2026
36d3eb6
CollectiveX: fix UnboundLocalError on EPLB canonical runs — define ro…
Oseltamivir Jun 26, 2026
ee4ffe7
CollectiveX: gitignore _seeded_archive/ (superseded seeded-runtime re…
Oseltamivir Jun 26, 2026
45fa504
CollectiveX: full-suite GHA dispatch — workflow inputs (hidden/topk/e…
Oseltamivir Jun 26, 2026
2c15d94
CollectiveX: full-suite completeness fixes — collect limit 500 (was 1…
Oseltamivir Jun 27, 2026
880f82c
CollectiveX: keep-newest cfg_key includes resource axis (resource_mod…
Oseltamivir Jun 27, 2026
ddc08e7
CollectiveX: add iters workflow input (CX_ITERS) — for the MoRI/MI355…
Oseltamivir Jun 27, 2026
8392632
CollectiveX: add trials/warmup workflow inputs (CX_TRIALS/CX_WARMUP) …
Oseltamivir Jun 27, 2026
74f52e0
CollectiveX: fix workflow_dispatch >25-input limit — consolidate iter…
Oseltamivir Jun 27, 2026
1495866
CollectiveX: add B300 to ep-nightly/ep-models/ep-routing (was missing…
Oseltamivir Jun 27, 2026
0cf9fc6
CollectiveX: DeepEP V2 build hook (CX_DEEPEP_V2 -> build NCCL-Gin V2 …
Oseltamivir Jun 27, 2026
76a3032
CollectiveX: kernel_gen (deepep v1/v2) as a distinct identity axis — …
Oseltamivir Jun 27, 2026
91c7acf
collectivex: fix DeepEP V2 build on PEP 668 images (H200/B300)
Oseltamivir Jun 27, 2026
df7fdde
collectivex: headline defaults, decision/summary/tabs UI, regression …
Oseltamivir Jun 27, 2026
803b785
collectivex: render NCCL all-reduce/all-gather (family=nccl) in plot …
Oseltamivir Jun 27, 2026
b6176a6
collectivex: collect family=nccl (all-reduce/all-gather) + uccl/flash…
Oseltamivir Jun 27, 2026
a504a3e
collectivex: model-shape selector in plot (DeepSeek-V3/V4, MiniMax-M3…
Oseltamivir Jun 27, 2026
1e21c72
collectivex: UCCL EP backend + memcpy-family collective benches (offl…
Oseltamivir Jun 27, 2026
eb6f953
collectivex: document hardware/kernel-gated items (honest blockers)
Oseltamivir Jun 27, 2026
c16f885
collectivex: fix UCCL build-check (import torch first) + capability/c…
Oseltamivir Jun 27, 2026
4c661f9
collectivex: summarize.py recognizes memcpy-family collectives (offlo…
Oseltamivir Jun 27, 2026
95137b8
collectivex: correct UCCL EP status — scaffolded, full run deferred
Oseltamivir Jun 27, 2026
645f9d5
collectivex: collect offload/copy_engine/kvcache files + robust _coll…
Oseltamivir Jun 27, 2026
f531529
collectivex: review upstream precision PRs (MoRI 311, FlashInfer 3376…
Oseltamivir Jun 27, 2026
0e54cde
collectivex: populate offload/copy-engine/kv-cache plot tabs (real data)
Oseltamivir Jun 27, 2026
71477ee
collectivex: RL mesh-to-mesh transfer benchmark (family=rl-mesh)
Oseltamivir Jun 27, 2026
e6224de
collectivex: rl-mesh passes capability pre-flight (non-EP bench passt…
Oseltamivir Jun 27, 2026
c40de99
collectivex: render RL mesh-to-mesh tab (family=rl-mesh) — final coll…
Oseltamivir Jun 27, 2026
925285d
collectivex: launchers/ contains only launch*; runtime/ + tools/ split
Oseltamivir Jun 27, 2026
ca8a505
collectivex: FlashInfer EP adapter + framework all-reduce bench (wire…
Oseltamivir Jun 27, 2026
762eb48
collectivex: direct-cast FP8 + per-token scale-layout dispatch recipes
Oseltamivir Jun 27, 2026
42eddb4
collectivex: fix fp8-variant CLI choices + allreduce-fw gate + surfac…
Oseltamivir Jun 27, 2026
ccb0b4a
collectivex: fix FlashInfer EP Mapping (tp_size=world_size for pure EP)
Oseltamivir Jun 27, 2026
9e1ac40
collectivex: FlashInfer MoeAlltoAll requires hidden_size (Mapping fix…
Oseltamivir Jun 27, 2026
91530dd
collectivex: FlashInfer MNNVL via TorchDistBackend (no MPI) — the rea…
Oseltamivir Jun 27, 2026
e150424
collectivex: FlashInfer EP combine — clone payload + payload_in_works…
Oseltamivir Jun 27, 2026
7aca33d
collectivex: FlashInfer EP — handle stateful dispatch/combine FSM
Oseltamivir Jun 27, 2026
1535869
collectivex: roundtrip-only timing for FlashInfer EP (stateful paired…
Oseltamivir Jun 27, 2026
511188e
collectivex: FlashInfer combine — pass recv as-is (source contract: s…
Oseltamivir Jun 27, 2026
2ebeba9
collectivex: FlashInfer EP correctness factor = distinct ranks per token
Oseltamivir Jun 27, 2026
04d83bf
collectivex: UCCL EP — vendor deep_ep_wrapper (group-based Buffer) + …
Oseltamivir Jun 27, 2026
5d08a93
collectivex: UCCL — pin vendored deep_ep_wrapper to the wheel's tag (…
Oseltamivir Jun 27, 2026
cfa1ec5
collectivex: UCCL EP finalize os._exit past teardown SIGSEGV (result …
Oseltamivir Jun 27, 2026
510fc17
CollectiveX: FlashInfer EP quant dispatch (fp8 e4m3 variants + mxfp8 …
Oseltamivir Jun 28, 2026
0b2753b
CollectiveX: real FlashInfer one-shot/two-shot all-reduce (trtllm_all…
Oseltamivir Jun 28, 2026
5c48dfd
CollectiveX: gate nvfp4 dispatch to Blackwell + refresh gated.md
Oseltamivir Jun 28, 2026
156e9ea
CollectiveX: render framework all-reduce in the All-reduce tab + gate…
Oseltamivir Jun 28, 2026
d8b4764
CollectiveX: document collective-suite serving-use mapping (all-reduc…
Oseltamivir Jun 28, 2026
02ef8d2
CollectiveX: DeepEP hybrid-ep branch backend (NVIDIA TMA HybridEPBuffer)
Oseltamivir Jun 28, 2026
90877fb
CollectiveX: allow AMD collective benches on the MI355X launcher (kv-…
Oseltamivir Jun 28, 2026
3850003
CollectiveX: FlashInfer quantized COMBINE output (fp8) via newer moe_…
Oseltamivir Jun 28, 2026
49dd8db
CollectiveX: fix flashinfer-combine upgrade — match cubin/jit-cache v…
Oseltamivir Jun 28, 2026
f684b37
CollectiveX: raise MI355X wall-clock guard to 1800s (slow shared clus…
Oseltamivir Jun 28, 2026
d9e0423
CollectiveX: install flashinfer from NIGHTLY index for combine output…
Oseltamivir Jun 28, 2026
c2c7feb
CollectiveX: upgrade nvidia-cutlass-dsl with the nightly flashinfer (…
Oseltamivir Jun 28, 2026
43614ad
CollectiveX: record exact upgraded FlashInfer library stack in proven…
Oseltamivir Jun 28, 2026
d4c508a
CollectiveX: build flashinfer main from source if the nightly wheel l…
Oseltamivir Jun 28, 2026
ba7c14a
CollectiveX: force JIT-from-main for combine kernel (uninstall stale …
Oseltamivir Jun 28, 2026
85273c6
CollectiveX: fix combine-quant output_scales to UE8M0 uint8 block-32 …
Oseltamivir Jun 28, 2026
4b3fe29
CollectiveX: NVFP4 quantized combine output (flashinfer fp4 path) — c…
Oseltamivir Jun 28, 2026
ddfbdf7
CollectiveX: gated.md — quant combine OUTPUT now DONE on B300 (flashi…
Oseltamivir Jun 28, 2026
2d65048
CollectiveX: add nvfp4 to harness --combine-dtype argparse choices
Oseltamivir Jun 28, 2026
0e61ac1
CollectiveX: nvfp4 combine dequant — view e4m3 scales as uint8 for e2…
Oseltamivir Jun 28, 2026
d6bf7b1
CollectiveX: gated.md — NVFP4 combine also DONE on B300 (valid, corre…
Oseltamivir Jun 28, 2026
94f03d5
CollectiveX: MXFP4 dispatch via fp4_quantize(ue8m0, swizzled=False) —…
Oseltamivir Jun 28, 2026
99e4ba0
CollectiveX: MoRI fp8 blockwise (e4m3fnuz) dispatch — the FNUZ precis…
Oseltamivir Jun 28, 2026
fe013ce
CollectiveX: NIXL via container switch — transfer bench (wired) + dev…
Oseltamivir Jun 28, 2026
a15bd8b
CollectiveX: AMD SDMA copy path — attempt the off-SM DMA engine on MI…
Oseltamivir Jun 28, 2026
f06b701
CollectiveX: direct-cast FP8 combine — output_scalar_scale-only on th…
Oseltamivir Jun 28, 2026
8405b10
CollectiveX: MoRI-IO transfer bench — the AMD RDMA p2p transfer engin…
Oseltamivir Jun 28, 2026
3ab6feb
CollectiveX: gated.md — NIXL container-switch result + direct-cast ke…
Oseltamivir Jun 28, 2026
83679b0
CollectiveX: methodology — named per-model TP-MoE handoff shapes table
Oseltamivir Jun 28, 2026
ae3032f
CollectiveX: copy-engine — add flash-attention victim for copy-vs-att…
Oseltamivir Jun 28, 2026
0078e31
CollectiveX: MoRI fp8 = fp8_direct_cast (not blockwise) — the validat…
Oseltamivir Jun 28, 2026
08a2f1e
CollectiveX: MoRI fp8_direct_cast needs non-zero-copy (use_external_i…
Oseltamivir Jun 28, 2026
e4f71c4
CollectiveX: MoRI fp8 correctness — gate against the e4m3fnuz consist…
Oseltamivir Jun 28, 2026
8eec44d
CollectiveX: gated.md — FNUZ fp8 VALIDATED (fp8_direct_cast e4m3fnuz,…
Oseltamivir Jun 28, 2026
0cbfe17
CollectiveX: NCCL/RCCL KV-cache transfer backend (p2p send/recv)
Oseltamivir Jun 28, 2026
744426a
CollectiveX: GB200 launcher — add EP multi-srun path (was nccl-only m…
Oseltamivir Jun 28, 2026
001626a
CollectiveX: MoonCake KV transfer backend — pip-import the transfer e…
Oseltamivir Jun 28, 2026
1d7e063
CollectiveX: AITER all-reduce builder (AMD framework-AR tier)
Oseltamivir Jun 28, 2026
a51018c
CollectiveX: workflow concurrency group += inputs.nodes (multi-node E…
Oseltamivir Jun 28, 2026
7a104f2
CollectiveX: gated.md — NVL72 rack-scale EP DONE up to EP64 via Flash…
Oseltamivir Jun 28, 2026
e8b5013
CollectiveX: framework all-reduce — replicate the serving distributed…
Oseltamivir Jun 28, 2026
0688f5d
CollectiveX: vLLM all-reduce via container switch (allreduce-fw-vllm …
Oseltamivir Jun 28, 2026
568b0a7
CollectiveX: AITER all-reduce via serving-init replication (like sglang)
Oseltamivir Jun 28, 2026
f8d87b4
CollectiveX: vLLM AR — enter VllmConfig context; NIXL EP — build UCX-…
Oseltamivir Jun 28, 2026
f594ab9
CollectiveX: gated.md — framework-AR (sglang/vllm/aiter) DONE; NIXL U…
Oseltamivir Jun 28, 2026
e3b1aad
CollectiveX: MI355X cross-node EP path — MoRI RDMA internode (goal 183)
Oseltamivir Jun 28, 2026
79cf2f6
CollectiveX: cross-node H100/H200 EP path — multi-node torchrun + UCC…
Oseltamivir Jun 28, 2026
22c2a12
CollectiveX: add prune_results.py — results hygiene (newest-N-valid p…
Oseltamivir Jun 28, 2026
aaf79c9
CollectiveX: cross-node EP — MASTER_ADDR = routable NodeAddr IP (fix …
Oseltamivir Jun 28, 2026
34943b1
CollectiveX: pin cross-node PG bootstrap iface for EP rendezvous
Oseltamivir Jun 28, 2026
45097ca
CollectiveX: drop superseded DeepEP capability probes
Oseltamivir Jun 28, 2026
308101a
CollectiveX: drop tools/_keep_newest.py — subsumed by prune_results.py
Oseltamivir Jun 28, 2026
53c4575
CollectiveX: xnode-net — always-on net diagnostic + missing-iproute2 …
Oseltamivir Jun 28, 2026
7b93bc0
CollectiveX: opt-in FileStore rendezvous for cross-node EP (CX_RDZV_F…
Oseltamivir Jun 28, 2026
f108874
CollectiveX: H200 cross-node EP via multi-srun + FileStore rendezvous
Oseltamivir Jun 28, 2026
344d051
CollectiveX: cross-node EP local-spawn via FileStore (no torchrun agent)
Oseltamivir Jun 28, 2026
e8d9a77
CollectiveX: add nccl-ep — NCCL/RCCL all-to-all EP (cross-node, both …
Oseltamivir Jun 28, 2026
127785d
CollectiveX: add nccl-ep to run_ep.py --backend argparse choices
Oseltamivir Jun 28, 2026
68d0e18
CollectiveX: gated.md — cross-node EP DONE via nccl-ep (rendezvous + …
Oseltamivir Jun 28, 2026
4113533
CollectiveX: allow nccl-ep on MI355X launcher (was remapped to mori)
Oseltamivir Jun 28, 2026
5a66645
CollectiveX: gated.md — goal 183 DONE, MI355X cross-node EP via nccl-…
Oseltamivir Jun 28, 2026
af2b445
CollectiveX: allow mooncake on MI355X launcher (was remapped to mori)
Oseltamivir Jun 29, 2026
3f2db08
CollectiveX: gated.md — MI355X collective backfill outcomes
Oseltamivir Jun 29, 2026
a274bdf
CollectiveX: capability — accept nccl primitives bench on AMD (rccl-t…
Oseltamivir Jun 29, 2026
ccfb3e3
CollectiveX: _gha_suite.sh — --deepep-v2 + --backend override for ful…
Oseltamivir Jun 29, 2026
680c397
CollectiveX: register b200 + gb200, un-drop gb300, thread rack-scale …
Oseltamivir Jun 29, 2026
fc76925
CollectiveX: collectivex-sweep.yml — setup -> matrix(shards) -> aggre…
Oseltamivir Jun 29, 2026
7e3380b
CollectiveX: fix sweep canonical-manifest failures (shard mode)
Oseltamivir Jun 29, 2026
593d4a4
CollectiveX: fix rack-scale EP8 sweep + b200 DeepEP-V2 arch
Oseltamivir Jun 29, 2026
c53e827
CollectiveX: fix JOB_ID race in salloc launchers (matrix concurrency)
Oseltamivir Jun 29, 2026
38890f6
CollectiveX: fix rack-scale EP8 shard-file path resolution
Oseltamivir Jun 29, 2026
1e4ab46
CollectiveX: plot_ep reads the consolidated ndjson (collapse loose re…
Oseltamivir Jun 29, 2026
40f30cd
CollectiveX: combine per-backend sweeps into ONE dispatch (backend=all)
Oseltamivir Jun 29, 2026
64a2495
CollectiveX: remove superseded tools/ SSH-orchestration scripts
Oseltamivir Jun 29, 2026
5a28f27
CollectiveX: document uccl + deepep-hybrid aarch64 GB200/GB300 wall
Oseltamivir Jun 29, 2026
5a98078
CollectiveX: plot defaults to All publication view (show the full sweep)
Oseltamivir Jun 30, 2026
c308858
CollectiveX: deepep-v2 x86-single-node only (was mislabeling V1 as v2…
Oseltamivir Jun 30, 2026
06dd4e8
CollectiveX: correct stale UCCL 'deferred/scaffold' docs — it produce…
Oseltamivir Jun 30, 2026
fd49614
CollectiveX: gb200/gb300 DeepEP V2 at EP4 (aarch64 V2 builds; only EP…
Oseltamivir Jun 30, 2026
0dfb124
CollectiveX: sweep_matrix sets explicit gb200/gb300 tray count (EP4 w…
Oseltamivir Jun 30, 2026
3c546cb
CollectiveX: gb300 EP8 rack builds V2/quant-combine once per node (pe…
Oseltamivir Jun 30, 2026
3e2eeb4
CollectiveX: gb300 EP8 deepep — force NVSHMEM off MNNVL for DeepEP LL…
Oseltamivir Jun 30, 2026
1630e0b
CollectiveX: gb300 EP8 deepep-v2 — pass allow_mnnvl=True to span tray…
Oseltamivir Jun 30, 2026
dc4e0c5
CollectiveX: gb300 EP8 deepep-v2 DONE — finalize (sweep re-enable, gb…
Oseltamivir Jun 30, 2026
dfaef9c
CollectiveX: h100 launcher gains cross-node EP path (CX_NODES>1, worl…
Oseltamivir Jun 30, 2026
b37c000
CollectiveX: correct h100 cross-node overclaim (WALLED, not 'same pat…
Oseltamivir Jun 30, 2026
81f42c9
CollectiveX sweep: add --max-nodes filter (symmetric to --min-nodes) …
Oseltamivir Jun 30, 2026
b8beb2d
CollectiveX: re-validate gb300 uccl/deepep-hybrid walls (per-backend,…
Oseltamivir Jun 30, 2026
b623948
CollectiveX: fix deepep-hybrid EP8 build-env propagation across srun …
Oseltamivir Jun 30, 2026
d7529a5
CollectiveX: deepep-hybrid build installs to site-packages (persist a…
Oseltamivir Jun 30, 2026
b1f0b4b
CollectiveX: sweep_matrix keeps mori PREFILL (capped), not decode-only
Oseltamivir Jun 30, 2026
c61961f
CollectiveX: correct deepep-hybrid gb300 EP8 — WORKS (not intranode-o…
Oseltamivir Jun 30, 2026
f0a8370
CollectiveX: correct ep_deepep_hybrid docstring/provenance (EP8 MNNVL…
Oseltamivir Jun 30, 2026
aab1172
CollectiveX: doc — deleting all runs de-registers a non-main workflow…
Oseltamivir Jul 1, 2026
6651a24
CollectiveX sweep: raise --max-cases default 14 -> 128 (eliminate chu…
Oseltamivir Jul 1, 2026
1bad711
CollectiveX sweep: drop mode/resource_mode from shard key -> 49 jobs …
Oseltamivir Jul 1, 2026
689861b
CollectiveX: from-source builds idempotent (build once per allocation…
Oseltamivir Jul 1, 2026
ffe663e
CollectiveX sweep: CX_TIME=120 for consolidated shards (up to ~74 cas…
Oseltamivir Jul 1, 2026
081cb90
CollectiveX: chunk flashinfer (per-backend max_cases=16) + settle bet…
Oseltamivir Jul 1, 2026
e2bed69
CollectiveX: revert flashinfer between-case settle (tested — made cra…
Oseltamivir Jul 1, 2026
ff2a1c1
CollectiveX: document h100 flashinfer intermittent CUDA-launch-failur…
Oseltamivir Jul 1, 2026
85d6159
CollectiveX: CX_FLASHINFER_UPGRADE — run plain flashinfer on the newe…
Oseltamivir Jul 1, 2026
8cbd7c8
CollectiveX: retry flashinfer cases (recovers the intermittent MNNVL-…
Oseltamivir Jul 1, 2026
d9f5190
CollectiveX docs: flashinfer retry mitigation + h200 pidfd wall speci…
Oseltamivir Jul 1, 2026
dd2a602
CollectiveX docs: deepep-hybrid h100/h200 works (212/212); empty-rank…
Oseltamivir Jul 1, 2026
de7dec5
CollectiveX docs: flashinfer retry MEASURED (30/46, correlated not 94…
Oseltamivir Jul 1, 2026
8c58b21
CollectiveX docs cleanup: README rewritten to current state; fix stal…
Oseltamivir Jul 2, 2026
0d54356
CollectiveX: flashinfer retry on the single-bench path (covers h100 q…
Oseltamivir Jul 2, 2026
41417d0
CollectiveX: h100 quant-combine = MEASURED kernel arch wall (moe_a2a_…
Oseltamivir Jul 2, 2026
b8eeec3
CollectiveX docs: h100 full-sweep results — uccl LL h100 hang is inte…
Oseltamivir Jul 2, 2026
6c2c515
CollectiveX: experimental workflow triggers the app snapshot refresh …
Oseltamivir Jul 2, 2026
28394cf
CollectiveX: retire ep-activation-sensitivity-v1 from the sweep — nul…
Oseltamivir Jul 2, 2026
0cd0d27
CollectiveX: full mi355x token ladders (large-T via the validated 8:1…
Oseltamivir Jul 2, 2026
07c9283
CollectiveX: job conclusions match the judge-by-data doctrine — deter…
Oseltamivir Jul 2, 2026
3dbacd1
CollectiveX: restore base CX_TS after the shard loop — summarize gate…
Oseltamivir Jul 2, 2026
b649fd8
CollectiveX: ep-result v4 stamp + schema/validator drift fixes; real …
Oseltamivir Jul 2, 2026
6878b1e
CollectiveX: design the e2e serving-correlation study (docs/e2e_corre…
Oseltamivir Jul 2, 2026
5668635
CollectiveX: skip skewed-routing prefill on mi355x (measured: receive…
Oseltamivir Jul 2, 2026
8e2d589
CollectiveX AMD parity: RCCL primitives ride the push job alongside M…
Oseltamivir Jul 2, 2026
67d877f
CollectiveX AMD parity: enable offload bench on MI355X; probe the pin…
Oseltamivir Jul 2, 2026
ece799d
CollectiveX: vendor-parity matrix (docs/parity.md, generated from cap…
Oseltamivir Jul 2, 2026
caaba99
CollectiveX: nccl-ep joins the mi355x sweep shard — the portable RCCL…
Oseltamivir Jul 2, 2026
73bace6
CollectiveX: wire MI300X + MI325X pools for the RCCL/primitives lane
Oseltamivir Jul 2, 2026
2e55a47
CollectiveX: cluster-scope the salloc exclude list; add chi-mi300x-04…
Oseltamivir Jul 2, 2026
b83bef6
CollectiveX: mi300x = evidenced cluster-wide enroot-userns wall; mi32…
Oseltamivir Jul 2, 2026
6f71e44
CollectiveX mi325x: judge the 9-run fleet — 6 valid; fix the 3 failur…
Oseltamivir Jul 2, 2026
8b91e30
CollectiveX mi325x: route MoRI around the GPUDirect-RDMA wall via int…
Oseltamivir Jul 2, 2026
1ed5757
CollectiveX mi300x: wire the same MoRI intranode-XGMI knobs as mi325x…
Oseltamivir Jul 2, 2026
d2522cc
CollectiveX mi325x: route mori/mori-io to the PR#355+ MoRI image (070…
Oseltamivir Jul 2, 2026
507bb39
CollectiveX mi325x: EP mori forces MORI_ENABLE_SDMA=0 (device kernels…
Oseltamivir Jul 2, 2026
db1a203
CollectiveX mi325x: EP mori sets MORI_SHMEM_HEAP_TYPE=normal — the de…
Oseltamivir Jul 2, 2026
53f9426
CollectiveX mi325x: route EP mori through the AsyncLL kernel (gfx942'…
Oseltamivir Jul 2, 2026
b21a720
CollectiveX: mi325x mori EP validated (AsyncLL kernel on gfx942) — th…
Oseltamivir Jul 2, 2026
c480e34
feat(collectivex): standardize timing and artifact contracts
Oseltamivir Jul 3, 2026
eb91f35
Merge remote-tracking branch 'origin/main' into collectivex
Oseltamivir Jul 3, 2026
391e038
docs(collectivex): consolidate v1 contract
Oseltamivir Jul 3, 2026
0462b16
refactor(collectivex): freeze minimal v1 sweep
Oseltamivir Jul 3, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
236 changes: 236 additions & 0 deletions .github/workflows/collectivex-experimental.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,236 @@
name: CollectiveX Experimental

# Orchestration only — all benchmark logic lives in experimental/CollectiveX/.
# Manual one-off diagnostics. Promoted v1 coverage uses collectivex-sweep.yml.

on:
workflow_dispatch:
inputs:
sku:
# Only SKUs with a matching launchers/launch_<prefix>.sh are offered —
# runner.name's prefix selects the script, so an SKU without one fails.
description: Self-hosted runner pool (must have a CollectiveX launcher)
type: choice
default: gb200
options: [gb200, b200-dgxc, b200-multinode, mi355x, mi300x, mi325x, h100-dgxc, h200, b300, gb300]
benchmark:
# mori runs only on mi355x; nccl/deepep/uccl/all + the collective benches on NVIDIA SKUs.
# offload/copy-engine/kv-cache are single-process memcpy-family collectives (family!=moe).
description: Which benchmark to run
type: choice
default: nccl
options: [nccl, deepep, deepep-hybrid, mori, uccl, nccl-ep, flashinfer, flashinfer-combine-fp8, flashinfer-combine-fp8-directcast, flashinfer-combine-nvfp4, nixl, mori-io, nccl-kv, mooncake, offload, copy-engine, kv-cache, rl-mesh, allreduce-fw, allreduce-fw-vllm, all]
ops:
description: NCCL ops (space-separated); blank = default set
type: string
default: ''
min_bytes:
description: nccl-tests min message size
type: string
default: '8'
max_bytes:
description: nccl-tests max message size
type: string
default: '8G'
nodes:
description: Node count (gb200 multi-node MNNVL; 2 = 8 GPU). Blank/1 = single node.
type: string
default: ''
phase:
# EP only. 'both' fans out to one job per phase (decode + prefill).
description: EP phase — decode (small T) / prefill (large T); 'both' = a job each
type: choice
default: both
options: [both, decode, prefill]
timing:
# Combined timing knobs "iters:trials:warmup" (GitHub caps workflow_dispatch at 25 inputs,
# so these share one). fixed-512-v1 requires this exact profile on every SKU/backend.
description: 'EP timing "iters:trials:warmup" (fixed-512-v1 requires 8:64:32)'
type: string
default: '8:64:32'
tokens_ladder:
description: EP source-tokens-per-rank sweep (space/comma sep); blank = phase default
type: string
default: ''
dispatch_dtype:
description: EP dispatch payload precision (fp8 scale-layout recipes + FlashInfer OCP-microscaling mxfp8/nvfp4)
type: choice
default: bf16
options: [bf16, fp8, fp8-pertoken, fp8-directcast, mxfp8, mxfp4, nvfp4]
mode:
# LL is retained for manual diagnostics only; it is not a promoted v1 dimension.
description: EP kernel path (LL is diagnostic only)
type: choice
default: normal
options: [normal, ll]
resource_mode:
# normalized = ~sm_fraction of device units (cross-vendor apples-to-apples);
# tuned = each backend's own recommended/default launch config.
description: Comm resource regime
type: choice
default: tuned
options: [normalized, tuned, default]
contract:
# [cl]/[rv] are retained for explicit diagnostics, never promoted v1 comparisons.
description: Measurement contract (non-default contracts are diagnostic only)
type: choice
default: layout-and-dispatch-v1
options: [layout-and-dispatch-v1, cached-layout-comm-only-v1, runtime-visible-v1]
routing:
# v1 schedules uniform and zipf only. The remaining choices are one-off diagnostics.
description: EP routing distribution
type: choice
default: uniform
options: [uniform, zipf, balanced, balanced-rank-local, hotspot-single]
eplb:
# EPLB = replicate hot experts + balanced-place (the remedy for skewed routing). A pure
# routing-trace transform; experts -> num_logical+redundant. Meaningful with zipf*.
description: Apply EPLB expert replication/placement
type: boolean
default: false
canonical:
# Consume a CANONICAL serialized workload (generated deterministically in-container) instead
# of seeded-runtime. A canonical-serialized run with full GHA provenance is publication
# 'official' — this is the switch that promotes a cohort past comparable-experimental.
description: Use canonical serialized workload (official-grade workload identity)
type: boolean
default: false
activation_profile:
# Activation VALUE distribution of expert inputs. normal = headline; the others stress a
# future quantized combine (latency-neutral under bf16 — the expected null result).
description: Activation value profile
type: choice
default: normal
options: [normal, zeros, small-amplitude, wide-dynamic-range, fp8-saturation]
sm_fraction:
# normalized comm-resource fraction (DeepEP sm_fraction*SMs / MoRI ~*CUs). Sweep this with
# resource_mode=normalized to build the resource-Pareto (latency vs comm fraction). Blank =
# harness default 0.18.
description: Normalized comm-resource fraction (resource_mode=normalized)
type: string
default: ''
hidden:
# Manual shape override. Blank = deepseek-v3-v1 default 7168.
description: MoE hidden dim (model-derived workloads); blank = 7168
type: string
default: ''
topk:
description: MoE top-k (model-derived workloads); blank = 8
type: string
default: ''
experts:
description: MoE total experts (model-derived workloads); blank = 256
type: string
default: ''
uneven_tokens:
# Manual diagnostic only; not a promoted v1 dimension.
description: Uneven source-token allocation
type: choice
default: none
options: [none, linear, empty-rank]

concurrency:
# Group per (SKU + FULL config): GitHub keeps only one running + one pending per group and
# cancels the rest, so a coarse per-SKU group made a fan-out of many configs on one SKU
# self-cancel down to ~2. Including dtype/mode/contract/routing/eplb/phase gives each config
# its OWN group -> all configs survive; they queue only on the runner's own capacity, not on
# GitHub concurrency. cancel-in-progress FALSE so a re-dispatch of the SAME config queues.
# Resource/value axes remain in the group so distinct diagnostics do not self-cancel.
group: collectivex-${{ github.ref }}-${{ inputs.sku }}-${{ inputs.benchmark }}-${{ inputs.dispatch_dtype }}-${{ inputs.mode }}-${{ inputs.contract }}-${{ inputs.routing }}-${{ inputs.eplb }}-${{ inputs.phase }}-${{ inputs.resource_mode }}-${{ inputs.sm_fraction }}-${{ inputs.activation_profile }}-${{ inputs.hidden }}-${{ inputs.topk }}-${{ inputs.experts }}-${{ inputs.uneven_tokens }}-${{ inputs.nodes }}
cancel-in-progress: false

permissions:
contents: read

jobs:
# Manual dispatch -> chosen SKU + benchmark. Lands on the inputs.sku runner.
dispatch:
# The bare `h200` label spans TWO clusters: 14 h200-dgxc runners (login-0; the EP
# path is validated there) and 2 h200-cw (CoreWeave) runners that have no
# launch_h200-cw.sh and die exit 127. Pin h200 to the h200-dgxc pool so every
# dispatch lands where the launcher + FS + partition are known-good. Other SKUs are
# single-pool, so pass the sku through unchanged.
runs-on: ${{ inputs.sku == 'h200' && 'h200-dgxc' || inputs.sku }}
timeout-minutes: 120
strategy:
fail-fast: false
matrix:
# nccl/rccl are collective primitives — phase is meaningless, so run ONE job (not
# the same work twice). EP backends: 'both' -> decode + prefill; else a single job.
phase: ${{ fromJSON((inputs.benchmark == 'nccl' || inputs.benchmark == 'rccl') && '["na"]' || (inputs.phase == 'both' && '["decode","prefill"]' || format('["{0}"]', inputs.phase))) }}
env:
# flashinfer-combine-{fp8,nvfp4} = the flashinfer EP backend with a QUANTIZED COMBINE OUTPUT
# (MXFP8 e4m3+e8m0, or NVFP4 e2m1, via the flashinfer-main moe_a2a_combine output_dtype). Map to
# CX_BENCH=flashinfer + CX_COMBINE_DTYPE (run_flashinfer_suite builds flashinfer-main when
# CX_COMBINE_DTYPE!=bf16). Input-cap-safe (a benchmark CHOICE, not a new input).
CX_BENCH: ${{ startsWith(inputs.benchmark, 'flashinfer-combine') && 'flashinfer' || (inputs.benchmark == 'allreduce-fw-vllm' && 'allreduce-fw' || inputs.benchmark) }}
# allreduce-fw-vllm = the framework all-reduce bench in a vLLM container (container switch for
# the vLLM custom-AR, goal 215) — set CX_IMAGE to a vLLM cuda image; the launcher uses CX_IMAGE
# when non-empty, else cx_default_image. Input-cap-safe (a benchmark CHOICE).
CX_IMAGE: ${{ inputs.benchmark == 'allreduce-fw-vllm' && 'vllm/vllm-openai:latest' || '' }}
# startsWith catches both flashinfer-combine-fp8 and -fp8-directcast (both fp8 combine output;
# the -directcast variant differs only in CX_QC_SCALE=scalar below — a single output_scalar_scale,
# no per-block scales = the unscaled direct-cast fp8 combine).
CX_COMBINE_DTYPE: ${{ startsWith(inputs.benchmark, 'flashinfer-combine-fp8') && 'fp8' || (inputs.benchmark == 'flashinfer-combine-nvfp4' && 'nvfp4' || 'bf16') }}
CX_COMBINE_QUANT_MODE: ${{ startsWith(inputs.benchmark, 'flashinfer-combine-fp8') && 'fp8' || (inputs.benchmark == 'flashinfer-combine-nvfp4' && 'nvfp4' || 'none') }}
CX_QC_SCALE: ${{ inputs.benchmark == 'flashinfer-combine-fp8-directcast' && 'scalar' || '' }}
CX_OPS: ${{ inputs.ops }}
CX_MIN_BYTES: ${{ inputs.min_bytes }}
CX_MAX_BYTES: ${{ inputs.max_bytes }}
CX_NODES: ${{ inputs.nodes }}
CX_PHASE: ${{ matrix.phase }}
CX_TOKENS_LADDER: ${{ inputs.tokens_ladder }}
CX_DISPATCH_DTYPE: ${{ inputs.dispatch_dtype }}
CX_MODE: ${{ inputs.mode }}
CX_RESOURCE_MODE: ${{ inputs.resource_mode }}
CX_MEASUREMENT_CONTRACT: ${{ inputs.contract }}
CX_ROUTING: ${{ inputs.routing }}
CX_EPLB: ${{ inputs.eplb && '1' || '' }}
# Canonical serialized workload (official-grade identity) + value diagnostics.
CX_CANONICAL: ${{ inputs.canonical && '1' || '' }}
CX_ACTIVATION_PROFILE: ${{ inputs.activation_profile }}
CX_SM_FRACTION: ${{ inputs.sm_fraction }}
# Manual shape and uneven-allocation diagnostics.
CX_HIDDEN: ${{ inputs.hidden }}
CX_TOPK: ${{ inputs.topk }}
CX_EXPERTS: ${{ inputs.experts }}
CX_UNEVEN_TOKENS: ${{ inputs.uneven_tokens }}
CX_TIMING: ${{ inputs.timing }}
# GHA run provenance: run_ep records git_run (repo/run/attempt/ref/sha/job) -> a GHA result
# is provenance_complete (publication_status >= comparable-experimental, official w/ canonical).
COLLECTIVEX_SOURCE_SHA: ${{ github.sha }}
COLLECTIVEX_ARTIFACT_NAME: collectivex_${{ inputs.sku }}_${{ inputs.benchmark }}_${{ matrix.phase }}_${{ github.run_id }}
# GB200/watchtower needs a compute-visible workspace; harmless elsewhere.
CX_STAGE_DIR: ${{ inputs.sku == 'gb200' && '/mnt/lustre01/users-public/sa-shared/cx-stage' || '' }}
# MI355X: pin to the warm-squash, writable nodes.
CX_NODELIST: ${{ inputs.sku == 'mi355x' && 'mia1-p01-g10,mia1-p01-g15' || '' }}
steps:
- uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v5.0.0
with: { clean: true }
# Reject an unsupported backend/SKU/mode/dtype/contract BEFORE consuming the runner
# (review #3): fail fast on the login node, not after a salloc. 'all' fans out per
# vendor in-container, so skip the single-combo check for it.
- name: Validate capability
if: inputs.benchmark != 'all'
run: |
python3 experimental/CollectiveX/tests/capability.py \
--sku "${{ inputs.sku }}" \
--backend "${{ startsWith(inputs.benchmark, 'flashinfer-combine') && 'flashinfer' || (inputs.benchmark == 'allreduce-fw-vllm' && 'allreduce-fw' || inputs.benchmark) }}" \
--mode "${{ inputs.mode }}" --dtype "${{ inputs.dispatch_dtype }}" \
--contract "${{ inputs.contract }}" \
--combine-dtype "${{ startsWith(inputs.benchmark, 'flashinfer-combine-fp8') && 'fp8' || (inputs.benchmark == 'flashinfer-combine-nvfp4' && 'nvfp4' || 'bf16') }}" \
--combine-quant-mode "${{ startsWith(inputs.benchmark, 'flashinfer-combine-fp8') && 'fp8' || (inputs.benchmark == 'flashinfer-combine-nvfp4' && 'nvfp4' || 'none') }}"
- name: Launch ${{ inputs.sku }} / ${{ inputs.benchmark }} (${{ matrix.phase }})
env:
RUNNER_NAME: ${{ runner.name }}
run: bash "experimental/CollectiveX/launchers/launch_${RUNNER_NAME%%_*}.sh"
Comment thread
cursor[bot] marked this conversation as resolved.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Workflow skips multinode staging

Medium Severity

CX_STAGE_DIR is set only when inputs.sku is gb200. The b200-multinode dispatch target uses launch_b200-dgxc-slurm.sh, which documents the same compute-visible checkout requirement but leaves staging unset, so Slurm jobs may not see the repo mount.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit d8ee9bf. Configure here.

- name: Results summary
if: always()
run: python3 experimental/CollectiveX/summarize.py --results-dir experimental/CollectiveX/results --markdown >> "$GITHUB_STEP_SUMMARY"
- name: Upload results
if: always()
uses: actions/upload-artifact@043fb46d1a93c77aae656e7c1c64a875d1fc6a0a # v7.0.1
with:
name: collectivex_${{ inputs.sku }}_${{ inputs.benchmark }}_${{ matrix.phase }}_${{ github.run_id }}
path: experimental/CollectiveX/results/*.json
if-no-files-found: warn
Loading