-
Notifications
You must be signed in to change notification settings - Fork 217
CollectiveX: experimental cross-vendor collective/EP benchmark #1896
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
83761d0
b7ed913
e6fdd84
ccfae8e
b384171
f48daed
be9cc91
d8ee9bf
ac3f1b9
46208f2
b62de99
481ef59
78322de
2b23573
a3a492c
871086d
368cfbc
e2717a3
5c7b273
0052b11
3a872a9
5876ea0
353c8ee
3bc941c
9f85d05
c596882
e71ef3c
4e217f9
7a2f94f
bbe0578
f7b9d35
1afd268
6122acb
c136ec5
cf34cb3
82ec864
cad380a
81cddca
e97bc8b
6a3a185
270b7b4
a6812dc
45c4570
600e909
171c7d1
6dba193
8ff23bd
9e52693
82c6130
e273009
978d338
a413de2
d799e0f
1622dff
f5df0ea
bb296c4
73da67b
0df55e8
13f0a0f
9fb6e5d
2b5e26c
d2433e3
156bf44
59a05e0
a767844
fd23d02
70cfef3
60dec7d
36d3eb6
ee4ffe7
45fa504
2c15d94
880f82c
ddc08e7
8392632
74f52e0
1495866
0cf9fc6
76a3032
91c7acf
df7fdde
803b785
b6176a6
a504a3e
1e21c72
eb6f953
c16f885
4c661f9
95137b8
645f9d5
f531529
0e54cde
71477ee
e6224de
c40de99
925285d
ca8a505
762eb48
42eddb4
ccb0b4a
9e1ac40
91530dd
e150424
7aca33d
1535869
511188e
2ebeba9
04d83bf
5d08a93
cfa1ec5
510fc17
0b2753b
5c48dfd
156e9ea
d8b4764
02ef8d2
90877fb
3850003
49dd8db
f684b37
d9e0423
c2c7feb
43614ad
d4c508a
ba7c14a
85273c6
4b3fe29
ddfbdf7
2d65048
0e61ac1
d6bf7b1
94f03d5
99e4ba0
fe013ce
a15bd8b
f06b701
8405b10
3ab6feb
83679b0
ae3032f
0078e31
08a2f1e
e4f71c4
8eec44d
0cbfe17
744426a
001626a
1d7e063
a51018c
7a104f2
e8b5013
0688f5d
568b0a7
f8d87b4
f594ab9
e3b1aad
79cf2f6
22c2a12
aaf79c9
34943b1
45097ca
308101a
53c4575
7b93bc0
f108874
344d051
e8d9a77
127785d
68d0e18
4113533
5a66645
af2b445
3f2db08
a274bdf
ccfb3e3
680c397
fc76925
7e3380b
593d4a4
c53e827
38890f6
1e4ab46
40f30cd
64a2495
5a28f27
5a98078
c308858
06dd4e8
fd49614
0dfb124
3c546cb
3e2eeb4
1630e0b
dc4e0c5
dfaef9c
b37c000
81f42c9
b8beb2d
b623948
d7529a5
b1f0b4b
c61961f
f0a8370
aab1172
6651a24
1bad711
689861b
ffe663e
081cb90
e2bed69
ff2a1c1
85d6159
8cbd7c8
d9f5190
dd2a602
de7dec5
8c58b21
0d54356
41417d0
b8eeec3
6c2c515
28394cf
0cd0d27
07c9283
3dbacd1
b649fd8
6878b1e
5668635
8e2d589
67d877f
ece799d
caaba99
73bace6
2e55a47
b83bef6
6f71e44
8b91e30
1ed5757
d2522cc
507bb39
db1a203
53f9426
b21a720
c480e34
eb91f35
391e038
0462b16
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,236 @@ | ||
| name: CollectiveX Experimental | ||
|
|
||
| # Orchestration only — all benchmark logic lives in experimental/CollectiveX/. | ||
| # Manual one-off diagnostics. Promoted v1 coverage uses collectivex-sweep.yml. | ||
|
|
||
| on: | ||
| workflow_dispatch: | ||
| inputs: | ||
| sku: | ||
| # Only SKUs with a matching launchers/launch_<prefix>.sh are offered — | ||
| # runner.name's prefix selects the script, so an SKU without one fails. | ||
| description: Self-hosted runner pool (must have a CollectiveX launcher) | ||
| type: choice | ||
| default: gb200 | ||
| options: [gb200, b200-dgxc, b200-multinode, mi355x, mi300x, mi325x, h100-dgxc, h200, b300, gb300] | ||
| benchmark: | ||
| # mori runs only on mi355x; nccl/deepep/uccl/all + the collective benches on NVIDIA SKUs. | ||
| # offload/copy-engine/kv-cache are single-process memcpy-family collectives (family!=moe). | ||
| description: Which benchmark to run | ||
| type: choice | ||
| default: nccl | ||
| options: [nccl, deepep, deepep-hybrid, mori, uccl, nccl-ep, flashinfer, flashinfer-combine-fp8, flashinfer-combine-fp8-directcast, flashinfer-combine-nvfp4, nixl, mori-io, nccl-kv, mooncake, offload, copy-engine, kv-cache, rl-mesh, allreduce-fw, allreduce-fw-vllm, all] | ||
| ops: | ||
| description: NCCL ops (space-separated); blank = default set | ||
| type: string | ||
| default: '' | ||
| min_bytes: | ||
| description: nccl-tests min message size | ||
| type: string | ||
| default: '8' | ||
| max_bytes: | ||
| description: nccl-tests max message size | ||
| type: string | ||
| default: '8G' | ||
| nodes: | ||
| description: Node count (gb200 multi-node MNNVL; 2 = 8 GPU). Blank/1 = single node. | ||
| type: string | ||
| default: '' | ||
| phase: | ||
| # EP only. 'both' fans out to one job per phase (decode + prefill). | ||
| description: EP phase — decode (small T) / prefill (large T); 'both' = a job each | ||
| type: choice | ||
| default: both | ||
| options: [both, decode, prefill] | ||
| timing: | ||
| # Combined timing knobs "iters:trials:warmup" (GitHub caps workflow_dispatch at 25 inputs, | ||
| # so these share one). fixed-512-v1 requires this exact profile on every SKU/backend. | ||
| description: 'EP timing "iters:trials:warmup" (fixed-512-v1 requires 8:64:32)' | ||
| type: string | ||
| default: '8:64:32' | ||
| tokens_ladder: | ||
| description: EP source-tokens-per-rank sweep (space/comma sep); blank = phase default | ||
| type: string | ||
| default: '' | ||
| dispatch_dtype: | ||
| description: EP dispatch payload precision (fp8 scale-layout recipes + FlashInfer OCP-microscaling mxfp8/nvfp4) | ||
| type: choice | ||
| default: bf16 | ||
| options: [bf16, fp8, fp8-pertoken, fp8-directcast, mxfp8, mxfp4, nvfp4] | ||
| mode: | ||
| # LL is retained for manual diagnostics only; it is not a promoted v1 dimension. | ||
| description: EP kernel path (LL is diagnostic only) | ||
| type: choice | ||
| default: normal | ||
| options: [normal, ll] | ||
| resource_mode: | ||
| # normalized = ~sm_fraction of device units (cross-vendor apples-to-apples); | ||
| # tuned = each backend's own recommended/default launch config. | ||
| description: Comm resource regime | ||
| type: choice | ||
| default: tuned | ||
| options: [normalized, tuned, default] | ||
| contract: | ||
| # [cl]/[rv] are retained for explicit diagnostics, never promoted v1 comparisons. | ||
| description: Measurement contract (non-default contracts are diagnostic only) | ||
| type: choice | ||
| default: layout-and-dispatch-v1 | ||
| options: [layout-and-dispatch-v1, cached-layout-comm-only-v1, runtime-visible-v1] | ||
| routing: | ||
| # v1 schedules uniform and zipf only. The remaining choices are one-off diagnostics. | ||
| description: EP routing distribution | ||
| type: choice | ||
| default: uniform | ||
| options: [uniform, zipf, balanced, balanced-rank-local, hotspot-single] | ||
| eplb: | ||
| # EPLB = replicate hot experts + balanced-place (the remedy for skewed routing). A pure | ||
| # routing-trace transform; experts -> num_logical+redundant. Meaningful with zipf*. | ||
| description: Apply EPLB expert replication/placement | ||
| type: boolean | ||
| default: false | ||
| canonical: | ||
| # Consume a CANONICAL serialized workload (generated deterministically in-container) instead | ||
| # of seeded-runtime. A canonical-serialized run with full GHA provenance is publication | ||
| # 'official' — this is the switch that promotes a cohort past comparable-experimental. | ||
| description: Use canonical serialized workload (official-grade workload identity) | ||
| type: boolean | ||
| default: false | ||
| activation_profile: | ||
| # Activation VALUE distribution of expert inputs. normal = headline; the others stress a | ||
| # future quantized combine (latency-neutral under bf16 — the expected null result). | ||
| description: Activation value profile | ||
| type: choice | ||
| default: normal | ||
| options: [normal, zeros, small-amplitude, wide-dynamic-range, fp8-saturation] | ||
| sm_fraction: | ||
| # normalized comm-resource fraction (DeepEP sm_fraction*SMs / MoRI ~*CUs). Sweep this with | ||
| # resource_mode=normalized to build the resource-Pareto (latency vs comm fraction). Blank = | ||
| # harness default 0.18. | ||
| description: Normalized comm-resource fraction (resource_mode=normalized) | ||
| type: string | ||
| default: '' | ||
| hidden: | ||
| # Manual shape override. Blank = deepseek-v3-v1 default 7168. | ||
| description: MoE hidden dim (model-derived workloads); blank = 7168 | ||
| type: string | ||
| default: '' | ||
| topk: | ||
| description: MoE top-k (model-derived workloads); blank = 8 | ||
| type: string | ||
| default: '' | ||
| experts: | ||
| description: MoE total experts (model-derived workloads); blank = 256 | ||
| type: string | ||
| default: '' | ||
| uneven_tokens: | ||
| # Manual diagnostic only; not a promoted v1 dimension. | ||
| description: Uneven source-token allocation | ||
| type: choice | ||
| default: none | ||
| options: [none, linear, empty-rank] | ||
|
|
||
| concurrency: | ||
| # Group per (SKU + FULL config): GitHub keeps only one running + one pending per group and | ||
| # cancels the rest, so a coarse per-SKU group made a fan-out of many configs on one SKU | ||
| # self-cancel down to ~2. Including dtype/mode/contract/routing/eplb/phase gives each config | ||
| # its OWN group -> all configs survive; they queue only on the runner's own capacity, not on | ||
| # GitHub concurrency. cancel-in-progress FALSE so a re-dispatch of the SAME config queues. | ||
| # Resource/value axes remain in the group so distinct diagnostics do not self-cancel. | ||
| group: collectivex-${{ github.ref }}-${{ inputs.sku }}-${{ inputs.benchmark }}-${{ inputs.dispatch_dtype }}-${{ inputs.mode }}-${{ inputs.contract }}-${{ inputs.routing }}-${{ inputs.eplb }}-${{ inputs.phase }}-${{ inputs.resource_mode }}-${{ inputs.sm_fraction }}-${{ inputs.activation_profile }}-${{ inputs.hidden }}-${{ inputs.topk }}-${{ inputs.experts }}-${{ inputs.uneven_tokens }}-${{ inputs.nodes }} | ||
| cancel-in-progress: false | ||
|
|
||
| permissions: | ||
| contents: read | ||
|
|
||
| jobs: | ||
| # Manual dispatch -> chosen SKU + benchmark. Lands on the inputs.sku runner. | ||
| dispatch: | ||
| # The bare `h200` label spans TWO clusters: 14 h200-dgxc runners (login-0; the EP | ||
| # path is validated there) and 2 h200-cw (CoreWeave) runners that have no | ||
| # launch_h200-cw.sh and die exit 127. Pin h200 to the h200-dgxc pool so every | ||
| # dispatch lands where the launcher + FS + partition are known-good. Other SKUs are | ||
| # single-pool, so pass the sku through unchanged. | ||
| runs-on: ${{ inputs.sku == 'h200' && 'h200-dgxc' || inputs.sku }} | ||
| timeout-minutes: 120 | ||
| strategy: | ||
| fail-fast: false | ||
| matrix: | ||
| # nccl/rccl are collective primitives — phase is meaningless, so run ONE job (not | ||
| # the same work twice). EP backends: 'both' -> decode + prefill; else a single job. | ||
| phase: ${{ fromJSON((inputs.benchmark == 'nccl' || inputs.benchmark == 'rccl') && '["na"]' || (inputs.phase == 'both' && '["decode","prefill"]' || format('["{0}"]', inputs.phase))) }} | ||
| env: | ||
| # flashinfer-combine-{fp8,nvfp4} = the flashinfer EP backend with a QUANTIZED COMBINE OUTPUT | ||
| # (MXFP8 e4m3+e8m0, or NVFP4 e2m1, via the flashinfer-main moe_a2a_combine output_dtype). Map to | ||
| # CX_BENCH=flashinfer + CX_COMBINE_DTYPE (run_flashinfer_suite builds flashinfer-main when | ||
| # CX_COMBINE_DTYPE!=bf16). Input-cap-safe (a benchmark CHOICE, not a new input). | ||
| CX_BENCH: ${{ startsWith(inputs.benchmark, 'flashinfer-combine') && 'flashinfer' || (inputs.benchmark == 'allreduce-fw-vllm' && 'allreduce-fw' || inputs.benchmark) }} | ||
| # allreduce-fw-vllm = the framework all-reduce bench in a vLLM container (container switch for | ||
| # the vLLM custom-AR, goal 215) — set CX_IMAGE to a vLLM cuda image; the launcher uses CX_IMAGE | ||
| # when non-empty, else cx_default_image. Input-cap-safe (a benchmark CHOICE). | ||
| CX_IMAGE: ${{ inputs.benchmark == 'allreduce-fw-vllm' && 'vllm/vllm-openai:latest' || '' }} | ||
| # startsWith catches both flashinfer-combine-fp8 and -fp8-directcast (both fp8 combine output; | ||
| # the -directcast variant differs only in CX_QC_SCALE=scalar below — a single output_scalar_scale, | ||
| # no per-block scales = the unscaled direct-cast fp8 combine). | ||
| CX_COMBINE_DTYPE: ${{ startsWith(inputs.benchmark, 'flashinfer-combine-fp8') && 'fp8' || (inputs.benchmark == 'flashinfer-combine-nvfp4' && 'nvfp4' || 'bf16') }} | ||
| CX_COMBINE_QUANT_MODE: ${{ startsWith(inputs.benchmark, 'flashinfer-combine-fp8') && 'fp8' || (inputs.benchmark == 'flashinfer-combine-nvfp4' && 'nvfp4' || 'none') }} | ||
| CX_QC_SCALE: ${{ inputs.benchmark == 'flashinfer-combine-fp8-directcast' && 'scalar' || '' }} | ||
| CX_OPS: ${{ inputs.ops }} | ||
| CX_MIN_BYTES: ${{ inputs.min_bytes }} | ||
| CX_MAX_BYTES: ${{ inputs.max_bytes }} | ||
| CX_NODES: ${{ inputs.nodes }} | ||
| CX_PHASE: ${{ matrix.phase }} | ||
| CX_TOKENS_LADDER: ${{ inputs.tokens_ladder }} | ||
| CX_DISPATCH_DTYPE: ${{ inputs.dispatch_dtype }} | ||
| CX_MODE: ${{ inputs.mode }} | ||
| CX_RESOURCE_MODE: ${{ inputs.resource_mode }} | ||
| CX_MEASUREMENT_CONTRACT: ${{ inputs.contract }} | ||
| CX_ROUTING: ${{ inputs.routing }} | ||
| CX_EPLB: ${{ inputs.eplb && '1' || '' }} | ||
| # Canonical serialized workload (official-grade identity) + value diagnostics. | ||
| CX_CANONICAL: ${{ inputs.canonical && '1' || '' }} | ||
| CX_ACTIVATION_PROFILE: ${{ inputs.activation_profile }} | ||
| CX_SM_FRACTION: ${{ inputs.sm_fraction }} | ||
| # Manual shape and uneven-allocation diagnostics. | ||
| CX_HIDDEN: ${{ inputs.hidden }} | ||
| CX_TOPK: ${{ inputs.topk }} | ||
| CX_EXPERTS: ${{ inputs.experts }} | ||
| CX_UNEVEN_TOKENS: ${{ inputs.uneven_tokens }} | ||
| CX_TIMING: ${{ inputs.timing }} | ||
| # GHA run provenance: run_ep records git_run (repo/run/attempt/ref/sha/job) -> a GHA result | ||
| # is provenance_complete (publication_status >= comparable-experimental, official w/ canonical). | ||
| COLLECTIVEX_SOURCE_SHA: ${{ github.sha }} | ||
| COLLECTIVEX_ARTIFACT_NAME: collectivex_${{ inputs.sku }}_${{ inputs.benchmark }}_${{ matrix.phase }}_${{ github.run_id }} | ||
| # GB200/watchtower needs a compute-visible workspace; harmless elsewhere. | ||
| CX_STAGE_DIR: ${{ inputs.sku == 'gb200' && '/mnt/lustre01/users-public/sa-shared/cx-stage' || '' }} | ||
| # MI355X: pin to the warm-squash, writable nodes. | ||
| CX_NODELIST: ${{ inputs.sku == 'mi355x' && 'mia1-p01-g10,mia1-p01-g15' || '' }} | ||
| steps: | ||
| - uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v5.0.0 | ||
| with: { clean: true } | ||
| # Reject an unsupported backend/SKU/mode/dtype/contract BEFORE consuming the runner | ||
| # (review #3): fail fast on the login node, not after a salloc. 'all' fans out per | ||
| # vendor in-container, so skip the single-combo check for it. | ||
| - name: Validate capability | ||
| if: inputs.benchmark != 'all' | ||
| run: | | ||
| python3 experimental/CollectiveX/tests/capability.py \ | ||
| --sku "${{ inputs.sku }}" \ | ||
| --backend "${{ startsWith(inputs.benchmark, 'flashinfer-combine') && 'flashinfer' || (inputs.benchmark == 'allreduce-fw-vllm' && 'allreduce-fw' || inputs.benchmark) }}" \ | ||
| --mode "${{ inputs.mode }}" --dtype "${{ inputs.dispatch_dtype }}" \ | ||
| --contract "${{ inputs.contract }}" \ | ||
| --combine-dtype "${{ startsWith(inputs.benchmark, 'flashinfer-combine-fp8') && 'fp8' || (inputs.benchmark == 'flashinfer-combine-nvfp4' && 'nvfp4' || 'bf16') }}" \ | ||
| --combine-quant-mode "${{ startsWith(inputs.benchmark, 'flashinfer-combine-fp8') && 'fp8' || (inputs.benchmark == 'flashinfer-combine-nvfp4' && 'nvfp4' || 'none') }}" | ||
| - name: Launch ${{ inputs.sku }} / ${{ inputs.benchmark }} (${{ matrix.phase }}) | ||
| env: | ||
| RUNNER_NAME: ${{ runner.name }} | ||
| run: bash "experimental/CollectiveX/launchers/launch_${RUNNER_NAME%%_*}.sh" | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Workflow skips multinode stagingMedium Severity
Reviewed by Cursor Bugbot for commit d8ee9bf. Configure here. |
||
| - name: Results summary | ||
| if: always() | ||
| run: python3 experimental/CollectiveX/summarize.py --results-dir experimental/CollectiveX/results --markdown >> "$GITHUB_STEP_SUMMARY" | ||
| - name: Upload results | ||
| if: always() | ||
| uses: actions/upload-artifact@043fb46d1a93c77aae656e7c1c64a875d1fc6a0a # v7.0.1 | ||
| with: | ||
| name: collectivex_${{ inputs.sku }}_${{ inputs.benchmark }}_${{ matrix.phase }}_${{ github.run_id }} | ||
| path: experimental/CollectiveX/results/*.json | ||
| if-no-files-found: warn | ||


Uh oh!
There was an error while loading. Please reload this page.