-
Notifications
You must be signed in to change notification settings - Fork 217
Update Minimax M3 FP4 B300 Eagle #2006
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -12631,7 +12631,7 @@ minimaxm3-fp4-b300-vllm: | |
| # /scratch/models/MiniMax-M3-NVFP4 (added to the STAGED_MODELS allow-list in | ||
| # launch_b300-nv.sh); the EAGLE3 draft is downloaded to the writable models dir. | ||
| minimaxm3-fp4-b300-vllm-mtp: | ||
| image: vllm/vllm-openai:vllm-minimax-m3-perf-x86_64-13.0.1-8b00f41 | ||
| image: vllm/vllm-openai:nightly-93d8f834dd8acf33eb0e2a75b2711b628cb6e226 | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🔴 Missing Extended reasoning...What the bug is
This PR does all three of the qualifying trigger conditions but touches only 2 files ( How it manifests / step-by-step proof
Why existing code doesn't cover it There is no wildcard changelog entry that catches this recipe key; the closest existing entry is for Precedent from sibling PRs Every recent sibling recipe-update PR that bumped an image and/or reshuffled the search space appended a How to fix Append a
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🟡 The rationale comments above and inside this recipe are now stale after this PR: (a) the yaml header at lines 12625-12632 still claims the recipe uses "3 speculative tokens" and that NVFP4 support is "baked into the perf container image", and (b) the shell-script header at benchmarks/single_node/fixed_seq_len/minimaxm3_fp4_b300_mtp.sh lines 3-7 repeats the "perf container image" phrasing. Both are stale after this diff swapped the perf-branded image for a generic nightly and replaced the constant NUM_SPEC_TOKENS=3 with a per-operating-point NUM_SPEC_TOKENS_MAP; update both to reference the new image (e.g. note PR #46380 is now merged upstream and included in commit 93d8f83) and the varying token count. Extended reasoning...This PR makes two changes whose rationale comments were left behind, and they show up in two adjacent locations. 1. The yaml header at
2. The shell-script header at Impact. Neither comment affects runtime behavior — the recipe still runs. But two failure modes for future readers are worth avoiding:
Fix. Update the yaml header to (a) drop or rephrase "baked into the perf container image" — e.g. "vllm PR #46380 is now merged upstream and included in nightly commit 93d8f83" — and (b) replace "(3 speculative tokens)" with a note that the count varies per operating point (1..4) via Step-by-step proof.
Note on the duplicate objection. One refutation argued this collapses into a single .sh-header nit. Kept as a merged finding because the yaml header carries a distinct stale claim ("3 speculative tokens") that the .sh header does not, and it lives in a different physical comment block; combining them makes it likelier the author touches both while updating rationale rather than only the .sh. |
||
| model: nvidia/MiniMax-M3-NVFP4 | ||
| model-prefix: minimaxm3 | ||
| runner: b300 | ||
|
|
@@ -12640,24 +12640,32 @@ minimaxm3-fp4-b300-vllm-mtp: | |
| multinode: false | ||
| scenarios: | ||
| fixed-seq-len: | ||
| # Search space trimmed to the (parallelism, concurrency) points that lie on | ||
| # the Total-TPS/GPU vs median-interactivity Pareto frontier of the EAGLE3 | ||
| # offline sweep (bench_results.json). The best num_speculative_tokens varies | ||
| # along the frontier (1 at the throughput end, up to 4 at the latency end); | ||
| # the recipe shell (minimaxm3_fp4_b300_mtp.sh) selects it per point from the | ||
| # same ISL:TP:EP_SIZE:DP_ATTENTION:CONC key. TP2 (plain / EP / DP-attention) | ||
| # carries the throughput-to-mid front; TP4/TP8 only survive at the | ||
| # low-concurrency, high-interactivity tail. | ||
| - isl: 1024 | ||
| osl: 1024 | ||
| search-space: | ||
| - { tp: 8, conc-start: 1, conc-end: 64, spec-decoding: mtp } | ||
| - { tp: 8, ep: 8, conc-start: 1, conc-end: 256, spec-decoding: mtp } | ||
| - { tp: 4, conc-start: 1, conc-end: 64, spec-decoding: mtp } | ||
| - { tp: 4, ep: 4, conc-start: 64, conc-end: 256, spec-decoding: mtp } | ||
| - { tp: 4, ep: 4, dp-attn: true, conc-start: 128, conc-end: 512, spec-decoding: mtp } | ||
| - { tp: 8, ep: 8, dp-attn: true, conc-start: 256, conc-end: 512, spec-decoding: mtp } | ||
| - { tp: 2, conc-list: [1, 2, 4, 8, 16, 32, 64, 512], spec-decoding: mtp } | ||
| - { tp: 2, ep: 2, conc-list: [1, 2, 4, 8, 16, 128, 512], spec-decoding: mtp } | ||
| - { tp: 2, ep: 2, dp-attn: true, conc-list: [256, 512], spec-decoding: mtp } | ||
| - { tp: 4, conc-list: [1, 2, 4, 8, 16, 32], spec-decoding: mtp } | ||
| - { tp: 4, ep: 4, conc-list: [1, 2, 32], spec-decoding: mtp } | ||
| - { tp: 8, conc-list: [1, 2, 4, 8], spec-decoding: mtp } | ||
| - isl: 8192 | ||
| osl: 1024 | ||
| search-space: | ||
| - { tp: 8, conc-start: 1, conc-end: 64, spec-decoding: mtp } | ||
| - { tp: 8, ep: 8, conc-start: 1, conc-end: 256, spec-decoding: mtp } | ||
| - { tp: 4, conc-start: 1, conc-end: 64, spec-decoding: mtp } | ||
| - { tp: 4, ep: 4, conc-start: 64, conc-end: 256, spec-decoding: mtp } | ||
| - { tp: 4, ep: 4, dp-attn: true, conc-start: 64, conc-end: 128, spec-decoding: mtp } | ||
| - { tp: 8, ep: 8, dp-attn: true, conc-start: 128, conc-end: 256, spec-decoding: mtp } | ||
| - { tp: 2, conc-list: [1, 2, 4, 8, 16, 32, 128, 256], spec-decoding: mtp } | ||
| - { tp: 2, ep: 2, conc-list: [1, 2, 4, 8, 32, 64, 128], spec-decoding: mtp } | ||
| - { tp: 2, ep: 2, dp-attn: true, conc-list: [128], spec-decoding: mtp } | ||
| - { tp: 4, conc-list: [1, 2, 4, 8], spec-decoding: mtp } | ||
| - { tp: 4, ep: 4, conc-list: [1, 2, 4, 8], spec-decoding: mtp } | ||
| - { tp: 8, conc-list: [1, 2, 4, 8], spec-decoding: mtp } | ||
|
|
||
| # MiniMax-M3 day-zero (https://recipes.vllm.ai/MiniMaxAI/MiniMax-M3). | ||
| # 427B total / 26B active MoE with MSA sparse attention; MXFP8 checkpoint | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🟡 The new comment at line 72 says
as in PARALLEL_ARGS below, but thePARALLEL_ARGSif/elif/else block that sets--data-parallel-size=$TPis at lines 57-63, immediately above this comment. Trivials/below/above/(or drop the directional word).Extended reasoning...
What the bug is. In the newly-added block introducing
NUM_SPEC_TOKENS_MAP, the comment ends with:The referent — the
PARALLEL_ARGSif/elif/else block that actually demonstrates theTP-as-data-parallel-size relationship — is not below this comment; it is above it.Proof by walkthrough. In
benchmarks/single_node/fixed_seq_len/minimaxm3_fp4_b300_mtp.shafter this PR:if [ "${DP_ATTENTION}" = "true" ]; then PARALLEL_ARGS="--tensor-parallel-size=1 --data-parallel-size=$TP --enable-expert-parallel" ...block. Line 58 in particular is what the comment is pointing at (it uses$TPas the data-parallel size whenDP_ATTENTION=true).... as in PARALLEL_ARGS below).declare -A NUM_SPEC_TOKENS_MAP=(.PARALLEL_ARGSafter the comment is the bare$PARALLEL_ARGSexpansion in thevllm serveinvocation (~line 105), which does not itself illustrate theTP-as-data-parallel-size definition.So the definition being referenced sits ~9 lines above the comment, and the only
PARALLEL_ARGSmention below is a bare variable expansion that carries no such demonstration.belowis directionally wrong.Impact. Zero runtime impact — this is a pure documentation error in a freshly-written comment. It only matters because a future reader following the reference (
as in PARALLEL_ARGS below) will scan downward and hit$PARALLEL_ARGSin thevllm serveline, which does not explain whyTPnames the DP size — momentary confusion, then they scroll up. Worth flagging while the author is right there in this region, but not merge-blocking.Suggested fix. Change
belowtoabove, or drop the directional word entirely (as in PARALLEL_ARGS). Speculative cause: the comment was likely drafted expecting the map declaration to sit above thePARALLEL_ARGSblock, then the block ordering was swapped during authoring but the wording wasn't updated.