Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 4 additions & 4 deletions .github/configs/amd-master.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2525,7 +2525,7 @@ dsv4-fp4-mi355x-atom-disagg:
# https://github.com/vllm-project/recipes/commit/2a3728ed9892debfd767a72a58ebc90b33f186e5
# MXFP8 runs from TP=4 on gfx950; block size 128 is mandatory for MSA.
minimaxm3-fp8-mi355x-vllm:
image: vllm/vllm-openai-rocm:nightly-3f5a1e1733200760169ff31ebe60a271072b199e
image: vllm/vllm-openai-rocm:nightly-4559c43a9526597c00cbcc4f59979496500268d1
model: MiniMaxAI/MiniMax-M3-MXFP8
model-prefix: minimaxm3
runner: mi355x
Expand Down Expand Up @@ -2562,7 +2562,7 @@ minimaxm3-fp8-mi355x-vllm:
# acceptance dilutes in big batches, and the draft weights + draft KV shave
# headroom — tp2-ep2 is dropped since its KV headroom was already thin.
minimaxm3-fp8-mi355x-vllm-mtp:
image: vllm/vllm-openai-rocm:nightly-3f5a1e1733200760169ff31ebe60a271072b199e
image: vllm/vllm-openai-rocm:nightly-4559c43a9526597c00cbcc4f59979496500268d1
model: MiniMaxAI/MiniMax-M3-MXFP8
model-prefix: minimaxm3
runner: mi355x
Expand Down Expand Up @@ -2641,7 +2641,7 @@ minimaxm3-fp4-mi355x-vllm-disagg:
# language-model path and mirror the MXFP8 MI355X search space for a direct
# precision comparison.
minimaxm3-fp4-mi355x-vllm:
image: vllm/vllm-openai-rocm:nightly-3f5a1e1733200760169ff31ebe60a271072b199e
image: vllm/vllm-openai-rocm:nightly-4559c43a9526597c00cbcc4f59979496500268d1
model: amd/MiniMax-M3-MXFP4
model-prefix: minimaxm3
runner: mi355x
Expand Down Expand Up @@ -2672,7 +2672,7 @@ minimaxm3-fp4-mi355x-vllm:
# tokens. Search space mirrors the MI355X MXFP8 MTP entry, trimming the base
# FP4 sweep at extreme concurrency where speculative decoding loses value.
minimaxm3-fp4-mi355x-vllm-mtp:
image: vllm/vllm-openai-rocm:nightly-3f5a1e1733200760169ff31ebe60a271072b199e
image: vllm/vllm-openai-rocm:nightly-4559c43a9526597c00cbcc4f59979496500268d1
model: amd/MiniMax-M3-MXFP4
model-prefix: minimaxm3
runner: mi355x
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
# minimaxm3_fp4_mi355x_vllm.sh and uses three speculative tokens from
# Inferact/MiniMax-M3-EAGLE3. The pinned nightly includes upstream AMD
# MiniMax-M3 SupportsEagle3 support, so no runtime model patch is needed.
# MoE serving mirrors minimaxm3_fp4_mi355x_vllm.sh (AITER MoE, vllm#46419).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Nit: line 8 says "MoE serving mirrors minimaxm3_fp4_mi355x_vllm.sh (AITER MoE, vllm#46419)." but on current main that STP recipe sets no AITER env vars and no --moe-backend — its own header comment says it "lets vLLM select the MoE backend." Those STP knobs come from #1954, which hasn't landed. If #1955 merges first, this cross-reference is wrong. Consider pointing at #1954 directly, or just describing the AITER setup (with the vllm#46419 credit) without claiming the STP file already does the same.

Extended reasoning...

What the comment claims vs. what is true on main

The new line 8 in benchmarks/single_node/fixed_seq_len/minimaxm3_fp4_mi355x_vllm_mtp.sh reads:

# MoE serving mirrors minimaxm3_fp4_mi355x_vllm.sh (AITER MoE, vllm#46419).

But benchmarks/single_node/fixed_seq_len/minimaxm3_fp4_mi355x_vllm.sh on current main does not yet set VLLM_ROCM_USE_AITER, VLLM_ROCM_USE_AITER_MOE, VLLM_ROCM_USE_AITER_FUSION_SHARED_EXPERTS, or pass --moe-backend aiter. Its line 6 explicitly states the opposite: "…lets vLLM select the MoE backend." A git grep --moe-backend confirms the only occurrences of --moe-backend aiter in the tree are the two MTP files introduced by this PR.

Step-by-step proof

  1. cat benchmarks/single_node/fixed_seq_len/minimaxm3_fp4_mi355x_vllm.sh on the current branch's view of main shows lines 5–6 saying the recipe "uses the text-only language-model path and lets vLLM select the MoE backend."
  2. grep -E 'VLLM_ROCM_USE_AITER|moe-backend' benchmarks/single_node/fixed_seq_len/minimaxm3_fp4_mi355x_vllm.sh returns nothing.
  3. git log --all --oneline | grep -i 1954 returns nothing — [AMD] Enable AITER MoE for MiniMax-M3 FP4 MI355X vLLM STP #1954 has not landed.
  4. The PR description's own test plan says: "Apply full-sweep-enabled (or full-sweep-fail-fast) after [AMD] Enable AITER MoE for MiniMax-M3 FP4 MI355X vLLM STP #1954 lands." That acknowledges the ordering dependency exists.
  5. Therefore, at the moment this PR's diff is merged (absent [AMD] Enable AITER MoE for MiniMax-M3 FP4 MI355X vLLM STP #1954), reader X opening minimaxm3_fp4_mi355x_vllm_mtp.sh and following the line-8 pointer to minimaxm3_fp4_mi355x_vllm.sh will find a file whose header comment says vLLM chooses the backend — directly contradicting the "mirrors … AITER MoE" claim.

Addressing the refutation

The refutation argues this is (a) a known dependency, (b) flagged in the test plan, (c) a nit, and (d) forward-looking design language is normal for paired PRs. Points (a)–(c) are accurate and are exactly why this is filed at nit severity, not blocking. The remaining concern is narrow: the comment's surface reading is a present-tense factual claim ("mirrors"), and merge order in this repo is not actually pinned — #1955 can land before, after, or instead of #1954. If #1954 is rebased, re-scoped, or abandoned, this comment ships indefinitely as a dangling cross-reference. The refutation's framing of "a comment that's true after the next-PR-in-the-stack merges" assumes a merge order that the PR description hopes for but does not enforce.

Impact

Documentation-only; no runtime effect. The AITER env vars and --moe-backend aiter flag in this file are self-contained and correct on their own. The risk is purely reader confusion if/when they trace the cross-reference and find the STP recipe in a state that contradicts the comment.

How to fix

One of three trivial options:

This is in-scope to flag because the comment is newly added by this PR, the fix is one-line, and option 1 makes the comment robust to any merge order without coupling the two PRs.


source "$(dirname "$0")/../../benchmark_lib.sh"

Expand Down Expand Up @@ -36,6 +37,9 @@ fi
SERVER_LOG=/workspace/server.log
export VLLM_ENGINE_READY_TIMEOUT_S=3600
export VLLM_USE_BREAKABLE_CUDAGRAPH=0
export VLLM_ROCM_USE_AITER=1
export VLLM_ROCM_USE_AITER_MOE=1
export VLLM_ROCM_USE_AITER_FUSION_SHARED_EXPERTS=1

if [ "${EVAL_ONLY}" = "true" ]; then
setup_eval_context
Expand Down Expand Up @@ -65,6 +69,7 @@ vllm serve "$MODEL" --port "$PORT" \
--language-model-only \
--max-model-len "$MAX_MODEL_LEN" \
--attention-backend TRITON_ATTN \
--moe-backend aiter \
--speculative-config "{\"method\": \"eagle3\", \"model\": \"$DRAFT_MODEL\", \"num_speculative_tokens\": $NUM_SPEC_TOKENS}" \
--tool-call-parser minimax_m3 \
--enable-auto-tool-choice \
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -61,6 +61,10 @@ export VLLM_ENGINE_READY_TIMEOUT_S=3600
# Run with CUDA graphs (no --enforce-eager): VLLM_USE_BREAKABLE_CUDAGRAPH=0
# avoids the M3-decode breakable-cudagraph path that previously forced eager.
export VLLM_USE_BREAKABLE_CUDAGRAPH=0
export VLLM_ROCM_USE_AITER=1

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will need to set VLLM_ROCM_USE_AITER=0 when enable ep

export VLLM_ROCM_USE_AITER_MOE=1
export VLLM_ROCM_USE_AITER_FUSION_SHARED_EXPERTS=1
export VLLM_ROCM_QUICK_REDUCE_QUANTIZATION=INT6

if [ "${EVAL_ONLY}" = "true" ]; then
setup_eval_context
Expand Down Expand Up @@ -176,7 +180,9 @@ vllm serve "$MODEL" --port "$PORT" \
--language-model-only \
--max-model-len "$MAX_MODEL_LEN" \
--kv-cache-dtype fp8 \
--linear-backend emulation \
--attention-backend TRITON_ATTN \
--moe-backend aiter \
Comment thread
functionstackx marked this conversation as resolved.
--speculative-config "{\"method\": \"eagle3\", \"model\": \"$DRAFT_MODEL\", \"num_speculative_tokens\": $NUM_SPEC_TOKENS}" \
--tool-call-parser minimax_m3 \
--reasoning-parser minimax_m3 \
Expand Down
9 changes: 9 additions & 0 deletions perf-changelog.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -4316,3 +4316,12 @@
description:
- "Update the DeepSeek-V4-Pro B300 disaggregated Dynamo-vLLM benchmark to the vllm/vllm-openai:v0.23.0 image"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1952

- config-keys:
- minimaxm3-fp4-mi355x-vllm-mtp
- minimaxm3-fp8-mi355x-vllm-mtp
description:
- "Enable AITER MoE on MiniMax-M3 MI355X single-node vLLM EAGLE3 MTP benchmarks (MXFP4 and MXFP8): export VLLM_ROCM_USE_AITER=1, VLLM_ROCM_USE_AITER_MOE=1, and VLLM_ROCM_USE_AITER_FUSION_SHARED_EXPERTS=1; pass --moe-backend aiter."
- "MXFP8 MTP also exports VLLM_ROCM_QUICK_REDUCE_QUANTIZATION=INT6 and passes --linear-backend emulation. Mirrors the STP AITER MoE knobs from #1954 with three Inferact/MiniMax-M3-EAGLE3 speculative tokens and --use-chat-template for serving."
- "Pin vllm/vllm-openai-rocm:nightly-4559c43a9526597c00cbcc4f59979496500268d1 (from nightly-3f5a1e1733200760169ff31ebe60a271072b199e) on all four MiniMax-M3 MI355X single-node vLLM configs."
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1955