Skip to content
Merged
Original file line number Diff line number Diff line change
Expand Up @@ -12,11 +12,11 @@ set -eo pipefail
# prompts silently regresses the acceptance rate.
#
# All other serving flags mirror the non-MTP MI355X recipe (TP=8,
# VLLM_ROCM_USE_AITER=1, triton_unfused MoE, FP8 KV cache, mp executor, async
# VLLM_ROCM_USE_AITER=1, AITER MoE, FP8 KV cache, mp executor, async
# scheduling, mode=3 FULL_AND_PIECEWISE compilation). See
# dsv4_fp4_mi355x_vllm.sh for per-flag rationale.

source "$(dirname "$0")/../benchmark_lib.sh"
source "$(dirname "$0")/../../benchmark_lib.sh"

check_env_vars \
MODEL \
Expand All @@ -40,6 +40,7 @@ if [ -n "$ROCR_VISIBLE_DEVICES" ]; then
fi

export VLLM_ROCM_USE_AITER=1
export VLLM_ROCM_USE_AITER_MOE=1

SERVER_LOG=/workspace/server.log
PORT=${PORT:-8888}
Expand Down Expand Up @@ -74,7 +75,7 @@ vllm serve $MODEL --port $PORT \
--gpu-memory-utilization 0.8 \
--kv-cache-dtype fp8 \
--trust-remote-code \
--moe-backend triton_unfused \
--moe-backend aiter \
--tokenizer-mode deepseek_v4 \
--reasoning-parser deepseek_v4 \
--speculative-config "{\"method\": \"mtp\", \"num_speculative_tokens\": $NUM_SPEC_TOKENS}" \
Expand Down
2 changes: 1 addition & 1 deletion configs/amd-master.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -1939,7 +1939,7 @@ dsv4-fp4-mi355x-vllm:
# above ~conc32 (-37% @ conc32). Image reuses the base entry's v0.22.0 ROCm
# build, which already contains the MTP commit.
dsv4-fp4-mi355x-vllm-mtp:
image: vllm/vllm-openai-rocm:v0.22.0
image: vllm/vllm-openai-rocm:nightly-09663abde0f50944a8d5ea30120666024b503faa

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The block comment immediately above this entry (lines 1978-1980) still reads "Image reuses the base entry's v0.22.0 ROCm build, which already contains the MTP commit." With this bump, the MTP variant is now on a nightly while the base entry dsv4-fp4-mi355x-vllm stays on v0.22.0, so that rationale is stale. Consider replacing those two sentences with a note about the intentional divergence and the new rationale (two-stage attention kernels + AITER MLA) already documented in the PR description and perf-changelog.

Extended reasoning...

What's stale. The trailing sentences of the block comment at .github/configs/amd-master.yaml:1978-1980 claim:\n\n> Image reuses the base entry's v0.22.0 ROCm build, which already contains the MTP commit.\n\nThat rationale explained why the two entries could share an image tag. It no longer holds.\n\nStep-by-step proof of the divergence.\n\n1. Base entry dsv4-fp4-mi355x-vllm at line 1955 still pins image: vllm/vllm-openai-rocm:v0.22.0 (unchanged by this PR).\n2. This PR changes the MTP variant at line 1982 from vllm/vllm-openai-rocm:v0.22.0 to vllm/vllm-openai-rocm:nightly-09663abde0f50944a8d5ea30120666024b503faa.\n3. Therefore the two image strings now differ, and "reuses the base entry's v0.22.0 ROCm build" is factually wrong.\n\nWhy the existing wording will mislead. A future reader landing on this recipe will read the block comment, see "reuses the base entry's v0.22.0 ROCm build," and assume the two entries track the same image — for example when doing a future bump they might touch only one entry and expect the other to follow. The PR description already spells out the real reason for the bump (nightly enables two-stage attention kernels / split-KV decode and the AITER MLA backend for the DSv4 MLA path), and the perf-changelog entry restates it. That rationale belongs in the inline comment now that the images have diverged.\n\nImpact. Documentation-only — no functional change, sweep behavior is unaffected. Filing as nit since it's worth fixing while the change is fresh (the author has the context right now) but does not need to block merge.\n\nSuggested fix. Replace the trailing two sentences of the comment (roughly lines 1978-1980) with something like:\n\n> Previously reused the base entry's v0.22.0 image; bumped to a nightly to pick up two-stage attention kernels (split-KV decode) and the AITER MLA backend for the DSv4 MLA path. Base entry stays pinned to v0.22.0 intentionally.

model: deepseek-ai/DeepSeek-V4-Pro
model-prefix: dsv4
runner: mi355x
Expand Down
9 changes: 9 additions & 0 deletions perf-changelog.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -4475,3 +4475,12 @@
description:
- "Update SGLang image from lmsysorg/sglang:v0.5.12-cu130 to lmsysorg/sglang:v0.5.14-cu130"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/2054

- config-keys:
- dsv4-fp4-mi355x-vllm-mtp
description:
- "Bump DeepSeek-V4-Pro FP4 MI355X single-node vLLM MTP image from vllm/vllm-openai-rocm:v0.22.0 to the latest nightly vllm/vllm-openai-rocm:nightly-09663abde0f50944a8d5ea30120666024b503faa."
- "The nightly enables two-stage attention kernels (split-KV decode), which reduce decode attention latency across all concurrency levels."
- "Employ the AITER MLA attention backend for the DeepSeek-V4 MLA path."
- "Switch the MoE backend from triton_unfused to AITER MoE (VLLM_ROCM_USE_AITER_MOE=1 + --moe-backend aiter) for the FP4 experts."
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1981