Skip to content
2 changes: 1 addition & 1 deletion .github/configs/amd-master.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -1952,7 +1952,7 @@ dsv4-fp4-mi355x-sglang-mtp:
# gpu-mem-util=0.6. TP8 sweeps conc 4-64; DEP8 has a single conc=64
# probe to validate the ROCm DP+EP path.
dsv4-fp4-mi355x-vllm:
image: vllm/vllm-openai-rocm:v0.22.0
image: vllm/vllm-openai-rocm:nightly-09663abde0f50944a8d5ea30120666024b503faa
Comment thread
Fangzhou-Ai marked this conversation as resolved.
model: deepseek-ai/DeepSeek-V4-Pro
model-prefix: dsv4
runner: mi355x
Expand Down
13 changes: 7 additions & 6 deletions benchmarks/single_node/fixed_seq_len/dsv4_fp4_mi355x_vllm.sh
Original file line number Diff line number Diff line change
Expand Up @@ -12,11 +12,11 @@ set -eo pipefail
# same ROCm recipe while switching parallelism to vLLM's DP+EP form.
# Image-pin details live in amd-master.yaml.
#
# --moe-backend triton_unfused is required for the FP4 MoE expert
# weight format used by deepseek-ai/DeepSeek-V4-Pro. Letting --moe-backend
# default to auto picks a backend that doesn't register the FP4 scale
# parameters (w13_weight_scale / w2_weight_scale), so safetensors
# loading raises KeyError.
# Use the AITER MoE backend (VLLM_ROCM_USE_AITER_MOE=1 + --moe-backend aiter)
# for the FP4 MoE expert weights of deepseek-ai/DeepSeek-V4-Pro. The AITER
# MXFP4 path registers the FP4 scale parameters (w13_weight_scale /
# w2_weight_scale), so safetensors loads correctly and decode runs on the
# fused AITER experts instead of triton_unfused.
#
# --compilation-config mode=3 with FULL_AND_PIECEWISE cudagraph mode
# enables full CUDA graph capture for improved throughput on MI355X.
Expand Down Expand Up @@ -45,6 +45,7 @@ if [ -n "$ROCR_VISIBLE_DEVICES" ]; then
fi

export VLLM_ROCM_USE_AITER=1
export VLLM_ROCM_USE_AITER_MOE=1

SERVER_LOG=/workspace/server.log

Expand Down Expand Up @@ -75,7 +76,7 @@ vllm serve $MODEL --port $PORT \
--gpu-memory-utilization 0.8 \
--kv-cache-dtype fp8 \
--trust-remote-code \
--moe-backend triton_unfused \
--moe-backend aiter \
--tokenizer-mode deepseek_v4 \
--reasoning-parser deepseek_v4 \
--compilation-config '{"mode":3,"cudagraph_mode":"FULL_AND_PIECEWISE"}' > $SERVER_LOG 2>&1 &
Expand Down
9 changes: 9 additions & 0 deletions perf-changelog.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -4407,3 +4407,12 @@
description:
- "Bump SGLang image from lmsysorg/sglang:deepseek-v4-blackwell (digest sha256:df18bfc4...) to mainline nightly lmsysorg/sglang:nightly-dev-cu13-20260628-da802ddc."
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1923

- config-keys:
- dsv4-fp4-mi355x-vllm
description:
- "Bump DeepSeek-V4-Pro FP4 MI355X single-node vLLM STP image from vllm/vllm-openai-rocm:v0.22.0 to the latest nightly vllm/vllm-openai-rocm:nightly-09663abde0f50944a8d5ea30120666024b503faa."
- "The nightly enables two-stage attention kernels (split-KV decode), which reduce decode attention latency across all concurrency levels."
Comment thread
Fangzhou-Ai marked this conversation as resolved.
- "Employ the AITER MLA attention backend for the DeepSeek-V4 MLA path."
- "Switch the MoE backend from triton_unfused to AITER MoE (VLLM_ROCM_USE_AITER_MOE=1 + --moe-backend aiter) for the FP4 experts."
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1980