[DNM][AMD] agentX benchmark (v1.0) #1996
Conversation
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. 感谢你的贡献!对于 vLLM 与 SGLang,请确保你的 recipe 与官方 vLLM recipes 和/或 SGLang cookbook 保持一致 如果不一致,请先创建一个 PR,之后我们才能将你的单节点 PR 合并到 master 分支。让我们确保文档保持一流水准,使整个 ML 社区都能从你的辛勤工作中受益!谢谢 PR 作者有责任确保合并后所有 GitHub Action 任务完全通过。 很多时候失败只是偶发抖动(flake),重新运行失败的任务即可解决。如果选择重新运行失败的任务,PR 作者有责任确保其最终通过。参见 GitHub 关于重新运行失败任务的文档:https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow 一般而言,PR 作者应先向相应公司的 CODEOWNERS 请求审阅并获得 PR 批准,然后再请求核心维护者审阅。 如需更多帮助,PR 作者可通过 Slack 联系核心维护者。 |
| # vllm-project/vllm#43447 keeps local SWA prefix-cache tails sparsely, while | ||
| # vllm-project/vllm#44774 applies the same reachability policy to Mooncake's | ||
| # store mask. 32k matches the trace-replay tuning validated for this workload. | ||
| export VLLM_PREFIX_CACHE_RETENTION_INTERVAL=32768 | ||
|
|
||
| # VLLM_PREFIX_CACHE_RETENTION_INTERVAL only applies to sliding-window/Mamba | ||
| # models; this vLLM build raises ValueError if it is set for DSv4. |
There was a problem hiding this comment.
🔴 Lines 79-82 justify and unconditionally export VLLM_PREFIX_CACHE_RETENTION_INTERVAL=32768, but the immediately following comment on lines 84-85 states: "this vLLM build raises ValueError if it is set for DSv4." Since this benchmark serves DeepSeek-V4-Pro (which is not SWA/Mamba), if the warning is accurate the vLLM engine fails at startup with ValueError and every OFFLOADING mode (none/cpu/lmcache) aborts before the server becomes ready. This warning is unique to the MI355X script — the B200/B300 vLLM siblings export the same variable without the warning — suggesting the AMD ROCm vLLM build pinned in amd-master.yaml behaves differently. Please resolve the contradiction: either remove the export on line 82, or remove the stale warning on lines 84-85.
Extended reasoning...
What the bug is
The script contains two adjacent statements that cannot both be true:
# vllm-project/vllm#43447 keeps local SWA prefix-cache tails sparsely, while
# vllm-project/vllm#44774 applies the same reachability policy to Mooncake's
# store mask. 32k matches the trace-replay tuning validated for this workload.
export VLLM_PREFIX_CACHE_RETENTION_INTERVAL=32768
# VLLM_PREFIX_CACHE_RETENTION_INTERVAL only applies to sliding-window/Mamba
# models; this vLLM build raises ValueError if it is set for DSv4.Lines 79-81 justify setting the variable, line 82 does the export unconditionally, and lines 84-85 warn that this exact export will crash the engine with ValueError because DSv4 is not a sliding-window/Mamba model. The two statements directly contradict each other in the same PR.
Why this matters — step-by-step proof of impact
check_env_varsruns at the top of the script; it does not gate onMODEL.- Line 82 unconditionally runs
export VLLM_PREFIX_CACHE_RETENTION_INTERVAL=32768before any conditional path, before the OFFLOADING case-switch, and before the vLLM server is started. - Later,
vllm serve "$MODEL_PATH"is invoked with DeepSeek-V4-Pro (--tokenizer-mode deepseek_v4,--reasoning-parser deepseek_v4) — confirming the served model is DSv4, which is not an SWA/Mamba model. - Per the author's own annotation on lines 84-85, this vLLM build will raise
ValueErrorduring engine initialization when the env var is set for DSv4. wait_for_server_readythen blocks until the server becomes healthy — which it never does because the engine crashed. The benchmark aborts before any request is served.
The effect is that every invocation of this benchmark (across all three OFFLOADING modes: none, cpu, lmcache; across all three parallelism modes: pure TP, TP+EP, DEP) fails at startup if the warning comment is accurate.
Why the existing code doesn't prevent this
There is no if [[ "$MODEL" == *swa* ]] / if [[ "$MODEL" != *deepseek_v4* ]] guard around the export. Even if such a guard existed elsewhere, none is present in this script or in benchmark_lib.sh for this variable. The export is a plain unconditional statement.
Cross-referencing sibling scripts
Searching the repo:
dsv4_fp4_b200_vllm.shline 81:export VLLM_PREFIX_CACHE_RETENTION_INTERVAL=32768— no warning comment.dsv4_fp4_b300_vllm.shline 85:export VLLM_PREFIX_CACHE_RETENTION_INTERVAL=32768— no warning comment.dsv4_fp4_mi355x_vllm.shline 82: same export, followed by the ValueError warning.
The pattern (comment appears only on the MI355X copy) strongly suggests the author discovered that the ROCm vLLM build pinned in amd-master.yaml handles this env var differently from the NVIDIA CUDA vLLM builds used by the B200/B300 scripts — added the warning to capture that finding — but forgot to remove or gate the export itself.
How to fix
Either of the following resolves the contradiction:
Option A (if the warning is accurate — likely, given it's a first-person authorial claim about observed behavior on this specific build): delete line 82 entirely. The variable becomes unset for DSv4 on MI355X, matching the reason stated in the warning.
Option B (if the warning is stale — e.g., the AMD build has since been patched to accept the flag): delete lines 84-85. The export remains, matching the B200/B300 siblings.
Option A is the safer default: the warning is the newer, more specific annotation, and if it's correct the benchmark cannot run at all. Option B is only correct if the author has since re-validated that the ROCm build accepts the env var for DSv4.
| cleanup_lmcache_server() { | ||
| if [[ -n "$LMCACHE_PID" ]] && kill -0 "$LMCACHE_PID" 2>/dev/null; then | ||
| kill "$LMCACHE_PID" 2>/dev/null || true | ||
| wait "$LMCACHE_PID" 2>/dev/null || true | ||
| fi | ||
| } | ||
|
|
||
| trap cleanup_lmcache_server EXIT | ||
|
|
||
| cleanup_agentic_services() { | ||
| local exit_code=$? | ||
| trap - EXIT INT TERM | ||
| set +e | ||
| stop_background_process_tree "$ROUTER_PID" "vLLM router" | ||
| stop_background_process_tree "$SERVER_PID" "vLLM server" 60 | ||
| stop_background_process_tree "$MOONCAKE_MASTER_PID" "Mooncake master" | ||
| exit "$exit_code" | ||
| } | ||
| trap cleanup_agentic_services EXIT | ||
| trap 'exit 130' INT | ||
| trap 'exit 143' TERM |
There was a problem hiding this comment.
🔴 Cleanup is broken in two ways. (1) trap cleanup_agentic_services EXIT on line 117 silently replaces trap cleanup_lmcache_server EXIT on line 106 (bash EXIT traps do not stack), so cleanup_lmcache_server is dead code — with OFFLOADING=lmcache, LMCACHE_PID is never killed on exit and holds ports 5555/8080, wedging follow-on runs on the same node. (2) cleanup_agentic_services (lines 108–116) references $ROUTER_PID, $SERVER_PID, $MOONCAKE_MASTER_PID under set -u, but the trap is registered at line 117 before those vars are ever assigned. set +e inside the handler does not disable nounset, so any early failure (Mooncake git-clone/cmake/make, LMCache install, rocm-smi) aborts cleanup on the first stop_background_process_tree "$ROUTER_PID" and leaks whichever background processes are up. Fix by initializing all four PIDs (ROUTER_PID, SERVER_PID, MOONCAKE_MASTER_PID, LMCACHE_PID) to "" up front — as the sibling dsv4_fp4_b300_vllm.sh does at lines 93–95 — collapsing to a single EXIT trap that also stops LMCACHE_PID.
Extended reasoning...
Bug 1: Second EXIT trap overrides the first, leaking LMCache
Bash EXIT traps assign, they do not chain. trap CMD EXIT replaces any previously-registered EXIT handler. In this script:
- Line 106:
trap cleanup_lmcache_server EXIT— registers LMCache cleanup. - Line 117:
trap cleanup_agentic_services EXIT— immediately replaces the LMCache cleanup.
After line 117, cleanup_lmcache_server is unreachable — no other code path calls it, and cleanup_agentic_services (lines 108–116) only stops ROUTER_PID, SERVER_PID, and MOONCAKE_MASTER_PID. It never touches LMCACHE_PID.
LMCACHE_PID refers to the standalone lmcache server process started around line 240 ("${LMCACHE_CMD[@]}" > "$LMCACHE_LOG" 2>&1 &). That is an independent background process — it is not part of SERVER_PID's process tree — so stop_background_process_tree "$SERVER_PID" cannot reach it either. In non-interactive shells (SLURM, CI, nohup) backgrounded & children do not inherit SIGHUP from the parent, so the process persists across script exit.
Consequence: with OFFLOADING=lmcache, on any exit (success, error, INT, TERM) the LMCache MP server keeps running and holds LMCACHE_PORT (default 5555, ZMQ) and LMCACHE_HTTP_PORT (default 8080). A follow-on lmcache run on the same node will fail its bind.
Bug 2: Unbound PID vars kill the cleanup handler under set -u
Line 2 sets set -euo pipefail. cleanup_agentic_services does set +e (line 111) but not set +u — nounset is unaffected by set +e. The handler is registered at line 117, but:
SERVER_PIDis only assigned around line 365 (after"${VLLM_CMD[@]}" ... &).ROUTER_PIDis only assigned around line 382, and only in the DP-attention branch.MOONCAKE_MASTER_PIDis only assigned inside thecpucase (around line 194).
So any failure between line 117 and those assignments causes the trap to fire while those vars are unbound. Bash expands "$ROUTER_PID" at the caller site before invoking stop_background_process_tree, so the callee's defensive ${1:-} cannot help — the expansion has already failed.
Step-by-step proof — OFFLOADING=cpu failing in Mooncake build
- Script starts,
set -euo pipefailactive. - Line 106 registers
cleanup_lmcache_serveras EXIT. Line 117 replaces it withcleanup_agentic_services. - Case branch
cpuruns:git clone,bash dependencies.sh,cmake ..,make -j— supposemake -jfails on a fresh node. errexittriggers → EXIT trap runs.- Handler runs
local exit_code=$?,trap - EXIT INT TERM,set +e.set -uis still active. - First substantive line:
stop_background_process_tree "$ROUTER_PID" "vLLM router". Bash tries to expand$ROUTER_PID, which was never assigned. bash: ROUTER_PID: unbound variable→ shell aborts.- Nothing is cleaned up. If
MOONCAKE_MASTER_PIDhad already been assigned (say the failure was later, insudo make install), the Mooncake master is orphaned.
Step-by-step proof — OFFLOADING=lmcache, normal exit
- Script starts. Trap registrations run the same way — line 117 replaces line 106.
lmcachecase installs LMCache and starts the MP server:LMCACHE_PID=$!. Port 5555/8080 are bound.- vLLM starts, benchmark runs, script reaches end successfully.
- EXIT fires →
cleanup_agentic_servicesruns.set +e, then stopsROUTER_PID,SERVER_PID,MOONCAKE_MASTER_PID.LMCACHE_PIDis never referenced. - Script exits. LMCache MP server continues running, holding ports 5555 and 8080.
- Next sweep-triggered run on the same node fails at LMCache bind.
Why the reference recipe already handles this
benchmarks/single_node/agentic/dsv4_fp4_b300_vllm.sh lines 93–95 initialize SERVER_PID="", ROUTER_PID="", MOONCAKE_MASTER_PID="" at the top of the script — exactly to prevent the unbound-var crash. This new MI355X copy has LMCACHE_PID="" (line 97) but omits the other three. The strong signal is that the project has already been bitten by this and guards against it in the sibling script; the guard was dropped here.
Fix
At the top of the script (before the trap registrations at ~line 106):
SERVER_PID=""
ROUTER_PID=""
MOONCAKE_MASTER_PID=""
LMCACHE_PID=""And fold cleanup_lmcache_server into cleanup_agentic_services (removing the line 106 trap), adding stop_background_process_tree "$LMCACHE_PID" "LMCache server" alongside the other stops. This gives a single EXIT trap that safely handles every early-exit path and every offloading mode.
| # (srok) | ||
| #export AIPERF_AGENTIC_CACHE_WARMUP_DURATION=600 | ||
| export AIPERF_AGENTIC_CACHE_WARMUP_DURATION=60 | ||
| export AIPERF_AGENTIC_CACHE_WARMUP_DURATION=1200 | ||
| # (srok) |
There was a problem hiding this comment.
🟡 Lines 57-58 have two back-to-back unconditional exports of AIPERF_AGENTIC_CACHE_WARMUP_DURATION — first =60, then immediately =1200. Bash last-write-wins, so the =60 is dead code and only 1200 takes effect; combined with the commented-out =600 and (srok) sentinels this looks like un-cleaned experimental scratch. Delete the dead export and keep a single intentional value (a comment noting why 1200 was picked over the 600 used in the sibling dsv4_fp4_b300_vllm.sh would help).
Extended reasoning...
The new script benchmarks/single_node/agentic/dsv4_fp4_mi355x_vllm.sh contains this block near the top of the environment-setup section (lines 55-59):
# (srok)
#export AIPERF_AGENTIC_CACHE_WARMUP_DURATION=600
export AIPERF_AGENTIC_CACHE_WARMUP_DURATION=60
export AIPERF_AGENTIC_CACHE_WARMUP_DURATION=1200
# (srok)All three lines target the same variable. The first is a commented-out reference value (600). The next two are both live, unconditional exports — first 60, then 1200. In bash, export FOO=x; export FOO=y is not additive: the second assignment simply overwrites the first in the process's environment. Nothing between them mutates or reads the variable, so no subprocess ever observes 60. Only 1200 is exported when the script continues.
Step-by-step proof. After line 57, the shell env contains AIPERF_AGENTIC_CACHE_WARMUP_DURATION=60. Line 58 (export AIPERF_AGENTIC_CACHE_WARMUP_DURATION=1200) executes immediately after. export VAR=value in bash performs (a) assignment to the shell variable, then (b) marking it exported; both operations replace any prior value. No fork, no subshell, no conditional guards this line. So at line 59 the value is unambiguously 1200, and every downstream vllm serve / aiperf child inherits 1200. The =60 export is unreachable dead code.
Why this is worth flagging. The (srok) markers on lines 55 and 59 wrap the block like a debug sentinel, matching the pattern used elsewhere in the same file (e.g. the commented-out Mooncake RDMA config block, MC_WORKERS_PER_CTX=4, and the commented --attention_config.use_fp4_indexer_cache=True). This is characteristic of experimental scratchwork that got merged rather than cleaned up. The sibling dsv4_fp4_b300_vllm.sh recipe uses a single value (600), so it is not obvious to a reviewer whether the intended warmup for MI355X is 60 (early experiment), 600 (sibling default), or 1200 (final tuning) — leaving all three in the file forces a reader to figure out which one bash chose.
Impact. Purely code hygiene. Runtime behavior is deterministic (1200 always wins), so the sweep will produce well-defined data. Nothing crashes or misconfigures — the only cost is reviewer/maintainer confusion later when someone tries to reason about warmup timing or diff MI355X vs B300 recipes.
How to fix. Delete lines 55, 57 (the commented =600 and the shadowed =60) and keep a single export AIPERF_AGENTIC_CACHE_WARMUP_DURATION=1200 (or whichever value is actually intended). A one-line comment such as # 20 min warmup: DSv4 FP4 warmup is slower than sibling B300 which uses 600 would document the divergence from dsv4_fp4_b300_vllm.sh. While cleaning this up, the two other (srok)-marked debug blocks in this file (the commented Mooncake RDMA transport config near line 189 and MC_WORKERS_PER_CTX=4 near line 205, and the --attention_config.use_fp4_indexer_cache=True near line 371) are worth removing too, but that is outside the scope of this specific finding.
| *) | ||
| echo "Error: unsupported OFFLOADING value '$OFFLOADING' (expected one of: none, cpu)" >&2 | ||
| exit 1 | ||
| ;; |
There was a problem hiding this comment.
🟡 The catch-all error at line 307 says (expected one of: none, cpu) but the case statement above accepts three values — none, cpu, and lmcache (the header docstring at lines 25-28 already lists all three). A user who mistypes OFFLOADING will be told lmcache isn't supported when it is. Update the message to (expected one of: none, cpu, lmcache).
Extended reasoning...
What the bug is
The OFFLOADING variable has three valid values: none, cpu, and lmcache. The header docstring on lines 25–28 explicitly documents all three, and the case "$OFFLOADING" in block at lines 158–308 has three matching arms (none) at line 160, cpu) at line 162, lmcache) at line 238). However, the fallthrough *) branch on line 307 emits:
Error: unsupported OFFLOADING value '$OFFLOADING' (expected one of: none, cpu)
lmcache is missing from the list.
How it manifests
Any user (or CI config) that mistypes OFFLOADING — e.g. lmchache, Lmcache, or an empty string — will fall through to this branch and be told the recipe only supports none and cpu. In reality the third arm above is fully wired: it clones LMCache, starts an MP server, waits for its healthcheck, and configures --kv-transfer-config with LMCacheMPConnector. So the diagnostic actively misleads users away from a supported mode.
Why existing code doesn't prevent it
There is no other validation layer for OFFLOADING. check_env_vars only asserts the variable is set, not that its value is one of the accepted enum values. The case statement is the sole validator, and its fallthrough diagnostic is the only feedback a user gets on a bad input.
Step-by-step proof
- User sets
OFFLOADING=lmchache(typo oflmcache). check_env_varspasses because the variable is non-empty.- Execution reaches the
case "$OFFLOADING" inat line 158. lmchachematches neithernone),cpu), norlmcache), so it falls through to*).- The script prints
Error: unsupported OFFLOADING value 'lmchache' (expected one of: none, cpu)and exits 1. - The user, believing
lmcacheis not supported, either abandons the mode or files an issue — even though a one-character fix would have made their run succeed.
Impact
Purely cosmetic — no runtime path is affected, and all three valid modes still work. The impact is bounded to user experience when someone mistypes the variable. Reserve severity is nit.
Fix
Change line 307 to:
echo "Error: unsupported OFFLOADING value '$OFFLOADING' (expected one of: none, cpu, lmcache)" >&2One-word addition, keeps the message in sync with both the case statement and the header docstring.
Summary
Single node
deepseek-ai/DeepSeek-V4-Pro
amd/MiniMax-M3-MXFP4
amd/Kimi-K2.7-Code-MXFP4
amd/Qwen3.5-397B-A17B-MXFP4
PD
deepseek-ai/DeepSeek-V4-Pro
amd/MiniMax-M3-MXFP4
amd/Kimi-K2.7-Code-MXFP4
amd/Qwen3.5-397B-A17B-MXFP4
🤖 Generated with Claude Code