[DNM][AMD] agentX benchmark (v1.0) by seungrokj · Pull Request #1996 · SemiAnalysisAI/InferenceX

seungrokj · 2026-07-03T01:49:27Z

Summary

Single node

deepseek-ai/DeepSeek-V4-Pro
amd/MiniMax-M3-MXFP4
amd/Kimi-K2.7-Code-MXFP4
amd/Qwen3.5-397B-A17B-MXFP4

PD

deepseek-ai/DeepSeek-V4-Pro
amd/MiniMax-M3-MXFP4
amd/Kimi-K2.7-Code-MXFP4
amd/Qwen3.5-397B-A17B-MXFP4

🤖 Generated with Claude Code

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

github-actions · 2026-07-03T01:49:34Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

感谢你的贡献！对于 vLLM 与 SGLang，请确保你的 recipe 与官方 vLLM recipes 和/或 SGLang cookbook 保持一致

如果不一致，请先创建一个 PR，之后我们才能将你的单节点 PR 合并到 master 分支。让我们确保文档保持一流水准，使整个 ML 社区都能从你的辛勤工作中受益！谢谢

PR 作者有责任确保合并后所有 GitHub Action 任务完全通过。 很多时候失败只是偶发抖动（flake），重新运行失败的任务即可解决。如果选择重新运行失败的任务，PR 作者有责任确保其最终通过。参见 GitHub 关于重新运行失败任务的文档：https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

一般而言，PR 作者应先向相应公司的 CODEOWNERS 请求审阅并获得 PR 批准，然后再请求核心维护者审阅。

如需更多帮助，PR 作者可通过 Slack 联系核心维护者。

claude · 2026-07-03T01:57:17Z

+# vllm-project/vllm#43447 keeps local SWA prefix-cache tails sparsely, while
+# vllm-project/vllm#44774 applies the same reachability policy to Mooncake's
+# store mask. 32k matches the trace-replay tuning validated for this workload.
+export VLLM_PREFIX_CACHE_RETENTION_INTERVAL=32768
+
+# VLLM_PREFIX_CACHE_RETENTION_INTERVAL only applies to sliding-window/Mamba
+# models; this vLLM build raises ValueError if it is set for DSv4.


🔴 Lines 79-82 justify and unconditionally export VLLM_PREFIX_CACHE_RETENTION_INTERVAL=32768, but the immediately following comment on lines 84-85 states: "this vLLM build raises ValueError if it is set for DSv4." Since this benchmark serves DeepSeek-V4-Pro (which is not SWA/Mamba), if the warning is accurate the vLLM engine fails at startup with ValueError and every OFFLOADING mode (none/cpu/lmcache) aborts before the server becomes ready. This warning is unique to the MI355X script — the B200/B300 vLLM siblings export the same variable without the warning — suggesting the AMD ROCm vLLM build pinned in amd-master.yaml behaves differently. Please resolve the contradiction: either remove the export on line 82, or remove the stale warning on lines 84-85.

Extended reasoning...

What the bug is

The script contains two adjacent statements that cannot both be true:

# vllm-project/vllm#43447 keeps local SWA prefix-cache tails sparsely, while # vllm-project/vllm#44774 applies the same reachability policy to Mooncake's # store mask. 32k matches the trace-replay tuning validated for this workload. export VLLM_PREFIX_CACHE_RETENTION_INTERVAL=32768 # VLLM_PREFIX_CACHE_RETENTION_INTERVAL only applies to sliding-window/Mamba # models; this vLLM build raises ValueError if it is set for DSv4.

Lines 79-81 justify setting the variable, line 82 does the export unconditionally, and lines 84-85 warn that this exact export will crash the engine with ValueError because DSv4 is not a sliding-window/Mamba model. The two statements directly contradict each other in the same PR.

Why this matters — step-by-step proof of impact

check_env_vars runs at the top of the script; it does not gate on MODEL.

Line 82 unconditionally runs export VLLM_PREFIX_CACHE_RETENTION_INTERVAL=32768 before any conditional path, before the OFFLOADING case-switch, and before the vLLM server is started.

Later, vllm serve "$MODEL_PATH" is invoked with DeepSeek-V4-Pro (--tokenizer-mode deepseek_v4, --reasoning-parser deepseek_v4) — confirming the served model is DSv4, which is not an SWA/Mamba model.

Per the author's own annotation on lines 84-85, this vLLM build will raise ValueError during engine initialization when the env var is set for DSv4.

wait_for_server_ready then blocks until the server becomes healthy — which it never does because the engine crashed. The benchmark aborts before any request is served.

The effect is that every invocation of this benchmark (across all three OFFLOADING modes: none, cpu, lmcache; across all three parallelism modes: pure TP, TP+EP, DEP) fails at startup if the warning comment is accurate.

Why the existing code doesn't prevent this

There is no if [[ "$MODEL" == *swa* ]] / if [[ "$MODEL" != *deepseek_v4* ]] guard around the export. Even if such a guard existed elsewhere, none is present in this script or in benchmark_lib.sh for this variable. The export is a plain unconditional statement.

Cross-referencing sibling scripts

Searching the repo:

dsv4_fp4_b200_vllm.sh line 81: export VLLM_PREFIX_CACHE_RETENTION_INTERVAL=32768 — no warning comment.

dsv4_fp4_b300_vllm.sh line 85: export VLLM_PREFIX_CACHE_RETENTION_INTERVAL=32768 — no warning comment.

dsv4_fp4_mi355x_vllm.sh line 82: same export, followed by the ValueError warning.

The pattern (comment appears only on the MI355X copy) strongly suggests the author discovered that the ROCm vLLM build pinned in amd-master.yaml handles this env var differently from the NVIDIA CUDA vLLM builds used by the B200/B300 scripts — added the warning to capture that finding — but forgot to remove or gate the export itself.

How to fix

Either of the following resolves the contradiction:

Option A (if the warning is accurate — likely, given it's a first-person authorial claim about observed behavior on this specific build): delete line 82 entirely. The variable becomes unset for DSv4 on MI355X, matching the reason stated in the warning.

Option B (if the warning is stale — e.g., the AMD build has since been patched to accept the flag): delete lines 84-85. The export remains, matching the B200/B300 siblings.

Option A is the safer default: the warning is the newer, more specific annotation, and if it's correct the benchmark cannot run at all. Option B is only correct if the author has since re-validated that the ROCm build accepts the env var for DSv4.

claude · 2026-07-03T01:57:19Z

+cleanup_lmcache_server() {
+    if [[ -n "$LMCACHE_PID" ]] && kill -0 "$LMCACHE_PID" 2>/dev/null; then
+        kill "$LMCACHE_PID" 2>/dev/null || true
+        wait "$LMCACHE_PID" 2>/dev/null || true
+    fi
+}
+
+trap cleanup_lmcache_server EXIT
+
+cleanup_agentic_services() {
+    local exit_code=$?
+    trap - EXIT INT TERM
+    set +e
+    stop_background_process_tree "$ROUTER_PID" "vLLM router"
+    stop_background_process_tree "$SERVER_PID" "vLLM server" 60
+    stop_background_process_tree "$MOONCAKE_MASTER_PID" "Mooncake master"
+    exit "$exit_code"
+}
+trap cleanup_agentic_services EXIT
+trap 'exit 130' INT
+trap 'exit 143' TERM


🔴 Cleanup is broken in two ways. (1) trap cleanup_agentic_services EXIT on line 117 silently replaces trap cleanup_lmcache_server EXIT on line 106 (bash EXIT traps do not stack), so cleanup_lmcache_server is dead code — with OFFLOADING=lmcache, LMCACHE_PID is never killed on exit and holds ports 5555/8080, wedging follow-on runs on the same node. (2) cleanup_agentic_services (lines 108–116) references $ROUTER_PID, $SERVER_PID, $MOONCAKE_MASTER_PID under set -u, but the trap is registered at line 117 before those vars are ever assigned. set +e inside the handler does not disable nounset, so any early failure (Mooncake git-clone/cmake/make, LMCache install, rocm-smi) aborts cleanup on the first stop_background_process_tree "$ROUTER_PID" and leaks whichever background processes are up. Fix by initializing all four PIDs (ROUTER_PID, SERVER_PID, MOONCAKE_MASTER_PID, LMCACHE_PID) to "" up front — as the sibling dsv4_fp4_b300_vllm.sh does at lines 93–95 — collapsing to a single EXIT trap that also stops LMCACHE_PID.

Extended reasoning...

Bug 1: Second EXIT trap overrides the first, leaking LMCache

Bash EXIT traps assign, they do not chain. trap CMD EXIT replaces any previously-registered EXIT handler. In this script:

Line 106: trap cleanup_lmcache_server EXIT — registers LMCache cleanup.

Line 117: trap cleanup_agentic_services EXIT — immediately replaces the LMCache cleanup.

After line 117, cleanup_lmcache_server is unreachable — no other code path calls it, and cleanup_agentic_services (lines 108–116) only stops ROUTER_PID, SERVER_PID, and MOONCAKE_MASTER_PID. It never touches LMCACHE_PID.

LMCACHE_PID refers to the standalone lmcache server process started around line 240 ("${LMCACHE_CMD[@]}" > "$LMCACHE_LOG" 2>&1 &). That is an independent background process — it is not part of SERVER_PID's process tree — so stop_background_process_tree "$SERVER_PID" cannot reach it either. In non-interactive shells (SLURM, CI, nohup) backgrounded & children do not inherit SIGHUP from the parent, so the process persists across script exit.

Consequence: with OFFLOADING=lmcache, on any exit (success, error, INT, TERM) the LMCache MP server keeps running and holds LMCACHE_PORT (default 5555, ZMQ) and LMCACHE_HTTP_PORT (default 8080). A follow-on lmcache run on the same node will fail its bind.

Bug 2: Unbound PID vars kill the cleanup handler under set -u

Line 2 sets set -euo pipefail. cleanup_agentic_services does set +e (line 111) but not set +u — nounset is unaffected by set +e. The handler is registered at line 117, but:

SERVER_PID is only assigned around line 365 (after "${VLLM_CMD[@]}" ... &).

ROUTER_PID is only assigned around line 382, and only in the DP-attention branch.

MOONCAKE_MASTER_PID is only assigned inside the cpu case (around line 194).

So any failure between line 117 and those assignments causes the trap to fire while those vars are unbound. Bash expands "$ROUTER_PID" at the caller site before invoking stop_background_process_tree, so the callee's defensive ${1:-} cannot help — the expansion has already failed.

Step-by-step proof — OFFLOADING=cpu failing in Mooncake build

Script starts, set -euo pipefail active.

Line 106 registers cleanup_lmcache_server as EXIT. Line 117 replaces it with cleanup_agentic_services.

Case branch cpu runs: git clone, bash dependencies.sh, cmake .., make -j — suppose make -j fails on a fresh node.

errexit triggers → EXIT trap runs.

Handler runs local exit_code=$?, trap - EXIT INT TERM, set +e. set -u is still active.

First substantive line: stop_background_process_tree "$ROUTER_PID" "vLLM router". Bash tries to expand $ROUTER_PID, which was never assigned.

bash: ROUTER_PID: unbound variable → shell aborts.

Nothing is cleaned up. If MOONCAKE_MASTER_PID had already been assigned (say the failure was later, in sudo make install), the Mooncake master is orphaned.

Step-by-step proof — OFFLOADING=lmcache, normal exit

Script starts. Trap registrations run the same way — line 117 replaces line 106.

lmcache case installs LMCache and starts the MP server: LMCACHE_PID=$!. Port 5555/8080 are bound.

vLLM starts, benchmark runs, script reaches end successfully.

EXIT fires → cleanup_agentic_services runs. set +e, then stops ROUTER_PID, SERVER_PID, MOONCAKE_MASTER_PID. LMCACHE_PID is never referenced.

Script exits. LMCache MP server continues running, holding ports 5555 and 8080.

Next sweep-triggered run on the same node fails at LMCache bind.

Why the reference recipe already handles this

benchmarks/single_node/agentic/dsv4_fp4_b300_vllm.sh lines 93–95 initialize SERVER_PID="", ROUTER_PID="", MOONCAKE_MASTER_PID="" at the top of the script — exactly to prevent the unbound-var crash. This new MI355X copy has LMCACHE_PID="" (line 97) but omits the other three. The strong signal is that the project has already been bitten by this and guards against it in the sibling script; the guard was dropped here.

Fix

At the top of the script (before the trap registrations at ~line 106):

SERVER_PID="" ROUTER_PID="" MOONCAKE_MASTER_PID="" LMCACHE_PID=""

And fold cleanup_lmcache_server into cleanup_agentic_services (removing the line 106 trap), adding stop_background_process_tree "$LMCACHE_PID" "LMCache server" alongside the other stops. This gives a single EXIT trap that safely handles every early-exit path and every offloading mode.

claude · 2026-07-03T01:57:20Z

+# (srok)
+#export AIPERF_AGENTIC_CACHE_WARMUP_DURATION=600
+export AIPERF_AGENTIC_CACHE_WARMUP_DURATION=60
+export AIPERF_AGENTIC_CACHE_WARMUP_DURATION=1200
+# (srok)


🟡 Lines 57-58 have two back-to-back unconditional exports of AIPERF_AGENTIC_CACHE_WARMUP_DURATION — first =60, then immediately =1200. Bash last-write-wins, so the =60 is dead code and only 1200 takes effect; combined with the commented-out =600 and (srok) sentinels this looks like un-cleaned experimental scratch. Delete the dead export and keep a single intentional value (a comment noting why 1200 was picked over the 600 used in the sibling dsv4_fp4_b300_vllm.sh would help).

Extended reasoning...

The new script benchmarks/single_node/agentic/dsv4_fp4_mi355x_vllm.sh contains this block near the top of the environment-setup section (lines 55-59):

# (srok) #export AIPERF_AGENTIC_CACHE_WARMUP_DURATION=600 export AIPERF_AGENTIC_CACHE_WARMUP_DURATION=60 export AIPERF_AGENTIC_CACHE_WARMUP_DURATION=1200 # (srok)

All three lines target the same variable. The first is a commented-out reference value (600). The next two are both live, unconditional exports — first 60, then 1200. In bash, export FOO=x; export FOO=y is not additive: the second assignment simply overwrites the first in the process's environment. Nothing between them mutates or reads the variable, so no subprocess ever observes 60. Only 1200 is exported when the script continues.

Step-by-step proof. After line 57, the shell env contains AIPERF_AGENTIC_CACHE_WARMUP_DURATION=60. Line 58 (export AIPERF_AGENTIC_CACHE_WARMUP_DURATION=1200) executes immediately after. export VAR=value in bash performs (a) assignment to the shell variable, then (b) marking it exported; both operations replace any prior value. No fork, no subshell, no conditional guards this line. So at line 59 the value is unambiguously 1200, and every downstream vllm serve / aiperf child inherits 1200. The =60 export is unreachable dead code.

Why this is worth flagging. The (srok) markers on lines 55 and 59 wrap the block like a debug sentinel, matching the pattern used elsewhere in the same file (e.g. the commented-out Mooncake RDMA config block, MC_WORKERS_PER_CTX=4, and the commented --attention_config.use_fp4_indexer_cache=True). This is characteristic of experimental scratchwork that got merged rather than cleaned up. The sibling dsv4_fp4_b300_vllm.sh recipe uses a single value (600), so it is not obvious to a reviewer whether the intended warmup for MI355X is 60 (early experiment), 600 (sibling default), or 1200 (final tuning) — leaving all three in the file forces a reader to figure out which one bash chose.

Impact. Purely code hygiene. Runtime behavior is deterministic (1200 always wins), so the sweep will produce well-defined data. Nothing crashes or misconfigures — the only cost is reviewer/maintainer confusion later when someone tries to reason about warmup timing or diff MI355X vs B300 recipes.

How to fix. Delete lines 55, 57 (the commented =600 and the shadowed =60) and keep a single export AIPERF_AGENTIC_CACHE_WARMUP_DURATION=1200 (or whichever value is actually intended). A one-line comment such as # 20 min warmup: DSv4 FP4 warmup is slower than sibling B300 which uses 600 would document the divergence from dsv4_fp4_b300_vllm.sh. While cleaning this up, the two other (srok)-marked debug blocks in this file (the commented Mooncake RDMA transport config near line 189 and MC_WORKERS_PER_CTX=4 near line 205, and the --attention_config.use_fp4_indexer_cache=True near line 371) are worth removing too, but that is outside the scope of this specific finding.

claude · 2026-07-03T01:57:22Z

+    *)
+        echo "Error: unsupported OFFLOADING value '$OFFLOADING' (expected one of: none, cpu)" >&2
+        exit 1
+        ;;


🟡 The catch-all error at line 307 says (expected one of: none, cpu) but the case statement above accepts three values — none, cpu, and lmcache (the header docstring at lines 25-28 already lists all three). A user who mistypes OFFLOADING will be told lmcache isn't supported when it is. Update the message to (expected one of: none, cpu, lmcache).

Extended reasoning...

What the bug is

The OFFLOADING variable has three valid values: none, cpu, and lmcache. The header docstring on lines 25–28 explicitly documents all three, and the case "$OFFLOADING" in block at lines 158–308 has three matching arms (none) at line 160, cpu) at line 162, lmcache) at line 238). However, the fallthrough *) branch on line 307 emits:

Error: unsupported OFFLOADING value '$OFFLOADING' (expected one of: none, cpu)

lmcache is missing from the list.

How it manifests

Any user (or CI config) that mistypes OFFLOADING — e.g. lmchache, Lmcache, or an empty string — will fall through to this branch and be told the recipe only supports none and cpu. In reality the third arm above is fully wired: it clones LMCache, starts an MP server, waits for its healthcheck, and configures --kv-transfer-config with LMCacheMPConnector. So the diagnostic actively misleads users away from a supported mode.

Why existing code doesn't prevent it

There is no other validation layer for OFFLOADING. check_env_vars only asserts the variable is set, not that its value is one of the accepted enum values. The case statement is the sole validator, and its fallthrough diagnostic is the only feedback a user gets on a bad input.

Step-by-step proof

User sets OFFLOADING=lmchache (typo of lmcache).

check_env_vars passes because the variable is non-empty.

Execution reaches the case "$OFFLOADING" in at line 158.

lmchache matches neither none), cpu), nor lmcache), so it falls through to *).

The script prints Error: unsupported OFFLOADING value 'lmchache' (expected one of: none, cpu) and exits 1.

The user, believing lmcache is not supported, either abandons the mode or files an issue — even though a one-character fix would have made their run succeed.

Impact

Purely cosmetic — no runtime path is affected, and all three valid modes still work. The impact is bounded to user experience when someone mistypes the variable. Reserve severity is nit.

Fix

Change line 307 to:

echo "Error: unsupported OFFLOADING value '$OFFLOADING' (expected one of: none, cpu, lmcache)" >&2

One-word addition, keeps the message in sync with both the case statement and the header docstring.

feat: add agentic benchmark for DeepSeek-V4-Pro FP4 MI355X vLLM

61f93d1

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

seungrokj requested a review from a team July 3, 2026 01:49

github-project-automation Bot added this to InferenceMAX Board Jul 3, 2026

seungrokj changed the title ~~feat: add agentic benchmark for DeepSeek-V4-Pro FP4 MI355X vLLM~~ [DNM][AMD] agentX benchmark (v1.0) Jul 3, 2026

claude Bot reviewed Jul 3, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[DNM][AMD] agentX benchmark (v1.0) #1996

[DNM][AMD] agentX benchmark (v1.0) #1996
seungrokj wants to merge 1 commit into
mainfrom
amd/agentx-v1.0

seungrokj commented Jul 3, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jul 3, 2026

Uh oh!

claude Bot Jul 3, 2026

Uh oh!

claude Bot Jul 3, 2026

Uh oh!

claude Bot Jul 3, 2026

Uh oh!

claude Bot Jul 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

seungrokj commented Jul 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Single node

PD

Uh oh!

github-actions Bot commented Jul 3, 2026

Uh oh!

claude Bot Jul 3, 2026

Choose a reason for hiding this comment

What the bug is

Why this matters — step-by-step proof of impact

Why the existing code doesn't prevent this

Cross-referencing sibling scripts

How to fix

Uh oh!

claude Bot Jul 3, 2026

Choose a reason for hiding this comment

Bug 1: Second EXIT trap overrides the first, leaking LMCache

Bug 2: Unbound PID vars kill the cleanup handler under set -u

Step-by-step proof — OFFLOADING=cpu failing in Mooncake build

Step-by-step proof — OFFLOADING=lmcache, normal exit

Why the reference recipe already handles this

Fix

Uh oh!

claude Bot Jul 3, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot Jul 3, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

seungrokj commented Jul 3, 2026 •

edited

Loading

Bug 2: Unbound PID vars kill the cleanup handler under `set -u`