Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
391 changes: 391 additions & 0 deletions benchmarks/single_node/agentic/dsv4_fp4_mi355x_vllm.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,391 @@
#!/usr/bin/env bash
set -euo pipefail
set -x

# Agentic trace replay benchmark for DeepSeek-V4-Pro FP4 on MI355X using vLLM.
# Mirrors the fixed-seq-len parallelism options (pure TP and DEP) so the
# agentic sweep can probe both interactivity and throughput regimes:
# pure TP (DP_ATTENTION=false, EP_SIZE=1): attention TP-sharded across
# all $TP GPUs in a single engine. Lower TPOT, lower batch.
# TP+EP (DP_ATTENTION=false, EP_SIZE>1): attention TP-sharded, MoE
# experts EP-sharded within the TP group.
# DEP (DP_ATTENTION=true, EP_SIZE>1): per-DP-rank attention with
# experts EP-sharded across DP ranks.
# Highest aggregate throughput at large CONC.
#
# Serving flags follow the validated MI355X recipe from
# vllm-project/recipes#433: AITER+AITER_LINEAR, mp executor,
# triton_unfused MoE (required for the FP4 expert format),
# async scheduling, FULL_AND_PIECEWISE cudagraph capture.
# Image is configured in amd-master.yaml.
#
# Required env vars:
# MODEL, TP, CONC, OFFLOADING, TOTAL_CPU_DRAM_GB, RESULT_DIR
#
# OFFLOADING values:
# none - vLLM GPU KV only.
# cpu - MooncakeStoreConnector with a configured host-memory KV tier.
# lmcache - LMCache MP server + vLLM LMCacheMPConnector.

source "$(dirname "$0")/../../benchmark_lib.sh"

check_env_vars MODEL TP CONC OFFLOADING TOTAL_CPU_DRAM_GB RESULT_DIR DURATION EP_SIZE DP_ATTENTION

if [[ -n "${SLURM_JOB_ID:-}" ]]; then
echo "JOB $SLURM_JOB_ID running on ${SLURMD_NODENAME:-unknown}"
fi

if [[ -n "${MODEL_PATH:-}" ]]; then
if [[ ! -d "$MODEL_PATH" || -z "$(ls -A "$MODEL_PATH" 2>/dev/null)" ]]; then
hf download "$MODEL" --local-dir "$MODEL_PATH"
fi
else
hf download "$MODEL"
export MODEL_PATH="$MODEL"
fi

if [ -n "${ROCR_VISIBLE_DEVICES:-}" ]; then
export HIP_VISIBLE_DEVICES="$ROCR_VISIBLE_DEVICES"
fi

# ---- Resolve traces and install deps ----------------------------------------
resolve_trace_source
install_agentic_deps

# (srok)
#export AIPERF_AGENTIC_CACHE_WARMUP_DURATION=600
export AIPERF_AGENTIC_CACHE_WARMUP_DURATION=60
export AIPERF_AGENTIC_CACHE_WARMUP_DURATION=1200
# (srok)

Check warning on line 59 in benchmarks/single_node/agentic/dsv4_fp4_mi355x_vllm.sh

View check run for this annotation

Claude / Claude Code Review

AIPERF_AGENTIC_CACHE_WARMUP_DURATION exported three times; only the last (=1200) takes effect

Lines 57-58 have two back-to-back unconditional exports of `AIPERF_AGENTIC_CACHE_WARMUP_DURATION` — first `=60`, then immediately `=1200`. Bash last-write-wins, so the `=60` is dead code and only `1200` takes effect; combined with the commented-out `=600` and `(srok)` sentinels this looks like un-cleaned experimental scratch. Delete the dead export and keep a single intentional value (a comment noting why `1200` was picked over the `600` used in the sibling `dsv4_fp4_b300_vllm.sh` would help).
Comment on lines +55 to +59

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Lines 57-58 have two back-to-back unconditional exports of AIPERF_AGENTIC_CACHE_WARMUP_DURATION — first =60, then immediately =1200. Bash last-write-wins, so the =60 is dead code and only 1200 takes effect; combined with the commented-out =600 and (srok) sentinels this looks like un-cleaned experimental scratch. Delete the dead export and keep a single intentional value (a comment noting why 1200 was picked over the 600 used in the sibling dsv4_fp4_b300_vllm.sh would help).

Extended reasoning...

The new script benchmarks/single_node/agentic/dsv4_fp4_mi355x_vllm.sh contains this block near the top of the environment-setup section (lines 55-59):

# (srok)
#export AIPERF_AGENTIC_CACHE_WARMUP_DURATION=600
export AIPERF_AGENTIC_CACHE_WARMUP_DURATION=60
export AIPERF_AGENTIC_CACHE_WARMUP_DURATION=1200
# (srok)

All three lines target the same variable. The first is a commented-out reference value (600). The next two are both live, unconditional exports — first 60, then 1200. In bash, export FOO=x; export FOO=y is not additive: the second assignment simply overwrites the first in the process's environment. Nothing between them mutates or reads the variable, so no subprocess ever observes 60. Only 1200 is exported when the script continues.

Step-by-step proof. After line 57, the shell env contains AIPERF_AGENTIC_CACHE_WARMUP_DURATION=60. Line 58 (export AIPERF_AGENTIC_CACHE_WARMUP_DURATION=1200) executes immediately after. export VAR=value in bash performs (a) assignment to the shell variable, then (b) marking it exported; both operations replace any prior value. No fork, no subshell, no conditional guards this line. So at line 59 the value is unambiguously 1200, and every downstream vllm serve / aiperf child inherits 1200. The =60 export is unreachable dead code.

Why this is worth flagging. The (srok) markers on lines 55 and 59 wrap the block like a debug sentinel, matching the pattern used elsewhere in the same file (e.g. the commented-out Mooncake RDMA config block, MC_WORKERS_PER_CTX=4, and the commented --attention_config.use_fp4_indexer_cache=True). This is characteristic of experimental scratchwork that got merged rather than cleaned up. The sibling dsv4_fp4_b300_vllm.sh recipe uses a single value (600), so it is not obvious to a reviewer whether the intended warmup for MI355X is 60 (early experiment), 600 (sibling default), or 1200 (final tuning) — leaving all three in the file forces a reader to figure out which one bash chose.

Impact. Purely code hygiene. Runtime behavior is deterministic (1200 always wins), so the sweep will produce well-defined data. Nothing crashes or misconfigures — the only cost is reviewer/maintainer confusion later when someone tries to reason about warmup timing or diff MI355X vs B300 recipes.

How to fix. Delete lines 55, 57 (the commented =600 and the shadowed =60) and keep a single export AIPERF_AGENTIC_CACHE_WARMUP_DURATION=1200 (or whichever value is actually intended). A one-line comment such as # 20 min warmup: DSv4 FP4 warmup is slower than sibling B300 which uses 600 would document the divergence from dsv4_fp4_b300_vllm.sh. While cleaning this up, the two other (srok)-marked debug blocks in this file (the commented Mooncake RDMA transport config near line 189 and MC_WORKERS_PER_CTX=4 near line 205, and the --attention_config.use_fp4_indexer_cache=True near line 371) are worth removing too, but that is outside the scope of this specific finding.

export AIPERF_HTTP_TCP_USER_TIMEOUT=900000

# vLLM router for DEP runs: expands one HTTP backend into one logical worker
# per DP rank and routes by X-Session-ID (aliased from X-Correlation-ID).
USE_VLLM_ROUTER=false
VLLM_BACKEND_PORT="$PORT"
if [ "$DP_ATTENTION" = "true" ]; then
USE_VLLM_ROUTER=true
VLLM_BACKEND_PORT=$((PORT + 1))
VLLM_ROUTER_VERSION=0.1.14
VLLM_ROUTER_POLICY=consistent_hash
VLLM_ROUTER_METRICS_PORT=$((PORT + 10000))
export AIPERF_HTTP_X_SESSION_ID_FROM_CORRELATION_ID=1
agentic_pip_install --quiet "vllm-router==$VLLM_ROUTER_VERSION"
fi

# DeepSeek-V4-Pro weights are large; engine startup can exceed default 600s.
export VLLM_ENGINE_READY_TIMEOUT_S=3600

# vllm-project/vllm#43447 keeps local SWA prefix-cache tails sparsely, while
# vllm-project/vllm#44774 applies the same reachability policy to Mooncake's
# store mask. 32k matches the trace-replay tuning validated for this workload.
export VLLM_PREFIX_CACHE_RETENTION_INTERVAL=32768

# VLLM_PREFIX_CACHE_RETENTION_INTERVAL only applies to sliding-window/Mamba
# models; this vLLM build raises ValueError if it is set for DSv4.

Check failure on line 85 in benchmarks/single_node/agentic/dsv4_fp4_mi355x_vllm.sh

View check run for this annotation

Claude / Claude Code Review

VLLM_PREFIX_CACHE_RETENTION_INTERVAL export contradicts inline warning that it breaks DSv4

Lines 79-82 justify and unconditionally `export VLLM_PREFIX_CACHE_RETENTION_INTERVAL=32768`, but the immediately following comment on lines 84-85 states: "this vLLM build raises ValueError if it is set for DSv4." Since this benchmark serves DeepSeek-V4-Pro (which is not SWA/Mamba), if the warning is accurate the vLLM engine fails at startup with `ValueError` and every OFFLOADING mode (none/cpu/lmcache) aborts before the server becomes ready. This warning is unique to the MI355X script — the B200
Comment on lines +79 to +85

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Lines 79-82 justify and unconditionally export VLLM_PREFIX_CACHE_RETENTION_INTERVAL=32768, but the immediately following comment on lines 84-85 states: "this vLLM build raises ValueError if it is set for DSv4." Since this benchmark serves DeepSeek-V4-Pro (which is not SWA/Mamba), if the warning is accurate the vLLM engine fails at startup with ValueError and every OFFLOADING mode (none/cpu/lmcache) aborts before the server becomes ready. This warning is unique to the MI355X script — the B200/B300 vLLM siblings export the same variable without the warning — suggesting the AMD ROCm vLLM build pinned in amd-master.yaml behaves differently. Please resolve the contradiction: either remove the export on line 82, or remove the stale warning on lines 84-85.

Extended reasoning...

What the bug is

The script contains two adjacent statements that cannot both be true:

# vllm-project/vllm#43447 keeps local SWA prefix-cache tails sparsely, while
# vllm-project/vllm#44774 applies the same reachability policy to Mooncake's
# store mask. 32k matches the trace-replay tuning validated for this workload.
export VLLM_PREFIX_CACHE_RETENTION_INTERVAL=32768

# VLLM_PREFIX_CACHE_RETENTION_INTERVAL only applies to sliding-window/Mamba
# models; this vLLM build raises ValueError if it is set for DSv4.

Lines 79-81 justify setting the variable, line 82 does the export unconditionally, and lines 84-85 warn that this exact export will crash the engine with ValueError because DSv4 is not a sliding-window/Mamba model. The two statements directly contradict each other in the same PR.

Why this matters — step-by-step proof of impact

  1. check_env_vars runs at the top of the script; it does not gate on MODEL.
  2. Line 82 unconditionally runs export VLLM_PREFIX_CACHE_RETENTION_INTERVAL=32768 before any conditional path, before the OFFLOADING case-switch, and before the vLLM server is started.
  3. Later, vllm serve "$MODEL_PATH" is invoked with DeepSeek-V4-Pro (--tokenizer-mode deepseek_v4, --reasoning-parser deepseek_v4) — confirming the served model is DSv4, which is not an SWA/Mamba model.
  4. Per the author's own annotation on lines 84-85, this vLLM build will raise ValueError during engine initialization when the env var is set for DSv4.
  5. wait_for_server_ready then blocks until the server becomes healthy — which it never does because the engine crashed. The benchmark aborts before any request is served.

The effect is that every invocation of this benchmark (across all three OFFLOADING modes: none, cpu, lmcache; across all three parallelism modes: pure TP, TP+EP, DEP) fails at startup if the warning comment is accurate.

Why the existing code doesn't prevent this

There is no if [[ "$MODEL" == *swa* ]] / if [[ "$MODEL" != *deepseek_v4* ]] guard around the export. Even if such a guard existed elsewhere, none is present in this script or in benchmark_lib.sh for this variable. The export is a plain unconditional statement.

Cross-referencing sibling scripts

Searching the repo:

  • dsv4_fp4_b200_vllm.sh line 81: export VLLM_PREFIX_CACHE_RETENTION_INTERVAL=32768no warning comment.
  • dsv4_fp4_b300_vllm.sh line 85: export VLLM_PREFIX_CACHE_RETENTION_INTERVAL=32768no warning comment.
  • dsv4_fp4_mi355x_vllm.sh line 82: same export, followed by the ValueError warning.

The pattern (comment appears only on the MI355X copy) strongly suggests the author discovered that the ROCm vLLM build pinned in amd-master.yaml handles this env var differently from the NVIDIA CUDA vLLM builds used by the B200/B300 scripts — added the warning to capture that finding — but forgot to remove or gate the export itself.

How to fix

Either of the following resolves the contradiction:

Option A (if the warning is accurate — likely, given it's a first-person authorial claim about observed behavior on this specific build): delete line 82 entirely. The variable becomes unset for DSv4 on MI355X, matching the reason stated in the warning.

Option B (if the warning is stale — e.g., the AMD build has since been patched to accept the flag): delete lines 84-85. The export remains, matching the B200/B300 siblings.

Option A is the safer default: the warning is the newer, more specific annotation, and if it's correct the benchmark cannot run at all. Option B is only correct if the author has since re-validated that the ROCm build accepts the env var for DSv4.


# ---- Server config ----------------------------------------------------------
SERVER_LOG="$RESULT_DIR/server.log"
ROUTER_LOG="$RESULT_DIR/router.log"
MOONCAKE_MASTER_LOG="$RESULT_DIR/mooncake_master.log"
LMCACHE_LOG="$RESULT_DIR/lmcache_server.log"
mkdir -p "$RESULT_DIR"

OFFLOAD_ARGS=()

# ---- Lmcache config ----------------------------------------------------------
LMCACHE_PID=""

cleanup_lmcache_server() {
if [[ -n "$LMCACHE_PID" ]] && kill -0 "$LMCACHE_PID" 2>/dev/null; then
kill "$LMCACHE_PID" 2>/dev/null || true
wait "$LMCACHE_PID" 2>/dev/null || true
fi
}

trap cleanup_lmcache_server EXIT

cleanup_agentic_services() {
local exit_code=$?
trap - EXIT INT TERM
set +e
stop_background_process_tree "$ROUTER_PID" "vLLM router"
stop_background_process_tree "$SERVER_PID" "vLLM server" 60
stop_background_process_tree "$MOONCAKE_MASTER_PID" "Mooncake master"
exit "$exit_code"
}
trap cleanup_agentic_services EXIT
trap 'exit 130' INT
trap 'exit 143' TERM

Check failure on line 119 in benchmarks/single_node/agentic/dsv4_fp4_mi355x_vllm.sh

View check run for this annotation

Claude / Claude Code Review

Cleanup traps broken: second EXIT trap overrides LMCache cleanup, and unbound PID vars fail under set -u

Cleanup is broken in two ways. **(1)** `trap cleanup_agentic_services EXIT` on line 117 silently replaces `trap cleanup_lmcache_server EXIT` on line 106 (bash EXIT traps do not stack), so `cleanup_lmcache_server` is dead code — with `OFFLOADING=lmcache`, `LMCACHE_PID` is never killed on exit and holds ports 5555/8080, wedging follow-on runs on the same node. **(2)** `cleanup_agentic_services` (lines 108–116) references `$ROUTER_PID`, `$SERVER_PID`, `$MOONCAKE_MASTER_PID` under `set -u`, but the
Comment on lines +99 to +119

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Cleanup is broken in two ways. (1) trap cleanup_agentic_services EXIT on line 117 silently replaces trap cleanup_lmcache_server EXIT on line 106 (bash EXIT traps do not stack), so cleanup_lmcache_server is dead code — with OFFLOADING=lmcache, LMCACHE_PID is never killed on exit and holds ports 5555/8080, wedging follow-on runs on the same node. (2) cleanup_agentic_services (lines 108–116) references $ROUTER_PID, $SERVER_PID, $MOONCAKE_MASTER_PID under set -u, but the trap is registered at line 117 before those vars are ever assigned. set +e inside the handler does not disable nounset, so any early failure (Mooncake git-clone/cmake/make, LMCache install, rocm-smi) aborts cleanup on the first stop_background_process_tree "$ROUTER_PID" and leaks whichever background processes are up. Fix by initializing all four PIDs (ROUTER_PID, SERVER_PID, MOONCAKE_MASTER_PID, LMCACHE_PID) to "" up front — as the sibling dsv4_fp4_b300_vllm.sh does at lines 93–95 — collapsing to a single EXIT trap that also stops LMCACHE_PID.

Extended reasoning...

Bug 1: Second EXIT trap overrides the first, leaking LMCache

Bash EXIT traps assign, they do not chain. trap CMD EXIT replaces any previously-registered EXIT handler. In this script:

  • Line 106: trap cleanup_lmcache_server EXIT — registers LMCache cleanup.
  • Line 117: trap cleanup_agentic_services EXITimmediately replaces the LMCache cleanup.

After line 117, cleanup_lmcache_server is unreachable — no other code path calls it, and cleanup_agentic_services (lines 108–116) only stops ROUTER_PID, SERVER_PID, and MOONCAKE_MASTER_PID. It never touches LMCACHE_PID.

LMCACHE_PID refers to the standalone lmcache server process started around line 240 ("${LMCACHE_CMD[@]}" > "$LMCACHE_LOG" 2>&1 &). That is an independent background process — it is not part of SERVER_PID's process tree — so stop_background_process_tree "$SERVER_PID" cannot reach it either. In non-interactive shells (SLURM, CI, nohup) backgrounded & children do not inherit SIGHUP from the parent, so the process persists across script exit.

Consequence: with OFFLOADING=lmcache, on any exit (success, error, INT, TERM) the LMCache MP server keeps running and holds LMCACHE_PORT (default 5555, ZMQ) and LMCACHE_HTTP_PORT (default 8080). A follow-on lmcache run on the same node will fail its bind.

Bug 2: Unbound PID vars kill the cleanup handler under set -u

Line 2 sets set -euo pipefail. cleanup_agentic_services does set +e (line 111) but not set +unounset is unaffected by set +e. The handler is registered at line 117, but:

  • SERVER_PID is only assigned around line 365 (after "${VLLM_CMD[@]}" ... &).
  • ROUTER_PID is only assigned around line 382, and only in the DP-attention branch.
  • MOONCAKE_MASTER_PID is only assigned inside the cpu case (around line 194).

So any failure between line 117 and those assignments causes the trap to fire while those vars are unbound. Bash expands "$ROUTER_PID" at the caller site before invoking stop_background_process_tree, so the callee's defensive ${1:-} cannot help — the expansion has already failed.

Step-by-step proof — OFFLOADING=cpu failing in Mooncake build

  1. Script starts, set -euo pipefail active.
  2. Line 106 registers cleanup_lmcache_server as EXIT. Line 117 replaces it with cleanup_agentic_services.
  3. Case branch cpu runs: git clone, bash dependencies.sh, cmake .., make -j — suppose make -j fails on a fresh node.
  4. errexit triggers → EXIT trap runs.
  5. Handler runs local exit_code=$?, trap - EXIT INT TERM, set +e. set -u is still active.
  6. First substantive line: stop_background_process_tree "$ROUTER_PID" "vLLM router". Bash tries to expand $ROUTER_PID, which was never assigned.
  7. bash: ROUTER_PID: unbound variable → shell aborts.
  8. Nothing is cleaned up. If MOONCAKE_MASTER_PID had already been assigned (say the failure was later, in sudo make install), the Mooncake master is orphaned.

Step-by-step proof — OFFLOADING=lmcache, normal exit

  1. Script starts. Trap registrations run the same way — line 117 replaces line 106.
  2. lmcache case installs LMCache and starts the MP server: LMCACHE_PID=$!. Port 5555/8080 are bound.
  3. vLLM starts, benchmark runs, script reaches end successfully.
  4. EXIT fires → cleanup_agentic_services runs. set +e, then stops ROUTER_PID, SERVER_PID, MOONCAKE_MASTER_PID. LMCACHE_PID is never referenced.
  5. Script exits. LMCache MP server continues running, holding ports 5555 and 8080.
  6. Next sweep-triggered run on the same node fails at LMCache bind.

Why the reference recipe already handles this

benchmarks/single_node/agentic/dsv4_fp4_b300_vllm.sh lines 93–95 initialize SERVER_PID="", ROUTER_PID="", MOONCAKE_MASTER_PID="" at the top of the script — exactly to prevent the unbound-var crash. This new MI355X copy has LMCACHE_PID="" (line 97) but omits the other three. The strong signal is that the project has already been bitten by this and guards against it in the sibling script; the guard was dropped here.

Fix

At the top of the script (before the trap registrations at ~line 106):

SERVER_PID=""
ROUTER_PID=""
MOONCAKE_MASTER_PID=""
LMCACHE_PID=""

And fold cleanup_lmcache_server into cleanup_agentic_services (removing the line 106 trap), adding stop_background_process_tree "$LMCACHE_PID" "LMCache server" alongside the other stops. This gives a single EXIT trap that safely handles every early-exit path and every offloading mode.


wait_for_lmcache_ready() {
{ set +x; } 2>/dev/null
local attempts="${LMCACHE_READY_ATTEMPTS:-120}"
local tail_pid=""

while [ ! -f "$LMCACHE_LOG" ]; do
if [[ -n "$LMCACHE_PID" ]] && ! kill -0 "$LMCACHE_PID" 2>/dev/null; then
echo "LMCache server died before creating log file. Exiting." >&2
exit 1
fi
sleep 1
done

tail -f -n +1 "$LMCACHE_LOG" &
tail_pid=$!

for ((i = 1; i <= attempts; i++)); do
if curl --output /dev/null --silent --fail "http://127.0.0.1:${LMCACHE_HTTP_PORT}/healthcheck"; then
kill "$tail_pid" 2>/dev/null || true
wait "$tail_pid" 2>/dev/null || true
return 0
fi
if [[ -n "$LMCACHE_PID" ]] && ! kill -0 "$LMCACHE_PID" 2>/dev/null; then
echo "LMCache server died before becoming healthy. Log follows:" >&2
kill "$tail_pid" 2>/dev/null || true
wait "$tail_pid" 2>/dev/null || true
cat "$LMCACHE_LOG" >&2 || true
exit 1
fi
sleep 1
done

echo "Timed out waiting for LMCache server healthcheck. Log follows:" >&2
kill "$tail_pid" 2>/dev/null || true
wait "$tail_pid" 2>/dev/null || true
cat "$LMCACHE_LOG" >&2 || true
exit 1
}

case "$OFFLOADING" in
none) ;;
cpu)
# Embedded mode contributes one segment per GPU rank to a shared
# distributed store, so pre-divide the aggregate host-memory budget.
PER_RANK_GB=$((TOTAL_CPU_DRAM_GB / TP))

#MOONCAKE_VERSION=0.3.11.post1
#apt-get update && apt-get install -y libcurl4 libibverbs1 rdma-core librdmacm1 libnuma1 liburing2
#agentic_pip_install --quiet --no-cache-dir --no-deps \
# --force-reinstall "mooncake-transfer-engine-non-cuda==$MOONCAKE_VERSION"

git clone https://github.com/kvcache-ai/Mooncake.git
cd Mooncake
bash dependencies.sh
mkdir build
cd build
cmake ..
make -j
sudo make install # optional, make it ready to be used by vLLM/SGLang
cd ..
cd ..

python3 -c "from mooncake.store import MooncakeDistributedStore" >/dev/null
export INFERENCEX_MOONCAKE_MAX_TRANSFER_BATCH_KEYS=32
python3 "$(dirname "$0")/patch_vllm_mooncake_transfer_batches.py"

MOONCAKE_MASTER_PORT=$((PORT + 12000))
MOONCAKE_CONFIG_PATH="$RESULT_DIR/mooncake_config.json"
cat > "$MOONCAKE_CONFIG_PATH" <<EOF
{
"mode": "embedded",
"metadata_server": "P2PHANDSHAKE",
"master_server_address": "127.0.0.1:$MOONCAKE_MASTER_PORT",
"global_segment_size": "${PER_RANK_GB}GB",
"local_buffer_size": "2GB",
"protocol": "tcp",
"device_name": "",
"enable_offload": false
}
EOF
# (srok)
#"protocol": "rdma",
#"device_name": "mlx5_0",
#"local_buffer_size": "4GB",
export MOONCAKE_CONFIG_PATH
export MC_ENABLE_DEST_DEVICE_AFFINITY=1
export PYTHONHASHSEED=0
export MC_SLICE_SIZE=1048576
# (srok)
#export MC_WORKERS_PER_CTX=4
export MC_WORKERS_PER_CTX=8

MOONCAKE_EVICTION_HIGH_WATERMARK_RATIO=0.80
MOONCAKE_EVICTION_RATIO=0.10
MOONCAKE_KV_LEASE_TTL=60s
#MOONCAKE_KV_LEASE_TTL=3600s

echo "Starting Mooncake master on port $MOONCAKE_MASTER_PORT..."
mooncake_master --port "$MOONCAKE_MASTER_PORT" \
--eviction_high_watermark_ratio="$MOONCAKE_EVICTION_HIGH_WATERMARK_RATIO" \
--eviction_ratio="$MOONCAKE_EVICTION_RATIO" \
--default_kv_lease_ttl="$MOONCAKE_KV_LEASE_TTL" \
> "$MOONCAKE_MASTER_LOG" 2>&1 &

sleep 10
MOONCAKE_MASTER_PID=$!
if ! kill -0 "$MOONCAKE_MASTER_PID" 2>/dev/null; then
echo "Mooncake master died during startup." >&2
cat "$MOONCAKE_MASTER_LOG" >&2
exit 1
fi
unset VLLM_USE_SIMPLE_KV_OFFLOAD
OFFLOAD_ARGS=(
--kv-transfer-config
'{"kv_connector":"MooncakeStoreConnector","kv_role":"kv_both","kv_connector_extra_config":{"load_async":true}}'
)
;;
lmcache)
{ set +x; } 2>/dev/null
unset VLLM_USE_SIMPLE_KV_OFFLOAD

git clone https://github.com/LMCache/LMCache.git
cd LMCache
pip install -r requirements/build.txt
CXX=hipcc BUILD_WITH_HIP=1 pip install -e . --no-build-isolation
cd ..

python3 -c "import lmcache.integration.vllm.lmcache_mp_connector" >/dev/null

# Match the B200 Kimi LMCache setup: keep a 2.5 TB semantic CPU KV
# pool, but let the external MP server own that pool so vLLM does not
# split --kv-offloading-size across TP ranks through the integrated
# LMCache backend.
LMCACHE_HOST="${LMCACHE_HOST:-127.0.0.1}"
LMCACHE_PORT="${LMCACHE_PORT:-5555}"
LMCACHE_HTTP_PORT="${LMCACHE_HTTP_PORT:-8080}"
# LMCacheMPConnector concatenates lmcache.mp.host and port into the
# ZMQ endpoint. Bind the server to a raw host, but pass the connector a
# ZMQ-style host string.
LMCACHE_CONNECT_HOST="${LMCACHE_CONNECT_HOST:-tcp://$LMCACHE_HOST}"
LMCACHE_L1_SIZE_GB="${LMCACHE_L1_SIZE_GB:-$TOTAL_CPU_DRAM_GB}"
if [ "$LMCACHE_L1_SIZE_GB" -gt "$TOTAL_CPU_DRAM_GB" ]; then
echo "Error: LMCACHE_L1_SIZE_GB=$LMCACHE_L1_SIZE_GB exceeds configured capacity $TOTAL_CPU_DRAM_GB" >&2
exit 1
fi
LMCACHE_L1_INIT_SIZE_GB="${LMCACHE_L1_INIT_SIZE_GB:-20}"
# LMCache read locks are leases on chunks that lookup has promised
# vLLM can retrieve. The default 300s TTL is too short for this
# long-context agentic queue: TP8/conc32 can spend >300s between
# lookup and retrieve while GPU KV is saturated, which leaves the
# object present in L1 but no longer readable. Keep the 2.5 TB pool
# size unchanged and only extend the lookup-to-retrieve lease.
LMCACHE_L1_READ_TTL_SECONDS="${LMCACHE_L1_READ_TTL_SECONDS:-7200}"
LMCACHE_CHUNK_SIZE="${LMCACHE_CHUNK_SIZE:-256}"
LMCACHE_MAX_WORKERS="${LMCACHE_MAX_WORKERS:-$TP}"
export PYTHONHASHSEED="${PYTHONHASHSEED:-0}"
export LMCACHE_BLOCKING_TIMEOUT_SECS=120

echo "Starting LMCache MP server..."
LMCACHE_CMD=(
lmcache server
--host "$LMCACHE_HOST"
--port "$LMCACHE_PORT"
--http-host "$LMCACHE_HOST"
--http-port "$LMCACHE_HTTP_PORT"
--l1-size-gb "$LMCACHE_L1_SIZE_GB"
--l1-init-size-gb "$LMCACHE_L1_INIT_SIZE_GB"
--l1-read-ttl-seconds "$LMCACHE_L1_READ_TTL_SECONDS"
--chunk-size "$LMCACHE_CHUNK_SIZE"
--max-workers "$LMCACHE_MAX_WORKERS"
--eviction-policy LRU
)
printf '%q ' "${LMCACHE_CMD[@]}" > "$RESULT_DIR/lmcache_command.txt"
printf '\n' >> "$RESULT_DIR/lmcache_command.txt"
"${LMCACHE_CMD[@]}" > "$LMCACHE_LOG" 2>&1 &
LMCACHE_PID=$!
echo "LMCache server PID: $LMCACHE_PID"
wait_for_lmcache_ready

PREFIX_CACHE_ARGS=(--enable-prefix-caching)
OFFLOAD_ARGS=(
--kv-transfer-config
"{\"kv_connector\":\"LMCacheMPConnector\",\"kv_connector_module_path\":\"lmcache.integration.vllm.lmcache_mp_connector\",\"kv_role\":\"kv_both\",\"kv_connector_extra_config\":{\"lmcache.mp.host\":\"$LMCACHE_CONNECT_HOST\",\"lmcache.mp.port\":$LMCACHE_PORT}}"
)
;;
*)
echo "Error: unsupported OFFLOADING value '$OFFLOADING' (expected one of: none, cpu)" >&2
exit 1
;;

Check warning on line 309 in benchmarks/single_node/agentic/dsv4_fp4_mi355x_vllm.sh

View check run for this annotation

Claude / Claude Code Review

OFFLOADING error message omits lmcache from allowed values

The catch-all error at line 307 says `(expected one of: none, cpu)` but the case statement above accepts three values — `none`, `cpu`, and `lmcache` (the header docstring at lines 25-28 already lists all three). A user who mistypes `OFFLOADING` will be told `lmcache` isn't supported when it is. Update the message to `(expected one of: none, cpu, lmcache)`.
Comment on lines +306 to +309

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The catch-all error at line 307 says (expected one of: none, cpu) but the case statement above accepts three values — none, cpu, and lmcache (the header docstring at lines 25-28 already lists all three). A user who mistypes OFFLOADING will be told lmcache isn't supported when it is. Update the message to (expected one of: none, cpu, lmcache).

Extended reasoning...

What the bug is

The OFFLOADING variable has three valid values: none, cpu, and lmcache. The header docstring on lines 25–28 explicitly documents all three, and the case "$OFFLOADING" in block at lines 158–308 has three matching arms (none) at line 160, cpu) at line 162, lmcache) at line 238). However, the fallthrough *) branch on line 307 emits:

Error: unsupported OFFLOADING value '$OFFLOADING' (expected one of: none, cpu)

lmcache is missing from the list.

How it manifests

Any user (or CI config) that mistypes OFFLOADING — e.g. lmchache, Lmcache, or an empty string — will fall through to this branch and be told the recipe only supports none and cpu. In reality the third arm above is fully wired: it clones LMCache, starts an MP server, waits for its healthcheck, and configures --kv-transfer-config with LMCacheMPConnector. So the diagnostic actively misleads users away from a supported mode.

Why existing code doesn't prevent it

There is no other validation layer for OFFLOADING. check_env_vars only asserts the variable is set, not that its value is one of the accepted enum values. The case statement is the sole validator, and its fallthrough diagnostic is the only feedback a user gets on a bad input.

Step-by-step proof

  1. User sets OFFLOADING=lmchache (typo of lmcache).
  2. check_env_vars passes because the variable is non-empty.
  3. Execution reaches the case "$OFFLOADING" in at line 158.
  4. lmchache matches neither none), cpu), nor lmcache), so it falls through to *).
  5. The script prints Error: unsupported OFFLOADING value 'lmchache' (expected one of: none, cpu) and exits 1.
  6. The user, believing lmcache is not supported, either abandons the mode or files an issue — even though a one-character fix would have made their run succeed.

Impact

Purely cosmetic — no runtime path is affected, and all three valid modes still work. The impact is bounded to user experience when someone mistypes the variable. Reserve severity is nit.

Fix

Change line 307 to:

echo "Error: unsupported OFFLOADING value '$OFFLOADING' (expected one of: none, cpu, lmcache)" >&2

One-word addition, keeps the message in sync with both the case statement and the header docstring.

esac

PARALLEL_ARGS=(--tensor-parallel-size "$TP" --data-parallel-size 1)
if [ "$DP_ATTENTION" = "true" ]; then
PARALLEL_ARGS=(--tensor-parallel-size 1 --data-parallel-size "$TP")
fi

EP_ARGS=()
if [ "$EP_SIZE" -gt 1 ]; then
EP_ARGS=(--enable-expert-parallel)
fi

# AgentX concurrency counts live session trees, not individual requests.
# Subagent fan-out can push instantaneous request concurrency above CONC, so
# leave 2x headroom rather than clipping those bursts at the scheduler.
MAX_NUM_SEQS=$((2 * CONC))

# Workaround for MEC FW <177 RCCL memory reclaim issue
version=$(rocm-smi --showfw 2>/dev/null | grep MEC | head -n 1 | awk '{print $NF}')
if [[ "$version" == "" || ${version:-0} -lt 177 ]]; then
export HSA_NO_SCRATCH_RECLAIM=1
fi

echo "Starting vllm server..."
set -x
export VLLM_ROCM_USE_AITER=1
export VLLM_ROCM_QUICK_REDUCE_QUANTIZATION=INT4

{ set +x; } 2>/dev/null
VLLM_CMD=(
vllm serve "$MODEL_PATH" --served-model-name "$MODEL"
--host 0.0.0.0
--port "$VLLM_BACKEND_PORT"
--trust-remote-code
--async-scheduling
--distributed-executor-backend mp
--kv-cache-dtype fp8
"${PARALLEL_ARGS[@]}"
"${EP_ARGS[@]}"
--moe-backend triton_unfused
--compilation-config '{"mode":3,"cudagraph_mode":"FULL_AND_PIECEWISE"}'
--tokenizer-mode deepseek_v4
--tool-call-parser deepseek_v4
--enable-auto-tool-choice
--reasoning-parser deepseek_v4
--enable-prefix-caching
--no-disable-hybrid-kv-cache-manager
--max-num-seqs "$MAX_NUM_SEQS"
"${OFFLOAD_ARGS[@]}"
)

# (srok), not yet
#--attention_config.use_fp4_indexer_cache=True
printf '%q ' "${VLLM_CMD[@]}" | tee "$RESULT_DIR/vllm_command.txt"
printf '\n' | tee -a "$RESULT_DIR/vllm_command.txt"
"${VLLM_CMD[@]}" > "$SERVER_LOG" 2>&1 &
SERVER_PID=$!
echo "Server PID: $SERVER_PID"

wait_for_server_ready --port "$VLLM_BACKEND_PORT" --server-log "$SERVER_LOG" --server-pid "$SERVER_PID"

if [ "$USE_VLLM_ROUTER" = "true" ]; then
echo "Starting native vLLM router on port $PORT for $TP DP ranks..."
vllm-router \
--worker-urls "http://localhost:$VLLM_BACKEND_PORT" \
--policy "$VLLM_ROUTER_POLICY" \
--intra-node-data-parallel-size "$TP" \
--host 0.0.0.0 \
--port "$PORT" \
--prometheus-host 127.0.0.1 \
--prometheus-port "$VLLM_ROUTER_METRICS_PORT" \
--request-timeout-secs 14400 \
--disable-retries > "$ROUTER_LOG" 2>&1 &
ROUTER_PID=$!
echo "Router PID: $ROUTER_PID"
wait_for_server_ready --port "$PORT" --server-log "$ROUTER_LOG" --server-pid "$ROUTER_PID"
fi

# ---- Run benchmark ----------------------------------------------------------
build_replay_cmd "$RESULT_DIR"

run_agentic_replay_and_write_outputs "$RESULT_DIR"