Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
111 changes: 92 additions & 19 deletions benchmarks/single_node/agentic/kimik2.5_fp4_b300.sh
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ set -x
# Required env vars:
# MODEL, TP, CONC, KV_OFFLOADING, TOTAL_CPU_DRAM_GB, RESULT_DIR
#
# KV_OFFLOADING=dram requires KV_OFFLOAD_BACKEND=native.
# KV_OFFLOADING=dram requires KV_OFFLOAD_BACKEND=native or KV_OFFLOAD_BACKEND=mooncake.

source "$(dirname "$0")/../../benchmark_lib.sh"

Expand Down Expand Up @@ -37,39 +37,112 @@ install_agentic_deps

# ---- Server config ----------------------------------------------------------
SERVER_LOG="$RESULT_DIR/server.log"
MOONCAKE_MASTER_LOG="$RESULT_DIR/mooncake_master.log"
mkdir -p "$RESULT_DIR"

SERVER_PID=""
MOONCAKE_MASTER_PID=""

cleanup_agentic_services() {
local exit_code=$?
trap - EXIT INT TERM
set +e
stop_background_process_tree "$SERVER_PID" "vLLM server" 60
stop_background_process_tree "$MOONCAKE_MASTER_PID" "Mooncake master"
exit "$exit_code"
}
trap cleanup_agentic_services EXIT
trap 'exit 130' INT
trap 'exit 143' TERM

OFFLOAD_ARGS=()
PREFIX_CACHE_ARGS=()

if require_agentic_kv_offload_backend native; then
export VLLM_USE_SIMPLE_KV_OFFLOAD=1
OFFLOAD_ARGS=(
--kv_offloading_backend native
--kv_offloading_size "$TOTAL_CPU_DRAM_GB"
--disable-hybrid-kv-cache-manager
)

if agentic_kv_offload_enabled; then
case "$KV_OFFLOAD_BACKEND" in
native)
export VLLM_USE_SIMPLE_KV_OFFLOAD=1
OFFLOAD_ARGS=(
--kv_offloading_backend native
--kv_offloading_size "$TOTAL_CPU_DRAM_GB"
--disable-hybrid-kv-cache-manager
)
;;
mooncake)
{ set +x; } 2>/dev/null
unset VLLM_USE_SIMPLE_KV_OFFLOAD

PER_RANK_GB=$((TOTAL_CPU_DRAM_GB / TP))

MOONCAKE_VERSION=0.3.11.post1
agentic_pip_install --quiet --no-cache-dir --no-deps \
--force-reinstall "mooncake-transfer-engine-cuda13==$MOONCAKE_VERSION"

MOONCAKE_MASTER_PORT=$((PORT + 12000))
MOONCAKE_CONFIG_PATH="$RESULT_DIR/mooncake_config.json"
cat > "$MOONCAKE_CONFIG_PATH" <<EOF
{
"mode": "embedded",
"metadata_server": "P2PHANDSHAKE",
"master_server_address": "127.0.0.1:$MOONCAKE_MASTER_PORT",
"global_segment_size": "${PER_RANK_GB}GB",
"local_buffer_size": "4GB",
"protocol": "rdma",
"device_name": ""
}
EOF
export MOONCAKE_CONFIG_PATH
export WITH_NVIDIA_PEERMEM=0
export MC_ENABLE_DEST_DEVICE_AFFINITY=1

MOONCAKE_EVICTION_HIGH_WATERMARK_RATIO=0.80
MOONCAKE_EVICTION_RATIO=0.10
MOONCAKE_KV_LEASE_TTL=60s

echo "Starting Mooncake master on port $MOONCAKE_MASTER_PORT..."
mooncake_master --port "$MOONCAKE_MASTER_PORT" \
--default_kv_lease_ttl=1h \
> "$MOONCAKE_MASTER_LOG" 2>&1 &
MOONCAKE_MASTER_PID=$!
sleep 2
if ! kill -0 "$MOONCAKE_MASTER_PID" 2>/dev/null; then
echo "Mooncake master died during startup." >&2
Comment on lines +97 to +108

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The three variables MOONCAKE_EVICTION_HIGH_WATERMARK_RATIO=0.80, MOONCAKE_EVICTION_RATIO=0.10, and MOONCAKE_KV_LEASE_TTL=60s (lines 97-99) are assigned but never referenced by the mooncake_master invocation, which instead hardcodes --default_kv_lease_ttl=1h — directly contradicting the declared 60s — and omits the two eviction flags entirely, so upstream defaults (~0.95/0.05) apply. Every sister Mooncake recipe (dsv4_fp4_b200_vllm.sh, dsv4_fp4_b300_vllm.sh, minimaxm3_fp8_*) plumbs all three variables through with --eviction_high_watermark_ratio, --eviction_ratio, and --default_kv_lease_ttl. Either wire the vars in (and pick one lease TTL) or delete the dead assignments.

Extended reasoning...

What the bug is. In the new Mooncake branch of kimik2.5_fp4_b300.sh, three env-style vars are declared at lines 97-99 and never referenced again:

MOONCAKE_EVICTION_HIGH_WATERMARK_RATIO=0.80
MOONCAKE_EVICTION_RATIO=0.10
MOONCAKE_KV_LEASE_TTL=60s

Three lines later, the master is launched with a hardcoded lease TTL and no eviction flags:

mooncake_master --port "$MOONCAKE_MASTER_PORT" \
    --default_kv_lease_ttl=1h \
    > "$MOONCAKE_MASTER_LOG" 2>&1 &

Two concrete defects.

  1. --default_kv_lease_ttl=1h contradicts MOONCAKE_KV_LEASE_TTL=60s. One of these values must be wrong — both cannot represent the author's intent. Since the declared variable is dead, whichever value was tuned first was lost during editing.
  2. The two eviction flags are absent. Without --eviction_high_watermark_ratio and --eviction_ratio, the master falls back to upstream defaults (roughly 0.95 / 0.05), not the 0.80 / 0.10 the vars declare.

Cross-recipe comparison. Every other agentic Mooncake recipe in this repo declares the same three variables and wires them through:

  • benchmarks/single_node/agentic/dsv4_fp4_b200_vllm.sh:139-141 — passes all three: --eviction_high_watermark_ratio="$MOONCAKE_EVICTION_HIGH_WATERMARK_RATIO" --eviction_ratio="$MOONCAKE_EVICTION_RATIO" --default_kv_lease_ttl="$MOONCAKE_KV_LEASE_TTL".
  • benchmarks/single_node/agentic/dsv4_fp4_b300_vllm.sh:136-142 — same pattern, with a comment stating the eviction tuning exists to "start eviction before an imbalanced rank exhausts its segment, and reclaim enough space for several concurrent multi-GB batch puts".
  • benchmarks/single_node/agentic/minimaxm3_fp8_h100.sh:66-68 (and the h200/mi300x/mi325x siblings) — all wire both eviction flags.

Step-by-step proof.

  1. This PR's config adds { tp: 4, ep: 1, kv-offloading: dram, kv-offload-backend: mooncake, conc-list: [64, 72, 80] } to configs/nvidia-master.yaml.
  2. That takes the script down the mooncake) branch (line 61+).
  3. Lines 97-99 assign the three variables. No other line in the file (or in benchmark_lib.sh) references them — grep confirms zero uses.
  4. Line 101-104 launches the master with --default_kv_lease_ttl=1h and nothing else. So the actual runtime state is: watermark ≈ 0.95, evict-ratio ≈ 0.05, lease = 1h. Declared intent (0.80 / 0.10 / 60s) never takes effect.

Impact. The sweep will still run — Mooncake defaults are functional. But the tuned eviction knobs exist in sister recipes for a documented reason (imbalanced-rank segment exhaustion under concurrent multi-GB puts), and this sweep hits the same regime: TP=4, CONC 64/72/80 with DRAM offload. Without --eviction_high_watermark_ratio and --eviction_ratio, if a single rank drifts hot it can OOM its segment before eviction begins. Additionally, the 60s-vs-1h contradiction leaves a live ambiguity — a future reader looking at MOONCAKE_KV_LEASE_TTL=60s will reasonably assume 60s is in effect and act on that belief.

Fix. Either plumb the variables through, matching the sister B300 recipe:

mooncake_master --port "$MOONCAKE_MASTER_PORT" \
    --eviction_high_watermark_ratio="$MOONCAKE_EVICTION_HIGH_WATERMARK_RATIO" \
    --eviction_ratio="$MOONCAKE_EVICTION_RATIO" \
    --default_kv_lease_ttl="$MOONCAKE_KV_LEASE_TTL" \
    > "$MOONCAKE_MASTER_LOG" 2>&1 &

…or delete the three unused assignments and accept 1h / defaults as the intended configuration.

cat "$MOONCAKE_MASTER_LOG" >&2
exit 1
fi
OFFLOAD_ARGS=(
--kv-transfer-config
'{"kv_connector":"MooncakeStoreConnector","kv_role":"kv_both"}'
)
;;
*) echo "Error: unsupported KV_OFFLOAD_BACKEND value '$KV_OFFLOAD_BACKEND' (expected one of: native, mooncake)" >&2; exit 1 ;;
esac
fi

echo "Starting vllm server..."
export PYTHONNOUSERSITE=1

DCP_ARGS=()
if [[ "$CONC" -ge 16 ]]; then
DCP_ARGS=(--decode-context-parallel-size "$TP")
fi

{ set +x; } 2>/dev/null
VLLM_CMD=(
vllm serve "$MODEL_PATH" --served-model-name "$MODEL"
--host 0.0.0.0
--port "$PORT"
--tensor-parallel-size="$TP"
--gpu-memory-utilization 0.90
--max-num-seqs "$CONC"
--reasoning-parser kimi_k2
--tool-call-parser kimi_k2
--compilation_config.pass_config.fuse_allreduce_rms true
--kv-cache-dtype fp8
--max-cudagraph-capture-size 2048
--stream-interval 20
--trust-remote-code
"${PREFIX_CACHE_ARGS[@]}"
--block-size 64
--language-model-only
--attention-config '{"mla_prefill_backend":"TRTLLM_RAGGED","use_prefill_query_quantization":true}'
--compilation-config '{"cudagraph_mode":"FULL_AND_PIECEWISE","custom_ops":["all"]}'
--max-cudagraph-capture-size 2048
--max-num-batched-tokens 16384
--stream-interval 10
--enable-prefix-caching
--tensor-parallel-size "$TP"
"${DCP_ARGS[@]}"
"${OFFLOAD_ARGS[@]}"
)
printf '%q ' "${VLLM_CMD[@]}" | tee "$RESULT_DIR/vllm_command.txt"
Expand Down
13 changes: 5 additions & 8 deletions configs/nvidia-master.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2965,12 +2965,7 @@ dsr1-fp8-b300-sglang-mtp:
- { tp: 8, ep: 1, conc-start: 1, conc-end: 512, spec-decoding: mtp }

kimik2.5-fp4-b300-vllm-agentic:
# v0.20.2 (cu129) lacks the flashinfer kernels for B300's reported SM
# (sm_12x); workers hit "Only SM 10.x and 11.x are supported" in the
# trtllm_fp4_block_scale_moe path. v0.20.0-cu130 is the Blackwell-targeted
# build that has the full sm_10x/sm_11x/sm_12x kernel set and is what the
# INT4 B300 sister already uses successfully.
image: vllm/vllm-openai:v0.22.0
image: vllm/vllm-openai:nightly-09663abde0f50944a8d5ea30120666024b503faa
model: nvidia/Kimi-K2.5-NVFP4
model-prefix: kimik2.5
runner: cluster:b300-nv
Expand All @@ -2981,8 +2976,10 @@ kimik2.5-fp4-b300-vllm-agentic:
agentic-coding:
- dram-utilization: 0.80
search-space:
- { tp: 8, ep: 1, kv-offloading: none, conc-list: [1, 2, 4, 8, 16, 32, 40, 48, 56, 64] }
- { tp: 8, ep: 1, kv-offloading: dram, kv-offload-backend: native, conc-list: [1, 2, 4, 8, 16, 32, 40, 48, 56, 64] }
- { tp: 8, ep: 1, kv-offloading: none, conc-list: [1, 2, 4] }
- { tp: 4, ep: 1, kv-offloading: none, conc-list: [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80] }
- { tp: 4, ep: 1, kv-offloading: dram, kv-offload-backend: native, conc-list: [64, 72, 80] }
- { tp: 4, ep: 1, kv-offloading: dram, kv-offload-backend: mooncake, conc-list: [64, 72, 80] }


dsr1-fp8-b200-trt:
Expand Down
6 changes: 6 additions & 0 deletions perf-changelog.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -4440,3 +4440,9 @@
- "Update Minimax M3 b200 vllm image tag"
- "Update search space to cover more configs"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1978

- config-keys:
- kimik2.5-fp4-b300-vllm-agentic
description:
- "Update kimi k2.5 agentx B300"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1998
Loading