[codex] Clean up srt-slurm recipes#2002
Conversation
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. 感谢你的贡献!对于 vLLM 与 SGLang,请确保你的 recipe 与官方 vLLM recipes 和/或 SGLang cookbook 保持一致 如果不一致,请先创建一个 PR,之后我们才能将你的单节点 PR 合并到 master 分支。让我们确保文档保持一流水准,使整个 ML 社区都能从你的辛勤工作中受益!谢谢 PR 作者有责任确保合并后所有 GitHub Action 任务完全通过。 很多时候失败只是偶发抖动(flake),重新运行失败的任务即可解决。如果选择重新运行失败的任务,PR 作者有责任确保其最终通过。参见 GitHub 关于重新运行失败任务的文档:https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow 一般而言,PR 作者应先向相应公司的 CODEOWNERS 请求审阅并获得 PR 批准,然后再请求核心维护者审阅。 如需更多帮助,PR 作者可通过 Slack 联系核心维护者。 |
|
Claude finished @Oseltamivir's task in 16m 27s —— View job Review of PR #2002 (srt-slurm recipe cleanup)
Found one blocking bug (inline comment on 🔴 BLOCKING: GB300 launcher can't handle the new selector-suffixed CONFIG_FILEsThis PR points all 22 glm5 GB300 config entries at upstream selectors like 🔴 Policy: master config changed without a perf-changelog entry
What checks out ✅
|
| git checkout sa-submission-q2-2026 | ||
| mkdir -p recipes/sglang/glm5 | ||
| cp -rT "$GITHUB_WORKSPACE/benchmarks/multi_node/srt-slurm-recipes/sglang/glm5" recipes/sglang/glm5 | ||
| git checkout main |
There was a problem hiding this comment.
🔴 BLOCKING: This switch to upstream main selectors breaks the job-name override further down in this script. Line 281 uses the raw $CONFIG_FILE:
sed -i "s/^name:.*/name: \"${RUNNER_NAME}\"/" "$CONFIG_FILE"All 22 new glm5 GB300 entries in configs/nvidia-master.yaml now set CONFIG_FILE with a selector suffix (e.g. recipes/gb300-fp8/glm5.yaml:zip_override_8k1k_hightpt[0]), which is exported verbatim by benchmark-multinode-tmpl.yml. sed treats the whole string as a filename, fails with "No such file or directory", and since this script runs under set -exo pipefail (line 5), the launcher aborts before srtctl apply ever runs.
Why it matters: Every glm5-fp4-gb300 and glm5-fp8-gb300 sweep entry using an upstream selector will fail at launch. This wasn't caught by PR CI because bash -n and config generation don't exercise this path — only the b200 launcher previously received selector-suffixed CONFIG_FILEs, and it strips them (runners/launch_b200-dgxc.sh:248).
Fix: strip the selector before sed, same as the b200 launcher (keep the full $CONFIG_FILE in the srtctl apply -f args at line 290 — srtctl consumes the selector):
sed -i "s/^name:.*/name: \"${RUNNER_NAME}\"/" "${CONFIG_FILE%%:*}"| additional-settings: | ||
| - "CONFIG_FILE=recipes/sglang/dsr1/b200-fp4/1k1k/disagg/mtp/1k1k_mtp_lowlat_0.yaml" | ||
| - "CONFIG_FILE=recipes/b200-fp4/1k1k.yaml:zip_override_mtp_lowlat[0]" | ||
| decode: |
There was a problem hiding this comment.
🔴 The pre-existing sed at runners/launch_b200-dgxc.sh:252 uses s/^ max_attempts: [0-9]*/ max_attempts: 720/ which anchors exactly 2 leading spaces, but the newly-referenced upstream combined recipes (recipes/b200-fp8/glm5.yaml, referenced by the 15 new CONFIG_FILE entries at lines 10621-10806) nest max_attempts inside a top-level base: block at column 4, so the substitution silently no-ops and the health-check window reverts from the intended 720×10s=7200s back to the recipe default 360×10s=3600s for exactly the glm5-fp8-b200-dynamo-sglang sweeps this PR migrates. GLM-5-FP8 is a ~600 GB checkpoint served off shared Lustre — the 7200s bump was added specifically to prevent this class of timeout (see comment at lines 249-251). Fix is a one-character regex broadening: ^ *max_attempts: matches either indent depth.
Extended reasoning...
What the bug is
runners/launch_b200-dgxc.sh:252 runs (unchanged in this PR):
sed -i 's/^ max_attempts: [0-9]*/ max_attempts: 720/' "${CONFIG_FILE%%:*}"The regex anchors ^ + exactly two literal spaces + max_attempts:. That matched the deleted flat local recipes (e.g. benchmarks/multi_node/srt-slurm-recipes/sglang/glm5/b200-fp8/1k1k/disagg/stp/1k1k_stp_lowlat_0.yaml), which had health_check at column 0 and max_attempts at column 2:
health_check:
max_attempts: 360This PR replaces the 15 glm5-fp8-b200-dynamo-sglang CONFIG_FILE entries (nvidia-master.yaml lines 10621, 10634, 10647, 10660, 10673, 10686, 10702, 10715, 10728, 10741, 10754, 10767, 10780, 10793, 10806) with upstream combined-recipe selectors recipes/b200-fp8/glm5.yaml:zip_override_*. The upstream file nests health_check inside a top-level base: block, so max_attempts sits at column 4, not column 2:
base:
...
health_check:
max_attempts: 360
interval_seconds: 10Why the code path triggers
The glm5-fp8-b200-dynamo-sglang top-level entry at nvidia-master.yaml:10600-10604 declares runner: b200-dgxc, which routes to runners/launch_b200-dgxc.sh. ${CONFIG_FILE%%:*} correctly strips the :zip_override_*[N] suffix and resolves to recipes/b200-fp8/glm5.yaml. The sed then runs against that file with the wrong anchor.
Step-by-step proof
- Reproduce the no-op against the new nesting:
$ printf 'base:\n health_check:\n max_attempts: 360\n' | \
sed 's/^ max_attempts: [0-9]*/ max_attempts: 720/'
base:
health_check:
max_attempts: 360 # unchanged, exit 0
- Confirm the broadened regex fixes it:
$ printf 'base:\n health_check:\n max_attempts: 360\n' | \
sed 's/^ *max_attempts: [0-9]*/ max_attempts: 720/'
base:
health_check:
max_attempts: 720
- sed exits 0 on non-match — no error surfaces to the launcher, and
srtctl applyruns against the unmodified recipe.
Why existing code does not prevent it
The sibling name: sed at line 248 uses ^name:.* which happens to match the deleted flat recipes (name at column 0) but has an analogous issue for the upstream nested layout — that is out of scope here, but the pattern demonstrates that neither substitution was written with the base-block layout in mind.
Impact
The launcher comment at lines 249-251 says the 720 bump exists specifically because 3600s is not enough for large model loads off shared FS (calls out DSR1-FP8 ~680GB). GLM-5-FP8 is a ~600 GB checkpoint on the same shared Lustre tree (MODEL_PATH=/lustre/fsw/models/GLM-5-FP8 at launch_b200-dgxc.sh:46). Silently dropping the window from 7200s back to 3600s reintroduces exactly the cold-cache health-check timeout class the workaround was designed to prevent, on the exact sweeps this PR retargets at the upstream file. Because sed exits 0 on no-match, this is invisible until a sweep flakes at health-check exhaustion, at which point diagnosing "the 720 bump silently stopped applying" costs real GPU time.
How to fix
One-character regex broadening at runners/launch_b200-dgxc.sh:252:
- sed -i 's/^ max_attempts: [0-9]*/ max_attempts: 720/' "${CONFIG_FILE%%:*}"
+ sed -i 's/^ *max_attempts: [0-9]*/ max_attempts: 720/' "${CONFIG_FILE%%:*}"This matches either column-2 (flat) or column-4 (nested-inside-base) layouts.
| dp-attn: true | ||
| additional-settings: | ||
| - "CONFIG_FILE=recipes/sglang/glm5/gb300-fp4/8k1k/disagg/stp/8k1k_stp_maxtpt_0.yaml" | ||
| - "CONFIG_FILE=recipes/gb300-fp4/glm5.yaml:zip_override_8k1k_hightpt[0]" |
There was a problem hiding this comment.
🔴 runners/launch_gb300-nv.sh:281 runs sed -i "…" "$CONFIG_FILE" without stripping the :zip_override_*[N] selector suffix. This PR routes many new gb300-nv glm5-fp{4,8}-gb300-dynamo-sglang CONFIG_FILE values (nvidia-master.yaml lines 11112, 11126, 11140, 11232, 11246, 11260, 11278, 11292, 11321, 11334, 11347, … through 11502) at upstream combined recipes carrying that suffix — sed then tries to open a literal path like recipes/gb300-fp4/glm5.yaml:zip_override_8k1k_hightpt[0], fails, and under set -exo pipefail (line 5) aborts the launcher before srtctl apply runs. The sibling launch_b200-dgxc.sh:248,252 already does this correctly with "${CONFIG_FILE%%:*}" (and even carries a comment explaining why); the fix is to apply the same one-line ${CONFIG_FILE%%:*} treatment on launch_gb300-nv.sh:281.
Extended reasoning...
What the bug is
runners/launch_gb300-nv.sh:281 unconditionally invokes:
sed -i "s/^name:.*/name: \"${RUNNER_NAME}\"/" "$CONFIG_FILE"passing $CONFIG_FILE verbatim. In this codebase, the CONFIG_FILE value materialised into the launcher's environment via additional-settings may carry an srtctl-style zip-override selector suffix of the form <path>.yaml:zip_override_<name>[<index>]. sed does not interpret that suffix — it will fopen(2) the literal string and fail with sed: can't read recipes/gb300-fp4/glm5.yaml:zip_override_8k1k_hightpt[0]: No such file or directory (exit status 2).
Because launch_gb300-nv.sh:5 declares set -exo pipefail, that non-zero exit terminates the launcher immediately — before the srtctl apply invocation on line 300 is reached.
The specific code path that triggers it
This PR migrates configs/nvidia-master.yaml's glm5-fp4-gb300-dynamo-sglang block (starting at line 11089, runner: gb300-nv) and glm5-fp8-gb300-dynamo-sglang block (starting at line 11303, also runner: gb300-nv) from flat per-topology recipes to upstream combined recipes with zip-override selectors. The new CONFIG_FILE values appear at nvidia-master.yaml:11112, 11126, 11140, 11232, 11246, 11260, 11278, 11292, 11321, 11334, 11347, … 11502 — 27+ configurations. All are routed through runner: gb300-nv, which per configs/runners.yaml maps to gb300-nv_0/1/2 and per .github/workflows/benchmark-multinode-tmpl.yml line 219 (bash ./runners/launch_${RUNNER_NAME%%_*}.sh) dispatches to launch_gb300-nv.sh.
Why existing code doesn't prevent it
The sibling launcher runners/launch_b200-dgxc.sh already handles this at lines 248 and 252 with the correct "${CONFIG_FILE%%:*}" bash parameter expansion to strip everything from the first : onward, and even carries an explanatory inline comment on line 251:
# Uses ${CONFIG_FILE%%:*} because CONFIG_FILE may carry an :override[N] suffix.
launch_gb300-nv.sh was never updated to apply the same treatment. This PR's diff of launch_gb300-nv.sh shows only clone/checkout/container-alias changes — line 281 remained unchanged, exposing this latent gap now that gb300-nv sweeps carry the selector.
Step-by-step proof
- A gb300-nv
glm5-fp4-gb300-dynamo-sglangsweep resolves toCONFIG_FILE=recipes/gb300-fp4/glm5.yaml:zip_override_8k1k_hightpt[0](nvidia-master.yaml:11112). - GHA runner dispatches
bash ./runners/launch_gb300-nv.sh. - Line 5 sets
-exo pipefail. - Line 281 executes
sed -i "s/^name:.*/name: \"gb300-nv_0\"/" "recipes/gb300-fp4/glm5.yaml:zip_override_8k1k_hightpt[0]". - sed calls
fopen("recipes/gb300-fp4/glm5.yaml:zip_override_8k1k_hightpt[0]", …)→ ENOENT (the actual file isrecipes/gb300-fp4/glm5.yaml; the suffix is an srtctl selector, not a filesystem path). sed exits status 2. pipefailpropagates the failure; the shell terminates beforesrtctl apply(line 300) is invoked. The job never launches.
Impact
Every gb300-nv glm5-fp4/fp8-gb300-dynamo-sglang sweep introduced or migrated by this PR (27+ configurations) hard-fails at launch. This directly contradicts the PR's stated goal of "keeping every remaining recipe referenced by an active configuration functional." The pre-merge validation listed in the PR description (generate_sweep_configs.py + pytest matrix_logic) does not exercise the launcher, so it doesn't catch this.
Fix
One-line change on runners/launch_gb300-nv.sh:281:
sed -i "s/^name:.*/name: \"${RUNNER_NAME}\"/" "${CONFIG_FILE%%:*}"matching the pattern already established on launch_b200-dgxc.sh:248.
Summary
This reduces
benchmarks/multi_node/srt-slurm-recipesfrom 321 files to 158 while keeping every remaining recipe referenced by an active configuration.Validation
uv run --with pytest --with pydantic --with pyyaml python -m pytest utils/matrix_logic/ -v(179 passed)python3 utils/matrix_logic/generate_sweep_configs.py full-sweep --config-files configs/nvidia-master.yaml --framework dynamo-sglangbash -n runners/launch_b200-dgxc.sh runners/launch_b300-nv.sh runners/launch_gb200-nv.sh runners/launch_gb300-nv.shgit diff --check