Skip to content

[codex] Clean up srt-slurm recipes#2002

Merged
Oseltamivir merged 1 commit into
mainfrom
codex/clean-up-srt-slurm-recipes
Jul 3, 2026
Merged

[codex] Clean up srt-slurm recipes#2002
Oseltamivir merged 1 commit into
mainfrom
codex/clean-up-srt-slurm-recipes

Conversation

@Oseltamivir

Copy link
Copy Markdown
Collaborator

Summary

  • remove 122 local srt-slurm recipe YAMLs that are no longer referenced by the active NVIDIA master config
  • replace 41 split DSR1 and GLM5 recipes with equivalent selectors from NVIDIA/srt-slurm upstream
  • update affected launchers to use upstream main and retain only the five GB300 FP4 GLM5 low-latency overrides that differ from upstream
  • remove obsolete MiniMax M2.5 recipe overlay branches from the launchers

This reduces benchmarks/multi_node/srt-slurm-recipes from 321 files to 158 while keeping every remaining recipe referenced by an active configuration.

Validation

  • uv run --with pytest --with pydantic --with pyyaml python -m pytest utils/matrix_logic/ -v (179 passed)
  • python3 utils/matrix_logic/generate_sweep_configs.py full-sweep --config-files configs/nvidia-master.yaml --framework dynamo-sglang
  • bash -n runners/launch_b200-dgxc.sh runners/launch_b300-nv.sh runners/launch_gb200-nv.sh runners/launch_gb300-nv.sh
  • git diff --check
  • resolved and compared each replacement upstream override selector against its local flat recipe

@github-actions

github-actions Bot commented Jul 3, 2026

Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.


感谢你的贡献!对于 vLLM 与 SGLang,请确保你的 recipe 与官方 vLLM recipes 和/或 SGLang cookbook 保持一致

如果不一致,请先创建一个 PR,之后我们才能将你的单节点 PR 合并到 master 分支。让我们确保文档保持一流水准,使整个 ML 社区都能从你的辛勤工作中受益!谢谢

PR 作者有责任确保合并后所有 GitHub Action 任务完全通过。 很多时候失败只是偶发抖动(flake),重新运行失败的任务即可解决。如果选择重新运行失败的任务,PR 作者有责任确保其最终通过。参见 GitHub 关于重新运行失败任务的文档:https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

一般而言,PR 作者应先向相应公司的 CODEOWNERS 请求审阅并获得 PR 批准,然后再请求核心维护者审阅。

如需更多帮助,PR 作者可通过 Slack 联系核心维护者。

@Oseltamivir Oseltamivir marked this pull request as ready for review July 3, 2026 12:21
@Oseltamivir Oseltamivir requested a review from a team July 3, 2026 12:21
@Oseltamivir Oseltamivir merged commit bd710a0 into main Jul 3, 2026
11 checks passed
@Oseltamivir Oseltamivir deleted the codex/clean-up-srt-slurm-recipes branch July 3, 2026 12:21
@claude

claude Bot commented Jul 3, 2026

Copy link
Copy Markdown
Contributor

Claude finished @Oseltamivir's task in 16m 27s —— View job


Review of PR #2002 (srt-slurm recipe cleanup)

  • Gather context (diff, config, launcher changes)
  • Verify no dangling references to deleted recipe YAMLs
  • Check launcher recipe-selector changes
  • Post review

Found one blocking bug (inline comment on runners/launch_gb300-nv.sh) plus one policy issue. Since this PR is already merged, both need a follow-up PR.

🔴 BLOCKING: GB300 launcher can't handle the new selector-suffixed CONFIG_FILEs

This PR points all 22 glm5 GB300 config entries at upstream selectors like recipes/gb300-fp8/glm5.yaml:zip_override_8k1k_hightpt[0]. The workflow exports CONFIG_FILE verbatim, but runners/launch_gb300-nv.sh:281 runs sed -i "s/^name:.*/..." "$CONFIG_FILE" on the raw value — sed treats the whole string (suffix included) as a filename, fails, and set -exo pipefail aborts the launcher before srtctl apply. Every glm5-fp4/fp8 GB300 sweep entry will fail at launch. The b200 launcher already handles this by stripping the suffix with ${CONFIG_FILE%%:*} (runners/launch_b200-dgxc.sh:248); the gb200 launcher does too (runners/launch_gb200-nv.sh:455). One-line fix, detailed in the inline comment. PR CI didn't catch this because bash -n and config generation never exercise the launch path.

🔴 Policy: master config changed without a perf-changelog entry

configs/nvidia-master.yaml was modified but perf-changelog.yaml was not updated. When changing configs/amd-master.yaml or configs/nvidia-master.yaml, you must add a corresponding entry to perf-changelog.yaml (appended to the end of the file) documenting the changes with the PR link — even for intended-perf-neutral recipe-path swaps, so the changelog explains any resulting shifts.

What checks out ✅

  • All 122 deleted recipe YAMLs are unreferenced: I cross-checked every remaining CONFIG_FILE=recipes/sglang/... and recipes/vllm/... reference in configs/nvidia-master.yaml against the tree — all resolve to existing files (deepseek-v4, dsr1 8k1k mtp, the five retained glm5 gb300-fp4 lowlat overrides, qwen3.5, kimi, minimax-m3).
  • The removed MiniMax M2.5 launcher branches had no remaining config references.
  • The b200 launcher changes (glm5-fp8 → upstream main, dsr1-fp4 → main + local 8k1k overlay) are consistent with the config's mix of upstream selectors and local paths, and that launcher handles the selector suffix correctly.
  • The retained glm5 gb300-fp4 recipes document they were ported from upstream recipes/gb300-fp4/glm5.yaml (upstream PR SPIKE: investigate ways to slim down Python file used for testing #152), consistent with the new selectors. I could not fetch NVIDIA/srt-slurm from this sandbox to independently verify the upstream zip_override_* key names and index counts on main, so that part relies on the author's stated validation.

git checkout sa-submission-q2-2026
mkdir -p recipes/sglang/glm5
cp -rT "$GITHUB_WORKSPACE/benchmarks/multi_node/srt-slurm-recipes/sglang/glm5" recipes/sglang/glm5
git checkout main

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 BLOCKING: This switch to upstream main selectors breaks the job-name override further down in this script. Line 281 uses the raw $CONFIG_FILE:

sed -i "s/^name:.*/name: \"${RUNNER_NAME}\"/" "$CONFIG_FILE"

All 22 new glm5 GB300 entries in configs/nvidia-master.yaml now set CONFIG_FILE with a selector suffix (e.g. recipes/gb300-fp8/glm5.yaml:zip_override_8k1k_hightpt[0]), which is exported verbatim by benchmark-multinode-tmpl.yml. sed treats the whole string as a filename, fails with "No such file or directory", and since this script runs under set -exo pipefail (line 5), the launcher aborts before srtctl apply ever runs.

Why it matters: Every glm5-fp4-gb300 and glm5-fp8-gb300 sweep entry using an upstream selector will fail at launch. This wasn't caught by PR CI because bash -n and config generation don't exercise this path — only the b200 launcher previously received selector-suffixed CONFIG_FILEs, and it strips them (runners/launch_b200-dgxc.sh:248).

Fix: strip the selector before sed, same as the b200 launcher (keep the full $CONFIG_FILE in the srtctl apply -f args at line 290 — srtctl consumes the selector):

sed -i "s/^name:.*/name: \"${RUNNER_NAME}\"/" "${CONFIG_FILE%%:*}"

Fix this →

additional-settings:
- "CONFIG_FILE=recipes/sglang/dsr1/b200-fp4/1k1k/disagg/mtp/1k1k_mtp_lowlat_0.yaml"
- "CONFIG_FILE=recipes/b200-fp4/1k1k.yaml:zip_override_mtp_lowlat[0]"
decode:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The pre-existing sed at runners/launch_b200-dgxc.sh:252 uses s/^ max_attempts: [0-9]*/ max_attempts: 720/ which anchors exactly 2 leading spaces, but the newly-referenced upstream combined recipes (recipes/b200-fp8/glm5.yaml, referenced by the 15 new CONFIG_FILE entries at lines 10621-10806) nest max_attempts inside a top-level base: block at column 4, so the substitution silently no-ops and the health-check window reverts from the intended 720×10s=7200s back to the recipe default 360×10s=3600s for exactly the glm5-fp8-b200-dynamo-sglang sweeps this PR migrates. GLM-5-FP8 is a ~600 GB checkpoint served off shared Lustre — the 7200s bump was added specifically to prevent this class of timeout (see comment at lines 249-251). Fix is a one-character regex broadening: ^ *max_attempts: matches either indent depth.

Extended reasoning...

What the bug is

runners/launch_b200-dgxc.sh:252 runs (unchanged in this PR):

sed -i 's/^  max_attempts: [0-9]*/  max_attempts: 720/' "${CONFIG_FILE%%:*}"

The regex anchors ^ + exactly two literal spaces + max_attempts:. That matched the deleted flat local recipes (e.g. benchmarks/multi_node/srt-slurm-recipes/sglang/glm5/b200-fp8/1k1k/disagg/stp/1k1k_stp_lowlat_0.yaml), which had health_check at column 0 and max_attempts at column 2:

health_check:
  max_attempts: 360

This PR replaces the 15 glm5-fp8-b200-dynamo-sglang CONFIG_FILE entries (nvidia-master.yaml lines 10621, 10634, 10647, 10660, 10673, 10686, 10702, 10715, 10728, 10741, 10754, 10767, 10780, 10793, 10806) with upstream combined-recipe selectors recipes/b200-fp8/glm5.yaml:zip_override_*. The upstream file nests health_check inside a top-level base: block, so max_attempts sits at column 4, not column 2:

base:
  ...
  health_check:
    max_attempts: 360
    interval_seconds: 10

Why the code path triggers

The glm5-fp8-b200-dynamo-sglang top-level entry at nvidia-master.yaml:10600-10604 declares runner: b200-dgxc, which routes to runners/launch_b200-dgxc.sh. ${CONFIG_FILE%%:*} correctly strips the :zip_override_*[N] suffix and resolves to recipes/b200-fp8/glm5.yaml. The sed then runs against that file with the wrong anchor.

Step-by-step proof

  1. Reproduce the no-op against the new nesting:
$ printf 'base:\n  health_check:\n    max_attempts: 360\n' | \
    sed 's/^  max_attempts: [0-9]*/  max_attempts: 720/'
base:
  health_check:
    max_attempts: 360     # unchanged, exit 0
  1. Confirm the broadened regex fixes it:
$ printf 'base:\n  health_check:\n    max_attempts: 360\n' | \
    sed 's/^ *max_attempts: [0-9]*/  max_attempts: 720/'
base:
  health_check:
  max_attempts: 720
  1. sed exits 0 on non-match — no error surfaces to the launcher, and srtctl apply runs against the unmodified recipe.

Why existing code does not prevent it

The sibling name: sed at line 248 uses ^name:.* which happens to match the deleted flat recipes (name at column 0) but has an analogous issue for the upstream nested layout — that is out of scope here, but the pattern demonstrates that neither substitution was written with the base-block layout in mind.

Impact

The launcher comment at lines 249-251 says the 720 bump exists specifically because 3600s is not enough for large model loads off shared FS (calls out DSR1-FP8 ~680GB). GLM-5-FP8 is a ~600 GB checkpoint on the same shared Lustre tree (MODEL_PATH=/lustre/fsw/models/GLM-5-FP8 at launch_b200-dgxc.sh:46). Silently dropping the window from 7200s back to 3600s reintroduces exactly the cold-cache health-check timeout class the workaround was designed to prevent, on the exact sweeps this PR retargets at the upstream file. Because sed exits 0 on no-match, this is invisible until a sweep flakes at health-check exhaustion, at which point diagnosing "the 720 bump silently stopped applying" costs real GPU time.

How to fix

One-character regex broadening at runners/launch_b200-dgxc.sh:252:

-    sed -i 's/^  max_attempts: [0-9]*/  max_attempts: 720/' "${CONFIG_FILE%%:*}"
+    sed -i 's/^ *max_attempts: [0-9]*/  max_attempts: 720/' "${CONFIG_FILE%%:*}"

This matches either column-2 (flat) or column-4 (nested-inside-base) layouts.

dp-attn: true
additional-settings:
- "CONFIG_FILE=recipes/sglang/glm5/gb300-fp4/8k1k/disagg/stp/8k1k_stp_maxtpt_0.yaml"
- "CONFIG_FILE=recipes/gb300-fp4/glm5.yaml:zip_override_8k1k_hightpt[0]"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 runners/launch_gb300-nv.sh:281 runs sed -i "…" "$CONFIG_FILE" without stripping the :zip_override_*[N] selector suffix. This PR routes many new gb300-nv glm5-fp{4,8}-gb300-dynamo-sglang CONFIG_FILE values (nvidia-master.yaml lines 11112, 11126, 11140, 11232, 11246, 11260, 11278, 11292, 11321, 11334, 11347, … through 11502) at upstream combined recipes carrying that suffix — sed then tries to open a literal path like recipes/gb300-fp4/glm5.yaml:zip_override_8k1k_hightpt[0], fails, and under set -exo pipefail (line 5) aborts the launcher before srtctl apply runs. The sibling launch_b200-dgxc.sh:248,252 already does this correctly with "${CONFIG_FILE%%:*}" (and even carries a comment explaining why); the fix is to apply the same one-line ${CONFIG_FILE%%:*} treatment on launch_gb300-nv.sh:281.

Extended reasoning...

What the bug is

runners/launch_gb300-nv.sh:281 unconditionally invokes:

sed -i "s/^name:.*/name: \"${RUNNER_NAME}\"/" "$CONFIG_FILE"

passing $CONFIG_FILE verbatim. In this codebase, the CONFIG_FILE value materialised into the launcher's environment via additional-settings may carry an srtctl-style zip-override selector suffix of the form <path>.yaml:zip_override_<name>[<index>]. sed does not interpret that suffix — it will fopen(2) the literal string and fail with sed: can't read recipes/gb300-fp4/glm5.yaml:zip_override_8k1k_hightpt[0]: No such file or directory (exit status 2).

Because launch_gb300-nv.sh:5 declares set -exo pipefail, that non-zero exit terminates the launcher immediately — before the srtctl apply invocation on line 300 is reached.

The specific code path that triggers it

This PR migrates configs/nvidia-master.yaml's glm5-fp4-gb300-dynamo-sglang block (starting at line 11089, runner: gb300-nv) and glm5-fp8-gb300-dynamo-sglang block (starting at line 11303, also runner: gb300-nv) from flat per-topology recipes to upstream combined recipes with zip-override selectors. The new CONFIG_FILE values appear at nvidia-master.yaml:11112, 11126, 11140, 11232, 11246, 11260, 11278, 11292, 11321, 11334, 11347, … 11502 — 27+ configurations. All are routed through runner: gb300-nv, which per configs/runners.yaml maps to gb300-nv_0/1/2 and per .github/workflows/benchmark-multinode-tmpl.yml line 219 (bash ./runners/launch_${RUNNER_NAME%%_*}.sh) dispatches to launch_gb300-nv.sh.

Why existing code doesn't prevent it

The sibling launcher runners/launch_b200-dgxc.sh already handles this at lines 248 and 252 with the correct "${CONFIG_FILE%%:*}" bash parameter expansion to strip everything from the first : onward, and even carries an explanatory inline comment on line 251:

# Uses ${CONFIG_FILE%%:*} because CONFIG_FILE may carry an :override[N] suffix.

launch_gb300-nv.sh was never updated to apply the same treatment. This PR's diff of launch_gb300-nv.sh shows only clone/checkout/container-alias changes — line 281 remained unchanged, exposing this latent gap now that gb300-nv sweeps carry the selector.

Step-by-step proof

  1. A gb300-nv glm5-fp4-gb300-dynamo-sglang sweep resolves to CONFIG_FILE=recipes/gb300-fp4/glm5.yaml:zip_override_8k1k_hightpt[0] (nvidia-master.yaml:11112).
  2. GHA runner dispatches bash ./runners/launch_gb300-nv.sh.
  3. Line 5 sets -exo pipefail.
  4. Line 281 executes sed -i "s/^name:.*/name: \"gb300-nv_0\"/" "recipes/gb300-fp4/glm5.yaml:zip_override_8k1k_hightpt[0]".
  5. sed calls fopen("recipes/gb300-fp4/glm5.yaml:zip_override_8k1k_hightpt[0]", …) → ENOENT (the actual file is recipes/gb300-fp4/glm5.yaml; the suffix is an srtctl selector, not a filesystem path). sed exits status 2.
  6. pipefail propagates the failure; the shell terminates before srtctl apply (line 300) is invoked. The job never launches.

Impact

Every gb300-nv glm5-fp4/fp8-gb300-dynamo-sglang sweep introduced or migrated by this PR (27+ configurations) hard-fails at launch. This directly contradicts the PR's stated goal of "keeping every remaining recipe referenced by an active configuration functional." The pre-merge validation listed in the PR description (generate_sweep_configs.py + pytest matrix_logic) does not exercise the launcher, so it doesn't catch this.

Fix

One-line change on runners/launch_gb300-nv.sh:281:

sed -i "s/^name:.*/name: \"${RUNNER_NAME}\"/" "${CONFIG_FILE%%:*}"

matching the pattern already established on launch_b200-dgxc.sh:248.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Development

Successfully merging this pull request may close these issues.

1 participant