[Klaud Cold] Align codeowner-signoff-verify with latest PR_REVIEW_CHECKLIST.md#2015
Conversation
Adds four checks to the sign-off verifier prompt to cover the checklist items introduced in #1956 and #2011 (plus previously unverified items): - Check 5: sign-off uses the latest PR_REVIEW_CHECKLIST.md template - Check 6: upstream vllm/lmsysorg Docker Hub images on established SKUs (H100/H200/B200/B300/GB200/GB300, MI300X/MI325X/MI355X), and vLLM/SGLang submissions before additional frameworks - Check 7: no benchmark hacks that change model architecture / cut FLOPs - Check 8: spec-decode configs benchmark through chat templates Also fixes the stale '(0-3)' check-range reference and updates the header summary and verdict rules to cover checks 0-8. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
| - (a) UPSTREAM IMAGE: for entries with `framework: vllm`, the image must come from the | ||
| upstream vLLM Docker Hub org (https://hub.docker.com/u/vllm) — in the master configs | ||
| that looks like `vllm/vllm-openai:<tag>` or `vllm/vllm-openai-rocm:<tag>`. For | ||
| `framework: sglang`, it must come from the upstream org | ||
| (https://hub.docker.com/u/lmsysorg) — `lmsysorg/sglang:<tag>` or | ||
| `lmsysorg/sglang-rocm:<tag>` (digest-pinned `@sha256:...` variants are fine). | ||
| Vendor/private forks — e.g. `rocm/sgl-dev`, `rocm/vllm-dev`, `ghcr.io#...`, or any | ||
| non-`vllm/`/non-`lmsysorg/` repo — FAIL on the established SKUs above unless the | ||
| sign-off's additional detail section documents a genuine exception: truly new | ||
| hardware (e.g. MI455X UALoE72, Vera Rubin NVL72, Rubin NVL8) or a new model | ||
| architecture that upstream vLLM/SGLang does not fundamentally support yet (as | ||
| backed by vLLM/SGLang community maintainers). Name the offending image and the | ||
| missing justification when you FAIL. (Note: some entries write the registry | ||
| separator as `#`, e.g. `nvcr.io#nvidia/...` — treat `#` as `/`.) | ||
| - (b) ENGINE-FIRST ORDERING: if this PR adds a config entry for a NON-vLLM/SGLang | ||
| framework (`framework:` of `trtllm`, `atom`, dynamo variants, etc. — images like |
There was a problem hiding this comment.
🟡 Check 6's framework matching is too literal for actual master-config values. (a) Only exact framework: vllm/sglang matches, missing vllm-disagg/sglang-disagg/dynamo-vllm/dynamo-sglang — which is exactly where vendor forks like rocm/sgl-dev/rocm/vllm-dev currently live (see dsr1-fp8-mi355x-sglang-disagg at configs/amd-master.yaml:967, minimaxm3-fp4-mi355x-vllm-disagg at :2534). (b) Lists trtllm as a framework value, but the actual YAML value is trt (and dynamo-trt) — framework: trtllm appears nowhere. Also, categorizing dynamo variants as non-vLLM/SGLang in (b) contradicts (a)'s intent — a dynamo-vllm PR is still a vLLM submission. Recommend matching by substring (framework contains vllm/sglang) and naming the exact YAML values (trt, dynamo-trt, atom, vllm-disagg, etc.).
Extended reasoning...
What the check says
Check 6(a) at .github/workflows/codeowner-signoff-verify.yml:412-427 gates upstream-image usage on entries with framework: vllm or framework: sglang. The backtick-formatted literals plus the exact-equality wording tell the verifier to match those two strings.
Check 6(b) at :428-434 lists the NON-vLLM/SGLang bucket as framework: of trtllm, atom, dynamo variants, etc.
What the master configs actually contain
Grepping configs/*-master.yaml for framework: values:
- vLLM/SGLang variants used on established SKUs:
vllm,sglang,vllm-disagg,sglang-disagg,dynamo-vllm,dynamo-sglang - TensorRT-LLM:
trtanddynamo-trt—trtllmdoes not appear anywhere - ATOM:
atom,atom-disagg
Concrete examples of vendor-fork images on disagg vLLM/SGLang variants — precisely the class Check 6 was designed to gate:
configs/amd-master.yaml:967-973—dsr1-fp8-mi355x-sglang-disagg(framework:sglang-disagg) usesrocm/sgl-dev:sglang-0.5.9-rocm720-mi35x-mori-0227-2configs/amd-master.yaml:2534-2541—minimaxm3-fp4-mi355x-vllm-disagg(framework:vllm-disagg) usesrocm/vllm-dev:vllm-0.23.1-rocm723-mi35x-mori-0625configs/amd-master.yaml:1119-1127— anothersglang-disaggonrocm/sgl-dev
Why this matters
(a) — silent under-application. A literal reading of framework: vllm/framework: sglang exempts every disagg/dynamo variant from the upstream-image requirement. That means the gate silently passes exactly the configs it was designed to fail: MI355X sglang-disagg / vllm-disagg entries running rocm/sgl-dev / rocm/vllm-dev sail through Check 6(a) untouched.
(b) — wrong search term + self-contradiction. A verifier grepping framework: trtllm finds nothing and skips every actual TRT-LLM entry (the ~14 framework: trt entries in nvidia-master.yaml, plus dynamo-trt). Additionally, (b) puts dynamo variants in the non-vLLM/SGLang bucket, which is self-contradictory with (a): a dynamo-vllm PR is a vLLM submission, yet (b) would demand a plain-vllm entry come first.
Step-by-step proof (part a)
- A hypothetical PR modifies
configs/amd-master.yaml:2534-2541(minimaxm3-fp4-mi355x-vllm-disagg) and keeps the vendor-fork imagerocm/vllm-dev:.... - Check 6(a) asks the verifier: does this entry have
framework: vllm? No — it hasframework: vllm-disagg. Literal match fails. - Because the entry doesn't have
framework: vllm, the (a) rule doesn't apply. The vendor-fork image is not flagged. - Result: MI355X is an established SKU, the image is a vendor fork, no exception was documented — but the gate passes.
Step-by-step proof (part b)
- A hypothetical PR adds a TensorRT-LLM entry —
framework: trt(ordynamo-trt) — before any vLLM/SGLang entry for that model+SKU exists. - Check 6(b) tells the verifier to look for
framework: of trtllm, atom, dynamo variants, etc. - A verifier greps
framework: trtllmin the diff / master configs → 0 matches. - Result: TRT-LLM entry is not detected as non-vLLM/SGLang, engine-first ordering is not enforced.
Why this is a nit, not a blocker
This is a natural-language prompt for an LLM verifier that has interpretive latitude. A competent auditor reading the whole check — including the vendor-fork image examples (rocm/sgl-dev, rocm/vllm-dev) and the informal trtllm label — will likely still catch the obvious cases. The list in (b) is explicitly non-exhaustive (etc.), and (b) also supplies image-name identifiers (nvcr.io...tensorrt-llm...) that independently point at the right entries. So the gate keeps working in the common case; it's the corner cases (disagg variants using vendor forks, first-time TRT-LLM ordering) where a literal reading misses.
Suggested fix
- (a): change wording from equality to substring — "for entries whose
frameworkcontainsvllmorsglang(vllm,vllm-disagg,dynamo-vllm,sglang,sglang-disagg,dynamo-sglang)". - (b): replace
trtllmwithtrt/dynamo-trt, and explicitly excludedynamo-vllm/dynamo-sglangfrom the non-vLLM/SGLang bucket. Something like: "framework values that are NOT vLLM/SGLang engines —trt,dynamo-trt,atom,atom-disagg, etc."
| - PASS if every item in the current template has a corresponding checked (`[x]`) item | ||
| in the sign-off. Match items semantically — minor wording drift is fine; a missing | ||
| ITEM is not. | ||
| - FAIL if the sign-off is missing current-template items (stale copy) or left items | ||
| unchecked without an explanation in the additional detail section. Name the | ||
| missing/unchecked items and link the current template. |
There was a problem hiding this comment.
🟡 Check 5's PASS/FAIL rule doesn't account for the conditional escape-hatch item on line 18 of docs/PR_REVIEW_CHECKLIST.md (If any of the above criteria cannot reasonably be satisfied, I have provided additional reasoning below.). On a perfectly clean sign-off, that box is intentionally left unchecked with no additional-detail entry — which literally matches Check 5's FAIL condition ("left items unchecked without an explanation in the additional detail section"), inverting its semantics. Suggest exempting the conditional item, e.g. "the escape-hatch If any of the above criteria cannot reasonably be satisfied line stays unchecked unless a genuine exception is documented — treat it as PASS in that case".
Extended reasoning...
What the bug is. Check 5 asks the verifier to compare the sign-off body against the current docs/PR_REVIEW_CHECKLIST.md template. Its rules are:
- PASS if every item has a corresponding
[x]in the sign-off. - FAIL if items are missing (stale copy) or "left items unchecked without an explanation in the additional detail section."
Every item in the template is a verification affirmation ("I have verified X") — except line 18, which is inherently conditional:
- [ ] If any of the above criteria cannot reasonably be satisfied, I have provided additional reasoning below.
Its own wording says "IF any of the above criteria cannot reasonably be satisfied" — the reviewer is meant to tick it only when they need to note an exception, and the accompanying reasoning belongs in the additional-detail section.
Why the check fails on the happy path. On a perfectly clean sign-off — every criterion satisfied, no exception needed — line 18 is legitimately unchecked and the additional-detail section is empty (or contains only non-exception info like the recipe link). That is a literal three-of-three match for Check 5's FAIL rule: item exists in template, item unchecked, no explanation in additional-detail. So Check 5 would FAIL precisely the state that is most correct.
Why existing wording doesn't rescue it. Check 5 does say "Match items semantically — minor wording drift is fine," which gives the LLM auditor some latitude. But the FAIL clause is written mechanically ("left items unchecked without an explanation"), and a strict-reading verifier acting on that rule will trip. The safer fix is to remove the ambiguity, not rely on interpretive charity.
Step-by-step proof (happy-path sign-off):
- Reviewer submits a sign-off where every substantive item (lines 8-17 in the template) is
[x]. - Reviewer leaves line 18 (
- [ ] If any of the above criteria cannot reasonably be satisfied ...) unchecked because there are no exceptions to explain. - The additional-detail section contains only the required recipe link — nothing addressing line 18.
- Check 5 runs: item 18 is in the current template ✓, item 18 is left unchecked ✓, there is no explanation for item 18 in the additional-detail section ✓ — the FAIL rule literally matches.
- The verifier reports Check 5 as FAIL, blocking merge on a correct sign-off. Reviewer's only workarounds are (a) tick a box whose text is a false claim ("criteria cannot reasonably be satisfied") or (b) add filler text to the additional-detail section. Both are worse than passing.
Impact. Nit-level, but pointed at the common case: false FAILs will land on precisely the sign-offs that are most correct. The verifier is Claude Fable-5, which will often reason semantically and pass anyway, so this won't fire every time — but the rule text invites it to fire, and even one spurious FAIL wastes a merge cycle. Nothing about merging as-is is unsafe.
How to fix. One clause in Check 5, e.g.:
The conditional item ("If any of the above criteria cannot reasonably be satisfied, I have provided additional reasoning below.") is an escape-hatch — it should be checked ONLY when a genuine exception is documented in the additional-detail section. Leaving it unchecked with no exception is expected and PASSes.
| - (a) UPSTREAM IMAGE: for entries with `framework: vllm`, the image must come from the | ||
| upstream vLLM Docker Hub org (https://hub.docker.com/u/vllm) — in the master configs | ||
| that looks like `vllm/vllm-openai:<tag>` or `vllm/vllm-openai-rocm:<tag>`. For | ||
| `framework: sglang`, it must come from the upstream org | ||
| (https://hub.docker.com/u/lmsysorg) — `lmsysorg/sglang:<tag>` or | ||
| `lmsysorg/sglang-rocm:<tag>` (digest-pinned `@sha256:...` variants are fine). | ||
| Vendor/private forks — e.g. `rocm/sgl-dev`, `rocm/vllm-dev`, `ghcr.io#...`, or any | ||
| non-`vllm/`/non-`lmsysorg/` repo — FAIL on the established SKUs above unless the | ||
| sign-off's additional detail section documents a genuine exception: truly new | ||
| hardware (e.g. MI455X UALoE72, Vera Rubin NVL72, Rubin NVL8) or a new model | ||
| architecture that upstream vLLM/SGLang does not fundamentally support yet (as | ||
| backed by vLLM/SGLang community maintainers). Name the offending image and the | ||
| missing justification when you FAIL. (Note: some entries write the registry | ||
| separator as `#`, e.g. `nvcr.io#nvidia/...` — treat `#` as `/`.) | ||
| - (b) ENGINE-FIRST ORDERING: if this PR adds a config entry for a NON-vLLM/SGLang | ||
| framework (`framework:` of `trtllm`, `atom`, dynamo variants, etc. — images like |
There was a problem hiding this comment.
🟡 Check 6's framework matching uses literal equality that misses real config values: (a) the upstream-image gate at lines 415-419 targets only framework: vllm / framework: sglang, silently exempting vllm-disagg, sglang-disagg, dynamo-vllm, and dynamo-sglang entries — which is exactly where vendor forks like rocm/sgl-dev and rocm/vllm-dev currently appear on established MI355X SKUs (e.g. dsr1-fp8-mi355x-sglang-disagg at configs/amd-master.yaml:967, minimaxm3-fp4-mi355x-vllm-disagg at :2534); (b) the engine-first ordering bullet at line 434 lists trtllm as a framework value, but grep confirms only trt (and dynamo-trt) appears in the master configs — trtllm matches nothing. Suggest matching by substring in (a) (framework contains vllm/sglang), and naming the exact YAML values (trt, dynamo-trt, atom, dynamo-* excluding dynamo-vllm/dynamo-sglang) in (b).
Extended reasoning...
Bug
Check 6 tries to gate two behaviors by literal framework-string matching, but the string set is out of sync with what the master configs actually contain.
Part (a) — upstream-image gate (line 415-419) misses -disagg / dynamo- variants.* The check text reads "for entries with framework: vllm, the image must come from the upstream vLLM Docker Hub org" and similarly for framework: sglang. The backtick code-fence formatting reads as an exact string match. But configs/amd-master.yaml and configs/nvidia-master.yaml also use compound framework values that are still fundamentally vLLM/SGLang engines: vllm-disagg, sglang-disagg, dynamo-vllm, dynamo-sglang.
Crucially, these are the exact entries currently using vendor forks — the class of entries Check 6 was written to catch:
configs/amd-master.yaml:967-973—dsr1-fp8-mi355x-sglang-disaggusesframework: sglang-disagg+image: rocm/sgl-dev:sglang-0.5.9-rocm720-mi35x-mori-0227-2configs/amd-master.yaml:1119-1127— anothersglang-disagg+rocm/sgl-devon MI355Xconfigs/amd-master.yaml:2534-2541—minimaxm3-fp4-mi355x-vllm-disaggusesframework: vllm-disagg+image: rocm/vllm-dev:vllm-0.23.1-rocm723-mi35x-mori-0625
All three are on ESTABLISHED SKUs (MI355X is in the check's own established-hardware list). A verifier taking the check text at face value would search for framework: vllm / framework: sglang, not find these entries, and silently pass the exact configurations the gate was designed to fail.
Part (b) — engine-first ordering (line 434) references framework: trtllm, which does not exist. The bullet lists "framework: of trtllm, atom, dynamo variants, etc." as the NON-vLLM/SGLang bucket. But grep across configs/*-master.yaml confirms the actual YAML value for TensorRT-LLM entries is framework: trt (~14 entries in nvidia-master.yaml, e.g. lines 1793, 1814, 1885, 3247, 4429) and framework: dynamo-trt — there is no framework: trtllm anywhere in the configs.
Part (b) is also self-contradictory with (a). The bullet lumps "dynamo variants" into the NON-vLLM/SGLang bucket, so a dynamo-vllm PR would be flagged as needing a plain-vllm entry to land first — even though the PR IS a vLLM submission (just packaged as Dynamo). Meanwhile (a) exempts it from the upstream-image gate for the opposite reason.
Why existing code doesn't prevent it
This check is a natural-language prompt for an LLM auditor, not code. The LLM has reasoning latitude and will likely match trtllm → TensorRT-LLM → trt semantically, and may notice the vendor-fork example images (rocm/sgl-dev, rocm/vllm-dev) in the prompt itself and connect them to disagg entries. But the wording is ambiguous enough that a strict reading is plausible, and (b)'s explicit categorization of dynamo variants as non-vLLM/SGLang is stated outright, not merely implied.
Impact
Soft check under-applied. A verifier that reads (a) strictly would pass entries using rocm/sgl-dev / rocm/vllm-dev on MI355X disagg configs — the exact class of entries the check exists to gate. And (b) as written would demand a plain-vLLM submission come first for a dynamo-vLLM PR that IS a vLLM submission.
Step-by-step proof
Consider a hypothetical PR that adds a new MI355X sglang-disagg config with image: rocm/sgl-dev:sglang-0.5.9-rocm720-mi35x-mori-0227-2 (matching the pattern already at configs/amd-master.yaml:967-973):
- The verifier reads Check 6(a): "for entries with
framework: vllm, the image must come from the upstream vLLM Docker Hub org ... Forframework: sglang, it must come from the upstream org..." - The verifier reads the new config entry:
framework: sglang-disagg. - A strict reading of the check finds no match: the entry's framework field is not
sglangand notvllm. - Check 6(a) is reported as N/A or PASS (no matching entries).
- The vendor-fork image on an established SKU is silently accepted.
Compare to the intended behavior: this is precisely the gate's purpose. The MI355X is in the check's own established-hardware list, sglang-disagg is a vLLM/SGLang engine, and rocm/sgl-dev is a vendor fork.
Fix
- In (a), change the wording to match by substring/prefix — e.g. "for entries whose framework contains
vllmorsglang(i.e.vllm,vllm-disagg,dynamo-vllm,sglang,sglang-disagg,dynamo-sglang), the image must come from...". - In (b), name the exact YAML values (
framework:oftrt,dynamo-trt,atom,dynamo-*variants excludingdynamo-vllm/dynamo-sglang) and drop the non-existenttrtllm.
Summary
Aligns the CODEOWNER sign-off verifier with the latest PR_REVIEW_CHECKLIST.md (as revised in #1956 and #2011) by adding four checks to the verification prompt:
docs/PR_REVIEW_CHECKLIST.mdtemplate and fails on stale/missing items.framework: vllm/framework: sglangmaster-config entries on established hardware (NVIDIA H100/H200/B200/B300/GB200/GB300, AMD MI300X/MI325X/MI355X), the image must come from the upstream Docker Hub orgs —vllm(vllm/vllm-openai,vllm/vllm-openai-rocm) orlmsysorg(lmsysorg/sglang,lmsysorg/sglang-rocm) — with vendor forks (rocm/sgl-dev,rocm/vllm-dev, ...) failing unless a new-hardware/new-model-architecture exception is documented in the sign-off. Also verifies vLLM/SGLang submissions exist before additional frameworks (TRT-LLM, ATOM, ...) for the same model+SKU.--hf-overrideslayer/indexer skipping etc.); same-computation speedups and lower-precision FLOPs remain fair game.Also fixes the stale "(0-3)" check-range reference, and updates the header docs and verdict rules to cover checks 0-8 (with explicit
N/A — <reason>rows for non-applicable checks).Image-repo shapes verified against current
configs/amd-master.yaml/configs/nvidia-master.yamlusage.Test plan
workflow_dispatch) and confirm the posted verdict includes the new checks.🤖 Generated with Claude Code