[https://nvbugs/6293712][fix] Patch GSM8K.EVALUATE_KWARGS with scores_filter="exact_match,strict-match"… by tensorrt-cicd · Pull Request #15346 · NVIDIA/TensorRT-LLM

tensorrt-cicd · 2026-06-14T04:05:26Z

Summary

Root cause: GSM8K score averaged flexible-extract (44.66, regex picks intermediate CoT numbers) with strict-match (75.13), pulling mean to 59.89 below threshold 62.225 even though model is healthy.
Fix: Patch GSM8K.EVALUATE_KWARGS with scores_filter="exact_match,strict-match" before evaluation, mirroring the precedent at test_mxfp4_gsm8k line 1283 in the same file.
Automated fix generated by repair-bot

Test plan

Verify fix on the same GPU type as the original failure
Check for regressions in related tests

Links

Bug: https://nvbugs/6293712

Summary by CodeRabbit

Tests

Enhanced accuracy evaluation for Nemotron Nano V3 model with stricter validation standards to ensure more rigorous quality assessment of model responses.

…atch only Nano V3 is a chain-of-thought reasoning model. lm-eval's flexible-extract regex picks the last numeric span in the response (group_select=-1), which on Nano V3 frequently lands on an intermediate calculation in the long reasoning chain rather than the canonical '#### N' final-answer marker. Strict-match correctly extracts the final answer. In the failing GB300 (SM103) NVFP4 1-GPU run, GSM8K strict-match was 75.13 (above the 65.428 reference) but flexible-extract was 44.66, dragging the mean to 59.89 and below the 62.225 threshold even though the model itself is producing high-quality answers. MMLU passes the same run with margin. Mirror the existing pattern in the same file (TestGptOssBase.test_mxfp4_gsm8k, line 1283) and across test_llm_api_pytorch.py: patch GSM8K.EVALUATE_KWARGS to set scores_filter='exact_match,strict-match', so the GSM8K score uses the authoritative CoT filter only. Reference accuracy and threshold are unchanged. Signed-off-by: tensorrt-cicd <90828364+tensorrt-cicd@users.noreply.github.com>

coderabbitai · 2026-06-14T04:08:51Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 4fb34aac-5b17-40e3-9829-635fb8f05cb9

📥 Commits

Reviewing files that changed from the base of the PR and between 4f46653 and 11b09a5.

📒 Files selected for processing (1)

tests/integration/defs/accuracy/test_llm_api_autodeploy.py

📝 Walkthrough

Walkthrough

TestNemotronNanoV3.test_accuracy gains a mocker fixture parameter and uses mocker.patch.dict to override GSM8K.EVALUATE_KWARGS, replacing the default scores_filter with exact_match,strict-match to enforce stricter scoring for this chain-of-thought model.

Changes

NemotronNanoV3 GSM8K scoring override

Layer / File(s)	Summary
mocker fixture and GSM8K.EVALUATE_KWARGS patch `tests/integration/defs/accuracy/test_llm_api_autodeploy.py`	`test_accuracy` signature adds `mocker`; `mocker.patch.dict` overrides `GSM8K.EVALUATE_KWARGS` to set `scores_filter` to `exact_match,strict-match` for the duration of the test.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	The PR description covers the root cause, fix, and test plan, but lacks the required structured sections (Description, Test Coverage, PR Checklist) specified in the template.	Reorganize the description to follow the template structure with explicit Description, Test Coverage, and PR Checklist sections.
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The PR title clearly describes the main change: patching GSM8K.EVALUATE_KWARGS with a specific scoring filter for the Nemotron Nano V3 test.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

tensorrt-cicd requested review from a team as code owners June 14, 2026 04:05

tensorrt-cicd requested a review from hnover-nv June 14, 2026 04:05

tensorrt-cicd assigned suyoggupta Jun 14, 2026

github-actions Bot assigned tensorrt-cicd Jun 14, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[https://nvbugs/6293712][fix] Patch GSM8K.EVALUATE_KWARGS with scores_filter="exact_match,strict-match"…#15346

[https://nvbugs/6293712][fix] Patch GSM8K.EVALUATE_KWARGS with scores_filter="exact_match,strict-match"…#15346
tensorrt-cicd wants to merge 1 commit into
NVIDIA:mainfrom
tensorrt-cicd:repair-bot-bug6293712

tensorrt-cicd commented Jun 14, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 14, 2026

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (2 warnings)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tensorrt-cicd commented Jun 14, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Links

Summary by CodeRabbit

Tests

Uh oh!

coderabbitai Bot commented Jun 14, 2026

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (2 warnings)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tensorrt-cicd commented Jun 14, 2026 •

edited by coderabbitai Bot

Loading