Skip to content

[https://nvbugs/6293712][fix] Patch GSM8K.EVALUATE_KWARGS with scores_filter="exact_match,strict-match"…#15346

Open
tensorrt-cicd wants to merge 1 commit into
NVIDIA:mainfrom
tensorrt-cicd:repair-bot-bug6293712
Open

[https://nvbugs/6293712][fix] Patch GSM8K.EVALUATE_KWARGS with scores_filter="exact_match,strict-match"…#15346
tensorrt-cicd wants to merge 1 commit into
NVIDIA:mainfrom
tensorrt-cicd:repair-bot-bug6293712

Conversation

@tensorrt-cicd

@tensorrt-cicd tensorrt-cicd commented Jun 14, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • Root cause: GSM8K score averaged flexible-extract (44.66, regex picks intermediate CoT numbers) with strict-match (75.13), pulling mean to 59.89 below threshold 62.225 even though model is healthy.
  • Fix: Patch GSM8K.EVALUATE_KWARGS with scores_filter="exact_match,strict-match" before evaluation, mirroring the precedent at test_mxfp4_gsm8k line 1283 in the same file.
  • Automated fix generated by repair-bot

Test plan

  • Verify fix on the same GPU type as the original failure
  • Check for regressions in related tests

Links

Summary by CodeRabbit

Tests

  • Enhanced accuracy evaluation for Nemotron Nano V3 model with stricter validation standards to ensure more rigorous quality assessment of model responses.

…atch only

Nano V3 is a chain-of-thought reasoning model. lm-eval's flexible-extract
regex picks the last numeric span in the response (group_select=-1), which
on Nano V3 frequently lands on an intermediate calculation in the long
reasoning chain rather than the canonical '#### N' final-answer marker.
Strict-match correctly extracts the final answer.

In the failing GB300 (SM103) NVFP4 1-GPU run, GSM8K strict-match was 75.13
(above the 65.428 reference) but flexible-extract was 44.66, dragging the
mean to 59.89 and below the 62.225 threshold even though the model itself
is producing high-quality answers. MMLU passes the same run with margin.

Mirror the existing pattern in the same file (TestGptOssBase.test_mxfp4_gsm8k,
line 1283) and across test_llm_api_pytorch.py: patch GSM8K.EVALUATE_KWARGS
to set scores_filter='exact_match,strict-match', so the GSM8K score uses
the authoritative CoT filter only. Reference accuracy and threshold are
unchanged.

Signed-off-by: tensorrt-cicd <90828364+tensorrt-cicd@users.noreply.github.com>
@coderabbitai

coderabbitai Bot commented Jun 14, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 4fb34aac-5b17-40e3-9829-635fb8f05cb9

📥 Commits

Reviewing files that changed from the base of the PR and between 4f46653 and 11b09a5.

📒 Files selected for processing (1)
  • tests/integration/defs/accuracy/test_llm_api_autodeploy.py

📝 Walkthrough

Walkthrough

TestNemotronNanoV3.test_accuracy gains a mocker fixture parameter and uses mocker.patch.dict to override GSM8K.EVALUATE_KWARGS, replacing the default scores_filter with exact_match,strict-match to enforce stricter scoring for this chain-of-thought model.

Changes

NemotronNanoV3 GSM8K scoring override

Layer / File(s) Summary
mocker fixture and GSM8K.EVALUATE_KWARGS patch
tests/integration/defs/accuracy/test_llm_api_autodeploy.py
test_accuracy signature adds mocker; mocker.patch.dict overrides GSM8K.EVALUATE_KWARGS to set scores_filter to exact_match,strict-match for the duration of the test.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Description check ⚠️ Warning The PR description covers the root cause, fix, and test plan, but lacks the required structured sections (Description, Test Coverage, PR Checklist) specified in the template. Reorganize the description to follow the template structure with explicit Description, Test Coverage, and PR Checklist sections.
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The PR title clearly describes the main change: patching GSM8K.EVALUATE_KWARGS with a specific scoring filter for the Nemotron Nano V3 test.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants