Fix autotune warm restart to retry failed schemes by willg-nv · Pull Request #1722 · NVIDIA/Model-Optimizer

willg-nv · 2026-06-15T03:36:05Z

Fix autotune warm restart to retry failed schemes

Problem

When the ONNX autotuner encounters transient errors during profiling (e.g. connection loss to the TensorRT benchmark server, build timeouts), those schemes are marked with error=True. On warm restart, these error schemes are treated as "already profiled" (is_profiled returns True for error schemes), so they are permanently skipped. This means a single network hiccup can block an entire pattern from ever being fully explored.

Solution

Two changes in autotuner_base.py:

load_state(): Reset all error schemes when loading from checkpoint — clear error flag, reset latency_ms to inf, and clear profile_timestamp. This makes them "unprofiled" so they re-enter the profiling queue.
set_profile_region(): Before seeding from pattern cache, check if there's an existing partially-profiled entry in profiled_patterns (i.e. one with unprofiled schemes from the reset above). If found, reuse it instead of starting fresh, so the previously-failed schemes get retried.

Testing

Verified that a state file with error records loads correctly and reports the number of reset errors.
On warm restart, regions with reset-error schemes are no longer skipped by _is_region_profiled().
Patterns with a mix of successful + previously-failed schemes correctly resume profiling the failed ones.

Files Changed

File	Change
`modelopt/onnx/quantization/autotune/autotuner_base.py`	Core fix: reset errors in `load_state`, reuse partial patterns in `set_profile_region`
`docs/source/guides/9_autotune.rst`	Document error-retry behavior in "Resume" section and troubleshooting
`CHANGELOG.rst`	Add entry under 0.46

Summary by CodeRabbit

New Features
- Warm restart now automatically retries schemes that previously failed due to transient errors (connection loss, TensorRT build timeouts) instead of permanently skipping them. No manual state file cleanup needed after infrastructure failures.
Documentation
- Updated crash recovery and troubleshooting guidance to reflect automatic error recovery on resume.

Error records (e.g. from connection loss or TensorRT build timeout) were permanently marked as profiled and skipped on resume. Now load_state() resets error schemes so they become unprofiled and get retried on warm restart. set_profile_region() also picks up partially-profiled patterns from prior sessions instead of starting fresh. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

copy-pr-bot · 2026-06-15T03:36:08Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2026-06-15T03:36:17Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 91c78a57-4362-4722-92f7-9669e6687805

📥 Commits

Reviewing files that changed from the base of the PR and between c4f39bd and 50a2ca4.

📒 Files selected for processing (3)

CHANGELOG.rst
docs/source/guides/9_autotune.rst
modelopt/onnx/quantization/autotune/autotuner_base.py

📝 Walkthrough

Walkthrough

QDQAutotunerBase.load_state() now resets error-flagged schemes to a retryable state (clearing the error flag, setting latency_ms to inf, nulling profile_timestamp) on warm restart. set_profile_region() detects partially profiled patterns with unprofiled schemes and reuses them for retry. Autotune docs and the 0.46 changelog entry document this new behavior.

Changes

ONNX Autotune warm restart error retry

Layer / File(s)	Summary
Error reset and partial resume in autotuner_base `modelopt/onnx/quantization/autotune/autotuner_base.py`	`load_state()` iterates loaded schemes and resets any with `error=True` to a retryable state (`latency_ms=inf`, no `profile_timestamp`, error cleared), tracking the count in the "Loaded state" log line. `set_profile_region()` adds logic to detect a previously profiled pattern that still has unprofiled schemes and pops it for retry, falling back to cache-seeding or a fresh start.
Docs and changelog `docs/source/guides/9_autotune.rst`, `CHANGELOG.rst`	Autotune guide gains two notes (crash recovery section and troubleshooting section) stating transient-error schemes are automatically reset and retried on warm restart. CHANGELOG records the behavior for release 0.46.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 6

✅ Passed checks (6 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'Fix autotune warm restart to retry failed schemes' is directly related to the main change in the changeset, which implements fixes to allow the ONNX autotuner to retry previously-failed schemes during warm restart.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Security Anti-Patterns	✅ Passed	No security anti-patterns found. Code uses yaml.safe_load, has no eval/exec/pickle/torch.load with unsafe flags, no hardcoded trust_remote_code, no nosec comments, and adds no new dependencies.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

willg-nv requested a review from a team as a code owner June 15, 2026 03:36

willg-nv requested a review from gcunhase June 15, 2026 03:36

coderabbitai Bot approved these changes Jun 15, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix autotune warm restart to retry failed schemes#1722

Fix autotune warm restart to retry failed schemes#1722
willg-nv wants to merge 1 commit into
NVIDIA:mainfrom
willg-nv:fix-autoqdq-warm-restart

willg-nv commented Jun 15, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

copy-pr-bot Bot commented Jun 15, 2026

Uh oh!

coderabbitai Bot commented Jun 15, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

willg-nv commented Jun 15, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Fix autotune warm restart to retry failed schemes

Problem

Solution

Testing

Files Changed

Summary by CodeRabbit

Uh oh!

copy-pr-bot Bot commented Jun 15, 2026

Uh oh!

coderabbitai Bot commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

willg-nv commented Jun 15, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 15, 2026 •

edited

Loading