Skip to content

Fix autotune warm restart to retry failed schemes#1722

Open
willg-nv wants to merge 1 commit into
NVIDIA:mainfrom
willg-nv:fix-autoqdq-warm-restart
Open

Fix autotune warm restart to retry failed schemes#1722
willg-nv wants to merge 1 commit into
NVIDIA:mainfrom
willg-nv:fix-autoqdq-warm-restart

Conversation

@willg-nv

@willg-nv willg-nv commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Fix autotune warm restart to retry failed schemes

Problem

When the ONNX autotuner encounters transient errors during profiling (e.g. connection loss to the TensorRT benchmark server, build timeouts), those schemes are marked with error=True. On warm restart, these error schemes are treated as "already profiled" (is_profiled returns True for error schemes), so they are permanently skipped. This means a single network hiccup can block an entire pattern from ever being fully explored.

Solution

Two changes in autotuner_base.py:

  1. load_state(): Reset all error schemes when loading from checkpoint — clear error flag, reset latency_ms to inf, and clear profile_timestamp. This makes them "unprofiled" so they re-enter the profiling queue.

  2. set_profile_region(): Before seeding from pattern cache, check if there's an existing partially-profiled entry in profiled_patterns (i.e. one with unprofiled schemes from the reset above). If found, reuse it instead of starting fresh, so the previously-failed schemes get retried.

Testing

  • Verified that a state file with error records loads correctly and reports the number of reset errors.
  • On warm restart, regions with reset-error schemes are no longer skipped by _is_region_profiled().
  • Patterns with a mix of successful + previously-failed schemes correctly resume profiling the failed ones.

Files Changed

File Change
modelopt/onnx/quantization/autotune/autotuner_base.py Core fix: reset errors in load_state, reuse partial patterns in set_profile_region
docs/source/guides/9_autotune.rst Document error-retry behavior in "Resume" section and troubleshooting
CHANGELOG.rst Add entry under 0.46

Summary by CodeRabbit

  • New Features

    • Warm restart now automatically retries schemes that previously failed due to transient errors (connection loss, TensorRT build timeouts) instead of permanently skipping them. No manual state file cleanup needed after infrastructure failures.
  • Documentation

    • Updated crash recovery and troubleshooting guidance to reflect automatic error recovery on resume.

Error records (e.g. from connection loss or TensorRT build timeout) were
permanently marked as profiled and skipped on resume. Now load_state()
resets error schemes so they become unprofiled and get retried on warm
restart. set_profile_region() also picks up partially-profiled patterns
from prior sessions instead of starting fresh.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@willg-nv willg-nv requested a review from a team as a code owner June 15, 2026 03:36
@willg-nv willg-nv requested a review from gcunhase June 15, 2026 03:36
@copy-pr-bot

copy-pr-bot Bot commented Jun 15, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai

coderabbitai Bot commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 91c78a57-4362-4722-92f7-9669e6687805

📥 Commits

Reviewing files that changed from the base of the PR and between c4f39bd and 50a2ca4.

📒 Files selected for processing (3)
  • CHANGELOG.rst
  • docs/source/guides/9_autotune.rst
  • modelopt/onnx/quantization/autotune/autotuner_base.py

📝 Walkthrough

Walkthrough

QDQAutotunerBase.load_state() now resets error-flagged schemes to a retryable state (clearing the error flag, setting latency_ms to inf, nulling profile_timestamp) on warm restart. set_profile_region() detects partially profiled patterns with unprofiled schemes and reuses them for retry. Autotune docs and the 0.46 changelog entry document this new behavior.

Changes

ONNX Autotune warm restart error retry

Layer / File(s) Summary
Error reset and partial resume in autotuner_base
modelopt/onnx/quantization/autotune/autotuner_base.py
load_state() iterates loaded schemes and resets any with error=True to a retryable state (latency_ms=inf, no profile_timestamp, error cleared), tracking the count in the "Loaded state" log line. set_profile_region() adds logic to detect a previously profiled pattern that still has unprofiled schemes and pops it for retry, falling back to cache-seeding or a fresh start.
Docs and changelog
docs/source/guides/9_autotune.rst, CHANGELOG.rst
Autotune guide gains two notes (crash recovery section and troubleshooting section) stating transient-error schemes are automatically reset and retried on warm restart. CHANGELOG records the behavior for release 0.46.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 6
✅ Passed checks (6 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Fix autotune warm restart to retry failed schemes' is directly related to the main change in the changeset, which implements fixes to allow the ONNX autotuner to retry previously-failed schemes during warm restart.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Security Anti-Patterns ✅ Passed No security anti-patterns found. Code uses yaml.safe_load, has no eval/exec/pickle/torch.load with unsafe flags, no hardcoded trust_remote_code, no nosec comments, and adds no new dependencies.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant