Fix autotune warm restart to retry failed schemes#1722
Conversation
Error records (e.g. from connection loss or TensorRT build timeout) were permanently marked as profiled and skipped on resume. Now load_state() resets error schemes so they become unprofiled and get retried on warm restart. set_profile_region() also picks up partially-profiled patterns from prior sessions instead of starting fresh. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (3)
📝 WalkthroughWalkthrough
ChangesONNX Autotune warm restart error retry
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes 🚥 Pre-merge checks | ✅ 6✅ Passed checks (6 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
Fix autotune warm restart to retry failed schemes
Problem
When the ONNX autotuner encounters transient errors during profiling (e.g. connection loss to the TensorRT benchmark server, build timeouts), those schemes are marked with
error=True. On warm restart, these error schemes are treated as "already profiled" (is_profiledreturnsTruefor error schemes), so they are permanently skipped. This means a single network hiccup can block an entire pattern from ever being fully explored.Solution
Two changes in
autotuner_base.py:load_state(): Reset all error schemes when loading from checkpoint — clearerrorflag, resetlatency_mstoinf, and clearprofile_timestamp. This makes them "unprofiled" so they re-enter the profiling queue.set_profile_region(): Before seeding from pattern cache, check if there's an existing partially-profiled entry inprofiled_patterns(i.e. one with unprofiled schemes from the reset above). If found, reuse it instead of starting fresh, so the previously-failed schemes get retried.Testing
_is_region_profiled().Files Changed
modelopt/onnx/quantization/autotune/autotuner_base.pyload_state, reuse partial patterns inset_profile_regiondocs/source/guides/9_autotune.rstCHANGELOG.rstSummary by CodeRabbit
New Features
Documentation