Fix ROCm GPU arch detection: prefer torch device properties, robust hipInfo.exe lookup on Windows#1969
Closed
danielhanchen wants to merge 1 commit into
Conversation
On Windows, get_rocm_gpu_arch() probed hipinfo.exe via PATH only. In practice hipInfo.exe is rarely on PATH: hosts without the HIP SDK do not have it there, and AMD's PyTorch wheels ship hipInfo.exe into the environment's Scripts directory, which is only on PATH when the venv is activated. The probe then raises FileNotFoundError, every import of bitsandbytes logs an ERROR + WARNING, and ROCM_GPU_ARCH silently degrades to unknown. Read torch.cuda.get_device_properties(0).gcnArchName first (works on Linux and Windows, no subprocess); keep the rocminfo / hipInfo.exe parsing as a fallback, additionally trying hipInfo.exe next to python.exe on Windows before giving up. Verified on gfx1151 (Strix Halo, Windows 11, torch 2.11.0+rocm7.13.0): previously unknown + ERROR; now gfx1151 via both the torch path and the forced subprocess fallback. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Author
|
Closing this for now. |
Member
|
@danielhanchen This seems like a reasonable change - what's the reason for closing? |
danielhanchen
added a commit
to unslothai/unsloth
that referenced
this pull request
Jun 11, 2026
) * Fix bitsandbytes ROCm GPU arch and warp size detection on Windows bitsandbytes resolves the ROCm GPU architecture (and warp size on 0.49.x) by shelling out to rocminfo / hipinfo.exe via PATH at import time. On Windows neither tool is normally on PATH (AMD torch wheels ship hipInfo.exe into the venv Scripts dir, only on PATH while activated), so every `import bitsandbytes` logs an ERROR and WARNING, ROCM_GPU_ARCH degrades to unknown, and the 0.49.x warp size defaults to 64, which is wrong on RDNA (wave 32) and silently disables pre-quantized 4-bit models via ALLOW_PREQUANTIZED_MODELS. Install a one-shot MetaPathFinder before unsloth_zoo is imported (the first bitsandbytes import on ROCm) that swaps get_rocm_gpu_arch and get_rocm_warpsize for torch-device-properties-first implementations right after bitsandbytes.cuda_specs executes, before cextension reads them. Falls back to running hipInfo.exe by absolute path (venv Scripts, conda Scripts, HIP SDK / AMD installer dirs). Repairs the constants in place when bitsandbytes was imported first. Strict no-op on non-Windows, non-ROCm builds, missing bitsandbytes, and versions that fix this upstream. Opt out with UNSLOTH_DISABLE_BNB_ROCM_FIX=1. Proposed upstream in bitsandbytes-foundation/bitsandbytes#1969; shipped here so all bitsandbytes versions are covered. Verified on gfx1151 Strix Halo, Windows 11, torch 2.11.0+rocm7.13.0 against bitsandbytes main, 0.49.2, and a torch-props-fixed variant. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Tighten comments in the bitsandbytes ROCm detection fix Comment and docstring pass only. AST comparison with docstrings stripped confirms every definition is identical to the version the 12 scenario suite ran against, and the suite plus the drift test pass unchanged on the edited files. * Keep the bitsandbytes cuda_specs finder installed for reload support Simulation testing caught a regression in the one-shot design: importlib.reload(bitsandbytes.cuda_specs) re-resolves the spec through sys.meta_path, so with the finder already removed the reload reinstalled the unpatched upstream detector and the Windows ROCm noise returned. Keep the finder on sys.meta_path permanently, matching the lifecycle of the existing causal_conv1d and vllm import blockers. The finder matches a single module name and patching stays idempotent via the sentinel flags, so repeat hits are no-ops. Validated on gfx1151 Windows 11: 22 simulation scenarios (conda and embedded layouts, Program Files scan ordering, paths with spaces and unicode, hanging probe timeout, lru-wrapped and C-function helper shapes, reload, failed-import retry, threads, spawn, dormant finder, Studio PATH coexistence, early fix-block ordering, bnb 0.45.5 / 0.47.0 / 0.49.2 / main / upstream-fixed) plus the original 12 scenario suite, CPU-torch and stale-HIP_PATH sandboxes, Python 3.10 to 3.13 gates, and a WSL Linux leg proving byte-identical Linux behavior with and without the fix, with and without rocminfo on PATH. --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
On Windows ROCm setups,
get_rocm_gpu_arch()probeshipinfo.exewithsubprocess.run(["hipinfo.exe"], ...), which resolves throughPATHonly. In practicehipInfo.exeis rarely reachable that way:PATHat all.hipInfo.exeinto the environment'sScriptsdirectory (next topython.exe), which is only onPATHwhile the venv is activated — not when the interpreter is invoked directly, viauv run, or embedded.The probe then raises
FileNotFoundError, and everyimport bitsandbyteslogs:while
ROCM_GPU_ARCHsilently degrades to"unknown"— even though the GPU works fine.Fix
torch.cuda.get_device_properties(0).gcnArchName— torch already knows the architecture on both Linux and Windows, with no subprocess at all. Feature-flag suffixes (e.g.gfx90a:sramecc+:xnack-) are stripped to keep the existing"gfx..."format. This introduces no new device initialization: importing bitsandbytes already initializes the device context incextension.pyviaget_cuda_specs()→torch.cuda.get_device_capability().rocminfo/hipInfo.exeparsing as a fallback, and on Windows additionally tryhipInfo.exenext topython.exe(where AMD's wheels place it) before giving up.Validation
On a Strix Halo (gfx1151) Windows 11 machine with AMD's wheels (
torch 2.11.0+rocm7.13.0), venv not activated, soScriptsis not onPATHandshutil.which("hipinfo.exe")isNone:ROCM_GPU_ARCH == "unknown"plus the ERROR/WARNING above on import.gfx1151with no log output; forcing the subprocess fallback (mockingtorch.cuda.is_availabletoFalse) also returnsgfx1151via theScripts-relativehipInfo.exe.Added two mocked unit tests that run on any backend. On the ROCm box:
All
pre-commithooks pass on the changed files.Context
hipinfo.exeprobe (superseding fix ROCm GPU architecture detection failed on windows #1843 / Update cuda_specs.py #1833, which reported this same failure mode); [ROCm] Windows workflow for creating wheels with ROCm 7.2.1 support #1915 added the Windows ROCm wheels.Scriptsdirectory toPATHbefore importing bitsandbytes (Windows/WSL installer: fix winget msstore cert failure, amd-smi DiskPart prompt, and enable AMD GPU (Strix Halo gfx1151) unslothai/unsloth#5940, commit f48fc9a). This PR fixes it at the source so downstream consumers don't need that.🤖 Generated with Claude Code