Skip to content

Fix ROCm GPU arch detection: prefer torch device properties, robust hipInfo.exe lookup on Windows#1969

Closed
danielhanchen wants to merge 1 commit into
bitsandbytes-foundation:mainfrom
danielhanchen:fix/windows-rocm-arch-detection
Closed

Fix ROCm GPU arch detection: prefer torch device properties, robust hipInfo.exe lookup on Windows#1969
danielhanchen wants to merge 1 commit into
bitsandbytes-foundation:mainfrom
danielhanchen:fix/windows-rocm-arch-detection

Conversation

@danielhanchen

Copy link
Copy Markdown

Problem

On Windows ROCm setups, get_rocm_gpu_arch() probes hipinfo.exe with subprocess.run(["hipinfo.exe"], ...), which resolves through PATH only. In practice hipInfo.exe is rarely reachable that way:

  • Hosts without the HIP SDK don't have it on PATH at all.
  • AMD's PyTorch wheels for Windows ship hipInfo.exe into the environment's Scripts directory (next to python.exe), which is only on PATH while the venv is activated — not when the interpreter is invoked directly, via uv run, or embedded.

The probe then raises FileNotFoundError, and every import bitsandbytes logs:

ERROR:bitsandbytes.cuda_specs:Could not detect ROCm GPU architecture: [WinError 2] The system cannot find the file specified
WARNING:bitsandbytes.cuda_specs:
ROCm GPU architecture detection failed despite ROCm being available.

while ROCM_GPU_ARCH silently degrades to "unknown" — even though the GPU works fine.

Fix

  1. Prefer torch.cuda.get_device_properties(0).gcnArchName — torch already knows the architecture on both Linux and Windows, with no subprocess at all. Feature-flag suffixes (e.g. gfx90a:sramecc+:xnack-) are stripped to keep the existing "gfx..." format. This introduces no new device initialization: importing bitsandbytes already initializes the device context in cextension.py via get_cuda_specs()torch.cuda.get_device_capability().
  2. Keep the rocminfo / hipInfo.exe parsing as a fallback, and on Windows additionally try hipInfo.exe next to python.exe (where AMD's wheels place it) before giving up.
  3. The ERROR/WARNING logging is preserved for genuine failures; behavior on non-ROCm builds is unchanged.

Validation

On a Strix Halo (gfx1151) Windows 11 machine with AMD's wheels (torch 2.11.0+rocm7.13.0), venv not activated, so Scripts is not on PATH and shutil.which("hipinfo.exe") is None:

  • Before: ROCM_GPU_ARCH == "unknown" plus the ERROR/WARNING above on import.
  • After: the torch-properties path returns gfx1151 with no log output; forcing the subprocess fallback (mocking torch.cuda.is_available to False) also returns gfx1151 via the Scripts-relative hipInfo.exe.

Added two mocked unit tests that run on any backend. On the ROCm box:

tests/test_cuda_setup_evaluator.py: 6 passed, 4 skipped (CUDA-only)

All pre-commit hooks pass on the changed files.

Context

🤖 Generated with Claude Code

On Windows, get_rocm_gpu_arch() probed hipinfo.exe via PATH only. In
practice hipInfo.exe is rarely on PATH: hosts without the HIP SDK do
not have it there, and AMD's PyTorch wheels ship hipInfo.exe into the
environment's Scripts directory, which is only on PATH when the venv
is activated. The probe then raises FileNotFoundError, every import
of bitsandbytes logs an ERROR + WARNING, and ROCM_GPU_ARCH silently
degrades to unknown.

Read torch.cuda.get_device_properties(0).gcnArchName first (works on
Linux and Windows, no subprocess); keep the rocminfo / hipInfo.exe
parsing as a fallback, additionally trying hipInfo.exe next to
python.exe on Windows before giving up.

Verified on gfx1151 (Strix Halo, Windows 11, torch 2.11.0+rocm7.13.0):
previously unknown + ERROR; now gfx1151 via both the torch path
and the forced subprocess fallback.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@danielhanchen

Copy link
Copy Markdown
Author

Closing this for now.

@matthewdouglas

Copy link
Copy Markdown
Member

@danielhanchen This seems like a reasonable change - what's the reason for closing?

danielhanchen added a commit to unslothai/unsloth that referenced this pull request Jun 11, 2026
)

* Fix bitsandbytes ROCm GPU arch and warp size detection on Windows

bitsandbytes resolves the ROCm GPU architecture (and warp size on
0.49.x) by shelling out to rocminfo / hipinfo.exe via PATH at import
time. On Windows neither tool is normally on PATH (AMD torch wheels
ship hipInfo.exe into the venv Scripts dir, only on PATH while
activated), so every `import bitsandbytes` logs an ERROR and WARNING,
ROCM_GPU_ARCH degrades to unknown, and the 0.49.x warp size defaults
to 64, which is wrong on RDNA (wave 32) and silently disables
pre-quantized 4-bit models via ALLOW_PREQUANTIZED_MODELS.

Install a one-shot MetaPathFinder before unsloth_zoo is imported (the
first bitsandbytes import on ROCm) that swaps get_rocm_gpu_arch and
get_rocm_warpsize for torch-device-properties-first implementations
right after bitsandbytes.cuda_specs executes, before cextension reads
them. Falls back to running hipInfo.exe by absolute path (venv
Scripts, conda Scripts, HIP SDK / AMD installer dirs). Repairs the
constants in place when bitsandbytes was imported first. Strict no-op
on non-Windows, non-ROCm builds, missing bitsandbytes, and versions
that fix this upstream. Opt out with UNSLOTH_DISABLE_BNB_ROCM_FIX=1.

Proposed upstream in bitsandbytes-foundation/bitsandbytes#1969;
shipped here so all bitsandbytes versions are covered. Verified on
gfx1151 Strix Halo, Windows 11, torch 2.11.0+rocm7.13.0 against
bitsandbytes main, 0.49.2, and a torch-props-fixed variant.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Tighten comments in the bitsandbytes ROCm detection fix

Comment and docstring pass only. AST comparison with docstrings
stripped confirms every definition is identical to the version the
12 scenario suite ran against, and the suite plus the drift test
pass unchanged on the edited files.

* Keep the bitsandbytes cuda_specs finder installed for reload support

Simulation testing caught a regression in the one-shot design:
importlib.reload(bitsandbytes.cuda_specs) re-resolves the spec through
sys.meta_path, so with the finder already removed the reload reinstalled
the unpatched upstream detector and the Windows ROCm noise returned.
Keep the finder on sys.meta_path permanently, matching the lifecycle of
the existing causal_conv1d and vllm import blockers. The finder matches
a single module name and patching stays idempotent via the sentinel
flags, so repeat hits are no-ops.

Validated on gfx1151 Windows 11: 22 simulation scenarios (conda and
embedded layouts, Program Files scan ordering, paths with spaces and
unicode, hanging probe timeout, lru-wrapped and C-function helper
shapes, reload, failed-import retry, threads, spawn, dormant finder,
Studio PATH coexistence, early fix-block ordering, bnb 0.45.5 / 0.47.0
/ 0.49.2 / main / upstream-fixed) plus the original 12 scenario suite,
CPU-torch and stale-HIP_PATH sandboxes, Python 3.10 to 3.13 gates, and
a WSL Linux leg proving byte-identical Linux behavior with and without
the fix, with and without rocminfo on PATH.

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants