Skip to content

[hipBLASLt] Expose SM-count-target hint and dynamic persistent tile (ext) toggle#7698

Open
jaopaulolc wants to merge 6 commits into
developfrom
users/jolabega/hipblaslt-sm-count-target-api
Open

[hipBLASLt] Expose SM-count-target hint and dynamic persistent tile (ext) toggle#7698
jaopaulolc wants to merge 6 commits into
developfrom
users/jolabega/hipblaslt-sm-count-target-api

Conversation

@jaopaulolc
Copy link
Copy Markdown
Contributor

@jaopaulolc jaopaulolc commented May 21, 2026

Motivation

hipBLASLt is gaining an internal dynamic persistent tile scheduler (work-stealing StreamK) that benefits from knowing how many compute units (CUs / SMs) the matmul should target — useful when another kernel (e.g. RCCL) is co-running on the device or when a persistent grid should be sized for a known CU budget. Today hipBLASLt has no public API to convey that hint or to opt the matmul into the dynamic persistent tile path.

cuBLAS already exposes the SM-count hint via two complementary surfaces:

  • Handle-level: cublasSetSmCountTarget(handle, int) / cublasGetSmCountTarget(handle, int*) (cuBLAS 13.0 §2.4.24–2.4.25).
  • Per-matmul (cuBLASLt): CUBLASLT_MATMUL_DESC_SM_COUNT_TARGET and CUBLASLT_MATMUL_PREF_SM_COUNT_TARGET descriptor / preference attributes.

This PR mirrors both into the non-ext hipBLASLt C API. For the dynamic persistent tile toggle, cuBLAS has no public analogue (it exposes no toggle for its persistent / Cluster Launch Control code path), so that knob lives in the _EXT attribute namespace and the hipblaslt_ext C++ API.

Technical Details

Public API additions

Handle-level (non-ext, mirrors cuBLAS):

hipblasStatus_t hipblasLtSetSmCountTarget(hipblasLtHandle_t handle, int32_t smCountTarget);
hipblasStatus_t hipblasLtGetSmCountTarget(hipblasLtHandle_t handle, int32_t* smCountTarget);
  • 0 (default) = "no override; use all CUs the device exposes".
  • Negative input rejected with HIPBLAS_STATUS_INVALID_VALUE (matches cuBLAS).
  • Null pointer to the getter rejected with HIPBLAS_STATUS_INVALID_VALUE (matches cuBLAS).
  • Stored on the handle. A per-matmul descriptor or preference attribute, when set to a non-zero value, takes precedence (same layering as cuBLAS).

Per-matmul attributes (non-ext, mirrors cuBLASLt):

  • HIPBLASLT_MATMUL_DESC_SM_COUNT_TARGET = 33int32_t matmul-descriptor attribute.
  • HIPBLASLT_MATMUL_PREF_SM_COUNT_TARGET = 2int32_t preference attribute.
  • HIPBLASLT_MATMUL_PREF_MAX bumped 2 → 3; no existing enum values renumbered; ABI preserved.

Dynamic persistent tile (ext-only, no cuBLAS analogue):

  • HIPBLASLT_MATMUL_DESC_DYN_PERSISTENT_TILE_EXT = 104int32_t ext attribute; non-zero opts the matmul into the dynamic persistent tile (work-stealing StreamK) scheduler.
  • hipblaslt_ext::GemmPreference::setDynPersistentTileEnabled(bool) / getDynPersistentTileEnabled() C++ ext methods.

Internal wiring

  • _rocblaslt_handle gains an sm_count_target field; new rocblaslt_set_sm_count_target / rocblaslt_get_sm_count_target helpers in rocblaslt-auxiliary.h are wired through hipblaslt.cpp with the existing log_api / log_error tracing and RocBlasLtStatusToHIPStatus translation, consistent with how hipblasLtCreate / hipblasLtDestroy are structured.
  • _rocblaslt_matmul_desc / _rocblaslt_matmul_preference carry their own sm_count_target fields (set/get via hipblasLtMatmulDescSetAttribute and hipblasLtMatmulPreferenceSetAttribute), with the same negative-value rejection and size validation as the other attributes.
  • utility.cpp attribute-to-string helper updated for logging.

hipblaslt-bench CLI

  • New options --sm_count_target <int32_t> and --dyn_persistent_tile <bool>. Both default to "off" so behavior is unchanged unless explicitly requested. --sm_count_target is validated to be non-negative at parse time (exits with code 1 on failure).
  • Backed by process-global accessors in clients/common/{include,src}/hipblaslt_bench_options.{hpp,cpp} (new) so testing_matmul.hpp can forward the values into both hipblasLtMatmulDescCreate call-sites without going through the YAML-driven Arguments struct.

Removed (ext duplicates of the cuBLAS-mirrored knob)

  • hipblaslt_ext::GemmPreference::setSmCountTarget(int32_t) and getSmCountTarget() were removed in the second commit on this branch (8ddf47b5c2). They duplicated a knob that has a direct cuBLAS analogue and therefore did not belong in the _EXT namespace. C++ users wanting the same effect should call hipblasLtSetSmCountTarget on the handle, or hipblasLtMatmulPreferenceSetAttribute(pref, HIPBLASLT_MATMUL_PREF_SM_COUNT_TARGET, ...) / the analogous descriptor setter on the per-matmul object.

Tests added

  • Handle-level public-API tests (clients/tests/src/hipblaslt_test.cpp, new aux_handle_test suite):
    • set_sm_count_target_default_is_zero — default value on a fresh handle is 0.
    • set_sm_count_target_round_trip — including the 0 sentinel round-trip.
    • set_sm_count_target_rejects_negative — negative input returns HIPBLAS_STATUS_INVALID_VALUE and leaves the prior value untouched.
    • get_sm_count_target_rejects_null_pointer — null pointer to the getter returns HIPBLAS_STATUS_INVALID_VALUE.
  • Ext gtest (still in hipblaslt_test.cpp): aux_ext_test.gemm_preference_dyn_persistent_tile_round_trip for the surviving GemmPreference boolean.
  • YAML-driven aux_test cases for the descriptor and preference attributes (round-trip, default, negative rejection, undersized buffers):
    • aux_matmul_sm_count_target
    • aux_matmul_dyn_persistent_tile_ext
    • aux_matmul_pref_sm_count_target

Test Plan

Built with cmake --build build -- -j32 against ROCm 7.1.1 on an MI355X (gfx950). libhipblaslt.so, hipblaslt-bench, and hipblaslt-test all built clean. Tests ran from build/projects/hipblaslt/clients/.

Test Result

All new tests pass (8/8) and the pre-existing pref-attribute tests still pass (regression-check on the HIPBLASLT_MATMUL_PREF_MAX bump from 2 → 3):

[ OK ] aux_handle_test.set_sm_count_target_default_is_zero
[ OK ] aux_handle_test.set_sm_count_target_round_trip
[ OK ] aux_handle_test.set_sm_count_target_rejects_negative
[ OK ] aux_handle_test.get_sm_count_target_rejects_null_pointer
[ OK ] aux_ext_test.gemm_preference_dyn_persistent_tile_round_trip
[ OK ] _/aux_test.conversion/pre_checkin_aux_matmul_sm_count_target_f16_r
[ OK ] _/aux_test.conversion/pre_checkin_aux_matmul_dyn_persistent_tile_ext_f16_r
[ OK ] _/aux_test.conversion/pre_checkin_aux_matmul_pref_sm_count_target_f16_r
[ OK ] _/aux_test.conversion/pre_checkin_aux_matmul_pref_get_attr_bad_arg_f16_r
[ OK ] _/aux_test.conversion/pre_checkin_aux_matmul_pref_get_attr_f16_r
[ PASSED ] 10 tests.

hipblaslt-bench CLI validation works:

$ hipblaslt-bench --sm_count_target -1 -m 64 -n 64 -k 64
sm_count_target must be >= 0 (0 means "use all CUs").
EXIT=1

Notes for reviewers

  • Terminology. The PR uses the language "compute units (CUs)" in user-visible docstrings (matching the AMD CDNA 4 whitepaper, the ROCm 7.0.1 gpu-arch-specs table for MI355X, and the existing rocBLAS rocblas_gemm_flags_use_cu_efficiency flag). The C symbol name keeps SM_COUNT_TARGET to preserve the cuBLAS-mirror convention so CUDA porters can locate the attribute by grepping for the cuBLAS name.
  • "Scheduler" wording in the dyn-persistent-tile docstrings. Retained as-is. An in-kernel persistent-thread work-stealing dequeue loop is conventionally called a "scheduler" in the GPU-systems literature; we considered swapping for "code path" but kept the established term.
  • Commit history. Two commits on the branch:
    1. e04723e908 — initial API additions (handle-level setter was missing; SM helpers were in ext).
    2. 8ddf47b5c2 — adds hipblasLtSetSmCountTarget / hipblasLtGetSmCountTarget to the non-ext API and removes the now-redundant GemmPreference::{set,get}SmCountTarget ext methods. Happy to squash on merge if maintainers prefer a single commit.

Submission Checklist

  • Branch named per the rocm-libraries convention users/<gh-username>/<topic>.
  • Pre-commit hooks ran on each commit (project still excluded from full pre-commit per .pre-commit-config.yaml; clang-format / cmake-lint / trailing-whitespace / large-file checks all clean).
  • Public-header API additions are appended (no enum renumbering); HIPBLASLT_MATMUL_PREF_MAX updated to reflect the new entry.
  • Defaults preserve current behavior — 0 / false on every new attribute and on the new handle-level setter.
  • Unit tests added for round-trip, default value, negative rejection, null pointer, and undersized buffers; gtest case for the surviving ext C++ method.
  • CHANGELOG.md updated under the upcoming 1.4.0 section.
  • Reviewed against https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

…ext) toggle

Add two matmul-descriptor attributes plus a C++ ext-API toggle that let
callers convey an estimate of how many compute units (CUs / SMs) hipBLASLt
should target, and request the work-stealing StreamK code path:

- HIPBLASLT_MATMUL_DESC_SM_COUNT_TARGET and
  HIPBLASLT_MATMUL_PREF_SM_COUNT_TARGET (int32_t; 0 = "use all CUs";
  negative rejected with HIPBLAS_STATUS_INVALID_VALUE).
- HIPBLASLT_MATMUL_DESC_DYN_PERSISTENT_TILE_EXT (int32_t bool) plus
  hipblaslt_ext::GemmPreference::{set,get}DynPersistentTileEnabled() —
  opts the matmul into the hipBLASLt dynamic persistent tile scheduler
  (work-stealing StreamK). Lives in the ext API.
- hipblaslt_ext::GemmPreference::{set,get}SmCountTarget() C++ helpers.
- hipblaslt-bench: --sm_count_target and --dyn_persistent_tile CLI
  options that forward the values into the matmul descriptor.

Defaults preserve current behavior. The PREF_MAX enum is bumped from 2
to 3; no existing enum values are renumbered, so ABI is preserved.

Unit tests:
- aux_ext_test gtest cases: gemm_preference_sm_count_target_default_is_zero,
  gemm_preference_sm_count_target_round_trip,
  gemm_preference_dyn_persistent_tile_round_trip (host-only).
- YAML-driven aux_test cases: aux_matmul_sm_count_target,
  aux_matmul_dyn_persistent_tile_ext, aux_matmul_pref_sm_count_target.

Co-authored-by: Cursor <cursoragent@cursor.com>
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 21, 2026

Codecov Report

❌ Patch coverage is 54.43787% with 77 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
...c/amd_detail/rocblaslt/src/rocblaslt_auxiliary.cpp 64.66% 43 Missing and 4 partials ⚠️
...cts/hipblaslt/library/src/amd_detail/hipblaslt.cpp 0.00% 24 Missing ⚠️
...hipblaslt/library/src/amd_detail/hipblaslt-ext.cpp 0.00% 6 Missing ⚠️

❌ Your project status has failed because the head coverage (77.83%) is below the target coverage (80.00%). You can increase the head coverage or adjust the target coverage.

Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #7698      +/-   ##
===========================================
- Coverage    61.88%   61.87%   -0.01%     
===========================================
  Files         2086     2086              
  Lines       357161   357195      +34     
  Branches     53836    53827       -9     
===========================================
- Hits        221013   220984      -29     
- Misses      117350   117414      +64     
+ Partials     18798    18797       -1     
Flag Coverage Δ *Carryforward flag
TensileLite 25.94% <ø> (-0.16%) ⬇️ Carriedforward from ffbde87
hipBLAS 90.65% <ø> (ø) Carriedforward from ffbde87
hipBLASLt 41.43% <54.44%> (+0.16%) ⬆️
hipCUB 82.21% <ø> (ø) Carriedforward from ffbde87
hipDNN 85.87% <ø> (ø) Carriedforward from ffbde87
hipFFT 50.00% <ø> (ø) Carriedforward from ffbde87
hipRAND 76.12% <ø> (ø) Carriedforward from ffbde87
hipSOLVER 69.24% <ø> (ø) Carriedforward from ffbde87
hipSPARSE 85.09% <ø> (ø) Carriedforward from ffbde87
rocBLAS 48.11% <ø> (ø) Carriedforward from ffbde87
rocFFT 52.07% <ø> (ø) Carriedforward from ffbde87
rocRAND 57.04% <ø> (ø) Carriedforward from ffbde87
rocSOLVER 77.83% <ø> (ø) Carriedforward from ffbde87
rocSPARSE 72.68% <ø> (ø) Carriedforward from ffbde87

*This pull request uses carry forward flags. Click here to find out more.

Files with missing lines Coverage Δ
...blaslt/library/include/hipblaslt/hipblaslt-ext.hpp 0.00% <ø> (ø)
...ts/hipblaslt/library/include/hipblaslt/hipblaslt.h 75.00% <ø> (ø)
...rary/src/amd_detail/rocblaslt/src/include/handle.h 97.14% <100.00%> (+0.17%) ⬆️
...t/library/src/amd_detail/rocblaslt/src/utility.cpp 27.01% <100.00%> (+0.70%) ⬆️
...hipblaslt/library/src/amd_detail/hipblaslt-ext.cpp 56.74% <0.00%> (-0.37%) ⬇️
...cts/hipblaslt/library/src/amd_detail/hipblaslt.cpp 47.54% <0.00%> (-2.10%) ⬇️
...c/amd_detail/rocblaslt/src/rocblaslt_auxiliary.cpp 68.25% <64.66%> (-0.24%) ⬇️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

jaopaulolc and others added 5 commits May 25, 2026 11:15
…t duplicates

Move the SM-count-target hint out of the ext API and into the non-ext
hipBLASLt C API, mirroring cuBLAS's cublasSetSmCountTarget /
cublasGetSmCountTarget directly:

- New public functions in hipblaslt.h:
    hipblasStatus_t hipblasLtSetSmCountTarget(hipblasLtHandle_t, int32_t);
    hipblasStatus_t hipblasLtGetSmCountTarget(hipblasLtHandle_t, int32_t*);
  Stored on the handle; 0 (default) = "no override"; negative input
  returns HIPBLAS_STATUS_INVALID_VALUE; null pointer to the getter
  returns HIPBLAS_STATUS_INVALID_VALUE.

- Internal rocblaslt_set_sm_count_target / rocblaslt_get_sm_count_target
  helpers wired through rocblaslt-auxiliary.h with matching status codes
  and log_api / log_error tracing, consistent with the existing
  attribute-handler conventions.

- _rocblaslt_handle gains an sm_count_target field. The per-matmul
  descriptor and per-preference attributes
  (HIPBLASLT_MATMUL_DESC_SM_COUNT_TARGET /
  HIPBLASLT_MATMUL_PREF_SM_COUNT_TARGET) are unchanged and, when set,
  take precedence over the handle-level value (matches cuBLAS layering).

- Remove hipblaslt_ext::GemmPreference::{set,get}SmCountTarget — they
  duplicated a cuBLAS-mirrored knob and no longer fit the "ext = no
  cuBLAS analogue" rule. setDynPersistentTileEnabled /
  getDynPersistentTileEnabled stay in ext (no cuBLAS analogue).

- Tests: replace the two aux_ext_test gemm_preference_sm_count_target_*
  cases with four aux_handle_test cases that exercise the new public
  C API (default == 0, round-trip including 0 sentinel, negative
  rejection preserves prior value, null pointer rejection on getter).
  The YAML-driven aux_matmul_sm_count_target / _pref_sm_count_target /
  _dyn_persistent_tile_ext cases are unchanged and still cover the
  per-desc and per-pref attributes plus the ext toggle.

Co-authored-by: Cursor <cursoragent@cursor.com>
…ranches

Codecov flagged the size-validation branches in
rocblaslt_auxiliary.cpp's HIPBLASLT_MATMUL_DESC_SM_COUNT_TARGET,
HIPBLASLT_MATMUL_DESC_DYN_PERSISTENT_TILE_EXT and
HIPBLASLT_MATMUL_PREF_SM_COUNT_TARGET set/get handlers, plus the
GemmPreference dyn-persistent-tile setter/getter, as unexecuted in the
host-only coverage build (the YAML-driven aux_test cases that already
exercise these are gated behind data availability and are not always
run for coverage).

Add six standalone gtest cases under `aux_attr_test` that drive each
undersized-buffer error path through the public hipBLASLt API, plus
two `aux_handle_test` cases that pin the rocblaslt_status_invalid_handle
branch for hipblasLtSet/GetSmCountTarget, and one `aux_ext_test` case
that asserts the default-disabled state of
GemmPreference::getDynPersistentTileEnabled.

All 14 affected tests pass on gfx950.

Co-authored-by: Cursor <cursoragent@cursor.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants