Skip to content

Add reuseA/reuseB support to MFMA instruction for precise reuse bit#7753

Merged
geotseng-amd merged 2 commits into
ROCm:developfrom
geotseng-amd:users/geotseng/develop-reuse-bit
May 29, 2026
Merged

Add reuseA/reuseB support to MFMA instruction for precise reuse bit#7753
geotseng-amd merged 2 commits into
ROCm:developfrom
geotseng-amd:users/geotseng/develop-reuse-bit

Conversation

@geotseng-amd
Copy link
Copy Markdown
Contributor

@geotseng-amd geotseng-amd commented May 26, 2026

AIHPBLAS-1403

Brief

Precisely setting reuse bit for gfx1250.

Implementations


Precisely setting reuse bit for gfx1250

  • Add reuseA /reuseB in mfma.hpp
    •  // Condition for reuseA/B Str setting
       // TODO: Consider MX Datatypes
       std::string reuseAStr       = typeConvert(instType)=="f8f6f4" || !getAsmCaps()["HasWMMA_V3"]? "": (!reuseA ? "" : " matrix_a_reuse");
       std::string reuseBStr       = typeConvert(instType)=="f8f6f4" || !getAsmCaps()["HasWMMA_V3"]? "": (!reuseB ? "" : " matrix_b_reuse");
  • Add arguments in insruction:
    • return acc->toString() + ", " + a->toString() + ", " + b->toString() + ", "
         + acc2->toString() + inputPermuteStr + reuseAStr + reuseBStr + negStr;
  • Precisely setting reuse bit in KernelWriterAssembly.py
    •  # Precisely setting reuse bit
       if kernel["MIWaveTile"][outer]==1 and kernel["MIWaveTile"][inner]==1:
         reuseA   = False
         reuseB   = False
       elif kernel["MIWaveTile"][outer]>1 and kernel["MIWaveTile"][inner]==1:
         reuseA   = True if tPB["tile01Idx"] and idx1 < (kernel["MIWaveTile"][outer]-1) else False
         reuseB   = True if not tPB["tile01Idx"] and idx1 < (kernel["MIWaveTile"][outer]-1) else False
       elif kernel["MIWaveTile"][outer]==1 and kernel["MIWaveTile"][inner]>1:
         reuseA   = True if not tPB["tile01Idx"] and idx0 < (kernel["MIWaveTile"][inner]-1) else False
         reuseB   = True if tPB["tile01Idx"] and idx0 < (kernel["MIWaveTile"][inner]-1) else False
       else:
         reuseA   = True if not tPB["tile01Idx"] and idx0 < (kernel["MIWaveTile"][inner]-1) else False
         reuseB   = True if tPB["tile01Idx"] and idx0 < (kernel["MIWaveTile"][inner]-1) else False

Assembly output

  • /*  mfmaIndex:12  */
    v_wmma_f32_16x16x64_fp8_fp8 v[0:7], v[vgprValuA_X3_I0+0+0+0:vgprValuA_X3_I0+0+0+0+7],  v[vgprValuB_X3_I0+0+0+0:vgprValuB_X3_I0+0+0+0+7], v[0:7] matrix_b_reuse // left value = v[0+0:7+0]
    /*  mfmaIndex:13  */
    ds_load_tr8_b64 v[vgprValuA_X0_I0+0+0:vgprValuA_X0_I0+0+0+1], v[vgprLocalReadAddrA] offset:0 // LDS Transpose
    ds_load_tr8_b64 v[vgprValuB_X0_I0+0+0:vgprValuB_X0_I0+0+0+1], v[vgprLocalReadAddrB] offset:0 // LDS Transpose
    ds_load_tr8_b64 v[vgprValuA_X0_I0+0+2:vgprValuA_X0_I0+0+2+1], v[vgprLocalReadAddrA] offset:1024 // LDS Transpose
    ds_load_tr8_b64 v[vgprValuA_X0_I0+0+4:vgprValuA_X0_I0+0+4+1], v[vgprLocalReadAddrA] offset:4096 // LDS Transpose
    ds_load_tr8_b64 v[vgprValuA_X0_I0+0+6:vgprValuA_X0_I0+0+6+1], v[vgprLocalReadAddrA] offset:5120 // LDS Transpose
    ds_load_tr8_b64 v[vgprValuA_X0_I0+8+0:vgprValuA_X0_I0+8+0+1], v[vgprLocalReadAddrA] offset:32 // LDS Transpose
    ds_load_tr8_b64 v[vgprValuA_X0_I0+8+2:vgprValuA_X0_I0+8+2+1], v[vgprLocalReadAddrA] offset:1056 // LDS Transpose
    v_wmma_f32_16x16x64_fp8_fp8 v[8:15], v[vgprValuA_X3_I0+8+0+0:vgprValuA_X3_I0+8+0+0+7], v[vgprValuB_X3_I0+0+0+0:vgprValuB_X3_I0+0+0+0+7], v[8:15] matrix_b_reuse // left value = v[8+0:15+0]
    /*  mfmaIndex:14  */
    ds_load_tr8_b64 v[vgprValuA_X0_I0+8+4:vgprValuA_X0_I0+8+4+1], v[vgprLocalReadAddrA] offset:4128 // LDS Transpose
    ds_load_tr8_b64 v[vgprValuA_X0_I0+8+6:vgprValuA_X0_I0+8+6+1], v[vgprLocalReadAddrA] offset:5152 // LDS Transpose
    ds_load_tr8_b64 v[vgprValuA_X0_I0+16+0:vgprValuA_X0_I0+16+0+1], v[vgprLocalReadAddrA] offset:64 // LDS Transpose
    ds_load_tr8_b64 v[vgprValuA_X0_I0+16+2:vgprValuA_X0_I0+16+2+1], v[vgprLocalReadAddrA] offset:1088 // LDS Transpose
    ds_load_tr8_b64 v[vgprValuA_X0_I0+16+4:vgprValuA_X0_I0+16+4+1], v[vgprLocalReadAddrA] offset:4160 // LDS Transpose
    ds_load_tr8_b64 v[vgprValuA_X0_I0+16+6:vgprValuA_X0_I0+16+6+1], v[vgprLocalReadAddrA] offset:5184 // LDS Transpose
    ds_load_tr8_b64 v[vgprValuA_X0_I0+24+0:vgprValuA_X0_I0+24+0+1], v[vgprLocalReadAddrA] offset:96 // LDS Transpose
    v_wmma_f32_16x16x64_fp8_fp8 v[16:23], v[vgprValuA_X3_I0+16+0+0:vgprValuA_X3_I0+16+0+0+7], v[vgprValuB_X3_I0+0+0+0:vgprValuB_X3_I0+0+0+0+7], v[16:23] matrix_b_reuse // left value = v[16+0:23+0]
    /*  mfmaIndex:15  */
    ds_load_tr8_b64 v[vgprValuA_X0_I0+24+2:vgprValuA_X0_I0+24+2+1], v[vgprLocalReadAddrA] offset:1120 // LDS Transpose
    ds_load_tr8_b64 v[vgprValuA_X0_I0+24+4:vgprValuA_X0_I0+24+4+1], v[vgprLocalReadAddrA] offset:4192 // LDS Transpose
    ds_load_tr8_b64 v[vgprValuA_X0_I0+24+6:vgprValuA_X0_I0+24+6+1], v[vgprLocalReadAddrA] offset:5216 // LDS Transpose
    ds_load_tr8_b64 v[vgprValuB_X0_I0+0+2:vgprValuB_X0_I0+0+2+1], v[vgprLocalReadAddrB] offset:256 // LDS Transpose
    ds_load_tr8_b64 v[vgprValuB_X0_I0+0+4:vgprValuB_X0_I0+0+4+1], v[vgprLocalReadAddrB] offset:1024 // LDS Transpose
    ds_load_tr8_b64 v[vgprValuB_X0_I0+0+6:vgprValuB_X0_I0+0+6+1], v[vgprLocalReadAddrB] offset:1280 // LDS Transpose
    v_wmma_f32_16x16x64_fp8_fp8 v[24:31], v[vgprValuA_X3_I0+24+0+0:vgprValuA_X3_I0+24+0+0+7], v[vgprValuB_X3_I0+0+0+0:vgprValuB_X3_I0+0+0+0+7], v[24:31] // left value = v[24+0:31+0]

Tests

Test cases are all passed on FFM and Gopher.

FFM(with emulated ECC)

tox passed

- generated xml file: /workspace/rocm-libraries/projects/hipblaslt/tensilelite/python_tests.xml -
========== 65 passed, 167 skipped, 80 warnings in 22226.26s (6:10:26) ==========
py3: exit 0 (22226.54 seconds) /workspace/rocm-libraries/projects/hipblaslt/tensilelite> pytest -v --basetemp=/tmp/.tensile-tox/py3/tmp --junit-xml=/workspace/rocm-libraries/projects/hipblaslt/tensilelite/python_tests.xml --junit-prefix=py3 --color=yes -n 4 --prebuilt-client=/tmp/.tensile-tox/py3/client/0_Build/client/tensile_client Tensile/Tests -m common pid=745752
.pkg: _exit> python /usr/lib/python3/dist-packages/pyproject_api/_backend.py True setuptools.build_meta __legacy__
.pkg: exit None (0.00 seconds) /workspace/rocm-libraries/projects/hipblaslt/tensilelite> python /usr/lib/python3/dist-packages/pyproject_api/_backend.py True setuptools.build_meta __legacy__ pid=742788
  py3: OK (22364.73=setup[9.57]+cmd[0.00,0.90,0.33,0.68,0.27,126.43,22226.54] seconds)
  congratulations :) (22364.83 seconds)

hipblaslt-test passed

[----------] Global test environment tear-down
[==========] 3417 tests from 1 test suite ran. (18512509 ms total)
[  PASSED  ] 3417 tests.
hipBLASLt version: 100100
hipBLASLt git version:
command line: ./build/release/clients/staging/hipblaslt-test --gtest_filter=*matmul_bias_relu_SAV_smoke*:*matmul_one*:*matmul_medium*:*matmul_one_double*:*one_integer_precisions_i8 *:*f8_bf8_dst_fp32_gfx12*:*matmul_f4*:*matmul_f6* --gtest_output=xml --gtest_color=yes --gtest_repeat=1

[----------] Global test environment tear-down
[==========] 336 tests from 1 test suite ran. (1778945 ms total)
[  PASSED  ] 336 tests.
hipBLASLt version: 100100
hipBLASLt git version:
command line: ./build/release/clients/staging/hipblaslt-test --gtest_filter=*matmul_dgelu_fp16*:*matmul_dgelu_bias_fp16*:*matmul_bgrada_fp16*:*matmul_bgradb_fp16*:*matmul_dgelu_bf16*:*matmul_bgradb_bf16* --gtest_output=xml --gtest_color=yes --gtest_repeat=1

Notes

  • Thanks Mark, Anne, and Serge for discussing with me.

@geotseng-amd
Copy link
Copy Markdown
Contributor Author

gfx1250 tox passed.

---------------------------------------------------------------------------------------- generated xml file: /workspace/fork/rocm-libraries/projects/hipblaslt/tensilelite/python_tests.xml -----------------------------------------------------------------------------------------
============================================================================================================= 81 passed, 209 skipped, 4 warnings in 1931.03s (0:32:11) ==============================================================================================================
py3: OK (2007.36=setup[17.84]+cmd[1.09,37.77,14.94,0.54,1935.18] seconds)
congratulations :) (2007.40 seconds)

@geotseng-amd geotseng-amd enabled auto-merge (squash) May 26, 2026 08:46
Copy link
Copy Markdown
Contributor

@hcman2 hcman2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds explicit reuseA/reuseB support for WMMA/MFMA-family instructions to more precisely control the emitted matrix_a_reuse / matrix_b_reuse hints (targeting gfx1250), and propagates those hints through Tensile assembly emission and rocisa→stinkytofu conversion.

Changes:

  • Extend rocisa MFMAInstruction / MXMFMAInstruction with reuseA/reuseB fields, update nanobind bindings, and emit matrix_{a,b}_reuse in toString() when supported.
  • Compute per-instruction reuse decisions in KernelWriterAssembly.py and pass them into MFMAInstruction / MXMFMAInstruction.
  • Teach stinkytofu rocisa string parsing to extract matrix_{a,b}_reuse and store it in MFMAModifiers; remove an obsolete commented “hack” line.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
shared/stinkytofu/src/conversion/rocisa/ToStinkyTofuUtils.cpp Parse matrix_a_reuse/matrix_b_reuse tokens from rocisa instruction strings into MFMAModifiers.
projects/hipblaslt/tensilelite/Tensile/SolutionStructs/Problem.py Remove an unused commented-out TLUB “hack” line.
projects/hipblaslt/tensilelite/Tensile/KernelWriterAssembly.py Compute reuseA/reuseB per MFMA and pass into rocisa instruction constructors.
projects/hipblaslt/tensilelite/rocisa/rocisa/src/instruction/mfma.cpp Expose reuseA/reuseB as Python ctor args for MFMA/MXMFMA instructions.
projects/hipblaslt/tensilelite/rocisa/rocisa/include/instruction/mfma.hpp Store reuseA/reuseB on MFMA/MXMFMA instructions and append matrix_{a,b}_reuse to emitted asm when supported.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread projects/hipblaslt/tensilelite/Tensile/KernelWriterAssembly.py Outdated
Comment thread projects/hipblaslt/tensilelite/Tensile/KernelWriterAssembly.py Outdated
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 26, 2026

Codecov Report

❌ Patch coverage is 0% with 22 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
...blaslt/tensilelite/Tensile/KernelWriterAssembly.py 0.00% 22 Missing ⚠️

❌ Your project status has failed because the head coverage (77.83%) is below the target coverage (80.00%). You can increase the head coverage or adjust the target coverage.

Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #7753      +/-   ##
===========================================
+ Coverage    61.87%   61.98%   +0.11%     
===========================================
  Files         2086     2087       +1     
  Lines       357038   357926     +888     
  Branches     53806    54001     +195     
===========================================
+ Hits        220892   221843     +951     
+ Misses      117348   117281      -67     
- Partials     18798    18802       +4     
Flag Coverage Δ *Carryforward flag
TensileLite 27.28% <0.00%> (+1.35%) ⬆️
hipBLAS 90.65% <ø> (ø) Carriedforward from 49885a8
hipBLASLt 41.27% <ø> (ø)
hipCUB 82.21% <ø> (ø) Carriedforward from 49885a8
hipDNN 85.87% <ø> (ø) Carriedforward from 49885a8
hipFFT 50.00% <ø> (ø) Carriedforward from 49885a8
hipRAND 76.12% <ø> (ø) Carriedforward from 49885a8
hipSOLVER 69.24% <ø> (ø) Carriedforward from 49885a8
hipSPARSE 85.09% <ø> (ø) Carriedforward from 49885a8
rocBLAS 48.10% <ø> (ø) Carriedforward from 49885a8
rocFFT 52.07% <ø> (ø) Carriedforward from 49885a8
rocRAND 57.04% <ø> (ø) Carriedforward from 49885a8
rocSOLVER 77.83% <ø> (ø) Carriedforward from 49885a8
rocSPARSE 72.68% <ø> (ø) Carriedforward from 49885a8

*This pull request uses carry forward flags. Click here to find out more.

Files with missing lines Coverage Δ
...slt/tensilelite/Tensile/SolutionStructs/Problem.py 45.59% <ø> (+1.78%) ⬆️
...blaslt/tensilelite/Tensile/KernelWriterAssembly.py 7.48% <0.00%> (-0.05%) ⬇️

... and 23 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Comment thread projects/hipblaslt/tensilelite/Tensile/KernelWriterAssembly.py Outdated
Comment thread projects/hipblaslt/tensilelite/Tensile/KernelWriterAssembly.py Outdated
@nakajee
Copy link
Copy Markdown
Contributor

nakajee commented May 26, 2026

Can we do this at StinkyTofu stage?

@geotseng-amd geotseng-amd force-pushed the users/geotseng/develop-reuse-bit branch from 666331e to 2dd1db0 Compare May 26, 2026 17:13
@geotseng-amd geotseng-amd force-pushed the users/geotseng/develop-reuse-bit branch from 2dd1db0 to 49885a8 Compare May 27, 2026 05:32
@geotseng-amd
Copy link
Copy Markdown
Contributor Author

geotseng-amd commented May 27, 2026

Can we do this at StinkyTofu stage?

hi @nakajee

Since StinkyTofu currently preserves the relative order of WMMA instructions (i.e., WMMA instructions remain stable with respect to each other, even though they may be interleaved with other instructions), there is no need to move this logic to the StinkyTofu stage.

Thank you.

@cycheng cycheng closed this May 27, 2026
auto-merge was automatically disabled May 27, 2026 07:11

Pull request was closed

@cycheng cycheng reopened this May 27, 2026
@geotseng-amd geotseng-amd changed the title Add reuseA/reuseB support to MFMA instruction for precise reuse bit c… Add reuseA/reuseB support to MFMA instruction for precise reuse bit May 27, 2026
@geotseng-amd
Copy link
Copy Markdown
Contributor Author

Hi @cycheng @nakajee

Let me clarify:

Since StinkyTofu currently preserves the relative order of WMMA instructions (i.e., WMMA instructions remain stable with respect to each other, even though they may be interleaved with other instructions), there is no need to move this logic to the StinkyTofu stage.

Thank you.

@geotseng-amd geotseng-amd enabled auto-merge (squash) May 27, 2026 07:20
Comment thread projects/hipblaslt/tensilelite/Tensile/KernelWriterAssembly.py Outdated
@geotseng-amd geotseng-amd disabled auto-merge May 28, 2026 09:30
@geotseng-amd geotseng-amd enabled auto-merge (squash) May 28, 2026 11:14
@geotseng-amd geotseng-amd disabled auto-merge May 28, 2026 13:33
@geotseng-amd geotseng-amd enabled auto-merge (squash) May 28, 2026 17:29
@geotseng-amd geotseng-amd merged commit 5895a95 into ROCm:develop May 29, 2026
55 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants