Skip to content

[CK Tile] Enable LSE output for fp8bf16 V3 FMHA kernels#7723

Merged
poyenc merged 3 commits into
ROCm:developfrom
ltqin:ck/ltqin/fix_fmha_v3_lse
May 28, 2026
Merged

[CK Tile] Enable LSE output for fp8bf16 V3 FMHA kernels#7723
poyenc merged 3 commits into
ROCm:developfrom
ltqin:ck/ltqin/fix_fmha_v3_lse

Conversation

@ltqin
Copy link
Copy Markdown
Contributor

@ltqin ltqin commented May 24, 2026

Motivation

The V3 pipeline (qr_async_trload_v3) for fp8bf16 FMHA kernels did not support LSE (Log-Sum-Exp) output. This PR enables LSE output support for fp8bf16 V3 FMHA kernels, allowing users to retrieve attention statistics alongside attention outputs.

Technical Details

- StandardAttention: lse = softmax_scale * m + log(l) 
- LogitsSoftCap: lse = (m / log2(e)) + log(l)                                                                                                                                                            

Test Plan

Run FMHA forward example with fp8bf16 precision and LSE output enabled:

  • Test 1: Basic LSE functionality
    ./build/bin/tile_example_fmha_fwd -v=1 -b=1 -h=8 -s=1024 -d=128 -prec=fp8bf16 -init=3 -qscale=1 -lse=1

  • Test 2: LSE with LogitsSoftCap (CMakeList should remove Logits filter)
    ./build/bin/tile_example_fmha_fwd -v=1 -b=1 -h=8 -s=1024 -d=128 -prec=fp8bf16 -init=3 -qscale=1 -lse=1 -logits_soft_cap=30.0

@poyenc
Copy link
Copy Markdown
Contributor

poyenc commented May 25, 2026

LGTM — the LSE formula fix for StandardAttention is correct (matches the async pipeline reference) and the plumbing changes are straightforward.

One ask: please make sure the FMHA test is enabled in CI for this PR so the lse=1 paths get exercised before merge.

Copy link
Copy Markdown
Contributor

@poyenc poyenc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM — LSE formula fix is correct, CI all green including FMHA tests on gfx950/gfx942/gfx90a/gfx1201.

@poyenc poyenc merged commit 4ed6c51 into ROCm:develop May 28, 2026
39 checks passed
assistant-librarian Bot pushed a commit to ROCm/composable_kernel that referenced this pull request May 28, 2026
[CK Tile] Enable LSE output for fp8bf16 V3 FMHA kernels
 (#7723)

###  Motivation
The V3 pipeline (qr_async_trload_v3) for fp8bf16 FMHA kernels did not
support LSE (Log-Sum-Exp) output. This PR enables LSE output support for
fp8bf16 V3 FMHA kernels, allowing users to retrieve attention statistics
alongside attention outputs.
### Technical Details
    - StandardAttention: lse = softmax_scale * m + log(l)
- LogitsSoftCap: lse = (m / log2(e)) + log(l)

### Test Plan
Run FMHA forward example with fp8bf16 precision and LSE output enabled:
- Test 1: Basic LSE functionality
./build/bin/tile_example_fmha_fwd -v=1 -b=1 -h=8 -s=1024 -d=128
-prec=fp8bf16 -init=3 -qscale=1 -lse=1
- Test 2: LSE with LogitsSoftCap (CMakeList should remove Logits filter)
./build/bin/tile_example_fmha_fwd -v=1 -b=1 -h=8 -s=1024 -d=128
-prec=fp8bf16 -init=3 -qscale=1 -lse=1 -logits_soft_cap=30.0
yenong-amd pushed a commit that referenced this pull request May 28, 2026
###  Motivation           
The V3 pipeline (qr_async_trload_v3) for fp8bf16 FMHA kernels did not
support LSE (Log-Sum-Exp) output. This PR enables LSE output support for
fp8bf16 V3 FMHA kernels, allowing users to retrieve attention statistics
alongside attention outputs.
### Technical Details
    - StandardAttention: lse = softmax_scale * m + log(l) 
- LogitsSoftCap: lse = (m / log2(e)) + log(l)

### Test Plan
Run FMHA forward example with fp8bf16 precision and LSE output enabled:
- Test 1: Basic LSE functionality
./build/bin/tile_example_fmha_fwd -v=1 -b=1 -h=8 -s=1024 -d=128
-prec=fp8bf16 -init=3 -qscale=1 -lse=1
- Test 2: LSE with LogitsSoftCap (CMakeList should remove Logits filter)
./build/bin/tile_example_fmha_fwd -v=1 -b=1 -h=8 -s=1024 -d=128
-prec=fp8bf16 -init=3 -qscale=1 -lse=1 -logits_soft_cap=30.0
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants