Skip to content

Commit a71b316

Browse files
committed
Improve SDPA to handle GQA and update models to use native GQA
Add GQA/MQA support to the Triton SDPA kernel with a "pack GQA" optimization adapted from FlashAttention. When enable_gqa=True and H_q > H_kv, the kernel folds multiple Q heads sharing the same KV head into the M (sequence) dimension, so K/V are loaded once per KV head instead of once per Q head. A tile-utilization heuristic from FlashAttention decides when packing is beneficial (decode) vs when simple head remapping suffices (prefill). Update Qwen 3.5 MoE and Voxtral Realtime models to use native enable_gqa=True instead of manually expanding KV heads via repeat_interleave, eliminating redundant memory traffic.
1 parent fa81941 commit a71b316

4 files changed

Lines changed: 876 additions & 123 deletions

File tree

0 commit comments

Comments
 (0)