You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
perf: use fused SDPA kernel for Qwen3.5 full attention generation
Route seq_len=1 (token generation) in Qwen3.5 full attention layers
through the fused_vector_attention Metal kernel via backend.sdpa().
This replaces 4+ separate GPU dispatches (repeat_kv head expansion,
Q@K^T matmul, dtype cast, softmax, att@V matmul) with a single fused
kernel that handles GQA natively and uses online softmax.
Benchmark (M3 Pro, Qwen3.5-0.8B, 50 tokens):
- Before: 14.7 tok/s
- After: 15.2 tok/s (+3.4%)
The prefill path (seq_len > 1) still uses the manual mixed-precision
attention, which is only called once at the start of a conversation.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
0 commit comments