Commit 98a1c43
Add YOCO KV sharing support to StaticAttention (pytorch#18517)
Summary:
Add YOCO (You Only Cache Once) support to StaticAttention so models with
num_kv_shared_layers > 0 can be transformed via
transform_attention_mha_to_static_attention and run correctly through the
static attention export pipeline.
Changes:
- StaticAttention.__init__: skip wks/wvs/caches for shared layers
- from_attention_mha: detect shared layers, preserve LoRA on wq/wo
- forward/_forward_mha/_forward_sha: accept shared_kv, skip K/V projection
- StaticAttentionIOManager: skip cache allocation for shared layers
- Tests: 6 new test methods covering shared/donor layers, LoRA, numerics
When num_kv_shared_layers=0 (default), behavior is completely unchanged.
No C++ runtime changes needed.
Reviewed By: limintang, YIWENX14
Differential Revision: D975560181 parent b6824d1 commit 98a1c43
2 files changed
Lines changed: 495 additions & 141 deletions
0 commit comments