Commit 98a1c43

authored and

committed

Add YOCO KV sharing support to StaticAttention (pytorch#18517)

Summary: Add YOCO (You Only Cache Once) support to StaticAttention so models with num_kv_shared_layers > 0 can be transformed via transform_attention_mha_to_static_attention and run correctly through the static attention export pipeline. Changes: - StaticAttention.__init__: skip wks/wvs/caches for shared layers - from_attention_mha: detect shared layers, preserve LoRA on wq/wo - forward/_forward_mha/_forward_sha: accept shared_kv, skip K/V projection - StaticAttentionIOManager: skip cache allocation for shared layers - Tests: 6 new test methods covering shared/donor layers, LoRA, numerics When num_kv_shared_layers=0 (default), behavior is completely unchanged. No C++ runtime changes needed. Reviewed By: limintang, YIWENX14 Differential Revision: D97556018

1 parent b6824d1 commit 98a1c43Copy full SHA for 98a1c43

2 files changed

examples/models/llama
- static_attention.py
- tests
  - test_static_attention.py

Comments

(0)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Commit 98a1c43

File tree

0 commit comments