Add YOCO KV sharing support to StaticAttention (#18517)#18517
Add YOCO KV sharing support to StaticAttention (#18517)#18517meta-codesync[bot] merged 1 commit intopytorch:mainfrom
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18517
Note: Links to docs will display an error until the docs builds have been completed. ❌ 1 New Failure, 2 Unrelated FailuresAs of commit e08f97d with merge base b6824d1 ( NEW FAILURE - The following job has failed:
BROKEN TRUNK - The following jobs failed but were present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This PR needs a
|
Summary: Add YOCO (You Only Cache Once) support to StaticAttention so models with num_kv_shared_layers > 0 can be transformed via transform_attention_mha_to_static_attention and run correctly through the static attention export pipeline. Changes: - StaticAttention.__init__: skip wks/wvs/caches for shared layers - from_attention_mha: detect shared layers, preserve LoRA on wq/wo - forward/_forward_mha/_forward_sha: accept shared_kv, skip K/V projection - StaticAttentionIOManager: skip cache allocation for shared layers - Tests: 6 new test methods covering shared/donor layers, LoRA, numerics When num_kv_shared_layers=0 (default), behavior is completely unchanged. No C++ runtime changes needed. Reviewed By: limintang, YIWENX14 Differential Revision: D97556018
2bd0bfa to
98a1c43
Compare
Summary: Add YOCO (You Only Cache Once) support to StaticAttention so models with num_kv_shared_layers > 0 can be transformed via transform_attention_mha_to_static_attention and run correctly through the static attention export pipeline. Changes: - StaticAttention.__init__: skip wks/wvs/caches for shared layers - from_attention_mha: detect shared layers, preserve LoRA on wq/wo - forward/_forward_mha/_forward_sha: accept shared_kv, skip K/V projection - StaticAttentionIOManager: skip cache allocation for shared layers - Tests: 6 new test methods covering shared/donor layers, LoRA, numerics When num_kv_shared_layers=0 (default), behavior is completely unchanged. No C++ runtime changes needed. Reviewed By: limintang, YIWENX14 Differential Revision: D97556018
98a1c43 to
e08f97d
Compare
Summary:
Add YOCO (You Only Cache Once) support to StaticAttention so models with
num_kv_shared_layers > 0 can be transformed via
transform_attention_mha_to_static_attention and run correctly through the
static attention export pipeline.
Changes:
When num_kv_shared_layers=0 (default), behavior is completely unchanged.
No C++ runtime changes needed.
Reviewed By: limintang, YIWENX14
Differential Revision: D97556018