Skip to content

Add YOCO KV sharing support to StaticAttention (#18517)#18517

Merged
meta-codesync[bot] merged 1 commit intopytorch:mainfrom
billmguo:export-D97556018
Mar 26, 2026
Merged

Add YOCO KV sharing support to StaticAttention (#18517)#18517
meta-codesync[bot] merged 1 commit intopytorch:mainfrom
billmguo:export-D97556018

Conversation

@billmguo
Copy link
Copy Markdown
Contributor

@billmguo billmguo commented Mar 26, 2026

Summary:

Add YOCO (You Only Cache Once) support to StaticAttention so models with
num_kv_shared_layers > 0 can be transformed via
transform_attention_mha_to_static_attention and run correctly through the
static attention export pipeline.

Changes:

  • StaticAttention.init: skip wks/wvs/caches for shared layers
  • from_attention_mha: detect shared layers, preserve LoRA on wq/wo
  • forward/_forward_mha/_forward_sha: accept shared_kv, skip K/V projection
  • StaticAttentionIOManager: skip cache allocation for shared layers
  • Tests: 6 new test methods covering shared/donor layers, LoRA, numerics

When num_kv_shared_layers=0 (default), behavior is completely unchanged.
No C++ runtime changes needed.

Reviewed By: limintang, YIWENX14

Differential Revision: D97556018

@billmguo billmguo requested a review from lucylq as a code owner March 26, 2026 05:57
@pytorch-bot
Copy link
Copy Markdown

pytorch-bot bot commented Mar 26, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18517

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 2 Unrelated Failures

As of commit e08f97d with merge base b6824d1 (image):

NEW FAILURE - The following job has failed:

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 26, 2026
@meta-codesync
Copy link
Copy Markdown
Contributor

meta-codesync bot commented Mar 26, 2026

@billmguo has exported this pull request. If you are a Meta employee, you can view the originating Diff in D97556018.

@github-actions
Copy link
Copy Markdown

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

@limintang limintang self-requested a review March 26, 2026 06:01
@meta-codesync meta-codesync bot changed the title Add YOCO KV sharing support to StaticAttention Add YOCO KV sharing support to StaticAttention (#18517) Mar 26, 2026
billmguo added a commit to billmguo/executorch that referenced this pull request Mar 26, 2026
Summary:

Add YOCO (You Only Cache Once) support to StaticAttention so models with
num_kv_shared_layers > 0 can be transformed via
transform_attention_mha_to_static_attention and run correctly through the
static attention export pipeline.

Changes:
- StaticAttention.__init__: skip wks/wvs/caches for shared layers
- from_attention_mha: detect shared layers, preserve LoRA on wq/wo
- forward/_forward_mha/_forward_sha: accept shared_kv, skip K/V projection
- StaticAttentionIOManager: skip cache allocation for shared layers
- Tests: 6 new test methods covering shared/donor layers, LoRA, numerics

When num_kv_shared_layers=0 (default), behavior is completely unchanged.
No C++ runtime changes needed.

Reviewed By: limintang, YIWENX14

Differential Revision: D97556018
Summary:

Add YOCO (You Only Cache Once) support to StaticAttention so models with
num_kv_shared_layers > 0 can be transformed via
transform_attention_mha_to_static_attention and run correctly through the
static attention export pipeline.

Changes:
- StaticAttention.__init__: skip wks/wvs/caches for shared layers
- from_attention_mha: detect shared layers, preserve LoRA on wq/wo
- forward/_forward_mha/_forward_sha: accept shared_kv, skip K/V projection
- StaticAttentionIOManager: skip cache allocation for shared layers
- Tests: 6 new test methods covering shared/donor layers, LoRA, numerics

When num_kv_shared_layers=0 (default), behavior is completely unchanged.
No C++ runtime changes needed.

Reviewed By: limintang, YIWENX14

Differential Revision: D97556018
@meta-codesync meta-codesync bot merged commit 55f64c1 into pytorch:main Mar 26, 2026
158 of 163 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants