Add YOCO KV sharing support to StaticAttention (#18517) by billmguo · Pull Request #18517 · pytorch/executorch

billmguo · 2026-03-26T05:57:00Z

Summary:

Add YOCO (You Only Cache Once) support to StaticAttention so models with
num_kv_shared_layers > 0 can be transformed via
transform_attention_mha_to_static_attention and run correctly through the
static attention export pipeline.

Changes:

StaticAttention.init: skip wks/wvs/caches for shared layers
from_attention_mha: detect shared layers, preserve LoRA on wq/wo
forward/_forward_mha/_forward_sha: accept shared_kv, skip K/V projection
StaticAttentionIOManager: skip cache allocation for shared layers
Tests: 6 new test methods covering shared/donor layers, LoRA, numerics

When num_kv_shared_layers=0 (default), behavior is completely unchanged.
No C++ runtime changes needed.

Reviewed By: limintang, YIWENX14

Differential Revision: D97556018

pytorch-bot · 2026-03-26T05:57:05Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18517

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 2 Unrelated Failures

As of commit e08f97d with merge base b6824d1 ():

NEW FAILURE - The following job has failed:

pull / unittest-arm-backend-with-no-deps (test_pytest_ops_tosa) / linux-job (gh)
RuntimeError: Command docker exec -t 249469fbc7e9e902a5745e71d75e40f3e63eafb54edee7acdbf11f068d79a5f1 /exec failed with exit code 1

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

pull / unittest / windows / windows-job (gh) (trunk failure)
##[error]The operation was canceled.
pull / unittest-editable / windows / windows-job (gh) (trunk failure)
##[error]The operation was canceled.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

meta-codesync · 2026-03-26T05:57:07Z

@billmguo has exported this pull request. If you are a Meta employee, you can view the originating Diff in D97556018.

github-actions · 2026-03-26T05:57:48Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Summary: Add YOCO (You Only Cache Once) support to StaticAttention so models with num_kv_shared_layers > 0 can be transformed via transform_attention_mha_to_static_attention and run correctly through the static attention export pipeline. Changes: - StaticAttention.__init__: skip wks/wvs/caches for shared layers - from_attention_mha: detect shared layers, preserve LoRA on wq/wo - forward/_forward_mha/_forward_sha: accept shared_kv, skip K/V projection - StaticAttentionIOManager: skip cache allocation for shared layers - Tests: 6 new test methods covering shared/donor layers, LoRA, numerics When num_kv_shared_layers=0 (default), behavior is completely unchanged. No C++ runtime changes needed. Reviewed By: limintang, YIWENX14 Differential Revision: D97556018

billmguo requested a review from lucylq as a code owner March 26, 2026 05:57

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 26, 2026

meta-codesync bot added fb-exported meta-exported labels Mar 26, 2026

limintang self-requested a review March 26, 2026 06:01

limintang approved these changes Mar 26, 2026

View reviewed changes

meta-codesync bot changed the title ~~Add YOCO KV sharing support to StaticAttention~~ Add YOCO KV sharing support to StaticAttention (#18517) Mar 26, 2026

billmguo force-pushed the export-D97556018 branch from 2bd0bfa to 98a1c43 Compare March 26, 2026 16:33

billmguo force-pushed the export-D97556018 branch from 98a1c43 to e08f97d Compare March 26, 2026 16:57

meta-codesync bot merged commit 55f64c1 into pytorch:main Mar 26, 2026
158 of 163 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add YOCO KV sharing support to StaticAttention (#18517)#18517

Add YOCO KV sharing support to StaticAttention (#18517)#18517
meta-codesync[bot] merged 1 commit intopytorch:mainfrom
billmguo:export-D97556018

billmguo commented Mar 26, 2026 •

edited by meta-codesync bot

Loading

Uh oh!

pytorch-bot bot commented Mar 26, 2026 •

edited

Loading

Uh oh!

meta-codesync bot commented Mar 26, 2026

Uh oh!

github-actions bot commented Mar 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

billmguo commented Mar 26, 2026 • edited by meta-codesync bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18517

❌ 1 New Failure, 2 Unrelated Failures

Uh oh!

meta-codesync bot commented Mar 26, 2026

Uh oh!

github-actions bot commented Mar 26, 2026

This PR needs a release notes: label

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

billmguo commented Mar 26, 2026 •

edited by meta-codesync bot

Loading

pytorch-bot bot commented Mar 26, 2026 •

edited

Loading

This PR needs a `release notes:` label