Major speedup for model lowering: Add should_run() + fast-copy infrastructure with targeted_ops annotations (#18497) by apullin · Pull Request #18497 · pytorch/executorch

apullin · 2026-03-25T16:05:10Z

Summary:

Adds infrastructure for skipping and fast-copying unchanged nodes during
ExportPass execution, then annotates ~60 ARM backend passes to use it.

Changes

1. should_run() hook on ExportPass / ArmPass

Subclasses that declare a targeted_ops class attribute (a set of op
overloads) can be skipped entirely when the graph contains none of their
target ops. ArmPass provides a default implementation via inheritance.

2. Fast-copy for cold nodes

When a pass declares targeted_ops, nodes whose ops are NOT in the set
are copied into the new graph via graph.node_copy() instead of full
FakeTensor dispatch. Per-node cost drops from ~0.4 ms to ~0.02 ms (~20x).

Includes a safety guard: nodes without val metadata (e.g. nodes
inserted by call() overrides before super().call()) fall back to
full dispatch instead of propagating None.

3. FakeTensor cache extension

Context manager _extend_faketensor_cache_builtins() temporarily extends
the FakeTensor dispatch cache to cover ExecuTorch op namespaces
(quantized_decomposed, tosa, dim_order_ops, cortex_m). Avoids redundant
re-dispatches for non-builtin ops across 50+ passes.

4. init_subclass auto-discovery on ArmPass

Subclasses with existing _TARGET_OPS, _supported_ops, or
_EDGE_OPS/_ATEN_OPS attributes get targeted_ops populated
automatically at class definition time — no manual annotation needed.

5. targeted_ops annotations on ~60 ARM passes

Each annotation is a one-liner declaring the ops the pass checks in
call_operator(). Combined with should_run() and fast-copy, this
achieves the measured speedup below.

Benchmark

Model: small CNN feature extractor (~50K params, 9 conv layers with
LayerNorm, targeting Ethos-U55 via the ARM/TOSA lowering pipeline).
Graph: ~1200 nodes, 146 ExportPass invocations.

lower() before: 186 s
lower() after: 100 s
Passes skipped: 53 of 146
Delta: -86 s (-46 %)
Adds should_run() hook to ExportPass that subclasses can override to skip
execution when a pass has no work to do. ArmPass implements a default that
checks a targeted_ops class attribute against the graph's call_function nodes.

Also adds:

_fast_copy_node path in ExportInterpreter.run_node that uses graph.node_copy
instead of full FakeTensor dispatch for cold nodes in passes that declare
targeted_ops. Per-node cost drops from ~0.4ms to ~0.02ms.
_extend_faketensor_cache_builtins context manager that extends FakeTensor
dispatch cache to cover ExecuTorch ops (quantized_decomposed, tosa, etc.)
init_subclass on ArmPass for auto-discovery of targeted_ops from
existing _TARGET_OPS, _supported_ops, _EDGE_OPS/_ATEN_OPS attributes
targeted_ops annotations on ~60 ARM pass subclasses

Measured on SleepNet featurizer (U55 lowering):
lower(): 185s -> 96s = -89s (-48%)

Differential Revision: D97528110

pytorch-bot · 2026-03-25T16:05:15Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18497

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

⚠️ 11 Awaiting Approval

As of commit 9dd58e8 with merge base d050015 ():

AWAITING APPROVAL - The following workflows need approval before CI can run:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

meta-codesync · 2026-03-25T16:05:40Z

@apullin has exported this pull request. If you are a Meta employee, you can view the originating Diff in D97528110.

github-actions · 2026-03-25T16:10:42Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

…ions, [executorch][arm] Add should_run() + fast-copy infrastructure with targeted_ops annotations (pytorch#18497) Summary: Pull Request resolved: pytorch#18497 Adds infrastructure for skipping and fast-copying unchanged nodes during ExportPass execution, then annotates ~60 ARM backend passes to use it. ## Changes ### 1. should_run() hook on ExportPass / ArmPass Subclasses that declare a `targeted_ops` class attribute (a set of op overloads) can be skipped entirely when the graph contains none of their target ops. ArmPass provides a default implementation via inheritance. ### 2. Fast-copy for cold nodes When a pass declares `targeted_ops`, nodes whose ops are NOT in the set are copied into the new graph via `graph.node_copy()` instead of full FakeTensor dispatch. Per-node cost drops from ~0.4 ms to ~0.02 ms (~20x). Includes a safety guard: nodes without `val` metadata (e.g. nodes inserted by `call()` overrides before `super().call()`) fall back to full dispatch instead of propagating None. ### 3. FakeTensor cache extension Context manager `_extend_faketensor_cache_builtins()` temporarily extends the FakeTensor dispatch cache to cover ExecuTorch op namespaces (quantized_decomposed, tosa, dim_order_ops, cortex_m). Avoids redundant re-dispatches for non-builtin ops across 50+ passes. ### 4. __init_subclass__ auto-discovery on ArmPass Subclasses with existing `_TARGET_OPS`, `_supported_ops`, or `_EDGE_OPS`/`_ATEN_OPS` attributes get `targeted_ops` populated automatically at class definition time — no manual annotation needed. ### 5. targeted_ops annotations on ~60 ARM passes Each annotation is a one-liner declaring the ops the pass checks in `call_operator()`. Combined with should_run() and fast-copy, this achieves the measured speedup below. ## Benchmark Model: small CNN feature extractor (~50K params, 9 conv layers with LayerNorm, targeting Ethos-U55 via the ARM/TOSA lowering pipeline). Graph: ~1200 nodes, 146 ExportPass invocations. lower() before: 186 s lower() after: 100 s Passes skipped: 53 of 146 Delta: -86 s (-46 %) Adds should_run() hook to ExportPass that subclasses can override to skip execution when a pass has no work to do. ArmPass implements a default that checks a targeted_ops class attribute against the graph's call_function nodes. Also adds: - _fast_copy_node path in ExportInterpreter.run_node that uses graph.node_copy instead of full FakeTensor dispatch for cold nodes in passes that declare targeted_ops. Per-node cost drops from ~0.4ms to ~0.02ms. - _extend_faketensor_cache_builtins context manager that extends FakeTensor dispatch cache to cover ExecuTorch ops (quantized_decomposed, tosa, etc.) - __init_subclass__ on ArmPass for auto-discovery of targeted_ops from existing _TARGET_OPS, _supported_ops, _EDGE_OPS/_ATEN_OPS attributes - targeted_ops annotations on ~60 ARM pass subclasses Measured on SleepNet featurizer (U55 lowering): lower(): 185s -> 96s = -89s (-48%) Differential Revision: D97528110

…ions (pytorch#18497) Summary: Pull Request resolved: pytorch#18497 Adds infrastructure for skipping and fast-copying unchanged nodes during ExportPass execution, then annotates ~60 ARM backend passes to use it. ## Changes ### 1. should_run() hook on ExportPass / ArmPass Subclasses that declare a `targeted_ops` class attribute (a set of op overloads) can be skipped entirely when the graph contains none of their target ops. ArmPass provides a default implementation via inheritance. ### 2. Fast-copy for cold nodes When a pass declares `targeted_ops`, nodes whose ops are NOT in the set are copied into the new graph via `graph.node_copy()` instead of full FakeTensor dispatch. Per-node cost drops from ~0.4 ms to ~0.02 ms (~20x). Includes a safety guard: nodes without `val` metadata (e.g. nodes inserted by `call()` overrides before `super().call()`) fall back to full dispatch instead of propagating None. ### 3. FakeTensor cache extension Context manager `_extend_faketensor_cache_builtins()` temporarily extends the FakeTensor dispatch cache to cover ExecuTorch op namespaces (quantized_decomposed, tosa, dim_order_ops, cortex_m). Avoids redundant re-dispatches for non-builtin ops across 50+ passes. ### 4. __init_subclass__ auto-discovery on ArmPass Subclasses with existing `_TARGET_OPS`, `_supported_ops`, or `_EDGE_OPS`/`_ATEN_OPS` attributes get `targeted_ops` populated automatically at class definition time — no manual annotation needed. ### 5. targeted_ops annotations on ~60 ARM passes Each annotation is a one-liner declaring the ops the pass checks in `call_operator()`. Combined with should_run() and fast-copy, this achieves the measured speedup below. ## Benchmark Model: small CNN feature extractor (~50K params, 9 conv layers with LayerNorm, targeting Ethos-U55 via the ARM/TOSA lowering pipeline). Graph: ~1200 nodes, 146 ExportPass invocations. lower() before: 186 s lower() after: 100 s Passes skipped: 53 of 146 Delta: -86 s (-46 %) Adds should_run() hook to ExportPass that subclasses can override to skip execution when a pass has no work to do. ArmPass implements a default that checks a targeted_ops class attribute against the graph's call_function nodes. Also adds: - _fast_copy_node path in ExportInterpreter.run_node that uses graph.node_copy instead of full FakeTensor dispatch for cold nodes in passes that declare targeted_ops. Per-node cost drops from ~0.4ms to ~0.02ms. - _extend_faketensor_cache_builtins context manager that extends FakeTensor dispatch cache to cover ExecuTorch ops (quantized_decomposed, tosa, etc.) - __init_subclass__ on ArmPass for auto-discovery of targeted_ops from existing _TARGET_OPS, _supported_ops, _EDGE_OPS/_ATEN_OPS attributes - targeted_ops annotations on ~60 ARM pass subclasses Measured on SleepNet featurizer (U55 lowering): lower(): 185s -> 96s = -89s (-48%) Differential Revision: D97528110

apullin requested review from JacobSzwejbka, digantdesai and larryliu0820 as code owners March 25, 2026 16:05

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 25, 2026

meta-codesync bot added fb-exported meta-exported labels Mar 25, 2026

meta-codesync bot changed the title ~~Add should_run() + fast-copy infrastructure with targeted_ops annotations~~ Add should_run() + fast-copy infrastructure with targeted_ops annotations, [executorch][arm] Add should_run() + fast-copy infrastructure with targeted_ops annotations Mar 25, 2026

apullin force-pushed the export-D97528110 branch from d300c02 to 076fb18 Compare March 25, 2026 17:13

apullin changed the title ~~Add should_run() + fast-copy infrastructure with targeted_ops annotations, [executorch][arm] Add should_run() + fast-copy infrastructure with targeted_ops annotations~~ Add should_run() + fast-copy infrastructure with targeted_ops annotations Mar 25, 2026

apullin force-pushed the export-D97528110 branch from 076fb18 to edcba1c Compare March 25, 2026 22:34

apullin force-pushed the export-D97528110 branch from edcba1c to a30f6a7 Compare March 25, 2026 22:57

apullin force-pushed the export-D97528110 branch 3 times, most recently from 485f99d to 417b280 Compare March 25, 2026 23:40

apullin force-pushed the export-D97528110 branch 2 times, most recently from ad7b73c to f401907 Compare March 26, 2026 06:34

apullin force-pushed the export-D97528110 branch from f401907 to 897a09c Compare March 26, 2026 06:42

apullin force-pushed the export-D97528110 branch from 897a09c to 575341f Compare March 30, 2026 16:14

apullin force-pushed the export-D97528110 branch from 575341f to 9dd58e8 Compare March 30, 2026 21:27

apullin changed the title ~~Add should_run() + fast-copy infrastructure with targeted_ops annotations (#18497)~~ Major speedup for model lowering: Add should_run() + fast-copy infrastructure with targeted_ops annotations (#18497) Apr 2, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Major speedup for model lowering: Add should_run() + fast-copy infrastructure with targeted_ops annotations (#18497)#18497

Major speedup for model lowering: Add should_run() + fast-copy infrastructure with targeted_ops annotations (#18497)#18497
apullin wants to merge 1 commit intopytorch:mainfrom
apullin:export-D97528110

apullin commented Mar 25, 2026 •

edited by meta-codesync bot

Loading

Uh oh!

pytorch-bot bot commented Mar 25, 2026 •

edited

Loading

Uh oh!

meta-codesync bot commented Mar 25, 2026

Uh oh!

github-actions bot commented Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

apullin commented Mar 25, 2026 • edited by meta-codesync bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

1. should_run() hook on ExportPass / ArmPass

2. Fast-copy for cold nodes

3. FakeTensor cache extension

4. init_subclass auto-discovery on ArmPass

5. targeted_ops annotations on ~60 ARM passes

Benchmark

Uh oh!

pytorch-bot bot commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18497

⚠️ 11 Awaiting Approval

Uh oh!

meta-codesync bot commented Mar 25, 2026

Uh oh!

github-actions bot commented Mar 25, 2026

This PR needs a release notes: label

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

apullin commented Mar 25, 2026 •

edited by meta-codesync bot

Loading

pytorch-bot bot commented Mar 25, 2026 •

edited

Loading

This PR needs a `release notes:` label