Qualcomm AI Engine Direct - AOT Lowering Time Optimization by winskuo-quic · Pull Request #18516 · pytorch/executorch

winskuo-quic · 2026-03-26T05:40:05Z

Summary

This PR is to target optimizing AoT Lowering Time.
These changes should benefit all models and should have larger improvements on LLM models.

We have done the following optimization:

Per Channel Observer: QNN weights are static, meaning we don't need to recalculate the quantized weights during each calibration. We have overwrite the forward method of per_channel_observer, so it only performs weight calibration once.
File Changed: backends/qualcomm/quantizer/observers/per_channel_param_observer.py
Concat Observer: Concat observer used to calculate min and max values in 2 operations. We have replaced min and max with aminmax operation, reducing the computation by half.
Files Changed: backends/qualcomm/quantizer/observers/concat_observer.py
GC Collect: During node_visitor when calculating quant configs for block quant, there are too many intermediate tensors, causing GC to trigger. We have optimized and ensure GC is not triggered by reducing the amount of intermediate tensors.
File Changed: backends/qualcomm/builders/node_visitor.py around line 180-190
Data copied in pybind: We noticed validating and building convolution ops can taking up to 5-10% of the AoT lowering time, depending on context length. This is because we created a list in python and cast it to a vector in pybind, which causes data copy. With the new approach, we have created a numpy array and directly reused this array in pybind, which prevents data copy.
Files Changed:
backends/qualcomm/builders/node_visitor.py line 176 where we create scale_offset_arr
backends/qualcomm/aot/python/PyQnnManagerAdaptor.cpp where we reuse the py:array buffer instead of copy.

We have profiled 2 models to verify that the changes can benefit the AoT lowering time in seconds. We have break it down to 2 stages: quantization time and compile time(the rest of the lowering)

Specs:
QNN Version: 2.42
Max_seq_len: 1025
Prefill_ar_len: 128
Model Mode: hybrid
Calibration tasks: 1 wikitext and user prompt
``

Model	Mainline Quantize (Sec)	Mainline Compile (Sec)	Optimized Quantize (Sec)	Optimized Compile (Sec)	Quantize Improvement (%)	Compile Improvement (%)	Total Improvement (%)
Qwen 3 0.6B	548	929	469	803	14.4%	13.6%	13.9%
Qwen 3 1.7B	1200	1536	894	964	25.5%	37.24%	32.09%

``

Below is the flame graph comparison, where there's improvement in quantization, op validation, and qnn preprocessing time.
Reproducing command for qwen3 0.6B
py-spy record -o origin_qwen3_0_6b_profile.svg --subprocesses --native --rate 10 -- python examples/qualcomm/oss_scripts/llama/llama.py -b build-android -m SM8750 --temperature 0 --model_mode hybrid --max_seq_len 1024 --prefill_ar_len 128 --decoder_model qwen3-0_6b --prompt "I would like to learn python, could you teach me with a simple example?" --tasks wikitext --limit 1 -c -a origin_qwen3-0_6b 2>&1 | tee origin_qwen3-0_6b_log.txt

The following labeled areas are currently some bottle necks existed in mainline. This PR has either removed or significantly decreased the computation time in these areas.

Test plan

Passing all mainline tests

Author: @haowhsu-quic, @winskuo-quic

pytorch-bot · 2026-03-26T05:40:09Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18516

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (2 Unrelated Failures)

As of commit 9c901d0 with merge base cdbfbe3 ():

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

pull / unittest / windows / windows-job (gh) (trunk failure)
##[error]The operation was canceled.
pull / unittest-editable / windows / windows-job (gh) (trunk failure)
##[error]The operation was canceled.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

github-actions · 2026-03-26T05:40:52Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

abhinaykukkadapu · 2026-03-26T17:44:22Z

@winskuo-quic thanks for pushing the optimizations even further, this is awesome.

abhinaykukkadapu · 2026-03-26T15:45:55Z

backends/qualcomm/aot/wrappers/QuantizeParamsWrapper.h

      std::uint32_t num_elements,
-      std::vector<float> scales,
-      std::vector<int32_t> offsets)
+      std::vector<float>& scales,


Nit: Gutting lvalue reference and moving seems silently stealing the caller data, maybe we can keep this by value and move it explicitly at the caller site. Something like this?

quantize_param_wrapper = std::make_unique<BwAxisScaleOffsetQuantizeParamsWrapper>( bitwidth, axis, num_elements, std::move(scales), std::move(offsets)); // explicit opt-in

@winskuo-quic any thoughts on this, we can merge after your response.

Hi @abhinaykukkadapu,
Sorry for the late response as I was OOO yesterday.
Thanks for reviewing the PR and providing the suggestions.
I have pushed a new commit that should address the issue you mentioned. Caller will now handle whether they want to perform std::move.

observer quantizer optimization by ian Change some move to reference and lint improve block scale node_visitor logic

…8516)

winskuo-quic requested review from abhinaykukkadapu and cccclai as code owners March 26, 2026 05:40

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 26, 2026

abhinaykukkadapu approved these changes Mar 26, 2026

View reviewed changes

winskuo-quic added 2 commits March 31, 2026 13:00

reduce pybind copy

150e0d7

observer quantizer optimization by ian Change some move to reference and lint improve block scale node_visitor logic

Code Review

9c901d0

winskuo-quic force-pushed the dev1/winskuo/inter_threads branch from 7c5b509 to 9c901d0 Compare March 31, 2026 05:54

abhinaykukkadapu merged commit aa4e489 into pytorch:main Mar 31, 2026
158 of 160 checks passed

Jiseong-oh pushed a commit to Jiseong-oh/executorch that referenced this pull request Apr 2, 2026

Qualcomm AI Engine Direct - AOT Lowering Time Optimization (pytorch#1…

54325ea

…8516)

Jiseong-oh pushed a commit that referenced this pull request Apr 2, 2026

Qualcomm AI Engine Direct - AOT Lowering Time Optimization (#18516)

eb51aa8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Qualcomm AI Engine Direct - AOT Lowering Time Optimization#18516

Qualcomm AI Engine Direct - AOT Lowering Time Optimization#18516
abhinaykukkadapu merged 2 commits intopytorch:mainfrom
CodeLinaro:dev1/winskuo/inter_threads

winskuo-quic commented Mar 26, 2026 •

edited

Loading

Uh oh!

pytorch-bot bot commented Mar 26, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Mar 26, 2026

Uh oh!

abhinaykukkadapu commented Mar 26, 2026

Uh oh!

abhinaykukkadapu Mar 26, 2026

Uh oh!

abhinaykukkadapu Mar 30, 2026

Uh oh!

winskuo-quic Mar 31, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

winskuo-quic commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

pytorch-bot bot commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18516

✅ You can merge normally! (2 Unrelated Failures)

Uh oh!

github-actions bot commented Mar 26, 2026

This PR needs a release notes: label

Uh oh!

abhinaykukkadapu commented Mar 26, 2026

Uh oh!

abhinaykukkadapu Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

abhinaykukkadapu Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

winskuo-quic Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

winskuo-quic commented Mar 26, 2026 •

edited

Loading

pytorch-bot bot commented Mar 26, 2026 •

edited

Loading

This PR needs a `release notes:` label