Skip to content

Qualcomm AI Engine Direct - AOT Lowering Time Optimization#18516

Merged
abhinaykukkadapu merged 2 commits intopytorch:mainfrom
CodeLinaro:dev1/winskuo/inter_threads
Mar 31, 2026
Merged

Qualcomm AI Engine Direct - AOT Lowering Time Optimization#18516
abhinaykukkadapu merged 2 commits intopytorch:mainfrom
CodeLinaro:dev1/winskuo/inter_threads

Conversation

@winskuo-quic
Copy link
Copy Markdown
Collaborator

@winskuo-quic winskuo-quic commented Mar 26, 2026

Summary

This PR is to target optimizing AoT Lowering Time.
These changes should benefit all models and should have larger improvements on LLM models.

We have done the following optimization:

  1. Per Channel Observer: QNN weights are static, meaning we don't need to recalculate the quantized weights during each calibration. We have overwrite the forward method of per_channel_observer, so it only performs weight calibration once.
    File Changed: backends/qualcomm/quantizer/observers/per_channel_param_observer.py
  2. Concat Observer: Concat observer used to calculate min and max values in 2 operations. We have replaced min and max with aminmax operation, reducing the computation by half.
    Files Changed: backends/qualcomm/quantizer/observers/concat_observer.py
  3. GC Collect: During node_visitor when calculating quant configs for block quant, there are too many intermediate tensors, causing GC to trigger. We have optimized and ensure GC is not triggered by reducing the amount of intermediate tensors.
    File Changed: backends/qualcomm/builders/node_visitor.py around line 180-190
  4. Data copied in pybind: We noticed validating and building convolution ops can taking up to 5-10% of the AoT lowering time, depending on context length. This is because we created a list in python and cast it to a vector in pybind, which causes data copy. With the new approach, we have created a numpy array and directly reused this array in pybind, which prevents data copy.
    Files Changed:
    backends/qualcomm/builders/node_visitor.py line 176 where we create scale_offset_arr
    backends/qualcomm/aot/python/PyQnnManagerAdaptor.cpp where we reuse the py:array buffer instead of copy.

We have profiled 2 models to verify that the changes can benefit the AoT lowering time in seconds. We have break it down to 2 stages: quantization time and compile time(the rest of the lowering)

Specs:
QNN Version: 2.42
Max_seq_len: 1025
Prefill_ar_len: 128
Model Mode: hybrid
Calibration tasks: 1 wikitext and user prompt
``

Model Mainline Quantize (Sec) Mainline Compile (Sec) Optimized Quantize (Sec) Optimized Compile (Sec) Quantize Improvement (%) Compile Improvement (%) Total Improvement (%)
Qwen 3 0.6B 548 929 469 803 14.4% 13.6% 13.9%
Qwen 3 1.7B 1200 1536 894 964 25.5% 37.24% 32.09%

``

Below is the flame graph comparison, where there's improvement in quantization, op validation, and qnn preprocessing time.
Reproducing command for qwen3 0.6B
py-spy record -o origin_qwen3_0_6b_profile.svg --subprocesses --native --rate 10 -- python examples/qualcomm/oss_scripts/llama/llama.py -b build-android -m SM8750 --temperature 0 --model_mode hybrid --max_seq_len 1024 --prefill_ar_len 128 --decoder_model qwen3-0_6b --prompt "I would like to learn python, could you teach me with a simple example?" --tasks wikitext --limit 1 -c -a origin_qwen3-0_6b 2>&1 | tee origin_qwen3-0_6b_log.txt

The following labeled areas are currently some bottle necks existed in mainline. This PR has either removed or significantly decreased the computation time in these areas.
image
image

Test plan

Passing all mainline tests

Author: @haowhsu-quic, @winskuo-quic

@pytorch-bot
Copy link
Copy Markdown

pytorch-bot bot commented Mar 26, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18516

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (2 Unrelated Failures)

As of commit 9c901d0 with merge base cdbfbe3 (image):

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 26, 2026
@github-actions
Copy link
Copy Markdown

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

@abhinaykukkadapu
Copy link
Copy Markdown
Contributor

@winskuo-quic thanks for pushing the optimizations even further, this is awesome.

std::uint32_t num_elements,
std::vector<float> scales,
std::vector<int32_t> offsets)
std::vector<float>& scales,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Gutting lvalue reference and moving seems silently stealing the caller data, maybe we can keep this by value and move it explicitly at the caller site. Something like this?

  quantize_param_wrapper =                                                                                                                   
      std::make_unique<BwAxisScaleOffsetQuantizeParamsWrapper>(
          bitwidth, axis, num_elements,
          std::move(scales), std::move(offsets));  // explicit opt-in 

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@winskuo-quic any thoughts on this, we can merge after your response.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @abhinaykukkadapu,
Sorry for the late response as I was OOO yesterday.
Thanks for reviewing the PR and providing the suggestions.
I have pushed a new commit that should address the issue you mentioned. Caller will now handle whether they want to perform std::move.

observer quantizer optimization by ian

Change some move to reference and lint

improve block scale node_visitor logic
@winskuo-quic winskuo-quic force-pushed the dev1/winskuo/inter_threads branch from 7c5b509 to 9c901d0 Compare March 31, 2026 05:54
@abhinaykukkadapu abhinaykukkadapu merged commit aa4e489 into pytorch:main Mar 31, 2026
158 of 160 checks passed
Jiseong-oh pushed a commit to Jiseong-oh/executorch that referenced this pull request Apr 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants