Qualcomm AI Engine Direct - AOT Lowering Time Optimization#18516
Qualcomm AI Engine Direct - AOT Lowering Time Optimization#18516abhinaykukkadapu merged 2 commits intopytorch:mainfrom
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18516
Note: Links to docs will display an error until the docs builds have been completed. ✅ You can merge normally! (2 Unrelated Failures)As of commit 9c901d0 with merge base cdbfbe3 ( BROKEN TRUNK - The following jobs failed but were present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This PR needs a
|
|
@winskuo-quic thanks for pushing the optimizations even further, this is awesome. |
| std::uint32_t num_elements, | ||
| std::vector<float> scales, | ||
| std::vector<int32_t> offsets) | ||
| std::vector<float>& scales, |
There was a problem hiding this comment.
Nit: Gutting lvalue reference and moving seems silently stealing the caller data, maybe we can keep this by value and move it explicitly at the caller site. Something like this?
quantize_param_wrapper =
std::make_unique<BwAxisScaleOffsetQuantizeParamsWrapper>(
bitwidth, axis, num_elements,
std::move(scales), std::move(offsets)); // explicit opt-in
There was a problem hiding this comment.
@winskuo-quic any thoughts on this, we can merge after your response.
There was a problem hiding this comment.
Hi @abhinaykukkadapu,
Sorry for the late response as I was OOO yesterday.
Thanks for reviewing the PR and providing the suggestions.
I have pushed a new commit that should address the issue you mentioned. Caller will now handle whether they want to perform std::move.
observer quantizer optimization by ian Change some move to reference and lint improve block scale node_visitor logic
7c5b509 to
9c901d0
Compare
Summary
This PR is to target optimizing AoT Lowering Time.
These changes should benefit all models and should have larger improvements on LLM models.
We have done the following optimization:
File Changed:
backends/qualcomm/quantizer/observers/per_channel_param_observer.pyminandmaxvalues in 2 operations. We have replacedminandmaxwithaminmaxoperation, reducing the computation by half.Files Changed:
backends/qualcomm/quantizer/observers/concat_observer.pyFile Changed:
backends/qualcomm/builders/node_visitor.pyaround line 180-190Files Changed:
backends/qualcomm/builders/node_visitor.pyline 176 where we create scale_offset_arrbackends/qualcomm/aot/python/PyQnnManagerAdaptor.cppwhere we reuse the py:array buffer instead of copy.We have profiled 2 models to verify that the changes can benefit the AoT lowering time in seconds. We have break it down to 2 stages: quantization time and compile time(the rest of the lowering)
Specs:
QNN Version: 2.42
Max_seq_len: 1025
Prefill_ar_len: 128
Model Mode: hybrid
Calibration tasks: 1 wikitext and user prompt
``
``
Below is the flame graph comparison, where there's improvement in quantization, op validation, and qnn preprocessing time.
Reproducing command for qwen3 0.6B
py-spy record -o origin_qwen3_0_6b_profile.svg --subprocesses --native --rate 10 -- python examples/qualcomm/oss_scripts/llama/llama.py -b build-android -m SM8750 --temperature 0 --model_mode hybrid --max_seq_len 1024 --prefill_ar_len 128 --decoder_model qwen3-0_6b --prompt "I would like to learn python, could you teach me with a simple example?" --tasks wikitext --limit 1 -c -a origin_qwen3-0_6b 2>&1 | tee origin_qwen3-0_6b_log.txtThe following labeled areas are currently some bottle necks existed in mainline. This PR has either removed or significantly decreased the computation time in these areas.


Test plan
Passing all mainline tests
Author: @haowhsu-quic, @winskuo-quic