🐛 Describe the bug
Qwen3-0.6B produces token id 0 every decode step through QNN HTP, at both 16a4w and 8a8w. Same weights/device/binary via XNNPACK produces correct output.
Environment
Snapdragon XR2 Gen 2 (SXR2230P), HTP v69, QNN SDK 2.37, Quest 3, ExecuTorch main.
Same-device A/B
XNNPACK and QNN .pte exported from the same edge program. Weights, tokenizer,
runner binary all identical.
Ruled out
- Quantization — 8a8w (near-lossless) equally broken
- Model def / weights / export — shared with XNNPACK path, which is correct
- Softmax — forced all 28 softmax ops to CPU, still degenerate
Qwen3 features that differ from Llama (suspects)
- XNNPACK (CPU, same device): ✅ Correct tool call, stops on EOS
- QNN HTP (qnn_8a8w): ❌ Token id 0 every step, never stops
- QNN HTP (qnn_16a4w): ❌ Same
Gemma 3 (1B) shares qk-norm and head_dim mismatch — testing on HTP to narrow urther.
Ask
Per-layer intermediate tensor diff (QNN vs CPU reference, fixed input) to find the first diverging op. Op-skipping can't isolate further — skipping RMSNorm/RoPE
Versions
attach logs and repro configs in comments.
cc @cccclai @cbilgin @abhinaykukkadapu
🐛 Describe the bug
Qwen3-0.6B produces token id 0 every decode step through QNN HTP, at both 16a4w and 8a8w. Same weights/device/binary via XNNPACK produces correct output.
Environment
Snapdragon XR2 Gen 2 (SXR2230P), HTP v69, QNN SDK 2.37, Quest 3, ExecuTorch main.
Same-device A/B
XNNPACK and QNN .pte exported from the same edge program. Weights, tokenizer,
runner binary all identical.
Ruled out
Qwen3 features that differ from Llama (suspects)
Gemma 3 (1B) shares qk-norm and head_dim mismatch — testing on HTP to narrow urther.
Ask
Per-layer intermediate tensor diff (QNN vs CPU reference, fixed input) to find the first diverging op. Op-skipping can't isolate further — skipping RMSNorm/RoPE
Versions
attach logs and repro configs in comments.
cc @cccclai @cbilgin @abhinaykukkadapu