Skip to content

Qwen3-0.6B: QNN HTP produces degenerate output at all precisions (XNNPACK correct on same device) #20168

@psiddh

Description

@psiddh

🐛 Describe the bug

Qwen3-0.6B produces token id 0 every decode step through QNN HTP, at both 16a4w and 8a8w. Same weights/device/binary via XNNPACK produces correct output.

Environment

Snapdragon XR2 Gen 2 (SXR2230P), HTP v69, QNN SDK 2.37, Quest 3, ExecuTorch main.

Same-device A/B

XNNPACK and QNN .pte exported from the same edge program. Weights, tokenizer,
runner binary all identical.

Ruled out

  • Quantization — 8a8w (near-lossless) equally broken
  • Model def / weights / export — shared with XNNPACK path, which is correct
  • Softmax — forced all 28 softmax ops to CPU, still degenerate

Qwen3 features that differ from Llama (suspects)

  • XNNPACK (CPU, same device): ✅ Correct tool call, stops on EOS
  • QNN HTP (qnn_8a8w): ❌ Token id 0 every step, never stops
  • QNN HTP (qnn_16a4w): ❌ Same

Gemma 3 (1B) shares qk-norm and head_dim mismatch — testing on HTP to narrow urther.

Ask

Per-layer intermediate tensor diff (QNN vs CPU reference, fixed input) to find the first diverging op. Op-skipping can't isolate further — skipping RMSNorm/RoPE

Versions

attach logs and repro configs in comments.

cc @cccclai @cbilgin @abhinaykukkadapu

Metadata

Metadata

Assignees

No one assigned

    Labels

    module: qnnIssues related to Qualcomm's QNN delegate and code under backends/qualcomm/

    Type

    No fields configured for Task.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions