Qwen3-0.6B: QNN HTP produces degenerate output at all precisions (XNNPACK    correct on same device)

### 🐛 Describe the bug


 
  Qwen3-0.6B produces token id 0 every decode step through QNN HTP, at both 16a4w and 8a8w. Same weights/device/binary via XNNPACK produces correct output.

  Environment

  Snapdragon XR2 Gen 2 (SXR2230P), HTP v69, QNN SDK 2.37, Quest 3, ExecuTorch main.

  Same-device A/B

  
  XNNPACK and QNN .pte exported from the same edge program. Weights, tokenizer,
  runner binary all identical.

  Ruled out

  - Quantization — 8a8w (near-lossless) equally broken
  - Model def / weights / export — shared with XNNPACK path, which is correct
  - Softmax — forced all 28 softmax ops to CPU, still degenerate

  Qwen3 features that differ from Llama (suspects)

  - XNNPACK (CPU, same device): ✅ Correct tool call, stops on EOS
  - QNN HTP (qnn_8a8w): ❌ Token id 0 every step, never stops
  - QNN HTP (qnn_16a4w): ❌ Same

  Gemma 3 (1B) shares qk-norm and head_dim mismatch — testing on HTP to narrow urther.

  Ask

 Per-layer intermediate tensor diff (QNN vs CPU reference, fixed input) to find the first diverging op. Op-skipping can't isolate further — skipping RMSNorm/RoPE

### Versions

<TBD> attach logs and repro configs in comments.

cc @cccclai @cbilgin @abhinaykukkadapu

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Qwen3-0.6B: QNN HTP produces degenerate output at all precisions (XNNPACK correct on same device) #20168

🐛 Describe the bug

Versions

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Qwen3-0.6B: QNN HTP produces degenerate output at all precisions (XNNPACK correct on same device) #20168

Description

🐛 Describe the bug

Versions

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions