Qwen3-TTS: end-to-end XNNPACK text-to-speech bring-up by seyeong-han · Pull Request #18508 · pytorch/executorch

seyeong-han · 2026-03-25T22:27:25Z

Summary

Add a complete Qwen3-TTS text-to-speech pipeline for ExecuTorch with XNNPACK backend, from model export through C++ inference to WAV output.

Single-PTE multi-method export (export_unified.py): packages encode_text, talker, code_predictor, codec_embed, cp_head, cp_generate, and decode_audio into one model.pte with 8da4w quantization and optional embedding quantization (4w/8w)
Fused sampling-aware cp_generate v2: collapses the 15-step host-side sub-code loop into a single XNNPACK graph call with inverse-CDF top-k(50) sampling, reducing per-step codegen cost by ~15-20%
Persistent SynthesisSession API: keeps the runner warm across sequential prompts with per-session RNG, detailed SynthesisTiming breakdowns (prompt prep, talker prefill, codegen, decode-audio), and automatic fast-path/legacy-fallback gating based on exported contract metadata
Multi-prompt warm benchmarking CLI: --prompts_path, --repeat, --seed, --disable_fused_cp_generate flags for honest generation-only latency measurement without startup tax
Contract tests: metadata, quality, runner, and prompt-flow test suites validating the export/runner API surface

Architecture

Text → encode_text → talker (prefill + autoregressive) → cp_generate (fused) → decode_audio → WAV
                                                          ↑ fallback: code_predictor + cp_head loop

Benchmark (Apple M-series, XNNPACK 8da4w, warm single process)

Path	Generation time	Codegen	Audio
Legacy (host loop)	23.9s	21.1s	11.5s
Fused cp_generate v2	19.6s	17.0s	10.8s

Commits (review order)

53ab54c — Initial XNNPACK bring-up: model wrappers, weight conversion, decode-only runner
aa37d0f — Multi-bucket decoder, talker export, streaming decode scaffolding
510c0ff — Unified single-PTE export and C++ runner with text synthesis
e3ddd29 — Align text synthesis with MLX reference semantics (English prefix, sampling, EOS)
498b6d2 — Fused cp_generate v2, SynthesisSession API, warm benchmark tooling

Test plan

python -m unittest passes all 4 test modules (28 tests)
C++ runner builds with cmake --build cmake-out/examples/models/qwen3-tts
Fresh model.pte exported from updated export_unified.py (no missing ops)
Legacy path produces clear speech: --text "..." --max_new_tokens 247
Fused fast path produces clear speech: --text "..." --top_k 50
Multi-prompt warm benchmark runs end-to-end with --prompts_path

Generated with assistance from Claude.

Made with Cursor

Implement a conversion/export/runtime path for the Qwen3-TTS speech tokenizer decoder with XNNPACK on CPU: weight conversion from HF snapshots, static-shape export, codec generation helper, and a C++ runner that decodes codec ids to WAV output. Made-with: Cursor

Decoder performance: export at multiple fixed codes_len buckets (75/150/300/600/1200) instead of a single 1200. The runner selects the smallest bucket that fits the input, reducing vocoder padding waste from 13x to 1.6x for typical inputs. Measured 10.5x decode speedup (32.4s → 3.1s for 91 codes, 8da4w XNNPACK CPU). Talker export: reuse the existing Llama/Qwen3 infrastructure to export the talker backbone (28-layer transformer) and code predictor (5-layer) as .pte models with static KV cache and 8da4w quantization. Weight conversion maps HF talker checkpoint to Meta/Llama format. Talker runs at 64ms/step, code predictor at 7.2ms/step on CPU. Streaming decode: interleave code generation with incremental vocoder decoding in 25-code chunks, yielding first audio at 2.15s instead of waiting for all codes (3.97s non-streaming, 32.4s old baseline). This PR was authored with Claude.

Replaces the multi-bucket decoder-only pipeline with a single .pte file containing all 6 pipeline stages (encode_text, talker, code_predictor, codec_embed, cp_head, decode_audio), following the Parakeet multi-method export pattern. Key changes: - export_unified.py: multi-method export with per-component quantization, dynamic-shape decoder (patched CausalConvNet for SymInt compat), and embedding quantization support (--qembedding 4w/8w) - qwen3_tts_unified_runner: C++ runner with lazy method loading, XNNPACK warmup, automatic silence trimming, and decode-only backward compat - generate_codes.py: added --trim-silence to strip conditioning prefix Model sizes: 1.0 GB (4w emb) / 1.2 GB (8w emb) / 2.1 GB (no emb quant) Decode perf: 2.0s for 91 codes (3.6x realtime) after XNNPACK warmup Authored with Claude.

Teach the unified runner and export path to mirror the MLX reference for dynamic text prompts, sampling behavior, and English codec prefix handling so XNNPACK text synthesis stays coherent end to end. Add contract tests, checked-in manifests, and small export compatibility shims so the single-PTE workflow remains reproducible. Made-with: Cursor

Replace the greedy-only unrolled cp_generate export with a sampling-aware v2 contract that performs inverse-CDF top-k(50) sampling inside the fused XNNPACK graph, collapsing 15 host-side sub-code round trips into one call. Add a persistent SynthesisSession with per-session RNG so the runner stays loaded/warmed across sequential prompts. Extend main_unified.cpp with --prompts_path, --repeat, --seed, and --disable_fused_cp_generate flags for multi-prompt warm benchmarking with generation-only timing breakdowns. The runner gates the fast path on exported metadata (contract version, top_k match, temperature threshold) and falls back to the legacy host-side sub-code loop for older .pte artifacts or unsupported sampler modes. Warm benchmark results show the fused path reduces per-step codegen cost by ~15-20% compared to the legacy loop on the same XNNPACK artifact. Generated with assistance from Claude. Made-with: Cursor

pytorch-bot · 2026-03-25T22:27:29Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18508

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit f45bb2b with merge base 518daa8 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

github-actions · 2026-03-25T22:28:14Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Bring the unified runner closer to upstream streaming behavior, add reproducible contract and benchmark coverage, and capture the XNNPACK, MLX, and hybrid Metal findings in-repo so the next round of performance work starts from a verified baseline. Made-with: Cursor

seyeong-han added 5 commits March 18, 2026 08:53

seyeong-han requested review from kirklandsign, larryliu0820 and lucylq as code owners March 25, 2026 22:27

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 25, 2026

seyeong-han marked this pull request as draft March 25, 2026 22:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Qwen3-TTS: end-to-end XNNPACK text-to-speech bring-up#18508

Qwen3-TTS: end-to-end XNNPACK text-to-speech bring-up#18508
seyeong-han wants to merge 6 commits intopytorch:mainfrom
seyeong-han:qwen3-tts-xnnpack-bringup

seyeong-han commented Mar 25, 2026

Uh oh!

pytorch-bot bot commented Mar 25, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

seyeong-han commented Mar 25, 2026

Summary

Architecture

Benchmark (Apple M-series, XNNPACK 8da4w, warm single process)

Commits (review order)

Test plan

Uh oh!

pytorch-bot bot commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18508

✅ No Failures

Uh oh!

github-actions bot commented Mar 25, 2026

This PR needs a release notes: label

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

pytorch-bot bot commented Mar 25, 2026 •

edited

Loading

This PR needs a `release notes:` label