Qwen3-TTS: end-to-end XNNPACK text-to-speech bring-up#18508
Draft
seyeong-han wants to merge 6 commits intopytorch:mainfrom
Draft
Qwen3-TTS: end-to-end XNNPACK text-to-speech bring-up#18508seyeong-han wants to merge 6 commits intopytorch:mainfrom
seyeong-han wants to merge 6 commits intopytorch:mainfrom
Conversation
Implement a conversion/export/runtime path for the Qwen3-TTS speech tokenizer decoder with XNNPACK on CPU: weight conversion from HF snapshots, static-shape export, codec generation helper, and a C++ runner that decodes codec ids to WAV output. Made-with: Cursor
Decoder performance: export at multiple fixed codes_len buckets (75/150/300/600/1200) instead of a single 1200. The runner selects the smallest bucket that fits the input, reducing vocoder padding waste from 13x to 1.6x for typical inputs. Measured 10.5x decode speedup (32.4s → 3.1s for 91 codes, 8da4w XNNPACK CPU). Talker export: reuse the existing Llama/Qwen3 infrastructure to export the talker backbone (28-layer transformer) and code predictor (5-layer) as .pte models with static KV cache and 8da4w quantization. Weight conversion maps HF talker checkpoint to Meta/Llama format. Talker runs at 64ms/step, code predictor at 7.2ms/step on CPU. Streaming decode: interleave code generation with incremental vocoder decoding in 25-code chunks, yielding first audio at 2.15s instead of waiting for all codes (3.97s non-streaming, 32.4s old baseline). This PR was authored with Claude.
Replaces the multi-bucket decoder-only pipeline with a single .pte file containing all 6 pipeline stages (encode_text, talker, code_predictor, codec_embed, cp_head, decode_audio), following the Parakeet multi-method export pattern. Key changes: - export_unified.py: multi-method export with per-component quantization, dynamic-shape decoder (patched CausalConvNet for SymInt compat), and embedding quantization support (--qembedding 4w/8w) - qwen3_tts_unified_runner: C++ runner with lazy method loading, XNNPACK warmup, automatic silence trimming, and decode-only backward compat - generate_codes.py: added --trim-silence to strip conditioning prefix Model sizes: 1.0 GB (4w emb) / 1.2 GB (8w emb) / 2.1 GB (no emb quant) Decode perf: 2.0s for 91 codes (3.6x realtime) after XNNPACK warmup Authored with Claude.
Teach the unified runner and export path to mirror the MLX reference for dynamic text prompts, sampling behavior, and English codec prefix handling so XNNPACK text synthesis stays coherent end to end. Add contract tests, checked-in manifests, and small export compatibility shims so the single-PTE workflow remains reproducible. Made-with: Cursor
Replace the greedy-only unrolled cp_generate export with a sampling-aware v2 contract that performs inverse-CDF top-k(50) sampling inside the fused XNNPACK graph, collapsing 15 host-side sub-code round trips into one call. Add a persistent SynthesisSession with per-session RNG so the runner stays loaded/warmed across sequential prompts. Extend main_unified.cpp with --prompts_path, --repeat, --seed, and --disable_fused_cp_generate flags for multi-prompt warm benchmarking with generation-only timing breakdowns. The runner gates the fast path on exported metadata (contract version, top_k match, temperature threshold) and falls back to the legacy host-side sub-code loop for older .pte artifacts or unsupported sampler modes. Warm benchmark results show the fused path reduces per-step codegen cost by ~15-20% compared to the legacy loop on the same XNNPACK artifact. Generated with assistance from Claude. Made-with: Cursor
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18508
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit f45bb2b with merge base 518daa8 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This PR needs a
|
Bring the unified runner closer to upstream streaming behavior, add reproducible contract and benchmark coverage, and capture the XNNPACK, MLX, and hybrid Metal findings in-repo so the next round of performance work starts from a verified baseline. Made-with: Cursor
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Add a complete Qwen3-TTS text-to-speech pipeline for ExecuTorch with XNNPACK backend, from model export through C++ inference to WAV output.
export_unified.py): packagesencode_text,talker,code_predictor,codec_embed,cp_head,cp_generate, anddecode_audiointo onemodel.ptewith 8da4w quantization and optional embedding quantization (4w/8w)cp_generatev2: collapses the 15-step host-side sub-code loop into a single XNNPACK graph call with inverse-CDF top-k(50) sampling, reducing per-step codegen cost by ~15-20%SynthesisSessionAPI: keeps the runner warm across sequential prompts with per-session RNG, detailedSynthesisTimingbreakdowns (prompt prep, talker prefill, codegen, decode-audio), and automatic fast-path/legacy-fallback gating based on exported contract metadata--prompts_path,--repeat,--seed,--disable_fused_cp_generateflags for honest generation-only latency measurement without startup taxArchitecture
Benchmark (Apple M-series, XNNPACK 8da4w, warm single process)
Commits (review order)
53ab54c— Initial XNNPACK bring-up: model wrappers, weight conversion, decode-only runneraa37d0f— Multi-bucket decoder, talker export, streaming decode scaffolding510c0ff— Unified single-PTE export and C++ runner with text synthesise3ddd29— Align text synthesis with MLX reference semantics (English prefix, sampling, EOS)498b6d2— Fused cp_generate v2, SynthesisSession API, warm benchmark toolingTest plan
python -m unittestpasses all 4 test modules (28 tests)cmake --build cmake-out/examples/models/qwen3-ttsmodel.pteexported from updatedexport_unified.py(no missing ops)--text "..." --max_new_tokens 247--text "..." --top_k 50--prompts_pathGenerated with assistance from Claude.
Made with Cursor