streaming support#16
Open
srao25 wants to merge 5 commits into
Open
Conversation
Clears PytestUnknownMarkWarning when running streaming integration tests. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Streaming audio generation
Adds end-to-end streaming to
generate()via a generator/iterator API. Audio chunks are emitted during LLM generation, not after. No retraining required, fully backward compatible.Usage
When
streamis not set (default), the existing non-streaming path runs unchanged.What changed
generate()gains two params:streamFalseTrue, returns anAudioStreamiterator yielding(chunk, sample_rate)tuples.streaming_cnn_window_size10050for <32 GB RAM.Under the hood,
_generate()is now a Python generator. It yields oneStepOutputper predicted token (carrying the new acoustic feature +time_before) and a finalSyncTokGenerationOutputon completion. The non-streaming path simply drains the generator eagerly, so its semantics are unchanged.How it works
The TADA decoder has two stages, both made streaming:
Transformer (KV-cache, bit-exact).
forward_with_cache()+_apply_rope_with_offset()onLocalSelfAttention/LocalAttentionEncoderLayer(encoder.py). Each new token computes only Q and attends to cached post-RoPE K,V. The v2 block attention mask restricts attention to current + previous block. Max diff vs. full-sequence pass: ~1.91e-06 (float noise only).CNN / DACDecoder (sliding window). New
StreamingDecoderclass indecoder.py. The CNN uses symmetric padding (non-causal), so it needs left and right context. The sliding window provides 20 frames of left context and 15 frames of right lookahead, empirically measured against the pretrained weights to be inaudible (≤ 0.0003 max diff)._all_hiddenis capped at the window size for bounded memory.E2E wiring lives in
AudioStream.__iter__(tada.py): on each predicted token it denormalizes acoustic features, feeds them toStreamingDecoder.decode_block(), and yields any audio that the sliding window has been able to emit. On the first token, leading silence is skipped at the frame level. On the finalSyncTokGenerationOutput, the streaming decoder is flushed andstream.resultis populated.Other fixes bundled in
generate()used to encode the prompt and gen text separately, causing BPE merges to drop characters at the seam (e.g.".Hello"→".H" + "ello"). Now uses joint encoding with a space separator. Fixes both streaming and non-streaming paths.time_before[0], the leading-silence trim removed almost nothing and exposed a CNN transition artifact. Now clamps minimum trim to 5 frames (100 ms) in both paths.Performance
True time-to-first-audio (TTFA) measured from
generate()call to first chunk:Peak VRAM is lower in streaming mode for medium/long text (up to 3.6 GB less than non-streaming), because the streaming decoder operates on a bounded sliding window instead of materializing all hidden states at once.
Tests
StepOutput,AudioStream(fake generator + tiny Decoder),StreamingDecoder(basic streaming,skip_leading_frames,reset, buffering, flush), segment attention mask.@pytest.mark.integration, real TADA-1B on GPU):test_non_streaming_unchanged— non-streaming path still produces audio.test_streaming_produces_chunks— streaming yields chunks,stream.resultis populated.test_streaming_vs_nonstreaming_similar_length— streaming and non-streaming produce audio within 0.5×–2× length of each other on the same text.test_streaming_early_break— breaking out of the iterator mid-stream is safe.TestGenerateAudios— generates 5 streaming + 5 non-streaming WAVs for manual A/B listening.pytest tests/test_streaming.py -m integration -s(orsbatch tests/run_integration.sh).All integration tests passed on TADA-1B (H100) on the latest run.
Files changed
tada/modules/decoder.py— newStreamingDecoderclass (~365 lines). ExistingDecoderuntouched.tada/modules/encoder.py—forward_with_cache()+_apply_rope_with_offset()added to attention layers. Existingforward()paths untouched.tada/modules/tada.py—_generate()becomes a generator yieldingStepOutput; newAudioStreamclass;generate()gainsstream+streaming_cnn_window_sizeparams; tokenization fix; 5-frame trim clamp.tada/modules/__init__.py— exportsAudioStream,StreamingDecoder.README.md— new "Streaming Audio Generation" section with examples, perf table, parameter reference.tests/test_streaming.py,tests/run_integration.sh— new test suite.Future work (not in this PR)
torch.compileon the streaming CNN/transformer path.