Bug Description
When using a custom TTS plugin that outputs raw PCM/WAV audio through the PCM path (AudioByteStream -> AudioSource.capture_frame()),
the first few tens to hundreds of milliseconds of audio are intermittently truncated on the receiving end.
In our tests, this was observed only on the PCM path.
We did not observe the same issue when using the encoded audio path (for example, MP3 via AudioStreamDecoder).
The issue is intermittent and occurs roughly 1 in 3-5 attempts under the same conditions.
What we verified:
- The raw audio bytes from the TTS provider are intact when dumped and played locally.
- Waveform analysis shows the initial samples are valid and do not contain abnormal spikes.
- We added debug logging in the Python SDK path and verified that every frame reaches
AudioSource.capture_frame() successfully.
- We were not able to find any frame loss before
capture_frame().
In our tests, this appears to happen somewhere below that handoff point.
Expected Behavior
All audio frames passed to AudioSource.capture_frame() should be delivered to the remote participant without truncation,
including the beginning of speech.
Reproduction Steps
1. Generate short utterances using a custom TTS plugin that returns WAV/PCM audio at 24kHz mono.
2. Feed the PCM bytes into the agents PCM path (`AudioByteStream`) and forward every emitted frame to `AudioSource.capture_frame()`.
3. Publish the audio track to a LiveKit room and receive it in a browser client.
4. Repeat playback of the same utterance multiple times under the same conditions.
5. Intermittently, the beginning of speech is truncated on the receiving side;
approximately the first 50-300ms is lost.
We can provide a minimal reproducer or captured PCM sample if needed.
Operating System
Linux (production), macOS (development)
Models Used
No response
Package Versions
livekit-agents==1.4.2
livekit==1.1.2
Python 3.13
Session/Room/Call IDs
No response
Proposed Solution
No concrete fix proposed yet.
We are mainly reporting the issue and sharing the investigation results.
If there is a recommended configuration or known workaround for the PCM path, we would like to try it.
Additional Context
Symptoms:
- A brief "pop" sound at the start of speech
- The first syllable or word is partially or fully cut off
- The same audio may play correctly on one attempt and be truncated on another
Workarounds attempted:
- Insert delay (30/50/100ms) before the first frame: no improvement
- Change AudioByteStream chunk size: no improvement
- Switch Fish Audio output to raw PCM (skip WAV decode): slight improvement, not resolved
- Insert 200ms of silence padding before the first real audio frame: no improvement
- Change
rtc.AudioSource queue_size_ms to 50 / 100 / 500: no improvement
This is particularly impactful for voice agent applications because users perceive it as the agent cutting off the beginning of sentences.
Screenshots and Recordings
No response
Bug Description
When using a custom TTS plugin that outputs raw PCM/WAV audio through the PCM path (
AudioByteStream -> AudioSource.capture_frame()),the first few tens to hundreds of milliseconds of audio are intermittently truncated on the receiving end.
In our tests, this was observed only on the PCM path.
We did not observe the same issue when using the encoded audio path (for example, MP3 via
AudioStreamDecoder).The issue is intermittent and occurs roughly 1 in 3-5 attempts under the same conditions.
What we verified:
AudioSource.capture_frame()successfully.capture_frame().In our tests, this appears to happen somewhere below that handoff point.
Expected Behavior
All audio frames passed to
AudioSource.capture_frame()should be delivered to the remote participant without truncation,including the beginning of speech.
Reproduction Steps
Operating System
Linux (production), macOS (development)
Models Used
No response
Package Versions
Session/Room/Call IDs
No response
Proposed Solution
Additional Context
Symptoms:
Workarounds attempted:
rtc.AudioSourcequeue_size_msto 50 / 100 / 500: no improvementThis is particularly impactful for voice agent applications because users perceive it as the agent cutting off the beginning of sentences.
Screenshots and Recordings
No response