Skip to content

Intermittent truncation of initial audio on PCM path (AudioByteStream -> AudioSource.capture_frame()) #5158

@satruin

Description

@satruin

Bug Description

When using a custom TTS plugin that outputs raw PCM/WAV audio through the PCM path (AudioByteStream -> AudioSource.capture_frame()),
the first few tens to hundreds of milliseconds of audio are intermittently truncated on the receiving end.

In our tests, this was observed only on the PCM path.
We did not observe the same issue when using the encoded audio path (for example, MP3 via AudioStreamDecoder).

The issue is intermittent and occurs roughly 1 in 3-5 attempts under the same conditions.

What we verified:

  • The raw audio bytes from the TTS provider are intact when dumped and played locally.
  • Waveform analysis shows the initial samples are valid and do not contain abnormal spikes.
  • We added debug logging in the Python SDK path and verified that every frame reaches AudioSource.capture_frame() successfully.
  • We were not able to find any frame loss before capture_frame().

In our tests, this appears to happen somewhere below that handoff point.

Expected Behavior

All audio frames passed to AudioSource.capture_frame() should be delivered to the remote participant without truncation,
including the beginning of speech.

Reproduction Steps

1. Generate short utterances using a custom TTS plugin that returns WAV/PCM audio at 24kHz mono.
2. Feed the PCM bytes into the agents PCM path (`AudioByteStream`) and forward every emitted frame to `AudioSource.capture_frame()`.
3. Publish the audio track to a LiveKit room and receive it in a browser client.
4. Repeat playback of the same utterance multiple times under the same conditions.
5. Intermittently, the beginning of speech is truncated on the receiving side;
approximately the first 50-300ms is lost.

We can provide a minimal reproducer or captured PCM sample if needed.

Operating System

Linux (production), macOS (development)

Models Used

No response

Package Versions

livekit-agents==1.4.2
livekit==1.1.2
Python 3.13

Session/Room/Call IDs

No response

Proposed Solution

No concrete fix proposed yet. 
We are mainly reporting the issue and sharing the investigation results. 
If there is a recommended configuration or known workaround for the PCM path, we would like to try it.

Additional Context

Symptoms:

  • A brief "pop" sound at the start of speech
  • The first syllable or word is partially or fully cut off
  • The same audio may play correctly on one attempt and be truncated on another

Workarounds attempted:

  • Insert delay (30/50/100ms) before the first frame: no improvement
  • Change AudioByteStream chunk size: no improvement
  • Switch Fish Audio output to raw PCM (skip WAV decode): slight improvement, not resolved
  • Insert 200ms of silence padding before the first real audio frame: no improvement
  • Change rtc.AudioSource queue_size_ms to 50 / 100 / 500: no improvement

This is particularly impactful for voice agent applications because users perceive it as the agent cutting off the beginning of sentences.

Screenshots and Recordings

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions