Skip to content

Bug: end_of_turn_delay collapses to min_endpointing_delay when running without VAD in turn_detection="stt" mode #5669

@miguelmoralai

Description

@miguelmoralai

Summary

When an AgentSession is configured with vad=NOT_GIVEN and turn_detection="stt", the end_of_turn_delay (and transcription_delay) values reported on user ChatMessage metrics no longer measure what their names suggest. They collapse to roughly endpointing.min_delay regardless of how long the STT model actually took to detect end of turn.

This is a problem for users who rely on these metrics to monitor real end of utterance latency or to compare STT providers.

Repro

  1. Build an AgentSession with a streaming STT that emits END_OF_SPEECH (Soniox or Deepgram Flux for example).
  2. Pass turn_detection="stt" and vad=NOT_GIVEN.
  3. Run a normal turn and inspect the metrics dict on the resulting user ChatMessage.

Observed: end_of_turn_delay is approximately equal to endpointing.min_delay on every turn, even on ambiguous turns where the STT model spends close to its MAX_ENDPOINT_DELAY ceiling before firing.

Expected: end_of_turn_delay should reflect the time from the user actually stopping to the EOU pipeline committing, including the STT model decision time.

Root cause

In livekit/agents/voice/audio_recognition.py the _last_speaking_time anchor is set in three places when there is no VAD attached:

# line 846 (FINAL_TRANSCRIPT)
if not self._vad or self._last_speaking_time is None:
    self._last_speaking_time = time.time()

# line 903 (PREFLIGHT_TRANSCRIPT)
if not self._vad or self._last_speaking_time is None:
    self._last_speaking_time = time.time()

# line 944 (END_OF_SPEECH, stt mode)
if not self._vad or self._last_speaking_time is None:
    self._last_speaking_time = time.time()

Without VAD, the first event handler that runs stamps _last_speaking_time = time.time(). For Soniox and Flux this is essentially the moment END_OF_SPEECH arrives, that is, after the model has already decided end of turn. The bounce task then runs and sleeps for endpointing_delay, then computes:

end_of_turn_delay = time.time() - last_speaking_time  # line 1121

Since last_speaking_time was just set to "now" before the sleep, the resulting value is approximately endpointing_delay and hides the real STT detection time.

The TODO already in the file at line 848 acknowledges this:

# vad disabled, use stt timestamp
# TODO: this would screw up transcription latency metrics
# but we'll live with it for now.
# the correct way is to ensure STT fires SpeechEventType.END_OF_SPEECH
# and using that timestamp for _last_speaking_time

Suggested fix

Use SpeechData.end_time from the last FINAL_TRANSCRIPT alternative (which both Soniox and Flux populate as a stream relative timestamp) plus a stream start anchor captured on the first event of the turn, to compute a wall clock acoustic stop time. Then assign that to _last_speaking_time instead of time.time().

That would close the TODO and make end_of_turn_delay and transcription_delay reflect real model behavior in STT driven mode without depending on VAD being attached.

Environment

  • livekit-agents==1.5.8
  • Python 3.12
  • Plugins reproduced with: livekit-plugins-soniox, livekit-plugins-deepgram (Flux v2)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions