Bug: end_of_turn_delay collapses to min_endpointing_delay when running without VAD in turn_detection="stt" mode

## Summary

When an `AgentSession` is configured with `vad=NOT_GIVEN` and `turn_detection="stt"`, the `end_of_turn_delay` (and `transcription_delay`) values reported on user `ChatMessage` metrics no longer measure what their names suggest. They collapse to roughly `endpointing.min_delay` regardless of how long the STT model actually took to detect end of turn.

This is a problem for users who rely on these metrics to monitor real end of utterance latency or to compare STT providers.

## Repro

1. Build an `AgentSession` with a streaming STT that emits `END_OF_SPEECH` (Soniox or Deepgram Flux for example).
2. Pass `turn_detection="stt"` and `vad=NOT_GIVEN`.
3. Run a normal turn and inspect the `metrics` dict on the resulting user `ChatMessage`.

Observed: `end_of_turn_delay` is approximately equal to `endpointing.min_delay` on every turn, even on ambiguous turns where the STT model spends close to its `MAX_ENDPOINT_DELAY` ceiling before firing.

Expected: `end_of_turn_delay` should reflect the time from the user actually stopping to the EOU pipeline committing, including the STT model decision time.

## Root cause

In `livekit/agents/voice/audio_recognition.py` the `_last_speaking_time` anchor is set in three places when there is no VAD attached:

```python
# line 846 (FINAL_TRANSCRIPT)
if not self._vad or self._last_speaking_time is None:
    self._last_speaking_time = time.time()

# line 903 (PREFLIGHT_TRANSCRIPT)
if not self._vad or self._last_speaking_time is None:
    self._last_speaking_time = time.time()

# line 944 (END_OF_SPEECH, stt mode)
if not self._vad or self._last_speaking_time is None:
    self._last_speaking_time = time.time()
```

Without VAD, the first event handler that runs stamps `_last_speaking_time = time.time()`. For Soniox and Flux this is essentially the moment `END_OF_SPEECH` arrives, that is, **after** the model has already decided end of turn. The bounce task then runs and sleeps for `endpointing_delay`, then computes:

```python
end_of_turn_delay = time.time() - last_speaking_time  # line 1121
```

Since `last_speaking_time` was just set to "now" before the sleep, the resulting value is approximately `endpointing_delay` and hides the real STT detection time.

The TODO already in the file at line 848 acknowledges this:

```python
# vad disabled, use stt timestamp
# TODO: this would screw up transcription latency metrics
# but we'll live with it for now.
# the correct way is to ensure STT fires SpeechEventType.END_OF_SPEECH
# and using that timestamp for _last_speaking_time
```

## Suggested fix

Use `SpeechData.end_time` from the last `FINAL_TRANSCRIPT` alternative (which both Soniox and Flux populate as a stream relative timestamp) plus a stream start anchor captured on the first event of the turn, to compute a wall clock acoustic stop time. Then assign that to `_last_speaking_time` instead of `time.time()`.

That would close the TODO and make `end_of_turn_delay` and `transcription_delay` reflect real model behavior in STT driven mode without depending on VAD being attached.

## Environment

- `livekit-agents==1.5.8`
- Python 3.12
- Plugins reproduced with: `livekit-plugins-soniox`, `livekit-plugins-deepgram` (Flux v2)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: end_of_turn_delay collapses to min_endpointing_delay when running without VAD in turn_detection="stt" mode #5669

Summary

Repro

Root cause

Suggested fix

Environment

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Bug: end_of_turn_delay collapses to min_endpointing_delay when running without VAD in turn_detection="stt" mode #5669

Description

Summary

Repro

Root cause

Suggested fix

Environment

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions