fix(eot): tighten eot cancellation by speech acitivity#6274
fix(eot): tighten eot cancellation by speech acitivity#6274chenghao-mou wants to merge 1 commit into
Conversation
Previously, the cancellation can be triggered by inference done event where background noise makes it flaky. This PR drops that path so it is now only based on STT/VAD SOS.
| if self._end_of_turn_task is not None: | ||
| # TODO(theomonnom): disallow cancel if the extra sleep is done | ||
| self._end_of_turn_task.cancel() | ||
|
|
||
| task_func = ( | ||
| _bounce_eou_task_with_speaking_guard | ||
| if isinstance(self._turn_detector, _StreamingTurnDetector) | ||
| else _bounce_eou_task | ||
| ) | ||
| # copy the last_speaking_time before awaiting (the value can change) | ||
| self._end_of_turn_task = asyncio.create_task( | ||
| task_func( | ||
| _bounce_eou_task( | ||
| self._last_speaking_time, | ||
| self._last_final_transcript_time, | ||
| self._user_turn_start, |
There was a problem hiding this comment.
π© Wider cancellation window when user resumes speech during endpointing
The removed _bounce_eou_task_with_speaking_guard raced _user_speaking_event.wait() against the bounce task. That event was set on INFERENCE_DONE with raw_accumulated_speech > 0 (audio_recognition.py:1246), which fires ~200-250ms before START_OF_SPEECH (gated by VAD's min_speech_duration). The new code relies solely on SOS cancelling _end_of_turn_task (audio_recognition.py:1236-1237), creating a timing window where the bounce could complete before SOS fires.
Concrete scenario: EOS at T=0 β bounce starts with min_delay=0.3s β user resumes at T=0.1 β INFERENCE_DONE detects speech at T=0.15 (old code cancels here) β bounce completes at Tβ0.3 β SOS fires at Tβ0.35 (too late in new code).
However, this appears intentional: (1) text-based turn detectors always used raw _bounce_eou_task with this same gap, so the PR makes behavior consistent; (2) the old _user_speaking_event had robustness issues with sub-threshold spikes getting stuck, which was the regression the old TestSubThresholdSpeakingSpike tests covered; (3) all _run_eou_detection call sites already ensure _speaking=False, making the entry guard redundant.
(Refers to lines 1585-1595)
Was this helpful? React with π or π to provide feedback.
Previously, the cancellation can be triggered by inference done event where background noise makes it flaky. This PR drops that path so it is now only based on STT/VAD SOS.