pipecat-ai · markbackman · May 29, 2026
diff --git a/api-reference/server/services/s2s/aws.mdx b/api-reference/server/services/s2s/aws.mdx
@@ -297,6 +297,7 @@ llm = AWSNovaSonicLLMService(
 
 ## Notes
 
+- **User turn frames**: Does NOT emit `UserStartedSpeakingFrame` / `UserStoppedSpeakingFrame`, so pipeline processors that depend on those frames — RTVI client speech events, `TurnTrackingObserver`, `AudioBufferProcessor` turn recording, `UserIdleController`, user mute strategies, voicemail detector — won't activate with the default server-VAD-only setup. Pair with `LLMContextAggregatorPair(..., realtime_service_mode=True)` so context writes are correct anyway. To produce the turn frames locally, wire `vad_analyzer=SileroVADAnalyzer()` (or similar) into `LLMUserAggregatorParams`; locally-generated turn boundaries are a heuristic and may not match Nova Sonic's server-side turn decisions.
 - **Model versions**: Nova 2 Sonic (`amazon.nova-2-sonic-v1:0`) is the default and recommended model. The older Nova Sonic (`amazon.nova-sonic-v1:0`) has fewer features and requires an assistant response trigger mechanism.
 - **Session continuation**: Enabled by default to handle AWS's ~8-minute session limit. The service automatically rotates sessions in the background with no user-perceptible interruption, preserving conversation context and buffering user audio during the transition. You can tune the threshold or disable it via `session_continuation` parameter.
 - **Endpointing sensitivity**: Only supported with Nova 2 Sonic. Controls how quickly the model decides the user has stopped speaking -- `"HIGH"` causes the model to respond most quickly.

diff --git a/api-reference/server/services/s2s/gemini-live.mdx b/api-reference/server/services/s2s/gemini-live.mdx
@@ -320,6 +320,7 @@ llm = GeminiLiveLLMService(
 
 ## Notes
 
+- **User turn frames**: Does NOT emit `UserStartedSpeakingFrame` / `UserStoppedSpeakingFrame` (the API exposes an `interrupted` event but no turn-start/-end), so pipeline processors that depend on those frames — RTVI client speech events, `TurnTrackingObserver`, `AudioBufferProcessor` turn recording, `UserIdleController`, user mute strategies, voicemail detector — won't activate with the default server-VAD-only setup. Pair with `LLMContextAggregatorPair(..., realtime_service_mode=True)` so context writes are correct anyway. To produce the turn frames locally, see [realtime-gemini-live-locally-driven-turns.py](https://github.com/pipecat-ai/pipecat/blob/main/examples/realtime/realtime-gemini-live-locally-driven-turns.py); note that locally-generated turn boundaries are a heuristic and may not match Gemini Live's server-side turn decisions.
 - **Model support**: The service supports both Gemini 2.5 and Gemini 3.x models. The service automatically detects and handles model-specific behavior.
 - **Async tool support**: Functions registered with `cancel_on_interruption=False` use Gemini's NON_BLOCKING tool mechanism on models that support it (currently Gemini 2.x), allowing the conversation to continue while the tool runs in the background. The result is delivered via the async-tool mechanism and integrated into the model's next turn. On models that don't support NON_BLOCKING (Gemini 3.x), the service logs a one-time warning explaining the limitation. Note: An intermittent 1008 error can occasionally occur on Gemini 2.5 during long-running tool calls; the service auto-reconnects when this happens.
 - **System instruction precedence**: The `system_instruction` from service settings takes precedence over an initial system message in the LLM context. A warning is logged when both are set.

diff --git a/api-reference/server/services/s2s/grok.mdx b/api-reference/server/services/s2s/grok.mdx
@@ -256,6 +256,7 @@ await task.queue_frame(
 
 ## Notes
 
+- **User turn frames**: Emits `UserStartedSpeakingFrame` / `UserStoppedSpeakingFrame` from Grok's server-side VAD events. Pair with `LLMContextAggregatorPair(..., realtime_service_mode=True)` so context writes are decoupled from those frames. If you wire local VAD (`LLMUserAggregatorParams.vad_analyzer`) on top of this service, disable Grok's server-side turn detection first via `turn_detection=None` (manual mode); otherwise both sources broadcast duplicate user-turn frames. See [realtime-grok-locally-driven-turns.py](https://github.com/pipecat-ai/pipecat/blob/main/examples/realtime/realtime-grok-locally-driven-turns.py).
 - **Audio format auto-configuration**: If audio format is not specified in `session_properties`, the service automatically configures PCM input/output using the pipeline's sample rates.
 - **Server-side VAD**: Enabled by default. When VAD is enabled, the server handles speech detection and turn management automatically. Set `turn_detection` to `None` to manage turns manually.
 - **Audio before setup**: Audio is not sent to Grok until the conversation setup is complete, preventing sample rate mismatches.

diff --git a/api-reference/server/services/s2s/inworld.mdx b/api-reference/server/services/s2s/inworld.mdx
@@ -324,6 +324,7 @@ await task.queue_frame(
 
 ## Notes
 
+- **User turn frames**: Emits `UserStartedSpeakingFrame` / `UserStoppedSpeakingFrame` from Inworld's server-side VAD events. Pair with `LLMContextAggregatorPair(..., realtime_service_mode=True)` so context writes are decoupled from those frames. If you wire local VAD (`LLMUserAggregatorParams.vad_analyzer`) on top of this service, disable Inworld's server-side turn detection first via `turn_detection=None` (manual mode); otherwise both sources broadcast duplicate user-turn frames. See [realtime-inworld-locally-driven-turns.py](https://github.com/pipecat-ai/pipecat/blob/main/examples/realtime/realtime-inworld-locally-driven-turns.py).
 - **Audio format auto-configuration**: If audio format is not specified in `session_properties`, the service automatically configures PCM input/output using the pipeline's sample rates (defaults to 24000 Hz).
 - **Semantic VAD by default**: The service uses semantic VAD (`"semantic_vad"`) by default for more natural turn detection. When VAD is enabled, the server handles speech detection and turn management automatically.
 - **Cascade architecture**: The service operates as an integrated STT → LLM → TTS pipeline on the server side, simplifying client-side implementation.

diff --git a/api-reference/server/services/s2s/openai.mdx b/api-reference/server/services/s2s/openai.mdx
@@ -347,6 +347,8 @@ await task.queue_frame(
 
 ## Notes
 
+- **User turn frames**: Emits `UserStartedSpeakingFrame` / `UserStoppedSpeakingFrame` from OpenAI's server-side VAD events, so pipeline processors that depend on those frames (RTVI client speech events, `TurnTrackingObserver`, `AudioBufferProcessor` turn recording, `UserIdleController`, user mute strategies, voicemail detector) work out of the box. Pair with `LLMContextAggregatorPair(..., realtime_service_mode=True)` so context writes are decoupled from those frames; see the [realtime-openai.py](https://github.com/pipecat-ai/pipecat/blob/main/examples/realtime/realtime-openai.py) example.
+- **Local VAD**: If you wire local VAD (`LLMUserAggregatorParams.vad_analyzer`) on top of this service, disable OpenAI's server-side turn detection first (`turn_detection=False`); otherwise both sources broadcast duplicate user-turn frames. See [realtime-openai-locally-driven-turns.py](https://github.com/pipecat-ai/pipecat/blob/main/examples/realtime/realtime-openai-locally-driven-turns.py).
 - **Model is connection-level**: The `model` parameter is set via the WebSocket URL at connection time and cannot be changed during a session.
 - **Output modalities are single-mode**: The API supports either `["text"]` or `["audio"]` output, not both simultaneously.
 - **Turn detection options**: Use `TurnDetection` for traditional VAD, `SemanticTurnDetection` for AI-based turn detection, or `False` to disable server-side detection and manage turns manually.

diff --git a/api-reference/server/services/s2s/ultravox.mdx b/api-reference/server/services/s2s/ultravox.mdx
@@ -248,6 +248,7 @@ await task.queue_frame(
 
 ## Notes
 
+- **User turn frames**: Does NOT emit `UserStartedSpeakingFrame` / `UserStoppedSpeakingFrame`, so pipeline processors that depend on those frames — RTVI client speech events, `TurnTrackingObserver`, `AudioBufferProcessor` turn recording, `UserIdleController`, user mute strategies, voicemail detector — won't activate with the default server-VAD-only setup. Pair with `LLMContextAggregatorPair(..., realtime_service_mode=True)` so context writes are correct anyway. To produce the turn frames locally, wire `vad_analyzer=SileroVADAnalyzer()` (or similar) into `LLMUserAggregatorParams`; locally-generated turn boundaries are a heuristic and may not match Ultravox's server-side turn decisions.
 - **Audio-native model**: Ultravox processes audio directly rather than relying on a separate STT step. Voice transcriptions are provided for reference but may not always align with the model's understanding of user input.
 - **Server-side context management**: Ultravox handles conversation context server-side. The LLM context in Pipecat is only used for passing function call results back to the service.
 - **Audio sample rate**: The service uses a 48kHz sample rate. Input audio at different sample rates is automatically resampled.