Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions api-reference/server/services/s2s/aws.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -297,6 +297,7 @@ llm = AWSNovaSonicLLMService(

## Notes

- **User turn frames**: Does NOT emit `UserStartedSpeakingFrame` / `UserStoppedSpeakingFrame`, so pipeline processors that depend on those frames — RTVI client speech events, `TurnTrackingObserver`, `AudioBufferProcessor` turn recording, `UserIdleController`, user mute strategies, voicemail detector — won't activate with the default server-VAD-only setup. Pair with `LLMContextAggregatorPair(..., realtime_service_mode=True)` so context writes are correct anyway. To produce the turn frames locally, wire `vad_analyzer=SileroVADAnalyzer()` (or similar) into `LLMUserAggregatorParams`; locally-generated turn boundaries are a heuristic and may not match Nova Sonic's server-side turn decisions.
- **Model versions**: Nova 2 Sonic (`amazon.nova-2-sonic-v1:0`) is the default and recommended model. The older Nova Sonic (`amazon.nova-sonic-v1:0`) has fewer features and requires an assistant response trigger mechanism.
- **Session continuation**: Enabled by default to handle AWS's ~8-minute session limit. The service automatically rotates sessions in the background with no user-perceptible interruption, preserving conversation context and buffering user audio during the transition. You can tune the threshold or disable it via `session_continuation` parameter.
- **Endpointing sensitivity**: Only supported with Nova 2 Sonic. Controls how quickly the model decides the user has stopped speaking -- `"HIGH"` causes the model to respond most quickly.
Expand Down
1 change: 1 addition & 0 deletions api-reference/server/services/s2s/gemini-live.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -320,6 +320,7 @@ llm = GeminiLiveLLMService(

## Notes

- **User turn frames**: Does NOT emit `UserStartedSpeakingFrame` / `UserStoppedSpeakingFrame` (the API exposes an `interrupted` event but no turn-start/-end), so pipeline processors that depend on those frames — RTVI client speech events, `TurnTrackingObserver`, `AudioBufferProcessor` turn recording, `UserIdleController`, user mute strategies, voicemail detector — won't activate with the default server-VAD-only setup. Pair with `LLMContextAggregatorPair(..., realtime_service_mode=True)` so context writes are correct anyway. To produce the turn frames locally, see [realtime-gemini-live-locally-driven-turns.py](https://github.com/pipecat-ai/pipecat/blob/main/examples/realtime/realtime-gemini-live-locally-driven-turns.py); note that locally-generated turn boundaries are a heuristic and may not match Gemini Live's server-side turn decisions.
- **Model support**: The service supports both Gemini 2.5 and Gemini 3.x models. The service automatically detects and handles model-specific behavior.
- **Async tool support**: Functions registered with `cancel_on_interruption=False` use Gemini's NON_BLOCKING tool mechanism on models that support it (currently Gemini 2.x), allowing the conversation to continue while the tool runs in the background. The result is delivered via the async-tool mechanism and integrated into the model's next turn. On models that don't support NON_BLOCKING (Gemini 3.x), the service logs a one-time warning explaining the limitation. Note: An intermittent 1008 error can occasionally occur on Gemini 2.5 during long-running tool calls; the service auto-reconnects when this happens.
- **System instruction precedence**: The `system_instruction` from service settings takes precedence over an initial system message in the LLM context. A warning is logged when both are set.
Expand Down
1 change: 1 addition & 0 deletions api-reference/server/services/s2s/grok.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -256,6 +256,7 @@ await task.queue_frame(

## Notes

- **User turn frames**: Emits `UserStartedSpeakingFrame` / `UserStoppedSpeakingFrame` from Grok's server-side VAD events. Pair with `LLMContextAggregatorPair(..., realtime_service_mode=True)` so context writes are decoupled from those frames. If you wire local VAD (`LLMUserAggregatorParams.vad_analyzer`) on top of this service, disable Grok's server-side turn detection first via `turn_detection=None` (manual mode); otherwise both sources broadcast duplicate user-turn frames. See [realtime-grok-locally-driven-turns.py](https://github.com/pipecat-ai/pipecat/blob/main/examples/realtime/realtime-grok-locally-driven-turns.py).
- **Audio format auto-configuration**: If audio format is not specified in `session_properties`, the service automatically configures PCM input/output using the pipeline's sample rates.
- **Server-side VAD**: Enabled by default. When VAD is enabled, the server handles speech detection and turn management automatically. Set `turn_detection` to `None` to manage turns manually.
- **Audio before setup**: Audio is not sent to Grok until the conversation setup is complete, preventing sample rate mismatches.
Expand Down
1 change: 1 addition & 0 deletions api-reference/server/services/s2s/inworld.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -324,6 +324,7 @@ await task.queue_frame(

## Notes

- **User turn frames**: Emits `UserStartedSpeakingFrame` / `UserStoppedSpeakingFrame` from Inworld's server-side VAD events. Pair with `LLMContextAggregatorPair(..., realtime_service_mode=True)` so context writes are decoupled from those frames. If you wire local VAD (`LLMUserAggregatorParams.vad_analyzer`) on top of this service, disable Inworld's server-side turn detection first via `turn_detection=None` (manual mode); otherwise both sources broadcast duplicate user-turn frames. See [realtime-inworld-locally-driven-turns.py](https://github.com/pipecat-ai/pipecat/blob/main/examples/realtime/realtime-inworld-locally-driven-turns.py).
- **Audio format auto-configuration**: If audio format is not specified in `session_properties`, the service automatically configures PCM input/output using the pipeline's sample rates (defaults to 24000 Hz).
- **Semantic VAD by default**: The service uses semantic VAD (`"semantic_vad"`) by default for more natural turn detection. When VAD is enabled, the server handles speech detection and turn management automatically.
- **Cascade architecture**: The service operates as an integrated STT → LLM → TTS pipeline on the server side, simplifying client-side implementation.
Expand Down
2 changes: 2 additions & 0 deletions api-reference/server/services/s2s/openai.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -347,6 +347,8 @@ await task.queue_frame(

## Notes

- **User turn frames**: Emits `UserStartedSpeakingFrame` / `UserStoppedSpeakingFrame` from OpenAI's server-side VAD events, so pipeline processors that depend on those frames (RTVI client speech events, `TurnTrackingObserver`, `AudioBufferProcessor` turn recording, `UserIdleController`, user mute strategies, voicemail detector) work out of the box. Pair with `LLMContextAggregatorPair(..., realtime_service_mode=True)` so context writes are decoupled from those frames; see the [realtime-openai.py](https://github.com/pipecat-ai/pipecat/blob/main/examples/realtime/realtime-openai.py) example.
- **Local VAD**: If you wire local VAD (`LLMUserAggregatorParams.vad_analyzer`) on top of this service, disable OpenAI's server-side turn detection first (`turn_detection=False`); otherwise both sources broadcast duplicate user-turn frames. See [realtime-openai-locally-driven-turns.py](https://github.com/pipecat-ai/pipecat/blob/main/examples/realtime/realtime-openai-locally-driven-turns.py).
- **Model is connection-level**: The `model` parameter is set via the WebSocket URL at connection time and cannot be changed during a session.
- **Output modalities are single-mode**: The API supports either `["text"]` or `["audio"]` output, not both simultaneously.
- **Turn detection options**: Use `TurnDetection` for traditional VAD, `SemanticTurnDetection` for AI-based turn detection, or `False` to disable server-side detection and manage turns manually.
Expand Down
1 change: 1 addition & 0 deletions api-reference/server/services/s2s/ultravox.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -248,6 +248,7 @@ await task.queue_frame(

## Notes

- **User turn frames**: Does NOT emit `UserStartedSpeakingFrame` / `UserStoppedSpeakingFrame`, so pipeline processors that depend on those frames — RTVI client speech events, `TurnTrackingObserver`, `AudioBufferProcessor` turn recording, `UserIdleController`, user mute strategies, voicemail detector — won't activate with the default server-VAD-only setup. Pair with `LLMContextAggregatorPair(..., realtime_service_mode=True)` so context writes are correct anyway. To produce the turn frames locally, wire `vad_analyzer=SileroVADAnalyzer()` (or similar) into `LLMUserAggregatorParams`; locally-generated turn boundaries are a heuristic and may not match Ultravox's server-side turn decisions.
- **Audio-native model**: Ultravox processes audio directly rather than relying on a separate STT step. Voice transcriptions are provided for reference but may not always align with the model's understanding of user input.
- **Server-side context management**: Ultravox handles conversation context server-side. The LLM context in Pipecat is only used for passing function call results back to the service.
- **Audio sample rate**: The service uses a 48kHz sample rate. Input audio at different sample rates is automatically resampled.
Expand Down
Loading