Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
175 changes: 163 additions & 12 deletions api-reference/server/services/stt/cartesia.mdx
Original file line number Diff line number Diff line change
@@ -1,11 +1,14 @@
---
title: "Cartesia"
description: "Speech-to-text service implementation using Cartesia's real-time transcription API"
description: "Speech-to-text service implementations using Cartesia's real-time transcription APIs"
---

## Overview

`CartesiaSTTService` provides real-time speech recognition using Cartesia's WebSocket API with the `ink-whisper` model, supporting streaming transcription with both interim and final results for low-latency applications.
Cartesia provides two STT service implementations:

- `CartesiaSTTService` for real-time speech recognition using Cartesia's WebSocket API with the `ink-whisper` model, supporting streaming transcription with both interim and final results for low-latency applications
- `CartesiaTurnsSTTService` for turn-based speech recognition using Cartesia's v2 WebSocket API with the `ink-2` model, where the server drives turn boundaries and pushes structured events for turn lifecycle management including start, updates, eager end predictions, resume, and final turn completion

<CardGroup cols={2}>
<Card
Expand All @@ -16,12 +19,26 @@ description: "Speech-to-text service implementation using Cartesia's real-time t
Pipecat's API methods for Cartesia STT integration
</Card>
<Card
title="Example Implementation"
title="Cartesia Turns STT API Reference"
icon="code"
href="https://reference-server.pipecat.ai/en/latest/api/pipecat.services.cartesia.turns.stt.html"
>
Pipecat's API methods for Cartesia Turns STT integration
</Card>
<Card
title="Standard STT Example"
icon="play"
href="https://github.com/pipecat-ai/pipecat/blob/main/examples/transcription/transcription-cartesia.py"
>
Complete example with transcription logging
</Card>
<Card
title="Turns STT Example"
icon="play"
href="https://github.com/pipecat-ai/pipecat/blob/main/examples/transcription/transcription-cartesia-turns.py"
>
Complete example with turn-based transcription
</Card>
<Card
title="Cartesia Documentation"
icon="book"
Expand Down Expand Up @@ -50,15 +67,13 @@ Before using Cartesia STT services, you need:

1. **Cartesia Account**: Sign up at [Cartesia](https://cartesia.ai/)
2. **API Key**: Generate an API key from your account dashboard
3. **Model Access**: Ensure access to the ink-whisper transcription model
3. **Model Access**: Ensure access to the transcription model you plan to use (`ink-whisper` for `CartesiaSTTService`, `ink-2` for `CartesiaTurnsSTTService`)

### Required Environment Variables

- `CARTESIA_API_KEY`: Your Cartesia API key for authentication

## Configuration

### CartesiaSTTService
## CartesiaSTTService

<ParamField path="api_key" type="str" required>
Cartesia API key for authentication.
Expand Down Expand Up @@ -107,9 +122,9 @@ Runtime-configurable settings passed via the `settings` constructor argument usi
| `model` | `str` | `"ink-whisper"` | The transcription model to use. _(Inherited from base STT settings.)_ |
| `language` | `Language \| str` | `"en"` | Target language for transcription. _(Inherited from base STT settings.)_ |

## Usage
### Usage

### Basic Setup
#### Basic Setup

```python
from pipecat.services.cartesia.stt import CartesiaSTTService
Expand All @@ -119,7 +134,7 @@ stt = CartesiaSTTService(
)
```

### With Custom Options
#### With Custom Options

```python
from pipecat.services.cartesia.stt import CartesiaSTTService
Expand All @@ -134,7 +149,7 @@ stt = CartesiaSTTService(
)
```

## Notes
### Notes

- **Inactivity timeout**: Cartesia disconnects WebSocket connections after 3 minutes of inactivity. The timeout resets with each message sent. Silence-based keepalive is enabled by default to prevent disconnections.
- **Auto-reconnect on send**: If the connection is closed (e.g., due to timeout), the service automatically reconnects when the next audio data is sent.
Expand All @@ -147,7 +162,7 @@ stt = CartesiaSTTService(
guide](/pipecat/fundamentals/service-settings) for migration details.
</Tip>

## Event Handlers
### Event Handlers

Cartesia STT supports the standard [service connection events](/api-reference/server/events/service-events):

Expand All @@ -161,3 +176,139 @@ Cartesia STT supports the standard [service connection events](/api-reference/se
async def on_connected(service):
print("Connected to Cartesia STT")
```

## CartesiaTurnsSTTService

The server drives turn boundaries with the `ink-2` model, pushing structured events for turn lifecycle management including start, updates, eager end predictions, resume, and final turn completion.

<ParamField path="api_key" type="str" required>
Cartesia API key for authentication.
</ParamField>

<ParamField path="url" type="str" default="wss://api.cartesia.ai/stt/turns/websocket">
WebSocket URL for the Cartesia Streaming ASR v2 endpoint.
</ParamField>

<ParamField path="sample_rate" type="int | None" default="None">
Audio sample rate in Hz. If `None`, uses the pipeline sample rate.
</ParamField>

<ParamField path="should_interrupt" type="bool" default="True">
Whether to broadcast an interruption when the server signals the start of a new turn.
</ParamField>

<ParamField path="watchdog_min_timeout" type="float" default="0.5">
Minimum idle timeout (in seconds) before sending silence to prevent dangling turns. The actual threshold is `max(chunk_duration * 2, watchdog_min_timeout)`.
</ParamField>

<ParamField path="extra_headers" type="dict[str, str] | None" default="None">
Optional additional HTTP headers to send with the WebSocket handshake.
</ParamField>

<ParamField path="settings" type="CartesiaTurnsSTTService.Settings" default="None">
Runtime-updatable settings. See [Settings](#settings-2) below.
</ParamField>

### Settings

Runtime-configurable settings passed via the `settings` constructor argument using `CartesiaTurnsSTTService.Settings(...)`. The ink-2 model family is English-only and does not support runtime model or language switching. Attempts to update these fields will be reported as unhandled.

| Parameter | Type | Default | Description |
| ---------- | ----------------- | --------- | --------------------------------------------------------------------- |
| `model` | `str` | `"ink-2"` | The transcription model to use. _(Inherited from base STT settings.)_ |
| `language` | `Language \| str` | `None` | Target language (fixed to English). _(Inherited from base STT settings.)_ |

### Usage

#### Basic Setup

```python
from pipecat.services.cartesia.turns.stt import CartesiaTurnsSTTService

stt = CartesiaTurnsSTTService(
api_key=os.getenv("CARTESIA_API_KEY"),
)
```

#### With Custom Configuration

```python
from pipecat.services.cartesia.turns.stt import CartesiaTurnsSTTService

stt = CartesiaTurnsSTTService(
api_key=os.getenv("CARTESIA_API_KEY"),
sample_rate=16000,
should_interrupt=True,
watchdog_min_timeout=1.0,
)
```

#### With Event Handlers

```python
from pipecat.services.cartesia.turns.stt import CartesiaTurnsSTTService

stt = CartesiaTurnsSTTService(
api_key=os.getenv("CARTESIA_API_KEY"),
)

@stt.event_handler("on_turn_start")
async def on_turn_start(service, transcript):
print(f"User started speaking: {transcript}")

@stt.event_handler("on_turn_end")
async def on_turn_end(service, transcript):
print(f"Final transcript: {transcript}")
```

### Turn-Based Protocol

The service speaks the v2 turn-based wire protocol:

```
connected → turn.start → turn.update* → (turn.eager_end → turn.resume?)* → turn.end → ...
```

- **`turn.start`**: Server detected the start of a turn. Pushes `UserStartedSpeakingFrame` and optionally broadcasts an interruption.
- **`turn.update`**: Incremental transcript update. Pushes `InterimTranscriptionFrame`.
- **`turn.eager_end`**: Server eagerly predicted the end of turn. Available via event handler for speculative downstream processing.
- **`turn.resume`**: User resumed speaking after an eager end. Available via event handler.
- **`turn.end`**: Final transcript for the completed turn. Pushes `TranscriptionFrame` and `UserStoppedSpeakingFrame`.

Transcripts are cumulative per turn. There is no `is_final` flag and no `finalize` command — closing the socket ends the session.

### Notes

- **English-only**: The ink-2 model family supports English transcription only at launch.
- **No runtime model switching**: Unlike the v1 API, the ink-2 model does not support runtime model or language switching.
- **Watchdog for dangling turns**: If audio stops flowing after a `turn.start`, the service sends silence to prevent the turn from hanging indefinitely. Configure the threshold with `watchdog_min_timeout`.
- **Server-driven turns**: The server controls turn boundaries. There is no client-side `finalize` command.
- **Interruption support**: Set `should_interrupt=True` to broadcast interruptions when the user starts speaking, enabling natural turn-taking.

### Event Handlers

Cartesia Turns STT supports the following event handlers:

| Event | Handler Signature | Description |
| --------------------- | ----------------------------------------- | ---------------------------------------------------------- |
| `on_connected` | `async def(service)` | Connected to Cartesia WebSocket |
| `on_disconnected` | `async def(service)` | Disconnected from Cartesia WebSocket |
| `on_connection_error` | `async def(service, error_msg)` | Connection error occurred |
| `on_turn_start` | `async def(service, transcript: str)` | Server detected start of a turn |
| `on_turn_update` | `async def(service, transcript: str)` | Incremental transcript update |
| `on_turn_eager_end` | `async def(service, transcript: str)` | Server eagerly predicted end of turn |
| `on_turn_resume` | `async def(service)` | User resumed speaking after an eager end |
| `on_turn_end` | `async def(service, transcript: str)` | Final transcript for the completed turn |

Example:

```python
@stt.event_handler("on_turn_eager_end")
async def on_turn_eager_end(service, transcript):
print(f"Eager end prediction: {transcript}")
# Optionally start processing speculatively

@stt.event_handler("on_turn_resume")
async def on_turn_resume(service):
print("User resumed speaking, discard speculative processing")
```
Loading