- "description": "Request body for text-to-speech synthesis. Supports both single-speaker and multi-speaker synthesis.\n\n## Single Speaker\nProvide either `reference_id` (string) pointing to a voice model, or `references` (array of ReferenceAudio) for zero-shot cloning.\n\n## Multiple Speakers (Dialogue)\nFor multi-speaker synthesis, provide:\n- `reference_id`: array of voice model IDs, e.g., [\"speaker-a-id\", \"speaker-b-id\"]\n- `text`: use speaker tags [0], [1], etc. to indicate speaker changes, e.g., \"[0]Hello![1]Hi there!\"\n\nAlternatively, for zero-shot multi-speaker:\n- `references`: 2D array where each inner array contains references for one speaker\n- `reference_id`: array of identifiers (can be arbitrary strings for zero-shot)\n\n## Example (Multi-Speaker with Model IDs)\n```json\n{\n \"text\": \"[0]Good morning![1]Good morning! How are you?[0]I'm great, thanks!\",\n \"reference_id\": [\"model-id-alice\", \"model-id-bob\"]\n}\n```",
0 commit comments