Skip to content

feat(realtime): support multi-message generation per response#5763

Merged
longcw merged 2 commits into
mainfrom
longc/multi-message-realtime-v2
May 20, 2026
Merged

feat(realtime): support multi-message generation per response#5763
longcw merged 2 commits into
mainfrom
longc/multi-message-realtime-v2

Conversation

@longcw
Copy link
Copy Markdown
Contributor

@longcw longcw commented May 18, 2026

Summary

  • Process each MessageGeneration from generation_ev.message_stream serially via perform_audio_forwarding + perform_text_forwarding + wait_for_playout. Only one flush is in flight at a time.
  • Per-msg state is derived directly from the playback_finished event:
    • full → emit ChatMessage(interrupted=False) with the msg's message_id
    • partial → emit ChatMessage(interrupted=True) and call _rt_session.truncate(...) with this msg's local playback_position (not a cumulative offset)
    • skipped → drop locally and call update_chat_ctx(...) so the realtime server removes never-played items from its history
  • _on_first_frame now early-returns once started_speaking_at is set, so per-msg first-frame callbacks don't re-fire _update_agent_state("speaking") for each message.

Alternative considered

#5690 makes multi-message work by flushing per message — that needs the synchronizer to keep pending/finalizing impls alive and serialize concurrent flushes in room_io/_output.py. Our AudioOutput assumes there is only one speech at a time, serializing per-message at the wait_for_playout boundary (this PR) avoids both changes.

close #5690, #5684

Some realtime providers (e.g. GPT-Realtime-2.0) emit multiple message
items in a single response. Process each one serially: push frames,
flush, wait_for_playout. Only one flush is ever in flight at a time, so
room_io and the transcript synchronizer keep their single-segment
invariants without modification.

Per-msg state is derived from the playback_finished event:
- 'full'    -> emit ChatMessage(interrupted=False) with the msg's id
- 'partial' -> emit ChatMessage(interrupted=True); call truncate() with
               the msg's local playback position
- 'skipped' -> drop from local chat ctx; call update_chat_ctx() so the
               realtime server removes never-played items from history

This is a cleaner alternative to flushing per-message, which would
require keeping multiple in-flight flush_tasks / synchronizer segments
alive simultaneously.
@chenghao-mou chenghao-mou requested a review from a team May 18, 2026 06:46
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 potential issue.

View 5 additional findings in Devin Review.

Open in Devin Review

Comment thread livekit-agents/livekit/agents/voice/agent_activity.py Outdated
Server-side truncation must run independently of local ChatMessage
emission. The previous order skipped truncate() when forwarded_text
was empty (transcription disabled, or interrupt before the text
stream caught up to audio), leaving the realtime server with the
full un-truncated audio.
@longcw longcw merged commit 187433c into main May 20, 2026
24 checks passed
@longcw longcw deleted the longc/multi-message-realtime-v2 branch May 20, 2026 00:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants