Skip to content

feat(webchat): add voice reply (TTS) support to ChatUI#8728

Open
jayzen33 wants to merge 2 commits into
AstrBotDevs:masterfrom
jayzen33:feat/webui-tts-replies
Open

feat(webchat): add voice reply (TTS) support to ChatUI#8728
jayzen33 wants to merge 2 commits into
AstrBotDevs:masterfrom
jayzen33:feat/webui-tts-replies

Conversation

@jayzen33

@jayzen33 jayzen33 commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

为 ChatUI 增加「语音回复」支持:用户可在聊天界面开启 Voice Reply 开关,让 Bot 以 TTS 语音形式回复,并附带文字转写。

Modifications / 改动点

后端

  • webchat_event.py:record 消息段透传 text 字幕,前端可在语音气泡下方渲染转写文本。
  • routes/chat.py
    • 新增 enable_tts 请求参数,将客户端偏好持久化为会话 TTS 状态;
    • TTS 可用时关闭流式输出,使 result-decorate 阶段能合成语音;TTS 不可用且客户端显式请求语音时,通过 tts_notice SSE 事件返回可本地化的原因码(流式与非流式请求均生效);
    • TTS 判定与管线对齐(trigger_probability 为 0 时视为禁用,不放弃流式);
    • 轮次仅剩 agent_stats/refs 时合并进上一条记录,避免空气泡;无可附着记录时回落保存 metadata-only 记录;
    • _save_bot_message 返回 (record, content) 消除重复构建;线程 UMO 构造复用 _build_webchat_umo
  • session_llm_manager.py:TTS 会话状态写入在值未变化时跳过,避免每条消息一次冗余 DB 写。

前端

  • 新增 AudioMessagePart.vue 语音播放器:波形可视化、指针/键盘拖动进度(含 a11y)、文字转写、新语音自动播放;波形解码懒加载(首次播放才下载)且全局共享单个 AudioContext,历史会话不会逐条下载音频。

  • Chat.vue 新增 Voice Reply 开关(localStorage 持久化);tts_notice 原因码显式映射为本地化 toast(en-US / zh-CN / ru-RU)。

  • This is NOT a breaking change. / 这不是一个破坏性变更。

Screenshots or Test Results / 运行截图或测试结果

  • pytest tests/test_webchat_tts_replies.py tests/test_conversation_checkpoint.py — 18 passed
  • vue-tsc --noEmit — 通过
  • ruff check / ruff format --check — 通过
astrbot-chatui-speech-test-01 astrbot-chatui-speech-test-02

Checklist / 检查清单

  • 😊 If there are new features added in the PR, I have discussed it with the authors through issues/emails, etc.
  • 👀 My changes have been well-tested, and "Verification Steps" and "Screenshots" have been provided above.
  • 🤓 I have ensured that no new dependencies are introduced.
  • 😮 My changes do not introduce malicious code.标题:feat(webchat): add voice reply (TTS) support to ChatUI

Summary by Sourcery

Introduce end-to-end voice reply support for webchat sessions, wiring ChatUI preferences through to backend TTS enablement, enriching audio messages with transcripts, and improving audio playback UX.

New Features:

  • Add configurable voice reply (TTS) support in webchat, including a ChatUI toggle and autoplaying audio replies with transcripts.

Bug Fixes:

  • Prevent empty bot message bubbles by merging trailing metadata-only turns into the previous record when possible.

Enhancements:

  • Persist per-session TTS preferences and align TTS enablement checks with backend pipeline behavior, including global, session, and provider availability.
  • Reuse unified message origin construction helpers and avoid redundant session config writes when TTS state is unchanged.
  • Replace native audio elements with a custom accessible audio player featuring waveform visualization and keyboard/pointer seeking.

Tests:

  • Add backend tests covering webchat TTS session IDs, enablement decisions, and provider selection behavior.

Add a per-client "Voice Reply" toggle to ChatUI. When enabled, the chat
route persists the preference as the session's TTS state, disables
streaming so the result-decorate stage can synthesize audio, and emits a
tts_notice SSE event when TTS was requested but cannot run so the client
can show a localized hint.

Backend:
- Pass the Record component's text caption through the webchat queue so
  the UI can render a transcript under the audio bubble.
- Mirror the result-decorate stage's trigger_probability in the route's
  TTS gate: probability 0 keeps streaming instead of waiting for audio
  that will never be synthesized.
- Resolve TTS (and emit tts_notice) for non-streaming requests too, not
  only when streaming was requested.
- Attach trailing agent_stats/refs to the previously saved record
  instead of inserting an empty bubble; fall back to a metadata-only
  record when there is nothing to attach to.
- Return the built content from _save_bot_message so flush no longer
  rebuilds it; skip the session TTS state write when unchanged.
- Reuse _build_webchat_umo for thread UMOs.

Frontend:
- New AudioMessagePart player with waveform, seek (pointer + keyboard),
  caption, and autoplay for freshly streamed voice replies. Waveform
  decoding is lazy (first playback) and shares a single AudioContext so
  loading a history full of voice messages does not fetch every clip.
- Map all tts_notice codes to localized toasts with a generic fallback
  (en-US / zh-CN / ru-RU).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@dosubot dosubot Bot added size:XL This PR changes 500-999 lines, ignoring generated files. feature:chatui The bug / feature is about astrbot's chatui, webchat labels Jun 11, 2026

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a "Voice Reply" (Text-to-Speech) feature to the chat interface, including a new custom AudioMessagePart.vue component with a waveform visualizer, backend support for TTS resolution and session status management, and localization updates. The review feedback highlights several important issues: in AudioMessagePart.vue, the sharedAudioCtx is incorrectly scoped inside <script setup> preventing it from being shared globally, and there are missing defensive checks for zero-width elements, non-2xx fetch responses, and zero-channel audio buffers. On the backend, potential runtime errors could occur if provider_tts_settings is configured as null, and static type-checking warnings may arise from unsafe access to last_saved_content.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.


// One AudioContext shared by every player instance: browsers cap concurrent
// contexts, and decoding is the only thing we need it for.
let sharedAudioCtx: AudioContext | null = null;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

In Vue <script setup>, all top-level variables are compiled inside the component's setup() function. This means sharedAudioCtx is instantiated per component instance rather than being shared globally across all player instances.

To truly share a single AudioContext globally (as intended by the comment), sharedAudioCtx should be declared in a separate, standard <script> block or imported from a shared utility module.

Comment thread astrbot/dashboard/routes/chat.py Outdated
Comment on lines +749 to +751
tts_settings = self.core_lifecycle.astrbot_config.get(
"provider_tts_settings", {}
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

If provider_tts_settings is explicitly configured as null (or None in Python) in the configuration file, self.core_lifecycle.astrbot_config.get("provider_tts_settings", {}) will return None instead of the default {}. This will lead to an AttributeError: 'NoneType' object has no attribute 'get' when calling tts_settings.get("enable").

Using or {} instead of the default argument in .get() ensures robust defensive handling.

Suggested change
tts_settings = self.core_lifecycle.astrbot_config.get(
"provider_tts_settings", {}
)
tts_settings = self.core_lifecycle.astrbot_config.get(
"provider_tts_settings"
) or {}

Comment thread astrbot/dashboard/routes/chat.py Outdated
# to the previously saved record instead of inserting an empty bubble.
# With no prior record to attach to, fall through and persist a
# metadata-only record so stats/refs are not silently dropped.
if not message_parts_to_save and last_saved_record is not None:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

last_saved_content is typed as dict | None and initialized to None. Although logically it is populated whenever last_saved_record is not None, static type checkers (like mypy or pyright) cannot infer this correlation and will flag last_saved_content.get(...) as unsafe.

Adding an explicit None check for last_saved_content satisfies type checkers and prevents potential runtime errors.

Suggested change
if not message_parts_to_save and last_saved_record is not None:
if not message_parts_to_save and last_saved_record is not None and last_saved_content is not None:

Comment on lines +149 to +150
const rect = wave.getBoundingClientRect();
const ratio = Math.min(1, Math.max(0, (clientX - rect.left) / rect.width));

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

If the waveform element is hidden or not fully rendered, rect.width can be 0. This will result in a division by zero, causing ratio and time to become NaN, which can throw errors when assigned to el.currentTime.

Adding a defensive check to return early if rect.width is 0 prevents this issue.

  const rect = wave.getBoundingClientRect();
  if (rect.width === 0) return;
  const ratio = Math.min(1, Math.max(0, (clientX - rect.left) / rect.width));

Comment on lines +231 to +232
const resp = await fetch(url);
const arrayBuffer = await resp.arrayBuffer();

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

fetch(url) does not throw an error on non-2xx HTTP status codes (such as 404 Not Found or 500 Internal Server Error). It will proceed to call resp.arrayBuffer() and attempt to decode invalid data, which eventually throws in decodeAudioData.

Checking resp.ok first avoids unnecessary processing and provides clearer error handling.

    const resp = await fetch(url);
    if (!resp.ok) throw new Error("Failed to fetch audio: " + resp.status);
    const arrayBuffer = await resp.arrayBuffer();

const audioBuffer = await ctx.decodeAudioData(arrayBuffer);
if (token !== decodeToken) return;

const channel = audioBuffer.getChannelData(0);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

If a corrupted or empty audio file is decoded, audioBuffer might have 0 channels. Calling getChannelData(0) on an empty buffer will throw an IndexSizeError DOMException.

Adding a guard clause to check numberOfChannels ensures robust error handling.

    if (audioBuffer.numberOfChannels === 0) return;
    const channel = audioBuffer.getChannelData(0);

@sourcery-ai sourcery-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 1 issue, and left some high level feedback:

  • In AudioMessagePart.vue, several user-visible strings (e.g. aria-labels Play / Pause / Seek) are hard-coded in English; consider wiring these through the existing i18n layer so they can be localized alongside the rest of the chat UI.
  • The waveform decoding in AudioMessagePart.vue uses a token to ignore stale results but does not cancel the in-flight fetch / decodeAudioData; if users scrub quickly between messages, consider adding an AbortController or equivalent to avoid unnecessary downloads/decodes for audio that will never be shown.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- In `AudioMessagePart.vue`, several user-visible strings (e.g. aria-labels `Play` / `Pause` / `Seek`) are hard-coded in English; consider wiring these through the existing i18n layer so they can be localized alongside the rest of the chat UI.
- The waveform decoding in `AudioMessagePart.vue` uses a token to ignore stale results but does not cancel the in-flight `fetch` / `decodeAudioData`; if users scrub quickly between messages, consider adding an `AbortController` or equivalent to avoid unnecessary downloads/decodes for audio that will never be shown.

## Individual Comments

### Comment 1
<location path="dashboard/src/components/chat/message_list_comps/AudioMessagePart.vue" line_range="218-219" />
<code_context>
+// downloads the full file, so it only runs lazily on first playback — a chat
+// history full of voice messages must not fetch every clip on mount.
+let decodeToken = 0;
+let waveformStarted = false;
+function ensureWaveform() {
+  if (waveformStarted) return;
+  waveformStarted = true;
</code_context>
<issue_to_address>
**issue (bug_risk):** waveformStarted is shared across all component instances, so only one audio message ever builds a decoded waveform

Since `waveformStarted` is module-scoped, all `AudioMessagePart` instances share it. After the first call to `ensureWaveform()` sets it to true, later instances never decode their own waveform and stay on the fallback bars. Please make this flag instance-specific (e.g., a `ref(false)` or deriving from whether `bars` is still the fallback), so each clip can lazily decode its own waveform while still sharing the `AudioContext`.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment on lines +218 to +219
let waveformStarted = false;
function ensureWaveform() {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (bug_risk): waveformStarted is shared across all component instances, so only one audio message ever builds a decoded waveform

Since waveformStarted is module-scoped, all AudioMessagePart instances share it. After the first call to ensureWaveform() sets it to true, later instances never decode their own waveform and stay on the fallback bars. Please make this flag instance-specific (e.g., a ref(false) or deriving from whether bars is still the fallback), so each clip can lazily decode its own waveform while still sharing the AudioContext.

- Move the shared AudioContext to a plain <script> block: top-level
  <script setup> state is per component instance, so the previous
  "shared" context was actually created once per audio player.
- Guard waveform seeking against a zero-width element (NaN currentTime).
- Check resp.ok before decoding and skip zero-channel buffers in the
  waveform builder for clearer failure handling.
- Treat an explicit null provider_tts_settings config as empty dict.
- Add an explicit None check for last_saved_content alongside
  last_saved_record to satisfy static type checkers.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature:chatui The bug / feature is about astrbot's chatui, webchat size:XL This PR changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant