feat(gateway): feishu voice message STT via gateway audio attachment#761
feat(gateway): feishu voice message STT via gateway audio attachment#761wangyuyan-agent wants to merge 7 commits intoopenabdev:mainfrom
Conversation
Once the bot replies in a thread, subsequent messages in that thread bypass @mention gating — matching Discord's default 'involved' mode. - Add participated_threads cache (HashMap<thread_id, Instant>) - Bypass mention gating when message is in a participated thread - Record participation on successful reply to a thread - TTL controlled by FEISHU_SESSION_TTL_HOURS (default 24h) - Cache eviction at 1000 entries (oldest-half strategy) - 3 new tests for participation logic
- Extract check_thread_participated() helper to reduce duplication - Add comments explaining intentional poisoned-mutex recovery - Improve eviction: drop TTL-expired entries first, then oldest half
- Add comment clarifying session_ttl_secs=0 disables participation tracking - Update bot_turns comment: remove TODO, note existing eviction pattern
Add AllowUsers enum (Involved/Mentions/MultibotMentions) controlled by FEISHU_ALLOW_USER_MESSAGES env var. In multibot-mentions mode, once another bot is @mentioned in a participated thread, require @mention for all bots — prevents multiple bots from responding simultaneously. Multibot detection strategy: - If FEISHU_TRUSTED_BOT_IDS configured: exact match - Otherwise: infer from allowed_users (mention not self and not in allowed_users → assumed to be another bot) - Only triggers in threads where bot has already participated This avoids requiring users to discover per-app open_ids for other bots.
Deduplicate the multibot detection block (~30 lines) that was repeated in both handle_ws_message and webhook(). Both now call a shared detect_and_mark_multibot() helper that handles: - Thread participation check - @mention-based other-bot detection (trusted IDs or inference) - Multibot cache marking with eviction - Computing is_thread_participated based on allow_user_messages mode Also update PARTICIPATION_CACHE_MAX comment to note it is intentionally shared between participated_threads and multibot_threads caches.
OpenAB PR ScreeningThis is auto-generated by the OpenAB project-screening flow for context collection and reviewer handoff.
Screening report## IntentPR #761 adds Feishu voice message support to the OpenAB gateway. The concrete problem is that Feishu users can currently send text-like messages through the gateway, but voice messages are not converted into prompt input, so the agent cannot respond meaningfully to spoken input. The PR proposes downloading Feishu FeatFeature. Behavioral change: Feishu gateway messages with The PR also generalizes the gateway protocol by introducing an Who It ServesPrimary beneficiaries: Feishu end users who want to interact with OpenAB agents using voice messages. Secondary beneficiaries: gateway adapter maintainers, because the PR creates a reusable protocol path for audio attachments instead of making Feishu-specific STT logic a one-off. Operational beneficiaries: deployers who already use Rewritten PromptImplement Feishu voice-message support through the gateway attachment protocol. When the Feishu adapter receives In core gateway handling, recognize Cover both Feishu websocket and webhook paths. Add focused tests for audio message parsing, resource download behavior, gateway event attachment shape, STT-enabled transcript injection, and graceful behavior when STT is disabled or payload decoding fails. Update Feishu and STT docs to describe multi-platform voice attachment support. Merge PitchThis is worth advancing because voice input is a real user-facing capability gap for Feishu deployments, and the proposed architecture mostly reuses OpenAB’s existing STT configuration and transcription path. Risk profile is moderate. The user-facing feature is straightforward, but the PR touches gateway protocol semantics, Feishu adapter behavior, core prompt construction, and STT error handling. The likely reviewer concern is whether the new generic Best-Practice ComparisonRelevant OpenClaw principles:
Relevant Hermes Agent principles:
Overall, the proposed direction fits the reference systems where they emphasize typed handoff, clear execution boundaries, and operator-observable failures. Scheduling and durable job-state principles do not apply. Implementation OptionsConservative option: Feishu-only audio support using existing STT in core. Keep the current PR narrow. Add Balanced option: Formalize gateway audio attachments as a small cross-adapter contract. Accept Feishu support, but also define the gateway protocol expectations for audio attachments: required fields, MIME handling, max size behavior, error logging, and what core does when STT is disabled. Add reusable helper functions so LINE and Telegram adapters can later plug in only their platform-specific download logic. Ambitious option: Introduce a media ingestion layer for gateway adapters. Create a gateway-level media abstraction for files, images, audio, and future media types, with shared download limits, content-type detection, logging, retry policy, and typed conversion into core attachments. Feishu audio becomes the first consumer, but the system is designed for all rich-message platforms. Comparison Table
RecommendationAdvance the PR using the balanced option. The feature is valuable enough to move forward, but the merge discussion should focus on making Sequence it as one mergeable step: land Feishu voice support plus the minimal reusable audio attachment contract. Defer a broader media ingestion layer until at least one more adapter needs similar download-and-forward behavior. |
1. session_ttl_secs doc comment: clarify conversion from FEISHU_SESSION_TTL_HOURS 2. Rename is_thread_participated → bypass_mention_gating in parse_message_event with doc comment explaining the parameter semantics
- Add msg_type=audio support to feishu adapter (parse, download, base64 encode) - Add MediaRef::Audio variant and download_feishu_audio() function - Add "audio" attachment type to core gateway handler (decode → stt::transcribe) - Pass SttConfig to gateway handler via GatewayParams - Update docs/feishu.md and docs/stt.md for multi-platform voice support Feishu voice messages (opus/ogg) are downloaded by the gateway, passed as base64-encoded audio attachments to core, and transcribed via the existing [stt] infrastructure (Groq Whisper by default). This is the first gateway platform to support audio — LINE/Telegram can reuse the core-side handler. Tested: 102 gateway tests + 197 core tests pass. E2E verified.
c0775f4 to
646d99a
Compare
Summary
Adds voice message (speech-to-text) support for the Feishu gateway adapter. When a user sends a voice message, the gateway downloads the opus/ogg audio, passes it to core as a base64-encoded
"audio"attachment, and core transcribes it via the existing[stt]infrastructure before injecting the transcript into the LLM prompt.This also introduces the
"audio"attachment type to the gateway protocol — making it trivial for LINE/Telegram adapters to add voice support in the future (only the download logic differs per platform).Stacks on #746 → #744. Please merge in order: #744 → #746 → this PR. Will rebase onto main once dependencies land.
Prior Art
audio/*MIME on attachmentmsg_type == "audio"[stt][stt]sectionvoice:YAML[stt]— zero new configDesign Trade-offs
Why STT in Core (not Gateway)?
reqwestdoesn't havemultipartfeature (needed for Whisper API)stt.rs+media.rs— reuse > rewriteWhy base64 over WebSocket (not streaming/binary)?
Why no user feedback on STT failure?
Changes
gateway/src/adapters/feishu.rs: Allowmsg_type=audio, addMediaRef::Audio, adddownload_feishu_audio(), handle in both WS and webhook pathssrc/gateway.rs: Addstt: SttConfigtoGatewayParams, add"audio"attachment handler (decode → transcribe → inject), warn on decode failuresrc/main.rs: Passcfg.stt.clone()to GatewayParamsdocs/feishu.md: Addaudiorow to message type tabledocs/stt.md: Update from Discord-only to multi-platform wordingConfiguration
Uses the existing
[stt]section — no new configuration:See docs/stt.md for full setup guide.
Testing
Feishu API Facts
msg_type=audio, content:{"file_key":"...", "duration":N}/im/v1/messages/{id}/resources/{key}?type=file)im:message(already required)Known Limitations (v1)
Discussion
https://discord.com/channels/1491295327620169908/1500160821567684660