feat(gateway): feishu voice message STT via gateway audio attachment by wangyuyan-agent · Pull Request #761 · openabdev/openab

wangyuyan-agent · 2026-05-06T15:58:36Z

Summary

Adds voice message (speech-to-text) support for the Feishu gateway adapter. When a user sends a voice message, the gateway downloads the opus/ogg audio, passes it to core as a base64-encoded "audio" attachment, and core transcribes it via the existing [stt] infrastructure before injecting the transcript into the LLM prompt.

This also introduces the "audio" attachment type to the gateway protocol — making it trivial for LINE/Telegram adapters to add voice support in the future (only the download logic differs per platform).

Feishu user sends voice message
    │
    ▼
Gateway: im.message.receive_v1 (msg_type=audio)
    │  parse content → extract file_key
    │  GET /im/v1/messages/{id}/resources/{key}?type=file
    │  base64 encode → Attachment{type:"audio", mime:"audio/ogg"}
    ▼
WebSocket → Core: GatewayEvent with audio attachment
    │
    ▼
Core: [stt] enabled?
    ├── Yes → decode base64 → stt::transcribe() → inject "[Voice message transcript]: ..."
    │         → LLM processes transcript as text
    └── No  → silently skip (graceful degradation)

⚠️ Dependency

Stacks on #746 → #744. Please merge in order: #744 → #746 → this PR. Will rebase onto main once dependencies land.

Prior Art

Feature	OpenAB (Discord)	OpenClaw	Hermes Agent	This PR (Feishu)
Voice detection	`audio/*` MIME on attachment	Skill-based (plugin)	Built-in voice mode	`msg_type == "audio"`
STT engine	OpenAI-compatible (Groq default)	Extensible providers (Yandex, Whisper, Gemini)	Built-in (Whisper, configurable)	Same as Discord — reuses core `[stt]`
Audio format	ogg/opus	ogg, wav, mp3	ogg/opus	opus/ogg (Feishu native)
Where STT runs	In adapter (direct core access)	In skill process	In gateway	In core (gateway passes raw audio)
Fallback on failure	Silent skip + 🎤 reaction	Error message to user	Configurable	Silent skip (matches Discord)
Config	`[stt]` section	Per-skill config	`voice:` YAML	Same `[stt]` — zero new config

Design Trade-offs

Why STT in Core (not Gateway)?

Gateway's reqwest doesn't have multipart feature (needed for Whisper API)
Gateway would need to hold API keys, manage config, handle retries
Core already has stt.rs + media.rs — reuse > rewrite
Verdict: Keep gateway lightweight. STT is "understanding", belongs in core.

Why base64 over WebSocket (not streaming/binary)?

Feishu voice messages capped at 60s → ~1-2MB opus → ~2.7MB base64
Whisper API requires complete file (no streaming input)
Binary WS frames would save 33% bandwidth but require protocol changes
Verdict: Not worth the complexity for <3MB payloads.

Why no user feedback on STT failure?

Discord adapter also silently skips (established pattern)
Adding error feedback requires knowing user's language, platform-specific reply formatting
Verdict: Match Discord behavior for v1. Can add feedback in follow-up.

Changes

gateway/src/adapters/feishu.rs: Allow msg_type=audio, add MediaRef::Audio, add download_feishu_audio(), handle in both WS and webhook paths
src/gateway.rs: Add stt: SttConfig to GatewayParams, add "audio" attachment handler (decode → transcribe → inject), warn on decode failure
src/main.rs: Pass cfg.stt.clone() to GatewayParams
docs/feishu.md: Add audio row to message type table
docs/stt.md: Update from Discord-only to multi-platform wording

Configuration

Uses the existing [stt] section — no new configuration:

[stt]
enabled = true
# Default: Groq free tier (auto-detects GROQ_API_KEY env var)
# model = "whisper-large-v3-turbo"
# base_url = "https://api.groq.com/openai/v1"

See docs/stt.md for full setup guide.

Testing

Gateway: 102 tests pass
Core: 197 tests pass
E2E: Feishu private chat → voice message → download → STT → LLM responds ✅

Feishu API Facts

Event: msg_type=audio, content: {"file_key":"...", "duration":N}
Download: same API as file (/im/v1/messages/{id}/resources/{key}?type=file)
Format: opus/ogg — Whisper natively supports, no transcoding
Permission: im:message (already required)
Size: typical 0.5-2MB for 60s voice (test messages: 5-16KB for 2-7s)

Known Limitations (v1)

STT failure → silent skip (no user feedback). Matches Discord behavior.
Base64 overhead (~33%) — negligible for actual voice messages (<2MB).
No duration-based filtering — very short voice messages (accidental taps) still get transcribed.

Discussion

https://discord.com/channels/1491295327620169908/1500160821567684660

Once the bot replies in a thread, subsequent messages in that thread bypass @mention gating — matching Discord's default 'involved' mode. - Add participated_threads cache (HashMap<thread_id, Instant>) - Bypass mention gating when message is in a participated thread - Record participation on successful reply to a thread - TTL controlled by FEISHU_SESSION_TTL_HOURS (default 24h) - Cache eviction at 1000 entries (oldest-half strategy) - 3 new tests for participation logic

- Extract check_thread_participated() helper to reduce duplication - Add comments explaining intentional poisoned-mutex recovery - Improve eviction: drop TTL-expired entries first, then oldest half

- Add comment clarifying session_ttl_secs=0 disables participation tracking - Update bot_turns comment: remove TODO, note existing eviction pattern

Add AllowUsers enum (Involved/Mentions/MultibotMentions) controlled by FEISHU_ALLOW_USER_MESSAGES env var. In multibot-mentions mode, once another bot is @mentioned in a participated thread, require @mention for all bots — prevents multiple bots from responding simultaneously. Multibot detection strategy: - If FEISHU_TRUSTED_BOT_IDS configured: exact match - Otherwise: infer from allowed_users (mention not self and not in allowed_users → assumed to be another bot) - Only triggers in threads where bot has already participated This avoids requiring users to discover per-app open_ids for other bots.

Deduplicate the multibot detection block (~30 lines) that was repeated in both handle_ws_message and webhook(). Both now call a shared detect_and_mark_multibot() helper that handles: - Thread participation check - @mention-based other-bot detection (trusted IDs or inference) - Multibot cache marking with eviction - Computing is_thread_participated based on allow_user_messages mode Also update PARTICIPATION_CACHE_MAX comment to note it is intentionally shared between participated_threads and multibot_threads caches.

shaun-agent · 2026-05-06T16:01:35Z

OpenAB PR Screening

This is auto-generated by the OpenAB project-screening flow for context collection and reviewer handoff.
Click 👍 if you find this useful. Human review will be done within 24 hours. We appreciate your support and contribution 🙏

Title: feat(gateway): feishu voice message STT via gateway audio attachment
Source: feat(gateway): feishu voice message STT via gateway audio attachment #761
Status: moved to PR-Screening
Generated at: 2026-05-06T16:01:35.025Z
Discord thread: https://discord.com/channels/1488041051187974246/1501614728768782448

Screening report

## Intent

PR #761 adds Feishu voice message support to the OpenAB gateway. The concrete problem is that Feishu users can currently send text-like messages through the gateway, but voice messages are not converted into prompt input, so the agent cannot respond meaningfully to spoken input.

The PR proposes downloading Feishu audio message resources, forwarding them to core as base64 audio attachments, and letting the existing OpenAB STT pipeline transcribe them before prompt injection.

Feat

Feature.

Behavioral change: Feishu gateway messages with msg_type=audio become usable agent input. The gateway extracts the Feishu file_key, downloads the opus/ogg payload, wraps it as an Attachment { type: "audio", mime: "audio/ogg" }, and sends it to core. Core decodes the attachment and, when [stt] is enabled, transcribes it and injects the transcript into the LLM prompt.

The PR also generalizes the gateway protocol by introducing an audio attachment type that future adapters such as LINE or Telegram could reuse.

Who It Serves

Primary beneficiaries: Feishu end users who want to interact with OpenAB agents using voice messages.

Secondary beneficiaries: gateway adapter maintainers, because the PR creates a reusable protocol path for audio attachments instead of making Feishu-specific STT logic a one-off.

Operational beneficiaries: deployers who already use [stt], because the feature claims to require no new configuration.

Rewritten Prompt

Implement Feishu voice-message support through the gateway attachment protocol.

When the Feishu adapter receives im.message.receive_v1 with msg_type == "audio", parse the message content, extract file_key, download the resource from Feishu using the existing message-resource API, and attach the downloaded opus/ogg bytes to the outgoing gateway event as base64 with type: "audio" and an accurate MIME type.

In core gateway handling, recognize audio attachments. If STT is enabled, decode the base64 payload, transcribe it through the existing stt::transcribe path, and inject a clear voice transcript marker into the prompt. If STT is disabled or decoding/transcription fails, degrade without crashing and log enough context for operators to diagnose the failure.

Cover both Feishu websocket and webhook paths. Add focused tests for audio message parsing, resource download behavior, gateway event attachment shape, STT-enabled transcript injection, and graceful behavior when STT is disabled or payload decoding fails. Update Feishu and STT docs to describe multi-platform voice attachment support.

Merge Pitch

This is worth advancing because voice input is a real user-facing capability gap for Feishu deployments, and the proposed architecture mostly reuses OpenAB’s existing STT configuration and transcription path.

Risk profile is moderate. The user-facing feature is straightforward, but the PR touches gateway protocol semantics, Feishu adapter behavior, core prompt construction, and STT error handling. The likely reviewer concern is whether the new generic audio attachment type is well-defined enough for future adapters, and whether core should silently skip failed audio transcription versus surfacing a clearer operator-visible warning.

Best-Practice Comparison

Relevant OpenClaw principles:

Explicit delivery routing is relevant. The gateway should pass audio as a typed attachment with enough metadata for core to handle it predictably.
Isolated executions are partially relevant. STT should remain inside the existing core transcription boundary rather than embedding provider-specific transcription inside the Feishu adapter.
Retry/backoff and run logs are relevant for the Feishu media download path. A failed download should be visible in logs and should not break the whole message pipeline.
Durable job persistence and gateway-owned scheduling are not directly relevant. This is event-driven message handling, not scheduled execution.

Relevant Hermes Agent principles:

Fresh session per scheduled run is not relevant because this PR handles live inbound messages, not scheduled jobs.
Self-contained prompts are relevant in a narrower sense: the injected transcript should be explicit and attributable, such as [Voice message transcript]: ..., so the model understands the source of the text.
Atomic writes and file locking are not relevant unless the implementation persists downloaded audio or intermediate state, which it should avoid if possible.
Gateway daemon tick model is not relevant to this direct event path.

Overall, the proposed direction fits the reference systems where they emphasize typed handoff, clear execution boundaries, and operator-observable failures. Scheduling and durable job-state principles do not apply.

Implementation Options

Conservative option: Feishu-only audio support using existing STT in core.

Keep the current PR narrow. Add msg_type=audio handling only to Feishu, forward an audio attachment, and let core transcribe it through existing [stt]. Avoid broader protocol redesign beyond documenting the new attachment type.

Balanced option: Formalize gateway audio attachments as a small cross-adapter contract.

Accept Feishu support, but also define the gateway protocol expectations for audio attachments: required fields, MIME handling, max size behavior, error logging, and what core does when STT is disabled. Add reusable helper functions so LINE and Telegram adapters can later plug in only their platform-specific download logic.

Ambitious option: Introduce a media ingestion layer for gateway adapters.

Create a gateway-level media abstraction for files, images, audio, and future media types, with shared download limits, content-type detection, logging, retry policy, and typed conversion into core attachments. Feishu audio becomes the first consumer, but the system is designed for all rich-message platforms.

Comparison Table

Option	Speed to ship	Complexity	Reliability	Maintainability	User impact	Fit for OpenAB right now
Conservative Feishu-only support	High	Low-Medium	Medium	Medium	High for Feishu users	Good
Balanced audio attachment contract	Medium	Medium	High	High	High for Feishu, enables future adapters	Best
Full media ingestion layer	Low	High	Potentially High	High if completed well	Broader long-term impact	Premature unless more media work is queued

Recommendation

Advance the PR using the balanced option.

The feature is valuable enough to move forward, but the merge discussion should focus on making audio attachments a clear gateway contract rather than only a Feishu implementation detail. That gives reviewers a concrete standard to validate: attachment shape, MIME expectations, STT-disabled behavior, failure logging, and test coverage across websocket and webhook paths.

Sequence it as one mergeable step: land Feishu voice support plus the minimal reusable audio attachment contract. Defer a broader media ingestion layer until at least one more adapter needs similar download-and-forward behavior.

1. session_ttl_secs doc comment: clarify conversion from FEISHU_SESSION_TTL_HOURS 2. Rename is_thread_participated → bypass_mention_gating in parse_message_event with doc comment explaining the parameter semantics

- Add msg_type=audio support to feishu adapter (parse, download, base64 encode) - Add MediaRef::Audio variant and download_feishu_audio() function - Add "audio" attachment type to core gateway handler (decode → stt::transcribe) - Pass SttConfig to gateway handler via GatewayParams - Update docs/feishu.md and docs/stt.md for multi-platform voice support Feishu voice messages (opus/ogg) are downloaded by the gateway, passed as base64-encoded audio attachments to core, and transcribed via the existing [stt] infrastructure (Groq Whisper by default). This is the first gateway platform to support audio — LINE/Telegram can reuse the core-side handler. Tested: 102 gateway tests + 197 core tests pass. E2E verified.

wangyuyan-agent and others added 5 commits May 5, 2026 23:13

refactor(gateway): address review nits on openabdev#744

15f16d8

- Extract check_thread_participated() helper to reduce duplication - Add comments explaining intentional poisoned-mutex recovery - Improve eviction: drop TTL-expired entries first, then oldest half

fix(gateway): address second-round review nits on openabdev#744

cb6a56e

- Add comment clarifying session_ttl_secs=0 disables participation tracking - Update bot_turns comment: remove TODO, note existing eviction pattern

wangyuyan-agent requested a review from thepagent as a code owner May 6, 2026 15:58

github-actions Bot added the pending-screening PR awaiting automated screening label May 6, 2026

wangyuyan-agent added 2 commits May 7, 2026 07:46

refactor(gateway): address review nits on openabdev#746

2ab7577

1. session_ttl_secs doc comment: clarify conversion from FEISHU_SESSION_TTL_HOURS 2. Rename is_thread_participated → bypass_mention_gating in parse_message_event with doc comment explaining the parameter semantics

wangyuyan-agent force-pushed the feat/gateway-feishu-voice-stt branch from c0775f4 to 646d99a Compare May 6, 2026 23:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(gateway): feishu voice message STT via gateway audio attachment#761

feat(gateway): feishu voice message STT via gateway audio attachment#761
wangyuyan-agent wants to merge 7 commits intoopenabdev:mainfrom
wangyuyan-agent:feat/gateway-feishu-voice-stt

wangyuyan-agent commented May 6, 2026 •

edited

Loading

Uh oh!

shaun-agent commented May 6, 2026

Feat

Who It Serves

Rewritten Prompt

Merge Pitch

Best-Practice Comparison

Implementation Options

Comparison Table

Recommendation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

wangyuyan-agent commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

⚠️ Dependency

Prior Art

Design Trade-offs

Why STT in Core (not Gateway)?

Why base64 over WebSocket (not streaming/binary)?

Why no user feedback on STT failure?

Changes

Configuration

Testing

Feishu API Facts

Known Limitations (v1)

Discussion

Uh oh!

shaun-agent commented May 6, 2026

OpenAB PR Screening

Feat

Who It Serves

Rewritten Prompt

Merge Pitch

Best-Practice Comparison

Implementation Options

Comparison Table

Recommendation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

wangyuyan-agent commented May 6, 2026 •

edited

Loading