Skip to content

feat(agents): add modality-aware Instructions with audio/text variants#1484

Merged
toubatbrian merged 11 commits into
mainfrom
claude/port-instruction-api-js-pNpnn
May 15, 2026
Merged

feat(agents): add modality-aware Instructions with audio/text variants#1484
toubatbrian merged 11 commits into
mainfrom
claude/port-instruction-api-js-pNpnn

Conversation

@toubatbrian
Copy link
Copy Markdown
Contributor

@toubatbrian toubatbrian commented May 13, 2026

Description

Introduces a new Instructions class for system prompts that adapt to the user's input modality (audio vs. text). This enables agents to provide different guidance to the LLM depending on whether the user is speaking or typing—for example, instructing the LLM to normalize spoken expressions like "next Tuesday" when processing voice input, while treating text input literally.

The pipeline now applies the matching variant before each LLM turn based on SpeechHandle.inputDetails.modality, and AgentSession.generateReply() exposes an inputModality option to control which variant is used.

Changes Made

  • New Instructions class (chat_context.ts):

    • Holds both audio and text variants of system instructions
    • value property renders the currently active variant (defaults to audio)
    • asModality(modality) returns a copy with the specified variant active, preserving both variants for future switches
    • concat() method propagates both variants when combining instructions
    • toJSON() serializes both variants for persistence
  • New concatInstructions() helper (chat_context.ts):

    • Concatenates any mix of strings and Instructions objects
    • Propagates both audio/text variants from all operands
    • Returns a plain string if no Instructions are involved, otherwise returns Instructions
  • New applyInstructionsModality() function (generation.ts):

    • Locates the instructions message in the chat context
    • Applies the correct variant based on input modality before LLM inference
    • No-op when no modality-aware instructions are present
  • Updated SpeechHandle (speech_handle.ts):

    • Added InputDetails interface with modality: 'audio' | 'text'
    • SpeechHandle.create() now accepts optional inputDetails parameter
    • Getter inputDetails exposes the modality for the current turn
  • Updated AgentSession.generateReply() (agent_session.ts):

    • New optional inputModality parameter (defaults to 'text')
    • Passed through to AgentActivity for modality-aware instruction selection
  • Updated Agent and AgentOptions (agent.ts):

    • instructions field now accepts string | Instructions
  • Updated AgentConfigUpdate (chat_context.ts):

    • instructions field now accepts string | Instructions
    • Serializes Instructions via toJSON() when present
  • Provider format adapters (openai, google, mistralai):

    • Updated to handle Instructions content by extracting the value property
  • Utility updates (utils.ts):

    • validateChatContextStructure() recognizes instructions type
    • formatMessageContentPart() extracts value from Instructions
  • Example agent (instructions_per_modality.ts):

    • Demonstrates a scheduling assistant with different instructions for voice vs. text users
    • Voice users get guidance on parsing spoken expressions and self-corrections
    • Text users get guidance on accepting literal input and skipping unnecessary confirmations
  • Comprehensive test suite (chat_context.test.ts):

    • Tests serialization, concatenation, modality switching, and round-tripping
    • Tests interaction with applyInstructionsModality() and ChatContext.copy()
    • Verifies both variants are preserved across turns

Pre-Review Checklist

  • Build passes: All builds (lint, typecheck, tests) pass locally
  • AI-generated code reviewed: Code is hand-written and follows project conventions
  • Changes explained: All changes are documented above and in code comments
  • Scope appropriate: All changes relate to modality-aware instructions
  • Video demo: Not applicable (framework feature, not user-facing UI)

Testing

  • Added comprehensive unit tests covering:
    • Serialization and JSON round-tripping
    • Concatenation

Port livekit/agents#4987 to JS. Adds an Instructions class that holds
separate audio/text system prompt variants and is resolved per-turn based
on the input modality. The voice pipeline now calls
applyInstructionsModality() before each LLM turn using the modality from
SpeechHandle.inputDetails, and AgentSession.generateReply() takes a new
inputModality option (defaults to 'text', matching Python). Provider
format adapters (openai, google, mistralai) and remote_session render
Instructions as their resolved string value.
@changeset-bot
Copy link
Copy Markdown

changeset-bot Bot commented May 13, 2026

🦋 Changeset detected

Latest commit: 4f635e7

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 31 packages
Name Type
@livekit/agents Patch
@livekit/agents-plugin-anam Patch
@livekit/agents-plugin-assemblyai Patch
@livekit/agents-plugin-baseten Patch
@livekit/agents-plugin-bey Patch
@livekit/agents-plugin-cartesia Patch
@livekit/agents-plugin-cerebras Patch
@livekit/agents-plugin-deepgram Patch
@livekit/agents-plugin-elevenlabs Patch
@livekit/agents-plugin-fishaudio Patch
@livekit/agents-plugin-google Patch
@livekit/agents-plugin-hedra Patch
@livekit/agents-plugin-hume Patch
@livekit/agents-plugin-inworld Patch
@livekit/agents-plugin-lemonslice Patch
@livekit/agents-plugin-liveavatar Patch
@livekit/agents-plugin-livekit Patch
@livekit/agents-plugin-minimax Patch
@livekit/agents-plugin-mistral Patch
@livekit/agents-plugin-mistralai Patch
@livekit/agents-plugin-neuphonic Patch
@livekit/agents-plugin-openai Patch
@livekit/agents-plugin-phonic Patch
@livekit/agents-plugin-resemble Patch
@livekit/agents-plugin-rime Patch
@livekit/agents-plugin-runway Patch
@livekit/agents-plugin-sarvam Patch
@livekit/agents-plugin-silero Patch
@livekit/agents-plugins-test Patch
@livekit/agents-plugin-trugen Patch
@livekit/agents-plugin-xai Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@CLAassistant
Copy link
Copy Markdown

CLAassistant commented May 13, 2026

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
1 out of 2 committers have signed the CLA.

✅ toubatbrian
❌ claude
You have signed the CLA already but the status is still pending? Let us recheck it.

chatgpt-codex-connector[bot]

This comment was marked as resolved.

devin-ai-integration[bot]

This comment was marked as resolved.

claude and others added 4 commits May 13, 2026 02:03
chatItemToProto was passing the raw Instructions object through to the
proto's text field, producing corrupt telemetry content. Extract .value
when the content item is an Instructions instance.
@toubatbrian toubatbrian changed the title feat(agents): add modality-aware Instructions with audio/text variants brianyin/agt-2873-new-instructions-api May 13, 2026
@toubatbrian toubatbrian changed the title brianyin/agt-2873-new-instructions-api feat(agents): add modality-aware Instructions with audio/text variants May 13, 2026
devin-ai-integration[bot]

This comment was marked as resolved.

The ProtoMessage.text field is typed as ChatContent, so the original code
intentionally passed objects through. Only Instructions need
unwrapping to their rendered value; image/audio content should pass
through unchanged so the OTLP exporter can serialize the full structure.
devin-ai-integration[bot]

This comment was marked as resolved.

`===` on `Instructions` objects compares by reference, so two distinct
instances with identical audio/text would falsely mark the realtime
session as non-reusable on handoff. Add an `instructionsEqual` helper
that compares strings by value and Instructions by audio + text, and
use it in `_detachReusableResources`.
params: {
id?: string;
instructions?: string;
instructions?: string | Instructions;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we made a mistake in Python to use | Instructions. I'll remove that on Python.

Let's only use string for everything inside the ChatContext.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SG, I'll make a follow up PR once python side made the change

Copy link
Copy Markdown
Member

@theomonnom theomonnom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@toubatbrian toubatbrian merged commit b7fdbe1 into main May 15, 2026
8 of 9 checks passed
@toubatbrian toubatbrian deleted the claude/port-instruction-api-js-pNpnn branch May 15, 2026 00:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants