feat(agents): add modality-aware Instructions with audio/text variants#1484
Conversation
Port livekit/agents#4987 to JS. Adds an Instructions class that holds separate audio/text system prompt variants and is resolved per-turn based on the input modality. The voice pipeline now calls applyInstructionsModality() before each LLM turn using the modality from SpeechHandle.inputDetails, and AgentSession.generateReply() takes a new inputModality option (defaults to 'text', matching Python). Provider format adapters (openai, google, mistralai) and remote_session render Instructions as their resolved string value.
🦋 Changeset detectedLatest commit: 4f635e7 The changes in this PR will be included in the next version bump. This PR includes changesets to release 31 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
|
|
chatItemToProto was passing the raw Instructions object through to the proto's text field, producing corrupt telemetry content. Extract .value when the content item is an Instructions instance.
The ProtoMessage.text field is typed as ChatContent, so the original code intentionally passed objects through. Only Instructions need unwrapping to their rendered value; image/audio content should pass through unchanged so the OTLP exporter can serialize the full structure.
`===` on `Instructions` objects compares by reference, so two distinct instances with identical audio/text would falsely mark the realtime session as non-reusable on handoff. Add an `instructionsEqual` helper that compares strings by value and Instructions by audio + text, and use it in `_detachReusableResources`.
| params: { | ||
| id?: string; | ||
| instructions?: string; | ||
| instructions?: string | Instructions; |
There was a problem hiding this comment.
I think we made a mistake in Python to use | Instructions. I'll remove that on Python.
Let's only use string for everything inside the ChatContext.
There was a problem hiding this comment.
SG, I'll make a follow up PR once python side made the change
Description
Introduces a new
Instructionsclass for system prompts that adapt to the user's input modality (audio vs. text). This enables agents to provide different guidance to the LLM depending on whether the user is speaking or typing—for example, instructing the LLM to normalize spoken expressions like "next Tuesday" when processing voice input, while treating text input literally.The pipeline now applies the matching variant before each LLM turn based on
SpeechHandle.inputDetails.modality, andAgentSession.generateReply()exposes aninputModalityoption to control which variant is used.Changes Made
New
Instructionsclass (chat_context.ts):audioandtextvariants of system instructionsvalueproperty renders the currently active variant (defaults to audio)asModality(modality)returns a copy with the specified variant active, preserving both variants for future switchesconcat()method propagates both variants when combining instructionstoJSON()serializes both variants for persistenceNew
concatInstructions()helper (chat_context.ts):InstructionsobjectsInstructionsare involved, otherwise returnsInstructionsNew
applyInstructionsModality()function (generation.ts):Updated
SpeechHandle(speech_handle.ts):InputDetailsinterface withmodality: 'audio' | 'text'SpeechHandle.create()now accepts optionalinputDetailsparameterinputDetailsexposes the modality for the current turnUpdated
AgentSession.generateReply()(agent_session.ts):inputModalityparameter (defaults to'text')AgentActivityfor modality-aware instruction selectionUpdated
AgentandAgentOptions(agent.ts):instructionsfield now acceptsstring | InstructionsUpdated
AgentConfigUpdate(chat_context.ts):instructionsfield now acceptsstring | InstructionsInstructionsviatoJSON()when presentProvider format adapters (openai, google, mistralai):
Instructionscontent by extracting thevaluepropertyUtility updates (
utils.ts):validateChatContextStructure()recognizesinstructionstypeformatMessageContentPart()extractsvaluefromInstructionsExample agent (
instructions_per_modality.ts):Comprehensive test suite (
chat_context.test.ts):applyInstructionsModality()andChatContext.copy()Pre-Review Checklist
Testing