feat(harness): unified harness surface — foundation (span derivation, delivery adapters, emitter)#412
feat(harness): unified harness surface — foundation (span derivation, delivery adapters, emitter)#412declan-scale wants to merge 21 commits into
Conversation
| elif isinstance(event, StreamTaskMessageDelta): | ||
| if current_ctx is not None and event.delta is not None: | ||
| # Reconstruct the delta with parent_task_message set from | ||
| # the context's task_message (mirrors _langgraph_async.py | ||
| # lines 72-78 and 117-127). | ||
| delta_with_parent = StreamTaskMessageDelta( | ||
| parent_task_message=current_ctx.task_message, | ||
| delta=event.delta, | ||
| type="delta", | ||
| index=event.index, | ||
| ) | ||
| await current_ctx.stream_update(delta_with_parent) | ||
| if isinstance(event.delta, TextDelta) and event.delta.text_delta: | ||
| final_text_parts.append(event.delta.text_delta) | ||
|
|
||
| elif isinstance(event, StreamTaskMessageDone): | ||
| await _close_current() |
There was a problem hiding this comment.
auto_send keeps only one current_ctx, so every delta is sent to the most recently opened text/reasoning context and every Done closes that context. Canonical streams can have multiple open parts keyed by index; when a tool-call or reasoning delta arrives while a text message is open, that delta is forwarded to the text context, or a Done for another index closes the text stream early. This can corrupt or truncate the auto-sent task messages. Please key the active contexts by event.index, or ignore deltas and done events whose index does not match the current text/reasoning context.
Artifacts
Repro: focused auto_send overlapping-index harness
- Contains supporting evidence from the run (text/x-python; charset=utf-8).
Repro: harness output showing misrouted reasoning delta and truncated text
- Keeps the command output available without making the summary code-heavy.
Ran code and verified through T-Rex
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/agentex/lib/core/harness/auto_send.py
Line: 75-91
Comment:
**Route by stream index**
`auto_send` keeps only one `current_ctx`, so every delta is sent to the most recently opened text/reasoning context and every `Done` closes that context. Canonical streams can have multiple open parts keyed by `index`; when a tool-call or reasoning delta arrives while a text message is open, that delta is forwarded to the text context, or a `Done` for another index closes the text stream early. This can corrupt or truncate the auto-sent task messages. Please key the active contexts by `event.index`, or ignore deltas and done events whose index does not match the current text/reasoning context.
How can I resolve this? If you propose a fix, please make it concise.Approach A (Agentex event stream as canonical source of truth): one tap per harness feeds shared yield/auto-send delivery adapters and a span-deriving tracing tap. Additive backwards-compat, stacked PRs <1000 lines, conformance + live-matrix testing (3 test agents per harness: sync/async/temporal). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… golden-agent integration - Make tracing-tap span derivation explicit (tool open on Done of a ToolRequestContent index, close on matching ToolResponseContent by tool_call_id; parallel-safe; reasoning start->done). Flag missing is_error on ToolResponseContent as an additive upstream decision. - Add first-class TurnUsage/TurnResult shape (aligned to llm_metrics token taxonomy) attached to the turn span via span(data=) and reused for metrics. - Document golden-agent integration: all SGP/sandbox/secret/MCP coupling stays in the agent; only parsing/streaming/tracing/usage move to SDK taps + emitter; sandbox-setup events chain before the harness stream. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… 1-3) Bite-sized TDD tasks: foundation types, pure SpanDeriver, SpanTracer adapter, yield + auto_send delivery, UnifiedEmitter facade, conformance scaffold + CI job. Migration/parser PRs (4-9) listed as follow-on plans. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… signals Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… handling in SpanDeriver Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…sts for SpanTracer Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…on early close Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…reaming + tracing) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… + cover error/finally paths in auto_send Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…send_turn + doc tracer modes Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…egistry semantics Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…he package Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…or consistency Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
d21c54a to
ebc468d
Compare
| def _on_full(self, event: StreamTaskMessageFull) -> list[SpanSignal]: | ||
| content = event.content | ||
| if isinstance(content, ToolResponseContent): | ||
| tcid = content.tool_call_id | ||
| if tcid in self._open_tool_ids: | ||
| self._open_tool_ids.pop(tcid, None) | ||
| return [CloseSpan(key=tcid, output=content.content, is_complete=True)] | ||
| return [] |
There was a problem hiding this comment.
SpanDeriver only opens tool spans from Start(ToolRequestContent) followed by Done, but existing canonical streams can emit tool calls as StreamTaskMessageFull(ToolRequestContent). For example, the LangGraph sync stream emits a full tool request and then a full tool response. In that path, this branch ignores the request, so the response tool_call_id is never open and no tool span is produced. Please treat Full(ToolRequestContent) as an immediate tool-span open using its tool_call_id, name, and arguments.
Artifacts
Repro: focused SpanDeriver full tool request and response harness
- Contains supporting evidence from the run (text/x-python; charset=utf-8).
Stack trace captured during the T-Rex run
- Keeps the raw stack trace available without making the summary code-heavy.
Ran code and verified through T-Rex
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/agentex/lib/core/harness/span_derivation.py
Line: 107-114
Comment:
**Derive full tool spans**
`SpanDeriver` only opens tool spans from `Start(ToolRequestContent)` followed by `Done`, but existing canonical streams can emit tool calls as `StreamTaskMessageFull(ToolRequestContent)`. For example, the LangGraph sync stream emits a full tool request and then a full tool response. In that path, this branch ignores the request, so the response `tool_call_id` is never open and no tool span is produced. Please treat `Full(ToolRequestContent)` as an immediate tool-span open using its `tool_call_id`, `name`, and `arguments`.
How can I resolve this? If you propose a fix, please make it concise.| async with streaming.streaming_task_message_context( | ||
| task_id=task_id, | ||
| initial_content=event.content, | ||
| ): | ||
| pass |
There was a problem hiding this comment.
This path handles a canonical StreamTaskMessageFull by opening and immediately closing a streaming context. The real streaming context emits StreamTaskMessageStart when it opens and StreamTaskMessageDone when it closes; it only publishes a full event when stream_update() receives a StreamTaskMessageFull. When a tool response arrives as Full(ToolResponseContent), auto-send turns it into a start/done pair, so consumers that rely on the canonical full tool result event can miss the result shape.
Artifacts
Repro: focused auto_send full tool response harness
- Contains supporting evidence from the run (text/x-python; charset=utf-8).
Repro: failing execution output with captured outbound events
- Keeps the command output available without making the summary code-heavy.
Ran code and verified through T-Rex
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/agentex/lib/core/harness/auto_send.py
Line: 101-105
Comment:
**Preserve full messages**
This path handles a canonical `StreamTaskMessageFull` by opening and immediately closing a streaming context. The real streaming context emits `StreamTaskMessageStart` when it opens and `StreamTaskMessageDone` when it closes; it only publishes a full event when `stream_update()` receives a `StreamTaskMessageFull`. When a tool response arrives as `Full(ToolResponseContent)`, auto-send turns it into a start/done pair, so consumers that rely on the canonical full tool result event can miss the result shape.
How can I resolve this? If you propose a fix, please make it concise.
What this is
Foundation (PRs 1–3 of the rollout) for a unified harness tracing/message-emitting surface: the Agentex
StreamTaskMessage*stream is the single source of truth, and shared harness-independent machinery derives spans from it and delivers it over both channels:adk.streaming(async + temporal agents, from inside an activity),with tracing on by default (derived from the same stream) and overridable, and a unified
TurnUsage/TurnResultshape for per-harness usage normalization.Design:
docs/superpowers/specs/2026-06-18-unified-harness-surface-design.mdPlan:
docs/superpowers/plans/2026-06-18-unified-harness-surface-foundation.mdWhat's in
src/agentex/lib/core/harness/types.py—StreamTaskMessage,OpenSpan/CloseSpan/SpanSignal,TurnUsage,TurnResult,HarnessTurnprotocol.span_derivation.py—SpanDeriver: pure reducer (noadkdep), canonical stream → span signals. Tool span opens on theDoneof aToolRequestContentindex, closes on the matchingToolResponseContentbytool_call_id; reasoning span open-on-Start / close-on-Done; parallel-safe;flush()closes unclosed spans.tracer.py—SpanTracer: best-effort adapter from span signals toadk.tracing(never raises; overridable; guardedmake_logger).yield_delivery.py/auto_send.py— the two delivery adapters (both feed the sameSpanDeriver/SpanTracer;finally-flush on early close/error).emitter.py—UnifiedEmitter: ties trace context + delivery + usage; default-on/overridable tracing; injectable tracing/streaming backends.conformance/— shared conformance scaffold each future harness tap registers fixtures with..github/workflows/harness-integration.yml— conformance CI job (via./scripts/test) + anif: falselive-matrixplaceholder enabled by the migration PRs.Scope / what's NOT here
Per-harness migration (pydantic-ai / langgraph / openai) and parser taps (claude-code / codex), plus their 3 e2e test agents each (sync/async/temporal), are future migration PRs (4–8) — not in this branch.
Quality gates
./scripts/test tests/lib/core/harness/).# type: ignorein the package.Follow-ups (filed)
Fulltool-message wire shape (blocks migration backward-compat claims).adkfacade before the first consumer migration.pathstoagentex.types;SpanTracerduplicate-open guard.is_erroronToolResponseContent(tool-span error status).🤖 Generated with Claude Code
Greptile Summary
UnifiedEmitterfacade.Confidence Score: 4/5
The harness foundation is well scoped, but full-message handling in span derivation and auto-send delivery needs attention before relying on existing canonical streams.
The implementation has focused tests and clean separation across tracing and delivery adapters, with two concrete stream-shape gaps affecting tool-call behavior.
src/agentex/lib/core/harness/span_derivation.py and src/agentex/lib/core/harness/auto_send.py
What T-Rex did
Prompt To Fix All With AI
Reviews (2): Last reviewed commit: "style: ruff import-sort + format fixes a..." | Re-trigger Greptile