Skip to content

feat(harness): unified harness surface — foundation (span derivation, delivery adapters, emitter)#412

Open
declan-scale wants to merge 21 commits into
nextfrom
declan-scale/unified-harness-surface
Open

feat(harness): unified harness surface — foundation (span derivation, delivery adapters, emitter)#412
declan-scale wants to merge 21 commits into
nextfrom
declan-scale/unified-harness-surface

Conversation

@declan-scale

@declan-scale declan-scale commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

What this is

Foundation (PRs 1–3 of the rollout) for a unified harness tracing/message-emitting surface: the Agentex StreamTaskMessage* stream is the single source of truth, and shared harness-independent machinery derives spans from it and delivers it over both channels:

  • yield — pass the canonical stream through to the caller (sync HTTP ACP agents),
  • auto-send — push to the task stream via adk.streaming (async + temporal agents, from inside an activity),

with tracing on by default (derived from the same stream) and overridable, and a unified TurnUsage/TurnResult shape for per-harness usage normalization.

Design: docs/superpowers/specs/2026-06-18-unified-harness-surface-design.md
Plan: docs/superpowers/plans/2026-06-18-unified-harness-surface-foundation.md

What's in src/agentex/lib/core/harness/

  • types.pyStreamTaskMessage, OpenSpan/CloseSpan/SpanSignal, TurnUsage, TurnResult, HarnessTurn protocol.
  • span_derivation.pySpanDeriver: pure reducer (no adk dep), canonical stream → span signals. Tool span opens on the Done of a ToolRequestContent index, closes on the matching ToolResponseContent by tool_call_id; reasoning span open-on-Start / close-on-Done; parallel-safe; flush() closes unclosed spans.
  • tracer.pySpanTracer: best-effort adapter from span signals to adk.tracing (never raises; overridable; guarded make_logger).
  • yield_delivery.py / auto_send.py — the two delivery adapters (both feed the same SpanDeriver/SpanTracer; finally-flush on early close/error).
  • emitter.pyUnifiedEmitter: ties trace context + delivery + usage; default-on/overridable tracing; injectable tracing/streaming backends.
  • conformance/ — shared conformance scaffold each future harness tap registers fixtures with.
  • .github/workflows/harness-integration.yml — conformance CI job (via ./scripts/test) + an if: false live-matrix placeholder enabled by the migration PRs.

Scope / what's NOT here

Per-harness migration (pydantic-ai / langgraph / openai) and parser taps (claude-code / codex), plus their 3 e2e test agents each (sync/async/temporal), are future migration PRs (4–8) — not in this branch.

Quality gates

  • 30 tests passing on Python 3.12 + 3.13 (via ./scripts/test tests/lib/core/harness/).
  • pyright clean (0 errors/warnings), no # type: ignore in the package.
  • Each task spec- + quality-reviewed; final whole-branch review passed with no Critical issues.

Follow-ups (filed)

  • AGX1-373 (High) — make conformance assert true yield-vs-auto-send equivalence + reconcile Full tool-message wire shape (blocks migration backward-compat claims).
  • AGX1-374 (Medium) — auto_send reasoning + mixed-ordering tests.
  • AGX1-375 (Medium) — expose the surface via the public adk facade before the first consumer migration.
  • AGX1-376 (Low) — widen CI paths to agentex.types; SpanTracer duplicate-open guard.
  • AGX1-371 — deferred optional is_error on ToolResponseContent (tool-span error status).

Note: total diff is ~3k lines but ~1.6k of that is the spec + plan docs; the package code + tests + CI is ~1.4k. Reviewable per-commit (one commit per plan task).

🤖 Generated with Claude Code

Greptile Summary

  • Adds a shared harness layer for canonical stream messages, span signals, turn usage, and turn results.
  • Implements span derivation, tracing adapters, yield delivery, auto-send delivery, and a UnifiedEmitter facade.
  • Adds harness conformance scaffolding, focused tests, and a CI workflow skeleton for future harness migrations.

Confidence Score: 4/5

The harness foundation is well scoped, but full-message handling in span derivation and auto-send delivery needs attention before relying on existing canonical streams.

The implementation has focused tests and clean separation across tracing and delivery adapters, with two concrete stream-shape gaps affecting tool-call behavior.

src/agentex/lib/core/harness/span_derivation.py and src/agentex/lib/core/harness/auto_send.py

T-Rex T-Rex Logs

What T-Rex did

  • T-Rex ran a focused Python repro feeding SpanDeriver a StreamTaskMessageFull ToolRequestContent followed by a matching StreamTaskMessageFull ToolResponseContent to derive full tool spans; the repro showed observe calls returned empty signal lists, flush returned no signals, and the expected OpenSpan and CloseSpan were missing.
  • T-Rex ran a focused Python repro against the real auto_send implementation with a StreamTaskMessageFull containing ToolResponseContent; the instrumented backend captured only context creation, context open as a start event, and context close as done for the tool response, with FULL_STREAM_UPDATE_COUNT 0 indicating the full tool response was not preserved.
  • T-Rex compared the base run to head by running the validation script again; base run failed with ModuleNotFoundError for agentex.lib.core.harness.span_derivation, while the head run completed and printed the expected OpenSpan/CloseSpan sequences with signals opening on Start and closing on Done.
  • T-Rex reviewed delivery-adapters artifacts, noting the Before run captured base command, cwd, import failure, and EXIT_CODE: 1, and the After run captured the head command, cwd, canonical messages, backend logs, tracing call logs, early close evidence, and EXIT_CODE: 0.
  • T-Rex examined emitter-tracer usage artifacts, where the base run failed with ImportError cannot import name UnifiedEmitter and the head run completed with ok: true for all scenarios, including swallowed fake tracing errors not propagated.
  • T-Rex examined harness conformance artifacts, describing the base environment where uv and pytest were unavailable and tests could not run, and the head environment where changed files were present and the scaffold shows OpenSpan/CloseSpan and PASS.

View all artifacts

T-Rex Ran code and verified through T-Rex

Fix All in Claude Code

Prompt To Fix All With AI
Fix the following 2 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 2
src/agentex/lib/core/harness/span_derivation.py:107-114
**Derive full tool spans**

`SpanDeriver` only opens tool spans from `Start(ToolRequestContent)` followed by `Done`, but existing canonical streams can emit tool calls as `StreamTaskMessageFull(ToolRequestContent)`. For example, the LangGraph sync stream emits a full tool request and then a full tool response. In that path, this branch ignores the request, so the response `tool_call_id` is never open and no tool span is produced. Please treat `Full(ToolRequestContent)` as an immediate tool-span open using its `tool_call_id`, `name`, and `arguments`.

### Issue 2 of 2
src/agentex/lib/core/harness/auto_send.py:101-105
**Preserve full messages**

This path handles a canonical `StreamTaskMessageFull` by opening and immediately closing a streaming context. The real streaming context emits `StreamTaskMessageStart` when it opens and `StreamTaskMessageDone` when it closes; it only publishes a full event when `stream_update()` receives a `StreamTaskMessageFull`. When a tool response arrives as `Full(ToolResponseContent)`, auto-send turns it into a start/done pair, so consumers that rely on the canonical full tool result event can miss the result shape.

Reviews (2): Last reviewed commit: "style: ruff import-sort + format fixes a..." | Re-trigger Greptile

Greptile also left 2 inline comments on this PR.

Comment thread src/agentex/lib/core/harness/auto_send.py
Comment on lines +75 to +91
elif isinstance(event, StreamTaskMessageDelta):
if current_ctx is not None and event.delta is not None:
# Reconstruct the delta with parent_task_message set from
# the context's task_message (mirrors _langgraph_async.py
# lines 72-78 and 117-127).
delta_with_parent = StreamTaskMessageDelta(
parent_task_message=current_ctx.task_message,
delta=event.delta,
type="delta",
index=event.index,
)
await current_ctx.stream_update(delta_with_parent)
if isinstance(event.delta, TextDelta) and event.delta.text_delta:
final_text_parts.append(event.delta.text_delta)

elif isinstance(event, StreamTaskMessageDone):
await _close_current()

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Route by stream index

auto_send keeps only one current_ctx, so every delta is sent to the most recently opened text/reasoning context and every Done closes that context. Canonical streams can have multiple open parts keyed by index; when a tool-call or reasoning delta arrives while a text message is open, that delta is forwarded to the text context, or a Done for another index closes the text stream early. This can corrupt or truncate the auto-sent task messages. Please key the active contexts by event.index, or ignore deltas and done events whose index does not match the current text/reasoning context.

Artifacts

Repro: focused auto_send overlapping-index harness

  • Contains supporting evidence from the run (text/x-python; charset=utf-8).

Repro: harness output showing misrouted reasoning delta and truncated text

  • Keeps the command output available without making the summary code-heavy.

View artifacts

T-Rex Ran code and verified through T-Rex

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/agentex/lib/core/harness/auto_send.py
Line: 75-91

Comment:
**Route by stream index**

`auto_send` keeps only one `current_ctx`, so every delta is sent to the most recently opened text/reasoning context and every `Done` closes that context. Canonical streams can have multiple open parts keyed by `index`; when a tool-call or reasoning delta arrives while a text message is open, that delta is forwarded to the text context, or a `Done` for another index closes the text stream early. This can corrupt or truncate the auto-sent task messages. Please key the active contexts by `event.index`, or ignore deltas and done events whose index does not match the current text/reasoning context.

How can I resolve this? If you propose a fix, please make it concise.

Fix in Claude Code

declan-scale and others added 21 commits June 18, 2026 13:28
Approach A (Agentex event stream as canonical source of truth): one tap per
harness feeds shared yield/auto-send delivery adapters and a span-deriving
tracing tap. Additive backwards-compat, stacked PRs <1000 lines, conformance +
live-matrix testing (3 test agents per harness: sync/async/temporal).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… golden-agent integration

- Make tracing-tap span derivation explicit (tool open on Done of a
  ToolRequestContent index, close on matching ToolResponseContent by
  tool_call_id; parallel-safe; reasoning start->done). Flag missing
  is_error on ToolResponseContent as an additive upstream decision.
- Add first-class TurnUsage/TurnResult shape (aligned to llm_metrics token
  taxonomy) attached to the turn span via span(data=) and reused for metrics.
- Document golden-agent integration: all SGP/sandbox/secret/MCP coupling
  stays in the agent; only parsing/streaming/tracing/usage move to SDK taps +
  emitter; sandbox-setup events chain before the harness stream.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… 1-3)

Bite-sized TDD tasks: foundation types, pure SpanDeriver, SpanTracer adapter,
yield + auto_send delivery, UnifiedEmitter facade, conformance scaffold + CI
job. Migration/parser PRs (4-9) listed as follow-on plans.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… signals

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… handling in SpanDeriver

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…sts for SpanTracer

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…on early close

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…reaming + tracing)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… + cover error/finally paths in auto_send

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…send_turn + doc tracer modes

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…egistry semantics

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…he package

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…or consistency

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@declan-scale declan-scale force-pushed the declan-scale/unified-harness-surface branch from d21c54a to ebc468d Compare June 18, 2026 17:29
Comment on lines +107 to +114
def _on_full(self, event: StreamTaskMessageFull) -> list[SpanSignal]:
content = event.content
if isinstance(content, ToolResponseContent):
tcid = content.tool_call_id
if tcid in self._open_tool_ids:
self._open_tool_ids.pop(tcid, None)
return [CloseSpan(key=tcid, output=content.content, is_complete=True)]
return []

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Derive full tool spans

SpanDeriver only opens tool spans from Start(ToolRequestContent) followed by Done, but existing canonical streams can emit tool calls as StreamTaskMessageFull(ToolRequestContent). For example, the LangGraph sync stream emits a full tool request and then a full tool response. In that path, this branch ignores the request, so the response tool_call_id is never open and no tool span is produced. Please treat Full(ToolRequestContent) as an immediate tool-span open using its tool_call_id, name, and arguments.

Artifacts

Repro: focused SpanDeriver full tool request and response harness

  • Contains supporting evidence from the run (text/x-python; charset=utf-8).

Stack trace captured during the T-Rex run

  • Keeps the raw stack trace available without making the summary code-heavy.

View artifacts

T-Rex Ran code and verified through T-Rex

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/agentex/lib/core/harness/span_derivation.py
Line: 107-114

Comment:
**Derive full tool spans**

`SpanDeriver` only opens tool spans from `Start(ToolRequestContent)` followed by `Done`, but existing canonical streams can emit tool calls as `StreamTaskMessageFull(ToolRequestContent)`. For example, the LangGraph sync stream emits a full tool request and then a full tool response. In that path, this branch ignores the request, so the response `tool_call_id` is never open and no tool span is produced. Please treat `Full(ToolRequestContent)` as an immediate tool-span open using its `tool_call_id`, `name`, and `arguments`.

How can I resolve this? If you propose a fix, please make it concise.

Fix in Claude Code

Comment on lines +101 to +105
async with streaming.streaming_task_message_context(
task_id=task_id,
initial_content=event.content,
):
pass

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Preserve full messages

This path handles a canonical StreamTaskMessageFull by opening and immediately closing a streaming context. The real streaming context emits StreamTaskMessageStart when it opens and StreamTaskMessageDone when it closes; it only publishes a full event when stream_update() receives a StreamTaskMessageFull. When a tool response arrives as Full(ToolResponseContent), auto-send turns it into a start/done pair, so consumers that rely on the canonical full tool result event can miss the result shape.

Artifacts

Repro: focused auto_send full tool response harness

  • Contains supporting evidence from the run (text/x-python; charset=utf-8).

Repro: failing execution output with captured outbound events

  • Keeps the command output available without making the summary code-heavy.

View artifacts

T-Rex Ran code and verified through T-Rex

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/agentex/lib/core/harness/auto_send.py
Line: 101-105

Comment:
**Preserve full messages**

This path handles a canonical `StreamTaskMessageFull` by opening and immediately closing a streaming context. The real streaming context emits `StreamTaskMessageStart` when it opens and `StreamTaskMessageDone` when it closes; it only publishes a full event when `stream_update()` receives a `StreamTaskMessageFull`. When a tool response arrives as `Full(ToolResponseContent)`, auto-send turns it into a start/done pair, so consumers that rely on the canonical full tool result event can miss the result shape.

How can I resolve this? If you propose a fix, please make it concise.

Fix in Claude Code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant