feat(harness): unified harness surface — foundation (span derivation, delivery adapters, emitter) by declan-scale · Pull Request #412 · scaleapi/scale-agentex-python

declan-scale · 2026-06-18T17:12:01Z

What this is

Foundation (PRs 1–3 of the rollout) for a unified harness tracing/message-emitting surface: the Agentex StreamTaskMessage* stream is the single source of truth, and shared harness-independent machinery derives spans from it and delivers it over both channels:

yield — pass the canonical stream through to the caller (sync HTTP ACP agents),
auto-send — push to the task stream via adk.streaming (async + temporal agents, from inside an activity),

with tracing on by default (derived from the same stream) and overridable, and a unified TurnUsage/TurnResult shape for per-harness usage normalization.

Design: docs/superpowers/specs/2026-06-18-unified-harness-surface-design.md
Plan: docs/superpowers/plans/2026-06-18-unified-harness-surface-foundation.md

What's in `src/agentex/lib/core/harness/`

types.py — StreamTaskMessage, OpenSpan/CloseSpan/SpanSignal, TurnUsage, TurnResult, HarnessTurn protocol.
span_derivation.py — SpanDeriver: pure reducer (no adk dep), canonical stream → span signals. Tool span opens on the Done of a ToolRequestContent index, closes on the matching ToolResponseContent by tool_call_id; reasoning span open-on-Start / close-on-Done; parallel-safe; flush() closes unclosed spans.
tracer.py — SpanTracer: best-effort adapter from span signals to adk.tracing (never raises; overridable; guarded make_logger).
yield_delivery.py / auto_send.py — the two delivery adapters (both feed the same SpanDeriver/SpanTracer; finally-flush on early close/error).
emitter.py — UnifiedEmitter: ties trace context + delivery + usage; default-on/overridable tracing; injectable tracing/streaming backends.
conformance/ — shared conformance scaffold each future harness tap registers fixtures with.
.github/workflows/harness-integration.yml — conformance CI job (via ./scripts/test) + an if: false live-matrix placeholder enabled by the migration PRs.

Scope / what's NOT here

Per-harness migration (pydantic-ai / langgraph / openai) and parser taps (claude-code / codex), plus their 3 e2e test agents each (sync/async/temporal), are future migration PRs (4–8) — not in this branch.

Quality gates

30 tests passing on Python 3.12 + 3.13 (via ./scripts/test tests/lib/core/harness/).
pyright clean (0 errors/warnings), no # type: ignore in the package.
Each task spec- + quality-reviewed; final whole-branch review passed with no Critical issues.

Follow-ups (filed)

AGX1-373 (High) — make conformance assert true yield-vs-auto-send equivalence + reconcile Full tool-message wire shape (blocks migration backward-compat claims).
AGX1-374 (Medium) — auto_send reasoning + mixed-ordering tests.
AGX1-375 (Medium) — expose the surface via the public adk facade before the first consumer migration.
AGX1-376 (Low) — widen CI paths to agentex.types; SpanTracer duplicate-open guard.
AGX1-371 — deferred optional is_error on ToolResponseContent (tool-span error status).

Note: total diff is ~3k lines but ~1.6k of that is the spec + plan docs; the package code + tests + CI is ~1.4k. Reviewable per-commit (one commit per plan task).

🤖 Generated with Claude Code

Greptile Summary

Adds a shared harness layer for canonical stream messages, span signals, turn usage, and turn results.
Implements span derivation, tracing adapters, yield delivery, auto-send delivery, and a UnifiedEmitter facade.
Adds harness conformance scaffolding, focused tests, and a CI workflow skeleton for future harness migrations.

Confidence Score: 4/5

The harness foundation is well scoped, but full-message handling in span derivation and auto-send delivery needs attention before relying on existing canonical streams.

The implementation has focused tests and clean separation across tracing and delivery adapters, with two concrete stream-shape gaps affecting tool-call behavior.

src/agentex/lib/core/harness/span_derivation.py and src/agentex/lib/core/harness/auto_send.py

T-Rex Logs

What T-Rex did

T-Rex ran a focused Python repro feeding SpanDeriver a StreamTaskMessageFull ToolRequestContent followed by a matching StreamTaskMessageFull ToolResponseContent to derive full tool spans; the repro showed observe calls returned empty signal lists, flush returned no signals, and the expected OpenSpan and CloseSpan were missing.
T-Rex ran a focused Python repro against the real auto_send implementation with a StreamTaskMessageFull containing ToolResponseContent; the instrumented backend captured only context creation, context open as a start event, and context close as done for the tool response, with FULL_STREAM_UPDATE_COUNT 0 indicating the full tool response was not preserved.
T-Rex compared the base run to head by running the validation script again; base run failed with ModuleNotFoundError for agentex.lib.core.harness.span_derivation, while the head run completed and printed the expected OpenSpan/CloseSpan sequences with signals opening on Start and closing on Done.
T-Rex reviewed delivery-adapters artifacts, noting the Before run captured base command, cwd, import failure, and EXIT_CODE: 1, and the After run captured the head command, cwd, canonical messages, backend logs, tracing call logs, early close evidence, and EXIT_CODE: 0.
T-Rex examined emitter-tracer usage artifacts, where the base run failed with ImportError cannot import name UnifiedEmitter and the head run completed with ok: true for all scenarios, including swallowed fake tracing errors not propagated.
T-Rex examined harness conformance artifacts, describing the base environment where uv and pytest were unavailable and tests could not run, and the head environment where changed files were present and the scaffold shows OpenSpan/CloseSpan and PASS.

_{Ran code and verified through T-Rex}

Prompt To Fix All With AI

Fix the following 2 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 2
src/agentex/lib/core/harness/span_derivation.py:107-114
**Derive full tool spans**

`SpanDeriver` only opens tool spans from `Start(ToolRequestContent)` followed by `Done`, but existing canonical streams can emit tool calls as `StreamTaskMessageFull(ToolRequestContent)`. For example, the LangGraph sync stream emits a full tool request and then a full tool response. In that path, this branch ignores the request, so the response `tool_call_id` is never open and no tool span is produced. Please treat `Full(ToolRequestContent)` as an immediate tool-span open using its `tool_call_id`, `name`, and `arguments`.

### Issue 2 of 2
src/agentex/lib/core/harness/auto_send.py:101-105
**Preserve full messages**

This path handles a canonical `StreamTaskMessageFull` by opening and immediately closing a streaming context. The real streaming context emits `StreamTaskMessageStart` when it opens and `StreamTaskMessageDone` when it closes; it only publishes a full event when `stream_update()` receives a `StreamTaskMessageFull`. When a tool response arrives as `Full(ToolResponseContent)`, auto-send turns it into a start/done pair, so consumers that rely on the canonical full tool result event can miss the result shape.

_{Reviews (2): Last reviewed commit: "style: ruff import-sort + format fixes a..." | Re-trigger Greptile}

Greptile also left 2 inline comments on this PR.

greptile-apps · 2026-06-18T17:24:34Z

+            elif isinstance(event, StreamTaskMessageDelta):
+                if current_ctx is not None and event.delta is not None:
+                    # Reconstruct the delta with parent_task_message set from
+                    # the context's task_message (mirrors _langgraph_async.py
+                    # lines 72-78 and 117-127).
+                    delta_with_parent = StreamTaskMessageDelta(
+                        parent_task_message=current_ctx.task_message,
+                        delta=event.delta,
+                        type="delta",
+                        index=event.index,
+                    )
+                    await current_ctx.stream_update(delta_with_parent)
+                    if isinstance(event.delta, TextDelta) and event.delta.text_delta:
+                        final_text_parts.append(event.delta.text_delta)
+
+            elif isinstance(event, StreamTaskMessageDone):
+                await _close_current()


Route by stream index

auto_send keeps only one current_ctx, so every delta is sent to the most recently opened text/reasoning context and every Done closes that context. Canonical streams can have multiple open parts keyed by index; when a tool-call or reasoning delta arrives while a text message is open, that delta is forwarded to the text context, or a Done for another index closes the text stream early. This can corrupt or truncate the auto-sent task messages. Please key the active contexts by event.index, or ignore deltas and done events whose index does not match the current text/reasoning context.

Artifacts

Repro: focused auto_send overlapping-index harness

Contains supporting evidence from the run (text/x-python; charset=utf-8).

Repro: harness output showing misrouted reasoning delta and truncated text

Keeps the command output available without making the summary code-heavy.

_{Ran code and verified through T-Rex}

Prompt To Fix With AI

This is a comment left during a code review. Path: src/agentex/lib/core/harness/auto_send.py Line: 75-91 Comment: **Route by stream index** `auto_send` keeps only one `current_ctx`, so every delta is sent to the most recently opened text/reasoning context and every `Done` closes that context. Canonical streams can have multiple open parts keyed by `index`; when a tool-call or reasoning delta arrives while a text message is open, that delta is forwarded to the text context, or a `Done` for another index closes the text stream early. This can corrupt or truncate the auto-sent task messages. Please key the active contexts by `event.index`, or ignore deltas and done events whose index does not match the current text/reasoning context. How can I resolve this? If you propose a fix, please make it concise.

Approach A (Agentex event stream as canonical source of truth): one tap per harness feeds shared yield/auto-send delivery adapters and a span-deriving tracing tap. Additive backwards-compat, stacked PRs <1000 lines, conformance + live-matrix testing (3 test agents per harness: sync/async/temporal). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

… golden-agent integration - Make tracing-tap span derivation explicit (tool open on Done of a ToolRequestContent index, close on matching ToolResponseContent by tool_call_id; parallel-safe; reasoning start->done). Flag missing is_error on ToolResponseContent as an additive upstream decision. - Add first-class TurnUsage/TurnResult shape (aligned to llm_metrics token taxonomy) attached to the turn span via span(data=) and reused for metrics. - Document golden-agent integration: all SGP/sandbox/secret/MCP coupling stays in the agent; only parsing/streaming/tracing/usage move to SDK taps + emitter; sandbox-setup events chain before the harness stream. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

… 1-3) Bite-sized TDD tasks: foundation types, pure SpanDeriver, SpanTracer adapter, yield + auto_send delivery, UnifiedEmitter facade, conformance scaffold + CI job. Migration/parser PRs (4-9) listed as follow-on plans. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>