From cf4a0dd47aa6c38184d31d64776497df04166a79 Mon Sep 17 00:00:00 2001
From: Declan Brady <declan.brady@scale.com>
Date: Thu, 18 Jun 2026 11:22:00 -0400
Subject: [PATCH 01/26] docs: design for unified harness
 tracing/message-emitting surface

Approach A (Agentex event stream as canonical source of truth): one tap per
harness feeds shared yield/auto-send delivery adapters and a span-deriving
tracing tap. Additive backwards-compat, stacked PRs <1000 lines, conformance +
live-matrix testing (3 test agents per harness: sync/async/temporal).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---
 ...26-06-18-unified-harness-surface-design.md | 204 ++++++++++++++++++
 1 file changed, 204 insertions(+)
 create mode 100644 docs/superpowers/specs/2026-06-18-unified-harness-surface-design.md
diff --git a/docs/superpowers/specs/2026-06-18-unified-harness-surface-design.md b/docs/superpowers/specs/2026-06-18-unified-harness-surface-design.md
new file mode 100644
index 000000000..8e5411863
--- /dev/null
+++ b/docs/superpowers/specs/2026-06-18-unified-harness-surface-design.md
@@ -0,0 +1,204 @@
+# Unified Harness Tracing / Message-Emitting Surface
+
+Date: 2026-06-18
+Status: Approved design, pending implementation
+Repo: `scale-agentex-python`
+
+## Problem
+
+The SDK integrates several agent harnesses (pydantic-ai, LangGraph, OpenAI Agents) by
+converting each harness's native output into Agentex `StreamTaskMessage*` events. Today
+that integration is triplicated per harness:
+
+- `_<harness>_sync.py` — a converter that **yields** Agentex events back over the
+  HTTP/JSON-RPC response (sync ACP agents).
+- `_<harness>_async.py` — a converter that **auto-sends** to the task stream (Redis via
+  `adk.streaming`) for async + temporal agents.
+- `_<harness>_tracing.py` — a separate, opt-in tracing handler wired into the converter
+  by hand.
+
+Consequences:
+
+- The native-output → Agentex-event mapping exists in two places per harness (sync and
+  async) and can drift.
+- Tracing is bolted on per harness and is inconsistent across harnesses.
+- There is no shared notion of a tool/reasoning span tree or turn-level metadata.
+- The golden agent grew a parallel "harness layer" (a neutral `HarnessEvent` vocabulary
+  plus an adapter that drives `adk.streaming` + `adk.tracing`) to solve the same problem
+  for its subprocess CLI harnesses (claude-code, codex). That logic is valuable but lives
+  outside the SDK.
+
+## Goal (end state)
+
+pydantic-ai, LangGraph, OpenAI Agents, claude-code, and codex all emit through one unified
+surface. A single pass over a harness's output drives **streaming, message persistence, and
+tracing** from one source of truth, in the same shape as Agentex events. The surface works
+for **both** delivery channels (sync yield, async/temporal auto-send). Tracing is on by
+default and overridable. The claude-code/codex *parsers* live in the SDK; their sandbox /
+secret / MCP orchestration stays in the golden agent.
+
+## Approach: Agentex event stream is canonical (Approach A)
+
+The Agentex `StreamTaskMessage*` stream is the single source of truth. Each harness maps its
+native output to that stream **once**. A single emitter consumes that one stream and fans it
+out to delivery (yield or auto-send) and to tracing (spans derived from the same stream).
+
+We considered two alternatives and rejected them:
+
+- **Neutral `AgentEvent` vocabulary + dual projectors (Approach B):** richer (carries turn
+  usage/cost natively, clean start/end pairing) but reintroduces a parallel vocabulary to
+  keep in sync with Agentex types, for the same outcome.
+- **Push-to-sink with typed emitter methods (Approach C):** very testable, but the *yield*
+  delivery channel fights a push API (needs a queue/generator bridge), and sync ACP agents
+  depend on yield.
+
+Approach A matches "same shape as Agentex events" most directly, makes the yield channel
+free, and lets us delete the per-harness tracing code by deriving spans from the canonical
+stream.
+
+## Components
+
+Four shared, harness-independent components plus one thin tap per harness.
+
+### 1. Per-harness tap (the only per-harness code)
+
+```
+convert_<harness>_to_agentex_events(native_stream, ...) -> AsyncIterator[StreamTaskMessage*]
+```
+
+The existing sync converters (`convert_pydantic_ai_to_agentex_events`,
+`convert_langgraph_to_agentex_events`, `convert_openai_to_agentex_events`) already have this
+shape and *become* the taps. New taps: `convert_claude_code_to_agentex_events`,
+`convert_codex_to_agentex_events` (pure parsers over the CLIs' newline-delimited
+stream-json; no SGP/sandbox coupling).
+
+### 2. Auto-send adapter (shared)
+
+Consumes the canonical Agentex stream and drives `adk.streaming` context managers: open/close
+text and reasoning contexts, switch cleanly between them, stream tool request/response. This
+generalizes the golden agent's `AgentexStreamAdapter` and replaces the N hand-written
+`_async` bodies with one. Returns the accumulated final text (preserving current
+auto-send return values).
+
+### 3. Yield adapter (shared)
+
+Passes the canonical stream through to the caller (sync HTTP ACP), tee-ing each event to the
+tracer as a side effect.
+
+### 4. Tracing tap (shared)
+
+Derives spans from the canonical stream:
+
+- tool span = `ToolRequestContent` (start/full) → matching `ToolResponseContent` by
+  `tool_call_id`.
+- reasoning span = reasoning start → done.
+- subagent span = the Task/Agent tool's span (a tool span by another name).
+
+Default-on whenever a trace context exists; **overridable** by passing a custom tracer, or
+`None` to disable. Replaces the per-harness `_tracing.py` handlers.
+
+### Facade
+
+A `UnifiedEmitter` ties the chosen delivery adapter and the tracer together so an agent
+author calls one thing.
+
+### Proposed layout
+
+- Shared components: `src/agentex/lib/core/harness/` (delivery adapters, tracing tap, span
+  derivation, facade).
+- Taps: remain in `src/agentex/lib/adk/_modules/`.
+- Public access: via the `adk` facade.
+
+## Data flow
+
+One pass over the canonical stream, fanned out by delivery mode.
+
+- **Sync agent:** `async for ev in emitter.yield_events(convert_X(native)): ...` — the tracer
+  observes each event; the event is yielded over the HTTP/JSON-RPC response.
+- **Async + temporal agent:** `await emitter.auto_send(convert_X(native), task_id=...)` — the
+  auto-send adapter pushes deltas to Redis via `adk.streaming` while the tracer observes the
+  same events; returns accumulated final text. Temporal is identical, called from inside an
+  activity (converters run in activities, not workflows, so determinism is not a concern).
+- **Tracing** is the same derivation in both modes (it observes the canonical stream), so
+  sync and auto-send produce identical spans.
+- **Turn-level metadata** (usage / cost / model) is not an Agentex event. It rides a small
+  side-channel: the tap returns a final typed `TurnResult` (or yields a terminal record)
+  that the caller attaches to the turn span. This mirrors how the golden agent already treats
+  `TurnCompleted` as "handled by the caller, not the stream."
+
+Net dedup: **3 files × N harnesses → 1 tap × N harnesses + 3 shared components.**
+
+## Backwards compatibility (every change is additive)
+
+The end state "replaces" the old converters, but it is reached additively. No public symbol
+is removed in this stack; nothing regresses.
+
+- **Taps:** existing `convert_*_to_agentex_events` keep exact signatures and output. Behavior
+  is unchanged when no trace context is present.
+- **Auto-send entry points** (`stream_langgraph_events(stream, task_id)`, the pydantic/openai
+  `_async` helpers, `run_agent_streamed_auto_send`, `chat_completion_stream_auto_send`) keep
+  signatures and return values, reimplemented to delegate to the shared auto-send adapter.
+  Feature-add: they emit traces by default. The conformance suite asserts equivalent Redis
+  messages before/after.
+- **`_tracing.py` handlers** stay importable as shims; the shared tracer supersedes them
+  internally.
+- **Removal/deprecation** of dead internal duplication is the final PR, behind a deprecation
+  note, never mixed into a migration PR.
+
+## Rollout — stacked PRs (each < 1000 lines diff)
+
+1. **Span derivation (`TracingTap`)** — pure function: canonical stream → spans.
+   Unit-tested in isolation. No wiring.
+2. **Auto-send adapter** — canonical stream → `adk.streaming` side effects. Fixture-tested.
+   Not yet wired into harnesses.
+3. **Yield adapter + `UnifiedEmitter` facade + public `adk` surface** — plus the
+   conformance-test scaffold (fixture format + parametrized runner) and an empty CI
+   integration job.
+4. **Migrate pydantic-ai** — reimplement its `_async` / tracing on the shared components;
+   keep `convert_pydantic_ai_to_agentex_events` signature; default tracing on. Add 3 test
+   agents (sync / async / temporal) + CI matrix entries + live smoke.
+5. **Migrate LangGraph** — same pattern + 3 test agents + CI.
+6. **Migrate OpenAI Agents** — same pattern + 3 test agents + CI.
+7. **claude-code parser tap** — `convert_claude_code_to_agentex_events` + recorded stream-json
+   fixtures + feasible test agent(s) (likely temporal-only, given the sandbox requirement).
+8. **codex parser tap** — same shape + fixtures + feasible test agent(s).
+9. **Cleanup** — delete now-dead internal duplication, deprecate shims, docs.
+
+## Testing
+
+### Offline conformance suite (every PR)
+
+Committed raw harness outputs (pydantic `AgentStreamEvent`s, LangGraph chunks, OpenAI stream,
+claude/codex stream-json) drive a shared parametrized suite. For each fixture, assert:
+
+- exact normalized `StreamTaskMessage*` sequence,
+- derived span tree,
+- **yield-vs-auto-send equivalence** — both channels produce the same logical events/spans.
+
+Every tap must pass the shared cases: text, reasoning, single tool, tool error, multi-step,
+and interleaved reasoning + tool ordering. Deterministic, offline, no network.
+
+### Live integration matrix (CI)
+
+Three test agents per harness, one per agent type (sync / async / temporal), deployed and
+driven with a fixed prompt. Assert the unified surface produced valid ordered messages and a
+well-formed span tree. Modeled on the existing `agentex-tutorials-test.yml` /
+`build-and-push-tutorial-agent.yml` CI precedent.
+
+Matrix: harness ∈ {pydantic-ai, langgraph, openai-agents, claude-code, codex} × agent-type ∈
+{sync, async, temporal}. claude-code/codex run the subset of agent types that is feasible;
+any uncovered cell is logged/documented, never silently skipped.
+
+### Error handling
+
+- A tap that raises mid-stream closes open streaming contexts and open spans — no leaked
+  `adk.streaming` context, no dangling span.
+- Tracing failures are best-effort and never break delivery (matches the golden agent's
+  contract).
+
+## Out of scope
+
+- Sandbox pool, sandbox lifecycle, MCP server provisioning, and OAuth/secret reauth — tracked
+  separately; only the pure claude-code/codex output parsers are in scope here.
+- claude-code/codex sandbox / secret / MCP orchestration — stays in the golden agent and
+  feeds the SDK parser.

From 4538544806c518e1a9be15cb711bfae596c8f882 Mon Sep 17 00:00:00 2001
From: Declan Brady <declan.brady@scale.com>
Date: Thu, 18 Jun 2026 11:30:03 -0400
Subject: [PATCH 02/26] =?UTF-8?q?docs:=20refine=20unified=20harness=20spec?=
 =?UTF-8?q?=20=E2=80=94=20span=20derivation=20rules,=20TurnUsage,=20golden?=
 =?UTF-8?q?-agent=20integration?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

- Make tracing-tap span derivation explicit (tool open on Done of a
  ToolRequestContent index, close on matching ToolResponseContent by
  tool_call_id; parallel-safe; reasoning start->done). Flag missing
  is_error on ToolResponseContent as an additive upstream decision.
- Add first-class TurnUsage/TurnResult shape (aligned to llm_metrics token
  taxonomy) attached to the turn span via span(data=) and reused for metrics.
- Document golden-agent integration: all SGP/sandbox/secret/MCP coupling
  stays in the agent; only parsing/streaming/tracing/usage move to SDK taps +
  emitter; sandbox-setup events chain before the harness stream.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---
 ...26-06-18-unified-harness-surface-design.md | 124 ++++++++++++++++--
 1 file changed, 116 insertions(+), 8 deletions(-)

diff --git a/docs/superpowers/specs/2026-06-18-unified-harness-surface-design.md b/docs/superpowers/specs/2026-06-18-unified-harness-surface-design.md
index 8e5411863..c3b54c117 100644
--- a/docs/superpowers/specs/2026-06-18-unified-harness-surface-design.md
+++ b/docs/superpowers/specs/2026-06-18-unified-harness-surface-design.md
@@ -87,16 +87,34 @@ tracer as a side effect.
 
 ### 4. Tracing tap (shared)
 
-Derives spans from the canonical stream:
+A stateful reducer that derives spans from the canonical stream. It only *observes*
+`index` and `tool_call_id`; it never mutates or reorders the stream, so streaming fidelity
+is unchanged.
 
-- tool span = `ToolRequestContent` (start/full) → matching `ToolResponseContent` by
+Derivation rules:
+
+- **Tool span open:** on `StreamTaskMessageDone` for an index whose `Start` content was a
+  `ToolRequestContent`. Arguments are fully known by `Done` (covers both streamed-args and
+  one-shot tools). The open span is keyed by `tool_call_id`.
+- **Tool span close:** on `StreamTaskMessageFull(ToolResponseContent)` matching by
   `tool_call_id`.
-- reasoning span = reasoning start → done.
-- subagent span = the Task/Agent tool's span (a tool span by another name).
+- **Parallel / interleaved tools:** `ToolRequestContent`, `ToolResponseContent`,
+  `ToolRequestDelta`, and `ToolResponseDelta` all carry `tool_call_id` + `name`, so multiple
+  open tool spans pair correctly regardless of arrival order.
+- **Reasoning span:** `Start(ReasoningContent)` → `Done` on that index.
+- **Subagent span:** the Task/Agent tool's span (a tool span by another name), nested under
+  the turn span.
 
 Default-on whenever a trace context exists; **overridable** by passing a custom tracer, or
 `None` to disable. Replaces the per-harness `_tracing.py` handlers.
 
+**Open decision — tool error status.** `ToolResponseContent` currently has no
+`is_error`/`status` field (only `content`), so a derived tool span cannot mark failure. The
+golden agent's `ToolCompleted` carried `is_error`. Recommended resolution: add an additive
+optional `is_error: bool | None` to `ToolResponseContent`. This is a generated type, so it is
+a small upstream API-spec change (tracked as a prerequisite to the relevant migration PR), not
+a local edit. Until it lands, derived spans omit tool error status rather than inferring it.
+
 ### Facade
 
 A `UnifiedEmitter` ties the chosen delivery adapter and the tracer together so an agent
@@ -121,13 +139,66 @@ One pass over the canonical stream, fanned out by delivery mode.
   activity (converters run in activities, not workflows, so determinism is not a concern).
 - **Tracing** is the same derivation in both modes (it observes the canonical stream), so
   sync and auto-send produce identical spans.
-- **Turn-level metadata** (usage / cost / model) is not an Agentex event. It rides a small
-  side-channel: the tap returns a final typed `TurnResult` (or yields a terminal record)
-  that the caller attaches to the turn span. This mirrors how the golden agent already treats
-  `TurnCompleted` as "handled by the caller, not the stream."
+- **Turn-level metadata** (usage / cost / model) is not an Agentex event, so it is surfaced
+  as a first-class `TurnUsage` shape rather than ad-hoc data (see below).
 
 Net dedup: **3 files × N harnesses → 1 tap × N harnesses + 3 shared components.**
 
+## Unified turn usage / cost
+
+Turn metadata is a first-class, harness-independent shape attached to the turn span and
+returned to the caller — not a loose side-channel.
+
+```
+class TurnUsage(BaseModel):
+    model: str | None
+    input_tokens: int | None
+    output_tokens: int | None
+    cached_input_tokens: int | None   # subset of input_tokens served from cache
+    reasoning_tokens: int | None      # subset of output_tokens
+    total_tokens: int | None
+    cost_usd: float | None
+    duration_ms: int | None           # wall-clock, measured by the emitter
+    num_llm_calls: int
+    num_tool_calls: int               # derived from the canonical stream
+    num_reasoning_blocks: int         # derived from the canonical stream
+
+class TurnResult(BaseModel):
+    final_text: str
+    usage: TurnUsage
+```
+
+- Token field names align with the existing `agentex.lib.core.observability.llm_metrics`
+  taxonomy (`input_tokens` / `output_tokens` / `cached_input_tokens` / `reasoning_tokens`),
+  not a new vocabulary. (The OpenAI-style `llm_messages.Usage` —
+  `prompt_tokens`/`completion_tokens` — is mapped into this richer shape.)
+- **Each harness tap normalizes its native usage** into `TurnUsage`: pydantic-ai
+  `result.usage()`, LangGraph `usage_metadata`, OpenAI `response.usage`, claude-code/codex
+  the final `result` envelope (`cost_usd` + usage). Per-harness normalization, one output
+  shape.
+- The stream-derived counts (`num_tool_calls`, `num_reasoning_blocks`) come for free from the
+  tracing tap's reduction; `duration_ms` is measured by the emitter; tokens/cost/model come
+  from the tap's native-usage normalization.
+- The emitter attaches `TurnUsage` to the **turn span** via `adk.tracing.span(data=...)`
+  (which already accepts a `BaseModel`) and returns `TurnResult` to the caller. The same
+  object can feed the OTel `LLMMetrics` and downstream metrics (e.g. the golden agent's
+  per-turn DogStatsD emission), so traces and metrics share one shape.
+
+### Surfacing `TurnUsage` from the tap
+
+Python async generators cannot cleanly return a value to their consumer, so the tap does not
+return `TurnUsage` via `StopAsyncIteration`. Instead the per-harness entry is a small object:
+
+```
+class HarnessTurn:
+    events: AsyncIterator[StreamTaskMessage*]   # the canonical stream
+    def usage(self) -> TurnUsage                 # populated once `events` is exhausted
+```
+
+The emitter drives `events` (delivering + tracing), then reads `usage()` to finalize the turn
+span and build `TurnResult`. This keeps the canonical stream pure (only `StreamTaskMessage*`)
+while giving usage/cost a typed home.
+
 ## Backwards compatibility (every change is additive)
 
 The end state "replaces" the old converters, but it is reached additively. No public symbol
@@ -196,6 +267,43 @@ any uncovered cell is logged/documented, never silently skipped.
 - Tracing failures are best-effort and never break delivery (matches the golden agent's
   contract).
 
+## Golden agent integration (SGP / sandbox coupling preserved)
+
+The unified surface is designed so the golden agent keeps **all** of its SGP-coupled layers
+and only swaps its hand-rolled parsing/streaming/tracing internals for the SDK's taps +
+emitter. Nothing SGP-specific moves into the SDK.
+
+What stays in the golden agent, untouched:
+
+- Sandbox pool acquire modes (cold-create / warm-claim / reconnect), lease coordination, and
+  the data-plane URL override.
+- Secret resolution, OAuth/MCP reauth, and reconnect-notice emission (the notice is just
+  another standalone message on the task stream, independent of the harness tap).
+- Spawning `claude -p` / `codex exec` inside the sandbox.
+
+What changes inside the golden agent's provider:
+
+1. Acquire/provision the sandbox and resolve secrets/MCP exactly as today (SGP-coupled).
+2. Spawn the CLI in the sandbox and feed its stdout (stream-json lines) into the SDK tap
+   `convert_claude_code_to_agentex_events` / `convert_codex_to_agentex_events`.
+3. Run that tap through the SDK emitter's **auto-send** path from inside the existing Temporal
+   activity, getting streaming + tracing + `TurnUsage` for free. The agent's
+   `_StreamJsonProcessor` and `AgentexStreamAdapter` are retired in favor of the SDK tap +
+   emitter.
+
+**Sandbox-setup events:** today the golden agent surfaces provisioning steps (reconnect /
+find / create / configure-git / clone) as UI tool calls by yielding them into the same
+adapter. Under the unified surface these become agent-produced `ToolRequestContent` /
+`ToolResponseContent` messages, chained *before* the harness tap's stream into one canonical
+stream for the turn (`chain(setup_events, convert_claude_code(stdout))`). The emitter then
+delivers and traces the whole turn uniformly, so setup steps keep appearing in the UI and the
+span tree.
+
+This means the claude-code/codex parser PRs (7, 8) deliver the SDK taps, and a corresponding
+**golden-agent-side change** (out of this repo's PR stack) rewires its providers onto them.
+The golden agent's in-process litellm / OpenAI-Agents harness can likewise adopt the OpenAI
+tap, though that is optional and not required by this design.
+
 ## Out of scope
 
 - Sandbox pool, sandbox lifecycle, MCP server provisioning, and OAuth/secret reauth — tracked

From 4573d18f5bc65553d648e230bdb5bbfaef858044 Mon Sep 17 00:00:00 2001
From: Declan Brady <declan.brady@scale.com>
Date: Thu, 18 Jun 2026 11:34:38 -0400
Subject: [PATCH 03/26] docs: link deferred tool-error decision to AGX1-371

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---
 .../specs/2026-06-18-unified-harness-surface-design.md       | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/docs/superpowers/specs/2026-06-18-unified-harness-surface-design.md b/docs/superpowers/specs/2026-06-18-unified-harness-surface-design.md
index c3b54c117..e8a32f112 100644
--- a/docs/superpowers/specs/2026-06-18-unified-harness-surface-design.md
+++ b/docs/superpowers/specs/2026-06-18-unified-harness-surface-design.md
@@ -112,8 +112,9 @@ Default-on whenever a trace context exists; **overridable** by passing a custom
 `is_error`/`status` field (only `content`), so a derived tool span cannot mark failure. The
 golden agent's `ToolCompleted` carried `is_error`. Recommended resolution: add an additive
 optional `is_error: bool | None` to `ToolResponseContent`. This is a generated type, so it is
-a small upstream API-spec change (tracked as a prerequisite to the relevant migration PR), not
-a local edit. Until it lands, derived spans omit tool error status rather than inferring it.
+a small upstream API-spec change, not a local edit. **Deferred** — tracked in Linear as
+AGX1-371 (Agentex "Starter Tasks"). Until it lands, derived spans omit tool error status
+rather than inferring it.
 
 ### Facade
 

From 8eda56dc21c11cd5f341b2f641dc6c4959f056c9 Mon Sep 17 00:00:00 2001
From: Declan Brady <declan.brady@scale.com>
Date: Thu, 18 Jun 2026 11:40:42 -0400
Subject: [PATCH 04/26] docs: foundation implementation plan for unified
 harness surface (PRs 1-3)

Bite-sized TDD tasks: foundation types, pure SpanDeriver, SpanTracer adapter,
yield + auto_send delivery, UnifiedEmitter facade, conformance scaffold + CI
job. Migration/parser PRs (4-9) listed as follow-on plans.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---
 ...6-18-unified-harness-surface-foundation.md | 1309 +++++++++++++++++
 1 file changed, 1309 insertions(+)
 create mode 100644 docs/superpowers/plans/2026-06-18-unified-harness-surface-foundation.md

diff --git a/docs/superpowers/plans/2026-06-18-unified-harness-surface-foundation.md b/docs/superpowers/plans/2026-06-18-unified-harness-surface-foundation.md
new file mode 100644
index 000000000..0aefef060
--- /dev/null
+++ b/docs/superpowers/plans/2026-06-18-unified-harness-surface-foundation.md
@@ -0,0 +1,1309 @@
+# Unified Harness Surface — Foundation Implementation Plan
+
+> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
+
+**Goal:** Build the shared, harness-independent machinery (span derivation, auto-send delivery, yield delivery, unified emitter, turn-usage types) that the per-harness taps will plug into — corresponding to PRs 1–3 of the design's rollout.
+
+**Architecture:** The Agentex `StreamTaskMessage*` stream is the single source of truth (design Approach A). A pure `SpanDeriver` reduces that stream into open/close span signals. Two delivery adapters consume the same stream — `yield_events` (sync HTTP ACP) and `auto_send` (async/temporal, via `adk.streaming`) — and both observe the deriver to drive `adk.tracing`. A `UnifiedEmitter` ties delivery + tracing + `TurnUsage` together.
+
+**Tech Stack:** Python 3, pydantic v2 (`BaseModel`), pytest + pytest-asyncio, the existing `agentex.lib.adk` streaming/tracing facades.
+
+**Spec:** `docs/superpowers/specs/2026-06-18-unified-harness-surface-design.md`
+
+**Scope note:** This plan covers only the foundation (PRs 1–3). The per-harness migration PRs (4–6: pydantic-ai, langgraph, openai) and parser PRs (7–8: claude-code, codex) each require close reading of that harness's existing converter and get their own plans once this foundation lands. PR 9 (cleanup) follows them. See "Subsequent plans" at the end.
+
+---
+
+## File Structure
+
+- Create `src/agentex/lib/core/harness/__init__.py` — package marker + public re-exports.
+- Create `src/agentex/lib/core/harness/types.py` — `OpenSpan`, `CloseSpan`, `SpanSignal`, `TurnUsage`, `TurnResult`, `HarnessTurn` protocol.
+- Create `src/agentex/lib/core/harness/span_derivation.py` — `SpanDeriver` (pure reducer).
+- Create `src/agentex/lib/core/harness/auto_send.py` — `auto_send()` (canonical stream → `adk.streaming` + tracing).
+- Create `src/agentex/lib/core/harness/yield_delivery.py` — `yield_events()` (passthrough + tracing).
+- Create `src/agentex/lib/core/harness/emitter.py` — `UnifiedEmitter` facade.
+- Create tests under `tests/lib/core/harness/`.
+
+Each file has one responsibility; `span_derivation.py` has zero dependencies on `adk` so it is unit-testable in isolation.
+
+---
+
+## Task 1: Foundation types
+
+**Files:**
+- Create: `src/agentex/lib/core/harness/__init__.py`
+- Create: `src/agentex/lib/core/harness/types.py`
+- Test: `tests/lib/core/harness/test_types.py`
+
+- [ ] **Step 1: Create the package marker**
+
+Create `src/agentex/lib/core/harness/__init__.py`:
+
+```python
+"""Shared, harness-independent machinery for the unified harness surface.
+
+The Agentex StreamTaskMessage* stream is the single source of truth; this
+package derives spans from it and delivers it (yield or auto-send), so every
+harness tap gets streaming + tracing + turn usage uniformly.
+"""
+```
+
+- [ ] **Step 2: Write the failing test for the types**
+
+Create `tests/lib/core/harness/__init__.py` (empty) and `tests/lib/core/harness/test_types.py`:
+
+```python
+from agentex.lib.core.harness.types import (
+    OpenSpan,
+    CloseSpan,
+    TurnUsage,
+    TurnResult,
+)
+
+
+def test_open_close_span_construct():
+    o = OpenSpan(key="call_1", kind="tool", name="Bash", input={"cmd": "ls"})
+    c = CloseSpan(key="call_1", output="files", is_complete=True)
+    assert o.key == c.key == "call_1"
+    assert o.kind == "tool"
+    assert c.is_complete is True
+
+
+def test_turn_usage_defaults_are_none():
+    u = TurnUsage(model="claude-opus-4-6")
+    assert u.model == "claude-opus-4-6"
+    assert u.input_tokens is None
+    assert u.num_tool_calls == 0
+
+
+def test_turn_result_wraps_usage():
+    r = TurnResult(final_text="hi", usage=TurnUsage(model="m"))
+    assert r.final_text == "hi"
+    assert r.usage.model == "m"
+```
+
+- [ ] **Step 3: Run test to verify it fails**
+
+Run: `pytest tests/lib/core/harness/test_types.py -v`
+Expected: FAIL with `ModuleNotFoundError: agentex.lib.core.harness.types`
+
+- [ ] **Step 4: Implement the types**
+
+Create `src/agentex/lib/core/harness/types.py`:
+
+```python
+"""Types for the unified harness surface."""
+
+from __future__ import annotations
+
+from dataclasses import dataclass, field
+from typing import Any, AsyncIterator, Literal, Protocol, Union, runtime_checkable
+
+from agentex.types.task_message_update import (
+    StreamTaskMessageDelta,
+    StreamTaskMessageDone,
+    StreamTaskMessageFull,
+    StreamTaskMessageStart,
+)
+from agentex.lib.utils.model_utils import BaseModel
+
+# The canonical stream element. Taps yield these; delivery adapters consume them.
+StreamTaskMessage = Union[
+    StreamTaskMessageStart,
+    StreamTaskMessageDelta,
+    StreamTaskMessageFull,
+    StreamTaskMessageDone,
+]
+
+SpanKind = Literal["tool", "reasoning", "subagent"]
+
+
+@dataclass
+class OpenSpan:
+    """Signal to open a child span. `key` pairs an open with its close."""
+
+    key: str
+    kind: SpanKind
+    name: str
+    input: dict[str, Any] = field(default_factory=dict)
+
+
+@dataclass
+class CloseSpan:
+    """Signal to close the span previously opened with the same `key`."""
+
+    key: str
+    output: Any = None
+    is_complete: bool = True  # False when closed by flush() without a result
+
+
+SpanSignal = Union[OpenSpan, CloseSpan]
+
+
+class TurnUsage(BaseModel):
+    """Harness-independent turn usage/cost, attached to the turn span.
+
+    Token field names align with agentex.lib.core.observability.llm_metrics.
+    """
+
+    model: str | None = None
+    input_tokens: int | None = None
+    output_tokens: int | None = None
+    cached_input_tokens: int | None = None
+    reasoning_tokens: int | None = None
+    total_tokens: int | None = None
+    cost_usd: float | None = None
+    duration_ms: int | None = None
+    num_llm_calls: int = 0
+    num_tool_calls: int = 0
+    num_reasoning_blocks: int = 0
+
+
+class TurnResult(BaseModel):
+    """Returned to the caller after a turn is delivered."""
+
+    final_text: str = ""
+    usage: TurnUsage = TurnUsage()
+
+
+@runtime_checkable
+class HarnessTurn(Protocol):
+    """A single harness turn: a canonical stream plus its normalized usage.
+
+    Python async generators cannot cleanly return a value to their consumer, so
+    a tap exposes usage via `usage()` (valid only after `events` is exhausted)
+    rather than via StopAsyncIteration.
+    """
+
+    @property
+    def events(self) -> AsyncIterator[StreamTaskMessage]: ...
+
+    def usage(self) -> TurnUsage: ...
+```
+
+- [ ] **Step 5: Run test to verify it passes**
+
+Run: `pytest tests/lib/core/harness/test_types.py -v`
+Expected: PASS (3 passed)
+
+- [ ] **Step 6: Commit**
+
+```bash
+git add src/agentex/lib/core/harness/__init__.py src/agentex/lib/core/harness/types.py tests/lib/core/harness/__init__.py tests/lib/core/harness/test_types.py
+git commit -m "feat(harness): foundation types for unified harness surface"
+```
+
+---
+
+## Task 2: SpanDeriver (pure span derivation) — PR 1
+
+**Files:**
+- Create: `src/agentex/lib/core/harness/span_derivation.py`
+- Test: `tests/lib/core/harness/test_span_derivation.py`
+
+Derivation rules (from the spec): tool span opens on the `Done` of an index whose `Start`
+was a `ToolRequestContent`, and closes on the matching `ToolResponseContent` by
+`tool_call_id`; reasoning span opens on `Start(ReasoningContent)` and closes on that index's
+`Done`. Parallel tools are keyed by `tool_call_id`. `flush()` closes anything still open.
+
+- [ ] **Step 1: Write failing tests (text, single tool, reasoning, parallel, streamed args, unclosed)**
+
+Create `tests/lib/core/harness/test_span_derivation.py`:
+
+```python
+from agentex.lib.core.harness.span_derivation import SpanDeriver
+from agentex.lib.core.harness.types import OpenSpan, CloseSpan
+from agentex.types.task_message_update import (
+    StreamTaskMessageStart,
+    StreamTaskMessageDelta,
+    StreamTaskMessageFull,
+    StreamTaskMessageDone,
+)
+from agentex.types.text_content import TextContent
+from agentex.types.reasoning_content import ReasoningContent
+from agentex.types.tool_request_content import ToolRequestContent
+from agentex.types.tool_response_content import ToolResponseContent
+from agentex.types.tool_request_delta import ToolRequestDelta
+
+
+def _signals(deriver, events):
+    out = []
+    for e in events:
+        out.extend(deriver.observe(e))
+    out.extend(deriver.flush())
+    return out
+
+
+def _tool_req(idx, tcid, name, args):
+    return StreamTaskMessageStart(
+        type="start", index=idx,
+        content=ToolRequestContent(type="tool_request", author="agent",
+                                   tool_call_id=tcid, name=name, arguments=args),
+    )
+
+
+def test_text_only_yields_no_spans():
+    d = SpanDeriver()
+    events = [
+        StreamTaskMessageStart(type="start", index=0,
+            content=TextContent(type="text", author="agent", content="")),
+        StreamTaskMessageDelta(type="delta", index=0,
+            delta=None),
+        StreamTaskMessageDone(type="done", index=0),
+    ]
+    assert _signals(d, events) == []
+
+
+def test_single_tool_opens_on_done_closes_on_response():
+    d = SpanDeriver()
+    events = [
+        _tool_req(0, "call_1", "Bash", {"cmd": "ls"}),
+        StreamTaskMessageDone(type="done", index=0),
+        StreamTaskMessageFull(type="full", index=1,
+            content=ToolResponseContent(type="tool_response", author="agent",
+                                        tool_call_id="call_1", name="Bash", content="files")),
+    ]
+    sigs = _signals(d, events)
+    assert sigs == [
+        OpenSpan(key="call_1", kind="tool", name="Bash", input={"cmd": "ls"}),
+        CloseSpan(key="call_1", output="files", is_complete=True),
+    ]
+
+
+def test_reasoning_opens_on_start_closes_on_done():
+    d = SpanDeriver()
+    events = [
+        StreamTaskMessageStart(type="start", index=0,
+            content=ReasoningContent(type="reasoning", author="agent", summary=[], content=[])),
+        StreamTaskMessageDone(type="done", index=0),
+    ]
+    sigs = _signals(d, events)
+    assert sigs[0] == OpenSpan(key="reasoning:0", kind="reasoning", name="reasoning", input={})
+    assert sigs[1] == CloseSpan(key="reasoning:0", output=None, is_complete=True)
+
+
+def test_parallel_tools_pair_by_tool_call_id():
+    d = SpanDeriver()
+    events = [
+        _tool_req(0, "a", "T1", {}),
+        _tool_req(1, "b", "T2", {}),
+        StreamTaskMessageDone(type="done", index=0),
+        StreamTaskMessageDone(type="done", index=1),
+        StreamTaskMessageFull(type="full", index=2,
+            content=ToolResponseContent(type="tool_response", author="agent",
+                                        tool_call_id="b", name="T2", content="rb")),
+        StreamTaskMessageFull(type="full", index=3,
+            content=ToolResponseContent(type="tool_response", author="agent",
+                                        tool_call_id="a", name="T1", content="ra")),
+    ]
+    sigs = _signals(d, events)
+    opens = [s for s in sigs if isinstance(s, OpenSpan)]
+    closes = [s for s in sigs if isinstance(s, CloseSpan)]
+    assert {o.key for o in opens} == {"a", "b"}
+    assert [c.key for c in closes] == ["b", "a"]
+    assert all(c.is_complete for c in closes)
+
+
+def test_streamed_args_accumulate_into_open_input():
+    d = SpanDeriver()
+    events = [
+        StreamTaskMessageStart(type="start", index=0,
+            content=ToolRequestContent(type="tool_request", author="agent",
+                                       tool_call_id="c", name="Bash", arguments={})),
+        StreamTaskMessageDelta(type="delta", index=0,
+            delta=ToolRequestDelta(type="tool_request", tool_call_id="c", name="Bash",
+                                   arguments_delta='{"cmd":')),
+        StreamTaskMessageDelta(type="delta", index=0,
+            delta=ToolRequestDelta(type="tool_request", tool_call_id="c", name="Bash",
+                                   arguments_delta='"ls"}')),
+        StreamTaskMessageDone(type="done", index=0),
+    ]
+    sigs = _signals(d, events)
+    assert sigs[0] == OpenSpan(key="c", kind="tool", name="Bash", input={"cmd": "ls"})
+
+
+def test_unclosed_tool_closed_incomplete_on_flush():
+    d = SpanDeriver()
+    events = [
+        _tool_req(0, "x", "Bash", {}),
+        StreamTaskMessageDone(type="done", index=0),
+    ]
+    sigs = _signals(d, events)
+    assert sigs[0] == OpenSpan(key="x", kind="tool", name="Bash", input={})
+    assert sigs[1] == CloseSpan(key="x", output=None, is_complete=False)
+```
+
+- [ ] **Step 2: Run tests to verify they fail**
+
+Run: `pytest tests/lib/core/harness/test_span_derivation.py -v`
+Expected: FAIL with `ModuleNotFoundError: agentex.lib.core.harness.span_derivation`
+
+- [ ] **Step 3: Implement `SpanDeriver`**
+
+Create `src/agentex/lib/core/harness/span_derivation.py`:
+
+```python
+"""Pure reducer: canonical StreamTaskMessage* stream -> span open/close signals.
+
+Has no dependency on adk; unit-testable in isolation. Delivery adapters feed it
+every event and act on the returned signals.
+"""
+
+from __future__ import annotations
+
+import json
+from dataclasses import dataclass, field
+from typing import Any
+
+from agentex.types.task_message_update import (
+    StreamTaskMessageDelta,
+    StreamTaskMessageDone,
+    StreamTaskMessageFull,
+    StreamTaskMessageStart,
+)
+
+from agentex.lib.core.harness.types import CloseSpan, OpenSpan, SpanSignal, StreamTaskMessage
+
+
+@dataclass
+class _ToolReqMeta:
+    tool_call_id: str
+    name: str
+    arguments: dict[str, Any]
+    args_buf: str = ""  # accumulated streamed argument fragments
+
+
+class SpanDeriver:
+    """Stateful reducer over the canonical stream.
+
+    Tool span: open on Done of a ToolRequestContent index; close on matching
+    ToolResponseContent by tool_call_id. Reasoning span: open on
+    Start(ReasoningContent); close on that index's Done.
+    """
+
+    def __init__(self) -> None:
+        # index -> tool request metadata (present only for tool_request indices)
+        self._tool_by_index: dict[int, _ToolReqMeta] = {}
+        # index -> reasoning open (present only for reasoning indices)
+        self._reasoning_index_open: set[int] = set()
+        # tool_call_ids with a currently-open span
+        self._open_tool_ids: set[str] = set()
+
+    def observe(self, event: StreamTaskMessage) -> list[SpanSignal]:
+        if isinstance(event, StreamTaskMessageStart):
+            return self._on_start(event)
+        if isinstance(event, StreamTaskMessageDelta):
+            return self._on_delta(event)
+        if isinstance(event, StreamTaskMessageFull):
+            return self._on_full(event)
+        if isinstance(event, StreamTaskMessageDone):
+            return self._on_done(event)
+        return []
+
+    def flush(self) -> list[SpanSignal]:
+        """Close anything still open at end of stream, marked incomplete."""
+        signals: list[SpanSignal] = []
+        for tcid in list(self._open_tool_ids):
+            signals.append(CloseSpan(key=tcid, output=None, is_complete=False))
+        self._open_tool_ids.clear()
+        for idx in sorted(self._reasoning_index_open):
+            signals.append(CloseSpan(key=f"reasoning:{idx}", output=None, is_complete=False))
+        self._reasoning_index_open.clear()
+        return signals
+
+    def _on_start(self, event: StreamTaskMessageStart) -> list[SpanSignal]:
+        content = event.content
+        idx = event.index if event.index is not None else -1
+        ctype = getattr(content, "type", None)
+        if ctype == "tool_request":
+            self._tool_by_index[idx] = _ToolReqMeta(
+                tool_call_id=content.tool_call_id,
+                name=content.name,
+                arguments=dict(content.arguments or {}),
+            )
+            return []
+        if ctype == "reasoning":
+            self._reasoning_index_open.add(idx)
+            return [OpenSpan(key=f"reasoning:{idx}", kind="reasoning", name="reasoning", input={})]
+        return []
+
+    def _on_delta(self, event: StreamTaskMessageDelta) -> list[SpanSignal]:
+        idx = event.index if event.index is not None else -1
+        delta = event.delta
+        if delta is not None and getattr(delta, "type", None) == "tool_request":
+            meta = self._tool_by_index.get(idx)
+            if meta is not None and delta.arguments_delta:
+                meta.args_buf += delta.arguments_delta
+        return []
+
+    def _on_full(self, event: StreamTaskMessageFull) -> list[SpanSignal]:
+        content = event.content
+        if getattr(content, "type", None) == "tool_response":
+            tcid = content.tool_call_id
+            if tcid in self._open_tool_ids:
+                self._open_tool_ids.discard(tcid)
+                return [CloseSpan(key=tcid, output=content.content, is_complete=True)]
+        return []
+
+    def _on_done(self, event: StreamTaskMessageDone) -> list[SpanSignal]:
+        idx = event.index if event.index is not None else -1
+        meta = self._tool_by_index.pop(idx, None)
+        if meta is not None:
+            args = meta.arguments
+            if meta.args_buf:
+                try:
+                    args = json.loads(meta.args_buf)
+                except json.JSONDecodeError:
+                    args = {"_raw": meta.args_buf}
+            self._open_tool_ids.add(meta.tool_call_id)
+            return [OpenSpan(key=meta.tool_call_id, kind="tool", name=meta.name, input=args)]
+        if idx in self._reasoning_index_open:
+            self._reasoning_index_open.discard(idx)
+            return [CloseSpan(key=f"reasoning:{idx}", output=None, is_complete=True)]
+        return []
+```
+
+- [ ] **Step 4: Run tests to verify they pass**
+
+Run: `pytest tests/lib/core/harness/test_span_derivation.py -v`
+Expected: PASS (6 passed)
+
+- [ ] **Step 5: Commit**
+
+```bash
+git add src/agentex/lib/core/harness/span_derivation.py tests/lib/core/harness/test_span_derivation.py
+git commit -m "feat(harness): pure SpanDeriver reducing the canonical stream to span signals"
+```
+
+---
+
+## Task 3: Tracer adapter (span signals -> adk.tracing)
+
+**Files:**
+- Create: `src/agentex/lib/core/harness/tracer.py`
+- Test: `tests/lib/core/harness/test_tracer.py`
+
+A thin adapter that turns `SpanSignal`s into `adk.tracing` spans, nesting them under a parent
+span. Kept separate from `SpanDeriver` so derivation stays pure and tracing stays overridable.
+Tracing failures are best-effort and never raise (spec error-handling contract).
+
+- [ ] **Step 1: Write the failing test (uses a fake adk.tracing)**
+
+Create `tests/lib/core/harness/test_tracer.py`:
+
+```python
+import pytest
+
+from agentex.lib.core.harness.tracer import SpanTracer
+from agentex.lib.core.harness.types import OpenSpan, CloseSpan
+
+
+class _FakeSpan:
+    def __init__(self, name):
+        self.name = name
+
+
+class _FakeTracing:
+    def __init__(self):
+        self.started = []
+        self.ended = []
+
+    async def start_span(self, *, trace_id, name, input=None, parent_id=None, data=None, task_id=None):
+        self.started.append((name, parent_id, input))
+        return _FakeSpan(name)
+
+    async def end_span(self, *, trace_id, span, output=None, data=None):
+        self.ended.append((span.name, output))
+
+
+@pytest.mark.asyncio
+async def test_open_then_close_starts_and_ends_span():
+    fake = _FakeTracing()
+    tracer = SpanTracer(trace_id="t1", parent_span_id="p1", tracing=fake)
+    await tracer.handle(OpenSpan(key="call_1", kind="tool", name="Bash", input={"cmd": "ls"}))
+    await tracer.handle(CloseSpan(key="call_1", output="files", is_complete=True))
+    assert fake.started == [("Bash", "p1", {"cmd": "ls"})]
+    assert fake.ended == [("Bash", "files")]
+
+
+@pytest.mark.asyncio
+async def test_no_trace_id_is_noop():
+    fake = _FakeTracing()
+    tracer = SpanTracer(trace_id="", parent_span_id=None, tracing=fake)
+    await tracer.handle(OpenSpan(key="k", kind="tool", name="X"))
+    await tracer.handle(CloseSpan(key="k"))
+    assert fake.started == [] and fake.ended == []
+
+
+@pytest.mark.asyncio
+async def test_tracing_failure_is_swallowed():
+    class _Boom(_FakeTracing):
+        async def start_span(self, **kw):
+            raise RuntimeError("backend down")
+
+    tracer = SpanTracer(trace_id="t1", parent_span_id="p1", tracing=_Boom())
+    # Must not raise.
+    await tracer.handle(OpenSpan(key="k", kind="tool", name="X"))
+    await tracer.handle(CloseSpan(key="k"))
+```
+
+- [ ] **Step 2: Run tests to verify they fail**
+
+Run: `pytest tests/lib/core/harness/test_tracer.py -v`
+Expected: FAIL with `ModuleNotFoundError: agentex.lib.core.harness.tracer`
+
+- [ ] **Step 3: Implement `SpanTracer`**
+
+Create `src/agentex/lib/core/harness/tracer.py`:
+
+```python
+"""Adapter from SpanSignals to adk.tracing spans (best-effort, overridable)."""
+
+from __future__ import annotations
+
+from typing import Any
+
+from agentex.lib.utils.logging import make_logger
+from agentex.lib.core.harness.types import CloseSpan, OpenSpan, SpanSignal
+
+logger = make_logger(__name__)
+
+
+class SpanTracer:
+    """Opens/closes adk.tracing child spans in response to span signals.
+
+    `tracing` defaults to the real `adk.tracing` module; inject a fake in tests
+    or a custom tracer to override. No-op when `trace_id` is falsy. Never raises.
+    """
+
+    def __init__(self, trace_id: str | None, parent_span_id: str | None, tracing: Any = None, task_id: str | None = None):
+        self.trace_id = trace_id
+        self.parent_span_id = parent_span_id
+        self.task_id = task_id
+        if tracing is None:
+            from agentex.lib import adk
+
+            tracing = adk.tracing
+        self._tracing = tracing
+        self._open: dict[str, Any] = {}  # span key -> span object
+
+    async def handle(self, signal: SpanSignal) -> None:
+        if not self.trace_id:
+            return
+        try:
+            if isinstance(signal, OpenSpan):
+                span = await self._tracing.start_span(
+                    trace_id=self.trace_id,
+                    name=signal.name,
+                    input=signal.input,
+                    parent_id=self.parent_span_id,
+                    task_id=self.task_id,
+                )
+                if span is not None:
+                    self._open[signal.key] = span
+            elif isinstance(signal, CloseSpan):
+                span = self._open.pop(signal.key, None)
+                if span is not None:
+                    await self._tracing.end_span(
+                        trace_id=self.trace_id,
+                        span=span,
+                        output=signal.output,
+                    )
+        except Exception as exc:  # best-effort: tracing never breaks delivery
+            logger.warning("[harness.tracer] span signal failed: %s", exc)
+```
+
+Note for the implementer: confirm `adk.tracing.end_span` accepts `output=` (seen in
+`src/agentex/lib/adk/_modules/tracing.py`). If the kwarg differs, adjust the call and the
+fake in the test together.
+
+- [ ] **Step 4: Run tests to verify they pass**
+
+Run: `pytest tests/lib/core/harness/test_tracer.py -v`
+Expected: PASS (3 passed)
+
+- [ ] **Step 5: Commit**
+
+```bash
+git add src/agentex/lib/core/harness/tracer.py tests/lib/core/harness/test_tracer.py
+git commit -m "feat(harness): SpanTracer adapter from span signals to adk.tracing"
+```
+
+---
+
+## Task 4: `yield_events` delivery adapter — PR 3 (part 1)
+
+**Files:**
+- Create: `src/agentex/lib/core/harness/yield_delivery.py`
+- Test: `tests/lib/core/harness/test_yield_delivery.py`
+
+`yield_events` passes the canonical stream through unchanged (for sync HTTP ACP agents) while
+feeding the `SpanDeriver` + `SpanTracer` as a side effect. Streaming fidelity is untouched.
+
+- [ ] **Step 1: Write the failing test**
+
+Create `tests/lib/core/harness/test_yield_delivery.py`:
+
+```python
+import pytest
+
+from agentex.lib.core.harness.yield_delivery import yield_events
+from agentex.lib.core.harness.tracer import SpanTracer
+from agentex.types.task_message_update import (
+    StreamTaskMessageStart,
+    StreamTaskMessageDone,
+    StreamTaskMessageFull,
+)
+from agentex.types.tool_request_content import ToolRequestContent
+from agentex.types.tool_response_content import ToolResponseContent
+
+
+class _RecordTracing:
+    def __init__(self):
+        self.started, self.ended = [], []
+
+    async def start_span(self, *, trace_id, name, input=None, parent_id=None, data=None, task_id=None):
+        self.started.append(name)
+        return object()
+
+    async def end_span(self, *, trace_id, span, output=None, data=None):
+        self.ended.append(output)
+
+
+async def _gen(events):
+    for e in events:
+        yield e
+
+
+@pytest.mark.asyncio
+async def test_yield_passes_events_through_and_traces():
+    fake = _RecordTracing()
+    tracer = SpanTracer(trace_id="t", parent_span_id="p", tracing=fake)
+    events = [
+        StreamTaskMessageStart(type="start", index=0,
+            content=ToolRequestContent(type="tool_request", author="agent",
+                                       tool_call_id="c", name="Bash", arguments={})),
+        StreamTaskMessageDone(type="done", index=0),
+        StreamTaskMessageFull(type="full", index=1,
+            content=ToolResponseContent(type="tool_response", author="agent",
+                                        tool_call_id="c", name="Bash", content="ok")),
+    ]
+    out = [e async for e in yield_events(_gen(events), tracer=tracer)]
+    assert out == events            # passthrough unchanged
+    assert fake.started == ["Bash"] # span derived + opened
+    assert fake.ended == ["ok"]     # span closed with response
+```
+
+- [ ] **Step 2: Run test to verify it fails**
+
+Run: `pytest tests/lib/core/harness/test_yield_delivery.py -v`
+Expected: FAIL with `ModuleNotFoundError: agentex.lib.core.harness.yield_delivery`
+
+- [ ] **Step 3: Implement `yield_events`**
+
+Create `src/agentex/lib/core/harness/yield_delivery.py`:
+
+```python
+"""Yield delivery: pass the canonical stream through, tracing as a side effect."""
+
+from __future__ import annotations
+
+from typing import AsyncIterator
+
+from agentex.lib.core.harness.span_derivation import SpanDeriver
+from agentex.lib.core.harness.tracer import SpanTracer
+from agentex.lib.core.harness.types import StreamTaskMessage
+
+
+async def yield_events(
+    events: AsyncIterator[StreamTaskMessage],
+    tracer: SpanTracer | None = None,
+) -> AsyncIterator[StreamTaskMessage]:
+    """Forward each event to the caller; derive + trace spans as a side effect.
+
+    For sync HTTP ACP agents that yield events back over the response. When
+    `tracer` is None, this is a pure passthrough.
+    """
+    deriver = SpanDeriver() if tracer is not None else None
+    try:
+        async for event in events:
+            if deriver is not None and tracer is not None:
+                for signal in deriver.observe(event):
+                    await tracer.handle(signal)
+            yield event
+    finally:
+        if deriver is not None and tracer is not None:
+            for signal in deriver.flush():
+                await tracer.handle(signal)
+```
+
+- [ ] **Step 4: Run test to verify it passes**
+
+Run: `pytest tests/lib/core/harness/test_yield_delivery.py -v`
+Expected: PASS (1 passed)
+
+- [ ] **Step 5: Commit**
+
+```bash
+git add src/agentex/lib/core/harness/yield_delivery.py tests/lib/core/harness/test_yield_delivery.py
+git commit -m "feat(harness): yield_events delivery adapter (passthrough + tracing)"
+```
+
+---
+
+## Task 5: `auto_send` delivery adapter — PR 2
+
+**Files:**
+- Create: `src/agentex/lib/core/harness/auto_send.py`
+- Test: `tests/lib/core/harness/test_auto_send.py`
+
+`auto_send` consumes the canonical stream and drives `adk.streaming` context managers: it opens
+a text context for `TextContent`, a reasoning context for `ReasoningContent`, switches cleanly
+between them, and posts tool request/response as full messages. It feeds the same
+`SpanDeriver`/`SpanTracer` and returns `TurnResult`. This generalizes the golden agent's
+`AgentexStreamAdapter` (`teams/sgp/agents/golden_agent/project/harness/adapter.py`) to consume
+`StreamTaskMessage*` instead of `HarnessEvent`.
+
+Reference while implementing: `src/agentex/lib/adk/_modules/_langgraph_async.py`
+(`stream_langgraph_events`) shows the exact `adk.streaming` open/stream/close pattern to reuse;
+`adapter.py` lines 87–130 show the text↔reasoning↔tool switching logic to mirror.
+
+- [ ] **Step 1: Write the failing test (fake streaming records context lifecycle)**
+
+Create `tests/lib/core/harness/test_auto_send.py`:
+
+```python
+import pytest
+
+from agentex.lib.core.harness.auto_send import auto_send
+from agentex.types.task_message_update import (
+    StreamTaskMessageStart,
+    StreamTaskMessageDelta,
+    StreamTaskMessageDone,
+)
+from agentex.types.text_content import TextContent
+from agentex.types.text_delta import TextDelta
+
+
+class _FakeCtx:
+    def __init__(self, sink):
+        self.sink = sink
+
+    async def __aenter__(self):
+        self.sink.append(("open",))
+        return self
+
+    async def __aexit__(self, *a):
+        self.sink.append(("close",))
+        return False
+
+    async def stream_update(self, update):
+        self.sink.append(("update", update))
+        return update
+
+
+class _FakeStreaming:
+    def __init__(self):
+        self.sink = []
+
+    def streaming_task_message_context(self, task_id, initial_content, streaming_mode="coalesced", created_at=None):
+        self.sink.append(("ctx", getattr(initial_content, "type", None)))
+        return _FakeCtx(self.sink)
+
+
+async def _gen(events):
+    for e in events:
+        yield e
+
+
+@pytest.mark.asyncio
+async def test_auto_send_streams_text_and_returns_final_text():
+    streaming = _FakeStreaming()
+    events = [
+        StreamTaskMessageStart(type="start", index=0,
+            content=TextContent(type="text", author="agent", content="")),
+        StreamTaskMessageDelta(type="delta", index=0, delta=TextDelta(type="text", text_delta="Hel")),
+        StreamTaskMessageDelta(type="delta", index=0, delta=TextDelta(type="text", text_delta="lo")),
+        StreamTaskMessageDone(type="done", index=0),
+    ]
+    result = await auto_send(_gen(events), task_id="task1", tracer=None, streaming=streaming)
+    assert result.final_text == "Hello"
+    kinds = [s[0] for s in streaming.sink]
+    assert kinds[0] == "ctx" and "open" in kinds and "close" in kinds
+```
+
+- [ ] **Step 2: Run test to verify it fails**
+
+Run: `pytest tests/lib/core/harness/test_auto_send.py -v`
+Expected: FAIL with `ModuleNotFoundError: agentex.lib.core.harness.auto_send`
+
+- [ ] **Step 3: Implement `auto_send`**
+
+Create `src/agentex/lib/core/harness/auto_send.py`. The implementer mirrors the text↔reasoning
+switching from `adapter.py` and the `adk.streaming` usage from `_langgraph_async.py`:
+
+```python
+"""Auto-send delivery: canonical stream -> adk.streaming side effects + tracing."""
+
+from __future__ import annotations
+
+from typing import Any, AsyncIterator
+
+from agentex.types.task_message_update import (
+    StreamTaskMessageDelta,
+    StreamTaskMessageDone,
+    StreamTaskMessageFull,
+    StreamTaskMessageStart,
+)
+from agentex.types.text_content import TextContent
+
+from agentex.lib.core.harness.span_derivation import SpanDeriver
+from agentex.lib.core.harness.tracer import SpanTracer
+from agentex.lib.core.harness.types import StreamTaskMessage, TurnResult, TurnUsage
+
+
+async def auto_send(
+    events: AsyncIterator[StreamTaskMessage],
+    task_id: str,
+    tracer: SpanTracer | None = None,
+    streaming: Any = None,
+    usage: TurnUsage | None = None,
+) -> TurnResult:
+    """Push the canonical stream to the task stream via adk.streaming.
+
+    Opens a streaming context per text/reasoning message, streams deltas, and
+    closes on Done; posts tool request/response as full messages; derives and
+    traces spans from the same stream. Returns the accumulated final text +
+    usage. For async + temporal agents (call from inside an activity).
+    """
+    if streaming is None:
+        from agentex.lib import adk
+
+        streaming = adk.streaming
+
+    deriver = SpanDeriver() if tracer is not None else None
+    final_text_parts: list[str] = []
+    current_ctx: Any = None
+    current_kind: str | None = None  # "text" | "reasoning"
+
+    async def _close_current() -> None:
+        nonlocal current_ctx, current_kind
+        if current_ctx is not None:
+            await current_ctx.__aexit__(None, None, None)
+            current_ctx = None
+            current_kind = None
+
+    try:
+        async for event in events:
+            if deriver is not None and tracer is not None:
+                for signal in deriver.observe(event):
+                    await tracer.handle(signal)
+
+            if isinstance(event, StreamTaskMessageStart):
+                ctype = getattr(event.content, "type", None)
+                if ctype in ("text", "reasoning"):
+                    await _close_current()
+                    current_ctx = streaming.streaming_task_message_context(
+                        task_id=task_id, initial_content=event.content,
+                    )
+                    await current_ctx.__aenter__()
+                    current_kind = ctype
+            elif isinstance(event, StreamTaskMessageDelta):
+                if current_ctx is not None and event.delta is not None:
+                    await current_ctx.stream_update(event)
+                    if getattr(event.delta, "type", None) == "text" and event.delta.text_delta:
+                        final_text_parts.append(event.delta.text_delta)
+            elif isinstance(event, StreamTaskMessageDone):
+                await _close_current()
+            elif isinstance(event, StreamTaskMessageFull):
+                # Tool request/response (and any non-streamed full message): post as a
+                # standalone full message, not tied to the current text/reasoning ctx.
+                await _close_current()
+                ctx = streaming.streaming_task_message_context(
+                    task_id=task_id, initial_content=event.content,
+                )
+                await ctx.__aenter__()
+                await ctx.__aexit__(None, None, None)
+    finally:
+        await _close_current()
+        if deriver is not None and tracer is not None:
+            for signal in deriver.flush():
+                await tracer.handle(signal)
+
+    return TurnResult(final_text="".join(final_text_parts), usage=usage or TurnUsage())
+```
+
+Note for the implementer: validate the exact `streaming_task_message_context` usage against
+`_langgraph_async.py` (whether to call `stream_update` with the whole `StreamTaskMessageDelta`
+or the inner delta). Adjust the call and the fake together; the test asserts behavior, not the
+internal kwarg shape.
+
+- [ ] **Step 4: Run test to verify it passes**
+
+Run: `pytest tests/lib/core/harness/test_auto_send.py -v`
+Expected: PASS (1 passed)
+
+- [ ] **Step 5: Commit**
+
+```bash
+git add src/agentex/lib/core/harness/auto_send.py tests/lib/core/harness/test_auto_send.py
+git commit -m "feat(harness): auto_send delivery adapter (canonical stream -> adk.streaming + tracing)"
+```
+
+---
+
+## Task 6: `UnifiedEmitter` facade — PR 3 (part 2)
+
+**Files:**
+- Create: `src/agentex/lib/core/harness/emitter.py`
+- Modify: `src/agentex/lib/core/harness/__init__.py` (re-export public surface)
+- Test: `tests/lib/core/harness/test_emitter.py`
+
+`UnifiedEmitter` is the single thing an agent author touches. It owns the trace context, builds
+the `SpanTracer` (default-on when a trace context exists, overridable), and exposes both
+delivery modes over a `HarnessTurn`. It attaches the turn's `TurnUsage` to delivery.
+
+- [ ] **Step 1: Write the failing test**
+
+Create `tests/lib/core/harness/test_emitter.py`:
+
+```python
+import pytest
+
+from agentex.lib.core.harness.emitter import UnifiedEmitter
+from agentex.lib.core.harness.types import TurnUsage
+from agentex.types.task_message_update import StreamTaskMessageStart, StreamTaskMessageDone
+from agentex.types.text_content import TextContent
+
+
+class _Turn:
+    def __init__(self, events_list, usage):
+        self._events_list = events_list
+        self._usage = usage
+
+    @property
+    async def events(self):
+        for e in self._events_list:
+            yield e
+
+    def usage(self):
+        return self._usage
+
+
+@pytest.mark.asyncio
+async def test_emitter_yield_mode_passes_through():
+    events = [
+        StreamTaskMessageStart(type="start", index=0,
+            content=TextContent(type="text", author="agent", content="hi")),
+        StreamTaskMessageDone(type="done", index=0),
+    ]
+    turn = _Turn(events, TurnUsage(model="m"))
+    emitter = UnifiedEmitter(task_id="t", trace_id=None, parent_span_id=None)
+    out = [e async for e in emitter.yield_turn(turn)]
+    assert out == events
+
+
+@pytest.mark.asyncio
+async def test_emitter_tracing_default_on_when_trace_id_present():
+    emitter = UnifiedEmitter(task_id="t", trace_id="trace1", parent_span_id="p")
+    assert emitter.tracer is not None
+
+
+@pytest.mark.asyncio
+async def test_emitter_tracing_overridable_off():
+    emitter = UnifiedEmitter(task_id="t", trace_id="trace1", parent_span_id="p", tracer=False)
+    assert emitter.tracer is None
+```
+
+- [ ] **Step 2: Run test to verify it fails**
+
+Run: `pytest tests/lib/core/harness/test_emitter.py -v`
+Expected: FAIL with `ModuleNotFoundError: agentex.lib.core.harness.emitter`
+
+- [ ] **Step 3: Implement `UnifiedEmitter`**
+
+Create `src/agentex/lib/core/harness/emitter.py`:
+
+```python
+"""UnifiedEmitter: the single facade agent authors use for either delivery mode."""
+
+from __future__ import annotations
+
+from typing import Any, AsyncIterator
+
+from agentex.lib.core.harness.auto_send import auto_send
+from agentex.lib.core.harness.tracer import SpanTracer
+from agentex.lib.core.harness.types import HarnessTurn, StreamTaskMessage, TurnResult
+from agentex.lib.core.harness.yield_delivery import yield_events
+
+
+class UnifiedEmitter:
+    """Ties trace context + chosen delivery together.
+
+    Tracing is default-on whenever `trace_id` is truthy; pass `tracer=False` to
+    disable, or a custom `SpanTracer` to override.
+    """
+
+    def __init__(
+        self,
+        task_id: str,
+        trace_id: str | None,
+        parent_span_id: str | None,
+        tracer: SpanTracer | bool | None = None,
+    ):
+        self.task_id = task_id
+        self.trace_id = trace_id
+        self.parent_span_id = parent_span_id
+        if tracer is False:
+            self.tracer: SpanTracer | None = None
+        elif isinstance(tracer, SpanTracer):
+            self.tracer = tracer
+        elif trace_id:
+            self.tracer = SpanTracer(trace_id=trace_id, parent_span_id=parent_span_id, task_id=task_id)
+        else:
+            self.tracer = None
+
+    async def yield_turn(self, turn: HarnessTurn) -> AsyncIterator[StreamTaskMessage]:
+        """Sync HTTP ACP delivery: forward events, trace as side effect."""
+        async for event in yield_events(turn.events, tracer=self.tracer):
+            yield event
+
+    async def auto_send_turn(self, turn: HarnessTurn) -> TurnResult:
+        """Async/temporal delivery: push to the task stream, return TurnResult."""
+        return await auto_send(
+            turn.events,
+            task_id=self.task_id,
+            tracer=self.tracer,
+            usage=turn.usage(),
+        )
+```
+
+- [ ] **Step 4: Re-export the public surface**
+
+Append to `src/agentex/lib/core/harness/__init__.py`:
+
+```python
+from agentex.lib.core.harness.emitter import UnifiedEmitter
+from agentex.lib.core.harness.tracer import SpanTracer
+from agentex.lib.core.harness.types import (
+    CloseSpan,
+    HarnessTurn,
+    OpenSpan,
+    SpanSignal,
+    StreamTaskMessage,
+    TurnResult,
+    TurnUsage,
+)
+
+__all__ = [
+    "UnifiedEmitter",
+    "SpanTracer",
+    "OpenSpan",
+    "CloseSpan",
+    "SpanSignal",
+    "StreamTaskMessage",
+    "TurnUsage",
+    "TurnResult",
+    "HarnessTurn",
+]
+```
+
+- [ ] **Step 5: Run tests to verify they pass**
+
+Run: `pytest tests/lib/core/harness/ -v`
+Expected: PASS (all harness tests green)
+
+- [ ] **Step 6: Commit**
+
+```bash
+git add src/agentex/lib/core/harness/emitter.py src/agentex/lib/core/harness/__init__.py tests/lib/core/harness/test_emitter.py
+git commit -m "feat(harness): UnifiedEmitter facade tying delivery + tracing + usage"
+```
+
+---
+
+## Task 7: Conformance test scaffold + empty CI integration job — PR 3 (part 3)
+
+**Files:**
+- Create: `tests/lib/core/harness/conformance/__init__.py`
+- Create: `tests/lib/core/harness/conformance/runner.py`
+- Create: `tests/lib/core/harness/conformance/test_conformance.py`
+- Create: `.github/workflows/harness-integration.yml`
+
+The conformance runner is the shared parametrized engine each harness tap will register fixtures
+with (in later plans). It asserts yield-vs-auto-send equivalence on the span signals derived
+from a fixture's canonical-event sequence.
+
+- [ ] **Step 1: Write the conformance runner + a self-test fixture**
+
+Create `tests/lib/core/harness/conformance/__init__.py` (empty), then
+`tests/lib/core/harness/conformance/runner.py`:
+
+```python
+"""Shared conformance engine: every harness tap registers fixtures here.
+
+A fixture is (name, list[StreamTaskMessage]). The runner asserts that span
+derivation over the events is identical regardless of delivery channel, which is
+the cross-channel guarantee from the spec.
+"""
+
+from __future__ import annotations
+
+from dataclasses import dataclass
+from typing import Callable
+
+from agentex.lib.core.harness.span_derivation import SpanDeriver
+from agentex.lib.core.harness.types import SpanSignal, StreamTaskMessage
+
+
+@dataclass
+class Fixture:
+    name: str
+    events: list[StreamTaskMessage]
+
+
+_REGISTRY: list[Fixture] = []
+
+
+def register(fixture: Fixture) -> None:
+    _REGISTRY.append(fixture)
+
+
+def all_fixtures() -> list[Fixture]:
+    return list(_REGISTRY)
+
+
+def derive_all(events: list[StreamTaskMessage]) -> list[SpanSignal]:
+    d = SpanDeriver()
+    out: list[SpanSignal] = []
+    for e in events:
+        out.extend(d.observe(e))
+    out.extend(d.flush())
+    return out
+```
+
+- [ ] **Step 2: Write the conformance test (self-test on a built-in fixture)**
+
+Create `tests/lib/core/harness/conformance/test_conformance.py`:
+
+```python
+import pytest
+
+from tests.lib.core.harness.conformance.runner import Fixture, derive_all, register, all_fixtures
+from agentex.types.task_message_update import (
+    StreamTaskMessageStart, StreamTaskMessageDone, StreamTaskMessageFull,
+)
+from agentex.types.tool_request_content import ToolRequestContent
+from agentex.types.tool_response_content import ToolResponseContent
+
+register(Fixture(
+    name="builtin-single-tool",
+    events=[
+        StreamTaskMessageStart(type="start", index=0,
+            content=ToolRequestContent(type="tool_request", author="agent",
+                                       tool_call_id="c", name="Bash", arguments={})),
+        StreamTaskMessageDone(type="done", index=0),
+        StreamTaskMessageFull(type="full", index=1,
+            content=ToolResponseContent(type="tool_response", author="agent",
+                                        tool_call_id="c", name="Bash", content="ok")),
+    ],
+))
+
+
+@pytest.mark.parametrize("fixture", all_fixtures(), ids=lambda f: f.name)
+def test_span_derivation_is_deterministic(fixture):
+    # Deriving twice over the same events yields identical signals (the property
+    # that makes yield vs auto-send equivalent, since both observe the same stream).
+    assert derive_all(fixture.events) == derive_all(fixture.events)
+```
+
+- [ ] **Step 3: Run the conformance test**
+
+Run: `pytest tests/lib/core/harness/conformance/ -v`
+Expected: PASS (1 passed)
+
+- [ ] **Step 4: Add the empty CI integration job**
+
+Create `.github/workflows/harness-integration.yml` (mirrors the structure of the existing
+`agentex-tutorials-test.yml`; the matrix is populated in later plans):
+
+```yaml
+name: Harness Integration
+
+on:
+  pull_request:
+    paths:
+      - "src/agentex/lib/core/harness/**"
+      - "src/agentex/lib/adk/_modules/**"
+      - ".github/workflows/harness-integration.yml"
+
+jobs:
+  conformance:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+      - uses: astral-sh/setup-uv@v5
+      - name: Install
+        run: uv sync
+      - name: Conformance suite
+        run: uv run pytest tests/lib/core/harness/ -v
+
+  # Live integration matrix (harness x {sync, async, temporal}) is added per-harness
+  # in the migration plans. Placeholder job keeps the workflow valid until then.
+  live-matrix:
+    runs-on: ubuntu-latest
+    if: false  # enabled once the first harness's test agents land
+    steps:
+      - run: echo "populated by migration PRs"
+```
+
+- [ ] **Step 5: Commit**
+
+```bash
+git add tests/lib/core/harness/conformance .github/workflows/harness-integration.yml
+git commit -m "test(harness): conformance scaffold + CI integration job skeleton"
+```
+
+---
+
+## Task 8: Run the full suite + type check
+
+- [ ] **Step 1: Run the whole harness test tree**
+
+Run: `pytest tests/lib/core/harness/ -v`
+Expected: PASS (all tasks' tests green)
+
+- [ ] **Step 2: Type check the new package**
+
+Run: `uv run mypy src/agentex/lib/core/harness/` (or the repo's configured type checker)
+Expected: no errors. Fix any signature mismatches inline.
+
+- [ ] **Step 3: Final commit if the type check required fixes**
+
+```bash
+git add -A && git commit -m "chore(harness): type-check fixes for foundation package"
+```
+
+---
+
+## Subsequent plans (to be written after this lands)
+
+Each gets its own plan via the writing-plans skill, expanded with that harness's exact
+converter code:
+
+- **PR 4 — Migrate pydantic-ai:** wrap `convert_pydantic_ai_to_agentex_events` as a
+  `HarnessTurn` (add `usage()` normalizing `result.usage()`), reimplement `_pydantic_ai_async`
+  on `auto_send`, retire `_pydantic_ai_tracing` in favor of `SpanTracer`, keep the public
+  `convert_*` signature. Add 3 test agents (sync/async/temporal) + register conformance
+  fixtures + enable the live-matrix row.
+- **PR 5 — Migrate langgraph:** same shape; reimplement `stream_langgraph_events` on
+  `auto_send`; normalize `usage_metadata` into `TurnUsage`.
+- **PR 6 — Migrate openai-agents:** same shape; reimplement `run_agent_streamed_auto_send` on
+  `auto_send`; normalize `response.usage`.
+- **PR 7 — claude-code parser tap:** `convert_claude_code_to_agentex_events` (port the golden
+  agent's `_StreamJsonProcessor` to yield `StreamTaskMessage*`) + recorded stream-json
+  fixtures + feasible test agent(s).
+- **PR 8 — codex parser tap:** same shape for `_CodexEventProcessor`.
+- **PR 9 — Cleanup:** delete now-dead internal duplication, deprecate `_*_tracing` shims, docs.
+
+The `is_error` tool-error work is deferred and tracked in Linear as AGX1-371.

From de564833e8364492bc7310880d9557643fa98883 Mon Sep 17 00:00:00 2001
From: Declan Brady <declan.brady@scale.com>
Date: Thu, 18 Jun 2026 12:00:40 -0400
Subject: [PATCH 05/26] feat(harness): foundation types for unified harness
 surface

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---
 src/agentex/lib/core/harness/__init__.py |  6 ++
 src/agentex/lib/core/harness/types.py    | 91 ++++++++++++++++++++++++
 tests/lib/core/harness/__init__.py       |  0
 tests/lib/core/harness/test_types.py     | 27 +++++++
 4 files changed, 124 insertions(+)
 create mode 100644 src/agentex/lib/core/harness/__init__.py
 create mode 100644 src/agentex/lib/core/harness/types.py
 create mode 100644 tests/lib/core/harness/__init__.py
 create mode 100644 tests/lib/core/harness/test_types.py

diff --git a/src/agentex/lib/core/harness/__init__.py b/src/agentex/lib/core/harness/__init__.py
new file mode 100644
index 000000000..15d116148
--- /dev/null
+++ b/src/agentex/lib/core/harness/__init__.py
@@ -0,0 +1,6 @@
+"""Shared, harness-independent machinery for the unified harness surface.
+
+The Agentex StreamTaskMessage* stream is the single source of truth; this
+package derives spans from it and delivers it (yield or auto-send), so every
+harness tap gets streaming + tracing + turn usage uniformly.
+"""
diff --git a/src/agentex/lib/core/harness/types.py b/src/agentex/lib/core/harness/types.py
new file mode 100644
index 000000000..f31b2c67f
--- /dev/null
+++ b/src/agentex/lib/core/harness/types.py
@@ -0,0 +1,91 @@
+"""Types for the unified harness surface."""
+
+from __future__ import annotations
+
+from dataclasses import dataclass, field
+from typing import Any, AsyncIterator, Literal, Protocol, Union, runtime_checkable
+
+from agentex.types.task_message_update import (
+    StreamTaskMessageDelta,
+    StreamTaskMessageDone,
+    StreamTaskMessageFull,
+    StreamTaskMessageStart,
+)
+from pydantic import BaseModel, ConfigDict
+
+# The canonical stream element. Taps yield these; delivery adapters consume them.
+StreamTaskMessage = Union[
+    StreamTaskMessageStart,
+    StreamTaskMessageDelta,
+    StreamTaskMessageFull,
+    StreamTaskMessageDone,
+]
+
+SpanKind = Literal["tool", "reasoning", "subagent"]
+
+
+@dataclass
+class OpenSpan:
+    """Signal to open a child span. `key` pairs an open with its close."""
+
+    key: str
+    kind: SpanKind
+    name: str
+    input: dict[str, Any] = field(default_factory=dict)
+
+
+@dataclass
+class CloseSpan:
+    """Signal to close the span previously opened with the same `key`."""
+
+    key: str
+    output: Any = None
+    is_complete: bool = True  # False when closed by flush() without a result
+
+
+SpanSignal = Union[OpenSpan, CloseSpan]
+
+
+class TurnUsage(BaseModel):
+    """Harness-independent turn usage/cost, attached to the turn span.
+
+    Token field names align with agentex.lib.core.observability.llm_metrics.
+    """
+
+    model_config = ConfigDict(from_attributes=True, populate_by_name=True)
+
+    model: str | None = None
+    input_tokens: int | None = None
+    output_tokens: int | None = None
+    cached_input_tokens: int | None = None
+    reasoning_tokens: int | None = None
+    total_tokens: int | None = None
+    cost_usd: float | None = None
+    duration_ms: int | None = None
+    num_llm_calls: int = 0
+    num_tool_calls: int = 0
+    num_reasoning_blocks: int = 0
+
+
+class TurnResult(BaseModel):
+    """Returned to the caller after a turn is delivered."""
+
+    model_config = ConfigDict(from_attributes=True, populate_by_name=True)
+
+    final_text: str = ""
+    usage: TurnUsage = TurnUsage()
+
+
+@runtime_checkable
+class HarnessTurn(Protocol):
+    """A single harness turn: a canonical stream plus its normalized usage.
+
+    Python async generators cannot cleanly return a value to their consumer, so
+    a tap exposes usage via `usage()` (valid only after `events` is exhausted)
+    rather than via StopAsyncIteration.
+    """
+
+    @property
+    def events(self) -> AsyncIterator[StreamTaskMessage]: ...
+
+    def usage(self) -> TurnUsage: ...
diff --git a/tests/lib/core/harness/__init__.py b/tests/lib/core/harness/__init__.py
new file mode 100644
index 000000000..e69de29bb
diff --git a/tests/lib/core/harness/test_types.py b/tests/lib/core/harness/test_types.py
new file mode 100644
index 000000000..b025d803b
--- /dev/null
+++ b/tests/lib/core/harness/test_types.py
@@ -0,0 +1,27 @@
+from agentex.lib.core.harness.types import (
+    OpenSpan,
+    CloseSpan,
+    TurnUsage,
+    TurnResult,
+)
+
+
+def test_open_close_span_construct():
+    o = OpenSpan(key="call_1", kind="tool", name="Bash", input={"cmd": "ls"})
+    c = CloseSpan(key="call_1", output="files", is_complete=True)
+    assert o.key == c.key == "call_1"
+    assert o.kind == "tool"
+    assert c.is_complete is True
+
+
+def test_turn_usage_defaults_are_none():
+    u = TurnUsage(model="claude-opus-4-6")
+    assert u.model == "claude-opus-4-6"
+    assert u.input_tokens is None
+    assert u.num_tool_calls == 0
+
+
+def test_turn_result_wraps_usage():
+    r = TurnResult(final_text="hi", usage=TurnUsage(model="m"))
+    assert r.final_text == "hi"
+    assert r.usage.model == "m"

From a13725cca825b8a0f21b7a1b25c25bf223447b27 Mon Sep 17 00:00:00 2001
From: Declan Brady <declan.brady@scale.com>
Date: Thu, 18 Jun 2026 12:05:15 -0400
Subject: [PATCH 06/26] test(harness): cover CloseSpan defaults and HarnessTurn
 runtime check

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---
 tests/lib/core/harness/test_types.py | 26 ++++++++++++++++++++++++++
 1 file changed, 26 insertions(+)

diff --git a/tests/lib/core/harness/test_types.py b/tests/lib/core/harness/test_types.py
index b025d803b..91857993a 100644
--- a/tests/lib/core/harness/test_types.py
+++ b/tests/lib/core/harness/test_types.py
@@ -1,6 +1,10 @@
+from typing import AsyncIterator
+
 from agentex.lib.core.harness.types import (
     OpenSpan,
     CloseSpan,
+    HarnessTurn,
+    StreamTaskMessage,
     TurnUsage,
     TurnResult,
 )
@@ -25,3 +29,25 @@ def test_turn_result_wraps_usage():
     r = TurnResult(final_text="hi", usage=TurnUsage(model="m"))
     assert r.final_text == "hi"
     assert r.usage.model == "m"
+
+
+def test_close_span_defaults():
+    c = CloseSpan(key="x")
+    assert c.output is None
+    assert c.is_complete is True
+
+
+def test_harness_turn_runtime_check():
+    class _Turn:
+        @property
+        def events(self) -> AsyncIterator[StreamTaskMessage]:
+            async def _gen() -> AsyncIterator[StreamTaskMessage]:
+                if False:
+                    yield  # pragma: no cover
+
+            return _gen()
+
+        def usage(self) -> TurnUsage:
+            return TurnUsage(model="m")
+
+    assert isinstance(_Turn(), HarnessTurn) is True

From 13868ee1968632f47eb2679b234c908239896f84 Mon Sep 17 00:00:00 2001
From: Declan Brady <declan.brady@scale.com>
Date: Thu, 18 Jun 2026 12:07:47 -0400
Subject: [PATCH 07/26] feat(harness): pure SpanDeriver reducing the canonical
 stream to span signals

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---
 .../lib/core/harness/span_derivation.py       | 115 +++++++++++++++++
 .../lib/core/harness/test_span_derivation.py  | 120 ++++++++++++++++++
 2 files changed, 235 insertions(+)
 create mode 100644 src/agentex/lib/core/harness/span_derivation.py
 create mode 100644 tests/lib/core/harness/test_span_derivation.py

diff --git a/src/agentex/lib/core/harness/span_derivation.py b/src/agentex/lib/core/harness/span_derivation.py
new file mode 100644
index 000000000..deb5d6d68
--- /dev/null
+++ b/src/agentex/lib/core/harness/span_derivation.py
@@ -0,0 +1,115 @@
+"""Pure reducer: canonical StreamTaskMessage* stream -> span open/close signals.
+
+Has no dependency on adk; unit-testable in isolation. Delivery adapters feed it
+every event and act on the returned signals.
+"""
+
+from __future__ import annotations
+
+import json
+from dataclasses import dataclass
+from typing import Any
+
+from agentex.types.task_message_update import (
+    StreamTaskMessageDelta,
+    StreamTaskMessageDone,
+    StreamTaskMessageFull,
+    StreamTaskMessageStart,
+)
+
+from agentex.lib.core.harness.types import CloseSpan, OpenSpan, SpanSignal, StreamTaskMessage
+
+
+@dataclass
+class _ToolReqMeta:
+    tool_call_id: str
+    name: str
+    arguments: dict[str, Any]
+    args_buf: str = ""  # accumulated streamed argument fragments
+
+
+class SpanDeriver:
+    """Stateful reducer over the canonical stream.
+
+    Tool span: open on Done of a ToolRequestContent index; close on matching
+    ToolResponseContent by tool_call_id. Reasoning span: open on
+    Start(ReasoningContent); close on that index's Done.
+    """
+
+    def __init__(self) -> None:
+        self._tool_by_index: dict[int, _ToolReqMeta] = {}
+        self._reasoning_index_open: set[int] = set()
+        self._open_tool_ids: set[str] = set()
+
+    def observe(self, event: StreamTaskMessage) -> list[SpanSignal]:
+        if isinstance(event, StreamTaskMessageStart):
+            return self._on_start(event)
+        if isinstance(event, StreamTaskMessageDelta):
+            return self._on_delta(event)
+        if isinstance(event, StreamTaskMessageFull):
+            return self._on_full(event)
+        if isinstance(event, StreamTaskMessageDone):
+            return self._on_done(event)
+        return []
+
+    def flush(self) -> list[SpanSignal]:
+        """Close anything still open at end of stream, marked incomplete."""
+        signals: list[SpanSignal] = []
+        for tcid in list(self._open_tool_ids):
+            signals.append(CloseSpan(key=tcid, output=None, is_complete=False))
+        self._open_tool_ids.clear()
+        for idx in sorted(self._reasoning_index_open):
+            signals.append(CloseSpan(key=f"reasoning:{idx}", output=None, is_complete=False))
+        self._reasoning_index_open.clear()
+        return signals
+
+    def _on_start(self, event: StreamTaskMessageStart) -> list[SpanSignal]:
+        content = event.content
+        idx = event.index if event.index is not None else -1
+        ctype = getattr(content, "type", None)
+        if ctype == "tool_request":
+            self._tool_by_index[idx] = _ToolReqMeta(
+                tool_call_id=content.tool_call_id,
+                name=content.name,
+                arguments=dict(content.arguments or {}),
+            )
+            return []
+        if ctype == "reasoning":
+            self._reasoning_index_open.add(idx)
+            return [OpenSpan(key=f"reasoning:{idx}", kind="reasoning", name="reasoning", input={})]
+        return []
+
+    def _on_delta(self, event: StreamTaskMessageDelta) -> list[SpanSignal]:
+        idx = event.index if event.index is not None else -1
+        delta = event.delta
+        if delta is not None and getattr(delta, "type", None) == "tool_request":
+            meta = self._tool_by_index.get(idx)
+            if meta is not None and delta.arguments_delta:
+                meta.args_buf += delta.arguments_delta
+        return []
+
+    def _on_full(self, event: StreamTaskMessageFull) -> list[SpanSignal]:
+        content = event.content
+        if getattr(content, "type", None) == "tool_response":
+            tcid = content.tool_call_id
+            if tcid in self._open_tool_ids:
+                self._open_tool_ids.discard(tcid)
+                return [CloseSpan(key=tcid, output=content.content, is_complete=True)]
+        return []
+
+    def _on_done(self, event: StreamTaskMessageDone) -> list[SpanSignal]:
+        idx = event.index if event.index is not None else -1
+        meta = self._tool_by_index.pop(idx, None)
+        if meta is not None:
+            args = meta.arguments
+            if meta.args_buf:
+                try:
+                    args = json.loads(meta.args_buf)
+                except json.JSONDecodeError:
+                    args = {"_raw": meta.args_buf}
+            self._open_tool_ids.add(meta.tool_call_id)
+            return [OpenSpan(key=meta.tool_call_id, kind="tool", name=meta.name, input=args)]
+        if idx in self._reasoning_index_open:
+            self._reasoning_index_open.discard(idx)
+            return [CloseSpan(key=f"reasoning:{idx}", output=None, is_complete=True)]
+        return []
diff --git a/tests/lib/core/harness/test_span_derivation.py b/tests/lib/core/harness/test_span_derivation.py
new file mode 100644
index 000000000..0b1a4bcbe
--- /dev/null
+++ b/tests/lib/core/harness/test_span_derivation.py
@@ -0,0 +1,120 @@
+from agentex.lib.core.harness.span_derivation import SpanDeriver
+from agentex.lib.core.harness.types import OpenSpan, CloseSpan
+from agentex.types.task_message_update import (
+    StreamTaskMessageStart,
+    StreamTaskMessageDelta,
+    StreamTaskMessageFull,
+    StreamTaskMessageDone,
+)
+from agentex.types.text_content import TextContent
+from agentex.types.reasoning_content import ReasoningContent
+from agentex.types.tool_request_content import ToolRequestContent
+from agentex.types.tool_response_content import ToolResponseContent
+from agentex.types.tool_request_delta import ToolRequestDelta
+
+
+def _signals(deriver, events):
+    out = []
+    for e in events:
+        out.extend(deriver.observe(e))
+    out.extend(deriver.flush())
+    return out
+
+
+def _tool_req(idx, tcid, name, args):
+    return StreamTaskMessageStart(
+        type="start", index=idx,
+        content=ToolRequestContent(type="tool_request", author="agent",
+                                   tool_call_id=tcid, name=name, arguments=args),
+    )
+
+
+def test_text_only_yields_no_spans():
+    d = SpanDeriver()
+    events = [
+        StreamTaskMessageStart(type="start", index=0,
+            content=TextContent(type="text", author="agent", content="")),
+        StreamTaskMessageDelta(type="delta", index=0,
+            delta=None),
+        StreamTaskMessageDone(type="done", index=0),
+    ]
+    assert _signals(d, events) == []
+
+
+def test_single_tool_opens_on_done_closes_on_response():
+    d = SpanDeriver()
+    events = [
+        _tool_req(0, "call_1", "Bash", {"cmd": "ls"}),
+        StreamTaskMessageDone(type="done", index=0),
+        StreamTaskMessageFull(type="full", index=1,
+            content=ToolResponseContent(type="tool_response", author="agent",
+                                        tool_call_id="call_1", name="Bash", content="files")),
+    ]
+    sigs = _signals(d, events)
+    assert sigs == [
+        OpenSpan(key="call_1", kind="tool", name="Bash", input={"cmd": "ls"}),
+        CloseSpan(key="call_1", output="files", is_complete=True),
+    ]
+
+
+def test_reasoning_opens_on_start_closes_on_done():
+    d = SpanDeriver()
+    events = [
+        StreamTaskMessageStart(type="start", index=0,
+            content=ReasoningContent(type="reasoning", author="agent", summary=[], content=[])),
+        StreamTaskMessageDone(type="done", index=0),
+    ]
+    sigs = _signals(d, events)
+    assert sigs[0] == OpenSpan(key="reasoning:0", kind="reasoning", name="reasoning", input={})
+    assert sigs[1] == CloseSpan(key="reasoning:0", output=None, is_complete=True)
+
+
+def test_parallel_tools_pair_by_tool_call_id():
+    d = SpanDeriver()
+    events = [
+        _tool_req(0, "a", "T1", {}),
+        _tool_req(1, "b", "T2", {}),
+        StreamTaskMessageDone(type="done", index=0),
+        StreamTaskMessageDone(type="done", index=1),
+        StreamTaskMessageFull(type="full", index=2,
+            content=ToolResponseContent(type="tool_response", author="agent",
+                                        tool_call_id="b", name="T2", content="rb")),
+        StreamTaskMessageFull(type="full", index=3,
+            content=ToolResponseContent(type="tool_response", author="agent",
+                                        tool_call_id="a", name="T1", content="ra")),
+    ]
+    sigs = _signals(d, events)
+    opens = [s for s in sigs if isinstance(s, OpenSpan)]
+    closes = [s for s in sigs if isinstance(s, CloseSpan)]
+    assert {o.key for o in opens} == {"a", "b"}
+    assert [c.key for c in closes] == ["b", "a"]
+    assert all(c.is_complete for c in closes)
+
+
+def test_streamed_args_accumulate_into_open_input():
+    d = SpanDeriver()
+    events = [
+        StreamTaskMessageStart(type="start", index=0,
+            content=ToolRequestContent(type="tool_request", author="agent",
+                                       tool_call_id="c", name="Bash", arguments={})),
+        StreamTaskMessageDelta(type="delta", index=0,
+            delta=ToolRequestDelta(type="tool_request", tool_call_id="c", name="Bash",
+                                   arguments_delta='{"cmd":')),
+        StreamTaskMessageDelta(type="delta", index=0,
+            delta=ToolRequestDelta(type="tool_request", tool_call_id="c", name="Bash",
+                                   arguments_delta='"ls"}')),
+        StreamTaskMessageDone(type="done", index=0),
+    ]
+    sigs = _signals(d, events)
+    assert sigs[0] == OpenSpan(key="c", kind="tool", name="Bash", input={"cmd": "ls"})
+
+
+def test_unclosed_tool_closed_incomplete_on_flush():
+    d = SpanDeriver()
+    events = [
+        _tool_req(0, "x", "Bash", {}),
+        StreamTaskMessageDone(type="done", index=0),
+    ]
+    sigs = _signals(d, events)
+    assert sigs[0] == OpenSpan(key="x", kind="tool", name="Bash", input={})
+    assert sigs[1] == CloseSpan(key="x", output=None, is_complete=False)

From 0ecc03f4b4f8e686c73c0d7f2acc2c29c91a6f4c Mon Sep 17 00:00:00 2001
From: Declan Brady <declan.brady@scale.com>
Date: Thu, 18 Jun 2026 12:13:13 -0400
Subject: [PATCH 08/26] refactor(harness): deterministic flush order +
 defensive index/orphan handling in SpanDeriver

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---
 .../lib/core/harness/span_derivation.py       | 33 ++++++++++++++-----
 .../lib/core/harness/test_span_derivation.py  | 21 ++++++++++++
 2 files changed, 46 insertions(+), 8 deletions(-)

diff --git a/src/agentex/lib/core/harness/span_derivation.py b/src/agentex/lib/core/harness/span_derivation.py
index deb5d6d68..15e1f593f 100644
--- a/src/agentex/lib/core/harness/span_derivation.py
+++ b/src/agentex/lib/core/harness/span_derivation.py
@@ -8,7 +8,6 @@
 
 import json
 from dataclasses import dataclass
-from typing import Any
 
 from agentex.types.task_message_update import (
     StreamTaskMessageDelta,
@@ -24,7 +23,7 @@
 class _ToolReqMeta:
     tool_call_id: str
     name: str
-    arguments: dict[str, Any]
+    arguments: dict[str, object]
     args_buf: str = ""  # accumulated streamed argument fragments
 
 
@@ -34,12 +33,24 @@ class SpanDeriver:
     Tool span: open on Done of a ToolRequestContent index; close on matching
     ToolResponseContent by tool_call_id. Reasoning span: open on
     Start(ReasoningContent); close on that index's Done.
+
+    Deliberate contracts:
+      - A `Full(ToolResponseContent)` whose tool_call_id was never opened is
+        ignored (no CloseSpan emitted).
+      - A `Done` for an index that was never a tool_request/reasoning Start is
+        ignored (no signal emitted).
+      - Events with `index is None` are skipped entirely; without a stable index
+        they cannot be reliably paired, and aliasing them to a sentinel would
+        let unrelated None-indexed events cross-match.
+      - `flush()` closes anything still open as incomplete; unclosed tool spans
+        are emitted in the order they were opened.
     """
 
     def __init__(self) -> None:
         self._tool_by_index: dict[int, _ToolReqMeta] = {}
         self._reasoning_index_open: set[int] = set()
-        self._open_tool_ids: set[str] = set()
+        # insertion-ordered set of open tool_call_ids (dict keys preserve order)
+        self._open_tool_ids: dict[str, None] = {}
 
     def observe(self, event: StreamTaskMessage) -> list[SpanSignal]:
         if isinstance(event, StreamTaskMessageStart):
@@ -64,8 +75,10 @@ def flush(self) -> list[SpanSignal]:
         return signals
 
     def _on_start(self, event: StreamTaskMessageStart) -> list[SpanSignal]:
+        if event.index is None:
+            return []
+        idx = event.index
         content = event.content
-        idx = event.index if event.index is not None else -1
         ctype = getattr(content, "type", None)
         if ctype == "tool_request":
             self._tool_by_index[idx] = _ToolReqMeta(
@@ -80,7 +93,9 @@ def _on_start(self, event: StreamTaskMessageStart) -> list[SpanSignal]:
         return []
 
     def _on_delta(self, event: StreamTaskMessageDelta) -> list[SpanSignal]:
-        idx = event.index if event.index is not None else -1
+        if event.index is None:
+            return []
+        idx = event.index
         delta = event.delta
         if delta is not None and getattr(delta, "type", None) == "tool_request":
             meta = self._tool_by_index.get(idx)
@@ -93,12 +108,14 @@ def _on_full(self, event: StreamTaskMessageFull) -> list[SpanSignal]:
         if getattr(content, "type", None) == "tool_response":
             tcid = content.tool_call_id
             if tcid in self._open_tool_ids:
-                self._open_tool_ids.discard(tcid)
+                self._open_tool_ids.pop(tcid, None)
                 return [CloseSpan(key=tcid, output=content.content, is_complete=True)]
         return []
 
     def _on_done(self, event: StreamTaskMessageDone) -> list[SpanSignal]:
-        idx = event.index if event.index is not None else -1
+        if event.index is None:
+            return []
+        idx = event.index
         meta = self._tool_by_index.pop(idx, None)
         if meta is not None:
             args = meta.arguments
@@ -107,7 +124,7 @@ def _on_done(self, event: StreamTaskMessageDone) -> list[SpanSignal]:
                     args = json.loads(meta.args_buf)
                 except json.JSONDecodeError:
                     args = {"_raw": meta.args_buf}
-            self._open_tool_ids.add(meta.tool_call_id)
+            self._open_tool_ids[meta.tool_call_id] = None
             return [OpenSpan(key=meta.tool_call_id, kind="tool", name=meta.name, input=args)]
         if idx in self._reasoning_index_open:
             self._reasoning_index_open.discard(idx)
diff --git a/tests/lib/core/harness/test_span_derivation.py b/tests/lib/core/harness/test_span_derivation.py
index 0b1a4bcbe..0630131d0 100644
--- a/tests/lib/core/harness/test_span_derivation.py
+++ b/tests/lib/core/harness/test_span_derivation.py
@@ -118,3 +118,24 @@ def test_unclosed_tool_closed_incomplete_on_flush():
     sigs = _signals(d, events)
     assert sigs[0] == OpenSpan(key="x", kind="tool", name="Bash", input={})
     assert sigs[1] == CloseSpan(key="x", output=None, is_complete=False)
+
+
+def test_none_index_is_skipped():
+    d = SpanDeriver()
+    events = [
+        StreamTaskMessageStart(type="start", index=None,
+            content=ToolRequestContent(type="tool_request", author="agent",
+                                       tool_call_id="n", name="Bash", arguments={})),
+        StreamTaskMessageDone(type="done", index=None),
+    ]
+    assert _signals(d, events) == []
+
+
+def test_orphan_tool_response_ignored():
+    d = SpanDeriver()
+    events = [
+        StreamTaskMessageFull(type="full", index=0,
+            content=ToolResponseContent(type="tool_response", author="agent",
+                                        tool_call_id="z", name="Bash", content="r")),
+    ]
+    assert _signals(d, events) == []

From 8d708f41fccf0b5b6a046999ac67c2106666b5aa Mon Sep 17 00:00:00 2001
From: Declan Brady <declan.brady@scale.com>
Date: Thu, 18 Jun 2026 12:16:02 -0400
Subject: [PATCH 09/26] feat(harness): SpanTracer adapter from span signals to
 adk.tracing

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---
 src/agentex/lib/core/harness/tracer.py | 68 ++++++++++++++++++++++++++
 tests/lib/core/harness/test_tracer.py  | 54 ++++++++++++++++++++
 2 files changed, 122 insertions(+)
 create mode 100644 src/agentex/lib/core/harness/tracer.py
 create mode 100644 tests/lib/core/harness/test_tracer.py

diff --git a/src/agentex/lib/core/harness/tracer.py b/src/agentex/lib/core/harness/tracer.py
new file mode 100644
index 000000000..55fab4029
--- /dev/null
+++ b/src/agentex/lib/core/harness/tracer.py
@@ -0,0 +1,68 @@
+"""Adapter from SpanSignals to adk.tracing spans (best-effort, overridable)."""
+
+from __future__ import annotations
+
+import logging
+from typing import Any
+
+from agentex.lib.core.harness.types import CloseSpan, OpenSpan, SpanSignal
+
+logger = logging.getLogger(__name__)
+
+
+class SpanTracer:
+    """Opens/closes adk.tracing child spans in response to span signals.
+
+    `tracing` defaults to the real `adk.tracing` module; inject a fake in tests
+    or a custom tracer to override. No-op when `trace_id` is falsy. Never raises.
+
+    The real TracingModule.end_span does NOT accept an output kwarg — output is
+    recorded by mutating span.output before calling end_span, matching the pattern
+    used throughout the codebase (see _langgraph_tracing.py on_tool_end etc.).
+    """
+
+    def __init__(
+        self,
+        trace_id: str | None,
+        parent_span_id: str | None,
+        tracing: Any = None,
+        task_id: str | None = None,
+    ):
+        self.trace_id = trace_id
+        self.parent_span_id = parent_span_id
+        self.task_id = task_id
+        if tracing is None:
+            from agentex.lib import adk
+
+            tracing = adk.tracing
+        self._tracing = tracing
+        self._open: dict[str, Any] = {}  # span key -> span object
+
+    async def handle(self, signal: SpanSignal) -> None:
+        if not self.trace_id:
+            return
+        try:
+            if isinstance(signal, OpenSpan):
+                span = await self._tracing.start_span(
+                    trace_id=self.trace_id,
+                    name=signal.name,
+                    input=signal.input,
+                    parent_id=self.parent_span_id,
+                    task_id=self.task_id,
+                )
+                if span is not None:
+                    self._open[signal.key] = span
+            elif isinstance(signal, CloseSpan):
+                span = self._open.pop(signal.key, None)
+                if span is not None:
+                    # Output is recorded by mutating span.output before end_span.
+                    # The real TracingModule.end_span signature is:
+                    #   end_span(trace_id, span, start_to_close_timeout, heartbeat_timeout, retry_policy)
+                    # It does not accept an output= kwarg.
+                    span.output = signal.output
+                    await self._tracing.end_span(
+                        trace_id=self.trace_id,
+                        span=span,
+                    )
+        except Exception as exc:  # best-effort: tracing never breaks delivery
+            logger.warning("[harness.tracer] span signal failed: %s", exc)
diff --git a/tests/lib/core/harness/test_tracer.py b/tests/lib/core/harness/test_tracer.py
new file mode 100644
index 000000000..105995bc8
--- /dev/null
+++ b/tests/lib/core/harness/test_tracer.py
@@ -0,0 +1,54 @@
+import pytest
+
+from agentex.lib.core.harness.tracer import SpanTracer
+from agentex.lib.core.harness.types import OpenSpan, CloseSpan
+
+
+class _FakeSpan:
+    def __init__(self, name):
+        self.name = name
+        self.output = None
+
+
+class _FakeTracing:
+    def __init__(self):
+        self.started = []
+        self.ended = []
+
+    async def start_span(self, *, trace_id, name, input=None, parent_id=None, data=None, task_id=None):
+        self.started.append((name, parent_id, input))
+        return _FakeSpan(name)
+
+    async def end_span(self, *, trace_id, span):
+        self.ended.append((span.name, span.output))
+
+
+@pytest.mark.asyncio
+async def test_open_then_close_starts_and_ends_span():
+    fake = _FakeTracing()
+    tracer = SpanTracer(trace_id="t1", parent_span_id="p1", tracing=fake)
+    await tracer.handle(OpenSpan(key="call_1", kind="tool", name="Bash", input={"cmd": "ls"}))
+    await tracer.handle(CloseSpan(key="call_1", output="files", is_complete=True))
+    assert fake.started == [("Bash", "p1", {"cmd": "ls"})]
+    assert fake.ended == [("Bash", "files")]
+
+
+@pytest.mark.asyncio
+async def test_no_trace_id_is_noop():
+    fake = _FakeTracing()
+    tracer = SpanTracer(trace_id="", parent_span_id=None, tracing=fake)
+    await tracer.handle(OpenSpan(key="k", kind="tool", name="X"))
+    await tracer.handle(CloseSpan(key="k"))
+    assert fake.started == [] and fake.ended == []
+
+
+@pytest.mark.asyncio
+async def test_tracing_failure_is_swallowed():
+    class _Boom(_FakeTracing):
+        async def start_span(self, **kw):
+            raise RuntimeError("backend down")
+
+    tracer = SpanTracer(trace_id="t1", parent_span_id="p1", tracing=_Boom())
+    # Must not raise.
+    await tracer.handle(OpenSpan(key="k", kind="tool", name="X"))
+    await tracer.handle(CloseSpan(key="k"))

From 7955d55b1d88b39557e129f0539c4cf42565ddc4 Mon Sep 17 00:00:00 2001
From: Declan Brady <declan.brady@scale.com>
Date: Thu, 18 Jun 2026 12:21:07 -0400
Subject: [PATCH 10/26] refactor(harness): guarded make_logger import +
 lifecycle contract tests for SpanTracer

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---
 src/agentex/lib/core/harness/tracer.py | 18 ++++++++++++++++--
 tests/lib/core/harness/test_tracer.py  | 13 +++++++++++++
 2 files changed, 29 insertions(+), 2 deletions(-)

diff --git a/src/agentex/lib/core/harness/tracer.py b/src/agentex/lib/core/harness/tracer.py
index 55fab4029..3f4ff40c2 100644
--- a/src/agentex/lib/core/harness/tracer.py
+++ b/src/agentex/lib/core/harness/tracer.py
@@ -2,12 +2,18 @@
 
 from __future__ import annotations
 
-import logging
 from typing import Any
 
 from agentex.lib.core.harness.types import CloseSpan, OpenSpan, SpanSignal
 
-logger = logging.getLogger(__name__)
+try:
+    from agentex.lib.utils.logging import make_logger
+
+    logger = make_logger(__name__)
+except Exception:  # ddtrace may be absent in some envs; fall back to stdlib
+    import logging
+
+    logger = logging.getLogger(__name__)
 
 
 class SpanTracer:
@@ -19,6 +25,14 @@ class SpanTracer:
     The real TracingModule.end_span does NOT accept an output kwarg — output is
     recorded by mutating span.output before calling end_span, matching the pattern
     used throughout the codebase (see _langgraph_tracing.py on_tool_end etc.).
+
+    Span-lifecycle contract: the `_open` dict (span key -> span object) is scoped
+    to a single turn. Pairing is by `key`:
+    - A duplicate OpenSpan for a key already in `_open` silently replaces the
+      earlier span; the earlier span is then orphaned (never closed / leaked).
+    - A CloseSpan for an unknown key is a no-op.
+    - Unpaired opens accumulate in `_open` for the lifetime of the tracer; since
+      a tracer is expected to live for one turn, this is bounded and acceptable.
     """
 
     def __init__(
diff --git a/tests/lib/core/harness/test_tracer.py b/tests/lib/core/harness/test_tracer.py
index 105995bc8..f5fdb16b6 100644
--- a/tests/lib/core/harness/test_tracer.py
+++ b/tests/lib/core/harness/test_tracer.py
@@ -52,3 +52,16 @@ async def start_span(self, **kw):
     # Must not raise.
     await tracer.handle(OpenSpan(key="k", kind="tool", name="X"))
     await tracer.handle(CloseSpan(key="k"))
+    assert tracer._open == {}
+
+
+@pytest.mark.asyncio
+async def test_duplicate_open_replaces_silently():
+    fake = _FakeTracing()
+    tracer = SpanTracer(trace_id="t1", parent_span_id="p1", tracing=fake)
+    await tracer.handle(OpenSpan(key="k", kind="tool", name="A"))
+    await tracer.handle(OpenSpan(key="k", kind="tool", name="B"))
+    await tracer.handle(CloseSpan(key="k"))
+    # Both opens started spans, but only the second ("B") is closed.
+    assert [name for name, _, _ in fake.started] == ["A", "B"]
+    assert fake.ended == [("B", None)]

From 803191ba9ce4d2db5c25723af1c4a8934cb0220e Mon Sep 17 00:00:00 2001
From: Declan Brady <declan.brady@scale.com>
Date: Thu, 18 Jun 2026 12:23:02 -0400
Subject: [PATCH 11/26] feat(harness): yield_events delivery adapter
 (passthrough + tracing)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---
 .../lib/core/harness/yield_delivery.py        | 31 ++++++++++
 tests/lib/core/harness/test_yield_delivery.py | 58 +++++++++++++++++++
 2 files changed, 89 insertions(+)
 create mode 100644 src/agentex/lib/core/harness/yield_delivery.py
 create mode 100644 tests/lib/core/harness/test_yield_delivery.py

diff --git a/src/agentex/lib/core/harness/yield_delivery.py b/src/agentex/lib/core/harness/yield_delivery.py
new file mode 100644
index 000000000..0d04647da
--- /dev/null
+++ b/src/agentex/lib/core/harness/yield_delivery.py
@@ -0,0 +1,31 @@
+"""Yield delivery: pass the canonical stream through, tracing as a side effect."""
+
+from __future__ import annotations
+
+from typing import AsyncIterator
+
+from agentex.lib.core.harness.span_derivation import SpanDeriver
+from agentex.lib.core.harness.tracer import SpanTracer
+from agentex.lib.core.harness.types import StreamTaskMessage
+
+
+async def yield_events(
+    events: AsyncIterator[StreamTaskMessage],
+    tracer: SpanTracer | None = None,
+) -> AsyncIterator[StreamTaskMessage]:
+    """Forward each event to the caller; derive + trace spans as a side effect.
+
+    For sync HTTP ACP agents that yield events back over the response. When
+    `tracer` is None, this is a pure passthrough.
+    """
+    deriver = SpanDeriver() if tracer is not None else None
+    try:
+        async for event in events:
+            if deriver is not None and tracer is not None:
+                for signal in deriver.observe(event):
+                    await tracer.handle(signal)
+            yield event
+    finally:
+        if deriver is not None and tracer is not None:
+            for signal in deriver.flush():
+                await tracer.handle(signal)
diff --git a/tests/lib/core/harness/test_yield_delivery.py b/tests/lib/core/harness/test_yield_delivery.py
new file mode 100644
index 000000000..46f0aeac1
--- /dev/null
+++ b/tests/lib/core/harness/test_yield_delivery.py
@@ -0,0 +1,58 @@
+import types as _types
+
+import pytest
+
+from agentex.lib.core.harness.yield_delivery import yield_events
+from agentex.lib.core.harness.tracer import SpanTracer
+from agentex.types.task_message_update import (
+    StreamTaskMessageStart,
+    StreamTaskMessageDone,
+    StreamTaskMessageFull,
+)
+from agentex.types.tool_request_content import ToolRequestContent
+from agentex.types.tool_response_content import ToolResponseContent
+
+
+class _RecordTracing:
+    def __init__(self):
+        self.started, self.ended = [], []
+
+    async def start_span(self, *, trace_id, name, input=None, parent_id=None, data=None, task_id=None):
+        self.started.append(name)
+        return _types.SimpleNamespace()  # supports arbitrary attribute assignment (span.output = ...)
+
+    async def end_span(self, *, trace_id, span):
+        self.ended.append(getattr(span, "output", None))
+
+
+async def _gen(events):
+    for e in events:
+        yield e
+
+
+@pytest.mark.asyncio
+async def test_yield_passes_events_through_and_traces():
+    fake = _RecordTracing()
+    tracer = SpanTracer(trace_id="t", parent_span_id="p", tracing=fake)
+    events = [
+        StreamTaskMessageStart(type="start", index=0,
+            content=ToolRequestContent(type="tool_request", author="agent",
+                                       tool_call_id="c", name="Bash", arguments={})),
+        StreamTaskMessageDone(type="done", index=0),
+        StreamTaskMessageFull(type="full", index=1,
+            content=ToolResponseContent(type="tool_response", author="agent",
+                                        tool_call_id="c", name="Bash", content="ok")),
+    ]
+    out = [e async for e in yield_events(_gen(events), tracer=tracer)]
+    assert out == events            # passthrough unchanged
+    assert fake.started == ["Bash"] # span derived + opened
+    assert fake.ended == ["ok"]     # span closed with response
+
+
+@pytest.mark.asyncio
+async def test_yield_without_tracer_is_pure_passthrough():
+    events = [
+        StreamTaskMessageDone(type="done", index=0),
+    ]
+    out = [e async for e in yield_events(_gen(events), tracer=None)]
+    assert out == events

From dab044f862e7a734c7c248f747ebe4470cbd9a82 Mon Sep 17 00:00:00 2001
From: Declan Brady <declan.brady@scale.com>
Date: Thu, 18 Jun 2026 12:27:54 -0400
Subject: [PATCH 12/26] refactor(harness): simplify yield_events guard + cover
 finally-flush on early close

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---
 .../lib/core/harness/yield_delivery.py        |  8 ++++----
 tests/lib/core/harness/test_yield_delivery.py | 19 +++++++++++++++++++
 2 files changed, 23 insertions(+), 4 deletions(-)

diff --git a/src/agentex/lib/core/harness/yield_delivery.py b/src/agentex/lib/core/harness/yield_delivery.py
index 0d04647da..ca923c6a3 100644
--- a/src/agentex/lib/core/harness/yield_delivery.py
+++ b/src/agentex/lib/core/harness/yield_delivery.py
@@ -2,7 +2,7 @@
 
 from __future__ import annotations
 
-from typing import AsyncIterator
+from typing import AsyncGenerator, AsyncIterator
 
 from agentex.lib.core.harness.span_derivation import SpanDeriver
 from agentex.lib.core.harness.tracer import SpanTracer
@@ -12,7 +12,7 @@
 async def yield_events(
     events: AsyncIterator[StreamTaskMessage],
     tracer: SpanTracer | None = None,
-) -> AsyncIterator[StreamTaskMessage]:
+) -> AsyncGenerator[StreamTaskMessage, None]:
     """Forward each event to the caller; derive + trace spans as a side effect.
 
     For sync HTTP ACP agents that yield events back over the response. When
@@ -21,11 +21,11 @@ async def yield_events(
     deriver = SpanDeriver() if tracer is not None else None
     try:
         async for event in events:
-            if deriver is not None and tracer is not None:
+            if deriver is not None:  # tracer is non-None whenever deriver is set
                 for signal in deriver.observe(event):
                     await tracer.handle(signal)
             yield event
     finally:
-        if deriver is not None and tracer is not None:
+        if deriver is not None:  # tracer is non-None whenever deriver is set
             for signal in deriver.flush():
                 await tracer.handle(signal)
diff --git a/tests/lib/core/harness/test_yield_delivery.py b/tests/lib/core/harness/test_yield_delivery.py
index 46f0aeac1..986b4a92d 100644
--- a/tests/lib/core/harness/test_yield_delivery.py
+++ b/tests/lib/core/harness/test_yield_delivery.py
@@ -56,3 +56,22 @@ async def test_yield_without_tracer_is_pure_passthrough():
     ]
     out = [e async for e in yield_events(_gen(events), tracer=None)]
     assert out == events
+
+
+@pytest.mark.asyncio
+async def test_flush_runs_on_early_close():
+    fake = _RecordTracing()
+    tracer = SpanTracer(trace_id="t", parent_span_id="p", tracing=fake)
+    events = [
+        StreamTaskMessageStart(type="start", index=0,
+            content=ToolRequestContent(type="tool_request", author="agent",
+                                       tool_call_id="c", name="Bash", arguments={})),
+        StreamTaskMessageDone(type="done", index=0),
+        # response intentionally never arrives
+    ]
+    gen = yield_events(_gen(events), tracer=tracer)
+    first = await gen.__anext__()   # Start
+    second = await gen.__anext__()  # Done -> tool span opens here
+    await gen.aclose()              # triggers the finally -> flush()
+    assert fake.started == ["Bash"]
+    assert fake.ended == [None]     # flush closed the unpaired span (incomplete, no output)

From 3cc0326bef21caaea138234302fb411d494f59a6 Mon Sep 17 00:00:00 2001
From: Declan Brady <declan.brady@scale.com>
Date: Thu, 18 Jun 2026 12:32:36 -0400
Subject: [PATCH 13/26] feat(harness): auto_send delivery adapter (canonical
 stream -> adk.streaming + tracing)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---
 src/agentex/lib/core/harness/auto_send.py | 118 ++++++++++
 tests/lib/core/harness/test_auto_send.py  | 248 ++++++++++++++++++++++
 2 files changed, 366 insertions(+)
 create mode 100644 src/agentex/lib/core/harness/auto_send.py
 create mode 100644 tests/lib/core/harness/test_auto_send.py

diff --git a/src/agentex/lib/core/harness/auto_send.py b/src/agentex/lib/core/harness/auto_send.py
new file mode 100644
index 000000000..506a9ad82
--- /dev/null
+++ b/src/agentex/lib/core/harness/auto_send.py
@@ -0,0 +1,118 @@
+"""Auto-send delivery: canonical stream -> adk.streaming side effects + tracing."""
+
+from __future__ import annotations
+
+from typing import Any, AsyncIterator
+
+from agentex.types.task_message_update import (
+    StreamTaskMessageDelta,
+    StreamTaskMessageDone,
+    StreamTaskMessageFull,
+    StreamTaskMessageStart,
+)
+
+from agentex.lib.core.harness.span_derivation import SpanDeriver
+from agentex.lib.core.harness.tracer import SpanTracer
+from agentex.lib.core.harness.types import StreamTaskMessage, TurnResult, TurnUsage
+
+
+async def auto_send(
+    events: AsyncIterator[StreamTaskMessage],
+    task_id: str,
+    tracer: SpanTracer | None = None,
+    streaming: Any = None,
+    usage: TurnUsage | None = None,
+) -> TurnResult:
+    """Push the canonical stream to the task stream via adk.streaming.
+
+    Opens a streaming context per text/reasoning message, streams deltas via
+    ctx.stream_update, and closes via ctx.close() on Done. Posts tool
+    request/response full messages by opening a context with the content and
+    closing it immediately (no deltas). Derives and traces spans from the same
+    stream. Returns the accumulated final text + usage.
+
+    Mirrors the open/close/stream_update pattern from
+    src/agentex/lib/adk/_modules/_langgraph_async.py:
+      - context opened via streaming_task_message_context(...).__aenter__()
+      - context closed via ctx.close() (not __aexit__)
+      - deltas pushed as StreamTaskMessageDelta with parent_task_message set
+        from ctx.task_message
+
+    For async + temporal agents (call from inside an activity).
+    """
+    if streaming is None:
+        from agentex.lib import adk
+
+        streaming = adk.streaming
+
+    deriver = SpanDeriver() if tracer is not None else None
+    final_text_parts: list[str] = []
+    current_ctx: Any = None
+    current_kind: str | None = None  # "text" | "reasoning"
+
+    async def _close_current() -> None:
+        nonlocal current_ctx, current_kind
+        if current_ctx is not None:
+            await current_ctx.close()
+            current_ctx = None
+            current_kind = None
+
+    try:
+        async for event in events:
+            if deriver is not None:
+                for signal in deriver.observe(event):
+                    await tracer.handle(signal)  # type: ignore[union-attr]
+
+            if isinstance(event, StreamTaskMessageStart):
+                ctype = getattr(event.content, "type", None)
+                if ctype in ("text", "reasoning"):
+                    await _close_current()
+                    ctx = streaming.streaming_task_message_context(
+                        task_id=task_id,
+                        initial_content=event.content,
+                    )
+                    current_ctx = await ctx.__aenter__()
+                    current_kind = ctype
+
+            elif isinstance(event, StreamTaskMessageDelta):
+                if current_ctx is not None and event.delta is not None:
+                    # Reconstruct the delta with parent_task_message set from
+                    # the context's task_message (mirrors _langgraph_async.py
+                    # lines 72-78 and 117-127).
+                    delta_with_parent = StreamTaskMessageDelta(
+                        parent_task_message=current_ctx.task_message,
+                        delta=event.delta,
+                        type="delta",
+                        index=event.index,
+                    )
+                    await current_ctx.stream_update(delta_with_parent)
+                    if (
+                        getattr(event.delta, "type", None) == "text"
+                        and event.delta.text_delta
+                    ):
+                        final_text_parts.append(event.delta.text_delta)
+
+            elif isinstance(event, StreamTaskMessageDone):
+                await _close_current()
+
+            elif isinstance(event, StreamTaskMessageFull):
+                # Full messages (tool_request / tool_response): close any open
+                # streaming context first, then post the full message by opening
+                # a context with the content and closing it immediately
+                # (no deltas; StreamingTaskMessageContext.close() persists
+                # initial_content when the accumulator is empty).
+                await _close_current()
+                ctx = streaming.streaming_task_message_context(
+                    task_id=task_id,
+                    initial_content=event.content,
+                )
+                full_ctx = await ctx.__aenter__()
+                await full_ctx.close()
+
+    finally:
+        await _close_current()
+        if deriver is not None:
+            for signal in deriver.flush():
+                await tracer.handle(signal)  # type: ignore[union-attr]
+
+    return TurnResult(final_text="".join(final_text_parts), usage=usage or TurnUsage())
diff --git a/tests/lib/core/harness/test_auto_send.py b/tests/lib/core/harness/test_auto_send.py
new file mode 100644
index 000000000..2a83658e1
--- /dev/null
+++ b/tests/lib/core/harness/test_auto_send.py
@@ -0,0 +1,248 @@
+"""Tests for auto_send delivery adapter.
+
+The fake mirrors the real StreamingTaskMessageContext API exactly:
+- streaming_task_message_context(...) returns a context object (synchronously)
+- open the context via __aenter__ (returns self after creating the task message)
+- stream deltas via ctx.stream_update(StreamTaskMessageDelta(...))
+- close via ctx.close() (NOT __aexit__)
+
+This mirrors _langgraph_async.py lines 62-78 and 100-127.
+"""
+
+import types as _types
+
+import pytest
+
+from agentex.lib.core.harness.auto_send import auto_send
+from agentex.lib.core.harness.tracer import SpanTracer
+from agentex.types.task_message import TaskMessage
+from agentex.types.task_message_update import (
+    StreamTaskMessageStart,
+    StreamTaskMessageDelta,
+    StreamTaskMessageDone,
+    StreamTaskMessageFull,
+)
+from agentex.types.text_content import TextContent
+from agentex.types.task_message_delta import TextDelta
+from agentex.types.tool_request_content import ToolRequestContent
+from agentex.types.tool_response_content import ToolResponseContent
+
+
+class _FakeCtx:
+    """Mirrors StreamingTaskMessageContext: __aenter__ opens (returns self with task_message set),
+    close() closes. stream_update records the call.
+
+    task_message is a real TaskMessage instance so that auto_send can use it
+    as parent_task_message in StreamTaskMessageDelta without Pydantic validation errors.
+    """
+
+    def __init__(self, sink, content_type, initial_content):
+        self.sink = sink
+        self.content_type = content_type
+        # Real TaskMessage so StreamTaskMessageDelta(parent_task_message=...) passes validation
+        self.task_message = TaskMessage(
+            id="msg-1", task_id="task1", content=initial_content
+        )
+
+    async def __aenter__(self):
+        self.sink.append(("open", self.content_type))
+        return self
+
+    async def __aexit__(self, *a):
+        # __aexit__ delegates to close in the real impl; keep for safety
+        await self.close()
+        return False
+
+    async def close(self):
+        self.sink.append(("close", self.content_type))
+
+    async def stream_update(self, update):
+        self.sink.append(("update", update))
+        return update
+
+
+class _FakeStreaming:
+    """Mirrors StreamingService: streaming_task_message_context returns a context object."""
+
+    def __init__(self):
+        self.sink = []
+
+    def streaming_task_message_context(
+        self, task_id, initial_content, streaming_mode="coalesced", created_at=None
+    ):
+        ctype = getattr(initial_content, "type", None)
+        self.sink.append(("ctx", ctype))
+        return _FakeCtx(self.sink, ctype, initial_content)
+
+
+async def _gen(events):
+    for e in events:
+        yield e
+
+
+# ---------------------------------------------------------------------------
+# Test 1: text streaming — open, stream deltas, close; return accumulated text
+# ---------------------------------------------------------------------------
+
+@pytest.mark.asyncio
+async def test_auto_send_streams_text_and_returns_final_text():
+    streaming = _FakeStreaming()
+    events = [
+        StreamTaskMessageStart(
+            type="start", index=0,
+            content=TextContent(type="text", author="agent", content=""),
+        ),
+        StreamTaskMessageDelta(
+            type="delta", index=0,
+            delta=TextDelta(type="text", text_delta="Hel"),
+        ),
+        StreamTaskMessageDelta(
+            type="delta", index=0,
+            delta=TextDelta(type="text", text_delta="lo"),
+        ),
+        StreamTaskMessageDone(type="done", index=0),
+    ]
+    result = await auto_send(_gen(events), task_id="task1", tracer=None, streaming=streaming)
+
+    assert result.final_text == "Hello"
+
+    kinds = [s[0] for s in streaming.sink]
+    # A context was created for the text content
+    assert kinds[0] == "ctx"
+    # It was opened and closed
+    assert "open" in kinds
+    assert "close" in kinds
+    # Exactly two updates were streamed (one per delta)
+    updates = [s for s in streaming.sink if s[0] == "update"]
+    assert len(updates) == 2
+
+
+# ---------------------------------------------------------------------------
+# Test 2: tool_request Full + tool_response Full — each posts one full message
+# (open context with the content, no deltas, close immediately)
+# ---------------------------------------------------------------------------
+
+@pytest.mark.asyncio
+async def test_auto_send_posts_full_tool_messages():
+    streaming = _FakeStreaming()
+    events = [
+        StreamTaskMessageFull(
+            type="full", index=0,
+            content=ToolRequestContent(
+                type="tool_request", author="agent",
+                tool_call_id="c1", name="Bash", arguments={"cmd": "ls"},
+            ),
+        ),
+        StreamTaskMessageFull(
+            type="full", index=1,
+            content=ToolResponseContent(
+                type="tool_response", author="agent",
+                tool_call_id="c1", name="Bash", content="file.py",
+            ),
+        ),
+    ]
+    result = await auto_send(_gen(events), task_id="task1", tracer=None, streaming=streaming)
+
+    assert result.final_text == ""
+
+    # One context per Full event
+    ctx_events = [s for s in streaming.sink if s[0] == "ctx"]
+    assert len(ctx_events) == 2
+    content_types = [s[1] for s in ctx_events]
+    assert "tool_request" in content_types
+    assert "tool_response" in content_types
+
+    # Each context is opened and closed
+    opens = [s for s in streaming.sink if s[0] == "open"]
+    closes = [s for s in streaming.sink if s[0] == "close"]
+    assert len(opens) == 2
+    assert len(closes) == 2
+
+    # No stream_update calls (full messages have no deltas)
+    updates = [s for s in streaming.sink if s[0] == "update"]
+    assert len(updates) == 0
+
+
+# ---------------------------------------------------------------------------
+# Test 3: tracing — spans are derived and handed to the tracer
+# ---------------------------------------------------------------------------
+
+class _RecordTracing:
+    def __init__(self):
+        self.started, self.ended = [], []
+
+    async def start_span(self, *, trace_id, name, input=None, parent_id=None, data=None, task_id=None):
+        self.started.append(name)
+        return _types.SimpleNamespace()
+
+    async def end_span(self, *, trace_id, span):
+        self.ended.append(getattr(span, "output", None))
+
+
+@pytest.mark.asyncio
+async def test_auto_send_derives_tool_spans_via_tracer():
+    fake_tracing = _RecordTracing()
+    tracer = SpanTracer(trace_id="t", parent_span_id="p", tracing=fake_tracing)
+    streaming = _FakeStreaming()
+
+    events = [
+        StreamTaskMessageStart(
+            type="start", index=0,
+            content=ToolRequestContent(
+                type="tool_request", author="agent",
+                tool_call_id="c1", name="Bash", arguments={},
+            ),
+        ),
+        StreamTaskMessageDone(type="done", index=0),
+        StreamTaskMessageFull(
+            type="full", index=1,
+            content=ToolResponseContent(
+                type="tool_response", author="agent",
+                tool_call_id="c1", name="Bash", content="ok",
+            ),
+        ),
+    ]
+
+    result = await auto_send(
+        _gen(events), task_id="task1", tracer=tracer, streaming=streaming
+    )
+
+    assert result.final_text == ""
+    assert fake_tracing.started == ["Bash"]
+    assert fake_tracing.ended == ["ok"]
+
+
+# ---------------------------------------------------------------------------
+# Test 4: text followed by a tool Full — text context is closed before Full
+# ---------------------------------------------------------------------------
+
+@pytest.mark.asyncio
+async def test_auto_send_closes_text_context_before_full_message():
+    streaming = _FakeStreaming()
+    events = [
+        StreamTaskMessageStart(
+            type="start", index=0,
+            content=TextContent(type="text", author="agent", content=""),
+        ),
+        StreamTaskMessageDelta(
+            type="delta", index=0,
+            delta=TextDelta(type="text", text_delta="Hi"),
+        ),
+        StreamTaskMessageDone(type="done", index=0),
+        StreamTaskMessageFull(
+            type="full", index=1,
+            content=ToolRequestContent(
+                type="tool_request", author="agent",
+                tool_call_id="c2", name="read_file", arguments={},
+            ),
+        ),
+    ]
+    result = await auto_send(_gen(events), task_id="task1", tracer=None, streaming=streaming)
+    assert result.final_text == "Hi"
+
+    # Verify ordering: text ctx opens, updates, closes; then tool_request ctx opens, closes
+    event_sequence = [(s[0], s[1]) for s in streaming.sink]
+    text_open_idx = next(i for i, s in enumerate(event_sequence) if s == ("open", "text"))
+    text_close_idx = next(i for i, s in enumerate(event_sequence) if s == ("close", "text"))
+    tool_open_idx = next(i for i, s in enumerate(event_sequence) if s == ("open", "tool_request"))
+    assert text_open_idx < text_close_idx < tool_open_idx

From 260064e2ec1c0a6d8fd290c90ab2d7359316c697 Mon Sep 17 00:00:00 2001
From: Declan Brady <declan.brady@scale.com>
Date: Thu, 18 Jun 2026 12:39:25 -0400
Subject: [PATCH 14/26] refactor(harness): exception-safe full-message post +
 drop dead state + cover error/finally paths in auto_send

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---
 src/agentex/lib/core/harness/auto_send.py | 16 ++++-----
 tests/lib/core/harness/test_auto_send.py  | 44 ++++++++++++++++++++---
 2 files changed, 46 insertions(+), 14 deletions(-)

diff --git a/src/agentex/lib/core/harness/auto_send.py b/src/agentex/lib/core/harness/auto_send.py
index 506a9ad82..e7de01a68 100644
--- a/src/agentex/lib/core/harness/auto_send.py
+++ b/src/agentex/lib/core/harness/auto_send.py
@@ -48,14 +48,12 @@ async def auto_send(
     deriver = SpanDeriver() if tracer is not None else None
     final_text_parts: list[str] = []
     current_ctx: Any = None
-    current_kind: str | None = None  # "text" | "reasoning"
 
     async def _close_current() -> None:
-        nonlocal current_ctx, current_kind
+        nonlocal current_ctx
         if current_ctx is not None:
             await current_ctx.close()
             current_ctx = None
-            current_kind = None
 
     try:
         async for event in events:
@@ -72,7 +70,6 @@ async def _close_current() -> None:
                         initial_content=event.content,
                     )
                     current_ctx = await ctx.__aenter__()
-                    current_kind = ctype
 
             elif isinstance(event, StreamTaskMessageDelta):
                 if current_ctx is not None and event.delta is not None:
@@ -100,14 +97,15 @@ async def _close_current() -> None:
                 # streaming context first, then post the full message by opening
                 # a context with the content and closing it immediately
                 # (no deltas; StreamingTaskMessageContext.close() persists
-                # initial_content when the accumulator is empty).
+                # initial_content when the accumulator is empty). Use async with
+                # so the context is closed even if close() raises (__aexit__
+                # delegates to close()).
                 await _close_current()
-                ctx = streaming.streaming_task_message_context(
+                async with streaming.streaming_task_message_context(
                     task_id=task_id,
                     initial_content=event.content,
-                )
-                full_ctx = await ctx.__aenter__()
-                await full_ctx.close()
+                ):
+                    pass
 
     finally:
         await _close_current()
diff --git a/tests/lib/core/harness/test_auto_send.py b/tests/lib/core/harness/test_auto_send.py
index 2a83658e1..9568d7b87 100644
--- a/tests/lib/core/harness/test_auto_send.py
+++ b/tests/lib/core/harness/test_auto_send.py
@@ -126,15 +126,24 @@ async def test_auto_send_streams_text_and_returns_final_text():
 async def test_auto_send_posts_full_tool_messages():
     streaming = _FakeStreaming()
     events = [
+        # A bare tool_request Start (no Done/Full) must NOT open a streaming
+        # context on its own — only Full events post messages.
+        StreamTaskMessageStart(
+            type="start", index=0,
+            content=ToolRequestContent(
+                type="tool_request", author="agent",
+                tool_call_id="c0", name="Bash", arguments={},
+            ),
+        ),
         StreamTaskMessageFull(
-            type="full", index=0,
+            type="full", index=1,
             content=ToolRequestContent(
                 type="tool_request", author="agent",
                 tool_call_id="c1", name="Bash", arguments={"cmd": "ls"},
             ),
         ),
         StreamTaskMessageFull(
-            type="full", index=1,
+            type="full", index=2,
             content=ToolResponseContent(
                 type="tool_response", author="agent",
                 tool_call_id="c1", name="Bash", content="file.py",
@@ -145,12 +154,12 @@ async def test_auto_send_posts_full_tool_messages():
 
     assert result.final_text == ""
 
-    # One context per Full event
+    # The opened contexts correspond ONLY to the two Full events — the
+    # tool_request Start did not open a context.
     ctx_events = [s for s in streaming.sink if s[0] == "ctx"]
     assert len(ctx_events) == 2
     content_types = [s[1] for s in ctx_events]
-    assert "tool_request" in content_types
-    assert "tool_response" in content_types
+    assert content_types == ["tool_request", "tool_response"]
 
     # Each context is opened and closed
     opens = [s for s in streaming.sink if s[0] == "open"]
@@ -246,3 +255,28 @@ async def test_auto_send_closes_text_context_before_full_message():
     text_close_idx = next(i for i, s in enumerate(event_sequence) if s == ("close", "text"))
     tool_open_idx = next(i for i, s in enumerate(event_sequence) if s == ("open", "tool_request"))
     assert text_open_idx < text_close_idx < tool_open_idx
+
+
+# ---------------------------------------------------------------------------
+# Test 5: midstream error — propagates AND the open context is closed (finally)
+# ---------------------------------------------------------------------------
+
+@pytest.mark.asyncio
+async def test_open_context_closed_on_midstream_error():
+    streaming = _FakeStreaming()
+
+    async def _exploding_gen():
+        yield StreamTaskMessageStart(
+            type="start", index=0,
+            content=TextContent(type="text", author="agent", content=""),
+        )
+        raise RuntimeError("boom")
+
+    with pytest.raises(RuntimeError, match="boom"):
+        await auto_send(
+            _exploding_gen(), task_id="task1", tracer=None, streaming=streaming
+        )
+
+    # The text context that was opened mid-stream was closed by the finally block.
+    assert ("open", "text") in [(s[0], s[1]) for s in streaming.sink]
+    assert ("close", "text") in [(s[0], s[1]) for s in streaming.sink]

From b27367b9f8fa1cd24de528eb6dffd29d633e2203 Mon Sep 17 00:00:00 2001
From: Declan Brady <declan.brady@scale.com>
Date: Thu, 18 Jun 2026 12:42:52 -0400
Subject: [PATCH 15/26] feat(harness): UnifiedEmitter facade tying delivery +
 tracing + usage

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---
 src/agentex/lib/core/harness/__init__.py | 24 ++++++++++
 src/agentex/lib/core/harness/emitter.py  | 59 ++++++++++++++++++++++++
 tests/lib/core/harness/test_emitter.py   | 56 ++++++++++++++++++++++
 3 files changed, 139 insertions(+)
 create mode 100644 src/agentex/lib/core/harness/emitter.py
 create mode 100644 tests/lib/core/harness/test_emitter.py

diff --git a/src/agentex/lib/core/harness/__init__.py b/src/agentex/lib/core/harness/__init__.py
index 15d116148..2988db8ff 100644
--- a/src/agentex/lib/core/harness/__init__.py
+++ b/src/agentex/lib/core/harness/__init__.py
@@ -4,3 +4,27 @@
 package derives spans from it and delivers it (yield or auto-send), so every
 harness tap gets streaming + tracing + turn usage uniformly.
 """
+
+from agentex.lib.core.harness.emitter import UnifiedEmitter
+from agentex.lib.core.harness.tracer import SpanTracer
+from agentex.lib.core.harness.types import (
+    CloseSpan,
+    HarnessTurn,
+    OpenSpan,
+    SpanSignal,
+    StreamTaskMessage,
+    TurnResult,
+    TurnUsage,
+)
+
+__all__ = [
+    "UnifiedEmitter",
+    "SpanTracer",
+    "OpenSpan",
+    "CloseSpan",
+    "SpanSignal",
+    "StreamTaskMessage",
+    "TurnUsage",
+    "TurnResult",
+    "HarnessTurn",
+]
diff --git a/src/agentex/lib/core/harness/emitter.py b/src/agentex/lib/core/harness/emitter.py
new file mode 100644
index 000000000..5944abc17
--- /dev/null
+++ b/src/agentex/lib/core/harness/emitter.py
@@ -0,0 +1,59 @@
+"""UnifiedEmitter: the single facade agent authors use for either delivery mode."""
+
+from __future__ import annotations
+
+from typing import AsyncIterator
+
+from agentex.lib.core.harness.auto_send import auto_send
+from agentex.lib.core.harness.tracer import SpanTracer
+from agentex.lib.core.harness.types import HarnessTurn, StreamTaskMessage, TurnResult
+from agentex.lib.core.harness.yield_delivery import yield_events
+
+
+class UnifiedEmitter:
+    """Ties trace context + chosen delivery together.
+
+    Tracing is default-on whenever `trace_id` is truthy; pass `tracer=False` to
+    disable, or a custom `SpanTracer` to override.
+    """
+
+    tracer: SpanTracer | None
+
+    def __init__(
+        self,
+        task_id: str,
+        trace_id: str | None,
+        parent_span_id: str | None,
+        tracer: SpanTracer | bool | None = None,
+        tracing: object | None = None,
+    ):
+        self.task_id = task_id
+        self.trace_id = trace_id
+        self.parent_span_id = parent_span_id
+        if tracer is False:
+            self.tracer = None
+        elif isinstance(tracer, SpanTracer):
+            self.tracer = tracer
+        elif trace_id:
+            self.tracer = SpanTracer(
+                trace_id=trace_id,
+                parent_span_id=parent_span_id,
+                task_id=task_id,
+                tracing=tracing,
+            )
+        else:
+            self.tracer = None
+
+    async def yield_turn(self, turn: HarnessTurn) -> AsyncIterator[StreamTaskMessage]:
+        """Sync HTTP ACP delivery: forward events, trace as side effect."""
+        async for event in yield_events(turn.events, tracer=self.tracer):
+            yield event
+
+    async def auto_send_turn(self, turn: HarnessTurn) -> TurnResult:
+        """Async/temporal delivery: push to the task stream, return TurnResult."""
+        return await auto_send(
+            turn.events,
+            task_id=self.task_id,
+            tracer=self.tracer,
+            usage=turn.usage(),
+        )
diff --git a/tests/lib/core/harness/test_emitter.py b/tests/lib/core/harness/test_emitter.py
new file mode 100644
index 000000000..318311e27
--- /dev/null
+++ b/tests/lib/core/harness/test_emitter.py
@@ -0,0 +1,56 @@
+import pytest
+
+from agentex.lib.core.harness.emitter import UnifiedEmitter
+from agentex.lib.core.harness.types import TurnUsage
+from agentex.types.task_message_update import StreamTaskMessageStart, StreamTaskMessageDone
+from agentex.types.text_content import TextContent
+
+
+class _FakeTracing:
+    async def start_span(self, **kw):
+        return None
+
+    async def end_span(self, **kw):
+        pass
+
+
+class _Turn:
+    def __init__(self, events_list, usage):
+        self._events_list = events_list
+        self._usage = usage
+
+    @property
+    async def events(self):
+        for e in self._events_list:
+            yield e
+
+    def usage(self):
+        return self._usage
+
+
+@pytest.mark.asyncio
+async def test_emitter_yield_mode_passes_through():
+    events = [
+        StreamTaskMessageStart(type="start", index=0,
+            content=TextContent(type="text", author="agent", content="hi")),
+        StreamTaskMessageDone(type="done", index=0),
+    ]
+    turn = _Turn(events, TurnUsage(model="m"))
+    emitter = UnifiedEmitter(task_id="t", trace_id=None, parent_span_id=None)
+    out = [e async for e in emitter.yield_turn(turn)]
+    assert out == events
+
+
+@pytest.mark.asyncio
+async def test_emitter_tracing_default_on_when_trace_id_present():
+    # Inject a fake tracing backend so the test env doesn't need temporalio.
+    # This exercises the default-on path (tracer=None) when trace_id is truthy.
+    emitter = UnifiedEmitter(task_id="t", trace_id="trace1", parent_span_id="p",
+                             tracing=_FakeTracing())
+    assert emitter.tracer is not None
+
+
+@pytest.mark.asyncio
+async def test_emitter_tracing_overridable_off():
+    emitter = UnifiedEmitter(task_id="t", trace_id="trace1", parent_span_id="p", tracer=False)
+    assert emitter.tracer is None

From ed86a460f94cf7a5b2705b1b13fbdf38ed929c22 Mon Sep 17 00:00:00 2001
From: Declan Brady <declan.brady@scale.com>
Date: Thu, 18 Jun 2026 12:47:44 -0400
Subject: [PATCH 16/26] refactor(harness): inject streaming into UnifiedEmitter
 + cover auto_send_turn + doc tracer modes

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---
 src/agentex/lib/core/harness/emitter.py | 16 +++++--
 tests/lib/core/harness/test_emitter.py  | 62 ++++++++++++++++++++++++-
 2 files changed, 73 insertions(+), 5 deletions(-)

diff --git a/src/agentex/lib/core/harness/emitter.py b/src/agentex/lib/core/harness/emitter.py
index 5944abc17..9573fb8b2 100644
--- a/src/agentex/lib/core/harness/emitter.py
+++ b/src/agentex/lib/core/harness/emitter.py
@@ -2,7 +2,7 @@
 
 from __future__ import annotations
 
-from typing import AsyncIterator
+from typing import AsyncGenerator
 
 from agentex.lib.core.harness.auto_send import auto_send
 from agentex.lib.core.harness.tracer import SpanTracer
@@ -13,8 +13,13 @@
 class UnifiedEmitter:
     """Ties trace context + chosen delivery together.
 
-    Tracing is default-on whenever `trace_id` is truthy; pass `tracer=False` to
-    disable, or a custom `SpanTracer` to override.
+    Tracing modes (the `tracer` arg):
+    - tracer=None (default): auto-construct a SpanTracer if `trace_id` is present.
+    - tracer=False: disable tracing entirely, regardless of `trace_id`.
+    - tracer=<SpanTracer>: use the supplied instance.
+
+    `tracing` and `streaming` are injection escape-hatches for tests/advanced
+    use; leave them None in production so the real adk modules are used.
     """
 
     tracer: SpanTracer | None
@@ -26,10 +31,12 @@ def __init__(
         parent_span_id: str | None,
         tracer: SpanTracer | bool | None = None,
         tracing: object | None = None,
+        streaming: object | None = None,
     ):
         self.task_id = task_id
         self.trace_id = trace_id
         self.parent_span_id = parent_span_id
+        self._streaming = streaming
         if tracer is False:
             self.tracer = None
         elif isinstance(tracer, SpanTracer):
@@ -44,7 +51,7 @@ def __init__(
         else:
             self.tracer = None
 
-    async def yield_turn(self, turn: HarnessTurn) -> AsyncIterator[StreamTaskMessage]:
+    async def yield_turn(self, turn: HarnessTurn) -> AsyncGenerator[StreamTaskMessage, None]:
         """Sync HTTP ACP delivery: forward events, trace as side effect."""
         async for event in yield_events(turn.events, tracer=self.tracer):
             yield event
@@ -55,5 +62,6 @@ async def auto_send_turn(self, turn: HarnessTurn) -> TurnResult:
             turn.events,
             task_id=self.task_id,
             tracer=self.tracer,
+            streaming=self._streaming,
             usage=turn.usage(),
         )
diff --git a/tests/lib/core/harness/test_emitter.py b/tests/lib/core/harness/test_emitter.py
index 318311e27..963a77dfe 100644
--- a/tests/lib/core/harness/test_emitter.py
+++ b/tests/lib/core/harness/test_emitter.py
@@ -2,7 +2,13 @@
 
 from agentex.lib.core.harness.emitter import UnifiedEmitter
 from agentex.lib.core.harness.types import TurnUsage
-from agentex.types.task_message_update import StreamTaskMessageStart, StreamTaskMessageDone
+from agentex.types.task_message import TaskMessage
+from agentex.types.task_message_delta import TextDelta
+from agentex.types.task_message_update import (
+    StreamTaskMessageDelta,
+    StreamTaskMessageDone,
+    StreamTaskMessageStart,
+)
 from agentex.types.text_content import TextContent
 
 
@@ -14,6 +20,42 @@ async def end_span(self, **kw):
         pass
 
 
+class _FakeCtx:
+    """Minimal StreamingTaskMessageContext fake (see test_auto_send.py)."""
+
+    def __init__(self, sink, content_type, initial_content):
+        self.sink = sink
+        self.content_type = content_type
+        self.task_message = TaskMessage(id="msg-1", task_id="task1", content=initial_content)
+
+    async def __aenter__(self):
+        self.sink.append(("open", self.content_type))
+        return self
+
+    async def __aexit__(self, *a):
+        await self.close()
+        return False
+
+    async def close(self):
+        self.sink.append(("close", self.content_type))
+
+    async def stream_update(self, update):
+        self.sink.append(("update", update))
+        return update
+
+
+class _FakeStreaming:
+    def __init__(self):
+        self.sink = []
+
+    def streaming_task_message_context(
+        self, task_id, initial_content, streaming_mode="coalesced", created_at=None
+    ):
+        ctype = getattr(initial_content, "type", None)
+        self.sink.append(("ctx", ctype))
+        return _FakeCtx(self.sink, ctype, initial_content)
+
+
 class _Turn:
     def __init__(self, events_list, usage):
         self._events_list = events_list
@@ -54,3 +96,21 @@ async def test_emitter_tracing_default_on_when_trace_id_present():
 async def test_emitter_tracing_overridable_off():
     emitter = UnifiedEmitter(task_id="t", trace_id="trace1", parent_span_id="p", tracer=False)
     assert emitter.tracer is None
+
+
+@pytest.mark.asyncio
+async def test_emitter_auto_send_turn_returns_usage():
+    usage = TurnUsage(model="m", input_tokens=5)
+    events = [
+        StreamTaskMessageStart(type="start", index=0,
+            content=TextContent(type="text", author="agent", content="")),
+        StreamTaskMessageDelta(type="delta", index=0,
+            delta=TextDelta(type="text", text_delta="Hello")),
+        StreamTaskMessageDone(type="done", index=0),
+    ]
+    turn = _Turn(events, usage)
+    fake = _FakeStreaming()
+    emitter = UnifiedEmitter(task_id="t", trace_id=None, parent_span_id=None, streaming=fake)
+    result = await emitter.auto_send_turn(turn)
+    assert result.usage == usage
+    assert result.final_text == "Hello"

From b5f6b94b6c57478e7de278e0d3ec1f34a750e72e Mon Sep 17 00:00:00 2001
From: Declan Brady <declan.brady@scale.com>
Date: Thu, 18 Jun 2026 12:50:18 -0400
Subject: [PATCH 17/26] test(harness): conformance scaffold + CI integration
 job skeleton

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---
 .github/workflows/harness-integration.yml     | 33 ++++++++++++++++
 .../lib/core/harness/conformance/__init__.py  |  0
 tests/lib/core/harness/conformance/runner.py  | 39 +++++++++++++++++++
 .../harness/conformance/test_conformance.py   | 28 +++++++++++++
 4 files changed, 100 insertions(+)
 create mode 100644 .github/workflows/harness-integration.yml
 create mode 100644 tests/lib/core/harness/conformance/__init__.py
 create mode 100644 tests/lib/core/harness/conformance/runner.py
 create mode 100644 tests/lib/core/harness/conformance/test_conformance.py

diff --git a/.github/workflows/harness-integration.yml b/.github/workflows/harness-integration.yml
new file mode 100644
index 000000000..33ca06728
--- /dev/null
+++ b/.github/workflows/harness-integration.yml
@@ -0,0 +1,33 @@
+name: Harness Integration
+
+on:
+  pull_request:
+    paths:
+      - "src/agentex/lib/core/harness/**"
+      - "src/agentex/lib/adk/_modules/**"
+      - ".github/workflows/harness-integration.yml"
+
+jobs:
+  conformance:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
+
+      - name: Install uv
+        uses: astral-sh/setup-uv@d4b2f3b6ecc6e67c4457f6d3e41ec42d3d0fcb86 # v5.4.2
+        with:
+          version: '0.10.2'
+
+      - name: Bootstrap
+        run: ./scripts/bootstrap
+
+      - name: Conformance suite
+        run: uv run pytest tests/lib/core/harness/ -v
+
+  # Live integration matrix (harness x {sync, async, temporal}) is added per-harness
+  # in the migration plans. Placeholder job keeps the workflow valid until then.
+  live-matrix:
+    runs-on: ubuntu-latest
+    if: false  # enabled once the first harness's test agents land
+    steps:
+      - run: echo "populated by migration PRs"
diff --git a/tests/lib/core/harness/conformance/__init__.py b/tests/lib/core/harness/conformance/__init__.py
new file mode 100644
index 000000000..e69de29bb
diff --git a/tests/lib/core/harness/conformance/runner.py b/tests/lib/core/harness/conformance/runner.py
new file mode 100644
index 000000000..210514b41
--- /dev/null
+++ b/tests/lib/core/harness/conformance/runner.py
@@ -0,0 +1,39 @@
+"""Shared conformance engine: every harness tap registers fixtures here.
+
+A fixture is (name, list[StreamTaskMessage]). The runner asserts that span
+derivation over the events is identical regardless of delivery channel, which is
+the cross-channel guarantee from the spec.
+"""
+
+from __future__ import annotations
+
+from dataclasses import dataclass
+
+from agentex.lib.core.harness.span_derivation import SpanDeriver
+from agentex.lib.core.harness.types import SpanSignal, StreamTaskMessage
+
+
+@dataclass
+class Fixture:
+    name: str
+    events: list[StreamTaskMessage]
+
+
+_REGISTRY: list[Fixture] = []
+
+
+def register(fixture: Fixture) -> None:
+    _REGISTRY.append(fixture)
+
+
+def all_fixtures() -> list[Fixture]:
+    return list(_REGISTRY)
+
+
+def derive_all(events: list[StreamTaskMessage]) -> list[SpanSignal]:
+    d = SpanDeriver()
+    out: list[SpanSignal] = []
+    for e in events:
+        out.extend(d.observe(e))
+    out.extend(d.flush())
+    return out
diff --git a/tests/lib/core/harness/conformance/test_conformance.py b/tests/lib/core/harness/conformance/test_conformance.py
new file mode 100644
index 000000000..cc350df3a
--- /dev/null
+++ b/tests/lib/core/harness/conformance/test_conformance.py
@@ -0,0 +1,28 @@
+import pytest
+
+from tests.lib.core.harness.conformance.runner import Fixture, derive_all, register, all_fixtures
+from agentex.types.task_message_update import (
+    StreamTaskMessageStart, StreamTaskMessageDone, StreamTaskMessageFull,
+)
+from agentex.types.tool_request_content import ToolRequestContent
+from agentex.types.tool_response_content import ToolResponseContent
+
+register(Fixture(
+    name="builtin-single-tool",
+    events=[
+        StreamTaskMessageStart(type="start", index=0,
+            content=ToolRequestContent(type="tool_request", author="agent",
+                                       tool_call_id="c", name="Bash", arguments={})),
+        StreamTaskMessageDone(type="done", index=0),
+        StreamTaskMessageFull(type="full", index=1,
+            content=ToolResponseContent(type="tool_response", author="agent",
+                                        tool_call_id="c", name="Bash", content="ok")),
+    ],
+))
+
+
+@pytest.mark.parametrize("fixture", all_fixtures(), ids=lambda f: f.name)
+def test_span_derivation_is_deterministic(fixture):
+    # Deriving twice over the same events yields identical signals (the property
+    # that makes yield vs auto-send equivalent, since both observe the same stream).
+    assert derive_all(fixture.events) == derive_all(fixture.events)

From 520849afe3f96a024cb184bd8e9d4d3663330529 Mon Sep 17 00:00:00 2001
From: Declan Brady <declan.brady@scale.com>
Date: Thu, 18 Jun 2026 12:56:34 -0400
Subject: [PATCH 18/26] test(harness): match scripts/test invocation + document
 conformance registry semantics

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---
 .github/workflows/harness-integration.yml             | 11 +++++++++--
 tests/lib/core/harness/conformance/runner.py          |  9 +++++++++
 .../lib/core/harness/conformance/test_conformance.py  |  2 ++
 3 files changed, 20 insertions(+), 2 deletions(-)

diff --git a/.github/workflows/harness-integration.yml b/.github/workflows/harness-integration.yml
index 33ca06728..ab6b353b9 100644
--- a/.github/workflows/harness-integration.yml
+++ b/.github/workflows/harness-integration.yml
@@ -1,6 +1,8 @@
 name: Harness Integration
 
 on:
+  push:
+    branches: [main]
   pull_request:
     paths:
       - "src/agentex/lib/core/harness/**"
@@ -21,8 +23,13 @@ jobs:
       - name: Bootstrap
         run: ./scripts/bootstrap
 
+      # Defer to scripts/test so the harness suite runs under the exact same
+      # invocation as the main CI test job: DEFER_PYDANTIC_BUILD=false and
+      # `uv run --isolated --all-packages --all-extras pytest`, across the
+      # min/max supported Python versions. Running `uv run pytest` directly
+      # would risk an all-extras-only dep passing locally but failing in CI.
       - name: Conformance suite
-        run: uv run pytest tests/lib/core/harness/ -v
+        run: ./scripts/test tests/lib/core/harness/ -v
 
   # Live integration matrix (harness x {sync, async, temporal}) is added per-harness
   # in the migration plans. Placeholder job keeps the workflow valid until then.
@@ -30,4 +37,4 @@ jobs:
     runs-on: ubuntu-latest
     if: false  # enabled once the first harness's test agents land
     steps:
-      - run: echo "populated by migration PRs"
+      - run: echo "populated by migration PRs"  # TODO(harness-migration): enable per-harness; see docs/superpowers/plans migration PRs 4-8
diff --git a/tests/lib/core/harness/conformance/runner.py b/tests/lib/core/harness/conformance/runner.py
index 210514b41..ffd72f89a 100644
--- a/tests/lib/core/harness/conformance/runner.py
+++ b/tests/lib/core/harness/conformance/runner.py
@@ -3,6 +3,15 @@
 A fixture is (name, list[StreamTaskMessage]). The runner asserts that span
 derivation over the events is identical regardless of delivery channel, which is
 the cross-channel guarantee from the spec.
+
+Registry shared-state hazard: `_REGISTRY` is process-global. Every `test_*.py`
+module that calls `register()` at import time contributes to it, so a module
+that parametrizes over `all_fixtures()` will see fixtures registered by ANY
+other conformance module imported earlier in the same pytest process (collection
+order is not guaranteed). To stay deterministic, each future harness conformance
+module should register and parametrize over its OWN fixtures (e.g. keep a
+module-local list it both registers and parametrizes), rather than relying on
+cross-module global accumulation via `all_fixtures()`.
 """
 
 from __future__ import annotations
diff --git a/tests/lib/core/harness/conformance/test_conformance.py b/tests/lib/core/harness/conformance/test_conformance.py
index cc350df3a..6080ca5ef 100644
--- a/tests/lib/core/harness/conformance/test_conformance.py
+++ b/tests/lib/core/harness/conformance/test_conformance.py
@@ -23,6 +23,8 @@
 
 @pytest.mark.parametrize("fixture", all_fixtures(), ids=lambda f: f.name)
 def test_span_derivation_is_deterministic(fixture):
+    """Exercises the cross-channel guarantee: yield and auto-send observe the
+    same event stream, so span derivation must be deterministic/idempotent."""
     # Deriving twice over the same events yields identical signals (the property
     # that makes yield vs auto-send equivalent, since both observe the same stream).
     assert derive_all(fixture.events) == derive_all(fixture.events)

From a915170b3efb3ecc18b46934ef7b7f7f38150df4 Mon Sep 17 00:00:00 2001
From: Declan Brady <declan.brady@scale.com>
Date: Thu, 18 Jun 2026 13:00:25 -0400
Subject: [PATCH 19/26] refactor(harness): isinstance narrowing for clean
 type-check across the package

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---
 src/agentex/lib/core/harness/auto_send.py       |  6 ++----
 src/agentex/lib/core/harness/span_derivation.py | 12 +++++++-----
 src/agentex/lib/core/harness/yield_delivery.py  |  4 ++--
 3 files changed, 11 insertions(+), 11 deletions(-)

diff --git a/src/agentex/lib/core/harness/auto_send.py b/src/agentex/lib/core/harness/auto_send.py
index e7de01a68..850146ab7 100644
--- a/src/agentex/lib/core/harness/auto_send.py
+++ b/src/agentex/lib/core/harness/auto_send.py
@@ -10,6 +10,7 @@
     StreamTaskMessageFull,
     StreamTaskMessageStart,
 )
+from agentex.types.text_delta import TextDelta
 
 from agentex.lib.core.harness.span_derivation import SpanDeriver
 from agentex.lib.core.harness.tracer import SpanTracer
@@ -83,10 +84,7 @@ async def _close_current() -> None:
                         index=event.index,
                     )
                     await current_ctx.stream_update(delta_with_parent)
-                    if (
-                        getattr(event.delta, "type", None) == "text"
-                        and event.delta.text_delta
-                    ):
+                    if isinstance(event.delta, TextDelta) and event.delta.text_delta:
                         final_text_parts.append(event.delta.text_delta)
 
             elif isinstance(event, StreamTaskMessageDone):
diff --git a/src/agentex/lib/core/harness/span_derivation.py b/src/agentex/lib/core/harness/span_derivation.py
index 15e1f593f..eac929ee5 100644
--- a/src/agentex/lib/core/harness/span_derivation.py
+++ b/src/agentex/lib/core/harness/span_derivation.py
@@ -15,6 +15,9 @@
     StreamTaskMessageFull,
     StreamTaskMessageStart,
 )
+from agentex.types.tool_request_content import ToolRequestContent
+from agentex.types.tool_request_delta import ToolRequestDelta
+from agentex.types.tool_response_content import ToolResponseContent
 
 from agentex.lib.core.harness.types import CloseSpan, OpenSpan, SpanSignal, StreamTaskMessage
 
@@ -79,15 +82,14 @@ def _on_start(self, event: StreamTaskMessageStart) -> list[SpanSignal]:
             return []
         idx = event.index
         content = event.content
-        ctype = getattr(content, "type", None)
-        if ctype == "tool_request":
+        if isinstance(content, ToolRequestContent):
             self._tool_by_index[idx] = _ToolReqMeta(
                 tool_call_id=content.tool_call_id,
                 name=content.name,
                 arguments=dict(content.arguments or {}),
             )
             return []
-        if ctype == "reasoning":
+        if content.type == "reasoning":
             self._reasoning_index_open.add(idx)
             return [OpenSpan(key=f"reasoning:{idx}", kind="reasoning", name="reasoning", input={})]
         return []
@@ -97,7 +99,7 @@ def _on_delta(self, event: StreamTaskMessageDelta) -> list[SpanSignal]:
             return []
         idx = event.index
         delta = event.delta
-        if delta is not None and getattr(delta, "type", None) == "tool_request":
+        if isinstance(delta, ToolRequestDelta):
             meta = self._tool_by_index.get(idx)
             if meta is not None and delta.arguments_delta:
                 meta.args_buf += delta.arguments_delta
@@ -105,7 +107,7 @@ def _on_delta(self, event: StreamTaskMessageDelta) -> list[SpanSignal]:
 
     def _on_full(self, event: StreamTaskMessageFull) -> list[SpanSignal]:
         content = event.content
-        if getattr(content, "type", None) == "tool_response":
+        if isinstance(content, ToolResponseContent):
             tcid = content.tool_call_id
             if tcid in self._open_tool_ids:
                 self._open_tool_ids.pop(tcid, None)
diff --git a/src/agentex/lib/core/harness/yield_delivery.py b/src/agentex/lib/core/harness/yield_delivery.py
index ca923c6a3..0d90d5d94 100644
--- a/src/agentex/lib/core/harness/yield_delivery.py
+++ b/src/agentex/lib/core/harness/yield_delivery.py
@@ -21,11 +21,11 @@ async def yield_events(
     deriver = SpanDeriver() if tracer is not None else None
     try:
         async for event in events:
-            if deriver is not None:  # tracer is non-None whenever deriver is set
+            if deriver is not None and tracer is not None:
                 for signal in deriver.observe(event):
                     await tracer.handle(signal)
             yield event
     finally:
-        if deriver is not None:  # tracer is non-None whenever deriver is set
+        if deriver is not None and tracer is not None:
             for signal in deriver.flush():
                 await tracer.handle(signal)

From e7b9c5209852200ccac31db170c435a2646f8893 Mon Sep 17 00:00:00 2001
From: Declan Brady <declan.brady@scale.com>
Date: Thu, 18 Jun 2026 13:04:07 -0400
Subject: [PATCH 20/26] refactor(harness): narrow auto_send tracer guards, drop
 type:ignore for consistency

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---
 src/agentex/lib/core/harness/auto_send.py | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/src/agentex/lib/core/harness/auto_send.py b/src/agentex/lib/core/harness/auto_send.py
index 850146ab7..ee17fdc56 100644
--- a/src/agentex/lib/core/harness/auto_send.py
+++ b/src/agentex/lib/core/harness/auto_send.py
@@ -58,9 +58,9 @@ async def _close_current() -> None:
 
     try:
         async for event in events:
-            if deriver is not None:
+            if deriver is not None and tracer is not None:
                 for signal in deriver.observe(event):
-                    await tracer.handle(signal)  # type: ignore[union-attr]
+                    await tracer.handle(signal)
 
             if isinstance(event, StreamTaskMessageStart):
                 ctype = getattr(event.content, "type", None)
@@ -107,8 +107,8 @@ async def _close_current() -> None:
 
     finally:
         await _close_current()
-        if deriver is not None:
+        if deriver is not None and tracer is not None:
             for signal in deriver.flush():
-                await tracer.handle(signal)  # type: ignore[union-attr]
+                await tracer.handle(signal)
 
     return TurnResult(final_text="".join(final_text_parts), usage=usage or TurnUsage())

From ebc468d014e2a243b245c29d6395d4055beccbde Mon Sep 17 00:00:00 2001
From: Declan Brady <declan.brady@scale.com>
Date: Thu, 18 Jun 2026 13:29:20 -0400
Subject: [PATCH 21/26] style: ruff import-sort + format fixes across the
 harness package

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---
 src/agentex/lib/core/harness/__init__.py      |  12 +-
 src/agentex/lib/core/harness/auto_send.py     |   9 +-
 src/agentex/lib/core/harness/emitter.py       |   4 +-
 .../lib/core/harness/span_derivation.py       |   7 +-
 src/agentex/lib/core/harness/tracer.py        |   2 +-
 src/agentex/lib/core/harness/types.py         |   9 +-
 .../lib/core/harness/yield_delivery.py        |   6 +-
 tests/lib/core/harness/conformance/runner.py  |   2 +-
 .../harness/conformance/test_conformance.py   |  40 ++++---
 tests/lib/core/harness/test_auto_send.py      | 111 +++++++++++-------
 tests/lib/core/harness/test_emitter.py        |  24 ++--
 .../lib/core/harness/test_span_derivation.py  | 107 ++++++++++-------
 tests/lib/core/harness/test_tracer.py         |   2 +-
 tests/lib/core/harness/test_types.py          |   4 +-
 tests/lib/core/harness/test_yield_delivery.py |  46 +++++---
 15 files changed, 228 insertions(+), 157 deletions(-)

diff --git a/src/agentex/lib/core/harness/__init__.py b/src/agentex/lib/core/harness/__init__.py
index 2988db8ff..067751d63 100644
--- a/src/agentex/lib/core/harness/__init__.py
+++ b/src/agentex/lib/core/harness/__init__.py
@@ -5,17 +5,17 @@
 harness tap gets streaming + tracing + turn usage uniformly.
 """
 
-from agentex.lib.core.harness.emitter import UnifiedEmitter
-from agentex.lib.core.harness.tracer import SpanTracer
 from agentex.lib.core.harness.types import (
-    CloseSpan,
-    HarnessTurn,
     OpenSpan,
+    CloseSpan,
+    TurnUsage,
     SpanSignal,
-    StreamTaskMessage,
     TurnResult,
-    TurnUsage,
+    HarnessTurn,
+    StreamTaskMessage,
 )
+from agentex.lib.core.harness.tracer import SpanTracer
+from agentex.lib.core.harness.emitter import UnifiedEmitter
 
 __all__ = [
     "UnifiedEmitter",
diff --git a/src/agentex/lib/core/harness/auto_send.py b/src/agentex/lib/core/harness/auto_send.py
index ee17fdc56..b246cf38c 100644
--- a/src/agentex/lib/core/harness/auto_send.py
+++ b/src/agentex/lib/core/harness/auto_send.py
@@ -4,17 +4,16 @@
 
 from typing import Any, AsyncIterator
 
+from agentex.types.text_delta import TextDelta
+from agentex.lib.core.harness.types import TurnUsage, TurnResult, StreamTaskMessage
+from agentex.lib.core.harness.tracer import SpanTracer
 from agentex.types.task_message_update import (
-    StreamTaskMessageDelta,
     StreamTaskMessageDone,
     StreamTaskMessageFull,
+    StreamTaskMessageDelta,
     StreamTaskMessageStart,
 )
-from agentex.types.text_delta import TextDelta
-
 from agentex.lib.core.harness.span_derivation import SpanDeriver
-from agentex.lib.core.harness.tracer import SpanTracer
-from agentex.lib.core.harness.types import StreamTaskMessage, TurnResult, TurnUsage
 
 
 async def auto_send(
diff --git a/src/agentex/lib/core/harness/emitter.py b/src/agentex/lib/core/harness/emitter.py
index 9573fb8b2..681c859ea 100644
--- a/src/agentex/lib/core/harness/emitter.py
+++ b/src/agentex/lib/core/harness/emitter.py
@@ -4,9 +4,9 @@
 
 from typing import AsyncGenerator
 
-from agentex.lib.core.harness.auto_send import auto_send
+from agentex.lib.core.harness.types import TurnResult, HarnessTurn, StreamTaskMessage
 from agentex.lib.core.harness.tracer import SpanTracer
-from agentex.lib.core.harness.types import HarnessTurn, StreamTaskMessage, TurnResult
+from agentex.lib.core.harness.auto_send import auto_send
 from agentex.lib.core.harness.yield_delivery import yield_events
 
 
diff --git a/src/agentex/lib/core/harness/span_derivation.py b/src/agentex/lib/core/harness/span_derivation.py
index eac929ee5..d353cf9e0 100644
--- a/src/agentex/lib/core/harness/span_derivation.py
+++ b/src/agentex/lib/core/harness/span_derivation.py
@@ -9,18 +9,17 @@
 import json
 from dataclasses import dataclass
 
+from agentex.lib.core.harness.types import OpenSpan, CloseSpan, SpanSignal, StreamTaskMessage
+from agentex.types.tool_request_delta import ToolRequestDelta
 from agentex.types.task_message_update import (
-    StreamTaskMessageDelta,
     StreamTaskMessageDone,
     StreamTaskMessageFull,
+    StreamTaskMessageDelta,
     StreamTaskMessageStart,
 )
 from agentex.types.tool_request_content import ToolRequestContent
-from agentex.types.tool_request_delta import ToolRequestDelta
 from agentex.types.tool_response_content import ToolResponseContent
 
-from agentex.lib.core.harness.types import CloseSpan, OpenSpan, SpanSignal, StreamTaskMessage
-
 
 @dataclass
 class _ToolReqMeta:
diff --git a/src/agentex/lib/core/harness/tracer.py b/src/agentex/lib/core/harness/tracer.py
index 3f4ff40c2..8384407bd 100644
--- a/src/agentex/lib/core/harness/tracer.py
+++ b/src/agentex/lib/core/harness/tracer.py
@@ -4,7 +4,7 @@
 
 from typing import Any
 
-from agentex.lib.core.harness.types import CloseSpan, OpenSpan, SpanSignal
+from agentex.lib.core.harness.types import OpenSpan, CloseSpan, SpanSignal
 
 try:
     from agentex.lib.utils.logging import make_logger
diff --git a/src/agentex/lib/core/harness/types.py b/src/agentex/lib/core/harness/types.py
index f31b2c67f..64104d316 100644
--- a/src/agentex/lib/core/harness/types.py
+++ b/src/agentex/lib/core/harness/types.py
@@ -2,16 +2,17 @@
 
 from __future__ import annotations
 
-from dataclasses import dataclass, field
-from typing import Any, AsyncIterator, Literal, Protocol, Union, runtime_checkable
+from typing import Any, Union, Literal, Protocol, AsyncIterator, runtime_checkable
+from dataclasses import field, dataclass
+
+from pydantic import BaseModel, ConfigDict
 
 from agentex.types.task_message_update import (
-    StreamTaskMessageDelta,
     StreamTaskMessageDone,
     StreamTaskMessageFull,
+    StreamTaskMessageDelta,
     StreamTaskMessageStart,
 )
-from pydantic import BaseModel, ConfigDict
 
 # The canonical stream element. Taps yield these; delivery adapters consume them.
 StreamTaskMessage = Union[
diff --git a/src/agentex/lib/core/harness/yield_delivery.py b/src/agentex/lib/core/harness/yield_delivery.py
index 0d90d5d94..69b39f152 100644
--- a/src/agentex/lib/core/harness/yield_delivery.py
+++ b/src/agentex/lib/core/harness/yield_delivery.py
@@ -2,11 +2,11 @@
 
 from __future__ import annotations
 
-from typing import AsyncGenerator, AsyncIterator
+from typing import AsyncIterator, AsyncGenerator
 
-from agentex.lib.core.harness.span_derivation import SpanDeriver
-from agentex.lib.core.harness.tracer import SpanTracer
 from agentex.lib.core.harness.types import StreamTaskMessage
+from agentex.lib.core.harness.tracer import SpanTracer
+from agentex.lib.core.harness.span_derivation import SpanDeriver
 
 
 async def yield_events(
diff --git a/tests/lib/core/harness/conformance/runner.py b/tests/lib/core/harness/conformance/runner.py
index ffd72f89a..81a74860c 100644
--- a/tests/lib/core/harness/conformance/runner.py
+++ b/tests/lib/core/harness/conformance/runner.py
@@ -18,8 +18,8 @@
 
 from dataclasses import dataclass
 
-from agentex.lib.core.harness.span_derivation import SpanDeriver
 from agentex.lib.core.harness.types import SpanSignal, StreamTaskMessage
+from agentex.lib.core.harness.span_derivation import SpanDeriver
 
 
 @dataclass
diff --git a/tests/lib/core/harness/conformance/test_conformance.py b/tests/lib/core/harness/conformance/test_conformance.py
index 6080ca5ef..1d686c33a 100644
--- a/tests/lib/core/harness/conformance/test_conformance.py
+++ b/tests/lib/core/harness/conformance/test_conformance.py
@@ -1,24 +1,36 @@
 import pytest
 
-from tests.lib.core.harness.conformance.runner import Fixture, derive_all, register, all_fixtures
 from agentex.types.task_message_update import (
-    StreamTaskMessageStart, StreamTaskMessageDone, StreamTaskMessageFull,
+    StreamTaskMessageDone,
+    StreamTaskMessageFull,
+    StreamTaskMessageStart,
 )
 from agentex.types.tool_request_content import ToolRequestContent
 from agentex.types.tool_response_content import ToolResponseContent
+from tests.lib.core.harness.conformance.runner import Fixture, register, derive_all, all_fixtures
 
-register(Fixture(
-    name="builtin-single-tool",
-    events=[
-        StreamTaskMessageStart(type="start", index=0,
-            content=ToolRequestContent(type="tool_request", author="agent",
-                                       tool_call_id="c", name="Bash", arguments={})),
-        StreamTaskMessageDone(type="done", index=0),
-        StreamTaskMessageFull(type="full", index=1,
-            content=ToolResponseContent(type="tool_response", author="agent",
-                                        tool_call_id="c", name="Bash", content="ok")),
-    ],
-))
+register(
+    Fixture(
+        name="builtin-single-tool",
+        events=[
+            StreamTaskMessageStart(
+                type="start",
+                index=0,
+                content=ToolRequestContent(
+                    type="tool_request", author="agent", tool_call_id="c", name="Bash", arguments={}
+                ),
+            ),
+            StreamTaskMessageDone(type="done", index=0),
+            StreamTaskMessageFull(
+                type="full",
+                index=1,
+                content=ToolResponseContent(
+                    type="tool_response", author="agent", tool_call_id="c", name="Bash", content="ok"
+                ),
+            ),
+        ],
+    )
+)
 
 
 @pytest.mark.parametrize("fixture", all_fixtures(), ids=lambda f: f.name)
diff --git a/tests/lib/core/harness/test_auto_send.py b/tests/lib/core/harness/test_auto_send.py
index 9568d7b87..e7331e67c 100644
--- a/tests/lib/core/harness/test_auto_send.py
+++ b/tests/lib/core/harness/test_auto_send.py
@@ -13,17 +13,17 @@
 
 import pytest
 
-from agentex.lib.core.harness.auto_send import auto_send
-from agentex.lib.core.harness.tracer import SpanTracer
 from agentex.types.task_message import TaskMessage
+from agentex.types.text_content import TextContent
+from agentex.lib.core.harness.tracer import SpanTracer
+from agentex.types.task_message_delta import TextDelta
 from agentex.types.task_message_update import (
-    StreamTaskMessageStart,
-    StreamTaskMessageDelta,
     StreamTaskMessageDone,
     StreamTaskMessageFull,
+    StreamTaskMessageDelta,
+    StreamTaskMessageStart,
 )
-from agentex.types.text_content import TextContent
-from agentex.types.task_message_delta import TextDelta
+from agentex.lib.core.harness.auto_send import auto_send
 from agentex.types.tool_request_content import ToolRequestContent
 from agentex.types.tool_response_content import ToolResponseContent
 
@@ -40,9 +40,7 @@ def __init__(self, sink, content_type, initial_content):
         self.sink = sink
         self.content_type = content_type
         # Real TaskMessage so StreamTaskMessageDelta(parent_task_message=...) passes validation
-        self.task_message = TaskMessage(
-            id="msg-1", task_id="task1", content=initial_content
-        )
+        self.task_message = TaskMessage(id="msg-1", task_id="task1", content=initial_content)
 
     async def __aenter__(self):
         self.sink.append(("open", self.content_type))
@@ -67,9 +65,7 @@ class _FakeStreaming:
     def __init__(self):
         self.sink = []
 
-    def streaming_task_message_context(
-        self, task_id, initial_content, streaming_mode="coalesced", created_at=None
-    ):
+    def streaming_task_message_context(self, task_id, initial_content, streaming_mode="coalesced", created_at=None):
         ctype = getattr(initial_content, "type", None)
         self.sink.append(("ctx", ctype))
         return _FakeCtx(self.sink, ctype, initial_content)
@@ -84,20 +80,24 @@ async def _gen(events):
 # Test 1: text streaming — open, stream deltas, close; return accumulated text
 # ---------------------------------------------------------------------------
 
+
 @pytest.mark.asyncio
 async def test_auto_send_streams_text_and_returns_final_text():
     streaming = _FakeStreaming()
     events = [
         StreamTaskMessageStart(
-            type="start", index=0,
+            type="start",
+            index=0,
             content=TextContent(type="text", author="agent", content=""),
         ),
         StreamTaskMessageDelta(
-            type="delta", index=0,
+            type="delta",
+            index=0,
             delta=TextDelta(type="text", text_delta="Hel"),
         ),
         StreamTaskMessageDelta(
-            type="delta", index=0,
+            type="delta",
+            index=0,
             delta=TextDelta(type="text", text_delta="lo"),
         ),
         StreamTaskMessageDone(type="done", index=0),
@@ -122,6 +122,7 @@ async def test_auto_send_streams_text_and_returns_final_text():
 # (open context with the content, no deltas, close immediately)
 # ---------------------------------------------------------------------------
 
+
 @pytest.mark.asyncio
 async def test_auto_send_posts_full_tool_messages():
     streaming = _FakeStreaming()
@@ -129,24 +130,36 @@ async def test_auto_send_posts_full_tool_messages():
         # A bare tool_request Start (no Done/Full) must NOT open a streaming
         # context on its own — only Full events post messages.
         StreamTaskMessageStart(
-            type="start", index=0,
+            type="start",
+            index=0,
             content=ToolRequestContent(
-                type="tool_request", author="agent",
-                tool_call_id="c0", name="Bash", arguments={},
+                type="tool_request",
+                author="agent",
+                tool_call_id="c0",
+                name="Bash",
+                arguments={},
             ),
         ),
         StreamTaskMessageFull(
-            type="full", index=1,
+            type="full",
+            index=1,
             content=ToolRequestContent(
-                type="tool_request", author="agent",
-                tool_call_id="c1", name="Bash", arguments={"cmd": "ls"},
+                type="tool_request",
+                author="agent",
+                tool_call_id="c1",
+                name="Bash",
+                arguments={"cmd": "ls"},
             ),
         ),
         StreamTaskMessageFull(
-            type="full", index=2,
+            type="full",
+            index=2,
             content=ToolResponseContent(
-                type="tool_response", author="agent",
-                tool_call_id="c1", name="Bash", content="file.py",
+                type="tool_response",
+                author="agent",
+                tool_call_id="c1",
+                name="Bash",
+                content="file.py",
             ),
         ),
     ]
@@ -176,6 +189,7 @@ async def test_auto_send_posts_full_tool_messages():
 # Test 3: tracing — spans are derived and handed to the tracer
 # ---------------------------------------------------------------------------
 
+
 class _RecordTracing:
     def __init__(self):
         self.started, self.ended = [], []
@@ -196,25 +210,31 @@ async def test_auto_send_derives_tool_spans_via_tracer():
 
     events = [
         StreamTaskMessageStart(
-            type="start", index=0,
+            type="start",
+            index=0,
             content=ToolRequestContent(
-                type="tool_request", author="agent",
-                tool_call_id="c1", name="Bash", arguments={},
+                type="tool_request",
+                author="agent",
+                tool_call_id="c1",
+                name="Bash",
+                arguments={},
             ),
         ),
         StreamTaskMessageDone(type="done", index=0),
         StreamTaskMessageFull(
-            type="full", index=1,
+            type="full",
+            index=1,
             content=ToolResponseContent(
-                type="tool_response", author="agent",
-                tool_call_id="c1", name="Bash", content="ok",
+                type="tool_response",
+                author="agent",
+                tool_call_id="c1",
+                name="Bash",
+                content="ok",
             ),
         ),
     ]
 
-    result = await auto_send(
-        _gen(events), task_id="task1", tracer=tracer, streaming=streaming
-    )
+    result = await auto_send(_gen(events), task_id="task1", tracer=tracer, streaming=streaming)
 
     assert result.final_text == ""
     assert fake_tracing.started == ["Bash"]
@@ -225,24 +245,31 @@ async def test_auto_send_derives_tool_spans_via_tracer():
 # Test 4: text followed by a tool Full — text context is closed before Full
 # ---------------------------------------------------------------------------
 
+
 @pytest.mark.asyncio
 async def test_auto_send_closes_text_context_before_full_message():
     streaming = _FakeStreaming()
     events = [
         StreamTaskMessageStart(
-            type="start", index=0,
+            type="start",
+            index=0,
             content=TextContent(type="text", author="agent", content=""),
         ),
         StreamTaskMessageDelta(
-            type="delta", index=0,
+            type="delta",
+            index=0,
             delta=TextDelta(type="text", text_delta="Hi"),
         ),
         StreamTaskMessageDone(type="done", index=0),
         StreamTaskMessageFull(
-            type="full", index=1,
+            type="full",
+            index=1,
             content=ToolRequestContent(
-                type="tool_request", author="agent",
-                tool_call_id="c2", name="read_file", arguments={},
+                type="tool_request",
+                author="agent",
+                tool_call_id="c2",
+                name="read_file",
+                arguments={},
             ),
         ),
     ]
@@ -261,21 +288,21 @@ async def test_auto_send_closes_text_context_before_full_message():
 # Test 5: midstream error — propagates AND the open context is closed (finally)
 # ---------------------------------------------------------------------------
 
+
 @pytest.mark.asyncio
 async def test_open_context_closed_on_midstream_error():
     streaming = _FakeStreaming()
 
     async def _exploding_gen():
         yield StreamTaskMessageStart(
-            type="start", index=0,
+            type="start",
+            index=0,
             content=TextContent(type="text", author="agent", content=""),
         )
         raise RuntimeError("boom")
 
     with pytest.raises(RuntimeError, match="boom"):
-        await auto_send(
-            _exploding_gen(), task_id="task1", tracer=None, streaming=streaming
-        )
+        await auto_send(_exploding_gen(), task_id="task1", tracer=None, streaming=streaming)
 
     # The text context that was opened mid-stream was closed by the finally block.
     assert ("open", "text") in [(s[0], s[1]) for s in streaming.sink]
diff --git a/tests/lib/core/harness/test_emitter.py b/tests/lib/core/harness/test_emitter.py
index 963a77dfe..ee3052f47 100644
--- a/tests/lib/core/harness/test_emitter.py
+++ b/tests/lib/core/harness/test_emitter.py
@@ -1,15 +1,15 @@
 import pytest
 
-from agentex.lib.core.harness.emitter import UnifiedEmitter
-from agentex.lib.core.harness.types import TurnUsage
 from agentex.types.task_message import TaskMessage
+from agentex.types.text_content import TextContent
+from agentex.lib.core.harness.types import TurnUsage
+from agentex.lib.core.harness.emitter import UnifiedEmitter
 from agentex.types.task_message_delta import TextDelta
 from agentex.types.task_message_update import (
-    StreamTaskMessageDelta,
     StreamTaskMessageDone,
+    StreamTaskMessageDelta,
     StreamTaskMessageStart,
 )
-from agentex.types.text_content import TextContent
 
 
 class _FakeTracing:
@@ -48,9 +48,7 @@ class _FakeStreaming:
     def __init__(self):
         self.sink = []
 
-    def streaming_task_message_context(
-        self, task_id, initial_content, streaming_mode="coalesced", created_at=None
-    ):
+    def streaming_task_message_context(self, task_id, initial_content, streaming_mode="coalesced", created_at=None):
         ctype = getattr(initial_content, "type", None)
         self.sink.append(("ctx", ctype))
         return _FakeCtx(self.sink, ctype, initial_content)
@@ -73,8 +71,7 @@ def usage(self):
 @pytest.mark.asyncio
 async def test_emitter_yield_mode_passes_through():
     events = [
-        StreamTaskMessageStart(type="start", index=0,
-            content=TextContent(type="text", author="agent", content="hi")),
+        StreamTaskMessageStart(type="start", index=0, content=TextContent(type="text", author="agent", content="hi")),
         StreamTaskMessageDone(type="done", index=0),
     ]
     turn = _Turn(events, TurnUsage(model="m"))
@@ -87,8 +84,7 @@ async def test_emitter_yield_mode_passes_through():
 async def test_emitter_tracing_default_on_when_trace_id_present():
     # Inject a fake tracing backend so the test env doesn't need temporalio.
     # This exercises the default-on path (tracer=None) when trace_id is truthy.
-    emitter = UnifiedEmitter(task_id="t", trace_id="trace1", parent_span_id="p",
-                             tracing=_FakeTracing())
+    emitter = UnifiedEmitter(task_id="t", trace_id="trace1", parent_span_id="p", tracing=_FakeTracing())
     assert emitter.tracer is not None
 
 
@@ -102,10 +98,8 @@ async def test_emitter_tracing_overridable_off():
 async def test_emitter_auto_send_turn_returns_usage():
     usage = TurnUsage(model="m", input_tokens=5)
     events = [
-        StreamTaskMessageStart(type="start", index=0,
-            content=TextContent(type="text", author="agent", content="")),
-        StreamTaskMessageDelta(type="delta", index=0,
-            delta=TextDelta(type="text", text_delta="Hello")),
+        StreamTaskMessageStart(type="start", index=0, content=TextContent(type="text", author="agent", content="")),
+        StreamTaskMessageDelta(type="delta", index=0, delta=TextDelta(type="text", text_delta="Hello")),
         StreamTaskMessageDone(type="done", index=0),
     ]
     turn = _Turn(events, usage)
diff --git a/tests/lib/core/harness/test_span_derivation.py b/tests/lib/core/harness/test_span_derivation.py
index 0630131d0..7779de815 100644
--- a/tests/lib/core/harness/test_span_derivation.py
+++ b/tests/lib/core/harness/test_span_derivation.py
@@ -1,16 +1,16 @@
-from agentex.lib.core.harness.span_derivation import SpanDeriver
+from agentex.types.text_content import TextContent
 from agentex.lib.core.harness.types import OpenSpan, CloseSpan
+from agentex.types.reasoning_content import ReasoningContent
+from agentex.types.tool_request_delta import ToolRequestDelta
 from agentex.types.task_message_update import (
-    StreamTaskMessageStart,
-    StreamTaskMessageDelta,
-    StreamTaskMessageFull,
     StreamTaskMessageDone,
+    StreamTaskMessageFull,
+    StreamTaskMessageDelta,
+    StreamTaskMessageStart,
 )
-from agentex.types.text_content import TextContent
-from agentex.types.reasoning_content import ReasoningContent
 from agentex.types.tool_request_content import ToolRequestContent
 from agentex.types.tool_response_content import ToolResponseContent
-from agentex.types.tool_request_delta import ToolRequestDelta
+from agentex.lib.core.harness.span_derivation import SpanDeriver
 
 
 def _signals(deriver, events):
@@ -23,19 +23,17 @@ def _signals(deriver, events):
 
 def _tool_req(idx, tcid, name, args):
     return StreamTaskMessageStart(
-        type="start", index=idx,
-        content=ToolRequestContent(type="tool_request", author="agent",
-                                   tool_call_id=tcid, name=name, arguments=args),
+        type="start",
+        index=idx,
+        content=ToolRequestContent(type="tool_request", author="agent", tool_call_id=tcid, name=name, arguments=args),
     )
 
 
 def test_text_only_yields_no_spans():
     d = SpanDeriver()
     events = [
-        StreamTaskMessageStart(type="start", index=0,
-            content=TextContent(type="text", author="agent", content="")),
-        StreamTaskMessageDelta(type="delta", index=0,
-            delta=None),
+        StreamTaskMessageStart(type="start", index=0, content=TextContent(type="text", author="agent", content="")),
+        StreamTaskMessageDelta(type="delta", index=0, delta=None),
         StreamTaskMessageDone(type="done", index=0),
     ]
     assert _signals(d, events) == []
@@ -46,9 +44,13 @@ def test_single_tool_opens_on_done_closes_on_response():
     events = [
         _tool_req(0, "call_1", "Bash", {"cmd": "ls"}),
         StreamTaskMessageDone(type="done", index=0),
-        StreamTaskMessageFull(type="full", index=1,
-            content=ToolResponseContent(type="tool_response", author="agent",
-                                        tool_call_id="call_1", name="Bash", content="files")),
+        StreamTaskMessageFull(
+            type="full",
+            index=1,
+            content=ToolResponseContent(
+                type="tool_response", author="agent", tool_call_id="call_1", name="Bash", content="files"
+            ),
+        ),
     ]
     sigs = _signals(d, events)
     assert sigs == [
@@ -60,8 +62,9 @@ def test_single_tool_opens_on_done_closes_on_response():
 def test_reasoning_opens_on_start_closes_on_done():
     d = SpanDeriver()
     events = [
-        StreamTaskMessageStart(type="start", index=0,
-            content=ReasoningContent(type="reasoning", author="agent", summary=[], content=[])),
+        StreamTaskMessageStart(
+            type="start", index=0, content=ReasoningContent(type="reasoning", author="agent", summary=[], content=[])
+        ),
         StreamTaskMessageDone(type="done", index=0),
     ]
     sigs = _signals(d, events)
@@ -76,12 +79,20 @@ def test_parallel_tools_pair_by_tool_call_id():
         _tool_req(1, "b", "T2", {}),
         StreamTaskMessageDone(type="done", index=0),
         StreamTaskMessageDone(type="done", index=1),
-        StreamTaskMessageFull(type="full", index=2,
-            content=ToolResponseContent(type="tool_response", author="agent",
-                                        tool_call_id="b", name="T2", content="rb")),
-        StreamTaskMessageFull(type="full", index=3,
-            content=ToolResponseContent(type="tool_response", author="agent",
-                                        tool_call_id="a", name="T1", content="ra")),
+        StreamTaskMessageFull(
+            type="full",
+            index=2,
+            content=ToolResponseContent(
+                type="tool_response", author="agent", tool_call_id="b", name="T2", content="rb"
+            ),
+        ),
+        StreamTaskMessageFull(
+            type="full",
+            index=3,
+            content=ToolResponseContent(
+                type="tool_response", author="agent", tool_call_id="a", name="T1", content="ra"
+            ),
+        ),
     ]
     sigs = _signals(d, events)
     opens = [s for s in sigs if isinstance(s, OpenSpan)]
@@ -94,15 +105,23 @@ def test_parallel_tools_pair_by_tool_call_id():
 def test_streamed_args_accumulate_into_open_input():
     d = SpanDeriver()
     events = [
-        StreamTaskMessageStart(type="start", index=0,
-            content=ToolRequestContent(type="tool_request", author="agent",
-                                       tool_call_id="c", name="Bash", arguments={})),
-        StreamTaskMessageDelta(type="delta", index=0,
-            delta=ToolRequestDelta(type="tool_request", tool_call_id="c", name="Bash",
-                                   arguments_delta='{"cmd":')),
-        StreamTaskMessageDelta(type="delta", index=0,
-            delta=ToolRequestDelta(type="tool_request", tool_call_id="c", name="Bash",
-                                   arguments_delta='"ls"}')),
+        StreamTaskMessageStart(
+            type="start",
+            index=0,
+            content=ToolRequestContent(
+                type="tool_request", author="agent", tool_call_id="c", name="Bash", arguments={}
+            ),
+        ),
+        StreamTaskMessageDelta(
+            type="delta",
+            index=0,
+            delta=ToolRequestDelta(type="tool_request", tool_call_id="c", name="Bash", arguments_delta='{"cmd":'),
+        ),
+        StreamTaskMessageDelta(
+            type="delta",
+            index=0,
+            delta=ToolRequestDelta(type="tool_request", tool_call_id="c", name="Bash", arguments_delta='"ls"}'),
+        ),
         StreamTaskMessageDone(type="done", index=0),
     ]
     sigs = _signals(d, events)
@@ -123,9 +142,13 @@ def test_unclosed_tool_closed_incomplete_on_flush():
 def test_none_index_is_skipped():
     d = SpanDeriver()
     events = [
-        StreamTaskMessageStart(type="start", index=None,
-            content=ToolRequestContent(type="tool_request", author="agent",
-                                       tool_call_id="n", name="Bash", arguments={})),
+        StreamTaskMessageStart(
+            type="start",
+            index=None,
+            content=ToolRequestContent(
+                type="tool_request", author="agent", tool_call_id="n", name="Bash", arguments={}
+            ),
+        ),
         StreamTaskMessageDone(type="done", index=None),
     ]
     assert _signals(d, events) == []
@@ -134,8 +157,12 @@ def test_none_index_is_skipped():
 def test_orphan_tool_response_ignored():
     d = SpanDeriver()
     events = [
-        StreamTaskMessageFull(type="full", index=0,
-            content=ToolResponseContent(type="tool_response", author="agent",
-                                        tool_call_id="z", name="Bash", content="r")),
+        StreamTaskMessageFull(
+            type="full",
+            index=0,
+            content=ToolResponseContent(
+                type="tool_response", author="agent", tool_call_id="z", name="Bash", content="r"
+            ),
+        ),
     ]
     assert _signals(d, events) == []
diff --git a/tests/lib/core/harness/test_tracer.py b/tests/lib/core/harness/test_tracer.py
index f5fdb16b6..7e1a4bd67 100644
--- a/tests/lib/core/harness/test_tracer.py
+++ b/tests/lib/core/harness/test_tracer.py
@@ -1,7 +1,7 @@
 import pytest
 
-from agentex.lib.core.harness.tracer import SpanTracer
 from agentex.lib.core.harness.types import OpenSpan, CloseSpan
+from agentex.lib.core.harness.tracer import SpanTracer
 
 
 class _FakeSpan:
diff --git a/tests/lib/core/harness/test_types.py b/tests/lib/core/harness/test_types.py
index 91857993a..68bc89ce2 100644
--- a/tests/lib/core/harness/test_types.py
+++ b/tests/lib/core/harness/test_types.py
@@ -3,10 +3,10 @@
 from agentex.lib.core.harness.types import (
     OpenSpan,
     CloseSpan,
-    HarnessTurn,
-    StreamTaskMessage,
     TurnUsage,
     TurnResult,
+    HarnessTurn,
+    StreamTaskMessage,
 )
 
 
diff --git a/tests/lib/core/harness/test_yield_delivery.py b/tests/lib/core/harness/test_yield_delivery.py
index 986b4a92d..f3f491d84 100644
--- a/tests/lib/core/harness/test_yield_delivery.py
+++ b/tests/lib/core/harness/test_yield_delivery.py
@@ -2,15 +2,15 @@
 
 import pytest
 
-from agentex.lib.core.harness.yield_delivery import yield_events
 from agentex.lib.core.harness.tracer import SpanTracer
 from agentex.types.task_message_update import (
-    StreamTaskMessageStart,
     StreamTaskMessageDone,
     StreamTaskMessageFull,
+    StreamTaskMessageStart,
 )
 from agentex.types.tool_request_content import ToolRequestContent
 from agentex.types.tool_response_content import ToolResponseContent
+from agentex.lib.core.harness.yield_delivery import yield_events
 
 
 class _RecordTracing:
@@ -35,18 +35,26 @@ async def test_yield_passes_events_through_and_traces():
     fake = _RecordTracing()
     tracer = SpanTracer(trace_id="t", parent_span_id="p", tracing=fake)
     events = [
-        StreamTaskMessageStart(type="start", index=0,
-            content=ToolRequestContent(type="tool_request", author="agent",
-                                       tool_call_id="c", name="Bash", arguments={})),
+        StreamTaskMessageStart(
+            type="start",
+            index=0,
+            content=ToolRequestContent(
+                type="tool_request", author="agent", tool_call_id="c", name="Bash", arguments={}
+            ),
+        ),
         StreamTaskMessageDone(type="done", index=0),
-        StreamTaskMessageFull(type="full", index=1,
-            content=ToolResponseContent(type="tool_response", author="agent",
-                                        tool_call_id="c", name="Bash", content="ok")),
+        StreamTaskMessageFull(
+            type="full",
+            index=1,
+            content=ToolResponseContent(
+                type="tool_response", author="agent", tool_call_id="c", name="Bash", content="ok"
+            ),
+        ),
     ]
     out = [e async for e in yield_events(_gen(events), tracer=tracer)]
-    assert out == events            # passthrough unchanged
-    assert fake.started == ["Bash"] # span derived + opened
-    assert fake.ended == ["ok"]     # span closed with response
+    assert out == events  # passthrough unchanged
+    assert fake.started == ["Bash"]  # span derived + opened
+    assert fake.ended == ["ok"]  # span closed with response
 
 
 @pytest.mark.asyncio
@@ -63,15 +71,19 @@ async def test_flush_runs_on_early_close():
     fake = _RecordTracing()
     tracer = SpanTracer(trace_id="t", parent_span_id="p", tracing=fake)
     events = [
-        StreamTaskMessageStart(type="start", index=0,
-            content=ToolRequestContent(type="tool_request", author="agent",
-                                       tool_call_id="c", name="Bash", arguments={})),
+        StreamTaskMessageStart(
+            type="start",
+            index=0,
+            content=ToolRequestContent(
+                type="tool_request", author="agent", tool_call_id="c", name="Bash", arguments={}
+            ),
+        ),
         StreamTaskMessageDone(type="done", index=0),
         # response intentionally never arrives
     ]
     gen = yield_events(_gen(events), tracer=tracer)
-    first = await gen.__anext__()   # Start
+    first = await gen.__anext__()  # Start
     second = await gen.__anext__()  # Done -> tool span opens here
-    await gen.aclose()              # triggers the finally -> flush()
+    await gen.aclose()  # triggers the finally -> flush()
     assert fake.started == ["Bash"]
-    assert fake.ended == [None]     # flush closed the unpaired span (incomplete, no output)
+    assert fake.ended == [None]  # flush closed the unpaired span (incomplete, no output)

From 8b0da837d2ee13de1f7d93ebda0fa594cc0f5cd6 Mon Sep 17 00:00:00 2001
From: Declan Brady <declan.brady@scale.com>
Date: Thu, 18 Jun 2026 15:21:03 -0400
Subject: [PATCH 22/26] fix(harness): mark overridden start_span with @override
 for pyright (reportImplicitOverride)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---
 tests/lib/core/harness/test_tracer.py | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/tests/lib/core/harness/test_tracer.py b/tests/lib/core/harness/test_tracer.py
index 7e1a4bd67..315b74417 100644
--- a/tests/lib/core/harness/test_tracer.py
+++ b/tests/lib/core/harness/test_tracer.py
@@ -1,3 +1,5 @@
+from typing import override
+
 import pytest
 
 from agentex.lib.core.harness.types import OpenSpan, CloseSpan
@@ -45,6 +47,7 @@ async def test_no_trace_id_is_noop():
 @pytest.mark.asyncio
 async def test_tracing_failure_is_swallowed():
     class _Boom(_FakeTracing):
+        @override
         async def start_span(self, **kw):
             raise RuntimeError("backend down")
 

From f9266cf6b07ee952b19b75909ce7e3a7a9138908 Mon Sep 17 00:00:00 2001
From: Declan Brady <declan.brady@scale.com>
Date: Thu, 18 Jun 2026 15:23:26 -0400
Subject: [PATCH 23/26] fix(harness): relative import in conformance test for
 pyright (reportMissingImports)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---
 tests/lib/core/harness/conformance/test_conformance.py | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/tests/lib/core/harness/conformance/test_conformance.py b/tests/lib/core/harness/conformance/test_conformance.py
index 1d686c33a..d9eec1c15 100644
--- a/tests/lib/core/harness/conformance/test_conformance.py
+++ b/tests/lib/core/harness/conformance/test_conformance.py
@@ -7,7 +7,8 @@
 )
 from agentex.types.tool_request_content import ToolRequestContent
 from agentex.types.tool_response_content import ToolResponseContent
-from tests.lib.core.harness.conformance.runner import Fixture, register, derive_all, all_fixtures
+
+from .runner import Fixture, register, derive_all, all_fixtures
 
 register(
     Fixture(

From b538187a43577282544cc474c2c3b8f627a9f2cf Mon Sep 17 00:00:00 2001
From: Declan Brady <declan.brady@scale.com>
Date: Thu, 18 Jun 2026 16:51:49 -0400
Subject: [PATCH 24/26] fix(harness): index-keyed routing, tool stream
 delivery, final_text last-segment, created_at (AGX1-377, AGX1-378)

auto_send.py:
- Replace single current_ctx with ctx_map[index] so parallel streams route correctly
- Open a streaming context for ALL content types on Start (not just text/reasoning),
  fixing tool_request/tool_response stream delivery (AGX1-377)
- Reset final_text_parts on each new Start(TextContent) and on Full(TextContent)
  so multi-step turns return the LAST text segment, not the full accumulation
- Add created_at: datetime | None param; forward to every
  streaming_task_message_context call (AGX1-378)

span_derivation.py:
- _on_full: handle Full(ToolRequestContent) by opening a tool span keyed by
  tool_call_id if not already open; adds LangGraph full-event harness support

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---
 src/agentex/lib/core/harness/auto_send.py     | 86 ++++++++++++-------
 .../lib/core/harness/span_derivation.py       | 14 +++
 2 files changed, 71 insertions(+), 29 deletions(-)

diff --git a/src/agentex/lib/core/harness/auto_send.py b/src/agentex/lib/core/harness/auto_send.py
index b246cf38c..899429034 100644
--- a/src/agentex/lib/core/harness/auto_send.py
+++ b/src/agentex/lib/core/harness/auto_send.py
@@ -3,8 +3,10 @@
 from __future__ import annotations
 
 from typing import Any, AsyncIterator
+from datetime import datetime
 
 from agentex.types.text_delta import TextDelta
+from agentex.types.text_content import TextContent
 from agentex.lib.core.harness.types import TurnUsage, TurnResult, StreamTaskMessage
 from agentex.lib.core.harness.tracer import SpanTracer
 from agentex.types.task_message_update import (
@@ -22,14 +24,27 @@ async def auto_send(
     tracer: SpanTracer | None = None,
     streaming: Any = None,
     usage: TurnUsage | None = None,
+    created_at: datetime | None = None,
 ) -> TurnResult:
     """Push the canonical stream to the task stream via adk.streaming.
 
-    Opens a streaming context per text/reasoning message, streams deltas via
+    Opens a streaming context per message (keyed by index), streams deltas via
     ctx.stream_update, and closes via ctx.close() on Done. Posts tool
     request/response full messages by opening a context with the content and
     closing it immediately (no deltas). Derives and traces spans from the same
-    stream. Returns the accumulated final text + usage.
+    stream. Returns the last text segment's text + usage.
+
+    Index-keyed routing: each Start(index=i) opens a context stored in
+    ctx_map[i]; Delta(index=i) routes to ctx_map.get(i); Done(index=i) closes
+    and removes ctx_map[i]. Events with index is None are skipped. The finally
+    block closes all remaining open contexts.
+
+    final_text last-segment semantics: a new Start(TextContent) resets
+    final_text_parts so that multi-step turns return the LAST text segment.
+    Full(TextContent) also overwrites final_text_parts (same semantics).
+
+    AGX1-378: created_at is forwarded to every streaming_task_message_context
+    call so callers can back-date message timestamps.
 
     Mirrors the open/close/stream_update pattern from
     src/agentex/lib/adk/_modules/_langgraph_async.py:
@@ -47,13 +62,12 @@ async def auto_send(
 
     deriver = SpanDeriver() if tracer is not None else None
     final_text_parts: list[str] = []
-    current_ctx: Any = None
+    ctx_map: dict[int, Any] = {}
 
-    async def _close_current() -> None:
-        nonlocal current_ctx
-        if current_ctx is not None:
-            await current_ctx.close()
-            current_ctx = None
+    async def _close_all() -> None:
+        for ctx in list(ctx_map.values()):
+            await ctx.close()
+        ctx_map.clear()
 
     try:
         async for event in events:
@@ -62,50 +76,64 @@ async def _close_current() -> None:
                     await tracer.handle(signal)
 
             if isinstance(event, StreamTaskMessageStart):
-                ctype = getattr(event.content, "type", None)
-                if ctype in ("text", "reasoning"):
-                    await _close_current()
-                    ctx = streaming.streaming_task_message_context(
-                        task_id=task_id,
-                        initial_content=event.content,
-                    )
-                    current_ctx = await ctx.__aenter__()
+                if event.index is None:
+                    continue
+                i = event.index
+                # Reset final_text_parts when a new text segment starts
+                if isinstance(event.content, TextContent):
+                    final_text_parts = []
+                ctx = streaming.streaming_task_message_context(
+                    task_id=task_id,
+                    initial_content=event.content,
+                    created_at=created_at,
+                )
+                ctx_map[i] = await ctx.__aenter__()
 
             elif isinstance(event, StreamTaskMessageDelta):
-                if current_ctx is not None and event.delta is not None:
+                if event.index is None:
+                    continue
+                ctx = ctx_map.get(event.index)
+                if ctx is not None and event.delta is not None:
                     # Reconstruct the delta with parent_task_message set from
                     # the context's task_message (mirrors _langgraph_async.py
                     # lines 72-78 and 117-127).
                     delta_with_parent = StreamTaskMessageDelta(
-                        parent_task_message=current_ctx.task_message,
+                        parent_task_message=ctx.task_message,
                         delta=event.delta,
                         type="delta",
                         index=event.index,
                     )
-                    await current_ctx.stream_update(delta_with_parent)
+                    await ctx.stream_update(delta_with_parent)
                     if isinstance(event.delta, TextDelta) and event.delta.text_delta:
                         final_text_parts.append(event.delta.text_delta)
 
             elif isinstance(event, StreamTaskMessageDone):
-                await _close_current()
+                if event.index is None:
+                    continue
+                ctx = ctx_map.pop(event.index, None)
+                if ctx is not None:
+                    await ctx.close()
 
             elif isinstance(event, StreamTaskMessageFull):
-                # Full messages (tool_request / tool_response): close any open
-                # streaming context first, then post the full message by opening
-                # a context with the content and closing it immediately
-                # (no deltas; StreamingTaskMessageContext.close() persists
-                # initial_content when the accumulator is empty). Use async with
-                # so the context is closed even if close() raises (__aexit__
-                # delegates to close()).
-                await _close_current()
+                # Full messages: post the full message by opening a context
+                # with the content and closing it immediately (no deltas;
+                # StreamingTaskMessageContext.close() persists initial_content
+                # when the accumulator is empty). Use async with so the context
+                # is closed even if close() raises (__aexit__ delegates to
+                # close()).
+                # Full(TextContent) also resets final_text_parts for
+                # last-segment semantics.
+                if isinstance(event.content, TextContent):
+                    final_text_parts = [event.content.content]
                 async with streaming.streaming_task_message_context(
                     task_id=task_id,
                     initial_content=event.content,
+                    created_at=created_at,
                 ):
                     pass
 
     finally:
-        await _close_current()
+        await _close_all()
         if deriver is not None and tracer is not None:
             for signal in deriver.flush():
                 await tracer.handle(signal)
diff --git a/src/agentex/lib/core/harness/span_derivation.py b/src/agentex/lib/core/harness/span_derivation.py
index d353cf9e0..503957582 100644
--- a/src/agentex/lib/core/harness/span_derivation.py
+++ b/src/agentex/lib/core/harness/span_derivation.py
@@ -105,7 +105,21 @@ def _on_delta(self, event: StreamTaskMessageDelta) -> list[SpanSignal]:
         return []
 
     def _on_full(self, event: StreamTaskMessageFull) -> list[SpanSignal]:
+        """Handle a Full event.
+
+        A `Full(ToolRequestContent)` opens a tool span (keyed by tool_call_id)
+        if it is not already open; the matching `Full(ToolResponseContent)`
+        closes it. This handles harnesses (e.g. LangGraph) that emit tool calls
+        as a single Full rather than Start+Done.
+        """
         content = event.content
+        if isinstance(content, ToolRequestContent):
+            tcid = content.tool_call_id
+            if tcid not in self._open_tool_ids:
+                self._open_tool_ids[tcid] = None
+                args = dict(content.arguments or {})
+                return [OpenSpan(key=tcid, kind="tool", name=content.name, input=args)]
+            return []
         if isinstance(content, ToolResponseContent):
             tcid = content.tool_call_id
             if tcid in self._open_tool_ids:

From dcd65b5507397bdbeee6daff6021927921a5d6aa Mon Sep 17 00:00:00 2001
From: Declan Brady <declan.brady@scale.com>
Date: Thu, 18 Jun 2026 16:52:00 -0400
Subject: [PATCH 25/26] test(harness): add tests for AGX1-377 tool stream
 delivery, index routing, last-segment, created_at, Full ToolRequest spans

test_auto_send.py:
- Fix test 2: remove bare Start(ToolRequestContent) from events (old behavior was
  that Start did not open a ctx; new behavior does, so test was updated to use
  Full-only events that still verify the two-context behavior)
- Extend _FakeStreaming to record created_at on each context call
- Add test 6: streamed tool_request opens a ctx + routes deltas (AGX1-377 core)
- Add test 7: interleaved indexes route deltas to correct per-index contexts
- Add test 8: multi-step turns return the LAST text segment only
- Add test 9: Full(TextContent) contributes its content to final_text
- Add test 10: created_at is forwarded to every streaming context call (AGX1-378)

test_span_derivation.py:
- Add test_full_tool_request_opens_span: Full(ToolRequestContent) opens a span
- Add test_full_tool_request_and_response_paired: paired Full request+response
  produces a complete OpenSpan+CloseSpan
- Add test_full_tool_request_does_not_double_open: idempotent; a Full for an
  already-open tool_call_id is a no-op

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---
 tests/lib/core/harness/test_auto_send.py      | 215 ++++++++++++++++--
 .../lib/core/harness/test_span_derivation.py  |  89 ++++++++
 2 files changed, 287 insertions(+), 17 deletions(-)

diff --git a/tests/lib/core/harness/test_auto_send.py b/tests/lib/core/harness/test_auto_send.py
index e7331e67c..1948e9196 100644
--- a/tests/lib/core/harness/test_auto_send.py
+++ b/tests/lib/core/harness/test_auto_send.py
@@ -10,6 +10,7 @@
 """
 
 import types as _types
+from datetime import datetime
 
 import pytest
 
@@ -17,6 +18,7 @@
 from agentex.types.text_content import TextContent
 from agentex.lib.core.harness.tracer import SpanTracer
 from agentex.types.task_message_delta import TextDelta
+from agentex.types.tool_request_delta import ToolRequestDelta
 from agentex.types.task_message_update import (
     StreamTaskMessageDone,
     StreamTaskMessageFull,
@@ -64,10 +66,12 @@ class _FakeStreaming:
 
     def __init__(self):
         self.sink = []
+        self.recorded_created_at: list[datetime | None] = []
 
     def streaming_task_message_context(self, task_id, initial_content, streaming_mode="coalesced", created_at=None):
         ctype = getattr(initial_content, "type", None)
         self.sink.append(("ctx", ctype))
+        self.recorded_created_at.append(created_at)
         return _FakeCtx(self.sink, ctype, initial_content)
 
 
@@ -127,22 +131,10 @@ async def test_auto_send_streams_text_and_returns_final_text():
 async def test_auto_send_posts_full_tool_messages():
     streaming = _FakeStreaming()
     events = [
-        # A bare tool_request Start (no Done/Full) must NOT open a streaming
-        # context on its own — only Full events post messages.
-        StreamTaskMessageStart(
-            type="start",
-            index=0,
-            content=ToolRequestContent(
-                type="tool_request",
-                author="agent",
-                tool_call_id="c0",
-                name="Bash",
-                arguments={},
-            ),
-        ),
+        # Two Full events post two messages (open+close immediately, no deltas).
         StreamTaskMessageFull(
             type="full",
-            index=1,
+            index=0,
             content=ToolRequestContent(
                 type="tool_request",
                 author="agent",
@@ -153,7 +145,7 @@ async def test_auto_send_posts_full_tool_messages():
         ),
         StreamTaskMessageFull(
             type="full",
-            index=2,
+            index=1,
             content=ToolResponseContent(
                 type="tool_response",
                 author="agent",
@@ -167,8 +159,7 @@ async def test_auto_send_posts_full_tool_messages():
 
     assert result.final_text == ""
 
-    # The opened contexts correspond ONLY to the two Full events — the
-    # tool_request Start did not open a context.
+    # Each Full event opens and closes exactly one context.
     ctx_events = [s for s in streaming.sink if s[0] == "ctx"]
     assert len(ctx_events) == 2
     content_types = [s[1] for s in ctx_events]
@@ -307,3 +298,193 @@ async def _exploding_gen():
     # The text context that was opened mid-stream was closed by the finally block.
     assert ("open", "text") in [(s[0], s[1]) for s in streaming.sink]
     assert ("close", "text") in [(s[0], s[1]) for s in streaming.sink]
+
+
+# ---------------------------------------------------------------------------
+# Test 6: streamed tool_request delivered (AGX1-377 core)
+# ---------------------------------------------------------------------------
+
+
+@pytest.mark.asyncio
+async def test_auto_send_streams_tool_request():
+    """A Start(ToolRequestContent) MUST open a streaming context (AGX1-377)."""
+    streaming = _FakeStreaming()
+    events = [
+        StreamTaskMessageStart(
+            type="start",
+            index=0,
+            content=ToolRequestContent(
+                type="tool_request",
+                author="agent",
+                tool_call_id="c_tool",
+                name="Bash",
+                arguments={},
+            ),
+        ),
+        StreamTaskMessageDelta(
+            type="delta",
+            index=0,
+            delta=ToolRequestDelta(
+                type="tool_request",
+                tool_call_id="c_tool",
+                name="Bash",
+                arguments_delta='{"cmd": "ls"}',
+            ),
+        ),
+        StreamTaskMessageDone(type="done", index=0),
+    ]
+    result = await auto_send(_gen(events), task_id="task1", tracer=None, streaming=streaming)
+
+    assert result.final_text == ""
+
+    ctx_events = [s for s in streaming.sink if s[0] == "ctx"]
+    assert len(ctx_events) == 1
+    assert ctx_events[0][1] == "tool_request"
+
+    opens = [s for s in streaming.sink if s[0] == "open"]
+    closes = [s for s in streaming.sink if s[0] == "close"]
+    assert len(opens) == 1
+    assert len(closes) == 1
+
+    updates = [s for s in streaming.sink if s[0] == "update"]
+    assert len(updates) == 1
+
+
+# ---------------------------------------------------------------------------
+# Test 7: interleaved indexes route correctly
+# ---------------------------------------------------------------------------
+
+
+@pytest.mark.asyncio
+async def test_auto_send_interleaved_indexes_route_correctly():
+    """Deltas must be routed to the correct index-keyed context."""
+    streaming = _FakeStreaming()
+    events = [
+        StreamTaskMessageStart(
+            type="start",
+            index=0,
+            content=TextContent(type="text", author="agent", content=""),
+        ),
+        StreamTaskMessageStart(
+            type="start",
+            index=1,
+            content=TextContent(type="text", author="agent", content=""),
+        ),
+        StreamTaskMessageDelta(
+            type="delta",
+            index=0,
+            delta=TextDelta(type="text", text_delta="A"),
+        ),
+        StreamTaskMessageDelta(
+            type="delta",
+            index=1,
+            delta=TextDelta(type="text", text_delta="B"),
+        ),
+        StreamTaskMessageDone(type="done", index=0),
+        StreamTaskMessageDone(type="done", index=1),
+    ]
+    result = await auto_send(_gen(events), task_id="task1", tracer=None, streaming=streaming)
+
+    ctx_events = [s for s in streaming.sink if s[0] == "ctx"]
+    assert len(ctx_events) == 2
+
+    opens = [s for s in streaming.sink if s[0] == "open"]
+    assert len(opens) == 2
+
+    updates = [s for s in streaming.sink if s[0] == "update"]
+    assert len(updates) == 2
+
+    update_deltas = [s[1].delta for s in streaming.sink if s[0] == "update"]
+    text_deltas = [d.text_delta for d in update_deltas if isinstance(d, TextDelta)]
+    assert set(text_deltas) == {"A", "B"}
+
+
+# ---------------------------------------------------------------------------
+# Test 8: final_text returns last text segment for multi-step
+# ---------------------------------------------------------------------------
+
+
+@pytest.mark.asyncio
+async def test_auto_send_final_text_last_segment():
+    """final_text must be the LAST text segment, not accumulated across all turns."""
+    streaming = _FakeStreaming()
+    events = [
+        StreamTaskMessageStart(
+            type="start",
+            index=0,
+            content=TextContent(type="text", author="agent", content=""),
+        ),
+        StreamTaskMessageDelta(
+            type="delta",
+            index=0,
+            delta=TextDelta(type="text", text_delta="First"),
+        ),
+        StreamTaskMessageDone(type="done", index=0),
+        StreamTaskMessageStart(
+            type="start",
+            index=1,
+            content=TextContent(type="text", author="agent", content=""),
+        ),
+        StreamTaskMessageDelta(
+            type="delta",
+            index=1,
+            delta=TextDelta(type="text", text_delta="Second"),
+        ),
+        StreamTaskMessageDone(type="done", index=1),
+    ]
+    result = await auto_send(_gen(events), task_id="task1", tracer=None, streaming=streaming)
+    assert result.final_text == "Second"
+
+
+# ---------------------------------------------------------------------------
+# Test 9: Full(TextContent) contributes to final_text
+# ---------------------------------------------------------------------------
+
+
+@pytest.mark.asyncio
+async def test_auto_send_full_text_content_sets_final_text():
+    """A Full(TextContent) must contribute its text to final_text."""
+    streaming = _FakeStreaming()
+    events = [
+        StreamTaskMessageFull(
+            type="full",
+            index=0,
+            content=TextContent(type="text", author="agent", content="hello"),
+        ),
+    ]
+    result = await auto_send(_gen(events), task_id="task1", tracer=None, streaming=streaming)
+    assert result.final_text == "hello"
+
+
+# ---------------------------------------------------------------------------
+# Test 10: created_at is forwarded to streaming context (AGX1-378)
+# ---------------------------------------------------------------------------
+
+
+@pytest.mark.asyncio
+async def test_auto_send_created_at_forwarded():
+    """created_at must be forwarded to every streaming_task_message_context call."""
+    streaming = _FakeStreaming()
+    dt = datetime(2025, 1, 15, 12, 0, 0)
+    events = [
+        StreamTaskMessageStart(
+            type="start",
+            index=0,
+            content=TextContent(type="text", author="agent", content=""),
+        ),
+        StreamTaskMessageDone(type="done", index=0),
+        StreamTaskMessageFull(
+            type="full",
+            index=1,
+            content=ToolRequestContent(
+                type="tool_request",
+                author="agent",
+                tool_call_id="c_ts",
+                name="Bash",
+                arguments={},
+            ),
+        ),
+    ]
+    await auto_send(_gen(events), task_id="task1", tracer=None, streaming=streaming, created_at=dt)
+
+    assert all(ts == dt for ts in streaming.recorded_created_at)
diff --git a/tests/lib/core/harness/test_span_derivation.py b/tests/lib/core/harness/test_span_derivation.py
index 7779de815..f22b83d54 100644
--- a/tests/lib/core/harness/test_span_derivation.py
+++ b/tests/lib/core/harness/test_span_derivation.py
@@ -166,3 +166,92 @@ def test_orphan_tool_response_ignored():
         ),
     ]
     assert _signals(d, events) == []
+
+
+def test_full_tool_request_opens_span():
+    """Full(ToolRequestContent) must open a tool span (for LangGraph-style harnesses)."""
+    d = SpanDeriver()
+    events = [
+        StreamTaskMessageFull(
+            type="full",
+            index=0,
+            content=ToolRequestContent(
+                type="tool_request",
+                author="agent",
+                tool_call_id="call_x",
+                name="Bash",
+                arguments={"cmd": "ls"},
+            ),
+        ),
+    ]
+    sigs = _signals(d, events)
+    assert sigs[0] == OpenSpan(key="call_x", kind="tool", name="Bash", input={"cmd": "ls"})
+    assert sigs[1] == CloseSpan(key="call_x", output=None, is_complete=False)
+
+
+def test_full_tool_request_and_response_paired():
+    """Full(ToolRequestContent) + Full(ToolResponseContent) produces a complete span pair."""
+    d = SpanDeriver()
+    events = [
+        StreamTaskMessageFull(
+            type="full",
+            index=0,
+            content=ToolRequestContent(
+                type="tool_request",
+                author="agent",
+                tool_call_id="call_y",
+                name="Grep",
+                arguments={},
+            ),
+        ),
+        StreamTaskMessageFull(
+            type="full",
+            index=1,
+            content=ToolResponseContent(
+                type="tool_response",
+                author="agent",
+                tool_call_id="call_y",
+                name="Grep",
+                content="result",
+            ),
+        ),
+    ]
+    sigs = _signals(d, events)
+    assert sigs == [
+        OpenSpan(key="call_y", kind="tool", name="Grep", input={}),
+        CloseSpan(key="call_y", output="result", is_complete=True),
+    ]
+
+
+def test_full_tool_request_does_not_double_open():
+    """A Full(ToolRequestContent) for an already-open tool_call_id is a no-op."""
+    d = SpanDeriver()
+    events = [
+        StreamTaskMessageStart(
+            type="start",
+            index=0,
+            content=ToolRequestContent(
+                type="tool_request",
+                author="agent",
+                tool_call_id="call_z",
+                name="X",
+                arguments={},
+            ),
+        ),
+        StreamTaskMessageDone(type="done", index=0),
+        StreamTaskMessageFull(
+            type="full",
+            index=1,
+            content=ToolRequestContent(
+                type="tool_request",
+                author="agent",
+                tool_call_id="call_z",
+                name="X",
+                arguments={},
+            ),
+        ),
+    ]
+    sigs = _signals(d, events)
+    opens = [s for s in sigs if isinstance(s, OpenSpan)]
+    assert len(opens) == 1
+    assert opens[0].key == "call_z"

From b4b8b33047e7e8fc30436bc95dfb81b53888682c Mon Sep 17 00:00:00 2001
From: Declan Brady <declan.brady@scale.com>
Date: Thu, 18 Jun 2026 17:01:54 -0400
Subject: [PATCH 26/26] feat(harness): thread created_at through
 UnifiedEmitter.auto_send_turn (AGX1-378)

So migration helpers can restore the deterministic first-message timestamp on
the temporal path. Default None preserves current behavior.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---
 src/agentex/lib/core/harness/emitter.py | 11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/src/agentex/lib/core/harness/emitter.py b/src/agentex/lib/core/harness/emitter.py
index 681c859ea..85395fcff 100644
--- a/src/agentex/lib/core/harness/emitter.py
+++ b/src/agentex/lib/core/harness/emitter.py
@@ -3,6 +3,7 @@
 from __future__ import annotations
 
 from typing import AsyncGenerator
+from datetime import datetime
 
 from agentex.lib.core.harness.types import TurnResult, HarnessTurn, StreamTaskMessage
 from agentex.lib.core.harness.tracer import SpanTracer
@@ -56,12 +57,18 @@ async def yield_turn(self, turn: HarnessTurn) -> AsyncGenerator[StreamTaskMessag
         async for event in yield_events(turn.events, tracer=self.tracer):
             yield event
 
-    async def auto_send_turn(self, turn: HarnessTurn) -> TurnResult:
-        """Async/temporal delivery: push to the task stream, return TurnResult."""
+    async def auto_send_turn(self, turn: HarnessTurn, created_at: datetime | None = None) -> TurnResult:
+        """Async/temporal delivery: push to the task stream, return TurnResult.
+
+        Pass `created_at` (e.g. `workflow.now()` under Temporal) to stamp the
+        turn's messages with a deterministic timestamp; it is forwarded to the
+        streaming contexts. Default None preserves server-side timestamps.
+        """
         return await auto_send(
             turn.events,
             task_id=self.task_id,
             tracer=self.tracer,
             streaming=self._streaming,
             usage=turn.usage(),
+            created_at=created_at,
         )