Skip to content

fix(harness): assert cross-channel (yield vs auto-send) conformance equivalence [AGX1-373]#414

Open
declan-scale wants to merge 3 commits into
declan-scale/unified-harness-surfacefrom
declan-scale/agx1-373-conformance-equivalence
Open

fix(harness): assert cross-channel (yield vs auto-send) conformance equivalence [AGX1-373]#414
declan-scale wants to merge 3 commits into
declan-scale/unified-harness-surfacefrom
declan-scale/agx1-373-conformance-equivalence

Conversation

@declan-scale

@declan-scale declan-scale commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Summary

Fast-follow on the unified harness surface foundation. Upgrades the conformance runner to actually assert cross-channel equivalence between yield_events and auto_send, replacing the prior determinism-only test that merely ran the same deriver twice.

Equivalence approach

Both channels are driven over each fixture using in-test fakes (mirroring patterns from test_yield_delivery.py and test_auto_send.py). The results are normalised to LogicalDelivery(content_type, identity) tuples that strip the streaming-envelope difference:

  • yield channel delivers StreamTaskMessageFull(ToolResponseContent) verbatim.
  • auto_send channel delivers the same content by opening a streaming context with initial_content and closing it immediately (no deltas).

Both collapse to LogicalDelivery("tool_response", frozenset({("tool_call_id", ...), ("name", ...)})) and compare equal.

Text/reasoning deliveries are normalised to sequential position within their type (since auto_send has no event index in its streaming sink).

Span signals are asserted identical: both channels call SpanDeriver.observe() on the same event sequence, so the derived signals must match.

Full-message decision: keep open+immediate-close

auto_send retains the existing approach of posting a StreamTaskMessageFull (tool_request/tool_response) via streaming_task_message_context(...).__aenter__() + immediate close(). Rationale:

  • StreamingTaskMessageContext.close() persists initial_content when the accumulator is empty, so the message is correctly written.
  • This mirrors the _langgraph_async.py pattern already in production.
  • Switching to adk.messages.create would require a new injectable dependency for no observable benefit.

The envelope difference (Full vs Start+Done on the wire) is documented as an acceptable design choice in runner.py alongside the decision rationale.

Fixtures

  • builtin-single-tool — retained (existing fixture, tool request+response cycle)
  • streaming-text — new: text Start/delta/delta/Done path
  • reasoning-block — new: reasoning Start/delta/Done (exercises reasoning span open/close)

Results

  • ./scripts/test tests/lib/core/harness/35 passed on Python 3.12 and 3.13
  • uv run pyright src/agentex/lib/core/harness/0 errors

🤖 Generated with Claude Code

Greptile Summary

This fast-follow PR upgrades the conformance runner from a determinism-only check to a genuine cross-channel equivalence assertion between yield_events and auto_send, directly addressing five previously-reported weaknesses in the prior implementation.

  • Adds LogicalDelivery(content_type, identity, payload) normalisation that compares actual delivered content (including initial_content seeds, delta accumulation, and tool arguments/response values) across both channels for four fixtures covering text, reasoning, tool-request, and tool-response paths.
  • Replaces the tautological span comparison with _RecordingTracer, which intercepts and records every SpanSignal each channel's tracer actually receives, so a channel that skips deriver.observe() for any event type is caught.
  • Fixes AGX1-377 by removing suppression of Start(tool_request)+Done deliveries and verifying auto_send now delivers them, confirmed by the new streamed-tool-request fixture.

Confidence Score: 5/5

Safe to merge — all five previously-flagged gaps are addressed and the two remaining observations are minor comment/coverage nits in test helper code.

Both changed files are test infrastructure only. The new RecordingTracer genuinely captures per-channel span signals, the LogicalDelivery payload field covers initial_content seeding and delta accumulation, and the tool-request suppression is correctly removed. No production code is touched and the conformance suite now runs 35 passing tests.

No files require special attention.

Important Files Changed

Filename Overview
tests/lib/core/harness/conformance/runner.py Core conformance engine rewritten: adds LogicalDelivery normalisation with payload comparison, fake streaming/tracing backends, RecordingTracer for genuine per-channel span capture, and run_cross_channel_conformance. All previously-flagged gaps (tautological spans, missing payloads, suppressed tool requests) addressed.
tests/lib/core/harness/conformance/test_conformance.py Adds test_cross_channel_equivalence parametrized over four fixtures and three new fixtures (streaming-text, reasoning-block, streamed-tool-request). Backward-compatible determinism test retained. One inaccurate comment in builtin-single-tool fixture definition.

Sequence Diagram

%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
    participant T as test_cross_channel_equivalence
    participant R as run_cross_channel_conformance
    participant Y as yield_events
    participant RT_Y as _RecordingTracer (yield)
    participant A as auto_send
    participant RT_A as _RecordingTracer (auto)
    participant FS as _FakeStreaming

    T->>R: fixture
    R->>RT_Y: "_RecordingTracer(tracing=_FakeTracing())"
    R->>Y: "yield_events(_gen(fixture.events), tracer=RT_Y)"
    Y-->>RT_Y: handle(SpanSignal) per event
    RT_Y-->>R: received_signals → yield_spans
    Y-->>R: yield_out (events verbatim)
    R->>R: _yield_logical_deliveries(yield_out)
    R->>R: _yield_text_reasoning_seq(...) → yield_deliveries
    R->>RT_A: "_RecordingTracer(tracing=_FakeTracing())"
    R->>FS: _FakeStreaming()
    R->>A: "auto_send(_gen(fixture.events), tracer=RT_A, streaming=FS)"
    A-->>RT_A: handle(SpanSignal) per event
    RT_A-->>R: received_signals → auto_spans
    A-->>FS: streaming_task_message_context → ctx/open/update/close entries
    R->>R: _auto_send_logical_deliveries(FS.sink) → auto_deliveries
    R-->>T: yield_deliveries, auto_deliveries, yield_spans, auto_spans
    T->>T: "assert yield_deliveries == auto_deliveries"
    T->>T: "assert yield_spans == auto_spans"
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
    participant T as test_cross_channel_equivalence
    participant R as run_cross_channel_conformance
    participant Y as yield_events
    participant RT_Y as _RecordingTracer (yield)
    participant A as auto_send
    participant RT_A as _RecordingTracer (auto)
    participant FS as _FakeStreaming

    T->>R: fixture
    R->>RT_Y: "_RecordingTracer(tracing=_FakeTracing())"
    R->>Y: "yield_events(_gen(fixture.events), tracer=RT_Y)"
    Y-->>RT_Y: handle(SpanSignal) per event
    RT_Y-->>R: received_signals → yield_spans
    Y-->>R: yield_out (events verbatim)
    R->>R: _yield_logical_deliveries(yield_out)
    R->>R: _yield_text_reasoning_seq(...) → yield_deliveries
    R->>RT_A: "_RecordingTracer(tracing=_FakeTracing())"
    R->>FS: _FakeStreaming()
    R->>A: "auto_send(_gen(fixture.events), tracer=RT_A, streaming=FS)"
    A-->>RT_A: handle(SpanSignal) per event
    RT_A-->>R: received_signals → auto_spans
    A-->>FS: streaming_task_message_context → ctx/open/update/close entries
    R->>R: _auto_send_logical_deliveries(FS.sink) → auto_deliveries
    R-->>T: yield_deliveries, auto_deliveries, yield_spans, auto_spans
    T->>T: "assert yield_deliveries == auto_deliveries"
    T->>T: "assert yield_spans == auto_spans"
Loading

Reviews (7): Last reviewed commit: "test(harness): propagate AGX1-377/378 fi..." | Re-trigger Greptile

@declan-scale declan-scale force-pushed the declan-scale/unified-harness-surface branch from d21c54a to ebc468d Compare June 18, 2026 17:29
Comment thread tests/lib/core/harness/conformance/runner.py Outdated
Comment thread tests/lib/core/harness/conformance/runner.py Outdated
Comment thread tests/lib/core/harness/conformance/runner.py Outdated
Comment thread tests/lib/core/harness/conformance/runner.py Outdated
@declan-scale declan-scale force-pushed the declan-scale/agx1-373-conformance-equivalence branch 3 times, most recently from b4c53ca to cae14d4 Compare June 18, 2026 19:24
@declan-scale

Copy link
Copy Markdown
Contributor Author

@greptile review

Comment thread tests/lib/core/harness/conformance/runner.py
declan-scale and others added 3 commits June 18, 2026 17:03
…quivalence [AGX1-373]

Rebased on the pyright-clean foundation. Includes @OverRide on _RecordingTracer.handle
and relative conformance imports so the whole-repo pyright (scripts/lint) passes.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…arison

- Add `payload: str` field to LogicalDelivery (NamedTuple, default "").
- _yield_logical_deliveries: track TextDelta / ReasoningContentDelta
  accumulation per-index; include "".join(deltas) as payload for text/
  reasoning deliveries. Include json.dumps(arguments, sort_keys=True) as
  payload for tool_request; str(content) for tool_response.
- _auto_send_logical_deliveries: collect ("update", delta) entries from
  the _FakeCtx sink between open and close; extract TextDelta /
  ReasoningContentDelta text and accumulate. Carry same tool payload
  fields.
- _yield_text_reasoning_seq: forward payload through when re-keying
  index → seq.
- All 35 harness tests pass; ruff + pyright clean.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ess streamed tool-request delivery, include initial_content in payload

- Remove the Start(tool_request)+Done suppression in _yield_logical_deliveries:
  auto_send now delivers streamed tool-request messages (AGX1-377 fix), so both
  channels emit a LogicalDelivery for a streamed tool_request. The cross-channel
  assertion verifies delivery on both sides.

- Include StreamTaskMessageStart.content in payload comparison for text and
  reasoning types: TextContent.content is prepended to accumulated deltas;
  ReasoningContent.summary items are prepended. This catches a channel that
  drops initial_content or reasoning summary (Greptile id 3438655533, P1).
  _auto_send_logical_deliveries mirrors the same seeding from ctx initial_content.

- Add "streamed-tool-request" fixture (Start + Done, no Full) to confirm
  delivery on both channels under the new auto_send behaviour.

- Update "streaming-text" fixture to use non-empty initial_content ("Init") so
  the initial_content seeding is actually exercised by the test.

- Update module/docstring comments that referenced the AGX1-377 suppression.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@declan-scale declan-scale force-pushed the declan-scale/agx1-373-conformance-equivalence branch from 8cd851c to 2e820c7 Compare June 18, 2026 21:08
@declan-scale

Copy link
Copy Markdown
Contributor Author

@greptile review

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant