Skip to content

Add --llm and --llm-reduce flags to assembly eval#184

Merged
alexkroman merged 1 commit into
mainfrom
claude/focused-clarke-utzabg
Jun 16, 2026
Merged

Add --llm and --llm-reduce flags to assembly eval#184
alexkroman merged 1 commit into
mainfrom
claude/focused-clarke-utzabg

Conversation

@alexkroman

Copy link
Copy Markdown
Collaborator

Adds LLM-Gateway integration to the assembly eval command, enabling post-transcription text transformation and cross-item analysis.

Summary

Extends assembly eval with two new optional flags that integrate with LLM Gateway:

  • --llm: Runs a chain over each transcript (the WER score still uses the raw transcript)
  • --llm-reduce: Runs a prompt over all items' results to summarize patterns across the run

The implementation mirrors the existing --llm / --llm-reduce behavior in assembly stream, adapted for the eval context where results are collected before processing.

Key Changes

  • New command options (aai_cli/commands/evaluate/__init__.py):

    • --llm: Repeatable option to transform transcripts through LLM Gateway
    • --llm-reduce: Repeatable option for cross-item aggregation
    • --model and --max-tokens: Gateway configuration options
    • Updated help text and examples
  • Core evaluation logic (aai_cli/commands/evaluate/_exec.py):

    • Added llm_prompt, llm_reduce, model, max_tokens fields to EvalOptions
    • New _LlmOptions frozen dataclass to encapsulate LLM settings
    • _ItemResult now carries hypothesis (transcript text) for post-scoring LLM processing
    • _run_llm_map(): Applies the per-item chain, attaching {"model", "steps"} under each row's llm key
    • _run_reduce(): Aggregates all transcripts/LLM outputs and runs the reduce chain once
    • Helper functions: _reduce_input(), _gather_reduce_inputs(), _final_llm_output(), _llm_block(), _reduce_block()
    • Updated _render() to display LLM outputs in human-readable format
    • Updated run_evaluate() to orchestrate the map and reduce phases
  • Comprehensive test suite (tests/test_eval_llm.py):

    • 294 lines of tests covering map phase, reduce phase, error handling, and rendering
    • Tests verify WER scoring uses raw transcripts (not LLM outputs)
    • Tests confirm failed rows are skipped gracefully
    • Tests validate model/token parameters reach the gateway
    • Tests ensure reduce skips the gateway call when there's nothing to aggregate
    • Helper function tests for reduce input gathering and output extraction

Notable Implementation Details

  • WER scoring invariant: The WER score always uses the raw transcript, not the LLM-transformed output. The --llm chain is purely informational.
  • Map-reduce pattern: The --llm flag is a map (per-item) while --llm-reduce is a reduce (across all items). Both are optional and independent.
  • Graceful degradation: Failed rows (transcription errors) are skipped in both map and reduce phases; the reduce also skips the gateway call if there's nothing to aggregate.
  • Payload structure: LLM outputs land in the JSON payload under rows[].llm (map) and at the top level under reduce (reduce), alongside the existing WER metrics.

https://claude.ai/code/session_0133Txdj2E5So6ZU1293W4px

`assembly eval` now takes the same LLM-Gateway flags as `transcribe`:

- `--llm` (repeatable) runs a prompt chain over each transcribed
  hypothesis and attaches it under the row's `llm` key. The WER score
  still uses the raw transcript — the chain output rides along as extra
  and feeds the reduce.
- `--llm-reduce` (repeatable) runs one chain over every item's result
  (last `--llm` output, else the transcript text) and adds a top-level
  `reduce` block to the single JSON payload (rendered as a section in
  human mode). Skips the billable call when there's nothing to aggregate.
- `--model` / `--max-tokens` select the gateway model and token budget.

https://claude.ai/code/session_0133Txdj2E5So6ZU1293W4px
@alexkroman alexkroman enabled auto-merge June 16, 2026 19:22
@alexkroman alexkroman added this pull request to the merge queue Jun 16, 2026
Merged via the queue into main with commit 6684d19 Jun 16, 2026
19 checks passed
@alexkroman alexkroman deleted the claude/focused-clarke-utzabg branch June 16, 2026 19:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants