Add --llm and --llm-reduce flags to `assembly eval` by alexkroman · Pull Request #184 · AssemblyAI/cli

alexkroman · 2026-06-16T19:22:49Z

Adds LLM-Gateway integration to the assembly eval command, enabling post-transcription text transformation and cross-item analysis.

Summary

Extends assembly eval with two new optional flags that integrate with LLM Gateway:

--llm: Runs a chain over each transcript (the WER score still uses the raw transcript)
--llm-reduce: Runs a prompt over all items' results to summarize patterns across the run

The implementation mirrors the existing --llm / --llm-reduce behavior in assembly stream, adapted for the eval context where results are collected before processing.

Key Changes

New command options (aai_cli/commands/evaluate/__init__.py):
- --llm: Repeatable option to transform transcripts through LLM Gateway
- --llm-reduce: Repeatable option for cross-item aggregation
- --model and --max-tokens: Gateway configuration options
- Updated help text and examples
Core evaluation logic (aai_cli/commands/evaluate/_exec.py):
- Added llm_prompt, llm_reduce, model, max_tokens fields to EvalOptions
- New _LlmOptions frozen dataclass to encapsulate LLM settings
- _ItemResult now carries hypothesis (transcript text) for post-scoring LLM processing
- _run_llm_map(): Applies the per-item chain, attaching {"model", "steps"} under each row's llm key
- _run_reduce(): Aggregates all transcripts/LLM outputs and runs the reduce chain once
- Helper functions: _reduce_input(), _gather_reduce_inputs(), _final_llm_output(), _llm_block(), _reduce_block()
- Updated _render() to display LLM outputs in human-readable format
- Updated run_evaluate() to orchestrate the map and reduce phases
Comprehensive test suite (tests/test_eval_llm.py):
- 294 lines of tests covering map phase, reduce phase, error handling, and rendering
- Tests verify WER scoring uses raw transcripts (not LLM outputs)
- Tests confirm failed rows are skipped gracefully
- Tests validate model/token parameters reach the gateway
- Tests ensure reduce skips the gateway call when there's nothing to aggregate
- Helper function tests for reduce input gathering and output extraction

Notable Implementation Details

WER scoring invariant: The WER score always uses the raw transcript, not the LLM-transformed output. The --llm chain is purely informational.
Map-reduce pattern: The --llm flag is a map (per-item) while --llm-reduce is a reduce (across all items). Both are optional and independent.
Graceful degradation: Failed rows (transcription errors) are skipped in both map and reduce phases; the reduce also skips the gateway call if there's nothing to aggregate.
Payload structure: LLM outputs land in the JSON payload under rows[].llm (map) and at the top level under reduce (reduce), alongside the existing WER metrics.

https://claude.ai/code/session_0133Txdj2E5So6ZU1293W4px

`assembly eval` now takes the same LLM-Gateway flags as `transcribe`: - `--llm` (repeatable) runs a prompt chain over each transcribed hypothesis and attaches it under the row's `llm` key. The WER score still uses the raw transcript — the chain output rides along as extra and feeds the reduce. - `--llm-reduce` (repeatable) runs one chain over every item's result (last `--llm` output, else the transcript text) and adds a top-level `reduce` block to the single JSON payload (rendered as a section in human mode). Skips the billable call when there's nothing to aggregate. - `--model` / `--max-tokens` select the gateway model and token budget. https://claude.ai/code/session_0133Txdj2E5So6ZU1293W4px

alexkroman enabled auto-merge June 16, 2026 19:22

alexkroman added this pull request to the merge queue Jun 16, 2026

Merged via the queue into main with commit 6684d19 Jun 16, 2026
19 checks passed

alexkroman deleted the claude/focused-clarke-utzabg branch June 16, 2026 19:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add --llm and --llm-reduce flags to `assembly eval`#184

Add --llm and --llm-reduce flags to `assembly eval`#184
alexkroman merged 1 commit into
mainfrom
claude/focused-clarke-utzabg

alexkroman commented Jun 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

alexkroman commented Jun 16, 2026

Summary

Key Changes

Notable Implementation Details

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants