Add --llm and --llm-reduce flags to assembly eval#184
Merged
Conversation
`assembly eval` now takes the same LLM-Gateway flags as `transcribe`: - `--llm` (repeatable) runs a prompt chain over each transcribed hypothesis and attaches it under the row's `llm` key. The WER score still uses the raw transcript — the chain output rides along as extra and feeds the reduce. - `--llm-reduce` (repeatable) runs one chain over every item's result (last `--llm` output, else the transcript text) and adds a top-level `reduce` block to the single JSON payload (rendered as a section in human mode). Skips the billable call when there's nothing to aggregate. - `--model` / `--max-tokens` select the gateway model and token budget. https://claude.ai/code/session_0133Txdj2E5So6ZU1293W4px
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds LLM-Gateway integration to the
assembly evalcommand, enabling post-transcription text transformation and cross-item analysis.Summary
Extends
assembly evalwith two new optional flags that integrate with LLM Gateway:--llm: Runs a chain over each transcript (the WER score still uses the raw transcript)--llm-reduce: Runs a prompt over all items' results to summarize patterns across the runThe implementation mirrors the existing
--llm/--llm-reducebehavior inassembly stream, adapted for the eval context where results are collected before processing.Key Changes
New command options (
aai_cli/commands/evaluate/__init__.py):--llm: Repeatable option to transform transcripts through LLM Gateway--llm-reduce: Repeatable option for cross-item aggregation--modeland--max-tokens: Gateway configuration optionsCore evaluation logic (
aai_cli/commands/evaluate/_exec.py):llm_prompt,llm_reduce,model,max_tokensfields toEvalOptions_LlmOptionsfrozen dataclass to encapsulate LLM settings_ItemResultnow carrieshypothesis(transcript text) for post-scoring LLM processing_run_llm_map(): Applies the per-item chain, attaching{"model", "steps"}under each row'sllmkey_run_reduce(): Aggregates all transcripts/LLM outputs and runs the reduce chain once_reduce_input(),_gather_reduce_inputs(),_final_llm_output(),_llm_block(),_reduce_block()_render()to display LLM outputs in human-readable formatrun_evaluate()to orchestrate the map and reduce phasesComprehensive test suite (
tests/test_eval_llm.py):Notable Implementation Details
--llmchain is purely informational.--llmflag is a map (per-item) while--llm-reduceis a reduce (across all items). Both are optional and independent.rows[].llm(map) and at the top level underreduce(reduce), alongside the existing WER metrics.https://claude.ai/code/session_0133Txdj2E5So6ZU1293W4px