Add the triage skills and strategies experiments#12
Conversation
These came from fullsend-ai/fullsend#170 and were used to form the basis of our real triage agent from fullsend-ai/fullsend#279
waynesun09
left a comment
There was a problem hiding this comment.
Reviewed with a 10-agent review squad. Posting the top 5 most actionable findings inline — 2 are script bugs that crash or fail on macOS, 2 are data integrity issues affecting experiment results, and 1 is a JSON parsing bug that silently truncates output.
The $SCENARIO_NAME_ unbound variable (github-adapter.sh:76) and grep -oP portability issue (github-adapter.sh:80) are the quickest wins. The data integrity findings in the README and judge.sh are worth addressing before drawing conclusions from the experiment results.
|
|
||
| --- | ||
| _This issue was created by the triage-skill-comparison experiment._ | ||
| _Strategy: $STRATEGY_NAME | Scenario: $SCENARIO_NAME_" \ |
There was a problem hiding this comment.
Bug — Unbound variable crash on every run
$SCENARIO_NAME_ (with trailing underscore) is interpreted by bash as a single variable name, since _ is a valid identifier character. This variable is never set, so with set -euo pipefail (line 3), this line will crash every invocation with unbound variable: SCENARIO_NAME_.
| _Strategy: $STRATEGY_NAME | Scenario: $SCENARIO_NAME_" \ | |
| --body "_Strategy: $STRATEGY_NAME | Scenario: ${SCENARIO_NAME}_" |
Use ${SCENARIO_NAME}_ to explicitly delimit the variable name from the trailing underscore literal.
| --label "$LABEL_TRIAGE" \ | ||
| 2>/dev/null)" | ||
|
|
||
| ISSUE_NUMBER="$(echo "$ISSUE_URL" | grep -oP '\d+$')" |
There was a problem hiding this comment.
Bug — grep -oP is GNU-only, fails on macOS
grep -P (Perl regex) is not available on macOS's default BSD grep. This will fail with grep: invalid option -- P on any macOS contributor's machine.
| ISSUE_NUMBER="$(echo "$ISSUE_URL" | grep -oP '\d+$')" | |
| ISSUE_NUMBER="$(echo "$ISSUE_URL" | grep -oE '[0-9]+$')" |
grep -oE with POSIX extended regex achieves the same result and works on both GNU and BSD grep.
| | Rank | Strategy | Mean score | Reliability | | ||
| |------|----------|-----------|-------------| | ||
| | 1 (tie) | omo-prometheus | 4.38 | 98% | | ||
| | 1 (tie) | omc-deep-interview | 4.38 | 97% | |
There was a problem hiding this comment.
Data integrity — Results table is incomplete and reliability numbers don't match trial data
Two issues with this rankings table:
-
Incomplete results presented as final rankings:
slow-searchandwrong-search-resultsscenarios have zero results, andsilent-data-corruptiononly has 2 of 5 strategies. The rankings here are drawn from partial data and may change significantly once all scenarios are run. -
Reliability percentages contradict trial data: The table shows values like 98% and 97%, but examining the actual result files, all trials show
parse_failures: 0— suggesting either 100% reliability or a different calculation method that isn't documented.
Consider either marking this table as preliminary/partial, or holding it until all scenario × strategy combinations have results.
| } | ||
|
|
||
| echo "$JUDGE_JSON" | jq '.' > "$TRIAL_DIR/judge-assessment.json" | ||
| SCORE="$(echo "$JUDGE_JSON" | jq -r '.weighted_total // 0')" |
There was a problem hiding this comment.
Data integrity — weighted_total values are unreliable
Two problems with trusting the LLM-provided weighted_total:
-
Arithmetic drift: Spot-checking ~33 of 120 judge assessment files shows 0.05–0.15 point discrepancies between the LLM's
weighted_totaland the sum you'd get from applying the stated weights to the individual scores. These small errors can change rankings. -
Inconsistent nesting: At least one file (
crash-on-save/structured-triage/trial-8/judge-assessment.json) hasweighted_totalnested inside.scoresinstead of at the top level, causing thisjqexpression to return0via the// 0fallback — silently zeroing out the score.
Consider computing weighted_total deterministically from the component scores rather than trusting the LLM's arithmetic, and normalize the JSON structure before reading it.
| # Try first { ... } block | ||
| local braced | ||
| braced="$(echo "$raw" | awk '/{/{found=1} found{print} /}/{if(found) exit}')" | ||
| if [[ -n "$braced" ]] && echo "$braced" | jq . &>/dev/null; then | ||
| echo "$braced"; return 0 | ||
| fi | ||
|
|
||
| echo "$raw" | ||
| return 1 |
There was a problem hiding this comment.
Bug — extract_json truncates nested JSON objects
The awk pattern /{/,/}/ exits on the first closing } it encounters. For any JSON with nested objects (which is the expected output format for triage responses), this silently truncates the response — cutting off fields that appear after the first nested object closes.
For example, given:
{
"priority": { "level": "high", "reason": "crash" },
"component": "auth"
}The function would return only { "priority": { "level": "high", "reason": "crash" } — dropping "component" entirely.
Consider using a brace-depth counter in awk, or piping through jq to extract the first valid JSON object from the mixed output.
These came from fullsend-ai/fullsend#170 and were used to form the basis of our real triage agent from fullsend-ai/fullsend#279