Bug
The Commit0 benchmark report (output.report.json) does not count error instances, leading to inconsistent reporting compared to other benchmarks like GAIA.
Observed behavior
In evaluation run 22836394952-claude-son (Commit0, litellm_proxy/claude-sonnet-4-5-20250929), 6 out of 16 instances failed with "Remote conversation ended with error" after 3 retries. These errors are correctly recorded in output_errors.jsonl and output.critic_attempt_1.jsonl, but the output.report.json reports:
{
"total_instances": 16,
"submitted_instances": 10,
"error_instances": 0
}
Comparison with GAIA
In the same batch, GAIA evaluation 22836395400-claude-son had 8 error instances and reports them correctly:
| Field |
GAIA |
Commit0 |
total_instances |
165 |
16 |
submitted_instances |
165 (includes errors) |
10 (excludes errors) |
error_instances |
8 |
0 (should be 6) |
incomplete_instances |
8 |
field missing |
Actual errors in output_errors.jsonl |
8 |
6 |
Expected behavior
error_instances should be 6 (matching the 6 lines in output_errors.jsonl)
incomplete_instances field should be present (like in the GAIA report)
submitted_instances should arguably be 16 (including errored instances), consistent with how GAIA counts them
Impact
- Downstream consumers of
output.report.json (Slack notifications, dashboards, index) may underreport failures
- Success rate denominator is
total_instances (16), so the rate is correct (6/16 = 37.5%), but the error breakdown is misleading
Reproduction
gs://openhands-evaluation-results/commit0/litellm_proxy-claude-sonnet-4-5-20250929/22836394952/results.tar.gz
gs://openhands-evaluation-results/gaia/litellm_proxy-claude-sonnet-4-5-20250929/22836395400/results.tar.gz
Bug
The Commit0 benchmark report (
output.report.json) does not count error instances, leading to inconsistent reporting compared to other benchmarks like GAIA.Observed behavior
In evaluation run
22836394952-claude-son(Commit0,litellm_proxy/claude-sonnet-4-5-20250929), 6 out of 16 instances failed with"Remote conversation ended with error"after 3 retries. These errors are correctly recorded inoutput_errors.jsonlandoutput.critic_attempt_1.jsonl, but theoutput.report.jsonreports:{ "total_instances": 16, "submitted_instances": 10, "error_instances": 0 }Comparison with GAIA
In the same batch, GAIA evaluation
22836395400-claude-sonhad 8 error instances and reports them correctly:total_instancessubmitted_instanceserror_instancesincomplete_instancesoutput_errors.jsonlExpected behavior
error_instancesshould be6(matching the 6 lines inoutput_errors.jsonl)incomplete_instancesfield should be present (like in the GAIA report)submitted_instancesshould arguably be16(including errored instances), consistent with how GAIA counts themImpact
output.report.json(Slack notifications, dashboards, index) may underreport failurestotal_instances(16), so the rate is correct (6/16 = 37.5%), but the error breakdown is misleadingReproduction