Skip to content

Commit0 report undercounts error instances (reports 0 instead of actual 6) #493

@simonrosenberg

Description

@simonrosenberg

Bug

The Commit0 benchmark report (output.report.json) does not count error instances, leading to inconsistent reporting compared to other benchmarks like GAIA.

Observed behavior

In evaluation run 22836394952-claude-son (Commit0, litellm_proxy/claude-sonnet-4-5-20250929), 6 out of 16 instances failed with "Remote conversation ended with error" after 3 retries. These errors are correctly recorded in output_errors.jsonl and output.critic_attempt_1.jsonl, but the output.report.json reports:

{
    "total_instances": 16,
    "submitted_instances": 10,
    "error_instances": 0
}

Comparison with GAIA

In the same batch, GAIA evaluation 22836395400-claude-son had 8 error instances and reports them correctly:

Field GAIA Commit0
total_instances 165 16
submitted_instances 165 (includes errors) 10 (excludes errors)
error_instances 8 0 (should be 6)
incomplete_instances 8 field missing
Actual errors in output_errors.jsonl 8 6

Expected behavior

  • error_instances should be 6 (matching the 6 lines in output_errors.jsonl)
  • incomplete_instances field should be present (like in the GAIA report)
  • submitted_instances should arguably be 16 (including errored instances), consistent with how GAIA counts them

Impact

  • Downstream consumers of output.report.json (Slack notifications, dashboards, index) may underreport failures
  • Success rate denominator is total_instances (16), so the rate is correct (6/16 = 37.5%), but the error breakdown is misleading

Reproduction

gs://openhands-evaluation-results/commit0/litellm_proxy-claude-sonnet-4-5-20250929/22836394952/results.tar.gz
gs://openhands-evaluation-results/gaia/litellm_proxy-claude-sonnet-4-5-20250929/22836395400/results.tar.gz

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions