Commit0 report undercounts error instances (reports 0 instead of actual 6)

## Bug

The Commit0 benchmark report (`output.report.json`) does not count error instances, leading to inconsistent reporting compared to other benchmarks like GAIA.

## Observed behavior

In evaluation run `22836394952-claude-son` (Commit0, `litellm_proxy/claude-sonnet-4-5-20250929`), 6 out of 16 instances failed with `"Remote conversation ended with error"` after 3 retries. These errors are correctly recorded in `output_errors.jsonl` and `output.critic_attempt_1.jsonl`, but the `output.report.json` reports:

```json
{
    "total_instances": 16,
    "submitted_instances": 10,
    "error_instances": 0
}
```

## Comparison with GAIA

In the same batch, GAIA evaluation `22836395400-claude-son` had 8 error instances and reports them correctly:

| Field | GAIA | Commit0 |
|-------|------|---------|
| `total_instances` | 165 | 16 |
| `submitted_instances` | 165 (includes errors) | 10 (excludes errors) |
| `error_instances` | 8 | **0** (should be 6) |
| `incomplete_instances` | 8 | **field missing** |
| Actual errors in `output_errors.jsonl` | 8 | 6 |

## Expected behavior

- `error_instances` should be `6` (matching the 6 lines in `output_errors.jsonl`)
- `incomplete_instances` field should be present (like in the GAIA report)
- `submitted_instances` should arguably be `16` (including errored instances), consistent with how GAIA counts them

## Impact

- Downstream consumers of `output.report.json` (Slack notifications, dashboards, index) may underreport failures
- Success rate denominator is `total_instances` (16), so the rate is correct (6/16 = 37.5%), but the error breakdown is misleading

## Reproduction

```
gs://openhands-evaluation-results/commit0/litellm_proxy-claude-sonnet-4-5-20250929/22836394952/results.tar.gz
gs://openhands-evaluation-results/gaia/litellm_proxy-claude-sonnet-4-5-20250929/22836395400/results.tar.gz
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Commit0 report undercounts error instances (reports 0 instead of actual 6) #493

Bug

Observed behavior

Comparison with GAIA

Expected behavior

Impact

Reproduction

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Field	GAIA	Commit0
`total_instances`	165	16
`submitted_instances`	165 (includes errors)	10 (excludes errors)
`error_instances`	8	0 (should be 6)
`incomplete_instances`	8	field missing
Actual errors in `output_errors.jsonl`	8	6

Commit0 report undercounts error instances (reports 0 instead of actual 6) #493

Description

Bug

Observed behavior

Comparison with GAIA

Expected behavior

Impact

Reproduction

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions