Run the terminal-bench benchmark suite with amplifier or baseline agent:
uv run --with terminal-bench tests/terminal_bench/run_terminal_bench.py --agent baselineGenerate failure analysis reports for a terminal-bench run:
uv run tests/terminal_bench/generate_benchmark_report.py --run-dir "ai_working/tmp/2025-10-14__09-39-16"Error Message:
Command '['docker', 'compose', '-p', 'task-name', '-f', '/path/to/docker-compose.yaml', 'up', '-d']' returned non-zero exit status 1.
Root Cause: This error can occur when Docker runs out of available network address pools. Terminal-bench creates a new Docker network for each task run, and if these aren't cleaned up properly, Docker eventually exhausts its predefined address pools with the error:
failed to create network: Error response from daemon: all predefined address pools have been fully subnetted
Solution: Clean up unused Docker networks:
# Remove all unused networks
docker network prune -f
# Check how many networks exist (should be < 30)
docker network ls | wc -l