Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions .cursor/skills/analyze-bot-failures/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
---
name: analyze-bot-failures
description: Analyze failures from the Metaculus bot forecasting GitHub Actions workflow. Pulls failed job logs, aggregates failure reasons, and investigates whether real bugs need fixing. Use when the user asks why bot runs/workflows are failing, wants a summary of forecasting errors, or mentions analyzing bot failure logs from GitHub Actions.
---

# Analyze Bot Forecasting Failures

The full skill lives with its script in the repo. Read `scripts/skills/analyze_bot_failures/SKILL.md` and follow its instructions.
89 changes: 89 additions & 0 deletions code_tests/unit_tests/test_analyze_bot_run_failures.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,89 @@
from scripts.skills.analyze_bot_failures.analyze_bot_run_failures import (
group_failures,
normalize_message_to_signature,
parse_failures_from_log,
)

GH_PREFIX = "2026-06-11T17:13:45.1234567Z "

TRACEBACK_LOG = (
f"{GH_PREFIX}2026-06-11 17:13:40,100 - root - INFO - Running bot METAC_GPT_5_5_HIGH with 3 questions\n"
f"{GH_PREFIX}2026-06-11 17:13:45,123 - forecasting_tools.forecast_bots.forecast_bot - ERROR - Exception occurred during forecasting:\n"
f"{GH_PREFIX}Traceback (most recent call last):\n"
f'{GH_PREFIX} File "/home/runner/work/forecasting-tools/forecasting-tools/forecasting_tools/forecast_bots/forecast_bot.py", line 341, in _run_individual_question_with_error_propagation\n'
f"{GH_PREFIX} return await self._run_individual_question(question)\n"
f'{GH_PREFIX} File "/home/runner/work/forecasting-tools/forecasting-tools/.venv/lib/python3.11/site-packages/litellm/main.py", line 99, in completion\n'
f"{GH_PREFIX} raise error\n"
f"{GH_PREFIX}RuntimeError: Error while processing question url: 'https://www.metaculus.com/questions/12345/': Rate limit hit for model gpt-5.5\n"
f"{GH_PREFIX}2026-06-11 17:14:00,500 - root - INFO - done\n"
)

SHORT_SUMMARY_ONLY_LOG = f"{GH_PREFIX}2026-06-11 17:13:45,123 - root - INFO - ❌ Exception: ValueError | Message: Error while processing question url: 'https://www.metaculus.com/questions/678/': bad probability\n"

UNPARSABLE_LOG = (
f"{GH_PREFIX}2026-06-11 17:13:45,123 - root - INFO - starting up\n"
f"{GH_PREFIX}The operation was canceled.\n"
)


def test_parses_traceback_failure_with_question_url_and_repo_frame() -> None:
events = parse_failures_from_log(
TRACEBACK_LOG,
"METAC_GPT_5_5_HIGH",
111,
"bot_gpt_5_5_high / run_bot",
"http://job-url",
)
assert len(events) == 1
event = events[0]
assert event.exception_type == "RuntimeError"
assert "Rate limit hit" in event.message
assert event.question_url == "https://www.metaculus.com/questions/12345/"
assert event.deepest_repo_frame is not None
assert (
event.deepest_repo_frame.file_path
== "forecasting_tools/forecast_bots/forecast_bot.py"
)
assert event.deepest_repo_frame.line_number == 341
assert "Traceback (most recent call last):" in (event.traceback_text or "")


def test_falls_back_to_short_summary_lines() -> None:
events = parse_failures_from_log(
SHORT_SUMMARY_ONLY_LOG, "BOT_A", 222, "bot_a / run_bot", "http://job-url"
)
assert len(events) == 1
assert events[0].exception_type == "ValueError"
assert events[0].question_url == "https://www.metaculus.com/questions/678/"


def test_falls_back_to_log_tail_when_nothing_parseable() -> None:
events = parse_failures_from_log(
UNPARSABLE_LOG, "BOT_B", 333, "bot_b / run_bot", "http://job-url"
)
assert len(events) == 1
assert events[0].exception_type == "UnparsedFailure"
assert "The operation was canceled." in (events[0].traceback_text or "")


def test_grouping_normalizes_urls_and_numbers() -> None:
signature_one = normalize_message_to_signature(
"RuntimeError",
"Error while processing question url: 'https://www.metaculus.com/questions/12345/': Rate limit hit after 30 seconds",
)
signature_two = normalize_message_to_signature(
"RuntimeError",
"Error while processing question url: 'https://www.metaculus.com/questions/99999/': Rate limit hit after 61 seconds",
)
assert signature_one == signature_two

events_one = parse_failures_from_log(
TRACEBACK_LOG, "BOT_A", 111, "bot_a / run_bot", "http://job-a"
)
events_two = parse_failures_from_log(
TRACEBACK_LOG, "BOT_B", 112, "bot_b / run_bot", "http://job-b"
)
groups = group_failures(events_one + events_two)
assert len(groups) == 1
assert len(groups[0].events) == 2
assert {event.bot_name for event in groups[0].events} == {"BOT_A", "BOT_B"}
Original file line number Diff line number Diff line change
Expand Up @@ -228,11 +228,18 @@ async def _binary_prompt_to_forecast(
) -> ReasonedPrediction[float]:
reasoning = await self.get_llm("default", "llm").invoke(prompt)
logger.info(f"Reasoning for URL {question.page_url}: {reasoning}")
parsing_instructions = clean_indents(
f"""
The text given to you is trying to give a probability forecast for a binary question.
{self._create_resolved_question_parsing_message()}
"""
)
binary_prediction: BinaryPrediction = await structure_output(
reasoning,
BinaryPrediction,
model=self.get_llm("parser", "llm"),
num_validation_samples=self._structure_output_validation_samples,
additional_instructions=parsing_instructions,
)
decimal_pred = max(0.01, min(0.99, binary_prediction.prediction_in_decimal))

Expand Down Expand Up @@ -298,6 +305,7 @@ async def _multiple_choice_prompt_to_forecast(

The text you are parsing may prepend these options with some variation of "Option" which you should remove if not part of the option names I just gave you.
Additionally, you may sometimes need to parse a 0% probability. Please do not skip options with 0% but rather make it an entry in your final list with 0% probability.
{self._create_resolved_question_parsing_message()}
"""
)
reasoning = await self.get_llm("default", "llm").invoke(prompt)
Expand Down Expand Up @@ -389,6 +397,8 @@ async def _numeric_prompt_to_forecast(
f"""
The text given to you is trying to give a forecast distribution for a numeric question.
- This text is trying to answer the numeric question: "{question.question_text}".
{self._create_single_distribution_parsing_message(question)}
{self._create_resolved_question_parsing_message()}
- When parsing the text, please make sure to give the values (the ones assigned to percentiles) in terms of the correct units.
- The units for the forecast are: {question.unit_of_measure}
- Your work will be shown publicly with these units stated verbatim after the numbers your parse.
Expand Down Expand Up @@ -483,6 +493,8 @@ async def _date_prompt_to_forecast(
f"""
The text given to you is trying to give a forecast distribution for a date question.
- This text is trying to answer the question: "{question.question_text}".
{self._create_single_distribution_parsing_message(question)}
{self._create_resolved_question_parsing_message()}
- As an example, someone else guessed that the answer will be between {question.lower_bound} and {question.upper_bound}, so the numbers parsed from an answer like this would be verbatim "{question.lower_bound}" and "{question.upper_bound}".
- The output is given as dates/times please format it into a valid datetime parsable string. Assume midnight UTC if no hour is given.
- If percentiles are not explicitly given (e.g. only a single value is given) please don't return a parsed output, but rather indicate that the answer is not explicitly given in the text.
Expand All @@ -509,6 +521,24 @@ async def _date_prompt_to_forecast(
)
return ReasonedPrediction(prediction_value=prediction, reasoning=reasoning)

def _create_resolved_question_parsing_message(self) -> str:
return "- If the text concludes that the question has already resolved *in the past* (i.e. it treats the question as decided rather than something to forecast), please DO NOT return a parsed output, even if a final forecast is also given. Instead indicate that the answer is not explicitly given in the text.\n"

def _create_single_distribution_parsing_message(
self, question: NumericQuestion | DateQuestion
) -> str:
message = (
"- The text may contain multiple percentile distributions (e.g. forecasts for several related questions/entities, or intermediate drafts before a final answer). You must return exactly ONE distribution: the single final distribution that answers the question stated above.\n"
"- Never merge or concatenate percentile lists that refer to different entities, options, or scenarios. Each percentile should appear at most once in your output.\n"
"- If there are multiple final distributions and you cannot tell which one answers the question stated above, do not guess or combine them. Instead indicate that the answer is not explicitly given in the text."
)
if question.group_question_option is not None:
message += (
f'\n- This question is specifically about "{question.group_question_option}" (one subquestion within a group of related questions). '
f'If the text gives distributions for multiple subjects, return only the distribution for "{question.group_question_option}".'
)
return message

def _create_upper_and_lower_bound_messages(
self, question: NumericQuestion | DateQuestion
) -> tuple[str, str]:
Expand Down
29 changes: 23 additions & 6 deletions run_bots.py
Original file line number Diff line number Diff line change
Expand Up @@ -54,6 +54,9 @@
40280, # https://www.metaculus.com/questions/40280/ is rejected since noisy workflow errors
39138, # https://www.metaculus.com/questions/39138/ is rejected the best value is way out of bounds, and bots are constrained to not be able to make these forecasts
]
POST_IDS_TO_NOT_RAISE_ERRORS_FOR = [
# 43335, # https://www.metaculus.com/questions/43335/ is still forecasted but should not fail the workflow if it errors
]


class ScheduleConfig:
Expand Down Expand Up @@ -173,10 +176,24 @@ async def configure_and_run_bot(
for i, question_report in enumerate(zip(questions, all_reports)):
question, report = question_report
if isinstance(report, BaseException) and "TimeoutError" in str(report):
logger.warning(
f"TimeoutError occurred for question {question.id_of_post}, retrying..."
)
new_report = await bot.forecast_question(question, return_exceptions=True)
all_reports[i] = new_report

bot.log_report_summary(all_reports)
bot.log_report_summary(all_reports, raise_errors=False)

errors_to_raise = [
report
for question, report in zip(questions, all_reports)
if isinstance(report, BaseException)
and question.id_of_post not in POST_IDS_TO_NOT_RAISE_ERRORS_FOR
]
if errors_to_raise:
raise RuntimeError(
f"{len(errors_to_raise)} errors occurred while forecasting: {errors_to_raise}"
)

return all_reports

Expand Down Expand Up @@ -1209,7 +1226,7 @@ def get_default_bot_dict() -> dict[str, RunBotConfig]: # NOSONAR
# **flex_price_settings,
),
),
"tournaments": [AllowedTourn.METACULUS_CUP],
"tournaments": TournConfig.NONE, # NOTE: gpt-5 (gpt-5-2025-08-07) deprecated by OpenAI, API shutoff Dec 10, 2026
},
"METAC_GPT_5": {
"estimated_cost_per_question": roughly_gpt_5_cost,
Expand All @@ -1221,7 +1238,7 @@ def get_default_bot_dict() -> dict[str, RunBotConfig]: # NOSONAR
# **flex_price_settings,
),
),
"tournaments": TournConfig.NONE,
"tournaments": TournConfig.NONE, # NOTE: gpt-5 (gpt-5-2025-08-07) deprecated by OpenAI, API shutoff Dec 10, 2026
},
"METAC_GPT_5_MINI": {
"estimated_cost_per_question": roughly_gpt_4o_mini_cost,
Expand All @@ -1232,7 +1249,7 @@ def get_default_bot_dict() -> dict[str, RunBotConfig]: # NOSONAR
**flex_price_settings,
),
),
"tournaments": TournConfig.NONE,
"tournaments": TournConfig.NONE, # NOTE: gpt-5-mini (gpt-5-mini-2025-08-07) deprecated by OpenAI, API shutoff Dec 10, 2026
},
"METAC_GPT_5_NANO": {
"estimated_cost_per_question": roughly_deepseek_r1_cost,
Expand All @@ -1242,7 +1259,7 @@ def get_default_bot_dict() -> dict[str, RunBotConfig]: # NOSONAR
temperature=default_temperature,
),
),
"tournaments": TournConfig.NONE,
"tournaments": TournConfig.NONE, # NOTE: gpt-5-nano (gpt-5-nano-2025-08-07) deprecated by OpenAI, API shutoff Dec 10, 2026
},
"METAC_CLAUDE_4_SONNET_HIGH_16K": {
"estimated_cost_per_question": 0.33980,
Expand Down Expand Up @@ -1414,7 +1431,7 @@ def get_default_bot_dict() -> dict[str, RunBotConfig]: # NOSONAR
llm=gpt_5_with_search,
bot_type="research_only",
),
"tournaments": TournConfig.experimental,
"tournaments": TournConfig.NONE, # NOTE: gpt-5 (gpt-5-2025-08-07) deprecated by OpenAI, API shutoff Dec 10, 2026
},
"METAC_GROK_4_LIVE_SEARCH": {
"estimated_cost_per_question": 3 * roughly_one_call_to_grok_4_llm,
Expand Down
66 changes: 66 additions & 0 deletions scripts/skills/analyze_bot_failures/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
---
name: analyze-bot-failures
description: Analyze failures from the Metaculus bot forecasting GitHub Actions workflow. Pulls failed job logs, aggregates failure reasons, and investigates whether real bugs need fixing. Use when the user asks why bot runs/workflows are failing, wants a summary of forecasting errors, or mentions analyzing bot failure logs from GitHub Actions.
---

# Analyze Bot Forecasting Failures

Workflow for diagnosing failures in the `run-bot-aib-tournament.yaml` GitHub Actions workflow, which runs ~30 bot jobs every 30 minutes via `run_bots.py`.

## Step 1: Pull and aggregate the failure logs

A GitHub token is required (the job-log API rejects unauthenticated requests). Check `GITHUB_TOKEN` env var or `gh auth token`; if neither works, ask the user to run `gh auth login`.

```bash
poetry run python scripts/skills/analyze_bot_failures/analyze_bot_run_failures.py --since 1d
```

Useful options:
- `--since 12h|2d|1w` or an ISO datetime (default `1d`)
- `--run-id <id>` to analyze one specific run
- `--max-runs <n>` cap on runs fetched (default 50)

Output goes to `logs/workflow_failure_analysis/<timestamp>/`:
- `report.md` — counts by category/bot/question, plus failure groups (deduped by normalized signature) with an example message, traceback, and the deepest repo code frame
- `failures.json` — every parsed failure with full messages and signatures (machine-readable)
- `raw_logs/run<id>_<bot>.log` — complete raw log of each failed job

## Step 2: Read the report, then verify against raw logs

Read `report.md` first. The parser extracts full tracebacks when present (falling back to `❌ Exception: ... | Message: ...` summary lines, then to the raw log tail), so each failure group usually includes a "deepest repo frame" — the last traceback frame inside this repo rather than a dependency — which is the first place to look in the code. Still **spot-check 2-3 raw logs** for the most frequent signatures to see surrounding context (retry attempts, which LLM call failed). Failures of type `UnparsedFailure` mean nothing parseable was found — the job likely failed at the infrastructure level (poetry install failures, token resolution, job-level timeout at 55 min, cancellation) — read their raw logs directly.

## Step 3: Triage — transient noise vs real bug

Likely transient (usually not worth fixing, just note frequency):
- Provider 5xx / overloaded / rate-limit errors that hit one bot in one run
- Occasional timeouts that succeeded on the in-run retry (`run_bots.py` retries reports whose error contains "TimeoutError")

Likely real bugs (investigate the code):
- The same bot failing across most runs (config/model-name/credentials issue in its `RunBotConfig` in `run_bots.py`)
- Structured-output or validation errors clustering on one question type (binary/MC/numeric/group) — points at parsing or prompt code
- Prediction validation errors (probabilities out of bounds, CDF/percentile issues) — points at report data models
- The same question ID recurring across bots/runs — candidate for `POST_IDS_TO_SKIP` or `POST_IDS_TO_NOT_RAISE_ERRORS_FOR` in `run_bots.py`
- Tracebacks ending inside `forecasting_tools/` rather than a provider SDK

## Step 4: Map errors to code

| Symptom | Where to look |
|---|---|
| Bot config, skip lists, question selection, final RuntimeError | `run_bots.py` |
| Forecast orchestration, retries, report summary format | `forecasting_tools/forecast_bots/forecast_bot.py` |
| Structured output parsing failures | `forecasting_tools/helpers/structure_output.py` |
| LLM call errors, model names, retries, token limits | `forecasting_tools/ai_models/` (esp. `general_llm.py`) |
| AskNews/research errors, cache misses | `forecasting_tools/helpers/` and `.github/scripts/precache_asknews.py` |
| Metaculus API / publish errors | `forecasting_tools/helpers/metaculus_client.py` |
| Job setup/infra failures, secrets, timeouts | `.github/workflows/run-bot-launcher.yaml`, `run-bot-aib-tournament.yaml` |

It may not be in any of the above, so be willing to search the codebase.

## Step 5: Summarize findings

Report back with:
1. Failure counts by category and bot (from `report.md`)
2. **Real bugs found**: each with evidence (log excerpt + code location) and a proposed fix
3. **Transient noise**: what to ignore and why
4. **Question skip candidates**: recurring question IDs with failure counts
5. Do not change code or skip lists without confirming the proposed fixes with the user first
Loading
Loading