Skip to content

fix: Handling of site connection issues during outage#3470

Closed
rippyboii wants to merge 0 commit intopython-discord:mainfrom
rippyboii:main
Closed

fix: Handling of site connection issues during outage#3470
rippyboii wants to merge 0 commit intopython-discord:mainfrom
rippyboii:main

Conversation

@rippyboii
Copy link

Summary

This PR improves bot startup reliability and moderator visibility when extensions/cogs fail to load.

  • Aggregate extension + cog load failures during startup and report them as a single alert in #mod-log.
  • Add retry + exponential backoff for cogs that depend on external sites/APIs (e.g., temporary 5xx/429/timeouts), with clear mod notifications on final failure.
  • Add unit tests to validate retry behavior, error classification, and startup reporting.

During setup_hook, extensions are loaded concurrently. When an extension/cog fails due to a transient outage (rate limits, 5xx, timeouts), failures can either:

  • stop startup unexpectedly, or
  • fail noisily/fragmentedly, making it hard to see what broke and why.

This change standardizes both resilience (retry when appropriate) and visibility (one clean startup report + targeted alerts).

Changes

1) Startup failure aggregation (single #mod-log alert)

  • Added utils/startup_reporting.py to format a standardized startup failure message.
  • Updated bot.py to:
    • collect extension + cog load failures (import/setup/add_cog)
    • wait for all load tasks to complete
    • send one aggregated alert summarizing all failures
  • Reporting is defensive: it does not crash if the log channel is unavailable.
  • Startup continues for non-critical failures.

2) Retry + backoff for external/API-dependent cogs

Implemented retry logic with exponential backoff and explicit “retriable vs non-retriable” classification, plus moderator notifications when retries are exhausted.

Covered cogs include:

  • Filtering: 3 attempts with backoff (1s, 2s, 4s); retries on HTTP 429, HTTP 5xx, TimeoutError, OSError; final failure logs + alerts #mod-alerts.
  • Reminders: retry count is configured via URLs.connect_max_retries; warns are logged to Sentry; final failure posts to #mod-log.
  • PythonNews: retries on 408, 429, 5xx, TimeoutError, OSError; on max retries logs + alerts mod_alerts and re-raises to stop startup.
  • Superstarify cogs: added retry + notification and corresponding tests.

Tests / Verification

  • Added unit tests covering:
    • retry-then-success
    • max-retries then alert + failure behavior
    • non-retriable errors
    • retry classification logic
    • aggregated startup failure reporting for faulty extensions/cogs

Suggested checks:

  1. uv run task test
  2. Run the bot and simulate a faulty extension/cog load to confirm a single aggregated #mod-log startup alert.

Moderator Alert in Discord:

mod_alert

Closes #2918

@rippyboii rippyboii requested a review from mbaruh as a code owner March 1, 2026 16:04

return isinstance(error, (TimeoutError, OSError))

async def _alert_mods_filter_load_failure(self, error: Exception, attempts: int) -> None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is never actually getting called.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, Thank you for mentioning it.

Initially we alerted mods directly via the individual cogs call. However, later when we refactored to centralize the process during the startup period, we missed to refactor it.

This unused method is removed

Comment on lines +172 to +175
await mod_alerts_channel.send(
":warning: Filtering failed to load filter lists during startup "
f"after {attempts} attempt(s). Error: `{error_details}`"
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should ping Moderators.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When there is any failure in startup process, including the cog load. Now it is all collected directly by StartupFailureReporter in startup_reporting.py, and it mentions the moderator when reporting the error

Comment on lines +19 to +20
MAX_ATTEMPTS = constants.URLs.connect_max_retries
INITIAL_BACKOFF_SECONDS = 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should be added to constants and instead of being redefined as constants at the top of the file usages should just use the imported class (e.g. do from bot.constants import URLs and then use URLs.connect_max_retries as an argument)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your suggestion,

We have fixed this issue!

Comment on lines +68 to +69
FILTER_LOAD_MAX_ATTEMPTS = constants.URLs.connect_max_retries
INITIAL_BACKOFF_SECONDS = 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move both to constants and reference the constants.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the suggestion!

We have fixed this issue!

Comment on lines +152 to +158
@staticmethod
def _retryable_filter_load_error(error: Exception) -> bool:
"""Return whether loading filter lists failed due to some temporary error, thus retrying could help."""
if isinstance(error, ResponseCodeError):
return error.status in (408, 429) or error.status >= 500

return isinstance(error, (TimeoutError, OSError))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be made a generic utility and moved to a separate file for all backoff-compatible cogs to use.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, Thank you for your suggestion.

We have now added new file in utility for this retry mechanism and is reflected across the code

if TYPE_CHECKING:
from bot.bot import Bot

@dataclass(frozen=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this a dataclass?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi,

This issue has been fixed!

Comment on lines +54 to +55
for k in keys:
e = failures[k]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use properly named variables.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, thank you for mentioning it,

we have fixed this one too


return textwrap.dedent(f"""
Failed items:
{chr(10).join(lines)}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not the way to do this in so many ways, you can just do "\n".join(lines)

Can do something like:

await ctx.send(
  "Failed items:\n" +
  "\n".join(lines)
)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, we have fixed this issue!

bot/bot.py Outdated
Comment on lines +22 to +24
_current_extension: contextvars.ContextVar[str | None] = contextvars.ContextVar(
"current_extension", default=None
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not exactly sure why you need a context var to implement this, it seems like unnecessary overcomplication and just makes things harder to follow.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, thank you for mentioning it

We have fixed this issue, bot.py no longer uses contextvars or _content_extensions at all.

Comment on lines +87 to +100
def test_retryable_python_news_load_error(self):
"""`_retryable_site_load_error` should classify temporary failures as retryable."""
test_cases = (
(ResponseCodeError(MagicMock(status=408)), True),
(ResponseCodeError(MagicMock(status=429)), True),
(ResponseCodeError(MagicMock(status=500)), True),
(ResponseCodeError(MagicMock(status=503)), True),
(ResponseCodeError(MagicMock(status=400)), False),
(ResponseCodeError(MagicMock(status=404)), False),
(TimeoutError("timeout"), True),
(OSError("os error"), True),
(AttributeError("attr"), False),
(ValueError("value"), False),
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests should all be moved to a general utility (like the retriable error method in each of the cogs as mentioned).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Thank you for mentioning this,

I have moved this one to it's respective test file to follow the structure, following from the utility in main.

@rippyboii rippyboii requested a review from jb3 March 2, 2026 23:43
@rippyboii
Copy link
Author

Hi @jb3 , Thank you so much for bringing out the suggestions. We have followed your guidance and implemented the refactor to most of it. Could you please review it once again in your free time?

Thank you so much

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Handling of site connection issues during outage.

2 participants