Skip to content

fix: prevent imports stuck in Pending on stale DB connections#327

Open
level09 wants to merge 1 commit into
mainfrom
fix/etl-import-stuck-pending-and-stale-pg-connections
Open

fix: prevent imports stuck in Pending on stale DB connections#327
level09 wants to merge 1 commit into
mainfrom
fix/etl-import-stuck-pending-and-stale-pg-connections

Conversation

@level09
Copy link
Copy Markdown
Collaborator

@level09 level09 commented May 6, 2026

Summary

Web/upload imports could silently get stuck in Pending with no error log on the import row, even though the celery worker logged a traceback. We hit this in production when Postgres dropped an idle connection: etl_process_file raised PendingRollbackError, the original exception was masked, and the import row was never marked Failed.

Two fixes:

  1. Defensive error handling in etl_process_file (enferno/tasks/data_import.py)

    • Call db.session.rollback() before reusing the session in the except branch. Otherwise a poisoned session causes the fail-marker query to raise, masking the real exception and leaving the row stuck in Pending.
    • Guard the fail-marker itself so a secondary failure logs and doesn't replace the original exception.
  2. Connection health checks on the SQLAlchemy engine (enferno/settings.py)

    • pool_pre_ping=True: ping each pooled connection before use, so workers don't get handed a dropped connection.
    • pool_recycle=300 (overridable via SQLALCHEMY_POOL_RECYCLE): cycle connections every 5 min, well under typical NAT/firewall idle timeouts.

Together: the first fix makes the symptom impossible regardless of root cause; the second stops the underlying class of error from happening in the first place.

Observed in production

PendingRollbackError: Can't reconnect until invalid transaction is rolled back.
  File "enferno/tasks/data_import.py", line 33, in etl_process_file
    log = db.session.get(DataImport, data_import_id)

The line number points inside the except clause; the original exception inside MediaImport(...) / di.process(file) was lost. We also saw OperationalError: server closed the connection unexpectedly on the same worker pool earlier the same day, confirming the connection-drop pattern.

Test plan

  • Run existing test suite (uv run pytest) on a Postgres with a low idle timeout and confirm no regressions.
  • Trigger a web import and verify normal happy path still produces a bulletin and Ready status.
  • Force a transient DB error during etl_process_file (e.g. kill the worker's connection) and confirm the import row ends in Failed with the original exception in the log instead of Pending.

etl_process_file's except handler called db.session.get(DataImport, ...)
without rolling back first. When an upstream error poisoned the session
(e.g. Postgres dropping an idle connection), the fail-marker query
raised PendingRollbackError, masked the original exception, and left
the import row stuck in 'Pending' forever.

- Roll back the session before recovering, and guard the fail-marker
  so a follow-up failure doesn't replace the original exception.
- Enable pool_pre_ping and pool_recycle on the SQLAlchemy engine so
  workers don't get handed dropped connections in the first place.
@level09 level09 requested a review from apodacaduron as a code owner May 6, 2026 16:10
@level09 level09 self-assigned this May 6, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 6, 2026

Important

Review skipped

Auto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 8c515149-ee5e-448c-8d2c-2b549dc50154

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/etl-import-stuck-pending-and-stale-pg-connections

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant