Skip to content

Bound the count query with a server-side statement timeout#1463

Open
selesse wants to merge 1 commit intomainfrom
count-timeout
Open

Bound the count query with a server-side statement timeout#1463
selesse wants to merge 1 commit intomainfrom
count-timeout

Conversation

@selesse
Copy link
Copy Markdown

@selesse selesse commented May 1, 2026

Why

Task#count populates the progress bar before iteration begins and is called from TaskJobConcern#on_start. On large or poorly-indexed collections this query can run for tens of seconds and stall the task — even though the count is only a UI hint and a failure here should not fail the run.

We've seen this hit production tasks repeatedly on my team, with count queries running long enough to time out the entire run.

Why not Timeout.timeout?

Active Record has no cross-platform query timeout API, so the obvious fix is Timeout.timeout(5) { @task.count }. That's unsafe to use around database calls — it raises across thread boundaries at arbitrary bytecode points, which can leave the connection mid-protocol with the server-side query still running on the database. That's the path #1388 currently uses for its count-preview endpoint, and it's the path I'd like to avoid for on_start.

What this does instead

Wrap the count call in a small adapter-aware helper that issues the database's native cancellation:

Adapter Mechanism
PostgreSQL / PostGIS SET LOCAL statement_timeout inside transaction(requires_new: true), auto-reset on commit/rollback
MySQL / Trilogy SET SESSION max_execution_time, prior value restored in ensure
SQLite / unknown no-op (the helper yields unmodified)

When the database cancels the query it raises ActiveRecord::QueryCanceled. safe_count rescues it; the run starts with a nil count (the progress bar shows progress without a percentage) and the cancellation is reported to Rails.error as a handled warning so it remains visible without paging anyone.

Default is 5 seconds, configurable via MaintenanceTasks.count_timeout_ms (nil or 0 to disable).

Tests

Unit tests use mocked connections to assert the right SQL is issued for each adapter, that prior values are restored, that errors propagate, and that nil/0/negative timeouts short-circuit before touching the connection.

Integration tests against a real PostgreSQL connection (the existing gemfiles/postgresql.gemfile CI matrix) verify that a slow query is actually cancelled, that fast queries pass through, and that statement_timeout is reset after both successful and failed blocks. They skip on SQLite.

Questions for reviewers

  1. Default value. I picked 5s as a "behave better by default" floor — long enough that almost every reasonable count finishes, short enough to prevent stalls. Open to making it nil (opt-in) if you'd prefer to avoid behaviour changes for existing users on first upgrade.
  2. Multi-DB. The timeout is applied to ActiveRecord::Base.connection. If a task's #count queries a model on a different connection, the timeout won't apply there. Documented as a caveat; could be extended to introspect the relation's connection if useful.

@selesse selesse force-pushed the count-timeout branch 2 times, most recently from 66bf990 to 379a057 Compare May 1, 2026 16:00
@selesse selesse requested a review from etiennebarrie May 1, 2026 16:03
@etiennebarrie
Copy link
Copy Markdown
Member

The count == :no_count@collection_enum.size fallback is preserved as-is. That fallback is the bigger footgun — operators returning :no_count to opt out of expensive counts get the count run for them anyway — but changing it is a separate behaviour change that needs a deprecation cycle, and I wanted to keep this PR scoped to the timeout safety net so it can land on its own.

Can you explain the footgun exactly? If I change Maintenance::UpdatePostsTask to return nil as described in the README, there's no count emitted. If you do return :no_count, the count is calculated but that's user error, we could change :no_count to be something else, it's not documented or anything. It doesn't mean "don't count" it means "there's no count method defined in the task".

@selesse
Copy link
Copy Markdown
Author

selesse commented May 4, 2026

The count == :no_count → @collection_enum.size fallback is preserved as-is. That fallback is the bigger footgun — operators returning :no_count to opt out of expensive counts get the count run for them anyway — but changing it is a separate behaviour change that needs a deprecation cycle, and I wanted to keep this PR scoped to the timeout safety net so it can land on its own.

Can you explain the footgun exactly? If I change Maintenance::UpdatePostsTask to return nil as described in the README, there's no count emitted. If you do return :no_count, the count is calculated but that's user error, we could change :no_count to be something else, it's not documented or anything. It doesn't mean "don't count" it means "there's no count method defined in the task".

Fair. The footgun documented is based on my team's experience in the past few weeks. AI tools are confidently recommending :no_count instead of nil as a way to skip counting.

`TaskJobConcern#on_start` calls `Task#count` to populate the progress
bar before iteration begins. On large or poorly-indexed collections
this query can run for tens of seconds and stall the task, even though
the count is only a UI hint and a failure here should not fail the
run. We've seen this hit production tasks repeatedly at Shopify.

Active Record has no cross-platform query timeout API, and
`Timeout.timeout` is unsafe to use around database calls — it raises
across thread boundaries at arbitrary bytecode points, which can leave
a connection mid-protocol and the server-side query still running.

Wrap the count call in a small adapter-aware helper that issues the
database's native cancellation:

  PostgreSQL/PostGIS: `SET LOCAL statement_timeout` inside a
    `transaction(requires_new: true)` so cleanup is automatic on
    commit or rollback.
  MySQL/Trilogy: `SET SESSION max_execution_time`, with the prior
    value restored in an `ensure`.
  SQLite/unknown: no-op.

When the database cancels the query it raises
`ActiveRecord::QueryCanceled`, which `safe_count` rescues. The run
starts with a `nil` count (the progress bar shows progress without a
percentage) and the cancellation is reported to `Rails.error` as a
handled warning so it remains visible without paging anyone.

Default is 5 seconds, configurable via
`MaintenanceTasks.count_timeout_ms` (set to `nil` or `0` to disable).
The `:no_count` → `@collection_enum.size` fallback is intentionally
preserved here; changing that is a separate behaviour change.

Co-Authored-By: pi <pi@shopify.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants