Skip to content

fix: lock main row in legacy next_batch_custom#235

Merged
NikolayS merged 3 commits into
NikolayS:mainfrom
victoraugustolls:fix/PGQUE-208/legacy-next-batch-coop-race
May 23, 2026
Merged

fix: lock main row in legacy next_batch_custom#235
NikolayS merged 3 commits into
NikolayS:mainfrom
victoraugustolls:fix/PGQUE-208/legacy-next-batch-coop-race

Conversation

@victoraugustolls
Copy link
Copy Markdown
Contributor

@victoraugustolls victoraugustolls commented May 15, 2026

What

When two sessions race, with one calling the legacy pgque.next_batch_custom(queue, consumer, ...) and another calling pgque.register_subconsumer(queue, consumer, sub, convert_normal => true), the subscription row can end up with sub_role = 'coop_main' and sub_batch IS NOT NULL at the same time. That combination is explicitly forbidden by the cooperative-consumer spec: the main row owns the group cursor, members own batches.

The bug is in the cooperative override of next_batch_custom: it reads the main subscription row without FOR UPDATE, so it can pass the "is this row a coop_main?" check while another transaction is simultaneously promoting that same row from normal to coop_main. By the time the function reaches its UPDATE ... SET sub_batch = ..., the row is already coop_main, and the stamp lands on the wrong kind of row.

Why it matters

Once a coop_main row carries a sub_batch, downstream behavior gets weird: finish_batch on that batch_id resolves to the main row instead of a member, the cooperative group cursor stops advancing cleanly, and tools that scan for "active cooperative batches" double-count. It's the kind of thing that doesn't show up in dev and then bites someone the first time a worker calls register_subconsumer(..., convert_normal => true) against a queue that already has a busy legacy consumer.

The fix

One line. Add for update of s to the main subscription SELECT inside the cooperative override of next_batch_custom(5-arg) at sql/pgque-api/cooperative_consumers.sql. The cooperative module CREATE OR REPLACEs the PgQ-derived next_batch_custom to add a coop_main-with-members rejection check; this PR adds the row lock that protects that check from a concurrent role transition.

A related (but separate) concurrent-receive race against the PgQ-derived non-cooperative next_batch_custom is being addressed in another PR/branch (fix/concurrent-receive); that is a different function and a different code path. This PR only touches the cooperative override.

Changed in three places (source + two bundled outputs), all in sync with bash build/transform.sh:

  • sql/pgque-api/cooperative_consumers.sql
  • sql/pgque.sql
  • sql/pgque-tle.sql

Test plan

tests/two_session_legacy_coop_race.sh is a new two-session harness, modeled on the existing tests/two_session_receive_lock.sh. It deterministically reproduces the race by temporarily swapping pgque.find_tick_helper for a variant that pauses on a session-level advisory lock — that gives us a clean window between the function's SELECT and its UPDATE to slip session B's register_subconsumer in. The original find_tick_helper is captured via pg_get_functiondef at the start and restored on exit.

Red/green evidence on a fresh PG16:

# pre-fix
session B elapsed: 102 ms
final billing-row state (sub_role|sub_batch): coop_main|4
FAIL: legacy next_batch_custom raced with register_subconsumer

# post-fix
session B elapsed: 3071 ms              (B blocked on the row lock)
final billing-row state (sub_role|sub_batch): normal|1
PASS: legacy next_batch_custom serializes against register_subconsumer

Also green:

  • tests/run_all.sql: all 47 suites
  • tests/two_session_receive_lock.sh: existing receive-lock harness still serializes correctly (no impact on that path)
  • bash build/transform.sh regenerates sql/pgque.sql and sql/pgque-tle.sql byte-identical to the manual edits

Things to know before merging

  • The new harness isn't wired into CI yet. That matches the current state of tests/two_session_receive_lock.sh, which is also unwired on main. Happy to do both in a follow-up, just didn't want to bundle CI plumbing into a behavioral fix.
  • No API change. Pre-existing callers see the same return shape, same errors. The only observable difference is that a concurrent register_subconsumer(..., convert_normal => true) will now block briefly behind a legacy next_batch instead of racing it.
  • Manual verification command (matches what the harness runs):
PGQUE_TEST_DSN=postgresql://postgres:pgque_test@localhost:5432/pgque_test \
  tests/two_session_legacy_coop_race.sh

NOTES:

This PR and Analysis was done with the help of claude Opus 4.7 with maximum effort.

The cooperative override of pgque.next_batch_custom(5-arg) selected the
main subscription row without FOR UPDATE before checking sub_role. A
concurrent pgque.register_subconsumer(..., convert_normal => true) could
promote 'normal' -> 'coop_main' between the SELECT and the UPDATE that
stamps sub_batch, leaving a coop_main row with sub_batch IS NOT NULL --
violating the spec invariant that "coop_main must never have sub_batch
IS NOT NULL".

Mirrors the same FOR UPDATE pattern that build/transform.sh injects into
the PgQ-derived next_batch.sql; the cooperative override at
sql/pgque-api/cooperative_consumers.sql replaces that function and must
hold the lock for the same reason.

The new tests/two_session_legacy_coop_race.sh deterministically races
the two sessions by temporarily replacing pgque.find_tick_helper with a
pausing variant so session A wedges between its initial SELECT and its
UPDATE while session B runs register_subconsumer. The original
find_tick_helper is captured via pg_get_functiondef and restored on
exit. Pre-fix the harness fails with final state coop_main|<batch>;
post-fix it passes with normal|<batch> and B blocks on the row lock.

Red/green evidence:

  pre-fix : final billing-row state (sub_role|sub_batch): coop_main|4
            FAIL: legacy next_batch_custom raced with register_subconsumer

  post-fix: final billing-row state (sub_role|sub_batch): normal|1
            PASS: legacy next_batch_custom serializes against
            register_subconsumer; invariant intact

tests/run_all.sql (47 suites) and tests/two_session_receive_lock.sh both
remain green. The new harness is not yet wired into .github/workflows/ci.yml
since tests/two_session_receive_lock.sh is also unwired on main; both can
be wired in a follow-up.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@victoraugustolls
Copy link
Copy Markdown
Contributor Author

If the use of AI is frowned upon to this type of PRs, let me know and I'll gladly find another way to help with this nice project

victoraugustolls and others added 2 commits May 15, 2026 13:05
Defense-in-depth follow-up to the FOR UPDATE fix in the parent commit.
The cooperative override of pgque.next_batch_custom(5-arg) already
rejects coop_main rows that have at least one coop_member, but a
*memberless* coop_main bypasses that check and falls through to the
UPDATE. Today that state is unreachable (register_subconsumer always
inserts a member in the same tx; unregister_subconsumer demotes the
main back to 'normal' when the last member is removed), but the UPDATE
keyed only on (sub_queue, sub_consumer) admits a spec violation if a
future code path ever leaves a memberless coop_main behind.

Adds `and pgque.subscription.sub_role = 'normal'` to the UPDATE WHERE
clause. The column reference is fully qualified because the function
declares a local PL/pgSQL variable also named `sub_role`. With FOR
UPDATE held since the initial SELECT, sub_role cannot change here, so
this filter is a guard against future regressions rather than a fix
for a currently-reachable bug.

tests/test_legacy_next_batch_role_guard.sql exercises the memberless-
coop_main passthrough by manually flipping sub_role and confirms the
UPDATE no longer stamps sub_batch on the coop_main row. Wired into
tests/run_all.sql.

Red/green:

  pre-fix : ERROR: invariant violated: coop_main row has sub_batch = 5
            (psql exit 3)

  post-fix: PASS: legacy next_batch_custom rejects writing sub_batch
            on a coop_main row
            (psql exit 0, full run_all.sql suite green)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
If tests/two_session_legacy_coop_race.sh is killed between installing
the pausing variant of pgque.find_tick_helper and restoring the original
on exit, the database is left with the test override. A naive re-run
captures that override as "original" via pg_get_functiondef and never
restores the real function -- the schema stays silently broken until
sql/pgque.sql is re-installed.

Refuse to start when the captured "original" contains the harness's
own advisory-lock key or the $test$ dollar-quote tag. The reviewer
suggestion from PR NikolayS#235.

Verified:
  - Normal run: still PASS (no false positive).
  - With a leftover override manually installed: harness aborts with
    a clear "re-install pgque first" hint instead of restoring the
    override on exit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@NikolayS
Copy link
Copy Markdown
Owner

REV Code Review Report


BLOCKING ISSUES

None found.


NON-BLOCKING

LOW tests/two_session_legacy_coop_race.sh:48 — Fixed advisory lock key can collide across concurrent manual runs

The queue/application names are suffixed per run, but the advisory lock key is global (77889911). Two developers/CI jobs running this harness against the same database could block or release each other’s test lock and produce confusing results.

Suggestion: derive the advisory key from the run suffix/queue name, or use a two-int advisory lock namespace like pg_advisory_lock(hashtext('pgque_legacy_coop_race'), hashtext(queue_name)).


POTENTIAL ISSUES

MEDIUM sql/pgque-api/cooperative_consumers.sql:384-400 — Memberless coop_main path returns a phantom batch_id (confidence: 6/10)

The new sub_role = 'normal' guard prevents stamping sub_batch on a stray memberless coop_main, which preserves the invariant. But the function assigns batch_id := nextval(...) before the guarded UPDATE and then returns without checking row_count. In the exact state covered by test_legacy_next_batch_role_guard.sql, callers can receive a non-null batch id that is not stored in pgque.subscription; finish_batch(batch_id) would not find it.

This may be acceptable if memberless coop_main is considered unreachable/corrupt state, but it contradicts the nearby comment saying memberless coop_main rows “behave as normal consumers and pass through.”

Suggestion: either explicitly reject coop_main in this legacy function, or check row_count after the update and return NULL/raise if no subscription row was updated. The test should also assert the returned batch_id behavior, not only sub_batch is null.


Verification performed

bash build/transform.sh
# ALL CHECKS PASSED; generated pgque.sql / pgque-tle.sql stayed consistent

gh pr checks 235 --repo NikolayS/PgQue
# all checks passing

PG18 focused local run:
psql -h localhost -p 55432 -U postgres -d pgque_test -v ON_ERROR_STOP=1 -f sql/pgque.sql
psql -h localhost -p 55432 -U postgres -d pgque_test -v ON_ERROR_STOP=1 -f tests/test_legacy_next_batch_role_guard.sql
PGQUE_TEST_DSN=postgresql://postgres:***@localhost:55432/pgque_test tests/two_session_legacy_coop_race.sh

# PASS: legacy next_batch_custom rejects writing sub_batch on a coop_main row
# session B elapsed: 3238 ms
# session B error: canceling statement due to lock timeout
# final billing-row state: normal|2
# PASS: legacy next_batch_custom serializes against register_subconsumer; invariant intact

Summary

The core race fix (FOR UPDATE OF s) looks right and the generated SQL is in sync. I’d treat the phantom-batch behavior as the only thing worth a maintainer decision before merge; it’s probably easy to tighten while this code is open.

@NikolayS
Copy link
Copy Markdown
Owner

Thanks, @victoraugustolls — this is a solid fix.

The race analysis makes sense, the change is nicely scoped, and the red/green TDD evidence is exactly the right way to handle this kind of concurrency bug. CI is green across the PostgreSQL matrix and I also ran the focused PG18 checks locally.

AI-assisted development is totally fine here when the contribution is fully understood and the code quality bar is maintained. This PR clears that bar.

Your follow-up ideas also make sense — especially wiring the new two-session harness into CI together with the existing unwired receive-lock harness. Follow-up contributions are welcome.

Copy link
Copy Markdown
Owner

@NikolayS NikolayS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved. CI is green, focused local verification passed, and the fix is appropriately scoped.

@NikolayS NikolayS merged commit 0a778bf into NikolayS:main May 23, 2026
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants