fix: lock main row in legacy next_batch_custom#235
Conversation
The cooperative override of pgque.next_batch_custom(5-arg) selected the
main subscription row without FOR UPDATE before checking sub_role. A
concurrent pgque.register_subconsumer(..., convert_normal => true) could
promote 'normal' -> 'coop_main' between the SELECT and the UPDATE that
stamps sub_batch, leaving a coop_main row with sub_batch IS NOT NULL --
violating the spec invariant that "coop_main must never have sub_batch
IS NOT NULL".
Mirrors the same FOR UPDATE pattern that build/transform.sh injects into
the PgQ-derived next_batch.sql; the cooperative override at
sql/pgque-api/cooperative_consumers.sql replaces that function and must
hold the lock for the same reason.
The new tests/two_session_legacy_coop_race.sh deterministically races
the two sessions by temporarily replacing pgque.find_tick_helper with a
pausing variant so session A wedges between its initial SELECT and its
UPDATE while session B runs register_subconsumer. The original
find_tick_helper is captured via pg_get_functiondef and restored on
exit. Pre-fix the harness fails with final state coop_main|<batch>;
post-fix it passes with normal|<batch> and B blocks on the row lock.
Red/green evidence:
pre-fix : final billing-row state (sub_role|sub_batch): coop_main|4
FAIL: legacy next_batch_custom raced with register_subconsumer
post-fix: final billing-row state (sub_role|sub_batch): normal|1
PASS: legacy next_batch_custom serializes against
register_subconsumer; invariant intact
tests/run_all.sql (47 suites) and tests/two_session_receive_lock.sh both
remain green. The new harness is not yet wired into .github/workflows/ci.yml
since tests/two_session_receive_lock.sh is also unwired on main; both can
be wired in a follow-up.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
If the use of AI is frowned upon to this type of PRs, let me know and I'll gladly find another way to help with this nice project |
Defense-in-depth follow-up to the FOR UPDATE fix in the parent commit.
The cooperative override of pgque.next_batch_custom(5-arg) already
rejects coop_main rows that have at least one coop_member, but a
*memberless* coop_main bypasses that check and falls through to the
UPDATE. Today that state is unreachable (register_subconsumer always
inserts a member in the same tx; unregister_subconsumer demotes the
main back to 'normal' when the last member is removed), but the UPDATE
keyed only on (sub_queue, sub_consumer) admits a spec violation if a
future code path ever leaves a memberless coop_main behind.
Adds `and pgque.subscription.sub_role = 'normal'` to the UPDATE WHERE
clause. The column reference is fully qualified because the function
declares a local PL/pgSQL variable also named `sub_role`. With FOR
UPDATE held since the initial SELECT, sub_role cannot change here, so
this filter is a guard against future regressions rather than a fix
for a currently-reachable bug.
tests/test_legacy_next_batch_role_guard.sql exercises the memberless-
coop_main passthrough by manually flipping sub_role and confirms the
UPDATE no longer stamps sub_batch on the coop_main row. Wired into
tests/run_all.sql.
Red/green:
pre-fix : ERROR: invariant violated: coop_main row has sub_batch = 5
(psql exit 3)
post-fix: PASS: legacy next_batch_custom rejects writing sub_batch
on a coop_main row
(psql exit 0, full run_all.sql suite green)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
If tests/two_session_legacy_coop_race.sh is killed between installing the pausing variant of pgque.find_tick_helper and restoring the original on exit, the database is left with the test override. A naive re-run captures that override as "original" via pg_get_functiondef and never restores the real function -- the schema stays silently broken until sql/pgque.sql is re-installed. Refuse to start when the captured "original" contains the harness's own advisory-lock key or the $test$ dollar-quote tag. The reviewer suggestion from PR NikolayS#235. Verified: - Normal run: still PASS (no false positive). - With a leftover override manually installed: harness aborts with a clear "re-install pgque first" hint instead of restoring the override on exit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
REV Code Review Report
BLOCKING ISSUESNone found. NON-BLOCKINGLOW
POTENTIAL ISSUESMEDIUM
Verification performedSummaryThe core race fix ( |
|
Thanks, @victoraugustolls — this is a solid fix. The race analysis makes sense, the change is nicely scoped, and the red/green TDD evidence is exactly the right way to handle this kind of concurrency bug. CI is green across the PostgreSQL matrix and I also ran the focused PG18 checks locally. AI-assisted development is totally fine here when the contribution is fully understood and the code quality bar is maintained. This PR clears that bar. Your follow-up ideas also make sense — especially wiring the new two-session harness into CI together with the existing unwired receive-lock harness. Follow-up contributions are welcome. |
NikolayS
left a comment
There was a problem hiding this comment.
Approved. CI is green, focused local verification passed, and the fix is appropriately scoped.
What
When two sessions race, with one calling the legacy
pgque.next_batch_custom(queue, consumer, ...)and another callingpgque.register_subconsumer(queue, consumer, sub, convert_normal => true), the subscription row can end up withsub_role = 'coop_main'andsub_batch IS NOT NULLat the same time. That combination is explicitly forbidden by the cooperative-consumer spec: the main row owns the group cursor, members own batches.The bug is in the cooperative override of
next_batch_custom: it reads the main subscription row withoutFOR UPDATE, so it can pass the "is this row acoop_main?" check while another transaction is simultaneously promoting that same row fromnormaltocoop_main. By the time the function reaches itsUPDATE ... SET sub_batch = ..., the row is alreadycoop_main, and the stamp lands on the wrong kind of row.Why it matters
Once a
coop_mainrow carries asub_batch, downstream behavior gets weird:finish_batchon thatbatch_idresolves to the main row instead of a member, the cooperative group cursor stops advancing cleanly, and tools that scan for "active cooperative batches" double-count. It's the kind of thing that doesn't show up in dev and then bites someone the first time a worker callsregister_subconsumer(..., convert_normal => true)against a queue that already has a busy legacy consumer.The fix
One line. Add
for update of sto the main subscriptionSELECTinside the cooperative override ofnext_batch_custom(5-arg)atsql/pgque-api/cooperative_consumers.sql. The cooperative moduleCREATE OR REPLACEs the PgQ-derivednext_batch_customto add acoop_main-with-members rejection check; this PR adds the row lock that protects that check from a concurrent role transition.A related (but separate) concurrent-receive race against the PgQ-derived non-cooperative
next_batch_customis being addressed in another PR/branch (fix/concurrent-receive); that is a different function and a different code path. This PR only touches the cooperative override.Changed in three places (source + two bundled outputs), all in sync with
bash build/transform.sh:sql/pgque-api/cooperative_consumers.sqlsql/pgque.sqlsql/pgque-tle.sqlTest plan
tests/two_session_legacy_coop_race.shis a new two-session harness, modeled on the existingtests/two_session_receive_lock.sh. It deterministically reproduces the race by temporarily swappingpgque.find_tick_helperfor a variant that pauses on a session-level advisory lock — that gives us a clean window between the function'sSELECTand itsUPDATEto slip session B'sregister_subconsumerin. The originalfind_tick_helperis captured viapg_get_functiondefat the start and restored on exit.Red/green evidence on a fresh PG16:
Also green:
tests/run_all.sql: all 47 suitestests/two_session_receive_lock.sh: existing receive-lock harness still serializes correctly (no impact on that path)bash build/transform.shregeneratessql/pgque.sqlandsql/pgque-tle.sqlbyte-identical to the manual editsThings to know before merging
tests/two_session_receive_lock.sh, which is also unwired onmain. Happy to do both in a follow-up, just didn't want to bundle CI plumbing into a behavioral fix.register_subconsumer(..., convert_normal => true)will now block briefly behind a legacynext_batchinstead of racing it.NOTES:
This PR and Analysis was done with the help of claude Opus 4.7 with maximum effort.