Apply blocks at production round entry by heifner · Pull Request #322 · Wire-Network/wire-sysio

heifner · 2026-05-01T16:34:50Z

Summary

Companion to #320 (strong-QC lockout fix). Where #320 handles the case where our applied head is on a different branch from the one that will win fork-choice, this PR handles the case where our applied head is on the same branch but behind it -- i.e., blocks from the canonical chain are already in fork_db but haven't been applied by the time our producing slot starts.

Plus a follow-on optimization to controller::is_head_descendant_of_pending_lib since it sits adjacent to this work, and a latent timer-correlation race in schedule_delayed_production_loop that becomes load-bearing once the apply loop runs in producing mode.

This PR is stacked on #320; merge target will rebase to master after #320 lands.

Why now -- continuing the Vaulta diagnosis

Incident 2 in the Vaulta operator's reports showed exactly this case. When titan's slot ended, blocks 408-411 were in fork_db on the canonical branch by 41.567, but the BP's applied head was still at 408 (apply was interrupted by the next block's arrival, and in_producing_mode then suppressed the retry). The BP entered its own slot at 41.962 with chain.head() == 408 and fork_db_head == 411, both on the canonical chain. Production proceeded on the stale head; all 12 produced blocks orphaned at slot end.

The strong-QC lockout fix in #320 doesn't trigger here -- titan's branch is our branch, just further along, so locks_out_branch_of correctly returns false. We need a separate condition for "same chain, behind."

Mechanism

producer_plugin::start_block already has a process_pending_blocks loop that fires apply_blocks when in_speculating_mode or when our head is not a descendant of pending_savanna_lib (an off-canonical-chain check). This PR extends the disjunct with a third condition: fork_db_head is ahead of chain.head() and fork_db_head.extends(chain.head().id()) (same branch, behind). When that fires, the existing time-bounded apply runs once, head advances, and start_block proceeds normally on the now-current head.

The bail-out condition is widened: the loop now exits whenever apply_blocks reports zero blocks applied, regardless of whether the status is complete (queue empty) or incomplete (the block at the front of the queue couldn't complete within deadline -- e.g. an infinite trx). Re-entering the loop in that case would just hit the same wall while starving the main thread of net_plugin and fork-choice updates that could resolve the situation. In producing mode the bail falls through to produce_block, building a competing block at the same height; under Savanna fork choice the producer's competing block wins by timestamp once committed and the unapplyable block is orphaned. interrupt_trx_test exercises this path.

Adds block_handle::extends(block_id_type) -- a thin wrapper around finality_core::extends. Used both by the new producer_plugin condition and by the optimization below.

Latent race in schedule_delayed_production_loop

Pre-existing bug, latent on master: schedule_delayed_production_loop checks _timer_corelation_id in its timer callback before posting schedule_production_loop to the executor, but the posted lambda itself doesn't recheck. Between the timer callback firing and the executor running the post, another scheduling call (e.g. schedule_maybe_produce_block) can bump _timer_corelation_id. The stale post then runs anyway, calling schedule_production_loop -> start_block -> the next deadline-interrupt schedule -> another cid bump -- which preempts the just-set produce_block timer (whose own posted lambda does check cid and skips).

schedule_maybe_produce_block already mirrors the cid check in its posted lambda. Apply the same pattern to schedule_delayed_production_loop.

Latent because today the apply-blocks loop in start_block doesn't fire in producing mode, so the deadline interrupt's schedule_production_loop post doesn't race against schedule_maybe_produce_block. The new same-chain-behind condition makes that combination reachable. Without this fix, interrupt_trx_test deadlocks: every apply attempt on the bad block is deadline-interrupted, the producer never reaches produce_block, and the chain stalls.

Savanna context: the drop-late-blocks property is gone

Spring/EOSIO has historically had a "drop late blocks" property: under DPoS, the BP that propagated late was the one that suffered, because the next BP would refuse their blocks and produce on top of their own (longer-by-chain-length) view. Timing accountability stayed with the offending BP.

Savanna fork choice does not have this property. Whether a chain wins is decided by the finalizer set's QC voting, not by chain length at any individual node. By the time titan's late chain has a strong QC, no local action by the next BP can recover the slot -- the only choices are "apply and accept the loss" or "don't apply and add a fork-switch loss on top."

This change accepts that trade-off: it can't restore the drop-late-blocks defense (which would require a protocol change), but it minimizes the local damage by ensuring the BP at least produces on the right chain when it does produce. Same-chain-behind is the most clearly-this-is-pointless case -- there's no alternative branch we'd be sticking with by refusing to apply, we're just sitting on an older block of the chain everyone else already has.

Optimization: is_head_descendant_of_pending_lib via core.extends

The previous implementation walked the fork_db index from chain_head following previous() links until it hit pending_savanna_lib_id (or went below it), under the fork_db mutex -- O(N) in head-vs-pending-lib distance with mutex contention.

Reimplemented using chain_head.extends(pending_id), which reads the canonical block_ref at pending_lib's height directly out of the head's finality_core. The core tracks block_refs across [last_final_block_num(), current_block_num()) on the chain leading to the head, so extends answers the same question in O(1) without any fork_db lock. The head == pending_lib edge case is handled with an explicit equality check.

Removes the now-unused fork_db.is_descendant_of_pending_savanna_lib (and impl) since this was its only caller.

Consensus invariants

No new signed artifact -- only chain.head() advances (it walks blocks already in fork_db, which got there via the network-validated accept_block path).
No fork-choice override -- fork_database's comparator still picks the branch; this code only governs whether we apply blocks on the chosen branch promptly.
No finalizer voting change -- consider_voting fires per applied block as it does today.
Time-boundedness preserved -- apply_blocks uses the existing slot deadline.
Schedule transitions during apply are handled by the existing re-determine path: if the apply advances head past a producer schedule change that reassigns our upcoming slot, determine_pending_block_mode will correctly transition to speculative and we won't produce in someone else's slot.

Spring

Same change applies cleanly to Spring; not opening that PR here. Spring's maintainers may decide independently whether and when to adopt.

Thin wrapper around finality_core::extends so callers can ask whether a given block_id is in this block's ancestry without reaching into the internal block_state. Used by the is_head_descendant_of_pending_lib refactor in the next commit; kept separate to keep that change focused.

…ant_of_pending_lib The previous implementation walked the fork_db index from chain_head following previous() links until it hit pending_savanna_lib_id (or went below it), under the fork_db mutex -- O(N) in head-vs-pending-lib distance with mutex contention. Reimplement using chain_head.extends(pending_id), which reads the canonical block_ref at pending_lib's height directly out of the head's finality_core. The finality_core tracks block_refs across [last_final_block_num(), current_block_num()) on the chain leading to the head, so the extends call answers the same question in O(1) without acquiring any fork_db lock. The head==pending_lib edge case is handled with an explicit equality check. Removes the now-unused fork_db.is_descendant_of_pending_savanna_lib (and impl) since this was its only caller. Adds 5 fork_database_tests cases (pending_lib_*) covering pending_lib at the genesis root, at fork points, on shared branches, on sub-branches, and at the deepest tip of a branch -- ensuring the new path returns correct values for the representative cases.

…corelation_id schedule_delayed_production_loop posts schedule_production_loop to the executor from its timer callback after a cid match check, but the posted lambda itself doesn't recheck _timer_corelation_id. Between the timer callback firing and the executor running the posted lambda, another scheduling call (e.g. schedule_maybe_produce_block after process_pending_blocks bails out, or another schedule_delayed_production_loop) can bump _timer_corelation_id. The stale post then runs anyway, calling schedule_production_loop -> start_block -> process_pending_blocks -> schedule_delayed_production_loop, which bumps cid again and starves the freshly-set produce_block timer (whose own posted lambda does check cid and skips). Mirror the double-check pattern schedule_maybe_produce_block already uses: capture cid in the posted lambda and skip if _timer_corelation_id has moved past it. Latent under master because the apply-blocks loop in start_block doesn't fire in producing mode today, so schedule_delayed_production_loop's deadline interrupt post doesn't race against schedule_maybe_produce_block. Becomes load-bearing once the apply-blocks loop also runs in producing mode (next commit).

…same chain Companion to the strong-QC lockout fix. When fork_db_head is ahead of our applied head on the *same* branch -- meaning blocks from the canonical chain are queued in fork_db but have not yet been applied -- start_block today proceeds to produce on the stale head. Under Savanna fork-choice (latest_qc_block_timestamp), producing on a stale head yields blocks that are orphaned by the next fork-switch as soon as apply_blocks runs against the queued blocks; "build longer locally and hope to win" no longer recovers since chain choice is decided by the finalizer set's QC, not by individual nodes' chain length. Extend the existing process_pending_blocks while-loop disjunct in start_block: in addition to firing apply_blocks for "speculating" or "head not on canonical chain (not descendant of pending savanna lib)", also fire when fork_db_head extends our applied head and is ahead of it. Other producers' blocks that are already on our chain just get applied before we decide what to build on. Time-bounded by the slot deadline as today. When apply_blocks reports zero blocks applied -- whether because the queue was empty (status complete) or because the block currently at the front of the queue cannot complete within deadline (status incomplete; e.g. infinite trx) -- exit the loop. Re-entering would just hit the same wall while starving the main thread of net_plugin and fork-choice updates. In producing mode, fall through to produce a competing block at the same height; fork choice can then prefer ours by timestamp once we commit. interrupt_trx_test exercises this path: a swap-injected doitforever block sits at fork_db_head, every apply attempt is deadline-interrupted, and the producer needs to build a competing block at the same height to advance the chain. Pairs with the prior commit's _timer_corelation_id fix in schedule_delayed_production_loop: without that fix, the deadline-interrupt's posted schedule_production_loop would race with schedule_maybe_produce_block and starve the producer. With it, the produce_block timer runs and the recovery completes.

block_handle::extends is a thin wrapper over finality_core::extends that adds a null-bsp guard. It was exercised only indirectly through is_head_descendant_of_pending_lib and producer_plugin's fork_db_ahead_on_same_chain check. Adds direct test coverage so a future change (e.g., dropping the null guard) does not slip past integration tests.

The pending_lib_* fixture tests verify the underlying logic via a local mirror of the controller method. Add a direct test that calls controller::is_head_descendant_of_pending_lib() on a real controller and exercises each branch (equality, ancestry, negative). Catches wiring bugs (wrong destructured field, swapped comparison, wrong fork_db accessor) that the mirror cannot. Uses setup_policy::none and drives pending_savanna_lib manually via set_savanna_lib to keep the test light and avoid depending on QC formation.

…_of_test The existing assertions all use head at block_num <= this. Add two cases for head at block_num > this: head on the same branch (extends this) returns false, head on a different branch returns true. The same-branch case exercises the qc_target check after the first extends returns false (head_id is past this->current_block_num).

…production_loop Replace the indirect "schedule_maybe_produce_block taking over after process_pending_blocks bails" with the more direct "the schedule_maybe_produce_block invoked after the next start_block". Same meaning, fewer hops to trace.

The producing-mode-locked-out path called chain.fork_db_head() twice (once for the lockout check, once for the head==fhead comparison after falling through). Hoist the call so both uses share the same snapshot.

bsp12bbb is created by the fixture (a third sibling at block 12 off bsp11b) but was missing from the heads vector in each pending_lib_* test. Add it. Only pending_lib_eq_root needs the corresponding expected_true update (which it gets implicitly via `heads`); the other four cases correctly keep bsp12bbb out of expected_true (it doesn't extend bsp11a, bsp12a, bsp12bb, or bsp14b).

…ly-fork-db-blocks-at-slot-entry Brings in the locks_out_branch_of false-positive fix for the case where head's last_final has advanced past the QC target.

brianjohnson5972

Looks good, a few comments, nothing needed though

brianjohnson5972 · 2026-05-19T23:53:01Z

From Claude:
schedule_delayed_production_loop cid recheck (producer_plugin.cpp:2880-2890)

Mirrors the existing pattern in schedule_maybe_produce_block at line 2898-2906. Correct fix for the described race. The PR description's explanation of why this becomes
load-bearing now (the apply-loop deadline-interrupt path becomes reachable in producing mode) is convincing.

Related issue not addressed: schedule_production_loop's own failure-retry path at line 2832-2838 posts unconditionally without an inner cid check. It has the same shape
and could in principle suffer the same race, though the retry interval (block_interval/10) makes it less likely to interleave. Worth flagging as follow-up.

The previous parenthetical "or is this block itself within the finality_core's tracking range" contradicts finality_core::extends's strict block_num < current_block_num requirement, which the unit test pins as false. Rewrite to spell out the tracking range and the strict-ancestor semantics so callers cannot misread the predicate as inclusive.

…ame_chain finality_core::extends only returns true when the queried id's block_num is strictly less than current_block_num, which already implies fork_db_head.block_num() > head.block_num(). The explicit comparison was redundant and not a meaningful optimization (extends is O(1) with cheap ops). The case of head having fallen below fork_db_head's last_final is the responsibility of !is_head_descendant_of_pending_lib(), not this predicate.

…ind predicate Three additions strengthening the test footprint around the producer_plugin fork_db_ahead_on_same_chain change: - block_handle_extends_test now exercises the boundaries of finality_core::extends's [last_final, current) range with a synthetic id below last_final and two real siblings at block_num == current and block_num > current, catching a regression where extends widens either bound. - controller_is_head_descendant_of_pending_lib_test walks pending_lib forward across eight stages alternating equality, distant ancestor, immediate-parent ancestor, and two flavors of negative case (pending_lib past head, and pending_lib at a block_num head later catches up to with a synthetic id). set_savanna_lib is monotonic in block_num, so head is advanced via produce_blocks between stages to expose new (head, pending_lib) pairs. - A new fork_db_ahead_of_head_on_same_chain savanna_cluster test localizes the predicate + apply pair on a real controller. Partitions node D off the cluster, produces two blocks on node A (kept to num_chains_to_final so pre_head stays in fork_db_head's tracking range), feeds them into D's fork_db via accept_block directly, asserts the predicate is true, and verifies apply_blocks catches head up to fork_db_head. A larger gap is the responsibility of the sibling !is_head_descendant_of_pending_lib() condition.

…locks-at-slot-entry Brings origin/master up to 3806434, picking up #292 (msig chunked storage), #318 (host builtin intrinsic removal), and #346 (locks_out_branch_of genesis-core and last_final boundary cases). Conflict in unittests/fork_db_tests.cpp resolved by keeping our h14a/h14b extensions to locks_out_branch_of_test alongside master's h_root genesis case. block_handle.hpp deduplicated -- both sides had appended extends() in different positions; kept the one with the tightened comment.

Brian flagged this as the third instance of the same shape: schedule_production_loop's failed-start retry posts schedule_production_loop unconditionally, while the matching paths in schedule_maybe_produce_block and schedule_delayed_production_loop recheck _timer_corelation_id inside the posted lambda. The race is less likely here (retry interval is block_interval/10 ~= 20ms, much shorter than the deadline path) but the shape is identical, so apply the same guard for consistency and so future schedule_* additions cannot quietly tip this path into a real race.

brianjohnson5972

Looks good, but there is one nit:
_timer_corelation_id should be _timer_correlation_id
but doesn't have to be done

heifner mentioned this pull request May 1, 2026

Block production diagnostic logging #323

Merged

heifner added 2 commits May 1, 2026 15:27

heifner force-pushed the fix/apply-fork-db-blocks-at-slot-entry branch from 15f873e to ebb021a Compare May 1, 2026 20:28

heifner added 9 commits May 1, 2026 15:40

producer_plugin: cache fork_db_head once at the top of on_incoming_block

8a4ee2a

The producing-mode-locked-out path called chain.fork_db_head() twice (once for the lockout check, once for the head==fhead comparison after falling through). Hoist the call so both uses share the same snapshot.

Merge branch 'fix/abort-production-on-strong-qc-lockout' into fix/app…

8d71562

…ly-fork-db-blocks-at-slot-entry Brings in the locks_out_branch_of false-positive fix for the case where head's last_final has advanced past the QC target.

Base automatically changed from fix/abort-production-on-strong-qc-lockout to master May 19, 2026 13:35

heifner requested review from brianjohnson5972 and huangminghuang May 19, 2026 13:42

huangminghuang approved these changes May 19, 2026

View reviewed changes

brianjohnson5972 approved these changes May 19, 2026

View reviewed changes

heifner added 5 commits May 20, 2026 09:54

heifner requested a review from brianjohnson5972 May 20, 2026 15:55

brianjohnson5972 approved these changes May 20, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Apply blocks at production round entry#322

Apply blocks at production round entry#322
heifner wants to merge 16 commits into
masterfrom
fix/apply-fork-db-blocks-at-slot-entry

heifner commented May 1, 2026 •

edited

Loading

Uh oh!

brianjohnson5972 left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

brianjohnson5972 commented May 19, 2026

Uh oh!

brianjohnson5972 left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

heifner commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why now -- continuing the Vaulta diagnosis

Mechanism

Latent race in schedule_delayed_production_loop

Savanna context: the drop-late-blocks property is gone

Optimization: is_head_descendant_of_pending_lib via core.extends

Consensus invariants

Spring

Uh oh!

brianjohnson5972 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

brianjohnson5972 commented May 19, 2026

Uh oh!

brianjohnson5972 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

heifner commented May 1, 2026 •

edited

Loading