Apply blocks at production round entry#322
Conversation
Thin wrapper around finality_core::extends so callers can ask whether a given block_id is in this block's ancestry without reaching into the internal block_state. Used by the is_head_descendant_of_pending_lib refactor in the next commit; kept separate to keep that change focused.
…ant_of_pending_lib The previous implementation walked the fork_db index from chain_head following previous() links until it hit pending_savanna_lib_id (or went below it), under the fork_db mutex -- O(N) in head-vs-pending-lib distance with mutex contention. Reimplement using chain_head.extends(pending_id), which reads the canonical block_ref at pending_lib's height directly out of the head's finality_core. The finality_core tracks block_refs across [last_final_block_num(), current_block_num()) on the chain leading to the head, so the extends call answers the same question in O(1) without acquiring any fork_db lock. The head==pending_lib edge case is handled with an explicit equality check. Removes the now-unused fork_db.is_descendant_of_pending_savanna_lib (and impl) since this was its only caller. Adds 5 fork_database_tests cases (pending_lib_*) covering pending_lib at the genesis root, at fork points, on shared branches, on sub-branches, and at the deepest tip of a branch -- ensuring the new path returns correct values for the representative cases.
15f873e to
ebb021a
Compare
…corelation_id schedule_delayed_production_loop posts schedule_production_loop to the executor from its timer callback after a cid match check, but the posted lambda itself doesn't recheck _timer_corelation_id. Between the timer callback firing and the executor running the posted lambda, another scheduling call (e.g. schedule_maybe_produce_block after process_pending_blocks bails out, or another schedule_delayed_production_loop) can bump _timer_corelation_id. The stale post then runs anyway, calling schedule_production_loop -> start_block -> process_pending_blocks -> schedule_delayed_production_loop, which bumps cid again and starves the freshly-set produce_block timer (whose own posted lambda does check cid and skips). Mirror the double-check pattern schedule_maybe_produce_block already uses: capture cid in the posted lambda and skip if _timer_corelation_id has moved past it. Latent under master because the apply-blocks loop in start_block doesn't fire in producing mode today, so schedule_delayed_production_loop's deadline interrupt post doesn't race against schedule_maybe_produce_block. Becomes load-bearing once the apply-blocks loop also runs in producing mode (next commit).
…same chain Companion to the strong-QC lockout fix. When fork_db_head is ahead of our applied head on the *same* branch -- meaning blocks from the canonical chain are queued in fork_db but have not yet been applied -- start_block today proceeds to produce on the stale head. Under Savanna fork-choice (latest_qc_block_timestamp), producing on a stale head yields blocks that are orphaned by the next fork-switch as soon as apply_blocks runs against the queued blocks; "build longer locally and hope to win" no longer recovers since chain choice is decided by the finalizer set's QC, not by individual nodes' chain length. Extend the existing process_pending_blocks while-loop disjunct in start_block: in addition to firing apply_blocks for "speculating" or "head not on canonical chain (not descendant of pending savanna lib)", also fire when fork_db_head extends our applied head and is ahead of it. Other producers' blocks that are already on our chain just get applied before we decide what to build on. Time-bounded by the slot deadline as today. When apply_blocks reports zero blocks applied -- whether because the queue was empty (status complete) or because the block currently at the front of the queue cannot complete within deadline (status incomplete; e.g. infinite trx) -- exit the loop. Re-entering would just hit the same wall while starving the main thread of net_plugin and fork-choice updates. In producing mode, fall through to produce a competing block at the same height; fork choice can then prefer ours by timestamp once we commit. interrupt_trx_test exercises this path: a swap-injected doitforever block sits at fork_db_head, every apply attempt is deadline-interrupted, and the producer needs to build a competing block at the same height to advance the chain. Pairs with the prior commit's _timer_corelation_id fix in schedule_delayed_production_loop: without that fix, the deadline-interrupt's posted schedule_production_loop would race with schedule_maybe_produce_block and starve the producer. With it, the produce_block timer runs and the recovery completes.
block_handle::extends is a thin wrapper over finality_core::extends that adds a null-bsp guard. It was exercised only indirectly through is_head_descendant_of_pending_lib and producer_plugin's fork_db_ahead_on_same_chain check. Adds direct test coverage so a future change (e.g., dropping the null guard) does not slip past integration tests.
The pending_lib_* fixture tests verify the underlying logic via a local mirror of the controller method. Add a direct test that calls controller::is_head_descendant_of_pending_lib() on a real controller and exercises each branch (equality, ancestry, negative). Catches wiring bugs (wrong destructured field, swapped comparison, wrong fork_db accessor) that the mirror cannot. Uses setup_policy::none and drives pending_savanna_lib manually via set_savanna_lib to keep the test light and avoid depending on QC formation.
…_of_test The existing assertions all use head at block_num <= this. Add two cases for head at block_num > this: head on the same branch (extends this) returns false, head on a different branch returns true. The same-branch case exercises the qc_target check after the first extends returns false (head_id is past this->current_block_num).
…production_loop Replace the indirect "schedule_maybe_produce_block taking over after process_pending_blocks bails" with the more direct "the schedule_maybe_produce_block invoked after the next start_block". Same meaning, fewer hops to trace.
The producing-mode-locked-out path called chain.fork_db_head() twice (once for the lockout check, once for the head==fhead comparison after falling through). Hoist the call so both uses share the same snapshot.
bsp12bbb is created by the fixture (a third sibling at block 12 off bsp11b) but was missing from the heads vector in each pending_lib_* test. Add it. Only pending_lib_eq_root needs the corresponding expected_true update (which it gets implicitly via `heads`); the other four cases correctly keep bsp12bbb out of expected_true (it doesn't extend bsp11a, bsp12a, bsp12bb, or bsp14b).
…ly-fork-db-blocks-at-slot-entry Brings in the locks_out_branch_of false-positive fix for the case where head's last_final has advanced past the QC target.
brianjohnson5972
left a comment
There was a problem hiding this comment.
Looks good, a few comments, nothing needed though
|
From Claude: Mirrors the existing pattern in schedule_maybe_produce_block at line 2898-2906. Correct fix for the described race. The PR description's explanation of why this becomes Related issue not addressed: schedule_production_loop's own failure-retry path at line 2832-2838 posts unconditionally without an inner cid check. It has the same shape |
The previous parenthetical "or is this block itself within the finality_core's tracking range" contradicts finality_core::extends's strict block_num < current_block_num requirement, which the unit test pins as false. Rewrite to spell out the tracking range and the strict-ancestor semantics so callers cannot misread the predicate as inclusive.
…ame_chain finality_core::extends only returns true when the queried id's block_num is strictly less than current_block_num, which already implies fork_db_head.block_num() > head.block_num(). The explicit comparison was redundant and not a meaningful optimization (extends is O(1) with cheap ops). The case of head having fallen below fork_db_head's last_final is the responsibility of !is_head_descendant_of_pending_lib(), not this predicate.
…ind predicate Three additions strengthening the test footprint around the producer_plugin fork_db_ahead_on_same_chain change: - block_handle_extends_test now exercises the boundaries of finality_core::extends's [last_final, current) range with a synthetic id below last_final and two real siblings at block_num == current and block_num > current, catching a regression where extends widens either bound. - controller_is_head_descendant_of_pending_lib_test walks pending_lib forward across eight stages alternating equality, distant ancestor, immediate-parent ancestor, and two flavors of negative case (pending_lib past head, and pending_lib at a block_num head later catches up to with a synthetic id). set_savanna_lib is monotonic in block_num, so head is advanced via produce_blocks between stages to expose new (head, pending_lib) pairs. - A new fork_db_ahead_of_head_on_same_chain savanna_cluster test localizes the predicate + apply pair on a real controller. Partitions node D off the cluster, produces two blocks on node A (kept to num_chains_to_final so pre_head stays in fork_db_head's tracking range), feeds them into D's fork_db via accept_block directly, asserts the predicate is true, and verifies apply_blocks catches head up to fork_db_head. A larger gap is the responsibility of the sibling !is_head_descendant_of_pending_lib() condition.
…locks-at-slot-entry Brings origin/master up to 3806434, picking up #292 (msig chunked storage), #318 (host builtin intrinsic removal), and #346 (locks_out_branch_of genesis-core and last_final boundary cases). Conflict in unittests/fork_db_tests.cpp resolved by keeping our h14a/h14b extensions to locks_out_branch_of_test alongside master's h_root genesis case. block_handle.hpp deduplicated -- both sides had appended extends() in different positions; kept the one with the tightened comment.
Brian flagged this as the third instance of the same shape: schedule_production_loop's failed-start retry posts schedule_production_loop unconditionally, while the matching paths in schedule_maybe_produce_block and schedule_delayed_production_loop recheck _timer_corelation_id inside the posted lambda. The race is less likely here (retry interval is block_interval/10 ~= 20ms, much shorter than the deadline path) but the shape is identical, so apply the same guard for consistency and so future schedule_* additions cannot quietly tip this path into a real race.
brianjohnson5972
left a comment
There was a problem hiding this comment.
Looks good, but there is one nit:
_timer_corelation_id should be _timer_correlation_id
but doesn't have to be done
Summary
Companion to #320 (strong-QC lockout fix). Where #320 handles the case where our applied head is on a different branch from the one that will win fork-choice, this PR handles the case where our applied head is on the same branch but behind it -- i.e., blocks from the canonical chain are already in fork_db but haven't been applied by the time our producing slot starts.
Plus a follow-on optimization to
controller::is_head_descendant_of_pending_libsince it sits adjacent to this work, and a latent timer-correlation race inschedule_delayed_production_loopthat becomes load-bearing once the apply loop runs in producing mode.This PR is stacked on #320; merge target will rebase to master after #320 lands.
Why now -- continuing the Vaulta diagnosis
Incident 2 in the Vaulta operator's reports showed exactly this case. When titan's slot ended, blocks 408-411 were in fork_db on the canonical branch by 41.567, but the BP's applied head was still at 408 (apply was interrupted by the next block's arrival, and
in_producing_modethen suppressed the retry). The BP entered its own slot at 41.962 withchain.head() == 408andfork_db_head == 411, both on the canonical chain. Production proceeded on the stale head; all 12 produced blocks orphaned at slot end.The strong-QC lockout fix in #320 doesn't trigger here -- titan's branch is our branch, just further along, so
locks_out_branch_ofcorrectly returns false. We need a separate condition for "same chain, behind."Mechanism
producer_plugin::start_blockalready has aprocess_pending_blocksloop that firesapply_blockswhen in_speculating_mode or when our head is not a descendant of pending_savanna_lib (an off-canonical-chain check). This PR extends the disjunct with a third condition:fork_db_headis ahead ofchain.head()andfork_db_head.extends(chain.head().id())(same branch, behind). When that fires, the existing time-bounded apply runs once, head advances, andstart_blockproceeds normally on the now-current head.The bail-out condition is widened: the loop now exits whenever
apply_blocksreports zero blocks applied, regardless of whether the status iscomplete(queue empty) orincomplete(the block at the front of the queue couldn't complete within deadline -- e.g. an infinite trx). Re-entering the loop in that case would just hit the same wall while starving the main thread of net_plugin and fork-choice updates that could resolve the situation. In producing mode the bail falls through toproduce_block, building a competing block at the same height; under Savanna fork choice the producer's competing block wins by timestamp once committed and the unapplyable block is orphaned.interrupt_trx_testexercises this path.Adds
block_handle::extends(block_id_type)-- a thin wrapper aroundfinality_core::extends. Used both by the new producer_plugin condition and by the optimization below.Latent race in schedule_delayed_production_loop
Pre-existing bug, latent on master:
schedule_delayed_production_loopchecks_timer_corelation_idin its timer callback before postingschedule_production_loopto the executor, but the posted lambda itself doesn't recheck. Between the timer callback firing and the executor running the post, another scheduling call (e.g.schedule_maybe_produce_block) can bump_timer_corelation_id. The stale post then runs anyway, callingschedule_production_loop->start_block-> the next deadline-interrupt schedule -> another cid bump -- which preempts the just-set produce_block timer (whose own posted lambda does check cid and skips).schedule_maybe_produce_blockalready mirrors the cid check in its posted lambda. Apply the same pattern toschedule_delayed_production_loop.Latent because today the apply-blocks loop in
start_blockdoesn't fire in producing mode, so the deadline interrupt'sschedule_production_looppost doesn't race againstschedule_maybe_produce_block. The new same-chain-behind condition makes that combination reachable. Without this fix,interrupt_trx_testdeadlocks: every apply attempt on the bad block is deadline-interrupted, the producer never reachesproduce_block, and the chain stalls.Savanna context: the drop-late-blocks property is gone
Spring/EOSIO has historically had a "drop late blocks" property: under DPoS, the BP that propagated late was the one that suffered, because the next BP would refuse their blocks and produce on top of their own (longer-by-chain-length) view. Timing accountability stayed with the offending BP.
Savanna fork choice does not have this property. Whether a chain wins is decided by the finalizer set's QC voting, not by chain length at any individual node. By the time titan's late chain has a strong QC, no local action by the next BP can recover the slot -- the only choices are "apply and accept the loss" or "don't apply and add a fork-switch loss on top."
This change accepts that trade-off: it can't restore the drop-late-blocks defense (which would require a protocol change), but it minimizes the local damage by ensuring the BP at least produces on the right chain when it does produce. Same-chain-behind is the most clearly-this-is-pointless case -- there's no alternative branch we'd be sticking with by refusing to apply, we're just sitting on an older block of the chain everyone else already has.
Optimization: is_head_descendant_of_pending_lib via core.extends
The previous implementation walked the fork_db index from
chain_headfollowingprevious()links until it hitpending_savanna_lib_id(or went below it), under the fork_db mutex -- O(N) in head-vs-pending-lib distance with mutex contention.Reimplemented using
chain_head.extends(pending_id), which reads the canonical block_ref at pending_lib's height directly out of the head'sfinality_core. The core tracks block_refs across[last_final_block_num(), current_block_num())on the chain leading to the head, soextendsanswers the same question in O(1) without any fork_db lock. Thehead == pending_libedge case is handled with an explicit equality check.Removes the now-unused
fork_db.is_descendant_of_pending_savanna_lib(and impl) since this was its only caller.Consensus invariants
chain.head()advances (it walks blocks already in fork_db, which got there via the network-validatedaccept_blockpath).fork_database's comparator still picks the branch; this code only governs whether we apply blocks on the chosen branch promptly.consider_votingfires per applied block as it does today.apply_blocksuses the existing slot deadline.determine_pending_block_modewill correctly transition to speculative and we won't produce in someone else's slot.Spring
Same change applies cleanly to Spring; not opening that PR here. Spring's maintainers may decide independently whether and when to adopt.