Block production diagnostic logging#323
Conversation
…y latency in Received block The existing "latency" field on the Received block log is apply-relative -- block_timestamp -> log_applied -- which combines actual network propagation delay with any local apply-queue / in-producing-mode delay. Operators investigating "BP receiving blocks late" had no easy way to tell those apart, leading to misdiagnosis of network issues that were actually local processing delays. Add a wire_arrival timestamp to block_state, set by net_plugin when a block first arrives off the wire, and surface it as a separate "wire latency" field in the Received block log. Existing "latency" field unchanged. The difference (latency - wire_latency) is the local apply-queue delay. wire_latency is 0 for blocks that weren't received over the network (replay, blocks loaded from disk).
…nsitions
Two diagnostic improvements adjacent in net_plugin:
(1) Reformat the over-threshold time-message-asymmetry log. Previous form printed the raw NTP arithmetic
(rec/org/xmt/dst nanoseconds) and a single "offset" number, which operators rationally read as "clock
drift" and dismissed. The same data is much more actionable expressed as outbound vs inbound one-way
latency: a large outbound suggests our send queue is backed up, large inbound suggests our connection-
strand processing is behind real-time, large symmetric offset across many peers points at our node,
a single peer drifting alone points at NTP / clock skew on that peer.
(2) Log when a peer crosses the consecutive_block_nacks threshold (and back). When a peer keeps nacking
blocks we send -- they were always faster than us on those blocks -- we switch to sending block_notice
instead of full blocks. Useful for operators to know which peers are normally faster than them and to
spot when those peers stop being faster (which makes them a fallback path subject to notice-request
round-trip latency).
Both are once-per-event, small log volume.
Two related once-per-round diagnostic logs from the producer's start_block path, plus a small block_handle helper they share with other diagnostic checks. Round-entry log fires when the scheduled producer for the pending block differs from head.producer() (i.e. the previous producer just finished and ours is starting). Includes head and fork_db_head numbers/ids and how many blocks are in fork_db ahead of head; if non-zero, the round is being entered with un-applied blocks queued -- the precondition for the in-producing-mode-gate failure mode. Round-complete log fires at the next round-transition (when a different producer is scheduled). Walks our recorded signed_blocks for the just-finished round and counts how many landed on head's chain vs were orphaned. Survival check uses head.extends; valid for blocks still above LIB which is true at next-round- start (sub-second window after round end may shift count by a block or two if a fork-switch fires in that window). signed_blocks.reserve(producer_repetitions) avoids reallocation as the round fills. Adds block_handle::extends(block_id_type) -- a thin wrapper around finality_core::extends, used for the survival check above.
| // signed_blocks form one chain in the common case (no mid-round fork switch); iterate reverse and break on the | ||
| // first canonical id, since all earlier signed blocks are then ancestors and necessarily canonical too. | ||
| size_t survived = 0; | ||
| for (auto it = _producing_round->signed_blocks.rbegin(); it != _producing_round->signed_blocks.rend(); ++it) { |
There was a problem hiding this comment.
counts earlier signed blocks as survived once it finds the latest signed block on the current head branch. That only works if every block signed during the round stayed on one chain. If the producer signs on branch A, then a mid-round fork switch occurs and it signs on branch B, the reverse search can find the branch-B block and count the earlier branch-A blocks as “finalized/survived” even though they were orphaned. This is exactly the kind of forky case these diagnostics are meant to clarify. Safer fix: check each signed id independently against head.id() || head.extends(id), or reset the round tracker when the producer switches branches.
There was a problem hiding this comment.
Fixed in 4ad3b54. Switched to a per-id head.id() == id || head.extends(id) loop over signed_blocks, so a mid-round branch switch can no longer let earlier branch-A blocks be counted as on-head once the producer fork-switches to branch B. The reverse-iter shortcut was a premature optimization for a 12-iteration loop.
| signed_count, survived, signed_count - survived); | ||
| _producing_round.reset(); | ||
| } | ||
| if (in_producing_mode() && head.producer() != scheduled_producer.producer_name && _last_round_entry_logged != block_time) { |
There was a problem hiding this comment.
only starts a producing-round record when head.producer() != scheduled_producer.producer_name. On a single-producer network, or any schedule where the same account continues across a round boundary, this suppresses both the “Entering producing round” log and the later summary. If the goal is round diagnostics, keying this off the round boundary/block number is more reliable than comparing current head producer to scheduled producer.
There was a problem hiding this comment.
Fixed in 4ad3b54. Both transitions now hang off the round-start slot (block_timestamp_type(block_time).slot % chain::config::producer_repetitions == 0), the same idiom used a few lines up at line 2174. Single-producer schedules and same-account-across-boundary rotations now log round-entry and round-complete normally. _last_round_entry_logged was redundant once the slot edge plus the _producing_round optional dedup the calls, so it's gone.
| } | ||
| } | ||
| const auto signed_count = _producing_round->signed_blocks.size(); | ||
| ilog("Round complete for {} starting #{} at {}: signed {}, finalized {}, orphaned {}", |
There was a problem hiding this comment.
logs the survived count as finalized. The code is checking whether signed blocks are on the current head branch, not whether they are irreversible/final. In this codebase “finalized” has a specific consensus meaning, so this can mislead operators reading the new diagnostics. Consider renaming the field to survived or canonical.
There was a problem hiding this comment.
Renamed to on head in 4ad3b54 (signed S, on head H, orphaned O). Agreed that finalized collided with the consensus meaning.
Three corrections to be4a7a2, all from review on #323: - Per-id head-branch check instead of the reverse-iter shortcut. Under a mid-round fork switch where the producer signs on branch A and then branch B, the reverse iteration lands on a branch-B id and silently counts the earlier branch-A blocks as on-head. The shortcut isn't safe for the exact case these diagnostics exist to surface; the loop is at most producer_repetitions iterations either way. - Round-entry/round-complete trigger off the round-start slot (slot % producer_repetitions == 0), not off head.producer() != scheduled_producer. The producer-comparison form silently suppresses both logs on single-producer schedules and on any rotation where the same account continues across the boundary. _last_round_entry_logged drops out -- the slot edge plus the optional round-state are enough to dedup. - Rename the on-head count from "finalized" to "on head" in the log message; "finalized" overlaps the consensus meaning of finalization (irreversible) and was misleading for "still on the head branch".
Summary
Five operator-facing diagnostic logging improvements that surfaced as gaps during the recent Vaulta BP investigations (the always-11-orphaned-blocks pattern). All small, low-volume, and structured for grepping. Independent of the producer-side fixes in #320 and #322 -- this PR is just diagnostics, no behavior changes to block validation, application, or fork choice.
What you get
wire latency added to "Received block" log lines. Splits the existing combined
latency(block_timestamp -> apply complete) into a wire-side number (block_timestamp -> first network arrival, set by net_plugin) and the existing latency. The difference is the local apply-queue / in-producing-mode delay. In the Vaulta incidents we spent hours chasing a "9-second wire gap" that was actually a ~1-second wire gap plus 8 seconds of local hold; this would have surfaced that immediately. Existinglatencyfield unchanged.Entering producing round. One info line at the start of each producing round, with head, fork_db_head, and the gap. If the gap is non-zero at round start, the producer is entering its slot with un-applied blocks queued in fork_db -- the exact condition that triggered the always-11 collapse. New-round detection compares head.producer() to scheduled_producer.
Round complete. Companion to (2): one info line at round transition summarizing the just-finished round -- producer, starting block num, signed count, finalized count, orphaned count. Surfaces "always 11 missed" patterns directly without operators having to hand-correlate produce_block logs against fork-switch logs. Survival check is via
head.extends(...), valid at next-round-start (sub-second window where a fork-switch could still shift counts by a block or two).Reformatted clock-offset log at over-threshold time-message asymmetry. Old form printed raw NTP arithmetic (
rec/org/xmt/dstnanoseconds plus anoffsetinteger), which operators rationally read as "clock drift" and dismissed. New form expresses the same data as outbound vs inbound one-way latency plus the offset. Same trigger threshold, same volume, much more actionable.block_notice mode transitions. One info line when a peer crosses the
consecutive_block_nacksthreshold (we've stopped sending them full blocks, only notices) and one when they fall back. Helps operators understand when peers are normally faster than them (and therefore become slow request-respond fallback paths when the usual fast path fails).Plus a small helper
block_handle::extends(block_id_type)shared by (3); same as the helper added in #322.