Skip to content

Block production diagnostic logging#323

Merged
heifner merged 4 commits into
masterfrom
feature/block-production-diagnostic-logging
May 8, 2026
Merged

Block production diagnostic logging#323
heifner merged 4 commits into
masterfrom
feature/block-production-diagnostic-logging

Conversation

@heifner
Copy link
Copy Markdown
Contributor

@heifner heifner commented May 1, 2026

Summary

Five operator-facing diagnostic logging improvements that surfaced as gaps during the recent Vaulta BP investigations (the always-11-orphaned-blocks pattern). All small, low-volume, and structured for grepping. Independent of the producer-side fixes in #320 and #322 -- this PR is just diagnostics, no behavior changes to block validation, application, or fork choice.

What you get

  1. wire latency added to "Received block" log lines. Splits the existing combined latency (block_timestamp -> apply complete) into a wire-side number (block_timestamp -> first network arrival, set by net_plugin) and the existing latency. The difference is the local apply-queue / in-producing-mode delay. In the Vaulta incidents we spent hours chasing a "9-second wire gap" that was actually a ~1-second wire gap plus 8 seconds of local hold; this would have surfaced that immediately. Existing latency field unchanged.

  2. Entering producing round. One info line at the start of each producing round, with head, fork_db_head, and the gap. If the gap is non-zero at round start, the producer is entering its slot with un-applied blocks queued in fork_db -- the exact condition that triggered the always-11 collapse. New-round detection compares head.producer() to scheduled_producer.

  3. Round complete. Companion to (2): one info line at round transition summarizing the just-finished round -- producer, starting block num, signed count, finalized count, orphaned count. Surfaces "always 11 missed" patterns directly without operators having to hand-correlate produce_block logs against fork-switch logs. Survival check is via head.extends(...), valid at next-round-start (sub-second window where a fork-switch could still shift counts by a block or two).

  4. Reformatted clock-offset log at over-threshold time-message asymmetry. Old form printed raw NTP arithmetic (rec/org/xmt/dst nanoseconds plus an offset integer), which operators rationally read as "clock drift" and dismissed. New form expresses the same data as outbound vs inbound one-way latency plus the offset. Same trigger threshold, same volume, much more actionable.

  5. block_notice mode transitions. One info line when a peer crosses the consecutive_block_nacks threshold (we've stopped sending them full blocks, only notices) and one when they fall back. Helps operators understand when peers are normally faster than them (and therefore become slow request-respond fallback paths when the usual fast path fails).

Plus a small helper block_handle::extends(block_id_type) shared by (3); same as the helper added in #322.

heifner added 3 commits May 1, 2026 13:32
…y latency in Received block

The existing "latency" field on the Received block log is apply-relative -- block_timestamp -> log_applied --
which combines actual network propagation delay with any local apply-queue / in-producing-mode delay. Operators
investigating "BP receiving blocks late" had no easy way to tell those apart, leading to misdiagnosis of network
issues that were actually local processing delays.

Add a wire_arrival timestamp to block_state, set by net_plugin when a block first arrives off the wire, and
surface it as a separate "wire latency" field in the Received block log. Existing "latency" field unchanged.
The difference (latency - wire_latency) is the local apply-queue delay. wire_latency is 0 for blocks that
weren't received over the network (replay, blocks loaded from disk).
…nsitions

Two diagnostic improvements adjacent in net_plugin:

(1) Reformat the over-threshold time-message-asymmetry log. Previous form printed the raw NTP arithmetic
    (rec/org/xmt/dst nanoseconds) and a single "offset" number, which operators rationally read as "clock
    drift" and dismissed. The same data is much more actionable expressed as outbound vs inbound one-way
    latency: a large outbound suggests our send queue is backed up, large inbound suggests our connection-
    strand processing is behind real-time, large symmetric offset across many peers points at our node,
    a single peer drifting alone points at NTP / clock skew on that peer.

(2) Log when a peer crosses the consecutive_block_nacks threshold (and back). When a peer keeps nacking
    blocks we send -- they were always faster than us on those blocks -- we switch to sending block_notice
    instead of full blocks. Useful for operators to know which peers are normally faster than them and to
    spot when those peers stop being faster (which makes them a fallback path subject to notice-request
    round-trip latency).

Both are once-per-event, small log volume.
Two related once-per-round diagnostic logs from the producer's start_block path, plus a small block_handle
helper they share with other diagnostic checks.

Round-entry log fires when the scheduled producer for the pending block differs from head.producer() (i.e.
the previous producer just finished and ours is starting). Includes head and fork_db_head numbers/ids and
how many blocks are in fork_db ahead of head; if non-zero, the round is being entered with un-applied
blocks queued -- the precondition for the in-producing-mode-gate failure mode.

Round-complete log fires at the next round-transition (when a different producer is scheduled). Walks our
recorded signed_blocks for the just-finished round and counts how many landed on head's chain vs were
orphaned. Survival check uses head.extends; valid for blocks still above LIB which is true at next-round-
start (sub-second window after round end may shift count by a block or two if a fork-switch fires in that
window). signed_blocks.reserve(producer_repetitions) avoids reallocation as the round fills.

Adds block_handle::extends(block_id_type) -- a thin wrapper around finality_core::extends, used for the
survival check above.
// signed_blocks form one chain in the common case (no mid-round fork switch); iterate reverse and break on the
// first canonical id, since all earlier signed blocks are then ancestors and necessarily canonical too.
size_t survived = 0;
for (auto it = _producing_round->signed_blocks.rbegin(); it != _producing_round->signed_blocks.rend(); ++it) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

counts earlier signed blocks as survived once it finds the latest signed block on the current head branch. That only works if every block signed during the round stayed on one chain. If the producer signs on branch A, then a mid-round fork switch occurs and it signs on branch B, the reverse search can find the branch-B block and count the earlier branch-A blocks as “finalized/survived” even though they were orphaned. This is exactly the kind of forky case these diagnostics are meant to clarify. Safer fix: check each signed id independently against head.id() || head.extends(id), or reset the round tracker when the producer switches branches.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 4ad3b54. Switched to a per-id head.id() == id || head.extends(id) loop over signed_blocks, so a mid-round branch switch can no longer let earlier branch-A blocks be counted as on-head once the producer fork-switches to branch B. The reverse-iter shortcut was a premature optimization for a 12-iteration loop.

signed_count, survived, signed_count - survived);
_producing_round.reset();
}
if (in_producing_mode() && head.producer() != scheduled_producer.producer_name && _last_round_entry_logged != block_time) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only starts a producing-round record when head.producer() != scheduled_producer.producer_name. On a single-producer network, or any schedule where the same account continues across a round boundary, this suppresses both the “Entering producing round” log and the later summary. If the goal is round diagnostics, keying this off the round boundary/block number is more reliable than comparing current head producer to scheduled producer.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 4ad3b54. Both transitions now hang off the round-start slot (block_timestamp_type(block_time).slot % chain::config::producer_repetitions == 0), the same idiom used a few lines up at line 2174. Single-producer schedules and same-account-across-boundary rotations now log round-entry and round-complete normally. _last_round_entry_logged was redundant once the slot edge plus the _producing_round optional dedup the calls, so it's gone.

}
}
const auto signed_count = _producing_round->signed_blocks.size();
ilog("Round complete for {} starting #{} at {}: signed {}, finalized {}, orphaned {}",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logs the survived count as finalized. The code is checking whether signed blocks are on the current head branch, not whether they are irreversible/final. In this codebase “finalized” has a specific consensus meaning, so this can mislead operators reading the new diagnostics. Consider renaming the field to survived or canonical.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed to on head in 4ad3b54 (signed S, on head H, orphaned O). Agreed that finalized collided with the consensus meaning.

Three corrections to be4a7a2, all from review on #323:

- Per-id head-branch check instead of the reverse-iter shortcut. Under a mid-round fork switch where
  the producer signs on branch A and then branch B, the reverse iteration lands on a branch-B id and
  silently counts the earlier branch-A blocks as on-head. The shortcut isn't safe for the exact case
  these diagnostics exist to surface; the loop is at most producer_repetitions iterations either way.

- Round-entry/round-complete trigger off the round-start slot (slot % producer_repetitions == 0), not
  off head.producer() != scheduled_producer. The producer-comparison form silently suppresses both
  logs on single-producer schedules and on any rotation where the same account continues across the
  boundary. _last_round_entry_logged drops out -- the slot edge plus the optional round-state are
  enough to dedup.

- Rename the on-head count from "finalized" to "on head" in the log message; "finalized" overlaps
  the consensus meaning of finalization (irreversible) and was misleading for "still on the head
  branch".
@huangminghuang huangminghuang self-requested a review May 7, 2026 19:23
@heifner heifner merged commit 7c536dd into master May 8, 2026
28 checks passed
@heifner heifner deleted the feature/block-production-diagnostic-logging branch May 8, 2026 12:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants