Block production diagnostic logging by heifner · Pull Request #323 · Wire-Network/wire-sysio

heifner · 2026-05-01T18:35:45Z

Summary

Five operator-facing diagnostic logging improvements that surfaced as gaps during the recent Vaulta BP investigations (the always-11-orphaned-blocks pattern). All small, low-volume, and structured for grepping. Independent of the producer-side fixes in #320 and #322 -- this PR is just diagnostics, no behavior changes to block validation, application, or fork choice.

What you get

wire latency added to "Received block" log lines. Splits the existing combined latency (block_timestamp -> apply complete) into a wire-side number (block_timestamp -> first network arrival, set by net_plugin) and the existing latency. The difference is the local apply-queue / in-producing-mode delay. In the Vaulta incidents we spent hours chasing a "9-second wire gap" that was actually a ~1-second wire gap plus 8 seconds of local hold; this would have surfaced that immediately. Existing latency field unchanged.
Entering producing round. One info line at the start of each producing round, with head, fork_db_head, and the gap. If the gap is non-zero at round start, the producer is entering its slot with un-applied blocks queued in fork_db -- the exact condition that triggered the always-11 collapse. New-round detection compares head.producer() to scheduled_producer.
Round complete. Companion to (2): one info line at round transition summarizing the just-finished round -- producer, starting block num, signed count, finalized count, orphaned count. Surfaces "always 11 missed" patterns directly without operators having to hand-correlate produce_block logs against fork-switch logs. Survival check is via head.extends(...), valid at next-round-start (sub-second window where a fork-switch could still shift counts by a block or two).
Reformatted clock-offset log at over-threshold time-message asymmetry. Old form printed raw NTP arithmetic (rec/org/xmt/dst nanoseconds plus an offset integer), which operators rationally read as "clock drift" and dismissed. New form expresses the same data as outbound vs inbound one-way latency plus the offset. Same trigger threshold, same volume, much more actionable.
block_notice mode transitions. One info line when a peer crosses the consecutive_block_nacks threshold (we've stopped sending them full blocks, only notices) and one when they fall back. Helps operators understand when peers are normally faster than them (and therefore become slow request-respond fallback paths when the usual fast path fails).

Plus a small helper block_handle::extends(block_id_type) shared by (3); same as the helper added in #322.

…y latency in Received block The existing "latency" field on the Received block log is apply-relative -- block_timestamp -> log_applied -- which combines actual network propagation delay with any local apply-queue / in-producing-mode delay. Operators investigating "BP receiving blocks late" had no easy way to tell those apart, leading to misdiagnosis of network issues that were actually local processing delays. Add a wire_arrival timestamp to block_state, set by net_plugin when a block first arrives off the wire, and surface it as a separate "wire latency" field in the Received block log. Existing "latency" field unchanged. The difference (latency - wire_latency) is the local apply-queue delay. wire_latency is 0 for blocks that weren't received over the network (replay, blocks loaded from disk).

…nsitions Two diagnostic improvements adjacent in net_plugin: (1) Reformat the over-threshold time-message-asymmetry log. Previous form printed the raw NTP arithmetic (rec/org/xmt/dst nanoseconds) and a single "offset" number, which operators rationally read as "clock drift" and dismissed. The same data is much more actionable expressed as outbound vs inbound one-way latency: a large outbound suggests our send queue is backed up, large inbound suggests our connection- strand processing is behind real-time, large symmetric offset across many peers points at our node, a single peer drifting alone points at NTP / clock skew on that peer. (2) Log when a peer crosses the consecutive_block_nacks threshold (and back). When a peer keeps nacking blocks we send -- they were always faster than us on those blocks -- we switch to sending block_notice instead of full blocks. Useful for operators to know which peers are normally faster than them and to spot when those peers stop being faster (which makes them a fallback path subject to notice-request round-trip latency). Both are once-per-event, small log volume.

Two related once-per-round diagnostic logs from the producer's start_block path, plus a small block_handle helper they share with other diagnostic checks. Round-entry log fires when the scheduled producer for the pending block differs from head.producer() (i.e. the previous producer just finished and ours is starting). Includes head and fork_db_head numbers/ids and how many blocks are in fork_db ahead of head; if non-zero, the round is being entered with un-applied blocks queued -- the precondition for the in-producing-mode-gate failure mode. Round-complete log fires at the next round-transition (when a different producer is scheduled). Walks our recorded signed_blocks for the just-finished round and counts how many landed on head's chain vs were orphaned. Survival check uses head.extends; valid for blocks still above LIB which is true at next-round- start (sub-second window after round end may shift count by a block or two if a fork-switch fires in that window). signed_blocks.reserve(producer_repetitions) avoids reallocation as the round fills. Adds block_handle::extends(block_id_type) -- a thin wrapper around finality_core::extends, used for the survival check above.

huangminghuang · 2026-05-06T19:25:41Z

+      // signed_blocks form one chain in the common case (no mid-round fork switch); iterate reverse and break on the
+      // first canonical id, since all earlier signed blocks are then ancestors and necessarily canonical too.
+      size_t survived = 0;
+      for (auto it = _producing_round->signed_blocks.rbegin(); it != _producing_round->signed_blocks.rend(); ++it) {


counts earlier signed blocks as survived once it finds the latest signed block on the current head branch. That only works if every block signed during the round stayed on one chain. If the producer signs on branch A, then a mid-round fork switch occurs and it signs on branch B, the reverse search can find the branch-B block and count the earlier branch-A blocks as “finalized/survived” even though they were orphaned. This is exactly the kind of forky case these diagnostics are meant to clarify. Safer fix: check each signed id independently against head.id() || head.extends(id), or reset the round tracker when the producer switches branches.

Fixed in 4ad3b54. Switched to a per-id head.id() == id || head.extends(id) loop over signed_blocks, so a mid-round branch switch can no longer let earlier branch-A blocks be counted as on-head once the producer fork-switches to branch B. The reverse-iter shortcut was a premature optimization for a 12-iteration loop.

huangminghuang · 2026-05-06T19:25:57Z

+           signed_count, survived, signed_count - survived);
+      _producing_round.reset();
+   }
+   if (in_producing_mode() && head.producer() != scheduled_producer.producer_name && _last_round_entry_logged != block_time) {


only starts a producing-round record when head.producer() != scheduled_producer.producer_name. On a single-producer network, or any schedule where the same account continues across a round boundary, this suppresses both the “Entering producing round” log and the later summary. If the goal is round diagnostics, keying this off the round boundary/block number is more reliable than comparing current head producer to scheduled producer.

Fixed in 4ad3b54. Both transitions now hang off the round-start slot (block_timestamp_type(block_time).slot % chain::config::producer_repetitions == 0), the same idiom used a few lines up at line 2174. Single-producer schedules and same-account-across-boundary rotations now log round-entry and round-complete normally. _last_round_entry_logged was redundant once the slot edge plus the _producing_round optional dedup the calls, so it's gone.

huangminghuang · 2026-05-06T19:26:13Z

+         }
+      }
+      const auto signed_count = _producing_round->signed_blocks.size();
+      ilog("Round complete for {} starting #{} at {}: signed {}, finalized {}, orphaned {}",


logs the survived count as finalized. The code is checking whether signed blocks are on the current head branch, not whether they are irreversible/final. In this codebase “finalized” has a specific consensus meaning, so this can mislead operators reading the new diagnostics. Consider renaming the field to survived or canonical.

Renamed to on head in 4ad3b54 (signed S, on head H, orphaned O). Agreed that finalized collided with the consensus meaning.

Three corrections to be4a7a2, all from review on #323: - Per-id head-branch check instead of the reverse-iter shortcut. Under a mid-round fork switch where the producer signs on branch A and then branch B, the reverse iteration lands on a branch-B id and silently counts the earlier branch-A blocks as on-head. The shortcut isn't safe for the exact case these diagnostics exist to surface; the loop is at most producer_repetitions iterations either way. - Round-entry/round-complete trigger off the round-start slot (slot % producer_repetitions == 0), not off head.producer() != scheduled_producer. The producer-comparison form silently suppresses both logs on single-producer schedules and on any rotation where the same account continues across the boundary. _last_round_entry_logged drops out -- the slot edge plus the optional round-state are enough to dedup. - Rename the on-head count from "finalized" to "on head" in the log message; "finalized" overlaps the consensus meaning of finalization (irreversible) and was misleading for "still on the head branch".

heifner added 3 commits May 1, 2026 13:32

huangminghuang reviewed May 6, 2026

View reviewed changes

huangminghuang approved these changes May 7, 2026

View reviewed changes

huangminghuang self-requested a review May 7, 2026 19:23

huangminghuang approved these changes May 7, 2026

View reviewed changes

heifner merged commit 7c536dd into master May 8, 2026
28 checks passed

heifner deleted the feature/block-production-diagnostic-logging branch May 8, 2026 12:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Block production diagnostic logging#323

Block production diagnostic logging#323
heifner merged 4 commits into
masterfrom
feature/block-production-diagnostic-logging

heifner commented May 1, 2026

Uh oh!

huangminghuang May 6, 2026

Uh oh!

heifner May 7, 2026

Uh oh!

huangminghuang May 6, 2026

Uh oh!

heifner May 7, 2026

Uh oh!

huangminghuang May 6, 2026

Uh oh!

heifner May 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

heifner commented May 1, 2026

Summary

What you get

Uh oh!

huangminghuang May 6, 2026

Choose a reason for hiding this comment

Uh oh!

heifner May 7, 2026

Choose a reason for hiding this comment

Uh oh!

huangminghuang May 6, 2026

Choose a reason for hiding this comment

Uh oh!

heifner May 7, 2026

Choose a reason for hiding this comment

Uh oh!

huangminghuang May 6, 2026

Choose a reason for hiding this comment

Uh oh!

heifner May 7, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants