Skip to content

fix(snapshot): detect and recover validator vote snapshot inconsisten…#112

Open
On1x wants to merge 16 commits into
masterfrom
snapshot-fix
Open

fix(snapshot): detect and recover validator vote snapshot inconsisten…#112
On1x wants to merge 16 commits into
masterfrom
snapshot-fix

Conversation

@On1x
Copy link
Copy Markdown
Member

@On1x On1x commented May 21, 2026

…cies

  • Add sanity check during export to warn if validators exist but validator votes are absent
  • Log warning about possible chainbase type-enum mismatch causing incomplete snapshot
  • Implement fallback during import to recover validator votes from legacy witness_vote key if validator_vote is empty
  • Improve snapshot integrity by handling potential silent corruption cases due to type enum shifts

On1x added 16 commits May 21, 2026 06:41
…cies

- Add sanity check during export to warn if validators exist but validator votes are absent
- Log warning about possible chainbase type-enum mismatch causing incomplete snapshot
- Implement fallback during import to recover validator votes from legacy witness_vote key if validator_vote is empty
- Improve snapshot integrity by handling potential silent corruption cases due to type enum shifts
…options

- Deleted all mentions of `LOW_MEMORY_NODE` from build scripts, environment variables, and documentation
- Removed low-memory node build instructions and flags from Linux, macOS, and Windows build guides
- Updated CMake options and environment variables to exclude low-memory settings
- Simplified Docker image CMake flags by removing `LOW_MEMORY_NODE`
- Cleared low-memory related config references in node setup and getting started guides
- Cleaned up example config files by removing deprecated plugins and options related to low-memory builds
- Delete config_debug_mongo.ini to clean up obsolete debug mongo configuration
- Remove config_mongo.ini to eliminate outdated mongo production configuration
- Simplify project configuration by removing unused or legacy mongo ini files
- Changed info-level logs (ilog) to debug-level logs (dlog) when connecting to peers and sending DLT hello messages
- Updated rate-limit notification from ilog to dlog for peer exchange requests
- Ensured logging reflects appropriate verbosity level for peer communication events
- Handle CORS preflight by responding to OPTIONS method with proper headers
- Append Access-Control-Allow-Origin header to all HTTP responses
- Add Access-Control-Allow-Methods, Allow-Headers, and Max-Age headers for OPTIONS responses
- Ensure CORS headers are included on error and success responses
- Prevent CORS issues for cross-origin API calls through the webserver plugin
- Add check to skip logging if disconnect is already in progress for a peer
- Avoid re-entrance in send_message calls during handle_disconnect coroutine
- Prevent excessive log entries when send queue is at max depth and peer disconnects
…ad fiber

- Close socket first to unblock pending I/O and avoid multi-second hangs
- Erase connection after closing to prevent dangling shared_ptr references
- Cancel read fiber only after socket is closed to ensure immediate exit
- Retain reentrancy guard to keep peer state valid during disconnect handling
- Adjust order of operations to fix deadlock when multiple peers disconnect simultaneously
- Introduced Ƶ as the short symbol for VIZ chosen by the community
- Explained common practice of showing balances with 2 decimal places
- Noted that even staked funds (SHARES) are displayed as Ƶ with staking notes
- Clarified symbol usage in wallets, explorers, and applications

docs(webserver): document native CORS support in webserver plugin

- Detailed handling of browser cross-origin requests without reverse proxy
- Specified preflight (OPTIONS) response headers and values
- Confirmed all other responses include Access-Control-Allow-Origin: *
- Mentioned compatibility with production setups using nginx proxy
- Highlighted use cases for browser-based wallets and dApps calling JSON-RPC endpoints directly
After shared-memory corruption triggers attempt_auto_recovery(), the
function sets currently_syncing=true so the validator plugin defers
block production during the wipe / snapshot import / dlt_block_log
replay sequence.  Once the database is rebuilt and P2P is resumed,
the flag was expected to self-clear on the next applied block via
plugin_impl::accept_block(), which stores the caller-supplied
sync_mode flag whenever a block is successfully pushed.

That self-clearing path never runs on the DLT pipeline.  The DLT P2P
delegate (dlt_delegate::accept_block in plugins/p2p/p2p_plugin.cpp)
calls chain.db().push_block() directly and bypasses
plugin_impl::accept_block() entirely, so neither broadcast blocks nor
gap-fill replies ever update currently_syncing.  The only remaining
clearer is transition_to_forward(), but a node that was in FORWARD
mode at the moment of corruption stays in FORWARD throughout
pause/resume — transition_to_forward() is never invoked, so the flag
is permanently stuck at true.

The validator gate at plugins/validator/validator.cpp checks
chain().is_syncing() in DLT mode and returns not_synced, producing
the observed indefinite "Block production deferred: not_synced
(head=#X, catching_up=false)" loop where head keeps advancing via
P2P but no local block is produced.

Fix: explicitly clear currently_syncing immediately after
do_snapshot_load(data_dir, true) returns successfully in
attempt_auto_recovery().  Post-recovery catchup remains correctly
gated by _catchup_after_pause in the P2P layer, which the periodic
task clears once no peer is ahead of our head.
The deferred-snapshot wake-up in on_applied_block previously used
head_block_time() >= pending_snapshot_safe_after_time, which fires on
the very block the local validator just produced.

The applied_block signal is dispatched synchronously from _push_block
inside db.generate_block(), and the validator only calls
p2p().broadcast_block() after generate_block() returns. So firing the
snapshot on the same block let the snapshot read-lock start before the
produced block had been broadcast to peers.

Change the condition to strictly greater than: the deferred snapshot
now waits until a SUBSEQUENT block is applied. That block is built by
another validator on top of ours, proving our block was produced,
applied locally, and propagated through the network. Only then does
the snapshot start reading state.

Cost is ~one block interval of additional delay, and only on slots
where the local validator was the deferral target. The non-producer
path is unchanged: snapshots still fire immediately at the originating
block when is_validator_producing_soon() is false.

Also expanded the surrounding comment block and updated the wake-up
log messages to reflect the new semantics.
Replace hardcoded b.validator with get_scheduled_validator(i + 2) so each
missed block line shows the validator scheduled for the slot immediately
after the miss, instead of repeating the current block producer for every
line.
…FORWARD oscillation

The static atomic recovery_in_progress flag in attempt_auto_recovery() was
never reset to false after successful recovery, making any subsequent
corruption event permanently unrecoverable ("already in progress, skipping
duplicate attempt").  Reset it after P2P resume so the node can recover from
future corruption events.

Add a consecutive recovery counter (max 3 within 5 minutes) to prevent
infinite recovery loops when the snapshot or block log is itself corrupted.

In request_gap_fill(), remove the SYNC transition and peer request loop from
the "no peer available" fallback path.  When no peer has a higher head,
transition_to_sync() followed by request_blocks_from_peer() immediately
detects all peers as "caught up" and calls transition_to_forward(), producing
rapid SYNC->FORWARD oscillation every 5 seconds.  Instead, just log and let
the periodic task retry when new peers connect.
Node crashes silently between DLT block log open and "Done opening
block log" with no error output. Add step-by-step ilog() calls to
every major operation in the critical path so the exact failing
step is visible in the next crash log:

- block_log and dlt_block_log head after open
- Before/after undo_all() with revision values
- Revision mismatch detection with values
- Before reading head block from block_log
- fork_db seeding start in both normal and DLT modes
- Before/after init_hardforks() (second call)
- Before validator schedule integrity check

Also add db.open() success log in chain plugin_startup.
All reads and writes to the currently_syncing atomic flag used relaxed
ordering, which does not guarantee cross-thread visibility on non-x86
architectures.  The recovery thread writes currently_syncing=false after
rebuilding the database, and the validator production thread reads it to
decide whether to produce blocks.  Upgrade to release/acquire ordering to
ensure the store is visible to the reader on all platforms.

store  → memory_order_release (3 sites)
load   → memory_order_acquire (1 site)
exchange → memory_order_acq_rel (1 site)
undo_all() in database::open() causes a silent SIGSEGV when shared
memory is corrupted after a hard crash. Since segfaults bypass all
C++ exception handlers, the node enters an infinite restart loop in
Docker without ever reaching the recovery path.

Introduce a marker file (state/undo_all_in_progress) that is created
before undo_all() and removed after it completes. If the process
crashes inside undo_all(), the marker survives and triggers
database_revision_exception on the next startup, which activates
the existing snapshot recovery path.

Marker cleanup is added to:
- database::open() — removed after successful undo_all()
- database::open_from_snapshot() — cleaned before snapshot import
- database::wipe() — cleaned during shared memory wipe
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant