fix(snapshot): detect and recover validator vote snapshot inconsisten…#112
Open
On1x wants to merge 16 commits into
Open
fix(snapshot): detect and recover validator vote snapshot inconsisten…#112On1x wants to merge 16 commits into
On1x wants to merge 16 commits into
Conversation
…cies - Add sanity check during export to warn if validators exist but validator votes are absent - Log warning about possible chainbase type-enum mismatch causing incomplete snapshot - Implement fallback during import to recover validator votes from legacy witness_vote key if validator_vote is empty - Improve snapshot integrity by handling potential silent corruption cases due to type enum shifts
…options - Deleted all mentions of `LOW_MEMORY_NODE` from build scripts, environment variables, and documentation - Removed low-memory node build instructions and flags from Linux, macOS, and Windows build guides - Updated CMake options and environment variables to exclude low-memory settings - Simplified Docker image CMake flags by removing `LOW_MEMORY_NODE` - Cleared low-memory related config references in node setup and getting started guides - Cleaned up example config files by removing deprecated plugins and options related to low-memory builds
- Delete config_debug_mongo.ini to clean up obsolete debug mongo configuration - Remove config_mongo.ini to eliminate outdated mongo production configuration - Simplify project configuration by removing unused or legacy mongo ini files
- Changed info-level logs (ilog) to debug-level logs (dlog) when connecting to peers and sending DLT hello messages - Updated rate-limit notification from ilog to dlog for peer exchange requests - Ensured logging reflects appropriate verbosity level for peer communication events
- Handle CORS preflight by responding to OPTIONS method with proper headers - Append Access-Control-Allow-Origin header to all HTTP responses - Add Access-Control-Allow-Methods, Allow-Headers, and Max-Age headers for OPTIONS responses - Ensure CORS headers are included on error and success responses - Prevent CORS issues for cross-origin API calls through the webserver plugin
- Add check to skip logging if disconnect is already in progress for a peer - Avoid re-entrance in send_message calls during handle_disconnect coroutine - Prevent excessive log entries when send queue is at max depth and peer disconnects
…ad fiber - Close socket first to unblock pending I/O and avoid multi-second hangs - Erase connection after closing to prevent dangling shared_ptr references - Cancel read fiber only after socket is closed to ensure immediate exit - Retain reentrancy guard to keep peer state valid during disconnect handling - Adjust order of operations to fix deadlock when multiple peers disconnect simultaneously
- Introduced Ƶ as the short symbol for VIZ chosen by the community - Explained common practice of showing balances with 2 decimal places - Noted that even staked funds (SHARES) are displayed as Ƶ with staking notes - Clarified symbol usage in wallets, explorers, and applications docs(webserver): document native CORS support in webserver plugin - Detailed handling of browser cross-origin requests without reverse proxy - Specified preflight (OPTIONS) response headers and values - Confirmed all other responses include Access-Control-Allow-Origin: * - Mentioned compatibility with production setups using nginx proxy - Highlighted use cases for browser-based wallets and dApps calling JSON-RPC endpoints directly
After shared-memory corruption triggers attempt_auto_recovery(), the function sets currently_syncing=true so the validator plugin defers block production during the wipe / snapshot import / dlt_block_log replay sequence. Once the database is rebuilt and P2P is resumed, the flag was expected to self-clear on the next applied block via plugin_impl::accept_block(), which stores the caller-supplied sync_mode flag whenever a block is successfully pushed. That self-clearing path never runs on the DLT pipeline. The DLT P2P delegate (dlt_delegate::accept_block in plugins/p2p/p2p_plugin.cpp) calls chain.db().push_block() directly and bypasses plugin_impl::accept_block() entirely, so neither broadcast blocks nor gap-fill replies ever update currently_syncing. The only remaining clearer is transition_to_forward(), but a node that was in FORWARD mode at the moment of corruption stays in FORWARD throughout pause/resume — transition_to_forward() is never invoked, so the flag is permanently stuck at true. The validator gate at plugins/validator/validator.cpp checks chain().is_syncing() in DLT mode and returns not_synced, producing the observed indefinite "Block production deferred: not_synced (head=#X, catching_up=false)" loop where head keeps advancing via P2P but no local block is produced. Fix: explicitly clear currently_syncing immediately after do_snapshot_load(data_dir, true) returns successfully in attempt_auto_recovery(). Post-recovery catchup remains correctly gated by _catchup_after_pause in the P2P layer, which the periodic task clears once no peer is ahead of our head.
The deferred-snapshot wake-up in on_applied_block previously used head_block_time() >= pending_snapshot_safe_after_time, which fires on the very block the local validator just produced. The applied_block signal is dispatched synchronously from _push_block inside db.generate_block(), and the validator only calls p2p().broadcast_block() after generate_block() returns. So firing the snapshot on the same block let the snapshot read-lock start before the produced block had been broadcast to peers. Change the condition to strictly greater than: the deferred snapshot now waits until a SUBSEQUENT block is applied. That block is built by another validator on top of ours, proving our block was produced, applied locally, and propagated through the network. Only then does the snapshot start reading state. Cost is ~one block interval of additional delay, and only on slots where the local validator was the deferral target. The non-producer path is unchanged: snapshots still fire immediately at the originating block when is_validator_producing_soon() is false. Also expanded the surrounding comment block and updated the wake-up log messages to reflect the new semantics.
Replace hardcoded b.validator with get_scheduled_validator(i + 2) so each missed block line shows the validator scheduled for the slot immediately after the miss, instead of repeating the current block producer for every line.
…FORWARD oscillation
The static atomic recovery_in_progress flag in attempt_auto_recovery() was
never reset to false after successful recovery, making any subsequent
corruption event permanently unrecoverable ("already in progress, skipping
duplicate attempt"). Reset it after P2P resume so the node can recover from
future corruption events.
Add a consecutive recovery counter (max 3 within 5 minutes) to prevent
infinite recovery loops when the snapshot or block log is itself corrupted.
In request_gap_fill(), remove the SYNC transition and peer request loop from
the "no peer available" fallback path. When no peer has a higher head,
transition_to_sync() followed by request_blocks_from_peer() immediately
detects all peers as "caught up" and calls transition_to_forward(), producing
rapid SYNC->FORWARD oscillation every 5 seconds. Instead, just log and let
the periodic task retry when new peers connect.
Node crashes silently between DLT block log open and "Done opening block log" with no error output. Add step-by-step ilog() calls to every major operation in the critical path so the exact failing step is visible in the next crash log: - block_log and dlt_block_log head after open - Before/after undo_all() with revision values - Revision mismatch detection with values - Before reading head block from block_log - fork_db seeding start in both normal and DLT modes - Before/after init_hardforks() (second call) - Before validator schedule integrity check Also add db.open() success log in chain plugin_startup.
All reads and writes to the currently_syncing atomic flag used relaxed ordering, which does not guarantee cross-thread visibility on non-x86 architectures. The recovery thread writes currently_syncing=false after rebuilding the database, and the validator production thread reads it to decide whether to produce blocks. Upgrade to release/acquire ordering to ensure the store is visible to the reader on all platforms. store → memory_order_release (3 sites) load → memory_order_acquire (1 site) exchange → memory_order_acq_rel (1 site)
undo_all() in database::open() causes a silent SIGSEGV when shared memory is corrupted after a hard crash. Since segfaults bypass all C++ exception handlers, the node enters an infinite restart loop in Docker without ever reaching the recovery path. Introduce a marker file (state/undo_all_in_progress) that is created before undo_all() and removed after it completes. If the process crashes inside undo_all(), the marker survives and triggers database_revision_exception on the next startup, which activates the existing snapshot recovery path. Marker cleanup is added to: - database::open() — removed after successful undo_all() - database::open_from_snapshot() — cleaned before snapshot import - database::wipe() — cleaned during shared memory wipe
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
…cies