debug(autobahn): log FinalizeBlock + upgrade BeginBlocker (DO NOT MERGE)#3474
Draft
wen-coding wants to merge 18 commits into
Draft
debug(autobahn): log FinalizeBlock + upgrade BeginBlocker (DO NOT MERGE)#3474wen-coding wants to merge 18 commits into
wen-coding wants to merge 18 commits into
Conversation
Bash twin of #3406's lib.js work. Submit cosmos txs with -b sync and poll the actual on-chain side effect rather than relying on -b block — under Autobahn the KV indexer that BroadcastTxCommit polls isn't populated, so -b block can hang to its 60s timeout. New shared helper integration_test/contracts/_tx_helpers.sh: - store_wasm — submit `wasm store` -b sync, return new code_id once max wasm code_id grows (mirrors lib.js storeWasm). - instantiate_wasm — submit `wasm instantiate` -b sync, return new contract address once it appears under code_id via list-contract- by-code set-diff (mirrors lib.js instantiateWasm). - bank_send_and_wait — submit `bank send` -b sync, wait for sender's account sequence to advance. Denom-agnostic causal commit signal used between consecutive sends from the same key (mirrors lib.js getAccountSequence-based wait). - _wait_until — generic check-loop with 30s default timeout, for one-off side-effect waits not covered by typed helpers. Migrated scripts: - deploy_wasm_contracts.sh: 30 × (store + instantiate). Helpers capture the new code_id / contract address inside the wait closure so a concurrent store/instantiate from another test can't substitute its value. - create_tokenfactory_denoms.sh: 30 × create-denom. The denom is deterministic (factory/<creator>/<subdenom>); the inline create_denom_and_wait polls denom-authority-metadata for non-empty admin (matches the lib.js createTokenFactoryTokenAndMint pattern). - deploy_timelocked_token_contract.sh: 7 × bank send (admin funding 7 admin/op/dest accounts sequentially) + 2 × store + 2 × instantiate. bank_send_and_wait between consecutive funds prevents "account sequence mismatch" rejections when -b sync returns before the prior tx commits. - deploy_dex_contract.sh: store + instantiate + dex register-contract + 4 × dex register-pairs. The dex txs have no convenient single-query side effect; inline dex_tx_and_wait waits for the sender's sequence to advance. Wait-timeout cases preserve the original error contract: scripts print to stderr and exit 1, same as before. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Layers proposal_submit.sh + proposal_vote.sh onto the deploy bash
script migration introduced in the prior commit. Both use the shared
_tx_helpers.sh; two new helpers added there:
- _get_max_proposal_id — highest existing gov proposal id (0 if none).
- find_proposal_by_title — scan (max_id_before, cur] for a proposal
whose title matches and echo its id. The diff-against-snapshot
pins it to *this* submission rather than a stale prior-run
proposal. Mirrors lib.js's findProposalByTitle (#3406).
proposal_submit.sh:
- Submit gov submit-proposal software-upgrade via -b sync, capture
txhash for error reporting, then poll gov state for a proposal
whose title matches our $VERSION arg. Same shape lib.js's
proposeParamChange uses.
proposal_vote.sh:
- Submit gov vote via -b sync and wait for the voter's account
sequence to advance — the upstream proposal-status poll in the
upgrade test yaml tolerates the brief window between this script's
return and the tally update, and the yaml caller discards this
script's stdout. CheckTx-rejection path still echoes the rejection
code so the rare bad-input case stays visible.
Together this unblocks Upgrade Module Major + Upgrade Module Minor CI
jobs under Autobahn (proposal_submit's -b block was the gating call
that prevented proposal_id capture, making PROPOSAL_ID empty and
breaking every downstream eval step).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When the gov module has no existing proposals, `seid q gov proposals` exits non-zero with "no proposals found"; 2>/dev/null swallows the error and jq reads empty stdin, emitting nothing. Callers (e.g. find_proposal_by_title's loop) then hit "integer expression expected" on `[ "$cur" -gt ... ]` and the search never matches, so a freshly- booted cluster's first proposal_submit times out at 30s with no proposal_id returned. Fix: capture the seid output explicitly and fall back to 0 when it's empty or when jq fails. Mirrors the try/catch path in lib.js's maxProposalId. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Under -b sync, seid surfaces "chain-id ()" in the signature-verify error if --chain-id isn't on the command line; -b block was reading it from the client config in some path. Setting it explicitly here matches what proposal_vote.sh already does and what the cosmos test framework expects. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Verifies the bash-script migrations in this PR end-to-end under
Autobahn in CI's isolated per-job clusters (couldn't reproduce locally
because upgrade tests' binary swaps degrade a shared cluster).
- Autobahn Upgrade Module (Major) / (Minor) — exercises
proposal_submit.sh + proposal_vote.sh migrations from this PR
- Autobahn FlatKV Integration — exercises deploy_flatkv_evm_fixture
+ downstream historical-state and crash-recovery scripts
- Autobahn Disable WASM Tests — passed 16/16 locally; adding CI
coverage
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Diagnostic logs to trace why x/upgrade's BeginBlocker doesn't panic at the scheduled upgrade height under Autobahn (Autobahn Upgrade Module Major/Minor jobs in #3469's CI failed: proposal_submit migration works, proposal passes, but LOG_AT_BLOCK_HEIGHT_NODE_X assertions fail because no node panics with "UPGRADE NEEDED"). Two log sites: - giga_router.go executeBlock: log every FinalizeBlock call with GlobalNumber, the height value being passed to ABCI, and tx_count. Confirms Autobahn block delivery is feeding ABCI correctly. - sei-cosmos/x/upgrade/abci.go BeginBlocker entry: log every invocation with ctx.BlockHeight, planFound, and if found the plan's target Height + Name + ShouldExecute verdict + hasHandler. Should expose whether the plan is committed (planFound=true) but not firing (ShouldExecute=false unexpectedly), or whether BeginBlocker isn't even called at the upgrade height. To be reverted once the root cause is identified. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ock' into wen/debug_autobahn_upgrade_panic
|
The latest Buf updates on your PR. Results from workflow Buf / buf (pull_request).
|
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3474 +/- ##
==========================================
+ Coverage 59.07% 59.09% +0.01%
==========================================
Files 2186 2186
Lines 182269 182369 +100
==========================================
+ Hits 107671 107764 +93
- Misses 64947 64954 +7
Partials 9651 9651
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
Two changes to make the debug logs actually reach us:
1. Append (>>) instead of truncate (>) when starting seid in the
cluster bringup + upgrade scripts. Previously every panic-restart
cycle wiped the prior process's log, so by the time we read the
log file all the BeginBlocker / FinalizeBlock log lines from the
pre-panic seid were gone — only the latest restart's startup
sequence remained.
Sites: docker/localnode/scripts/step5_start_sei.sh,
integration_test/upgrade_module/scripts/seid_upgrade.sh,
integration_test/upgrade_module/scripts/seid_downgrade.sh
2. Strip the CI matrix to just the two Autobahn Upgrade Module jobs
so we don't wait through 18+ other matrix rows that don't help
this investigation.
To be reverted once the upgrade-panic-under-Autobahn root cause is
identified.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…grade_panic # Conflicts: # .github/workflows/integration-test.yml
Sample two heights 3s apart and divide by elapsed wall-clock to derive the real block interval, instead of assuming 300ms. Under Autobahn the chain runs at ~400ms; with the hardcoded 300ms estimate, target_height ended up ~33% closer to now than intended, the voting period elapsed past it, and ScheduleUpgrade rejected the plan with "upgrade cannot be scheduled in the past" — which is why planFound was never true in the Autobahn upgrade test logs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Under Autobahn the cluster reaches TARGET_HEIGHT and panics within ~16s of the proposal passing (block production runs at ~100ms/block during catchup, not the 400ms steady-state). That window isn't long enough for the early-upgraded node to finish its restart bootstrap (~14s) and execute its first block, so its BeginBlocker never gets a chance to fire the "BINARY UPDATED BEFORE TRIGGER" panic and the test's verify_panic step times out. 60s buys ~30s of cluster lifetime past vote-pass, comfortably more than node-0's restart bootstrap. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add AUTOBAHN-DEBUG logs at each Autobahn subsystem boundary so we can tell, on a stuck post-restart cluster, which loop never advanced: - giga_router.runExecute: start, App.Info result, restored-appHash push, per-iteration "awaiting GlobalBlock" / "received GlobalBlock" / "executeBlock complete". If the block-fetch line repeats without a matching received line, the data layer isn't delivering the next block (consensus didn't agree). - producer.Run: start, per-iteration "awaiting capacity" / "building payload" / "calling ProduceBlock". Identifies whether producer is blocked on capacity (consensus stalled) or actually emitting blocks. - consensus.runPropose: per-view "view tick" with am_leader, plus a "reproposing" / "new proposal stored" entry on the leader path. Distinguishes "no proposer for this view" from "proposer can't build a proposal". Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
So a single CI run on this branch produces 4 jobs (CometBFT major, CometBFT minor, Autobahn major, Autobahn minor) with the same debug logging applied. Diffing the CometBFT vs Autobahn log timelines after the coordinated panic-and-restart should make it obvious which Autobahn subsystem stalls. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PushAppHash is the deadlock site on coordinated-restart: it waits on n < nextBlockToPersist, and on restart the data WAL recovers at the same height the app committed, so nextBlockToPersist <= n and the wait never unblocks. Log the (n, nextAppProposal, nextBlockToPersist, nextBlock, nextQC, first) tuple on entry and the wait result. With this, one CI run will show the exact cursor values on the failing call, confirming or refuting the equality-stall theory. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PushAppHash's WaitUntil(n < nextBlockToPersist) protects the steady-state
invariant "don't broadcast an appHash for a block whose bytes aren't
durable yet". On runExecute's restart path this is over-conservative:
the app reports n as committed, so its hash is durable in the app's
persisted state regardless of the data WAL's catchup status. On
coordinated panic-recovery the data-persist goroutine was killed
mid-batch and the WAL lags the app's commit; no new block arrives to
satisfy the wait because consensus hasn't restarted on this node, so
the cluster deadlocks.
Add PushAppHashOnRecovery: same intent as PushAppHash on restart
(advance the app-proposal cursor to n+1 so subsequent steady-state
calls don't have to fast-forward through pruned heights), but:
- no WaitUntil — durability is asserted by the app's commit
- no per-block metrics or appProposals population — the recovered
cursor jump may span heights whose blocks/QCs aren't in memory
- advances all cursors to >= n+1 to preserve
first <= nextAppProposal <= nextBlockToPersist <= nextBlock <= nextQC
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
To verify whether the recovery deadlock is really caused by the data
WAL being behind app.LastBlockHeight, instrument three new sites:
- NewState entry: which heights the Blocks and CommitQCs WALs offer
(LoadedFirst, persister Next cursor), plus the committee's first.
- NewState completion: final cursors and how many QCs / blocks ended
up in memory after the restore loop.
- runPersist batch persisted: how many QCs/blocks each batch wrote
and where the persistedQC/persistedBlock cursors landed.
With these plus the PushAppHash entry log, a single CI run shows:
(a) what was actually on disk at restart (NewState load)
(b) how runPersist's in-memory advance rate compares to the app's
forward progress before the panic
(c) the exact cursor gap PushAppHash sees on restart
If (a) shows the Blocks WAL only reached 802 while app.LastBlockHeight
is 803, the OS-page-cache-vs-fsync hypothesis is confirmed.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PushAppHashOnRecovery advanced data-layer cursors past blocks that
weren't actually loaded into inner.qcs / inner.blocks. Logs from
26194247294 (Minor) show the data WAL is *empty* on every restart
(loaded_qc_count=0, loaded_block_count=0) even after a graceful
shutdown — runPersist's writes don't survive process exit. With
cursors advanced, runPersist's next batch nil-panics dereferencing
inner.qcs[n].QC() for n that was never inserted.
The real story the logs revealed:
- Minor passes pre-fix-C not because the data WAL persists, but
because PushAppHash's WaitUntil is satisfied via peer-fed PushQC
after restart (peers nodes 1/2/3 still have data in memory and
deliver it). In the previous Minor run, PushAppHash(411) waited
17 seconds for peers to catch node-0 up.
- Major hangs because all 4 nodes restart simultaneously: nobody
has the missing data in memory or on disk, so the wait can never
be satisfied.
The deeper bug is data-WAL durability, not PushAppHash logic. Keep
the diagnostic logs (NewState load summary, runPersist batch progress,
PushAppHash entry) — they're the evidence — but revert the
behavioral change.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The data layer's block and commitqc WALs were constructed with utils.None[string](), making them no-ops: writes returned success but nothing landed on disk, and NewState loaded an empty WAL on every restart. Single-node restarts survived only by peer-fed PushQC; a coordinated cluster restart had no recovery path because every node had simultaneously lost its in-memory data. The autobahn config file already has persistent_state_dir, and it was already wired into the consensus layer. Thread the same value into the data layer: - gen_autobahn_config: add --persistent-state-dir flag, emit the value into AutobahnFileConfig.PersistentStateDir - step4_config_override.sh: pass --persistent-state-dir $HOME/.sei/data/autobahn when generating the docker cluster's autobahn.json - p2p.GigaRouterConfig: add PersistentStateDir field - node/setup.go: forward fc.PersistentStateDir into the new field - p2p/giga_router.go: pass cfg.PersistentStateDir to data.NewDataWAL instead of utils.None Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Diagnostic logs to trace why x/upgrade's BeginBlocker doesn't fire the panic under Autobahn. Two log sites:
giga_router.goexecuteBlock — logs every FinalizeBlock call with GlobalNumber, height, tx_countsei-cosmos/x/upgrade/abci.goBeginBlocker entry — logs ctx.BlockHeight, planFound, plan.Height, plan.Name, shouldExecute, hasHandlerMerged in #3469's upgrade script migration so proposal_submit + proposal_vote actually work under Autobahn (otherwise the proposal would never get submitted).
CI's "Autobahn Upgrade Module (Major)/(Minor)" jobs should produce logs we can grep for
UPGRADE-DEBUGandAUTOBAHN-DEBUGto identify which of these is true:planFound=falseat upgrade height → gov proposal handler didn't persist the upgrade planplanFound=true, shouldExecute=false→ height-flow bugplanFound=true, shouldExecute=true, hasHandler=true→ applyUpgrade ran, panic was correctly skippedTo be reverted after diagnosis.