debug(autobahn): log FinalizeBlock + upgrade BeginBlocker (DO NOT MERGE) by wen-coding · Pull Request #3474 · sei-protocol/sei-chain

wen-coding · 2026-05-20T17:55:15Z

Diagnostic logs to trace why x/upgrade's BeginBlocker doesn't fire the panic under Autobahn. Two log sites:

giga_router.go executeBlock — logs every FinalizeBlock call with GlobalNumber, height, tx_count
sei-cosmos/x/upgrade/abci.go BeginBlocker entry — logs ctx.BlockHeight, planFound, plan.Height, plan.Name, shouldExecute, hasHandler

Merged in #3469's upgrade script migration so proposal_submit + proposal_vote actually work under Autobahn (otherwise the proposal would never get submitted).

CI's "Autobahn Upgrade Module (Major)/(Minor)" jobs should produce logs we can grep for UPGRADE-DEBUG and AUTOBAHN-DEBUG to identify which of these is true:

planFound=false at upgrade height → gov proposal handler didn't persist the upgrade plan
planFound=true, shouldExecute=false → height-flow bug
planFound=true, shouldExecute=true, hasHandler=true → applyUpgrade ran, panic was correctly skipped

To be reverted after diagnosis.

Bash twin of #3406's lib.js work. Submit cosmos txs with -b sync and poll the actual on-chain side effect rather than relying on -b block — under Autobahn the KV indexer that BroadcastTxCommit polls isn't populated, so -b block can hang to its 60s timeout. New shared helper integration_test/contracts/_tx_helpers.sh: - store_wasm — submit `wasm store` -b sync, return new code_id once max wasm code_id grows (mirrors lib.js storeWasm). - instantiate_wasm — submit `wasm instantiate` -b sync, return new contract address once it appears under code_id via list-contract- by-code set-diff (mirrors lib.js instantiateWasm). - bank_send_and_wait — submit `bank send` -b sync, wait for sender's account sequence to advance. Denom-agnostic causal commit signal used between consecutive sends from the same key (mirrors lib.js getAccountSequence-based wait). - _wait_until — generic check-loop with 30s default timeout, for one-off side-effect waits not covered by typed helpers. Migrated scripts: - deploy_wasm_contracts.sh: 30 × (store + instantiate). Helpers capture the new code_id / contract address inside the wait closure so a concurrent store/instantiate from another test can't substitute its value. - create_tokenfactory_denoms.sh: 30 × create-denom. The denom is deterministic (factory/<creator>/<subdenom>); the inline create_denom_and_wait polls denom-authority-metadata for non-empty admin (matches the lib.js createTokenFactoryTokenAndMint pattern). - deploy_timelocked_token_contract.sh: 7 × bank send (admin funding 7 admin/op/dest accounts sequentially) + 2 × store + 2 × instantiate. bank_send_and_wait between consecutive funds prevents "account sequence mismatch" rejections when -b sync returns before the prior tx commits. - deploy_dex_contract.sh: store + instantiate + dex register-contract + 4 × dex register-pairs. The dex txs have no convenient single-query side effect; inline dex_tx_and_wait waits for the sender's sequence to advance. Wait-timeout cases preserve the original error contract: scripts print to stderr and exit 1, same as before. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Layers proposal_submit.sh + proposal_vote.sh onto the deploy bash script migration introduced in the prior commit. Both use the shared _tx_helpers.sh; two new helpers added there: - _get_max_proposal_id — highest existing gov proposal id (0 if none). - find_proposal_by_title — scan (max_id_before, cur] for a proposal whose title matches and echo its id. The diff-against-snapshot pins it to *this* submission rather than a stale prior-run proposal. Mirrors lib.js's findProposalByTitle (#3406). proposal_submit.sh: - Submit gov submit-proposal software-upgrade via -b sync, capture txhash for error reporting, then poll gov state for a proposal whose title matches our $VERSION arg. Same shape lib.js's proposeParamChange uses. proposal_vote.sh: - Submit gov vote via -b sync and wait for the voter's account sequence to advance — the upstream proposal-status poll in the upgrade test yaml tolerates the brief window between this script's return and the tally update, and the yaml caller discards this script's stdout. CheckTx-rejection path still echoes the rejection code so the rare bad-input case stays visible. Together this unblocks Upgrade Module Major + Upgrade Module Minor CI jobs under Autobahn (proposal_submit's -b block was the gating call that prevented proposal_id capture, making PROPOSAL_ID empty and breaking every downstream eval step). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

When the gov module has no existing proposals, `seid q gov proposals` exits non-zero with "no proposals found"; 2>/dev/null swallows the error and jq reads empty stdin, emitting nothing. Callers (e.g. find_proposal_by_title's loop) then hit "integer expression expected" on `[ "$cur" -gt ... ]` and the search never matches, so a freshly- booted cluster's first proposal_submit times out at 30s with no proposal_id returned. Fix: capture the seid output explicitly and fall back to 0 when it's empty or when jq fails. Mirrors the try/catch path in lib.js's maxProposalId. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Under -b sync, seid surfaces "chain-id ()" in the signature-verify error if --chain-id isn't on the command line; -b block was reading it from the client config in some path. Setting it explicitly here matches what proposal_vote.sh already does and what the cosmos test framework expects. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Verifies the bash-script migrations in this PR end-to-end under Autobahn in CI's isolated per-job clusters (couldn't reproduce locally because upgrade tests' binary swaps degrade a shared cluster). - Autobahn Upgrade Module (Major) / (Minor) — exercises proposal_submit.sh + proposal_vote.sh migrations from this PR - Autobahn FlatKV Integration — exercises deploy_flatkv_evm_fixture + downstream historical-state and crash-recovery scripts - Autobahn Disable WASM Tests — passed 16/16 locally; adding CI coverage Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Diagnostic logs to trace why x/upgrade's BeginBlocker doesn't panic at the scheduled upgrade height under Autobahn (Autobahn Upgrade Module Major/Minor jobs in #3469's CI failed: proposal_submit migration works, proposal passes, but LOG_AT_BLOCK_HEIGHT_NODE_X assertions fail because no node panics with "UPGRADE NEEDED"). Two log sites: - giga_router.go executeBlock: log every FinalizeBlock call with GlobalNumber, the height value being passed to ABCI, and tx_count. Confirms Autobahn block delivery is feeding ABCI correctly. - sei-cosmos/x/upgrade/abci.go BeginBlocker entry: log every invocation with ctx.BlockHeight, planFound, and if found the plan's target Height + Name + ShouldExecute verdict + hasHandler. Should expose whether the plan is committed (planFound=true) but not firing (ShouldExecute=false unexpectedly), or whether BeginBlocker isn't even called at the upgrade height. To be reverted once the root cause is identified. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ock' into wen/debug_autobahn_upgrade_panic

github-actions · 2026-05-20T17:57:02Z

The latest Buf updates on your PR. Results from workflow Buf / buf (pull_request).

Build	Format	Lint	Breaking	Updated (UTC)
`✅ passed`	`✅ passed`	`✅ passed`	`✅ passed`	May 20, 2026, 11:37 PM

codecov · 2026-05-20T17:59:18Z

Codecov Report

❌ Patch coverage is 94.23077% with 6 lines in your changes missing coverage. Please review.
✅ Project coverage is 59.09%. Comparing base (63067ee) to head (7889ff7).
⚠️ Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
...int/cmd/tendermint/commands/gen_autobahn_config.go	0.00%	4 Missing ⚠️
sei-tendermint/internal/p2p/giga_router.go	91.30%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3474      +/-   ##
==========================================
+ Coverage   59.07%   59.09%   +0.01%     
==========================================
  Files        2186     2186              
  Lines      182269   182369     +100     
==========================================
+ Hits       107671   107764      +93     
- Misses      64947    64954       +7     
  Partials     9651     9651

Flag	Coverage Δ
sei-chain-pr	`68.24% <94.23%> (?)`
sei-db	`70.41% <ø> (-0.22%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
sei-cosmos/x/upgrade/abci.go	`82.43% <100.00%> (+4.09%)`	⬆️
...ei-tendermint/internal/autobahn/consensus/state.go	`89.20% <100.00%> (+0.71%)`	⬆️
sei-tendermint/internal/autobahn/data/state.go	`80.82% <100.00%> (+1.83%)`	⬆️
sei-tendermint/internal/autobahn/producer/state.go	`84.05% <100.00%> (+3.35%)`	⬆️
sei-tendermint/node/setup.go	`69.77% <100.00%> (+0.09%)`	⬆️
sei-tendermint/internal/p2p/giga_router.go	`71.80% <91.30%> (+2.05%)`	⬆️
...int/cmd/tendermint/commands/gen_autobahn_config.go	`15.38% <0.00%> (-1.01%)`	⬇️

... and 1 file with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Two changes to make the debug logs actually reach us: 1. Append (>>) instead of truncate (>) when starting seid in the cluster bringup + upgrade scripts. Previously every panic-restart cycle wiped the prior process's log, so by the time we read the log file all the BeginBlocker / FinalizeBlock log lines from the pre-panic seid were gone — only the latest restart's startup sequence remained. Sites: docker/localnode/scripts/step5_start_sei.sh, integration_test/upgrade_module/scripts/seid_upgrade.sh, integration_test/upgrade_module/scripts/seid_downgrade.sh 2. Strip the CI matrix to just the two Autobahn Upgrade Module jobs so we don't wait through 18+ other matrix rows that don't help this investigation. To be reverted once the upgrade-panic-under-Autobahn root cause is identified. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…grade_panic # Conflicts: # .github/workflows/integration-test.yml

Sample two heights 3s apart and divide by elapsed wall-clock to derive the real block interval, instead of assuming 300ms. Under Autobahn the chain runs at ~400ms; with the hardcoded 300ms estimate, target_height ended up ~33% closer to now than intended, the voting period elapsed past it, and ScheduleUpgrade rejected the plan with "upgrade cannot be scheduled in the past" — which is why planFound was never true in the Autobahn upgrade test logs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Under Autobahn the cluster reaches TARGET_HEIGHT and panics within ~16s of the proposal passing (block production runs at ~100ms/block during catchup, not the 400ms steady-state). That window isn't long enough for the early-upgraded node to finish its restart bootstrap (~14s) and execute its first block, so its BeginBlocker never gets a chance to fire the "BINARY UPDATED BEFORE TRIGGER" panic and the test's verify_panic step times out. 60s buys ~30s of cluster lifetime past vote-pass, comfortably more than node-0's restart bootstrap. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Add AUTOBAHN-DEBUG logs at each Autobahn subsystem boundary so we can tell, on a stuck post-restart cluster, which loop never advanced: - giga_router.runExecute: start, App.Info result, restored-appHash push, per-iteration "awaiting GlobalBlock" / "received GlobalBlock" / "executeBlock complete". If the block-fetch line repeats without a matching received line, the data layer isn't delivering the next block (consensus didn't agree). - producer.Run: start, per-iteration "awaiting capacity" / "building payload" / "calling ProduceBlock". Identifies whether producer is blocked on capacity (consensus stalled) or actually emitting blocks. - consensus.runPropose: per-view "view tick" with am_leader, plus a "reproposing" / "new proposal stored" entry on the leader path. Distinguishes "no proposer for this view" from "proposer can't build a proposal". Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

So a single CI run on this branch produces 4 jobs (CometBFT major, CometBFT minor, Autobahn major, Autobahn minor) with the same debug logging applied. Diffing the CometBFT vs Autobahn log timelines after the coordinated panic-and-restart should make it obvious which Autobahn subsystem stalls. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

PushAppHash is the deadlock site on coordinated-restart: it waits on n < nextBlockToPersist, and on restart the data WAL recovers at the same height the app committed, so nextBlockToPersist <= n and the wait never unblocks. Log the (n, nextAppProposal, nextBlockToPersist, nextBlock, nextQC, first) tuple on entry and the wait result. With this, one CI run will show the exact cursor values on the failing call, confirming or refuting the equality-stall theory. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

PushAppHash's WaitUntil(n < nextBlockToPersist) protects the steady-state invariant "don't broadcast an appHash for a block whose bytes aren't durable yet". On runExecute's restart path this is over-conservative: the app reports n as committed, so its hash is durable in the app's persisted state regardless of the data WAL's catchup status. On coordinated panic-recovery the data-persist goroutine was killed mid-batch and the WAL lags the app's commit; no new block arrives to satisfy the wait because consensus hasn't restarted on this node, so the cluster deadlocks. Add PushAppHashOnRecovery: same intent as PushAppHash on restart (advance the app-proposal cursor to n+1 so subsequent steady-state calls don't have to fast-forward through pruned heights), but: - no WaitUntil — durability is asserted by the app's commit - no per-block metrics or appProposals population — the recovered cursor jump may span heights whose blocks/QCs aren't in memory - advances all cursors to >= n+1 to preserve first <= nextAppProposal <= nextBlockToPersist <= nextBlock <= nextQC Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

To verify whether the recovery deadlock is really caused by the data WAL being behind app.LastBlockHeight, instrument three new sites: - NewState entry: which heights the Blocks and CommitQCs WALs offer (LoadedFirst, persister Next cursor), plus the committee's first. - NewState completion: final cursors and how many QCs / blocks ended up in memory after the restore loop. - runPersist batch persisted: how many QCs/blocks each batch wrote and where the persistedQC/persistedBlock cursors landed. With these plus the PushAppHash entry log, a single CI run shows: (a) what was actually on disk at restart (NewState load) (b) how runPersist's in-memory advance rate compares to the app's forward progress before the panic (c) the exact cursor gap PushAppHash sees on restart If (a) shows the Blocks WAL only reached 802 while app.LastBlockHeight is 803, the OS-page-cache-vs-fsync hypothesis is confirmed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

PushAppHashOnRecovery advanced data-layer cursors past blocks that weren't actually loaded into inner.qcs / inner.blocks. Logs from 26194247294 (Minor) show the data WAL is *empty* on every restart (loaded_qc_count=0, loaded_block_count=0) even after a graceful shutdown — runPersist's writes don't survive process exit. With cursors advanced, runPersist's next batch nil-panics dereferencing inner.qcs[n].QC() for n that was never inserted. The real story the logs revealed: - Minor passes pre-fix-C not because the data WAL persists, but because PushAppHash's WaitUntil is satisfied via peer-fed PushQC after restart (peers nodes 1/2/3 still have data in memory and deliver it). In the previous Minor run, PushAppHash(411) waited 17 seconds for peers to catch node-0 up. - Major hangs because all 4 nodes restart simultaneously: nobody has the missing data in memory or on disk, so the wait can never be satisfied. The deeper bug is data-WAL durability, not PushAppHash logic. Keep the diagnostic logs (NewState load summary, runPersist batch progress, PushAppHash entry) — they're the evidence — but revert the behavioral change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The data layer's block and commitqc WALs were constructed with utils.None[string](), making them no-ops: writes returned success but nothing landed on disk, and NewState loaded an empty WAL on every restart. Single-node restarts survived only by peer-fed PushQC; a coordinated cluster restart had no recovery path because every node had simultaneously lost its in-memory data. The autobahn config file already has persistent_state_dir, and it was already wired into the consensus layer. Thread the same value into the data layer: - gen_autobahn_config: add --persistent-state-dir flag, emit the value into AutobahnFileConfig.PersistentStateDir - step4_config_override.sh: pass --persistent-state-dir $HOME/.sei/data/autobahn when generating the docker cluster's autobahn.json - p2p.GigaRouterConfig: add PersistentStateDir field - node/setup.go: forward fc.PersistentStateDir into the new field - p2p/giga_router.go: pass cfg.PersistentStateDir to data.NewDataWAL instead of utils.None Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

wen-coding and others added 7 commits May 19, 2026 21:12

Merge remote-tracking branch 'origin/wen/migrate_deploy_bash_off_b_bl…

3074739

…ock' into wen/debug_autobahn_upgrade_panic

wen-coding added the non-app-hash-breaking label May 20, 2026

wen-coding and others added 11 commits May 20, 2026 11:40

Merge remote-tracking branch 'origin/main' into wen/debug_autobahn_up…

8d6403e

…grade_panic # Conflicts: # .github/workflows/integration-test.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

debug(autobahn): log FinalizeBlock + upgrade BeginBlocker (DO NOT MERGE)#3474

debug(autobahn): log FinalizeBlock + upgrade BeginBlocker (DO NOT MERGE)#3474
wen-coding wants to merge 18 commits into
mainfrom
wen/debug_autobahn_upgrade_panic

wen-coding commented May 20, 2026

Uh oh!

github-actions Bot commented May 20, 2026 •

edited

Loading

Uh oh!

codecov Bot commented May 20, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

wen-coding commented May 20, 2026

Uh oh!

github-actions Bot commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov Bot commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

github-actions Bot commented May 20, 2026 •

edited

Loading

codecov Bot commented May 20, 2026 •

edited

Loading