Skip to content

debug(autobahn): log FinalizeBlock + upgrade BeginBlocker (DO NOT MERGE)#3474

Draft
wen-coding wants to merge 18 commits into
mainfrom
wen/debug_autobahn_upgrade_panic
Draft

debug(autobahn): log FinalizeBlock + upgrade BeginBlocker (DO NOT MERGE)#3474
wen-coding wants to merge 18 commits into
mainfrom
wen/debug_autobahn_upgrade_panic

Conversation

@wen-coding
Copy link
Copy Markdown
Contributor

Diagnostic logs to trace why x/upgrade's BeginBlocker doesn't fire the panic under Autobahn. Two log sites:

  • giga_router.go executeBlock — logs every FinalizeBlock call with GlobalNumber, height, tx_count
  • sei-cosmos/x/upgrade/abci.go BeginBlocker entry — logs ctx.BlockHeight, planFound, plan.Height, plan.Name, shouldExecute, hasHandler

Merged in #3469's upgrade script migration so proposal_submit + proposal_vote actually work under Autobahn (otherwise the proposal would never get submitted).

CI's "Autobahn Upgrade Module (Major)/(Minor)" jobs should produce logs we can grep for UPGRADE-DEBUG and AUTOBAHN-DEBUG to identify which of these is true:

  1. planFound=false at upgrade height → gov proposal handler didn't persist the upgrade plan
  2. planFound=true, shouldExecute=false → height-flow bug
  3. planFound=true, shouldExecute=true, hasHandler=true → applyUpgrade ran, panic was correctly skipped

To be reverted after diagnosis.

wen-coding and others added 7 commits May 19, 2026 21:12
Bash twin of #3406's lib.js work. Submit cosmos txs with -b sync and
poll the actual on-chain side effect rather than relying on -b block —
under Autobahn the KV indexer that BroadcastTxCommit polls isn't
populated, so -b block can hang to its 60s timeout.

New shared helper integration_test/contracts/_tx_helpers.sh:

  - store_wasm — submit `wasm store` -b sync, return new code_id once
    max wasm code_id grows (mirrors lib.js storeWasm).
  - instantiate_wasm — submit `wasm instantiate` -b sync, return new
    contract address once it appears under code_id via list-contract-
    by-code set-diff (mirrors lib.js instantiateWasm).
  - bank_send_and_wait — submit `bank send` -b sync, wait for sender's
    account sequence to advance. Denom-agnostic causal commit signal
    used between consecutive sends from the same key (mirrors lib.js
    getAccountSequence-based wait).
  - _wait_until — generic check-loop with 30s default timeout, for
    one-off side-effect waits not covered by typed helpers.

Migrated scripts:

  - deploy_wasm_contracts.sh: 30 × (store + instantiate). Helpers
    capture the new code_id / contract address inside the wait
    closure so a concurrent store/instantiate from another test can't
    substitute its value.
  - create_tokenfactory_denoms.sh: 30 × create-denom. The denom is
    deterministic (factory/<creator>/<subdenom>); the inline
    create_denom_and_wait polls denom-authority-metadata for non-empty
    admin (matches the lib.js createTokenFactoryTokenAndMint pattern).
  - deploy_timelocked_token_contract.sh: 7 × bank send (admin funding
    7 admin/op/dest accounts sequentially) + 2 × store + 2 ×
    instantiate. bank_send_and_wait between consecutive funds prevents
    "account sequence mismatch" rejections when -b sync returns before
    the prior tx commits.
  - deploy_dex_contract.sh: store + instantiate + dex register-contract
    + 4 × dex register-pairs. The dex txs have no convenient
    single-query side effect; inline dex_tx_and_wait waits for the
    sender's sequence to advance.

Wait-timeout cases preserve the original error contract: scripts
print to stderr and exit 1, same as before.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Layers proposal_submit.sh + proposal_vote.sh onto the deploy bash
script migration introduced in the prior commit. Both use the shared
_tx_helpers.sh; two new helpers added there:

  - _get_max_proposal_id — highest existing gov proposal id (0 if none).
  - find_proposal_by_title — scan (max_id_before, cur] for a proposal
    whose title matches and echo its id. The diff-against-snapshot
    pins it to *this* submission rather than a stale prior-run
    proposal. Mirrors lib.js's findProposalByTitle (#3406).

proposal_submit.sh:
  - Submit gov submit-proposal software-upgrade via -b sync, capture
    txhash for error reporting, then poll gov state for a proposal
    whose title matches our $VERSION arg. Same shape lib.js's
    proposeParamChange uses.

proposal_vote.sh:
  - Submit gov vote via -b sync and wait for the voter's account
    sequence to advance — the upstream proposal-status poll in the
    upgrade test yaml tolerates the brief window between this script's
    return and the tally update, and the yaml caller discards this
    script's stdout. CheckTx-rejection path still echoes the rejection
    code so the rare bad-input case stays visible.

Together this unblocks Upgrade Module Major + Upgrade Module Minor CI
jobs under Autobahn (proposal_submit's -b block was the gating call
that prevented proposal_id capture, making PROPOSAL_ID empty and
breaking every downstream eval step).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When the gov module has no existing proposals, `seid q gov proposals`
exits non-zero with "no proposals found"; 2>/dev/null swallows the
error and jq reads empty stdin, emitting nothing. Callers (e.g.
find_proposal_by_title's loop) then hit "integer expression expected"
on `[ "$cur" -gt ... ]` and the search never matches, so a freshly-
booted cluster's first proposal_submit times out at 30s with no
proposal_id returned.

Fix: capture the seid output explicitly and fall back to 0 when it's
empty or when jq fails. Mirrors the try/catch path in lib.js's
maxProposalId.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Under -b sync, seid surfaces "chain-id ()" in the signature-verify
error if --chain-id isn't on the command line; -b block was reading
it from the client config in some path. Setting it explicitly here
matches what proposal_vote.sh already does and what the cosmos test
framework expects.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Verifies the bash-script migrations in this PR end-to-end under
Autobahn in CI's isolated per-job clusters (couldn't reproduce locally
because upgrade tests' binary swaps degrade a shared cluster).

  - Autobahn Upgrade Module (Major) / (Minor) — exercises
    proposal_submit.sh + proposal_vote.sh migrations from this PR
  - Autobahn FlatKV Integration — exercises deploy_flatkv_evm_fixture
    + downstream historical-state and crash-recovery scripts
  - Autobahn Disable WASM Tests — passed 16/16 locally; adding CI
    coverage

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Diagnostic logs to trace why x/upgrade's BeginBlocker doesn't panic
at the scheduled upgrade height under Autobahn (Autobahn Upgrade
Module Major/Minor jobs in #3469's CI failed: proposal_submit
migration works, proposal passes, but LOG_AT_BLOCK_HEIGHT_NODE_X
assertions fail because no node panics with "UPGRADE NEEDED").

Two log sites:

  - giga_router.go executeBlock: log every FinalizeBlock call with
    GlobalNumber, the height value being passed to ABCI, and tx_count.
    Confirms Autobahn block delivery is feeding ABCI correctly.

  - sei-cosmos/x/upgrade/abci.go BeginBlocker entry: log every
    invocation with ctx.BlockHeight, planFound, and if found the
    plan's target Height + Name + ShouldExecute verdict + hasHandler.
    Should expose whether the plan is committed (planFound=true) but
    not firing (ShouldExecute=false unexpectedly), or whether
    BeginBlocker isn't even called at the upgrade height.

To be reverted once the root cause is identified.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 20, 2026

The latest Buf updates on your PR. Results from workflow Buf / buf (pull_request).

BuildFormatLintBreakingUpdated (UTC)
✅ passed✅ passed✅ passed✅ passedMay 20, 2026, 11:37 PM

@codecov
Copy link
Copy Markdown

codecov Bot commented May 20, 2026

Codecov Report

❌ Patch coverage is 94.23077% with 6 lines in your changes missing coverage. Please review.
✅ Project coverage is 59.09%. Comparing base (63067ee) to head (7889ff7).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
...int/cmd/tendermint/commands/gen_autobahn_config.go 0.00% 4 Missing ⚠️
sei-tendermint/internal/p2p/giga_router.go 91.30% 2 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #3474      +/-   ##
==========================================
+ Coverage   59.07%   59.09%   +0.01%     
==========================================
  Files        2186     2186              
  Lines      182269   182369     +100     
==========================================
+ Hits       107671   107764      +93     
- Misses      64947    64954       +7     
  Partials     9651     9651              
Flag Coverage Δ
sei-chain-pr 68.24% <94.23%> (?)
sei-db 70.41% <ø> (-0.22%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
sei-cosmos/x/upgrade/abci.go 82.43% <100.00%> (+4.09%) ⬆️
...ei-tendermint/internal/autobahn/consensus/state.go 89.20% <100.00%> (+0.71%) ⬆️
sei-tendermint/internal/autobahn/data/state.go 80.82% <100.00%> (+1.83%) ⬆️
sei-tendermint/internal/autobahn/producer/state.go 84.05% <100.00%> (+3.35%) ⬆️
sei-tendermint/node/setup.go 69.77% <100.00%> (+0.09%) ⬆️
sei-tendermint/internal/p2p/giga_router.go 71.80% <91.30%> (+2.05%) ⬆️
...int/cmd/tendermint/commands/gen_autobahn_config.go 15.38% <0.00%> (-1.01%) ⬇️

... and 1 file with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

wen-coding and others added 11 commits May 20, 2026 11:40
Two changes to make the debug logs actually reach us:

1. Append (>>) instead of truncate (>) when starting seid in the
   cluster bringup + upgrade scripts. Previously every panic-restart
   cycle wiped the prior process's log, so by the time we read the
   log file all the BeginBlocker / FinalizeBlock log lines from the
   pre-panic seid were gone — only the latest restart's startup
   sequence remained.

   Sites: docker/localnode/scripts/step5_start_sei.sh,
          integration_test/upgrade_module/scripts/seid_upgrade.sh,
          integration_test/upgrade_module/scripts/seid_downgrade.sh

2. Strip the CI matrix to just the two Autobahn Upgrade Module jobs
   so we don't wait through 18+ other matrix rows that don't help
   this investigation.

To be reverted once the upgrade-panic-under-Autobahn root cause is
identified.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…grade_panic

# Conflicts:
#	.github/workflows/integration-test.yml
Sample two heights 3s apart and divide by elapsed wall-clock to derive
the real block interval, instead of assuming 300ms. Under Autobahn the
chain runs at ~400ms; with the hardcoded 300ms estimate, target_height
ended up ~33% closer to now than intended, the voting period elapsed
past it, and ScheduleUpgrade rejected the plan with "upgrade cannot
be scheduled in the past" — which is why planFound was never true in
the Autobahn upgrade test logs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Under Autobahn the cluster reaches TARGET_HEIGHT and panics within
~16s of the proposal passing (block production runs at ~100ms/block
during catchup, not the 400ms steady-state). That window isn't long
enough for the early-upgraded node to finish its restart bootstrap
(~14s) and execute its first block, so its BeginBlocker never gets
a chance to fire the "BINARY UPDATED BEFORE TRIGGER" panic and the
test's verify_panic step times out.

60s buys ~30s of cluster lifetime past vote-pass, comfortably more
than node-0's restart bootstrap.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add AUTOBAHN-DEBUG logs at each Autobahn subsystem boundary so we can
tell, on a stuck post-restart cluster, which loop never advanced:

- giga_router.runExecute: start, App.Info result, restored-appHash push,
  per-iteration "awaiting GlobalBlock" / "received GlobalBlock" /
  "executeBlock complete". If the block-fetch line repeats without a
  matching received line, the data layer isn't delivering the next
  block (consensus didn't agree).

- producer.Run: start, per-iteration "awaiting capacity" /
  "building payload" / "calling ProduceBlock". Identifies whether
  producer is blocked on capacity (consensus stalled) or actually
  emitting blocks.

- consensus.runPropose: per-view "view tick" with am_leader, plus a
  "reproposing" / "new proposal stored" entry on the leader path.
  Distinguishes "no proposer for this view" from "proposer can't
  build a proposal".

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
So a single CI run on this branch produces 4 jobs (CometBFT major,
CometBFT minor, Autobahn major, Autobahn minor) with the same debug
logging applied. Diffing the CometBFT vs Autobahn log timelines after
the coordinated panic-and-restart should make it obvious which
Autobahn subsystem stalls.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PushAppHash is the deadlock site on coordinated-restart: it waits on
n < nextBlockToPersist, and on restart the data WAL recovers at the
same height the app committed, so nextBlockToPersist <= n and the
wait never unblocks.

Log the (n, nextAppProposal, nextBlockToPersist, nextBlock, nextQC,
first) tuple on entry and the wait result. With this, one CI run will
show the exact cursor values on the failing call, confirming or
refuting the equality-stall theory.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PushAppHash's WaitUntil(n < nextBlockToPersist) protects the steady-state
invariant "don't broadcast an appHash for a block whose bytes aren't
durable yet". On runExecute's restart path this is over-conservative:
the app reports n as committed, so its hash is durable in the app's
persisted state regardless of the data WAL's catchup status. On
coordinated panic-recovery the data-persist goroutine was killed
mid-batch and the WAL lags the app's commit; no new block arrives to
satisfy the wait because consensus hasn't restarted on this node, so
the cluster deadlocks.

Add PushAppHashOnRecovery: same intent as PushAppHash on restart
(advance the app-proposal cursor to n+1 so subsequent steady-state
calls don't have to fast-forward through pruned heights), but:
  - no WaitUntil — durability is asserted by the app's commit
  - no per-block metrics or appProposals population — the recovered
    cursor jump may span heights whose blocks/QCs aren't in memory
  - advances all cursors to >= n+1 to preserve
    first <= nextAppProposal <= nextBlockToPersist <= nextBlock <= nextQC

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
To verify whether the recovery deadlock is really caused by the data
WAL being behind app.LastBlockHeight, instrument three new sites:

- NewState entry: which heights the Blocks and CommitQCs WALs offer
  (LoadedFirst, persister Next cursor), plus the committee's first.
- NewState completion: final cursors and how many QCs / blocks ended
  up in memory after the restore loop.
- runPersist batch persisted: how many QCs/blocks each batch wrote
  and where the persistedQC/persistedBlock cursors landed.

With these plus the PushAppHash entry log, a single CI run shows:
  (a) what was actually on disk at restart (NewState load)
  (b) how runPersist's in-memory advance rate compares to the app's
      forward progress before the panic
  (c) the exact cursor gap PushAppHash sees on restart

If (a) shows the Blocks WAL only reached 802 while app.LastBlockHeight
is 803, the OS-page-cache-vs-fsync hypothesis is confirmed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PushAppHashOnRecovery advanced data-layer cursors past blocks that
weren't actually loaded into inner.qcs / inner.blocks. Logs from
26194247294 (Minor) show the data WAL is *empty* on every restart
(loaded_qc_count=0, loaded_block_count=0) even after a graceful
shutdown — runPersist's writes don't survive process exit. With
cursors advanced, runPersist's next batch nil-panics dereferencing
inner.qcs[n].QC() for n that was never inserted.

The real story the logs revealed:

  - Minor passes pre-fix-C not because the data WAL persists, but
    because PushAppHash's WaitUntil is satisfied via peer-fed PushQC
    after restart (peers nodes 1/2/3 still have data in memory and
    deliver it). In the previous Minor run, PushAppHash(411) waited
    17 seconds for peers to catch node-0 up.
  - Major hangs because all 4 nodes restart simultaneously: nobody
    has the missing data in memory or on disk, so the wait can never
    be satisfied.

The deeper bug is data-WAL durability, not PushAppHash logic. Keep
the diagnostic logs (NewState load summary, runPersist batch progress,
PushAppHash entry) — they're the evidence — but revert the
behavioral change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The data layer's block and commitqc WALs were constructed with
utils.None[string](), making them no-ops: writes returned success but
nothing landed on disk, and NewState loaded an empty WAL on every
restart. Single-node restarts survived only by peer-fed PushQC; a
coordinated cluster restart had no recovery path because every node
had simultaneously lost its in-memory data.

The autobahn config file already has persistent_state_dir, and it was
already wired into the consensus layer. Thread the same value into
the data layer:

- gen_autobahn_config: add --persistent-state-dir flag, emit the
  value into AutobahnFileConfig.PersistentStateDir
- step4_config_override.sh: pass --persistent-state-dir
  $HOME/.sei/data/autobahn when generating the docker cluster's
  autobahn.json
- p2p.GigaRouterConfig: add PersistentStateDir field
- node/setup.go: forward fc.PersistentStateDir into the new field
- p2p/giga_router.go: pass cfg.PersistentStateDir to data.NewDataWAL
  instead of utils.None

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant