Skip to content

fix: backward-forward-let#1570

Open
arnaubennassar wants to merge 12 commits intodevelopfrom
feat/imporve-bw-fw-lettool
Open

fix: backward-forward-let#1570
arnaubennassar wants to merge 12 commits intodevelopfrom
feat/imporve-bw-fw-lettool

Conversation

@arnaubennassar
Copy link
Copy Markdown
Collaborator

@arnaubennassar arnaubennassar commented Apr 1, 2026

🔄 Changes Summary

  • Extend backward-forward-let with the staging-drill workflow validated on a real staging network:
    • add craft-cert to build and sign malicious certificates for controlled BFL recovery drills,
    • allow send-cert --no-db so a crafted certificate can be sent to AggLayer without persisting it in the aggsender DB,
    • improve craft-cert fallback behavior so it can reconstruct historical leaves from aggsender RPC, aggsender DB, bridge service, and --cert-exits-file override data,
    • tighten diagnosis so missing settled certificate exits are not misclassified as NoDivergence, and fail fast when bridge-service indexing is still behind.
  • Update the BFL operator documentation to match the full staging validation flow:
    • document the normal recovery flow,
    • document the AggLayer admin / --cert-exits-file fallback procedure,
    • document staging-drill constraints such as stopping aggsender before malicious sends, waiting for bridge-service indexing, and waiting for honest post-recovery certificates to settle.
  • Keep aggsender startup recovery intentionally strict:
    • revert the branch-only change that auto-reconciled stale/conflicting local certificate state from AggLayer,
    • add explicit fatal logging that tells the operator to wipe the aggsender DB and restart aggsender when local DB state no longer matches a further AggLayer certificate.

⚠️ Breaking Changes

  • 🛠️ Config: None
  • 🔌 API/CLI: None for existing commands; this PR only adds staging-oriented BFL subcommands/flags (craft-cert, send-cert --no-db, --cert-exits-file reuse during crafting).
  • 🗑️ Deprecated Features: None

📋 Config Updates

  • No required config migration.
  • craft-cert reuses the existing AggSender.AggsenderPrivateKey signer config by default, so existing shared signer backends such as GCP KMS continue to work.

✅ Testing

  • 🤖 Automatic:
    • go test ./aggsender/statuschecker/...
    • go test ./tools/backward_forward_let/...
  • 🖱️ Manual:
    • validated the backward/forward LET workflow end to end on staging, including crafted malicious cert drills, fallback override-file recovery, Case 1 through Case 4 classification/recovery, and post-recovery convergence back to NoDivergence
    • validated the aggsender mismatch incident discovered during staging and confirmed the final documented operator action is manual DB wipe, not automatic reconciliation

🐞 Issues

  • Closes #none

🔗 Related PRs

  • None

📝 Notes

  • The aggsender strictness is intentional. When AggLayer is already on a further or conflicting certificate than the local aggsender DB, the supported recovery is to wipe the aggsender DB and restart aggsender.
  • craft-cert remains explicitly gated behind --staging-only and is intended for staging drills, not production operation.
  • The fallback override flow remains an operator procedure because recovery still depends on externally supplied bridge exits when aggsender RPC/DB cannot provide them.

@arnaubennassar arnaubennassar marked this pull request as draft April 14, 2026 13:25
@arnaubennassar arnaubennassar changed the title docs: update backward-forward-let runbook fix: backward-forward-let Apr 16, 2026
@agglayer agglayer deleted a comment from claude bot Apr 16, 2026
@arnaubennassar arnaubennassar marked this pull request as ready for review April 16, 2026 14:51
@joanestebanr joanestebanr requested a review from Copilot April 16, 2026 15:37
@joanestebanr
Copy link
Copy Markdown
Collaborator

@claude

@joanestebanr
Copy link
Copy Markdown
Collaborator

Code review

No issues found. Checked for bugs and CLAUDE.md compliance.

🤖 Generated with Claude Code

- If this code review was useful, please react with 👍. Otherwise, react with 👎.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens AggSender startup recovery against stale/conflicting local certificate state (treating AggLayer as authoritative), and extends the backward-forward-let tooling/runbooks to support staging drills (including crafting and sending certificates without persisting them locally).

Changes:

  • AggSender startup recovery now deletes/replaces conflicting local cert entries instead of failing, syncing local state back to AggLayer.
  • backward-forward-let adds craft-cert (staging-gated) and enhances send-cert with --no-db; diagnosis/reporting logic is refined for partial results.
  • Documentation/runbooks are updated to reflect the operator workflow and fallback procedures when aggsender exit data is unavailable.

Reviewed changes

Copilot reviewed 17 out of 18 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
tools/backward_forward_let/types.go Adds a helper for “complete + no divergence” diagnosis checks.
tools/backward_forward_let/send_cert.go Adds --no-db path to send certs without persisting in aggsender DB.
tools/backward_forward_let/send_cert_test.go Adds coverage for --no-db and CLI flag wiring.
tools/backward_forward_let/run.go Uses the new “complete no divergence” helper to avoid false “nothing to do”.
tools/backward_forward_let/diagnosis.go Adjusts printing flow and changes bridge-service NotFound handling to fail fast.
tools/backward_forward_let/diagnosis_test.go Adds tests for partial results and updates bridge-service NotFound expectations.
tools/backward_forward_let/craft_cert.go Introduces craft-cert command implementation and multi-source exit reconstruction logic.
tools/backward_forward_let/craft_cert_test.go Adds comprehensive tests for crafted cert behavior and fallback reconstruction paths.
tools/backward_forward_let/config.go Extends tool config to reuse AggSender.AggsenderPrivateKey for craft-cert signing.
tools/backward_forward_let/cmd/main.go Registers craft-cert and adds --no-db to send-cert.
tools/backward_forward_let/RECOVERY_PROCEDURE.md Documents admin API auth header option and operational staging notes.
tools/backward_forward_let/README.md Adds full tool/command documentation including staging drill flows.
docs/backward_forward_let_runbook.md Rewrites runbook to a tool-driven operator workflow and fallback steps.
aggsender/statuschecker/initial_state.go Adds delete-local-cert recovery actions and makes reconciliation permissive.
aggsender/statuschecker/initial_state_test.go Updates expectations to validate new recovery actions (including delete/insert sequencing).
aggsender/statuschecker/cert_status_checker.go Executes the new delete-local-cert action and surfaces warnings.
aggsender/statuschecker/cert_status_checker_test.go Adds coverage for executing delete-local-cert actions.
.gitignore Ignores debug/ and bin/ directories.
Comments suppressed due to low confidence (1)

tools/backward_forward_let/cmd/main.go:104

  • The send-cert command Usage string still says it records the cert in the aggsender DB, but this PR makes DB storage optional via --no-db. Update the help text to reflect the new behavior, and consider validating flag combinations (e.g., require exactly one of --db-path or --no-db, and error if both are set) to avoid surprising no-op/ignored flags.
			Name:  "send-cert",
			Usage: "Send a certificate to the agglayer and record it in the aggsender DB",
			Flags: []cli.Flag{
				&cli.StringFlag{
					Name:  "cert-json",
					Usage: "Certificate JSON string (mutually exclusive with --cert-file)",

return nil, fmt.Errorf("parse --amount %q as decimal big.Int", c.String("amount"))
}

originTokenAddr := common.HexToAddress(c.String("origin-token-address"))
Comment on lines +551 to +553
func makeFakeBridgeExit(opts *craftCertOptions, exitIndex int) *agglayertypes.BridgeExit {
addrBytes := crypto.Keccak256(append(append([]byte(nil), opts.nonce...), byte(exitIndex)))
return &agglayertypes.BridgeExit{
return nil, err
}
if cert == nil {
return nil, fmt.Errorf("certificate not found")
Comment on lines +510 to +528
func callCraftCertRPCWithTimeout[T any](fn func() (T, error)) (T, error) {
type result struct {
value T
err error
}

resultCh := make(chan result, 1)
go func() {
value, err := fn()
resultCh <- result{value: value, err: err}
}()

select {
case result := <-resultCh:
return result.value, result.err
case <-time.After(craftCertRPCRequestTimeout):
var zero T
return zero, fmt.Errorf("aggsender RPC request timed out after %s", craftCertRPCRequestTimeout)
}
@sonarqubecloud
Copy link
Copy Markdown

@agglayer agglayer deleted a comment from claude bot Apr 17, 2026
@arnaubennassar
Copy link
Copy Markdown
Collaborator Author

@claude

@claude
Copy link
Copy Markdown

claude bot commented Apr 17, 2026

Claude finished @arnaubennassar's task in 2m 9s —— View job


Code Review Complete ✅

Reviewed PR #1570: fix: backward-forward-let


🚀 What's New

This PR extends the backward-forward-let tool with staging drill capabilities, including a new craft-cert command for building malicious certificates and send-cert --no-db for sending certificates without persisting them. It also hardens aggsender startup recovery to require manual DB wipes when AggLayer state diverges from local state, replacing auto-reconciliation with strict fail-fast behavior.

🐛 Bug Fixes

  • Fixed diagnosis logic to fail fast when bridge-service data is missing (not indexed yet) rather than silently treating it as "NoDivergence"
  • Fixed aggsender recovery to properly detect and require manual intervention when local certificate state conflicts with AggLayer state

📋 Config Updates

None. The craft-cert command reuses existing AggSender.AggsenderPrivateKey configuration.

⚠️ Breaking Changes

  • Aggsender startup behavior: Previously, aggsender would attempt to auto-reconcile when AggLayer had a different certificate than the local DB. Now it requires manual DB wipe and restart. This is intentional and documented in error messages with clear operator instructions.

🔍 Security Review

🔴 Required Issues

1. Address collision in fake bridge exit generation (craft_cert.go:552)

The makeFakeBridgeExit function uses only a single byte of the exit index to generate destination addresses:

addrBytes := crypto.Keccak256(append(append([]byte(nil), opts.nonce...), byte(exitIndex)))

Issue: byte(exitIndex) truncates to 0-255, causing collisions for exitIndex >= 256. If an operator uses --starting-exit-index 256 or --num-fake-exits > 256, this will generate duplicate destination addresses and identical bridge exit leaves, which could cause unexpected behavior in staging drills.

Recommendation: Encode the full exit index (as uint32 big-endian bytes) into the hash input instead of casting to a single byte.

Fix this →

2. Silent address parsing failure (craft_cert.go:132)

common.HexToAddress silently returns the zero address for invalid input:

originTokenAddr := common.HexToAddress(c.String("origin-token-address"))

Issue: If the operator provides a malformed --origin-token-address, the tool will use 0x0000... without warning, potentially crafting unintended certificates.

Recommendation: Validate the address using common.IsHexAddress and return a clear error for malformed input before proceeding.

Fix this →

🟡 Suggested Improvements

3. Goroutine leak in RPC timeout wrapper (craft_cert.go:510-529)

callCraftCertRPCWithTimeout spawns a goroutine per RPC call but cannot cancel the underlying request on timeout. If the RPC hangs indefinitely, the goroutine continues running even after the timeout fires. Repeated retries can accumulate leaked goroutines.

Recommendation: Use a context-aware approach with context.WithTimeout for the RPC calls, or document that timeouts may leak goroutines in long-running scenarios.

4. Missing height in error message (craft_cert.go:440)

When reading from the DB fails with cert == nil, the error message doesn't include the height being queried:

return nil, fmt.Errorf("certificate not found")

Recommendation: Include the height in the error message for better debugging during staging drills.

5. Help text inconsistency (cmd/main.go:100)

The send-cert command usage says "record it in the aggsender DB", but this PR makes DB storage optional via --no-db. The help text should reflect this:

Usage: "Send a certificate to the agglayer and optionally record it in the aggsender DB",

Also consider validating that exactly one of --db-path or --no-db is set to avoid confusion.


✅ Best Practices Review

Code Quality

  • Error handling: Comprehensive wrapping with fmt.Errorf and %w
  • Naming: Follows Go conventions (CamelCase for exports, descriptive names)
  • Testing: Strong test coverage including unit tests for all new functions (87.5% coverage per SonarQube)
  • Comments: Exported functions have proper doc comments
  • ⚠️ Line length: Some lines in craft_cert.go exceed 120 chars (lines 189, 299-301, 442-444) - consider breaking these up

Go Style (CLAUDE.md compliance)

  • ✅ Uses require (not assert) in tests as required
  • ✅ Error wrapping follows project conventions
  • ✅ Typed constants used appropriately (craftCertFetchMaxAttempts, etc.)
  • ✅ Mocks properly placed in test files

Security

  • ⚠️ craft-cert is properly gated behind --staging-only flag (line 59-61)
  • ✅ File permissions use restrictive 0o600 for crafted cert output (line 38)
  • ✅ No hardcoded secrets or credentials
  • ✅ Proper input validation on most parameters

🧪 Test Coverage Assessment

Backward Forward LET Tool Tests

  • craft_cert_test.go (614 lines): Comprehensive coverage of certificate crafting, fallback logic, RPC retries, signer config resolution
  • send_cert_test.go: Tests both --db and --no-db paths
  • diagnosis_test.go: Tests partial results and missing cert handling

Aggsender Recovery Tests

  • initial_state_test.go: Extensive test matrix covering 17 regular cases + 10 inconsistency cases
  • ✅ Tests explicitly validate the new "manual DB wipe required" error messages (lines 84, 92, 100, 115, 130, 295, 315)

Overall: Excellent test coverage. The test suite properly validates both happy paths and error conditions.


🟢 Optional Notes

Staging-only enforcement: The craft-cert command's --staging-only flag provides good documentation of intent, but consider adding runtime checks (e.g., refusing to run if config points to mainnet endpoints) for additional safety.

Recovery strictness rationale: The intentional removal of auto-reconciliation in aggsender recovery is well-justified in the PR description. The explicit fatal logging with operator instructions (lines 75-78 in initial_state.go) is good UX.

Documentation quality: The new README.md and updated runbook are thorough and provide clear operator workflows. The missing cert report output (diagnosis.go:438-498) with copy-pasteable commands is especially helpful.


📊 Summary

This PR successfully extends the backward-forward-let tool for staging drill workflows while hardening aggsender recovery semantics. The implementation is generally well-structured with comprehensive test coverage.

Priority fixes needed before merge:

  1. 🔴 Fix address collision in makeFakeBridgeExit (exitIndex > 255)
  2. 🔴 Add validation for --origin-token-address input

Recommended improvements:
3. 🟡 Address potential goroutine leak in RPC timeout wrapper
4. 🟡 Improve error messages with height context
5. 🟡 Update send-cert help text to reflect optional DB storage

The changes are well-tested, follow project conventions, and the intentional strictness in aggsender recovery is appropriately documented.


Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants