Fix raft leader re-election delays after SIGTERM by auricom · Pull Request #3230 · evstack/ev-node

auricom · 2026-04-07T13:46:46Z

Summary

wait for the raft FSM to catch up to the commit index instead of the last log index when leadership changes
clear the retriever apply callback on stop so a follower->leader transition does not keep writing into a dead channel
add regression coverage for clearing the apply callback

Testing

ok github.com/evstack/ev-node/pkg/raft (cached)
ok github.com/evstack/ev-node/block/internal/syncing (cached)

Closes #3229

Summary by CodeRabbit

Bug Fixes
- Improved Raft node shutdown behavior by properly clearing callback handlers during graceful termination.
- Refined message synchronization wait condition to better track committed log entries.
Tests
- Added test coverage for Raft node shutdown callback clearing behavior.

coderabbitai · 2026-04-07T13:47:05Z

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: e576b86c-04d5-4ca1-a990-34242140e6a8

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

📝 Walkthrough

Walkthrough

The changes fix raft retriever shutdown behavior and message landing detection to prevent apply channel congestion and timeout issues during graceful shutdown. A new callback clearing step was added to the retriever's Stop method, and the shutdown's message landing condition was adjusted from comparing to LastIndex to CommitIndex.

Changes

Cohort / File(s)	Summary
Raft Retriever Shutdown Cleanup `block/internal/syncing/raft_retriever.go`, `block/internal/syncing/raft_retriever_test.go`	Added explicit call to `SetApplyCallback(nil)` when stopping the retriever to clear the previously registered apply callback. New unit test verifies this cleanup occurs correctly.
Raft Node Message Landing Detection `pkg/raft/node.go`	Modified `waitForMsgsLanded` loop condition to check when `AppliedIndex >= CommitIndex` instead of `AppliedIndex >= LastIndex`, changing the point at which shutdown considers raft messages fully applied.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 The apply channel clears with a whispered goodbye,
As callbacks find rest where they once stood high,
From LastIndex to CommitIndex we trace,
A rabbit's refactor brings shutdown to grace!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'Fix raft leader re-election delays after SIGTERM' accurately summarizes the main objective of the changes—addressing delayed leader re-elections in raft consensus after termination signals.
Description check	✅ Passed	The description covers the key changes, includes testing results, and links to issue `#3229`. It explains the three main improvements: FSM catch-up logic, callback cleanup, and regression test coverage.
Linked Issues check	✅ Passed	The code changes directly address issue `#3229`'s requirements: waiting for commit index instead of last index improves FSM synchronization, clearing the apply callback prevents dead channel writes, and regression tests ensure against recurrence.
Out of Scope Changes check	✅ Passed	All changes directly support the linked objective to fix raft leader re-election delays. Modifications to waitForMsgsLanded, raftRetriever.Stop(), and test coverage are scoped to the issue requirements without extraneous changes.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/3229-raft-re-election

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

claude · 2026-04-07T13:49:01Z

Claude finished @auricom's task in 3m 39s —— View job

Code Review

Gather context and read changed files
Review pkg/raft/node.go changes
Review block/internal/syncing/raft_retriever.go changes
Review block/internal/syncing/raft_retriever_test.go changes
Analyze concurrency safety and race conditions
Post final review

The two bug fixes are well-scoped and address real issues. Here's my analysis:

`pkg/raft/node.go` — `CommitIndex` vs `LastIndex`

This change is correct. LastIndex includes log entries that may never be committed (e.g., uncommitted entries appended by a pre-shutdown leader), which means AppliedIndex >= LastIndex could be permanently unsatisfied during shutdown — causing a guaranteed timeout and warning log on every clean shutdown. CommitIndex is the right target: it represents the highest entry that quorum has agreed on, which the FSM will eventually apply. Fix this →

One subtle edge: if the node is a follower during shutdown and the leader commits new entries during the waitForMsgsLanded polling loop, CommitIndex could advance beyond what the FSM has applied. With only 2 poll cycles (ticker at SendTimeout/2, timeout at SendTimeout), a briefly lagging FSM could still hit the timeout. This is a pre-existing design constraint rather than a bug introduced here.

`block/internal/syncing/raft_retriever.go` — `SetApplyCallback(nil)` on Stop

The sequencing of wg.Wait() → SetApplyCallback(nil) is correct: the consumer goroutine is confirmed stopped before the callback is cleared, eliminating the "follower→leader transition writes to dead channel" scenario described in the PR.

Data race concern (Medium): FSM.Apply in pkg/raft/node.go reads f.applyCh without any mutex:

// pkg/raft/node.go:322
if f.applyCh != nil {
    select {
    case f.applyCh <- RaftApplyMsg{...}:
    ...
    }
}

After wg.Wait() returns, the raft node is still running, meaning FSM.Apply can still be invoked by the raft library's internal apply goroutine. SetApplyCallback(nil) then concurrently writes f.applyCh. This is a data race per the Go memory model and will be flagged by go test -race.

The practical impact is limited (a nil channel in a select just takes the default), but the race detector will flag this. CodeRabbit's suggested fix is correct: add a sync.RWMutex to FSM, protect the field with a write lock in SetApplyCallback, and snapshot the pointer under a read lock in Apply before the select. Fix this →

Missing Stop() call in Syncer.Stop(): Syncer.Stop() (block/internal/syncing/syncer.go:231) cancels the parent context and calls s.fiRetriever.Stop() / s.daFollower.Stop(), but never calls s.raftRetriever.Stop(). This means the SetApplyCallback(nil) call added by this PR is never reached during normal Syncer.Stop() — only context cancellation stops the raftApplyLoop. If the intent is to clear the callback on every clean shutdown (not just explicit raftRetriever.Stop() calls), Syncer.Stop() should be updated: Fix this →

`block/internal/syncing/raft_retriever_test.go` — New test

The test correctly verifies the stop/clear behavior and uses a properly mutex-protected stubRaftNode. The nil eventSink is safe since the stub never triggers an apply that would call PipeEvent.

Per AGENTS.md, Go unit tests should be table-driven. CodeRabbit also flagged this. Consider wrapping the current case in a tests slice so idempotent stop and start/stop-cycle variants can be added without duplicating setup. Fix this →

Summary

Issue	Severity	File
Data race on `FSM.applyCh` (will fail `-race`)	Medium	`pkg/raft/node.go`
`Syncer.Stop()` never calls `raftRetriever.Stop()` — fix doesn't fire on normal shutdown	Medium	`block/internal/syncing/syncer.go`
Test should be table-driven (AGENTS.md guideline)	Low	`raft_retriever_test.go`

The waitForMsgsLanded fix and the SetApplyCallback(nil) intent are both correct — the data race and missing call site are the blocking concerns before merge.

github-actions · 2026-04-07T13:50:56Z

The latest Buf updates on your PR. Results from workflow CI / buf-check (pull_request).

Build	Format	Lint	Breaking	Updated (UTC)
`✅ passed`	`⏩ skipped`	`✅ passed`	`✅ passed`	Apr 8, 2026, 12:19 AM

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

block/internal/syncing/raft_retriever_test.go (1)

42-61: Consider converting this to a table-driven test.

The current case is good, but a table shape will make it easier to add stop idempotency and start/stop-cycle variants without duplicating setup.

As per coding guidelines "Use table-driven tests in Go unit tests".

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@block/internal/syncing/raft_retriever_test.go` around lines 42 - 61, The test
TestRaftRetrieverStopClearsApplyCallback should be converted into a table-driven
test to cover multiple scenarios (current stop behavior, stop idempotency,
start/stop cycles) without duplicating setup: create a slice of test cases each
with a name and a sequence of actions (e.g., start, stop, stop again, start/stop
cycle), and in the t.Run loop instantiate a fresh stubRaftNode and retriever via
newRaftRetriever, call retriever.Start and retriever.Stop according to the case,
then assert expected recordedCallbacks via stubRaftNode.recordedCallbacks; keep
using require.NoError for Start and require assertions on callback length and
nil/non-nil entries as in the original test. Ensure each case isolates state by
creating new retriever and stubRaftNode within the loop.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@block/internal/syncing/raft_retriever.go`:
- Line 77: The call to r.raftNode.SetApplyCallback(nil) races with FSM.Apply
because Apply reads/sends on applyCh while the raft node may concurrently invoke
the callback; fix by adding a mutex to the raft node to guard access to the
apply callback: protect the callback field and its setter Get/SetApplyCallback
(or SetApplyCallback and any internal invocation sites) with the new mutex so
that FSM.Apply (which reads/sends on applyCh via the callback) cannot see a nil
or changing callback mid-invocation; update the raft node's invocation path that
calls the callback (where Apply is invoked) to acquire the same mutex (or use a
read lock) when reading the callback and release it immediately after obtaining
the pointer, then call the callback outside the lock if needed to avoid
long-held locks.

---

Nitpick comments:
In `@block/internal/syncing/raft_retriever_test.go`:
- Around line 42-61: The test TestRaftRetrieverStopClearsApplyCallback should be
converted into a table-driven test to cover multiple scenarios (current stop
behavior, stop idempotency, start/stop cycles) without duplicating setup: create
a slice of test cases each with a name and a sequence of actions (e.g., start,
stop, stop again, start/stop cycle), and in the t.Run loop instantiate a fresh
stubRaftNode and retriever via newRaftRetriever, call retriever.Start and
retriever.Stop according to the case, then assert expected recordedCallbacks via
stubRaftNode.recordedCallbacks; keep using require.NoError for Start and require
assertions on callback length and nil/non-nil entries as in the original test.
Ensure each case isolates state by creating new retriever and stubRaftNode
within the loop.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: d9bc4987-af22-4eb8-a04c-1c9ef989e55a

📥 Commits

Reviewing files that changed from the base of the PR and between 04c9cad and 2d28b20.

📒 Files selected for processing (3)

block/internal/syncing/raft_retriever.go
block/internal/syncing/raft_retriever_test.go
pkg/raft/node.go

coderabbitai · 2026-04-07T13:53:44Z

block/internal/syncing/raft_retriever.go

 	r.mtx.Unlock()

 	r.wg.Wait()
+	r.raftNode.SetApplyCallback(nil)


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Verify unsynchronized callback read/write sites in raft node implementation. rg -n -C3 'type FSM struct|applyCh|func \(n \*Node\) SetApplyCallback|func \(f \*FSM\) Apply' pkg/raft/node.go

Repository: evstack/ev-node

Length of output: 1336

🏁 Script executed:

# Check the actual raft_retriever.Stop() implementation to understand what wg.Wait() is waiting for cat -n block/internal/syncing/raft_retriever.go | grep -A 15 "func (r \*raftRetriever) Stop"

Repository: evstack/ev-node

Length of output: 590

🏁 Script executed:

# Check if there's synchronization in the raft node's Stop or shutdown methods rg -n -B3 -A10 "func \(n \*Node\) Stop" pkg/raft/node.go

Repository: evstack/ev-node

Length of output: 553

🏁 Script executed:

# Check if the raft library itself (from ethraft package) provides synchronization around FSM.Apply rg -n "applyCh" pkg/raft/node.go | head -20

Repository: evstack/ev-node

Length of output: 212

🏁 Script executed:

# Check where raftNode is initialized and where Stop is called in context rg -n -B5 -A15 "func.*NewRaftRetriever|raftNode.*Stop\(\)" block/internal/syncing/raft_retriever.go

Repository: evstack/ev-node

Length of output: 41

🏁 Script executed:

# Check if there's a parent Syncer that coordinates shutdown of both raftNode and raftRetriever rg -n -B3 -A8 "type Syncer struct|raftRetriever\)" block/internal/syncing/syncer.go | head -40

Repository: evstack/ev-node

Length of output: 328

🏁 Script executed:

# Check the full Stop() implementation and see if there are any shutdown calls to raftNode before SetApplyCallback cat -n block/internal/syncing/raft_retriever.go | sed -n '60,90p'

Repository: evstack/ev-node

Length of output: 951

🏁 Script executed:

# Find where raftRetriever.Stop() is called rg -n "raftRetriever.*Stop\(\)|\.Stop\(\)" block/internal/syncing/syncer.go | head -20

Repository: evstack/ev-node

Length of output: 186

🏁 Script executed:

# Get the Syncer.Stop() method to see the full shutdown sequence rg -n -B3 -A20 "func \(s \*Syncer\) Stop" block/internal/syncing/syncer.go

Repository: evstack/ev-node

Length of output: 646

🏁 Script executed:

# Check where r.raftNode is stopped/shut down rg -n "raftNode\.*Stop\|raftRetriever.*Stop" block/internal/syncing/raft_retriever.go

Repository: evstack/ev-node

Length of output: 41

Synchronize callback access to prevent data race with FSM.Apply.

Line 77's SetApplyCallback(nil) races with FSM.Apply in pkg/raft/node.go, which reads and sends on applyCh without synchronization. The wg.Wait() only ensures the consumer (raftApplyLoop) has exited, but raftNode continues running and can still invoke Apply(). Guard callback access with a mutex in the raft node implementation.

Suggested fix (in pkg/raft/node.go)

import ( "context" "errors" "fmt" "io" "net" "os" "path/filepath" "strings" + "sync" "sync/atomic" "time" @@ type FSM struct { logger zerolog.Logger state *atomic.Pointer[RaftBlockState] + applyMu sync.RWMutex applyCh chan<- RaftApplyMsg } @@ func (n *Node) SetApplyCallback(ch chan<- RaftApplyMsg) { + n.fsm.applyMu.Lock() + defer n.fsm.applyMu.Unlock() n.fsm.applyCh = ch } @@ - if f.applyCh != nil { + f.applyMu.RLock() + ch := f.applyCh + f.applyMu.RUnlock() + if ch != nil { select { - case f.applyCh <- RaftApplyMsg{Index: log.Index, State: &state}: + case ch <- RaftApplyMsg{Index: log.Index, State: &state}: default: // on a slow consumer, the raft cluster should not be blocked. Followers can sync from DA or other peers, too. f.logger.Warn().Msg("apply channel full, dropping message") } }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@block/internal/syncing/raft_retriever.go` at line 77, The call to r.raftNode.SetApplyCallback(nil) races with FSM.Apply because Apply reads/sends on applyCh while the raft node may concurrently invoke the callback; fix by adding a mutex to the raft node to guard access to the apply callback: protect the callback field and its setter Get/SetApplyCallback (or SetApplyCallback and any internal invocation sites) with the new mutex so that FSM.Apply (which reads/sends on applyCh via the callback) cannot see a nil or changing callback mid-invocation; update the raft node's invocation path that calls the callback (where Apply is invoked) to acquire the same mutex (or use a read lock) when reading the callback and release it immediately after obtaining the pointer, then call the callback outside the lock if needed to avoid long-held locks.

codecov · 2026-04-08T00:22:58Z

Codecov Report

❌ Patch coverage is 50.00000% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 61.92%. Comparing base (d2a29e8) to head (857b44b).

Files with missing lines	Patch %	Lines
pkg/raft/node.go	0.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3230      +/-   ##
==========================================
+ Coverage   61.67%   61.92%   +0.24%     
==========================================
  Files         120      120              
  Lines       12635    12636       +1     
==========================================
+ Hits         7793     7825      +32     
+ Misses       3968     3936      -32     
- Partials      874      875       +1

Flag	Coverage Δ
combined	`61.92% <50.00%> (+0.24%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Fix raft leader handoff regression after SIGTERM

2d28b20

github-actions bot assigned auricom Apr 7, 2026

coderabbitai bot reviewed Apr 7, 2026

View reviewed changes

auricom marked this pull request as draft April 7, 2026 15:24

Merge branch 'main' into fix/3229-raft-re-election

857b44b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix raft leader re-election delays after SIGTERM#3230

Fix raft leader re-election delays after SIGTERM#3230
auricom wants to merge 2 commits intomainfrom
fix/3229-raft-re-election

auricom commented Apr 7, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Apr 7, 2026 •

edited

Loading

Review skipped

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

claude bot commented Apr 7, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Apr 7, 2026 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Apr 7, 2026

Uh oh!

codecov bot commented Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

auricom commented Apr 7, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Testing

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

claude bot commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review

pkg/raft/node.go — CommitIndex vs LastIndex

block/internal/syncing/raft_retriever.go — SetApplyCallback(nil) on Stop

block/internal/syncing/raft_retriever_test.go — New test

Summary

Uh oh!

github-actions bot commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Apr 8, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

auricom commented Apr 7, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Apr 7, 2026 •

edited

Loading

claude bot commented Apr 7, 2026 •

edited

Loading

`pkg/raft/node.go` — `CommitIndex` vs `LastIndex`

`block/internal/syncing/raft_retriever.go` — `SetApplyCallback(nil)` on Stop

`block/internal/syncing/raft_retriever_test.go` — New test

github-actions bot commented Apr 7, 2026 •

edited

Loading