Skip to content

Fix raft leader re-election delays after SIGTERM#3230

Draft
auricom wants to merge 2 commits intomainfrom
fix/3229-raft-re-election
Draft

Fix raft leader re-election delays after SIGTERM#3230
auricom wants to merge 2 commits intomainfrom
fix/3229-raft-re-election

Conversation

@auricom
Copy link
Copy Markdown
Contributor

@auricom auricom commented Apr 7, 2026

Summary

  • wait for the raft FSM to catch up to the commit index instead of the last log index when leadership changes
  • clear the retriever apply callback on stop so a follower->leader transition does not keep writing into a dead channel
  • add regression coverage for clearing the apply callback

Testing

  • ok github.com/evstack/ev-node/pkg/raft (cached)
    ok github.com/evstack/ev-node/block/internal/syncing (cached)

Closes #3229

Summary by CodeRabbit

  • Bug Fixes

    • Improved Raft node shutdown behavior by properly clearing callback handlers during graceful termination.
    • Refined message synchronization wait condition to better track committed log entries.
  • Tests

    • Added test coverage for Raft node shutdown callback clearing behavior.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 7, 2026

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: e576b86c-04d5-4ca1-a990-34242140e6a8

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

The changes fix raft retriever shutdown behavior and message landing detection to prevent apply channel congestion and timeout issues during graceful shutdown. A new callback clearing step was added to the retriever's Stop method, and the shutdown's message landing condition was adjusted from comparing to LastIndex to CommitIndex.

Changes

Cohort / File(s) Summary
Raft Retriever Shutdown Cleanup
block/internal/syncing/raft_retriever.go, block/internal/syncing/raft_retriever_test.go
Added explicit call to SetApplyCallback(nil) when stopping the retriever to clear the previously registered apply callback. New unit test verifies this cleanup occurs correctly.
Raft Node Message Landing Detection
pkg/raft/node.go
Modified waitForMsgsLanded loop condition to check when AppliedIndex >= CommitIndex instead of AppliedIndex >= LastIndex, changing the point at which shutdown considers raft messages fully applied.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 The apply channel clears with a whispered goodbye,
As callbacks find rest where they once stood high,
From LastIndex to CommitIndex we trace,
A rabbit's refactor brings shutdown to grace!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'Fix raft leader re-election delays after SIGTERM' accurately summarizes the main objective of the changes—addressing delayed leader re-elections in raft consensus after termination signals.
Description check ✅ Passed The description covers the key changes, includes testing results, and links to issue #3229. It explains the three main improvements: FSM catch-up logic, callback cleanup, and regression test coverage.
Linked Issues check ✅ Passed The code changes directly address issue #3229's requirements: waiting for commit index instead of last index improves FSM synchronization, clearing the apply callback prevents dead channel writes, and regression tests ensure against recurrence.
Out of Scope Changes check ✅ Passed All changes directly support the linked objective to fix raft leader re-election delays. Modifications to waitForMsgsLanded, raftRetriever.Stop(), and test coverage are scoped to the issue requirements without extraneous changes.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/3229-raft-re-election

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@claude
Copy link
Copy Markdown
Contributor

claude bot commented Apr 7, 2026

Claude finished @auricom's task in 3m 39s —— View job


Code Review

  • Gather context and read changed files
  • Review pkg/raft/node.go changes
  • Review block/internal/syncing/raft_retriever.go changes
  • Review block/internal/syncing/raft_retriever_test.go changes
  • Analyze concurrency safety and race conditions
  • Post final review

The two bug fixes are well-scoped and address real issues. Here's my analysis:


pkg/raft/node.goCommitIndex vs LastIndex

This change is correct. LastIndex includes log entries that may never be committed (e.g., uncommitted entries appended by a pre-shutdown leader), which means AppliedIndex >= LastIndex could be permanently unsatisfied during shutdown — causing a guaranteed timeout and warning log on every clean shutdown. CommitIndex is the right target: it represents the highest entry that quorum has agreed on, which the FSM will eventually apply. Fix this →

One subtle edge: if the node is a follower during shutdown and the leader commits new entries during the waitForMsgsLanded polling loop, CommitIndex could advance beyond what the FSM has applied. With only 2 poll cycles (ticker at SendTimeout/2, timeout at SendTimeout), a briefly lagging FSM could still hit the timeout. This is a pre-existing design constraint rather than a bug introduced here.


block/internal/syncing/raft_retriever.goSetApplyCallback(nil) on Stop

The sequencing of wg.Wait()SetApplyCallback(nil) is correct: the consumer goroutine is confirmed stopped before the callback is cleared, eliminating the "follower→leader transition writes to dead channel" scenario described in the PR.

Data race concern (Medium): FSM.Apply in pkg/raft/node.go reads f.applyCh without any mutex:

// pkg/raft/node.go:322
if f.applyCh != nil {
    select {
    case f.applyCh <- RaftApplyMsg{...}:
    ...
    }
}

After wg.Wait() returns, the raft node is still running, meaning FSM.Apply can still be invoked by the raft library's internal apply goroutine. SetApplyCallback(nil) then concurrently writes f.applyCh. This is a data race per the Go memory model and will be flagged by go test -race.

The practical impact is limited (a nil channel in a select just takes the default), but the race detector will flag this. CodeRabbit's suggested fix is correct: add a sync.RWMutex to FSM, protect the field with a write lock in SetApplyCallback, and snapshot the pointer under a read lock in Apply before the select. Fix this →

Missing Stop() call in Syncer.Stop(): Syncer.Stop() (block/internal/syncing/syncer.go:231) cancels the parent context and calls s.fiRetriever.Stop() / s.daFollower.Stop(), but never calls s.raftRetriever.Stop(). This means the SetApplyCallback(nil) call added by this PR is never reached during normal Syncer.Stop() — only context cancellation stops the raftApplyLoop. If the intent is to clear the callback on every clean shutdown (not just explicit raftRetriever.Stop() calls), Syncer.Stop() should be updated: Fix this →


block/internal/syncing/raft_retriever_test.go — New test

The test correctly verifies the stop/clear behavior and uses a properly mutex-protected stubRaftNode. The nil eventSink is safe since the stub never triggers an apply that would call PipeEvent.

Per AGENTS.md, Go unit tests should be table-driven. CodeRabbit also flagged this. Consider wrapping the current case in a tests slice so idempotent stop and start/stop-cycle variants can be added without duplicating setup. Fix this →


Summary

Issue Severity File
Data race on FSM.applyCh (will fail -race) Medium pkg/raft/node.go
Syncer.Stop() never calls raftRetriever.Stop() — fix doesn't fire on normal shutdown Medium block/internal/syncing/syncer.go
Test should be table-driven (AGENTS.md guideline) Low raft_retriever_test.go

The waitForMsgsLanded fix and the SetApplyCallback(nil) intent are both correct — the data race and missing call site are the blocking concerns before merge.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 7, 2026

The latest Buf updates on your PR. Results from workflow CI / buf-check (pull_request).

BuildFormatLintBreakingUpdated (UTC)
✅ passed⏩ skipped✅ passed✅ passedApr 8, 2026, 12:19 AM

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
block/internal/syncing/raft_retriever_test.go (1)

42-61: Consider converting this to a table-driven test.

The current case is good, but a table shape will make it easier to add stop idempotency and start/stop-cycle variants without duplicating setup.

As per coding guidelines "Use table-driven tests in Go unit tests".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@block/internal/syncing/raft_retriever_test.go` around lines 42 - 61, The test
TestRaftRetrieverStopClearsApplyCallback should be converted into a table-driven
test to cover multiple scenarios (current stop behavior, stop idempotency,
start/stop cycles) without duplicating setup: create a slice of test cases each
with a name and a sequence of actions (e.g., start, stop, stop again, start/stop
cycle), and in the t.Run loop instantiate a fresh stubRaftNode and retriever via
newRaftRetriever, call retriever.Start and retriever.Stop according to the case,
then assert expected recordedCallbacks via stubRaftNode.recordedCallbacks; keep
using require.NoError for Start and require assertions on callback length and
nil/non-nil entries as in the original test. Ensure each case isolates state by
creating new retriever and stubRaftNode within the loop.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@block/internal/syncing/raft_retriever.go`:
- Line 77: The call to r.raftNode.SetApplyCallback(nil) races with FSM.Apply
because Apply reads/sends on applyCh while the raft node may concurrently invoke
the callback; fix by adding a mutex to the raft node to guard access to the
apply callback: protect the callback field and its setter Get/SetApplyCallback
(or SetApplyCallback and any internal invocation sites) with the new mutex so
that FSM.Apply (which reads/sends on applyCh via the callback) cannot see a nil
or changing callback mid-invocation; update the raft node's invocation path that
calls the callback (where Apply is invoked) to acquire the same mutex (or use a
read lock) when reading the callback and release it immediately after obtaining
the pointer, then call the callback outside the lock if needed to avoid
long-held locks.

---

Nitpick comments:
In `@block/internal/syncing/raft_retriever_test.go`:
- Around line 42-61: The test TestRaftRetrieverStopClearsApplyCallback should be
converted into a table-driven test to cover multiple scenarios (current stop
behavior, stop idempotency, start/stop cycles) without duplicating setup: create
a slice of test cases each with a name and a sequence of actions (e.g., start,
stop, stop again, start/stop cycle), and in the t.Run loop instantiate a fresh
stubRaftNode and retriever via newRaftRetriever, call retriever.Start and
retriever.Stop according to the case, then assert expected recordedCallbacks via
stubRaftNode.recordedCallbacks; keep using require.NoError for Start and require
assertions on callback length and nil/non-nil entries as in the original test.
Ensure each case isolates state by creating new retriever and stubRaftNode
within the loop.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: d9bc4987-af22-4eb8-a04c-1c9ef989e55a

📥 Commits

Reviewing files that changed from the base of the PR and between 04c9cad and 2d28b20.

📒 Files selected for processing (3)
  • block/internal/syncing/raft_retriever.go
  • block/internal/syncing/raft_retriever_test.go
  • pkg/raft/node.go

r.mtx.Unlock()

r.wg.Wait()
r.raftNode.SetApplyCallback(nil)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify unsynchronized callback read/write sites in raft node implementation.
rg -n -C3 'type FSM struct|applyCh|func \(n \*Node\) SetApplyCallback|func \(f \*FSM\) Apply' pkg/raft/node.go

Repository: evstack/ev-node

Length of output: 1336


🏁 Script executed:

# Check the actual raft_retriever.Stop() implementation to understand what wg.Wait() is waiting for
cat -n block/internal/syncing/raft_retriever.go | grep -A 15 "func (r \*raftRetriever) Stop"

Repository: evstack/ev-node

Length of output: 590


🏁 Script executed:

# Check if there's synchronization in the raft node's Stop or shutdown methods
rg -n -B3 -A10 "func \(n \*Node\) Stop" pkg/raft/node.go

Repository: evstack/ev-node

Length of output: 553


🏁 Script executed:

# Check if the raft library itself (from ethraft package) provides synchronization around FSM.Apply
rg -n "applyCh" pkg/raft/node.go | head -20

Repository: evstack/ev-node

Length of output: 212


🏁 Script executed:

# Check where raftNode is initialized and where Stop is called in context
rg -n -B5 -A15 "func.*NewRaftRetriever|raftNode.*Stop\(\)" block/internal/syncing/raft_retriever.go

Repository: evstack/ev-node

Length of output: 41


🏁 Script executed:

# Check if there's a parent Syncer that coordinates shutdown of both raftNode and raftRetriever
rg -n -B3 -A8 "type Syncer struct|raftRetriever\)" block/internal/syncing/syncer.go | head -40

Repository: evstack/ev-node

Length of output: 328


🏁 Script executed:

# Check the full Stop() implementation and see if there are any shutdown calls to raftNode before SetApplyCallback
cat -n block/internal/syncing/raft_retriever.go | sed -n '60,90p'

Repository: evstack/ev-node

Length of output: 951


🏁 Script executed:

# Find where raftRetriever.Stop() is called
rg -n "raftRetriever.*Stop\(\)|\.Stop\(\)" block/internal/syncing/syncer.go | head -20

Repository: evstack/ev-node

Length of output: 186


🏁 Script executed:

# Get the Syncer.Stop() method to see the full shutdown sequence
rg -n -B3 -A20 "func \(s \*Syncer\) Stop" block/internal/syncing/syncer.go

Repository: evstack/ev-node

Length of output: 646


🏁 Script executed:

# Check where r.raftNode is stopped/shut down
rg -n "raftNode\.*Stop\|raftRetriever.*Stop" block/internal/syncing/raft_retriever.go

Repository: evstack/ev-node

Length of output: 41


Synchronize callback access to prevent data race with FSM.Apply.

Line 77's SetApplyCallback(nil) races with FSM.Apply in pkg/raft/node.go, which reads and sends on applyCh without synchronization. The wg.Wait() only ensures the consumer (raftApplyLoop) has exited, but raftNode continues running and can still invoke Apply(). Guard callback access with a mutex in the raft node implementation.

Suggested fix (in pkg/raft/node.go)
 import (
 	"context"
 	"errors"
 	"fmt"
 	"io"
 	"net"
 	"os"
 	"path/filepath"
 	"strings"
+	"sync"
 	"sync/atomic"
 	"time"
@@
 type FSM struct {
 	logger  zerolog.Logger
 	state   *atomic.Pointer[RaftBlockState]
+	applyMu sync.RWMutex
 	applyCh chan<- RaftApplyMsg
 }
@@
 func (n *Node) SetApplyCallback(ch chan<- RaftApplyMsg) {
+	n.fsm.applyMu.Lock()
+	defer n.fsm.applyMu.Unlock()
 	n.fsm.applyCh = ch
 }
@@
-	if f.applyCh != nil {
+	f.applyMu.RLock()
+	ch := f.applyCh
+	f.applyMu.RUnlock()
+	if ch != nil {
 		select {
-		case f.applyCh <- RaftApplyMsg{Index: log.Index, State: &state}:
+		case ch <- RaftApplyMsg{Index: log.Index, State: &state}:
 		default:
 			// on a slow consumer, the raft cluster should not be blocked. Followers can sync from DA or other peers, too.
 			f.logger.Warn().Msg("apply channel full, dropping message")
 		}
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@block/internal/syncing/raft_retriever.go` at line 77, The call to
r.raftNode.SetApplyCallback(nil) races with FSM.Apply because Apply reads/sends
on applyCh while the raft node may concurrently invoke the callback; fix by
adding a mutex to the raft node to guard access to the apply callback: protect
the callback field and its setter Get/SetApplyCallback (or SetApplyCallback and
any internal invocation sites) with the new mutex so that FSM.Apply (which
reads/sends on applyCh via the callback) cannot see a nil or changing callback
mid-invocation; update the raft node's invocation path that calls the callback
(where Apply is invoked) to acquire the same mutex (or use a read lock) when
reading the callback and release it immediately after obtaining the pointer,
then call the callback outside the lock if needed to avoid long-held locks.

@auricom auricom marked this pull request as draft April 7, 2026 15:24
@codecov
Copy link
Copy Markdown

codecov bot commented Apr 8, 2026

Codecov Report

❌ Patch coverage is 50.00000% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 61.92%. Comparing base (d2a29e8) to head (857b44b).

Files with missing lines Patch % Lines
pkg/raft/node.go 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3230      +/-   ##
==========================================
+ Coverage   61.67%   61.92%   +0.24%     
==========================================
  Files         120      120              
  Lines       12635    12636       +1     
==========================================
+ Hits         7793     7825      +32     
+ Misses       3968     3936      -32     
- Partials      874      875       +1     
Flag Coverage Δ
combined 61.92% <50.00%> (+0.24%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Raft leader re-election takes up to 90s after SIGTERM on a 3-node cluster

1 participant