Fix raft leader re-election delays after SIGTERM#3230
Conversation
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughThe changes fix raft retriever shutdown behavior and message landing detection to prevent apply channel congestion and timeout issues during graceful shutdown. A new callback clearing step was added to the retriever's Stop method, and the shutdown's message landing condition was adjusted from comparing to LastIndex to CommitIndex. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
Claude finished @auricom's task in 3m 39s —— View job Code Review
The two bug fixes are well-scoped and address real issues. Here's my analysis:
|
| Issue | Severity | File |
|---|---|---|
Data race on FSM.applyCh (will fail -race) |
Medium | pkg/raft/node.go |
Syncer.Stop() never calls raftRetriever.Stop() — fix doesn't fire on normal shutdown |
Medium | block/internal/syncing/syncer.go |
| Test should be table-driven (AGENTS.md guideline) | Low | raft_retriever_test.go |
The waitForMsgsLanded fix and the SetApplyCallback(nil) intent are both correct — the data race and missing call site are the blocking concerns before merge.
|
The latest Buf updates on your PR. Results from workflow CI / buf-check (pull_request).
|
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
block/internal/syncing/raft_retriever_test.go (1)
42-61: Consider converting this to a table-driven test.The current case is good, but a table shape will make it easier to add stop idempotency and start/stop-cycle variants without duplicating setup.
As per coding guidelines "Use table-driven tests in Go unit tests".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@block/internal/syncing/raft_retriever_test.go` around lines 42 - 61, The test TestRaftRetrieverStopClearsApplyCallback should be converted into a table-driven test to cover multiple scenarios (current stop behavior, stop idempotency, start/stop cycles) without duplicating setup: create a slice of test cases each with a name and a sequence of actions (e.g., start, stop, stop again, start/stop cycle), and in the t.Run loop instantiate a fresh stubRaftNode and retriever via newRaftRetriever, call retriever.Start and retriever.Stop according to the case, then assert expected recordedCallbacks via stubRaftNode.recordedCallbacks; keep using require.NoError for Start and require assertions on callback length and nil/non-nil entries as in the original test. Ensure each case isolates state by creating new retriever and stubRaftNode within the loop.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@block/internal/syncing/raft_retriever.go`:
- Line 77: The call to r.raftNode.SetApplyCallback(nil) races with FSM.Apply
because Apply reads/sends on applyCh while the raft node may concurrently invoke
the callback; fix by adding a mutex to the raft node to guard access to the
apply callback: protect the callback field and its setter Get/SetApplyCallback
(or SetApplyCallback and any internal invocation sites) with the new mutex so
that FSM.Apply (which reads/sends on applyCh via the callback) cannot see a nil
or changing callback mid-invocation; update the raft node's invocation path that
calls the callback (where Apply is invoked) to acquire the same mutex (or use a
read lock) when reading the callback and release it immediately after obtaining
the pointer, then call the callback outside the lock if needed to avoid
long-held locks.
---
Nitpick comments:
In `@block/internal/syncing/raft_retriever_test.go`:
- Around line 42-61: The test TestRaftRetrieverStopClearsApplyCallback should be
converted into a table-driven test to cover multiple scenarios (current stop
behavior, stop idempotency, start/stop cycles) without duplicating setup: create
a slice of test cases each with a name and a sequence of actions (e.g., start,
stop, stop again, start/stop cycle), and in the t.Run loop instantiate a fresh
stubRaftNode and retriever via newRaftRetriever, call retriever.Start and
retriever.Stop according to the case, then assert expected recordedCallbacks via
stubRaftNode.recordedCallbacks; keep using require.NoError for Start and require
assertions on callback length and nil/non-nil entries as in the original test.
Ensure each case isolates state by creating new retriever and stubRaftNode
within the loop.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: d9bc4987-af22-4eb8-a04c-1c9ef989e55a
📒 Files selected for processing (3)
block/internal/syncing/raft_retriever.goblock/internal/syncing/raft_retriever_test.gopkg/raft/node.go
| r.mtx.Unlock() | ||
|
|
||
| r.wg.Wait() | ||
| r.raftNode.SetApplyCallback(nil) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify unsynchronized callback read/write sites in raft node implementation.
rg -n -C3 'type FSM struct|applyCh|func \(n \*Node\) SetApplyCallback|func \(f \*FSM\) Apply' pkg/raft/node.goRepository: evstack/ev-node
Length of output: 1336
🏁 Script executed:
# Check the actual raft_retriever.Stop() implementation to understand what wg.Wait() is waiting for
cat -n block/internal/syncing/raft_retriever.go | grep -A 15 "func (r \*raftRetriever) Stop"Repository: evstack/ev-node
Length of output: 590
🏁 Script executed:
# Check if there's synchronization in the raft node's Stop or shutdown methods
rg -n -B3 -A10 "func \(n \*Node\) Stop" pkg/raft/node.goRepository: evstack/ev-node
Length of output: 553
🏁 Script executed:
# Check if the raft library itself (from ethraft package) provides synchronization around FSM.Apply
rg -n "applyCh" pkg/raft/node.go | head -20Repository: evstack/ev-node
Length of output: 212
🏁 Script executed:
# Check where raftNode is initialized and where Stop is called in context
rg -n -B5 -A15 "func.*NewRaftRetriever|raftNode.*Stop\(\)" block/internal/syncing/raft_retriever.goRepository: evstack/ev-node
Length of output: 41
🏁 Script executed:
# Check if there's a parent Syncer that coordinates shutdown of both raftNode and raftRetriever
rg -n -B3 -A8 "type Syncer struct|raftRetriever\)" block/internal/syncing/syncer.go | head -40Repository: evstack/ev-node
Length of output: 328
🏁 Script executed:
# Check the full Stop() implementation and see if there are any shutdown calls to raftNode before SetApplyCallback
cat -n block/internal/syncing/raft_retriever.go | sed -n '60,90p'Repository: evstack/ev-node
Length of output: 951
🏁 Script executed:
# Find where raftRetriever.Stop() is called
rg -n "raftRetriever.*Stop\(\)|\.Stop\(\)" block/internal/syncing/syncer.go | head -20Repository: evstack/ev-node
Length of output: 186
🏁 Script executed:
# Get the Syncer.Stop() method to see the full shutdown sequence
rg -n -B3 -A20 "func \(s \*Syncer\) Stop" block/internal/syncing/syncer.goRepository: evstack/ev-node
Length of output: 646
🏁 Script executed:
# Check where r.raftNode is stopped/shut down
rg -n "raftNode\.*Stop\|raftRetriever.*Stop" block/internal/syncing/raft_retriever.goRepository: evstack/ev-node
Length of output: 41
Synchronize callback access to prevent data race with FSM.Apply.
Line 77's SetApplyCallback(nil) races with FSM.Apply in pkg/raft/node.go, which reads and sends on applyCh without synchronization. The wg.Wait() only ensures the consumer (raftApplyLoop) has exited, but raftNode continues running and can still invoke Apply(). Guard callback access with a mutex in the raft node implementation.
Suggested fix (in pkg/raft/node.go)
import (
"context"
"errors"
"fmt"
"io"
"net"
"os"
"path/filepath"
"strings"
+ "sync"
"sync/atomic"
"time"
@@
type FSM struct {
logger zerolog.Logger
state *atomic.Pointer[RaftBlockState]
+ applyMu sync.RWMutex
applyCh chan<- RaftApplyMsg
}
@@
func (n *Node) SetApplyCallback(ch chan<- RaftApplyMsg) {
+ n.fsm.applyMu.Lock()
+ defer n.fsm.applyMu.Unlock()
n.fsm.applyCh = ch
}
@@
- if f.applyCh != nil {
+ f.applyMu.RLock()
+ ch := f.applyCh
+ f.applyMu.RUnlock()
+ if ch != nil {
select {
- case f.applyCh <- RaftApplyMsg{Index: log.Index, State: &state}:
+ case ch <- RaftApplyMsg{Index: log.Index, State: &state}:
default:
// on a slow consumer, the raft cluster should not be blocked. Followers can sync from DA or other peers, too.
f.logger.Warn().Msg("apply channel full, dropping message")
}
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@block/internal/syncing/raft_retriever.go` at line 77, The call to
r.raftNode.SetApplyCallback(nil) races with FSM.Apply because Apply reads/sends
on applyCh while the raft node may concurrently invoke the callback; fix by
adding a mutex to the raft node to guard access to the apply callback: protect
the callback field and its setter Get/SetApplyCallback (or SetApplyCallback and
any internal invocation sites) with the new mutex so that FSM.Apply (which
reads/sends on applyCh via the callback) cannot see a nil or changing callback
mid-invocation; update the raft node's invocation path that calls the callback
(where Apply is invoked) to acquire the same mutex (or use a read lock) when
reading the callback and release it immediately after obtaining the pointer,
then call the callback outside the lock if needed to avoid long-held locks.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3230 +/- ##
==========================================
+ Coverage 61.67% 61.92% +0.24%
==========================================
Files 120 120
Lines 12635 12636 +1
==========================================
+ Hits 7793 7825 +32
+ Misses 3968 3936 -32
- Partials 874 875 +1
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Summary
Testing
ok github.com/evstack/ev-node/block/internal/syncing (cached)
Closes #3229
Summary by CodeRabbit
Bug Fixes
Tests