SDSTOR-22424: redo destroy pg#435
Conversation
|
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## stable/v4.x #435 +/- ##
==============================================
Coverage ? 53.72%
==============================================
Files ? 36
Lines ? 5392
Branches ? 678
==============================================
Hits ? 2897
Misses ? 2195
Partials ? 300 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
0b5e4a4 to
c686ce2
Compare
| @@ -499,7 +503,7 @@ void ReplicationStateMachine::write_snapshot_obj(std::shared_ptr< homestore::sna | |||
| if (home_object_->pg_exists(pg_data->pg_id())) { | |||
| LOGI("pg already exists, clean pg resources before snapshot, pg={} {}", pg_data->pg_id(), log_suffix); | |||
| // Need to pause state machine before destroying the PG, if fail, let raft retry. | |||
There was a problem hiding this comment.
comments out of date, as well as we dont have a branch that returns false as of now.
There was a problem hiding this comment.
let`s remove this out-of-date comments after addressing other comments for this PR
c686ce2 to
41de75e
Compare
| // error. but actually, this is not a problem. since before we starting pg_destroy in baseline resync, | ||
| // m_rd_sb->last_snapshot_lsn will be persisted upto the snapshot.get_last_log_idx(). then all the log less than | ||
| // or equal to m_rd_sb->last_snapshot_lsn will not be replayed or committed after recovery. so, the concern is |
There was a problem hiding this comment.
@xiaoxichen in baseline resync case , before we start destroying pg, we will m_rd_sb->last_snapshot_lsn upto snapshot.get_last_log_idx(). then raft_repl_dev#need_skip_processing will help us skipping replaying all the logs in recovery path(so that we will not hit those destroyed resources , like pg_index_table, etc.). so we don`t need wait for all the appended log to be committed in pg_destroy for BR case
There was a problem hiding this comment.
so basically you reverted your changes
There was a problem hiding this comment.
I dont get the point of this comments, you said for br no need to redo destroy PG but we do it anyway.
The concern of log replay vs destroy is not valid here ... if we reach here the log replay had been done...If we want to record the thinking why waiting for log commit is not needed, better rephrase this paragraph and move it to destroy_pg
Similar for L1043-1052, those lines explains the situations that a PG can be destroy, better to move to destroy_pg rather than here, especially we use same action in recovery path , for all source
xiaoxichen
left a comment
There was a problem hiding this comment.
LGTM aside from inline comments cleanup.
Try to use LLM to polish the language a bit.
|
I need to rethink the implementation here again , seems I miss some repl_dev details. |
41de75e to
a3118a7
Compare
|
let`s me explain redo destroy pg with more details. Bassically, the destoryed pg can be categorized into two case: 1 RaftReplDev::leave() is not called and thus m_rd_sb->destroy_pending = 0x0: only BR belongs to this case. in this case repl_dev will be recovered, log will be replayed. For this case , we need do nothing for redoing destroy pg, no matter is pg state is alive or destroyed. the reason is that if the pg super blk exists, it means the first snapshot message (obj_id.shard_seq_num == 0) is not successfully handled and crash happens before pg super blk is destroyed . Then, when recovery, leader will resend the first snapshot message and follower will handle this message again and call pg_destroy to redo pg destory. 2 RaftReplDev::leave() is called and thus m_rd_sb->destroy_pending = 0x1: remove_member, destroy_repl_dev and destroy_raft_group all belong to this case. if m_rd_sb->destroy_pending = 0x1, this repl_dev will not be loaded and no log replay.. so permenant_destroy will not see this repl_dev, but the pg resource are probably still there if crash happens. so we need to redo pg destory to carefully reclaim them. @xiaoxichen ptal |
any possibility it is case 2 but crashed just before setting the destroy_pending = 0x1? |
yes , of course, but it does not matter, since pg_destory is always called after RaftReplDev::leave() is successfully called (it is called in permenant_destroy in gc_repl_devs()#leave_group) and RaftReplDev::leave() is the only way to change destroy_pending to 0x1. let`s discuss case by case: 1 destroy_raft_group: will propose a log with type of HS_CTRL_DESTROY, and 2 destroy_repl_dev(exit_pg): will directly call leave(). so if crash happens before changing destroy_pending to 0x1, nothing happens since now in exit_pg , what we do is only changing destroy_pending to 0x1 , no pg resource will be destory there. 3 remove_from_cluster: will call leave() when receiving a RemovedFromCluster. so if crash happens before changing destroy_pending to 0x1, then after recovery, it will received RemovedFromCluster again. as a summary, if it is case 2 but crashed just before setting the destroy_pending = 0x1, pg resource will not be touched and everything will go well |
Fix crash-recovery gap in PG destroy, handle stale destroyed PGs on restart
Extracted PG resource cleanup into a reusable destroy_pg_resource() helper and fixed two crash-recovery bugs:
Problem 1 — Stale destroyed PGs after crash:
When a repl_dev is marked destroyed (via leave()) but a crash occurs before pg_destroy cleans up the PG superblk, subsequent recovery finds a PG superblk with no corresponding repl_dev.
Previously this triggered a fatal error log and left the PG resources dangling.
Fix: on_pg_meta_blk_found now detects this case and tracks the pg_id in destoryed_stale_pgs_. On on_replica_restart, destroy_pg_resource is called for each stale PG to reclaim its chunks,
index table, and superblk.
Problem 2 — Crash mid-destroy during baseline resync:
If a crash occurs after pg_destroy marks the PG destroyed but before the superblk is removed, log replay on recovery would attempt to write to the destroyed index table. This is safe
because m_rd_sb->last_snapshot_lsn is persisted before pg_destroy is triggered, so all logs at or before that LSN are skipped on recovery (see raft_repl_dev::need_skip_processing). Added a
guard in on_log_replay_done to return early when the PG is in DESTROYED state with a live repl_dev, letting the next snapshot message re-trigger pg_destroy.
Additional fixes: