Skip to content

Refactor disk-based replication checkpoint shipping and safe hlog segment truncation#1773

Draft
vazois wants to merge 21 commits intodevfrom
vazois/fsync-ref
Draft

Refactor disk-based replication checkpoint shipping and safe hlog segment truncation#1773
vazois wants to merge 21 commits intodevfrom
vazois/fsync-ref

Conversation

@vazois
Copy link
Copy Markdown
Contributor

@vazois vazois commented May 6, 2026

Summary

Refactors the checkpoint shipping pipeline for disk-based replication, introducing clean abstractions for reading and transmitting Tsavorite checkpoint data. Also adds safe hybrid log segment truncation to prevent deletion of segments actively being read by syncing replicas.

Key changes

Checkpoint shipping abstractions (send side):

  • ISnapshotReader / ISnapshotTransmitSource / ISnapshotDataSource interfaces
  • TsavoriteSnapshotReader / TsavoriteCheckpointReader for reading checkpoint files
  • FileTransmitSource / TsavoriteMetadataTransmitSource for transmitting data
  • SnapshotTransmissionDriver orchestrating the send pipeline
  • ReplicaSyncSession refactored to use the new abstractions

Checkpoint shipping abstractions (receive side):

  • ISnapshotDataSink interface
  • FileDataSink / MetadataDataSink implementations
  • Unified ReceiveCheckpointHandler with ProcessSnapshotData entry point
  • Unified CLUSTER SNAPSHOT_DATA command (previous per-type commands are deprecated but not removed)

Safe hlog segment truncation (PerformInternalCleanup):

  • Added PerformInternalCleanup property to ICheckpointManager interface
  • GarnetClusterCheckpointManager overrides as alse -- Tsavorite skips internal cleanup
  • CheckpointStore.DeleteOutdatedCheckpoints() calls ShiftBeginAddress with the oldest active checkpoint begin address as the safe truncation boundary
  • Ensures hlog segments are not deleted while replicas are actively reading them

RangeIndexManager refactoring:

  • Moved AOF replication methods to RangeIndexManager.Replication.cs partial file

Testing

  • Added ClusterReplicationHlogSegmentCleanupTest validating hlog segment truncation during concurrent replica sync (25+ stable runs)
  • All existing replication tests pass

vazois and others added 21 commits April 27, 2026 16:27
Consolidate file segment and metadata transmission into a single
CLUSTER SNAPSHOT_DATA <token> <type> <startAddress> <data> command.
A startAddress of -1 signals a single-message payload (e.g., metadata)
committed directly. Any other startAddress indicates a streamed file
segment where empty data signals end-of-stream.

The previous CLUSTER SEND_CKPT_FILE_SEGMENT and SEND_CKPT_METADATA
commands are not removed but are deprecated in favor of SNAPSHOT_DATA.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Ship BfTree snapshot files (snapshot.{token}.bftree) during checkpoint
synchronization to replicas. This enables replicas to lazily restore
RangeIndex trees from checkpoint snapshots after recovery.

New types following the existing ISnapshotDataSource/ISnapshotTransmitSource/
ISnapshotReader pattern:
- RangeIndexFileDataSource: reads .bftree files via FileStream
- RangeIndexFileTransmitSource: sends chunks with per-file key hash header
- RangeIndexCheckpointReader: enumerates snapshot files for a checkpoint token
- RangeIndexFileSink: writes received .bftree data to disk on replicas

Wire protocol: for each RINDEX_SNAPSHOT file, a header message (startAddress=-1)
carries the 32-char hex key hash directory name, followed by file data chunks,
followed by an empty EOT packet. Receiver validates key hash format.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…cation

Add cluster-aware purging to PurgeOldCheckpointSnapshots so that in
cluster mode, snapshot deletion is deferred to CheckpointStore which
verifies no active readers hold the checkpoint entry. CheckpointStore
now calls PurgeOldCheckpointSnapshots with enforceClusterSafety:true
after confirming reader safety, both in DeleteOutdatedCheckpoints and
PurgeAllCheckpointsExceptEntry.

- Add clusterEnabled field to RangeIndexManager constructor
- Add enforceClusterSafety parameter to PurgeOldCheckpointSnapshots
- Wire CheckpointStore to purge BfTree snapshots alongside HLOG/index
- Pass clusterEnabled from GarnetServer to RangeIndexManager

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…t truncation

Introduce PerformInternalCleanup property on ICheckpointManager to control
whether Tsavorite performs internal cleanup of checkpoint snapshot files and
hybrid log segments during the checkpoint state machine. When false, the
external layer (cluster mode) manages cleanup with reader-safety checks.

- Add PerformInternalCleanup to ICheckpointManager interface
- Add virtual property to DeviceLogCommitCheckpointManager (default: true)
- Override as false in GarnetClusterCheckpointManager
- Guard CleanupLogCheckpoint/CleanupIndexCheckpoint in Checkpoint.cs
- Activate safe ShiftBeginAddress in CheckpointStore.DeleteOutdatedCheckpoints
  using the oldest active checkpoint's begin address as the truncation boundary
- Add ClusterReplicationHlogSegmentCleanupTest to validate hlog segment
  truncation does not interfere with concurrent replica sync

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Remove the ISnapshotReader and ISnapshotTransmitSource implementations for
RangeIndex BfTree checkpoints, along with the receive-side sink and helper
methods for enumerating/purging BfTree snapshot files during replication.

Deleted:
- RangeIndexCheckpointReader.cs
- RangeIndexFileTransmitSource.cs
- RangeIndexFileSink.cs

Cleaned up:
- CheckpointFileType: removed RINDEX_SNAPSHOT enum value
- CheckpointStore: removed PurgeOldCheckpointSnapshots calls
- ReceiveCheckpointHandler: removed RINDEX_SNAPSHOT handling
- ReplicaSyncSession: removed RangeIndex reader registration
- RangeIndexManager.cs: moved replication methods to partial file
- StoreWrapper: removed public RangeIndexManager property
- GarnetServer: reverted clusterEnabled parameter

The RangeIndexManager.Replication.cs partial file is retained as the
separation of AOF replication methods remains useful.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant