Skip to content

[BFTree] Add RangeIndex cluster migration#1731

Open
tiagonapoli wants to merge 17 commits intodevfrom
tiagonapoli/bftree-migration
Open

[BFTree] Add RangeIndex cluster migration#1731
tiagonapoli wants to merge 17 commits intodevfrom
tiagonapoli/bftree-migration

Conversation

@tiagonapoli
Copy link
Copy Markdown
Collaborator

@tiagonapoli tiagonapoli commented Apr 23, 2026

Summary

Adds end-to-end cluster migration for RangeIndex keys, supporting both MIGRATE SLOTS and MIGRATE KEYS paths. RangeIndex keys are backed by a native BfTree whose on-disk state (data.bftree) lives outside Tsavorite — shipping just the 51-byte stub is insufficient. This change streams the entire BfTree snapshot file alongside the stub over the existing migration transport.

Architecture

  • Serializer (RangeIndexChunkedSerializer): Pure state machine over Span<byte> — no I/O. Takes file data as input via MoveNext(dest, fileData, out consumed).
  • Migration Reader (RangeIndexMigrationReader): Async wrapper that reads the snapshot file and feeds bytes to the serializer.
  • Deserializer (RangeIndexChunkedDeserializer): Sync state machine that writes received file data to a temp file, validates xxHash64 checksum, recovers the native BfTree, and publishes the stub to the store.
  • Factory (RangeIndexManager.SnapshotRangeIndexAndCreateReader): Snapshots the BfTree under an exclusive lock to a temp file, then creates the reader.

Wire format

Single MigrationRecordSpanType.SerializedRangeIndexStream (tag 4). Stream format across chunks:

[4B keyLen][key bytes][8B fileSize][file bytes][8B xxHash64][4B stubLen][stub]

Key and file bytes may span chunks; all other elements must fit within a single chunk.

SLOTS path

  1. Scan detects RI keys via RecordType == 2, captures to MigrateOperation.RangeIndexes
  2. After scan completes, MigrateRangeIndexKeysAsync runs a sketch-protected batch cycle:
    • Add all RI keys to sketch (INITIALIZING)
    • TRANSMITTING + epoch barrier — blocks writes during snapshot+transmit
    • Transmit each key via TransmitRangeIndexAsync
    • DELETING + epoch barrier — blocks reads+writes during delete
    • Delete each key
    • finally: clear sketch (unblocks clients)

KEYS path

  1. GetRangeIndexKeysForMigration discovers RI keys via RIGET
  2. TransmitKeysAsync skips RI keys (in rangeIndexKeysToIgnore)
  3. Each RI key transmitted via TransmitRangeIndexAsync, then marked in sketch for DeleteKeysAsync

Bug fixes

  • Remove RI+cluster startup guard (GarnetServer.cs)
  • Fix round-trip migration: Publish deletes existing data file before move
  • Fix Publish registration: accept InPlaceUpdated/CopyUpdated status
  • Fix serializer empty-buffer bug in FileData phase
  • TransmitRangeIndexAsync catches all exceptions (never throws)
  • Sketch protection for RI keys in SLOTS path (previously unprotected)

Tests

  • 26 unit tests for serializer/deserializer (round-trip, checksums, error states, buffer boundaries, chunk sizes)
  • 11 cluster integration tests: SingleBySlot, ByKeys, ManyBySlot, WhileModifying, MigrateBack, LargeTree (1KB/256KB chunks), ChunkSize variants, StressAsync (Explicit)

TODO

  • Vector Set index keys have the same sketch protection gap (documented in code)
  • AOF replication of migrated trees to destination replicas

@tiagonapoli tiagonapoli force-pushed the tiagonapoli/bftree-migration branch 6 times, most recently from 61a467b to 3a6e52f Compare May 5, 2026 17:15
@tiagonapoli tiagonapoli marked this pull request as ready for review May 5, 2026 17:26
Copilot AI review requested due to automatic review settings May 5, 2026 17:26
@tiagonapoli tiagonapoli force-pushed the tiagonapoli/bftree-migration branch from 3a6e52f to dc78c02 Compare May 5, 2026 17:36
@tiagonapoli tiagonapoli changed the title [BFTree] Add RangeIndex cluster migration Add RangeIndex cluster migration with sketch protection May 5, 2026
@tiagonapoli tiagonapoli changed the title Add RangeIndex cluster migration with sketch protection [BFTree] Add RangeIndex cluster migration May 5, 2026
@tiagonapoli tiagonapoli force-pushed the tiagonapoli/bftree-migration branch 3 times, most recently from 74564ec to 74a1e44 Compare May 5, 2026 17:53
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@tiagonapoli tiagonapoli force-pushed the tiagonapoli/bftree-migration branch 3 times, most recently from 4851503 to 5754d5d Compare May 5, 2026 20:34
Migration plumbing:
- Chunked serializer (pure state machine, no I/O) + async MigrationReader wrapper
- Chunked deserializer with FileStream I/O for receiving migration data
- TransmitRangeIndexAsync source-side driver with configurable chunk size
- RangeIndexMigrationReceiveState receiver with state machine
- Remove redundant content-length prefix in migration wire format
- SnapshotRangeIndexAndCreateReader factory on RangeIndexManager

Bug fixes:
- Remove RI+cluster startup guard (GarnetServer.cs)
- Fix round-trip migration: Publish deletes existing data file
- Fix Publish registration: accept InPlaceUpdated/CopyUpdated
- Fix serializer empty-buffer bug in FileData phase
- Add missing buffer-too-small guard for KeyHeader phase
- TransmitRangeIndexAsync catches all exceptions (never throws)

Sketch protection (SLOTS path):
- All RI keys added to sketch in one batch
- TRANSMITTING epoch barrier blocks writes during snapshot+transmit
- DELETING epoch barrier blocks reads+writes during delete
- try/finally ensures sketch cleanup on failure
- TODO: Vector Sets have the same unprotected pattern

Sketch protection (KEYS path):
- RI keys already in sketch from user enumeration
- Mark transmitted keys for DeleteKeysAsync()

Tests:
- 26 unit tests for serializer/deserializer
- 11 cluster migration integration tests (SingleBySlot, ByKeys,
  ManyBySlot, WhileModifying, MigrateBack, LargeTree, ChunkSize,
  StressAsync)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@tiagonapoli tiagonapoli force-pushed the tiagonapoli/bftree-migration branch from 5754d5d to ac06150 Compare May 5, 2026 22:50
Tiago Napoli and others added 9 commits May 5, 2026 15:51
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…of out int consumed

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…atedIndex

The deserializer now only exposes parsed data (Key, Stub, TempPath).
The store operations (file move, BfTree recovery, RMW, registration) live
in RangeIndexManager where they belong.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…operty

Deserializer now takes tempPath and optional ILogger directly instead of
the full RangeIndexManager. This completes the separation: the deserializer
is a pure parser with no knowledge of the manager's internals.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Close the FileStream as soon as fileBytesRemaining hits zero (ensuring
data is flushed to disk) rather than waiting for trailer data to arrive
in the same chunk. The trailer is now parsed in a separate state.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Test cases crafted from raw bytes (no real BfTree needed):
- FileDataExactlyFillsChunk_TrailerInNextChunk: file flush verified
- FileDataOneBytePerChunk: single byte file data chunks
- EmptyChunkDuringWaitingForTrailer: empty chunk accepted
- KeySplitAcrossTwoChunks: partial key in first chunk
- ZeroFileDataGoesDirectlyToTrailer: no file bytes
- CorruptedFileDataFailsChecksumInTrailer: checksum mismatch

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Tiago Napoli and others added 7 commits May 5, 2026 17:03
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- ClusterMigrateEmptyRangeIndex: creates RI key with no data, migrates,
  verifies RI.SET works on target after migration
- Log snapshot file size in SnapshotForMigration for debugging

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…content

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@vazois
Copy link
Copy Markdown
Contributor

vazois commented May 6, 2026

Missing RangeIndex feature-enabled guards

This PR doesn't consistently check whether the RangeIndex feature is enabled (i.e., whether RangeIndexManager is non-null) before using it. Two spots are at risk of NullReferenceException:

1. Source side — MigrateSession.RangeIndex.cs:29

var rangeIndexManager = clusterProvider.storeWrapper.DefaultDatabase.RangeIndexManager;
// No null check before calling .SnapshotRangeIndexAndCreateReader(...)

The KEYS path correctly uses ?.GetRangeIndexKeysForMigration(...) with a fallback to an empty dictionary, but TransmitRangeIndexAsync assumes the manager is always available. If the SLOTS scan somehow discovers an RI key on a node where the feature is disabled, this will crash.

2. Receive side — RespClusterMigrateCommands.cs:181

if (!rangeIndexMigrationState.ProcessRecord(...))

rangeIndexMigrationState is set to null when RangeIndexManager is null (line 70 in ClusterSession.cs), but the dispatch at line 181 doesn't guard against that. If a receiving node has RangeIndex disabled and the sender transmits a SerializedRangeIndexStream record, this will throw.

Suggestion: Add a null check on the receive side (e.g., log a warning and skip/error the record), and consider a defensive null check in TransmitRangeIndexAsync as well.

}

public async Task<bool> TransmitKeysAsync(Dictionary<byte[], byte[]> vectorSetKeysToIgnore)
public async Task<bool> TransmitKeysAsync(Dictionary<byte[], byte[]> vectorSetKeysToIgnore, Dictionary<byte[], byte[]> rangeIndexKeysToIgnore)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to combine these two instead of having to maintain two dictionaries?


var delRes = localServerSession.BasicGarnetApi.DELETE(key);

session.logger?.LogDebug("Deleting RangeIndex {key} after migration: {delRes}", System.Text.Encoding.UTF8.GetString(key), delRes);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: logging message is confusing. Should be 'Deleted..' otherwise move it before actual delete.

Copy link
Copy Markdown
Contributor

@vazois vazois left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

libs/cluster/Server/Migration/MigrateScanFunctions.cs:53 — maybe not for this PR, but can't we just get these values from an enum?


return true;
}
finally
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe catch and log a possible exception here. ensure to re-throw.

await WaitForConfigPropagationAsync().ConfigureAwait(false);

// Discover Vector Sets linked namespaces
var allKeys = migrateTask.sketch.Keys.Select(t => t.Item1.ToArray());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we avoid the copy here and just iterate over the container?

Copy link
Copy Markdown
Contributor

@vazois vazois left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

libs/cluster/Server/Migration/MigrateSessionKeys.cs:43 — seems really wasteful to maintain a separate dictionary just for skipping keys. We can potentially store the info for the key type inline and skip the key once we read it from the sketch list

}

/// <summary>Reset state for the next key stream.</summary>
private void Reset()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ensure that the reset does not race with back to back Migration calls or parallel sessions on different slots

/// from one migration arrive on the same <see cref="ClusterSession"/>, guaranteeing
/// in-order delivery.
/// </remarks>
internal sealed class RangeIndexMigrationReceiveState : IDisposable
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this implementation is specific to RangeIndex, can you rename the file to indicate that? The current filename MigrationReceiveSession.cs is too generic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants