Fix MultiDatabaseSaveInProgressTest by badrishc · Pull Request #1767 · microsoft/garnet

badrishc · 2026-05-04T22:18:19Z

The general BGSAVE path returned 'Background saving started' before its async helper had pause-locked any per-DB checkpoint locks. A subsequent BGSAVE could win the race against TryPauseCheckpoints(dbId) and succeed, instead of failing with 'ERR checkpoint already in progress'.

This caused MultiDatabaseTests.MultiDatabaseSaveInProgressTest to flake in CI. SingleDatabaseManager.TakeCheckpointAsync already does the right thing by pausing synchronously before returning, which is why the single-DB equivalent test (SeSaveInProgressTest) does not flake.

Fix:

Restructure MultiDatabaseManager.TakeCheckpointAsync(bool, ILogger, CancellationToken) so the synchronous portion now acquires databasesContentLock, multiDbCheckpointingLock (when multi-db), AND calls TryPauseCheckpoints(dbId) for every active DB before returning. Compactly store only successfully paused DB IDs in dbIdsToCheckpoint. Roll back any partially-acquired state on exception in the sync phase.
Add alreadyPaused parameter to TakeDatabasesCheckpointAsync so the new caller can skip the inline TryPauseCheckpoints. Add a catch fallback that resumes pre-paused DB IDs not yet handed off to per-DB helpers, preventing stranded locks.
Existing AOF-size-driven caller (TaskCheckpointBasedOnAofSizeLimitAsync) is unaffected; it uses the default alreadyPaused=false.

Regression test: MultiDatabaseGeneralSaveBlocksGeneralSaveTest verifies that a second general BGSAVE while one is in flight (multi-db) reliably returns 'ERR checkpoint already in progress' (multiDbCheckpointingLock is now held synchronously).

The general BGSAVE path returned 'Background saving started' before its async helper had pause-locked any per-DB checkpoint locks. A subsequent BGSAVE <dbId> could win the race against TryPauseCheckpoints(dbId) and succeed, instead of failing with 'ERR checkpoint already in progress'. This caused MultiDatabaseTests.MultiDatabaseSaveInProgressTest to flake in CI. SingleDatabaseManager.TakeCheckpointAsync already does the right thing by pausing synchronously before returning, which is why the single-DB equivalent test (SeSaveInProgressTest) does not flake. Fix: - Restructure MultiDatabaseManager.TakeCheckpointAsync(bool, ILogger, CancellationToken) so the synchronous portion now acquires databasesContentLock, multiDbCheckpointingLock (when multi-db), AND calls TryPauseCheckpoints(dbId) for every active DB before returning. Compactly store only successfully paused DB IDs in dbIdsToCheckpoint. Roll back any partially-acquired state on exception in the sync phase. - Add alreadyPaused parameter to TakeDatabasesCheckpointAsync so the new caller can skip the inline TryPauseCheckpoints. Add a catch fallback that resumes pre-paused DB IDs not yet handed off to per-DB helpers, preventing stranded locks. - Existing AOF-size-driven caller (TaskCheckpointBasedOnAofSizeLimitAsync) is unaffected; it uses the default alreadyPaused=false. Regression test: MultiDatabaseGeneralSaveBlocksGeneralSaveTest verifies that a second general BGSAVE while one is in flight (multi-db) reliably returns 'ERR checkpoint already in progress' (multiDbCheckpointingLock is now held synchronously). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…sAsync The test has been timing out in CI. Set an explicit 180s cancellation timeout so the shared ClusterTestContext.cts is configured accordingly and polling loops (BackOff(cts.Token)) can exit cleanly instead of hanging until the test runner kills the process. Matches existing convention in this file (lines 357, 1291). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Fixes a race in multi-database background checkpointing where a general BGSAVE could return before per-DB checkpoint locks were synchronously paused, causing MultiDatabaseSaveInProgressTest to flake.

Changes:

Restructures MultiDatabaseManager.TakeCheckpointAsync to synchronously acquire the multi-DB checkpoint lock and pause per-DB checkpoints before returning.
Extends TakeDatabasesCheckpointAsync with an alreadyPaused option and adds rollback logic to avoid stranded per-DB pause locks on early failure/exception.
Adds a regression test ensuring a second general BGSAVE reliably fails while a multi-DB checkpoint is in progress.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.

File	Description
test/Garnet.test/MultiDatabaseTests.cs	Adds regression coverage for concurrent general `BGSAVE` during an active multi-DB background save.
libs/server/Databases/MultiDatabaseManager.cs	Moves critical lock acquisition/pause steps into the synchronous portion of checkpoint initiation and adds safer rollback paths.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

GPT 5.5 review of the prior commit flagged that handing pre-paused DB IDs through the shared instance fields dbIdsToCheckpoint and checkpointTasks is unsafe: HandleDatabaseAdded reallocates both fields without coordinating with multiDbCheckpointingLock, so a SELECT that adds a new active DB between the synchronous pause phase and the async helper resuming after Task.Yield could swap in a fresh zero-initialized array. The async helper would then read default 0 entries as 'paused DB IDs', leaking the lock on the actually-paused DBs and double-resuming DB 0 (which spins forever in SingleWriterMultiReaderLock.WriteUnlock when called on an unlocked lock). Fix: - Remove the shared dbIdsToCheckpoint and checkpointTasks instance fields and the matching reallocation block in HandleDatabaseAdded. - TakeCheckpointAsync(general) now allocates a local pausedDbIds buffer and passes it explicitly to its async helper / TakeDatabasesCheckpointAsync. - TaskCheckpointBasedOnAofSizeLimitAsync allocates a local 1-element buffer (the loop breaks after the first oversized AOF anyway). - TakeDatabasesCheckpointAsync now takes (int[] dbIds, int dbIdsCount) and allocates its own checkpointTasks array of exactly dbIdsCount, explicitly assigning Task.CompletedTask in the skip branch so Task.WhenAll never sees a null. These buffers are tiny and BGSAVE / AOF-size-driven checkpoints are not hot paths, so per-operation allocation is acceptable. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

1. Test variable naming clarity (was 'db1=GetDatabase(0)' / 'db2=GetDatabase(1)'; now 'db0' / 'db1' to match the underlying Redis db index, and updated comment). 2. Assert.Throws<T>(action, message) treats the second argument as a *failure* message, not the expected exception message - it never validated that the server returned 'ERR checkpoint already in progress'. Capture the exception and assert on Message explicitly. 3. LASTSAVE wait loop hardened: capture baseline as long (no 2038 truncation), wait for advance past baseline, and add a 30s bounded timeout with a final ClassicAssert.Greater so a hang fails the test instead of stalling CI. 4/5. Add 'contentLockAlreadyHeld' parameter to TakeDatabasesCheckpointAsync so callers that already hold databasesContentLock as a reader skip the redundant nested re-acquisition. The lock is reentrant for readers (just a counter), so this was correct but redundant work; both call sites (TakeCheckpointAsync general and TaskCheckpointBasedOnAofSizeLimitAsync) now pass contentLockAlreadyHeld: true. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Per the cleanup plan in plan.md, restructure the checkpoint code to remove the accumulated complexity from the previous fix: - Drop the alreadyPaused and contentLockAlreadyHeld parameters by adopting a single convention: the caller is responsible for synchronously acquiring databasesContentLock (read), multiDbCheckpointingLock (write, if multi-db), and pausing per-DB checkpoint locks before handing off to the shared async runner. - Replace the dual-flag TakeDatabasesCheckpointAsync helper with two small, single-purpose runners: RunPausedCheckpointsAndReleaseLocksAsync(pausedDbIds, count, multiDbLockHeld, …) — used by the background-capable entry points (general BGSAVE, per-DB BGSAVE), runs all pre-paused per-DB checkpoints in parallel and releases the outer locks in finally. RunPausedCheckpointAsync(db, dbId, …) — single per-DB checkpoint + LASTSAVE update + per-DB lock resume; used by AOF-size-driven path. - Per-DB BGSAVE (TakeCheckpointAsync(bool, int, …)) and TakeOnDemandCheckpointAsync now also take databasesContentLock as a reader. Without this, a concurrent swap-db can move the GarnetDatabase out from under an in-flight per-DB checkpoint: UpdateLastSaveData looks up databases.Map[dbId] at write time and would record LASTSAVE on the swapped wrapper, and a second BGSAVE for the same dbId would race against the in-flight checkpoint because the swapped slot has a fresh CheckpointingLock. - TakeOnDemandCheckpointAsync also now guards ResumeCheckpoints behind the checkpointsPaused flag - otherwise it would unconditionally WriteUnlock a per-DB CheckpointingLock that TryPauseCheckpoints had refused to acquire (corrupting the lock in debug, spinning forever in release). Net diff: -52 lines. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot AI review requested due to automatic review settings May 4, 2026 22:18

Copilot AI reviewed May 4, 2026

View reviewed changes

Copilot started reviewing on behalf of badrishc May 4, 2026 22:33 View session

badrishc and others added 2 commits May 4, 2026 16:38

badrishc marked this pull request as draft May 5, 2026 18:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix MultiDatabaseSaveInProgressTest#1767

Fix MultiDatabaseSaveInProgressTest#1767
badrishc wants to merge 5 commits intodevfrom
badrishc/multidb-test-fix

badrishc commented May 4, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

badrishc commented May 4, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants