fix: replication for data loaded by DFLY LOAD command by BorysTheDev · Pull Request #6740 · dragonflydb/dragonfly

BorysTheDev · 2026-02-25T14:46:11Z

augmentcode · 2026-02-27T16:09:29Z

🤖 Augment PR Summary

Summary: Fixes replication correctness when importing a snapshot via DFLY LOAD by forcing connected replicas to re-sync.

Changes:

Exposes DflyCmd::CancelAllReplicas() and reuses it from shutdown.
After a successful ServerFamily::Load, restarts per-shard journals (dropping backlog) and cancels active replica sessions so they reconnect and full-sync.
Adds standalone + cluster-mode regression tests that load a saved snapshot while replicas are in stable sync and assert full state-hash equality.

_{🤖 Was this summary useful? React with 👍 or 👎}

augmentcode

Review completed. 2 suggestions posted.

Comment augment review to trigger a new review at any time.

Copilot

Pull request overview

Fixes replication correctness when data is introduced via DFLY LOAD (which bypasses journaling) by forcing replicas to fall back to a full resync, and adds regression coverage for both standalone and cluster mode.

Changes:

Reset per-shard journal backlog/LSN after DFLY LOAD to invalidate partial sync offsets.
Cancel active replica sessions after load so replicas reconnect and perform FULL SYNC.
Add new integration tests validating replication correctness after snapshot load (standalone + cluster).

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File	Description
`src/server/dflycmd.cc`	Forces replica full resync after `DFLY LOAD` by invalidating journal offsets and cancelling replica sessions.
`tests/dragonfly/replication_test.py`	Adds standalone regression test for replication after snapshot load.
`tests/dragonfly/cluster_test.py`	Adds cluster-mode regression test for replication after snapshot load.

Comments suppressed due to low confidence (2)

tests/dragonfly/replication_test.py:4207

The fixed asyncio.sleep(0.5) is a timing heuristic and can be flaky on slow/loaded CI machines. Prefer waiting on an observable condition (e.g., the replica disconnect/reconnect state transition) with a bounded timeout instead of a hardcoded sleep.

    # After DFLY LOAD, the master cancels all replicas to force a full resync.
    # Wait for the replica to detect disconnection and complete the new full sync.
    await asyncio.sleep(0.5)
    await wait_for_replicas_state(c_replica)

tests/dragonfly/cluster_test.py:3840

The fixed asyncio.sleep(0.5) can be flaky across environments. Prefer waiting on a concrete signal (role/state change, reconnect, or replica offset progression) with a bounded timeout rather than a hardcoded sleep.

    # After DFLY LOAD, the master cancels all replicas to force a full resync.
    # Wait for the replica to detect disconnection and complete the new full sync.
    await asyncio.sleep(0.5)
    await wait_for_replicas_state(r1_node.client)

kostasrim · 2026-03-02T10:05:54Z

+    await c_master.execute_command("DEBUG", "POPULATE", "1000", "key", "100", "RAND")
+    assert await c_master.dbsize() == 1000
+
+    await c_master.execute_command("SAVE", "DF", dbfilename)
+
+    await c_master.execute_command("FLUSHALL")
+
+    await c_replica.execute_command("REPLICAOF", "localhost", str(master.port))
+    await wait_available_async(c_replica)
+
+    await c_master.execute_command("DFLY", "LOAD", f"{dbfilename}-summary.dfs")
+


This test can be improved. The main issue is that it calls:

Debug populate

Save

Flushes the datastore (so everything is empty)

Calls replica of -> nothing gets replicated empty datastore

Calls DFLY LOAD -> loads the new snapshot

Wouldn't it be better if:

You actually had some data in both master/replica before you load a new snapshot

Stream data while trying to LOAD the snapshot via DFLY LOAD. This will cause the bug I explained in the other comment to come up

romange · 2026-04-09T10:14:20Z

@BorysTheDev what's the status of this PR?

BorysTheDev · 2026-04-09T10:18:22Z

@BorysTheDev what's the status of this PR?

This PR is incorrect. I have postponed this task, but I remember it.

BorysTheDev · 2026-04-10T10:45:27Z

augment review

augmentcode

Review completed. 2 suggestions posted.

Comment augment review to trigger a new review at any time.

Copilot

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

kostasrim · 2026-04-21T11:17:58Z

+      dfly_cmd_->CancelAllReplicas();
+      shard_set->RunBriefInParallel([](EngineShard* shard) {
+        if (shard->journal())
+          journal::StartInThreadAtLsn(1);


This a bug. Even copilot found it and you resolved it 🤣 The correct fix here is to call journal::ClearBuffer(); instead of reseting the lsn to 1.

there is no journal::ClearBuffer() in the code. and previously I did journal::StartInThreadAtLsn(shard_journal->GetLsn() + 1); and got comment to make it 1

kostasrim · 2026-04-21T11:21:29Z

We can reduce the lines of changes in this PR + small fixes

dranikpg · 2026-04-23T14:58:18Z

      LOG(INFO) << "Load finished, num keys read: " << aggregated_result->keys_read;
+
+      // Loaded data bypasses the journal, so force replicas into full sync.
+      dfly_cmd_->Shutdown();


Shutdown is a tricky name, it just cancels all replicas. But if it ever does something non-restorable, this code will break

I wanted to have another name, @kostasrim doesn't like this idea

I agree DflyCmd::Shutdown should be renamed to BreakReplication or CancelReplicas or something like that

kostasrim

💯

🚢 🇮🇹

BorysTheDev marked this pull request as ready for review February 27, 2026 16:05

Copilot AI review requested due to automatic review settings February 27, 2026 16:05

Copilot started reviewing on behalf of BorysTheDev February 27, 2026 16:05 View session

augmentcode Bot reviewed Feb 27, 2026

View reviewed changes

Comment thread tests/dragonfly/replication_test.py

Comment thread tests/dragonfly/cluster_test.py Outdated

Copilot AI reviewed Feb 27, 2026

View reviewed changes

Comment thread tests/dragonfly/replication_test.py Outdated

Comment thread tests/dragonfly/replication_test.py

Comment thread tests/dragonfly/cluster_test.py Outdated

Comment thread tests/dragonfly/cluster_test.py Outdated

BorysTheDev requested review from kostasrim and romange February 27, 2026 16:20

kostasrim reviewed Mar 2, 2026

View reviewed changes

romange reviewed Mar 2, 2026

View reviewed changes

Comment thread src/server/dflycmd.cc Outdated

romange reviewed Mar 2, 2026

View reviewed changes

Comment thread src/server/dflycmd.cc Outdated

BorysTheDev force-pushed the fix_load_replication_in_cluster_mode branch from ec72e08 to b7d467b Compare April 10, 2026 10:33

BorysTheDev requested a review from Copilot April 10, 2026 10:45

Copilot started reviewing on behalf of BorysTheDev April 10, 2026 10:45 View session

augmentcode Bot reviewed Apr 10, 2026

View reviewed changes

Comment thread src/server/server_family.cc

Comment thread src/server/server_family.cc Outdated

Copilot AI reviewed Apr 10, 2026

View reviewed changes

Comment thread src/server/server_family.cc Outdated

BorysTheDev force-pushed the fix_load_replication_in_cluster_mode branch from bc0119e to b9a9795 Compare April 12, 2026 07:31

BorysTheDev requested review from kostasrim and romange April 13, 2026 12:33

BorysTheDev force-pushed the fix_load_replication_in_cluster_mode branch 2 times, most recently from e3851fe to ac7eecf Compare April 16, 2026 10:40

BorysTheDev requested a review from dranikpg April 17, 2026 08:20

kostasrim reviewed Apr 21, 2026

View reviewed changes

BorysTheDev force-pushed the fix_load_replication_in_cluster_mode branch from ac7eecf to 3365240 Compare April 22, 2026 08:24

dranikpg reviewed Apr 23, 2026

View reviewed changes

fix: restart replicas after DFLY LOAD

6119c71

BorysTheDev added 2 commits April 23, 2026 18:41

refactor: address comments

1f4f9c8

refactor: address comment

8b41f62

BorysTheDev force-pushed the fix_load_replication_in_cluster_mode branch from 3365240 to 8b41f62 Compare April 23, 2026 16:13

kostasrim approved these changes Apr 23, 2026

View reviewed changes

BorysTheDev enabled auto-merge (squash) April 23, 2026 18:15

BorysTheDev merged commit ece10b0 into main Apr 23, 2026
19 of 33 checks passed

BorysTheDev deleted the fix_load_replication_in_cluster_mode branch April 23, 2026 18:17

Conversation

BorysTheDev commented Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

augmentcode Bot commented Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

augmentcode Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kostasrim Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

romange commented Apr 9, 2026

Uh oh!

BorysTheDev commented Apr 9, 2026

Uh oh!

BorysTheDev commented Apr 10, 2026

Uh oh!

augmentcode Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kostasrim Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

BorysTheDev Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kostasrim commented Apr 21, 2026

Uh oh!

dranikpg Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

BorysTheDev Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

romange Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

kostasrim left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

BorysTheDev commented Feb 25, 2026 •

edited

Loading

augmentcode Bot commented Feb 27, 2026 •

edited

Loading

kostasrim left a comment •

edited

Loading