Skip to content

fix: replication for data loaded by DFLY LOAD command#6740

Merged
BorysTheDev merged 3 commits intomainfrom
fix_load_replication_in_cluster_mode
Apr 23, 2026
Merged

fix: replication for data loaded by DFLY LOAD command#6740
BorysTheDev merged 3 commits intomainfrom
fix_load_replication_in_cluster_mode

Conversation

@BorysTheDev
Copy link
Copy Markdown
Contributor

@BorysTheDev BorysTheDev commented Feb 25, 2026

fixes: #6739

@BorysTheDev BorysTheDev marked this pull request as ready for review February 27, 2026 16:05
Copilot AI review requested due to automatic review settings February 27, 2026 16:05
@augmentcode
Copy link
Copy Markdown

augmentcode Bot commented Feb 27, 2026

🤖 Augment PR Summary

Summary: Fixes replication correctness when importing a snapshot via DFLY LOAD by forcing connected replicas to re-sync.

Changes:

  • Exposes DflyCmd::CancelAllReplicas() and reuses it from shutdown.
  • After a successful ServerFamily::Load, restarts per-shard journals (dropping backlog) and cancels active replica sessions so they reconnect and full-sync.
  • Adds standalone + cluster-mode regression tests that load a saved snapshot while replicas are in stable sync and assert full state-hash equality.

🤖 Was this summary useful? React with 👍 or 👎

Copy link
Copy Markdown

@augmentcode augmentcode Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review completed. 2 suggestions posted.

Fix All in Augment

Comment augment review to trigger a new review at any time.

Comment thread tests/dragonfly/replication_test.py
Comment thread tests/dragonfly/cluster_test.py Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes replication correctness when data is introduced via DFLY LOAD (which bypasses journaling) by forcing replicas to fall back to a full resync, and adds regression coverage for both standalone and cluster mode.

Changes:

  • Reset per-shard journal backlog/LSN after DFLY LOAD to invalidate partial sync offsets.
  • Cancel active replica sessions after load so replicas reconnect and perform FULL SYNC.
  • Add new integration tests validating replication correctness after snapshot load (standalone + cluster).

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File Description
src/server/dflycmd.cc Forces replica full resync after DFLY LOAD by invalidating journal offsets and cancelling replica sessions.
tests/dragonfly/replication_test.py Adds standalone regression test for replication after snapshot load.
tests/dragonfly/cluster_test.py Adds cluster-mode regression test for replication after snapshot load.
Comments suppressed due to low confidence (2)

tests/dragonfly/replication_test.py:4207

  • The fixed asyncio.sleep(0.5) is a timing heuristic and can be flaky on slow/loaded CI machines. Prefer waiting on an observable condition (e.g., the replica disconnect/reconnect state transition) with a bounded timeout instead of a hardcoded sleep.
    # After DFLY LOAD, the master cancels all replicas to force a full resync.
    # Wait for the replica to detect disconnection and complete the new full sync.
    await asyncio.sleep(0.5)
    await wait_for_replicas_state(c_replica)

tests/dragonfly/cluster_test.py:3840

  • The fixed asyncio.sleep(0.5) can be flaky across environments. Prefer waiting on a concrete signal (role/state change, reconnect, or replica offset progression) with a bounded timeout rather than a hardcoded sleep.
    # After DFLY LOAD, the master cancels all replicas to force a full resync.
    # Wait for the replica to detect disconnection and complete the new full sync.
    await asyncio.sleep(0.5)
    await wait_for_replicas_state(r1_node.client)

Comment thread tests/dragonfly/replication_test.py Outdated
Comment thread tests/dragonfly/replication_test.py
Comment thread tests/dragonfly/cluster_test.py Outdated
Comment thread tests/dragonfly/cluster_test.py Outdated
Comment thread src/server/dflycmd.cc Outdated
Comment thread tests/dragonfly/replication_test.py Outdated
Comment on lines +4190 to +4201
await c_master.execute_command("DEBUG", "POPULATE", "1000", "key", "100", "RAND")
assert await c_master.dbsize() == 1000

await c_master.execute_command("SAVE", "DF", dbfilename)

await c_master.execute_command("FLUSHALL")

await c_replica.execute_command("REPLICAOF", "localhost", str(master.port))
await wait_available_async(c_replica)

await c_master.execute_command("DFLY", "LOAD", f"{dbfilename}-summary.dfs")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test can be improved. The main issue is that it calls:

  1. Debug populate
  2. Save
  3. Flushes the datastore (so everything is empty)
  4. Calls replica of -> nothing gets replicated empty datastore
  5. Calls DFLY LOAD -> loads the new snapshot

Wouldn't it be better if:

  1. You actually had some data in both master/replica before you load a new snapshot
  2. Stream data while trying to LOAD the snapshot via DFLY LOAD. This will cause the bug I explained in the other comment to come up

Comment thread src/server/dflycmd.cc Outdated
Comment thread src/server/dflycmd.cc Outdated
@romange
Copy link
Copy Markdown
Collaborator

romange commented Apr 9, 2026

@BorysTheDev what's the status of this PR?

@BorysTheDev
Copy link
Copy Markdown
Contributor Author

@BorysTheDev what's the status of this PR?

This PR is incorrect. I have postponed this task, but I remember it.

@BorysTheDev BorysTheDev force-pushed the fix_load_replication_in_cluster_mode branch from ec72e08 to b7d467b Compare April 10, 2026 10:33
@BorysTheDev BorysTheDev requested a review from Copilot April 10, 2026 10:45
@BorysTheDev
Copy link
Copy Markdown
Contributor Author

augment review

Copy link
Copy Markdown

@augmentcode augmentcode Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review completed. 2 suggestions posted.

Fix All in Augment

Comment augment review to trigger a new review at any time.

Comment thread src/server/server_family.cc
Comment thread src/server/server_family.cc Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

Comment thread src/server/server_family.cc Outdated
@BorysTheDev BorysTheDev force-pushed the fix_load_replication_in_cluster_mode branch from bc0119e to b9a9795 Compare April 12, 2026 07:31
@BorysTheDev BorysTheDev force-pushed the fix_load_replication_in_cluster_mode branch 2 times, most recently from e3851fe to ac7eecf Compare April 16, 2026 10:40
@BorysTheDev BorysTheDev requested a review from dranikpg April 17, 2026 08:20
Comment thread src/server/dflycmd.cc Outdated
Comment thread src/server/dflycmd.cc Outdated
Comment thread src/server/server_family.cc Outdated
dfly_cmd_->CancelAllReplicas();
shard_set->RunBriefInParallel([](EngineShard* shard) {
if (shard->journal())
journal::StartInThreadAtLsn(1);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This a bug. Even copilot found it and you resolved it 🤣 The correct fix here is to call journal::ClearBuffer(); instead of reseting the lsn to 1.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is no journal::ClearBuffer() in the code. and previously I did journal::StartInThreadAtLsn(shard_journal->GetLsn() + 1); and got comment to make it 1

Comment thread tests/dragonfly/cluster_test.py Outdated
@kostasrim
Copy link
Copy Markdown
Contributor

We can reduce the lines of changes in this PR + small fixes

@BorysTheDev BorysTheDev force-pushed the fix_load_replication_in_cluster_mode branch from ac7eecf to 3365240 Compare April 22, 2026 08:24
Comment thread src/server/server_family.cc Outdated
LOG(INFO) << "Load finished, num keys read: " << aggregated_result->keys_read;

// Loaded data bypasses the journal, so force replicas into full sync.
dfly_cmd_->Shutdown();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shutdown is a tricky name, it just cancels all replicas. But if it ever does something non-restorable, this code will break

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to have another name, @kostasrim doesn't like this idea

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree DflyCmd::Shutdown should be renamed to BreakReplication or CancelReplicas or something like that

@BorysTheDev BorysTheDev force-pushed the fix_load_replication_in_cluster_mode branch from 3365240 to 8b41f62 Compare April 23, 2026 16:13
Copy link
Copy Markdown
Contributor

@kostasrim kostasrim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯

🚢 🇮🇹

@BorysTheDev BorysTheDev enabled auto-merge (squash) April 23, 2026 18:15
@BorysTheDev BorysTheDev merged commit ece10b0 into main Apr 23, 2026
19 of 33 checks passed
@BorysTheDev BorysTheDev deleted the fix_load_replication_in_cluster_mode branch April 23, 2026 18:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

support replication for data loaded by DFLY LOAD command

5 participants