Skip to content

fix(storage): markSnapshotAsVolume before delete origin volume bits#4014

Open
MatheMatrix wants to merge 1 commit into
4.8.36from
sync/zstackio/TIC-5745
Open

fix(storage): markSnapshotAsVolume before delete origin volume bits#4014
MatheMatrix wants to merge 1 commit into
4.8.36from
sync/zstackio/TIC-5745

Conversation

@MatheMatrix
Copy link
Copy Markdown
Owner

问题

ZSTAC-76704 / TIC-5745:APIUndoSnapshotCreationMsg 在 blockCommit 数据面成功后,删除源云盘底层数据(delete-origin-volume-bits)失败会触发控制面回滚,导致管理面 installPath 仍指向已被 commit 掉的旧路径 —— 控制面/数据面不一致,后续快照删除报 unable to find volume

根因

undo 流程中 delete-origin-volume-bits 排在 update-db-install-path(MarkSnapshotAsVolume) 之前。删除底层数据失败 → trigger.fail → MarkSnapshotAsVolume 没执行 → 卷安装路径未更新。

修复

backport 5.4.0 MR 8211 到 4.8.36:

  • MarkSnapshotAsVolume(update-db-install-path) 调到 delete-origin-volume-bits 之前
  • 删除底层数据失败时不再 trigger.fail,改为提交 PrimaryStorageDeleteBitGCtrigger.next() 继续

测试

新增 UndoSnapshotCreationDeleteBitsFailureCase 集成测试,注入 delete-bits 失败后断言:

  1. undo API 仍成功
  2. 卷 installPath 更新到新(已 commit)路径
  3. flow 顺序:delete-bits 被调用时 DB installPath 已是新路径(MarkSnapshotAsVolume 先于 delete)
  4. 删除失败只提交 PrimaryStorageDeleteBitGC job

./runMavenProfile premium 全量编译通过;mvn test -Dtest=UndoSnapshotCreationDeleteBitsFailureCaseTests run: 1, Failures: 0, Errors: 0

Closes ZSTAC-76704

sync from gitlab !9907

ZSTAC-76704 / TIC-5745: in APIUndoSnapshotCreationMsg, after blockCommit
succeeds on the data plane, a failure when deleting the origin volume
bits aborted the undo flow and rolled back, leaving the management-plane
install path pointing at the origin path that was already committed away
on the data plane (control/data plane inconsistency).

Re-order the undo flows so MarkSnapshotAsVolume (update-db-install-path)
runs before delete-origin-volume-bits, and on delete failure submit a
PrimaryStorageDeleteBitGC job instead of failing the whole undo.

Backport of 5.4.0 MR 8211 to 4.8.36.

Add UndoSnapshotCreationDeleteBitsFailureCase covering: undo still
succeeds on delete failure, volume install path updated, flow ordering,
and GC job submission.

Change-Id: Ia82c3126bf4ad5815584f6ceaaee42beb8d25b38
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 18, 2026

Review Change Stack

变更概览

在快照撤销流程中,调整了步骤执行顺序:先执行 MarkSnapshotAsVolumeMsg 将快照标记为新卷,再执行 DeleteVolumeBitsOnPrimaryStorageMsg 删除源卷位数据。当删除失败且错误码为 VOLUME_IN_USE 时,改为提交异步垃圾回收任务而非直接失败,从而确保撤销流程仍可完整完成。新增集成测试验证该场景。

变更详情

快照撤销流程的弹性删除与垃圾回收

层 / 文件 概要
垃圾回收基础设施导入
storage/src/main/java/org/zstack/storage/volume/VolumeBase.java
新增 PrimaryStorageDeleteBitGCPrimaryStorageGlobalConfigTimeUnit 导入,为删除失败时的异步 GC 调度做准备。
调序后的 undoSnapShotCreation 流程与弹性删除位数据
storage/src/main/java/org/zstack/storage/volume/VolumeBase.java
重新调序 NoRollbackFlow 的两个步骤:MarkSnapshotAsVolumeMsg 优先执行,将快照标记为新卷并在数据库中提交新的安装路径;随后执行 DeleteVolumeBitsOnPrimaryStorageMsg 删除源卷位数据。当删除遇到 VOLUME_IN_USE 错误时,改为根据 PrimaryStorageGlobalConfig.PRIMARY_STORAGE_DELETEBITS_GARBAGE_COLLECTOR_INTERVAL 配置提交 PrimaryStorageDeleteBitGC 异步垃圾回收任务,而非导致整个撤销流程失败。
删除位数据失败场景的集成测试
test/src/test/groovy/org/zstack/test/integration/storage/snapshot/UndoSnapshotCreationDeleteBitsFailureCase.groovy
完整的集成测试覆盖 ZSTAC-76704 / TIC-5745:在 undoSnapshotCreation 执行过程中注入 VOLUME_IN_USE 失败,验证 MarkSnapshotAsVolumeMsg 步骤先于删除步骤执行(从快照的安装路径被正确捕获验证),撤销成功完成且数据库中卷的 installPath 已更新为快照路径,以及仅有 PrimaryStorageDeleteBitGC 类型的 GC 任务被提交而不阻断撤销流程。

评估代码审查工作量

🎯 3 (中等复杂度) | ⏱️ ~25 分钟

诗集

快照轻轻一撤销,
位数据不急着飘,
标记先于删除跑,
GC 默默在关照,
流程顺利往前翘!🐰✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed 标题清晰准确地反映了主要变更内容:重新排序快照撤销流程,使MarkSnapshotAsVolume在删除源云盘位之前执行。
Description check ✅ Passed 描述详细说明了问题、根因、修复方案和测试验证,与代码变更完全相关且信息充分。
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch sync/zstackio/TIC-5745

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
test/src/test/groovy/org/zstack/test/integration/storage/snapshot/UndoSnapshotCreationDeleteBitsFailureCase.groovy (1)

83-97: ⚡ Quick win

建议把 delete-bits 的目标路径也断言出来。

现在这里只用 DB 中的 installPath 证明 MarkSnapshotAsVolume 先执行,但没有校验 DeleteVolumeBitsOnPrimaryStorageMsg 实际删除的仍然是 originPath。如果以后误把新路径传给 delete-bits,这个用例仍可能通过。建议在 simulator 里同时记录 delete 请求里的路径并断言它等于 originPath

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@test/src/test/groovy/org/zstack/test/integration/storage/snapshot/UndoSnapshotCreationDeleteBitsFailureCase.groovy`
around lines 83 - 97, In the
env.afterSimulator(LocalStorageKvmBackend.DELETE_BITS_PATH) handler (the
DeleteBitsRsp block) record the delete request's target path from the incoming
simulator message and assert it equals the expected originPath in addition to
the existing DB installPath check; specifically, inside the afterSimulator
closure capture the path field from the delete-bits request payload, store it
(e.g. deleteRequestPath) and add an assertion that deleteRequestPath ==
originPath so the test fails if delete-bits is invoked on a different path than
the original snapshot.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In
`@test/src/test/groovy/org/zstack/test/integration/storage/snapshot/UndoSnapshotCreationDeleteBitsFailureCase.groovy`:
- Around line 83-97: In the
env.afterSimulator(LocalStorageKvmBackend.DELETE_BITS_PATH) handler (the
DeleteBitsRsp block) record the delete request's target path from the incoming
simulator message and assert it equals the expected originPath in addition to
the existing DB installPath check; specifically, inside the afterSimulator
closure capture the path field from the delete-bits request payload, store it
(e.g. deleteRequestPath) and add an assertion that deleteRequestPath ==
originPath so the test fails if delete-bits is invoked on a different path than
the original snapshot.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: e786214e-da75-47f2-92ce-04c14acdaad5

📥 Commits

Reviewing files that changed from the base of the PR and between 97a38fa and 0f86ae0.

📒 Files selected for processing (2)
  • storage/src/main/java/org/zstack/storage/volume/VolumeBase.java
  • test/src/test/groovy/org/zstack/test/integration/storage/snapshot/UndoSnapshotCreationDeleteBitsFailureCase.groovy

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant