fix(satellite): bounded DRBD teardown on Resource delete + pod preStop (Bug 82) by kvaps · Pull Request #4 · cozystack/blockstor

kvaps · 2026-05-14T15:58:09Z

Summary

Deleting an RD whose kernel slot was suspended:quorum blocked drbdadm down indefinitely, stalling the satellite finalizer-strip path. Recycled port to minor allocations then failed create-md with "Device 'N' is configured!" and pod restarts hit the same hang in preStop (DaemonSet stuck Terminating, requiring kubectl delete --force).
New Adm.DownForce wraps drbdadm down with a 15s bounded context and on failure falls back to per-volume drbdsetup detach --force + kernel-direct drbdsetup down — neither command waits for quorum. Reconciler.DeleteResource routes through DownForce so finalizer-strip is unconditional after best-effort cleanup.
DaemonSet preStop hook gains the same fallback so pod restarts can no longer stall on quorum-stuck slots.

Observed on dev-kvaps stand

linstor rd d cc-autoplace deletes the RD via REST.
Resource CRDs lose the satellite finalizer (normal flow or operator force-strip).
/dev/drbd1000 survives on every satellite — drbdsetup status still shows cc-autoplace as suspended:quorum disk:Inconsistent.
Next RD allocated the same minor fails create-md with Device '1000' is configured! exit 20.
Satellite pod restart blocks indefinitely on drbdadm down all in preStop; DaemonSet rollout stuck Terminating until kubectl delete pod --force.

Test plan

go test ./pkg/drbd/... ./pkg/satellite/... -count=1 — green
Unit tests: TestAdmDownForceHappyPath, TestAdmDownForceFallback, TestAdmDownForceFallbackMissingSlot, TestAdmDownForceBoundedDeadline, TestDeleteResourceForceTeardownOnSuspendedQuorum
Stand verification: re-deploy DaemonSet, delete RD on a suspended-quorum resource, confirm kernel slot released within 15s and no Device '<minor>' is configured! on the next allocation.

…p (Bug 82) Deleting an RD whose kernel slot was `suspended:quorum` blocked `drbdadm down` indefinitely, stalling the satellite finalizer-strip path and stranding `/dev/drbd<minor>` slots. Recycled port to minor allocations then failed `create-md` with "Device 'N' is configured!" and a satellite pod restart hit the same hang in its preStop hook (DaemonSet stuck Terminating, only `kubectl delete --force` unblocks). `Adm.DownForce` wraps `drbdadm down` with a bounded context (`DownForceTimeout` = 15s) and on failure falls back to per-volume `drbdsetup detach --force` + kernel-direct `drbdsetup down`, neither of which wait for quorum. `Reconciler.DeleteResource` routes through `DownForce` so finalizer-strip is unconditional after best-effort cleanup. The DaemonSet preStop hook gains the same fallback so pod restarts can't stall on quorum-stuck slots. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com>

coderabbitai · 2026-05-14T15:58:17Z

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 531294bd-3cdd-4fff-b1e0-e804c06f9b02

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/satellite-prestop-drbd-teardown

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

kvaps · 2026-05-14T15:58:40Z

Duplicate of #2 — the combined Bug 81+Bug 82 PR landed first with the same DownForce/preStop pattern. Closing as redundant.

gemini-code-assist

Code Review

This pull request introduces a robust teardown mechanism for DRBD resources to address issues where drbdadm down hangs due to lost quorum. A new DownForce method is implemented that attempts a graceful shutdown with a timeout, falling back to kernel-direct drbdsetup commands if necessary. These changes are integrated into the satellite reconciler and the daemonset's pre-stop lifecycle hook, accompanied by comprehensive unit tests. Feedback suggests enhancing the DownForce implementation by explicitly checking for context cancellation before and during the fallback sequence to avoid unnecessary operations if the parent context has already expired.

gemini-code-assist · 2026-05-14T16:00:46Z

+
+	// Fallback path: force-detach every volume (no quorum wait),
+	// then kernel-direct `drbdsetup down`. Use the parent ctx so a
+	// tighter caller deadline still applies.


The fallback path should respect the parent context cancellation. If ctx was cancelled by the caller (e.g., the reconciler timed out or the pod is shutting down), we should bail out immediately instead of attempting the fallback sequence. This avoids unnecessary shell-outs when the operation is already doomed to fail due to context expiration.

if ctx.Err() != nil { return err } // Fallback path: force-detach every volume (no quorum wait), // then kernel-direct `drbdsetup down`. Use the parent ctx so a // tighter caller deadline still applies.

gemini-code-assist · 2026-05-14T16:00:46Z

+	for _, vol := range volumeNumbers {
+		target := fmt.Sprintf("%s/%d", resource, vol)
+		_, _ = a.exec.Run(ctx, "drbdsetup", "detach", "--force", target)
+	}


Inside the volume detach loop, it's safer to check for context cancellation. If the context is cancelled while iterating through many volumes, continuing to issue drbdsetup detach commands is wasteful. Checking ctx.Err() at the start of each iteration ensures we stop as soon as the context is no longer valid.

for _, vol := range volumeNumbers { if ctx.Err() != nil { return ctx.Err() } target := fmt.Sprintf("%s/%d", resource, vol) _, _ = a.exec.Run(ctx, "drbdsetup", "detach", "--force", target) }

kvaps closed this May 14, 2026

kvaps deleted the fix/satellite-prestop-drbd-teardown branch May 14, 2026 15:58

gemini-code-assist Bot reviewed May 14, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(satellite): bounded DRBD teardown on Resource delete + pod preStop (Bug 82)#4

fix(satellite): bounded DRBD teardown on Resource delete + pod preStop (Bug 82)#4
kvaps wants to merge 1 commit into
mainfrom
fix/satellite-prestop-drbd-teardown

kvaps commented May 14, 2026

Uh oh!

coderabbitai Bot commented May 14, 2026

Review skipped

Uh oh!

kvaps commented May 14, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 14, 2026

Uh oh!

gemini-code-assist Bot May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kvaps commented May 14, 2026

Summary

Observed on dev-kvaps stand

Test plan

Uh oh!

coderabbitai Bot commented May 14, 2026

Review skipped

Uh oh!

kvaps commented May 14, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 14, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 14, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant