Skip to content

fix(satellite): bounded DRBD teardown on Resource delete + pod preStop (Bug 82)#4

Closed
kvaps wants to merge 1 commit into
mainfrom
fix/satellite-prestop-drbd-teardown
Closed

fix(satellite): bounded DRBD teardown on Resource delete + pod preStop (Bug 82)#4
kvaps wants to merge 1 commit into
mainfrom
fix/satellite-prestop-drbd-teardown

Conversation

@kvaps
Copy link
Copy Markdown
Member

@kvaps kvaps commented May 14, 2026

Summary

  • Deleting an RD whose kernel slot was suspended:quorum blocked drbdadm down indefinitely, stalling the satellite finalizer-strip path. Recycled port to minor allocations then failed create-md with "Device 'N' is configured!" and pod restarts hit the same hang in preStop (DaemonSet stuck Terminating, requiring kubectl delete --force).
  • New Adm.DownForce wraps drbdadm down with a 15s bounded context and on failure falls back to per-volume drbdsetup detach --force + kernel-direct drbdsetup down — neither command waits for quorum. Reconciler.DeleteResource routes through DownForce so finalizer-strip is unconditional after best-effort cleanup.
  • DaemonSet preStop hook gains the same fallback so pod restarts can no longer stall on quorum-stuck slots.

Observed on dev-kvaps stand

  1. linstor rd d cc-autoplace deletes the RD via REST.
  2. Resource CRDs lose the satellite finalizer (normal flow or operator force-strip).
  3. /dev/drbd1000 survives on every satellite — drbdsetup status still shows cc-autoplace as suspended:quorum disk:Inconsistent.
  4. Next RD allocated the same minor fails create-md with Device '1000' is configured! exit 20.
  5. Satellite pod restart blocks indefinitely on drbdadm down all in preStop; DaemonSet rollout stuck Terminating until kubectl delete pod --force.

Test plan

  • go test ./pkg/drbd/... ./pkg/satellite/... -count=1 — green
  • Unit tests: TestAdmDownForceHappyPath, TestAdmDownForceFallback, TestAdmDownForceFallbackMissingSlot, TestAdmDownForceBoundedDeadline, TestDeleteResourceForceTeardownOnSuspendedQuorum
  • Stand verification: re-deploy DaemonSet, delete RD on a suspended-quorum resource, confirm kernel slot released within 15s and no Device '<minor>' is configured! on the next allocation.

…p (Bug 82)

Deleting an RD whose kernel slot was `suspended:quorum` blocked
`drbdadm down` indefinitely, stalling the satellite finalizer-strip
path and stranding `/dev/drbd<minor>` slots. Recycled port to minor
allocations then failed `create-md` with "Device 'N' is configured!"
and a satellite pod restart hit the same hang in its preStop hook
(DaemonSet stuck Terminating, only `kubectl delete --force` unblocks).

`Adm.DownForce` wraps `drbdadm down` with a bounded context
(`DownForceTimeout` = 15s) and on failure falls back to per-volume
`drbdsetup detach --force` + kernel-direct `drbdsetup down`, neither
of which wait for quorum. `Reconciler.DeleteResource` routes through
`DownForce` so finalizer-strip is unconditional after best-effort
cleanup. The DaemonSet preStop hook gains the same fallback so pod
restarts can't stall on quorum-stuck slots.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 14, 2026

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 531294bd-3cdd-4fff-b1e0-e804c06f9b02

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/satellite-prestop-drbd-teardown

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@kvaps
Copy link
Copy Markdown
Member Author

kvaps commented May 14, 2026

Duplicate of #2 — the combined Bug 81+Bug 82 PR landed first with the same DownForce/preStop pattern. Closing as redundant.

@kvaps kvaps closed this May 14, 2026
@kvaps kvaps deleted the fix/satellite-prestop-drbd-teardown branch May 14, 2026 15:58
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a robust teardown mechanism for DRBD resources to address issues where drbdadm down hangs due to lost quorum. A new DownForce method is implemented that attempts a graceful shutdown with a timeout, falling back to kernel-direct drbdsetup commands if necessary. These changes are integrated into the satellite reconciler and the daemonset's pre-stop lifecycle hook, accompanied by comprehensive unit tests. Feedback suggests enhancing the DownForce implementation by explicitly checking for context cancellation before and during the fallback sequence to avoid unnecessary operations if the parent context has already expired.

Comment thread pkg/drbd/drbdadm.go
Comment on lines +108 to +111

// Fallback path: force-detach every volume (no quorum wait),
// then kernel-direct `drbdsetup down`. Use the parent ctx so a
// tighter caller deadline still applies.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The fallback path should respect the parent context cancellation. If ctx was cancelled by the caller (e.g., the reconciler timed out or the pod is shutting down), we should bail out immediately instead of attempting the fallback sequence. This avoids unnecessary shell-outs when the operation is already doomed to fail due to context expiration.

	if ctx.Err() != nil {
		return err
	}

	// Fallback path: force-detach every volume (no quorum wait),
	// then kernel-direct `drbdsetup down`. Use the parent ctx so a
	// tighter caller deadline still applies.

Comment thread pkg/drbd/drbdadm.go
Comment on lines +112 to +115
for _, vol := range volumeNumbers {
target := fmt.Sprintf("%s/%d", resource, vol)
_, _ = a.exec.Run(ctx, "drbdsetup", "detach", "--force", target)
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Inside the volume detach loop, it's safer to check for context cancellation. If the context is cancelled while iterating through many volumes, continuing to issue drbdsetup detach commands is wasteful. Checking ctx.Err() at the start of each iteration ensures we stop as soon as the context is no longer valid.

	for _, vol := range volumeNumbers {
		if ctx.Err() != nil {
			return ctx.Err()
		}
		target := fmt.Sprintf("%s/%d", resource, vol)
		_, _ = a.exec.Run(ctx, "drbdsetup", "detach", "--force", target)
	}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant