Fix device plugin pod not evicted during node drain#1802
Conversation
When a node is drained, the device plugin DaemonSet pod remains running because DaemonSet pods auto-tolerate NoSchedule taints and the only nodeSelector was the kmm-ready label (which stays while the kmod is loaded). This creates a deadlock: the unloader cannot run while the device plugin holds device files, and the device plugin won't leave because kmm-ready is never removed. Introduce a new device-plugin-target node label managed by the DevicePluginReconciler. The DaemonSet nodeSelector now requires both kmm-ready AND device-plugin-target. The DevicePluginReconciler watches node taint changes: it adds device-plugin-target to schedulable nodes matching the Module selector, and removes it from unschedulable nodes. This causes the DaemonSet controller to evict the device plugin pod during drain, breaking the deadlock.
✅ Deploy Preview for openshift-kmm ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: TomerNewman The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
WalkthroughThe PR implements device-plugin pod eviction during node drain by refactoring node selection into separate list and filter operations, introducing a device-plugin-target label managed by the DevicePluginReconciler, and updating the device-plugin DaemonSet to require this label alongside the existing kernel-module-ready label. The device-plugin reconciler now watches node taint changes and dynamically labels schedulable nodes while removing the label from unschedulable nodes. ChangesDevice Plugin Pod Eviction During Node Drain
Sequence DiagramsequenceDiagram
participant DevicePluginReconciler
participant NodeWatch as Node Event Watch
participant NodeAPI as Node API
participant LabelHandler as Label Handler
participant Kubernetes as Kubernetes API
NodeWatch->>DevicePluginReconciler: Node taint changed (non-deletion)
DevicePluginReconciler->>DevicePluginReconciler: Reconcile module
DevicePluginReconciler->>LabelHandler: handleDevicePluginTargetLabels
LabelHandler->>NodeAPI: GetAllNodesBySelector (module selector)
NodeAPI-->>LabelHandler: nodes list
LabelHandler->>NodeAPI: IsNodeSchedulable (per-node check)
alt Schedulable
LabelHandler->>Kubernetes: Add device-plugin-target label
else Unschedulable
LabelHandler->>Kubernetes: Remove device-plugin-target label
end
LabelHandler-->>DevicePluginReconciler: aggregated errors
DevicePluginReconciler->>Kubernetes: Update DaemonSet (requires device-plugin-target label)
Kubernetes->>Kubernetes: Evict pod from unschedulable nodes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
/lgtm |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
internal/node/node.go (1)
34-47:⚠️ Potential issue | 🟠 Major | ⚡ Quick winHandle cordoned nodes in
IsNodeSchedulable.
internal/node/node.go’sIsNodeSchedulableonly checksnode.Spec.Taintsand ignoresnode.Spec.Unschedulable(set bykubectl cordon).handleDevicePluginTargetLabelsusesIsNodeSchedulableto decide when to add/remove thedevice-plugin-targetlabel; since DaemonSets tolerate the implicitnode.kubernetes.io/unschedulable:NoScheduletaint, cordoned nodes can keep the label unlessSpec.Unschedulableis accounted for.- Add a regression test in
internal/node/node_test.goassertingIsNodeSchedulable(...)=falsewhenSpec.Unschedulable=true(even if there are no blocking taints).🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/node/node.go` around lines 34 - 47, The IsNodeSchedulable function currently only examines node.Spec.Taints; update IsNodeSchedulable to return false when node.Spec.Unschedulable == true (i.e., treat cordoned nodes as unschedulable before checking taints) so handleDevicePluginTargetLabels will drop the label for cordoned nodes; then add a unit test in internal/node/node_test.go that constructs a v1.Node with Spec.Unschedulable = true (and no blocking taints) and asserts IsNodeSchedulable(...) == false to prevent regressions.
🧹 Nitpick comments (1)
internal/controllers/module_reconciler_test.go (1)
125-128: ⚡ Quick winAdd one case that exercises non-empty
mod.Spec.Tolerations.These expectations only cover the nil/empty case, so they won't catch a regression where
Reconcilestops forwarding user-defined tolerations and only passesmodule.InternalTolerations. A single happy-path test with a custom module toleration and an expectation on the combined slice would lock down the new contract.Also applies to: 189-190, 209-209
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/controllers/module_reconciler_test.go` around lines 125 - 128, Add a test case that sets mod.Spec.Tolerations to a non-empty slice and assert Reconcile forwards the combined tolerations (user-defined tolerations followed by module.InternalTolerations) to mn.GetSchedulableNodesBySelector; specifically, in the test create a module with a custom toleration, set an expectation on GetSchedulableNodesBySelector to receive mod.Spec.Selector and the combined slice (e.g., append(module.InternalTolerations... ) to mod.Spec.Tolerations) and return targetedNodes,nil, then exercise the Reconcile path; apply the same pattern to the other similar test blocks that currently only cover the nil/empty tolerations case.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@internal/node/node.go`:
- Around line 34-47: The IsNodeSchedulable function currently only examines
node.Spec.Taints; update IsNodeSchedulable to return false when
node.Spec.Unschedulable == true (i.e., treat cordoned nodes as unschedulable
before checking taints) so handleDevicePluginTargetLabels will drop the label
for cordoned nodes; then add a unit test in internal/node/node_test.go that
constructs a v1.Node with Spec.Unschedulable = true (and no blocking taints) and
asserts IsNodeSchedulable(...) == false to prevent regressions.
---
Nitpick comments:
In `@internal/controllers/module_reconciler_test.go`:
- Around line 125-128: Add a test case that sets mod.Spec.Tolerations to a
non-empty slice and assert Reconcile forwards the combined tolerations
(user-defined tolerations followed by module.InternalTolerations) to
mn.GetSchedulableNodesBySelector; specifically, in the test create a module with
a custom toleration, set an expectation on GetSchedulableNodesBySelector to
receive mod.Spec.Selector and the combined slice (e.g.,
append(module.InternalTolerations... ) to mod.Spec.Tolerations) and return
targetedNodes,nil, then exercise the Reconcile path; apply the same pattern to
the other similar test blocks that currently only cover the nil/empty
tolerations case.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 97a01d47-d5f3-4c2f-b9b6-20b5187aebd7
📒 Files selected for processing (10)
internal/controllers/device_plugin_reconciler.gointernal/controllers/device_plugin_reconciler_test.gointernal/controllers/mock_device_plugin_reconciler.gointernal/controllers/module_reconciler.gointernal/controllers/module_reconciler_test.gointernal/filter/filter.gointernal/node/mock_node.gointernal/node/node.gointernal/node/node_test.gointernal/utils/kmmlabels.go
a4a5d0b
into
rh-ecosystem-edge:main
|
/cherry-pick release-2.16 |
|
@TomerNewman: cannot checkout DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/cherry-pick release-2.6 |
|
@TomerNewman: new pull request created: #1803 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/cherry-pick release-2.5 |
|
@TomerNewman: #1802 failed to apply on top of branch "release-2.5": DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/cherry-pick release-2.5 |
|
@TomerNewman: #1802 failed to apply on top of branch "release-2.5": DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
When a node is drained, the device plugin DaemonSet pod remains running because DaemonSet pods auto-tolerate NoSchedule taints and the only nodeSelector was the kmm-ready label (which stays while the kmod is loaded). This creates a deadlock: the unloader cannot run while the device plugin holds device files, and the device plugin won't leave because kmm-ready is never removed.
Introduce a new device-plugin-target node label managed by the DevicePluginReconciler. The DaemonSet nodeSelector now requires both kmm-ready AND device-plugin-target. The DevicePluginReconciler watches node taint changes: it adds device-plugin-target to schedulable nodes matching the Module selector, and removes it from unschedulable nodes. This causes the DaemonSet controller to evict the device plugin pod during drain, breaking the deadlock.
fixes #1801
/cc @yevgeny-shnaidman @ybettan
Summary by CodeRabbit
New Features
Improvements