OCPBUGS-86246: Clean up old ovnkube-client certificates#4120
OCPBUGS-86246: Clean up old ovnkube-client certificates#4120ranjithrajaram wants to merge 1 commit into
Conversation
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Repository YAML (base), Central YAML (inherited) Review profile: CHILL Plan: Enterprise Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughThis PR adds a certificate housekeeping mechanism to the Windows controller reconciliation loop. After reconciling services, the controller attempts to clean up old 🚥 Pre-merge checks | ✅ 14 | ❌ 3❌ Failed checks (3 warnings)
✅ Passed checks (14 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
Hi @ranjithrajaram. Thanks for your PR. I'm waiting for a openshift member to verify that this patch is reasonable to test. If it is, they should reply with Regular contributors should join the org to skip this step. Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
pkg/daemon/controller/controller.go (1)
315-318: ⚡ Quick winAlign cleanup logging with project's logr standard.
The cleanup paths use
klog.*fcalls, but other controllers (csr.go,nodeconfig.go) uselogr.Loggerwith structured logging. Switch lines 315–318 and 450–458 to logr-stylelog.Info(),log.Error(), andlog.V(1).Info()for consistency. The manager is already configured withklogr(line 191), so logr is available.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/daemon/controller/controller.go` around lines 315 - 318, Replace the klog.*f calls around the certificate cleanup with the package's logr Logger calls: when calling sc.cleanupOldCertificates() use log.Error(err, "failed to cleanup old certificates") on error and log.V(1).Info("completed old certificate cleanup") (or log.Info for an important informational message) on success; do the same for the other certificate cleanup block currently using klog (the later cleanup that removes old cert files) by converting its klog.Infof/klog.Warningf calls to log.Info / log.Error / log.V(1).Info and pass the error as the first argument to log.Error plus short structured fields (e.g., "path" or "count") as needed so the logging matches csr.go/nodeconfig.go style.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pkg/daemon/controller/controller_test.go`:
- Around line 1090-1094: The test assumes fileTimes are in newest-first order
but filepath.Glob yields lexical order, so sort the slice before asserting:
apply a descending-time sort to the fileTimes slice (e.g., using sort.Slice or
sort.SliceStable comparing fileTimes[i].After(fileTimes[j])) right before the
loop that checks ordering; keep the variable name fileTimes and the existing
assertion loop unchanged so the subsequent asserts verify the sorted, descending
list.
---
Nitpick comments:
In `@pkg/daemon/controller/controller.go`:
- Around line 315-318: Replace the klog.*f calls around the certificate cleanup
with the package's logr Logger calls: when calling sc.cleanupOldCertificates()
use log.Error(err, "failed to cleanup old certificates") on error and
log.V(1).Info("completed old certificate cleanup") (or log.Info for an important
informational message) on success; do the same for the other certificate cleanup
block currently using klog (the later cleanup that removes old cert files) by
converting its klog.Infof/klog.Warningf calls to log.Info / log.Error /
log.V(1).Info and pass the error as the first argument to log.Error plus short
structured fields (e.g., "path" or "count") as needed so the logging matches
csr.go/nodeconfig.go style.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: ad07803a-0f28-4c7b-a9b7-0ee4d2be7958
📒 Files selected for processing (2)
pkg/daemon/controller/controller.gopkg/daemon/controller/controller_test.go
| // Ensure kept files are the most recent ones | ||
| for i := 1; i < len(fileTimes); i++ { | ||
| assert.True(t, fileTimes[i-1].After(fileTimes[i]) || fileTimes[i-1].Equal(fileTimes[i]), | ||
| "kept files should be sorted by modification time (newest first)") | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
Does Go's filepath.Glob return matches sorted by pathname rather than file modification time?
💡 Result:
Yes, Go's filepath.Glob returns matches sorted by pathname in lexicographical order, not by file modification time [1][2]. The implementation specifically appends new matches to the result slice in a way that maintains this lexicographical order [1]. If you require a different ordering, you must sort the returned slice manually [3].
Citations:
- 1: https://go.dev/src/io/fs/glob.go
- 2: https://golang-nuts.narkive.com/6NUcUpyU/order-of-files-returned-by-filepath-glob
- 3: path/filepath: ensure Glob() returns in lexical order and comment about that golang/go#17153
🏁 Script executed:
cd pkg/daemon/controller && sed -n '1060,1110p' controller_test.goRepository: openshift/windows-machine-config-operator
Length of output: 1382
Sort fileTimes before asserting descending order.
The assertion assumes fileTimes are in descending order, but filepath.Glob returns matches in lexicographical pathname order, not modification time order. Since the filenames here happen to be chronologically sequential (2026-05-01, 2026-05-02, etc.), Glob's lexicographical output produces times in ascending order (oldest-first), causing this check to fail. Sort fileTimes in descending order before asserting.
Suggested fix
+ // Ensure kept files are the most recent ones
+ sort.Slice(fileTimes, func(i, j int) bool {
+ return fileTimes[i].After(fileTimes[j])
+ })
for i := 1; i < len(fileTimes); i++ {
assert.True(t, fileTimes[i-1].After(fileTimes[i]) || fileTimes[i-1].Equal(fileTimes[i]),
"kept files should be sorted by modification time (newest first)")
}🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@pkg/daemon/controller/controller_test.go` around lines 1090 - 1094, The test
assumes fileTimes are in newest-first order but filepath.Glob yields lexical
order, so sort the slice before asserting: apply a descending-time sort to the
fileTimes slice (e.g., using sort.Slice or sort.SliceStable comparing
fileTimes[i].After(fileTimes[j])) right before the loop that checks ordering;
keep the variable name fileTimes and the existing assertion loop unchanged so
the subsequent asserts verify the sorted, descending list.
|
/test ? |
jrvaldes
left a comment
There was a problem hiding this comment.
thanks @ranjithrajaram for working on this, PTAL at the comments
| filesToDelete := matches[3:] | ||
| klog.Infof("cleaning up %d old certificate files from %s", len(filesToDelete), certDir) |
There was a problem hiding this comment.
consider including an e2e to validate this logic
There was a problem hiding this comment.
Given the complexity of E2E test setup for Windows nodes, I've added comprehensive unit tests that cover the production scenario. We can add E2E coverage in a follow-up if needed
| // If we have 5 or fewer files, no cleanup needed | ||
| if len(matches) <= 5 { | ||
| return nil | ||
| } |
There was a problem hiding this comment.
What drove the decision to only clean up more than 5?
There was a problem hiding this comment.
The threshold of 5 was chosen to avoid unnecessary cleanup operations when few files exist.
However, we could lower this to 3 (matching the retention count) since cleanup is
inexpensive. Let me know what your suggestion here ?
| infoI, errI := os.Stat(matches[i]) | ||
| infoJ, errJ := os.Stat(matches[j]) |
There was a problem hiding this comment.
OS system calls are expensive, for this case, you can use the filenames directly to compare, given the date is present in the naming convention, for example:
k/cni/config/ovnkube-client-2026-01-02-20-30-40.pem
k/cni/config/ovnkube-client-2026-01-03-20-30-40.pem
k/cni/config/ovnkube-client-current.pem
There was a problem hiding this comment.
Thanks will update , yes understand,parsing dates from filenames is much faster than os.Stat
| // Keep the 3 most recent certificates, delete the rest | ||
| filesToDelete := matches[3:] |
There was a problem hiding this comment.
why are the 3 most recent certificates needed?
There was a problem hiding this comment.
Good point. Keeping 3 was conservative. Given that certificate rotation typically
needs the current + previous cert, keeping 2 should be sufficient
Current certificate is actively used
Previous certificate might still be referenced during rotation
Third certificate is likely unnecessary
| } else { | ||
| klog.V(1).Infof("removed old certificate file: %s", file) | ||
| } |
There was a problem hiding this comment.
consider removing the else statement, unreachable after the continue.
There was a problem hiding this comment.
Thanks for the feedback, will update it
|
/test unit |
|
@ranjithrajaram add the jira id |
|
/retitle OCPBUGS-86246: Clean up old ovnkube-client certificates |
|
@ranjithrajaram: This pull request references Jira Issue OCPBUGS-86246, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
WICD now periodically removes old ovnkube-client certificate files from the CNI config directory to prevent disk space exhaustion. The cleanup runs during the normal reconciliation loop (every 2 minutes) and keeps only the 2 most recent certificates, deleting older ones. Without this cleanup, certificate files accumulate indefinitely as the hybrid-overlay service generates a new timestamped certificate daily. This can lead to hundreds of certificate files consuming disk space. The cleanup uses filename-based sorting instead of expensive os.Stat() calls, leveraging the ISO-8601-like timestamp format in the filenames (ovnkube-client-YYYY-MM-DD-HH-MM-SS.pem) for efficient lexicographical comparison. The cleanup logic is implemented in a separate function for testability, with comprehensive unit tests covering various scenarios including the production case of 150+ accumulated files.
|
Thanks for working on this @ranjithrajaram |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: mansikulkarni96, ranjithrajaram The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
| // cleanupOldCertificatesInDir removes old ovnkube-client certificate files from the specified directory, | ||
| // keeping only the most recent ones to prevent disk space exhaustion. | ||
| // This function is separated for testability. | ||
| func cleanupOldCertificatesInDir(certDir string) error { | ||
| certPattern := filepath.Join(certDir, "ovnkube-client-*.pem") |
There was a problem hiding this comment.
Consider adding a link to source the function comment so it is clear where the naming convention is coming from.
|
|
||
| // Keep the 2 most recent certificates, delete the rest | ||
| filesToDelete := matches[2:] | ||
| klog.Infof("cleaning up %d old certificate files from %s", len(filesToDelete), certDir) |
There was a problem hiding this comment.
this log will overload the log stream; consider lowering the verbosity level to debug klog.V(1).Infof
|
/jira refresh |
|
@jrvaldes: This pull request references Jira Issue OCPBUGS-86246, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@ranjithrajaram: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Trying to address OCPBUGS-86246
WICD now periodically removes old ovnkube-client certificate files from the CNI config directory to prevent disk space exhaustion. The cleanup runs during the normal reconciliation loop (every 2 minutes) and keeps only the 3 most recent certificates, deleting older ones.
Without this cleanup, certificate files accumulate indefinitely as the hybrid-overlay service generates a new timestamped certificate daily. This can lead to hundreds of certificate files consuming disk space.
The cleanup logic is implemented in a separate function for testability, with comprehensive unit tests covering various scenarios including the production case of 150+ accumulated files.
Summary by CodeRabbit
New Features
Tests