Skip to content

OCPBUGS-86246: Clean up old ovnkube-client certificates#4120

Open
ranjithrajaram wants to merge 1 commit into
openshift:masterfrom
ranjithrajaram:master
Open

OCPBUGS-86246: Clean up old ovnkube-client certificates#4120
ranjithrajaram wants to merge 1 commit into
openshift:masterfrom
ranjithrajaram:master

Conversation

@ranjithrajaram
Copy link
Copy Markdown

@ranjithrajaram ranjithrajaram commented May 20, 2026

Trying to address OCPBUGS-86246

WICD now periodically removes old ovnkube-client certificate files from the CNI config directory to prevent disk space exhaustion. The cleanup runs during the normal reconciliation loop (every 2 minutes) and keeps only the 3 most recent certificates, deleting older ones.

Without this cleanup, certificate files accumulate indefinitely as the hybrid-overlay service generates a new timestamped certificate daily. This can lead to hundreds of certificate files consuming disk space.

The cleanup logic is implemented in a separate function for testability, with comprehensive unit tests covering various scenarios including the production case of 150+ accumulated files.

Summary by CodeRabbit

  • New Features

    • Added automatic cleanup of old certificate files during service reconciliation. The system retains the 3 most recent certificates and removes older ones. Cleanup failures emit warnings but do not halt reconciliation.
  • Tests

    • Added test coverage for certificate cleanup functionality.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 20, 2026

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 5b5f4cd4-6628-46ae-a4a6-54237f6cb6c6

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This PR adds a certificate housekeeping mechanism to the Windows controller reconciliation loop. After reconciling services, the controller attempts to clean up old ovnkube-client-*.pem files from the CNI config directory, keeping only the 3 most recent when more than 5 are present. Errors during cleanup emit a warning but do not block reconciliation progress. The implementation includes glob-based file discovery, sorting by modification time (newest first), and iterative deletion with failure tolerance. Test coverage verifies the retention count and file ordering semantics.

🚥 Pre-merge checks | ✅ 14 | ❌ 3

❌ Failed checks (3 warnings)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Kubernetes Controller Patterns ⚠️ Warning Test assertion fails: fileTimes from lexicographic glob ordering not sorted before checking descending modification-time order. Apply review suggestion to sort fileTimes in descending order before assertion to fix the test logic.
Test Structure And Quality ⚠️ Warning fileTimes are in ascending order from Glob's lexicographical sort, but assertion checks they're descending, causing the test to fail on the assertion logic itself. Sort fileTimes in descending order before asserting: sort.Slice(fileTimes, func(i, j int) bool { return fileTimes[i].After(fileTimes[j]) })
✅ Passed checks (14 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Go Best Practices & Build Tags ✅ Passed Files have correct //go:build windows tags, errors wrapped with fmt.Errorf %w, nil checks precede pointer dereferences, no error ignoring with _, zero panic calls, proper design.
Security: Secrets, Ssh & Csr ✅ Passed PR adds certificate file cleanup logic that only performs filesystem operations and logs file counts/paths—never reading or logging certificate contents. No CSR or SSH code is affected.
Windows Service Management ✅ Passed This PR is not related to Windows Service Management per the check's requirements. It implements certificate file cleanup to prevent disk exhaustion, not service control operations.
Platform-Specific Requirements ✅ Passed PR implements Windows-only certificate cleanup (CNI config directory) that is platform-agnostic; no vSphere, AWS, Azure, GCP, or machine naming considerations apply to local file cleanup.
Stable And Deterministic Test Names ✅ Passed Test names use standard Go testing (not Ginkgo); all names are static, descriptive, and deterministic with no dynamic values, generated suffixes, or timestamps.
Microshift Test Compatibility ✅ Passed No Ginkgo e2e tests added. The new TestCleanupOldCertificates is a standard Go unit test using testing.T, not a Ginkgo e2e test. Check not applicable.
Single Node Openshift (Sno) Test Compatibility ✅ Passed No Ginkgo e2e tests were added. The new TestCleanupOldCertificates is a standard Go unit test using testing.T and testify, not a Ginkgo test.
Topology-Aware Scheduling Compatibility ✅ Passed PR adds only local certificate file cleanup utilities. No deployment manifests, pod scheduling constraints, affinity rules, topology assumptions, nodeSelectors, or replica logic are introduced.
Ote Binary Stdout Contract ✅ Passed PR adds certificate cleanup functions with no process-level code or stdout writes; all logging uses klog, tests are standard Go table-driven with testing.T.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed No Ginkgo e2e tests added. Only standard Go unit test TestCleanupOldCertificates using *testing.T, testing local certificate cleanup with no IPv4 or external connectivity.
Title check ✅ Passed The title accurately and specifically describes the main change: implementing cleanup of old ovnkube-client certificates, which directly addresses the certificate accumulation issue (OCPBUGS-86246) that the PR objectives highlight as the core purpose.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label May 20, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 20, 2026

Hi @ranjithrajaram. Thanks for your PR.

I'm waiting for a openshift member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
pkg/daemon/controller/controller.go (1)

315-318: ⚡ Quick win

Align cleanup logging with project's logr standard.

The cleanup paths use klog.*f calls, but other controllers (csr.go, nodeconfig.go) use logr.Logger with structured logging. Switch lines 315–318 and 450–458 to logr-style log.Info(), log.Error(), and log.V(1).Info() for consistency. The manager is already configured with klogr (line 191), so logr is available.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/daemon/controller/controller.go` around lines 315 - 318, Replace the
klog.*f calls around the certificate cleanup with the package's logr Logger
calls: when calling sc.cleanupOldCertificates() use log.Error(err, "failed to
cleanup old certificates") on error and log.V(1).Info("completed old certificate
cleanup") (or log.Info for an important informational message) on success; do
the same for the other certificate cleanup block currently using klog (the later
cleanup that removes old cert files) by converting its klog.Infof/klog.Warningf
calls to log.Info / log.Error / log.V(1).Info and pass the error as the first
argument to log.Error plus short structured fields (e.g., "path" or "count") as
needed so the logging matches csr.go/nodeconfig.go style.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/daemon/controller/controller_test.go`:
- Around line 1090-1094: The test assumes fileTimes are in newest-first order
but filepath.Glob yields lexical order, so sort the slice before asserting:
apply a descending-time sort to the fileTimes slice (e.g., using sort.Slice or
sort.SliceStable comparing fileTimes[i].After(fileTimes[j])) right before the
loop that checks ordering; keep the variable name fileTimes and the existing
assertion loop unchanged so the subsequent asserts verify the sorted, descending
list.

---

Nitpick comments:
In `@pkg/daemon/controller/controller.go`:
- Around line 315-318: Replace the klog.*f calls around the certificate cleanup
with the package's logr Logger calls: when calling sc.cleanupOldCertificates()
use log.Error(err, "failed to cleanup old certificates") on error and
log.V(1).Info("completed old certificate cleanup") (or log.Info for an important
informational message) on success; do the same for the other certificate cleanup
block currently using klog (the later cleanup that removes old cert files) by
converting its klog.Infof/klog.Warningf calls to log.Info / log.Error /
log.V(1).Info and pass the error as the first argument to log.Error plus short
structured fields (e.g., "path" or "count") as needed so the logging matches
csr.go/nodeconfig.go style.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: ad07803a-0f28-4c7b-a9b7-0ee4d2be7958

📥 Commits

Reviewing files that changed from the base of the PR and between 8640d4e and 0bbced8.

📒 Files selected for processing (2)
  • pkg/daemon/controller/controller.go
  • pkg/daemon/controller/controller_test.go

Comment on lines +1090 to +1094
// Ensure kept files are the most recent ones
for i := 1; i < len(fileTimes); i++ {
assert.True(t, fileTimes[i-1].After(fileTimes[i]) || fileTimes[i-1].Equal(fileTimes[i]),
"kept files should be sorted by modification time (newest first)")
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🌐 Web query:

Does Go's filepath.Glob return matches sorted by pathname rather than file modification time?

💡 Result:

Yes, Go's filepath.Glob returns matches sorted by pathname in lexicographical order, not by file modification time [1][2]. The implementation specifically appends new matches to the result slice in a way that maintains this lexicographical order [1]. If you require a different ordering, you must sort the returned slice manually [3].

Citations:


🏁 Script executed:

cd pkg/daemon/controller && sed -n '1060,1110p' controller_test.go

Repository: openshift/windows-machine-config-operator

Length of output: 1382


Sort fileTimes before asserting descending order.

The assertion assumes fileTimes are in descending order, but filepath.Glob returns matches in lexicographical pathname order, not modification time order. Since the filenames here happen to be chronologically sequential (2026-05-01, 2026-05-02, etc.), Glob's lexicographical output produces times in ascending order (oldest-first), causing this check to fail. Sort fileTimes in descending order before asserting.

Suggested fix
+			// Ensure kept files are the most recent ones
+			sort.Slice(fileTimes, func(i, j int) bool {
+				return fileTimes[i].After(fileTimes[j])
+			})
			for i := 1; i < len(fileTimes); i++ {
				assert.True(t, fileTimes[i-1].After(fileTimes[i]) || fileTimes[i-1].Equal(fileTimes[i]),
					"kept files should be sorted by modification time (newest first)")
			}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/daemon/controller/controller_test.go` around lines 1090 - 1094, The test
assumes fileTimes are in newest-first order but filepath.Glob yields lexical
order, so sort the slice before asserting: apply a descending-time sort to the
fileTimes slice (e.g., using sort.Slice or sort.SliceStable comparing
fileTimes[i].After(fileTimes[j])) right before the loop that checks ordering;
keep the variable name fileTimes and the existing assertion loop unchanged so
the subsequent asserts verify the sorted, descending list.

@jrvaldes
Copy link
Copy Markdown
Contributor

/test ?

Copy link
Copy Markdown
Contributor

@jrvaldes jrvaldes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @ranjithrajaram for working on this, PTAL at the comments

Comment thread pkg/daemon/controller/controller.go Outdated
Comment on lines +449 to +450
filesToDelete := matches[3:]
klog.Infof("cleaning up %d old certificate files from %s", len(filesToDelete), certDir)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider including an e2e to validate this logic

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the complexity of E2E test setup for Windows nodes, I've added comprehensive unit tests that cover the production scenario. We can add E2E coverage in a follow-up if needed

Comment thread pkg/daemon/controller/controller.go Outdated
Comment on lines +433 to +436
// If we have 5 or fewer files, no cleanup needed
if len(matches) <= 5 {
return nil
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What drove the decision to only clean up more than 5?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The threshold of 5 was chosen to avoid unnecessary cleanup operations when few files exist.
However, we could lower this to 3 (matching the retention count) since cleanup is
inexpensive. Let me know what your suggestion here ?

Comment thread pkg/daemon/controller/controller.go Outdated
Comment on lines +440 to +441
infoI, errI := os.Stat(matches[i])
infoJ, errJ := os.Stat(matches[j])
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OS system calls are expensive, for this case, you can use the filenames directly to compare, given the date is present in the naming convention, for example:

k/cni/config/ovnkube-client-2026-01-02-20-30-40.pem
k/cni/config/ovnkube-client-2026-01-03-20-30-40.pem
k/cni/config/ovnkube-client-current.pem

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks will update , yes understand,parsing dates from filenames is much faster than os.Stat

Comment thread pkg/daemon/controller/controller.go Outdated
Comment on lines +448 to +449
// Keep the 3 most recent certificates, delete the rest
filesToDelete := matches[3:]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are the 3 most recent certificates needed?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Keeping 3 was conservative. Given that certificate rotation typically
needs the current + previous cert, keeping 2 should be sufficient

Current certificate is actively used
Previous certificate might still be referenced during rotation
Third certificate is likely unnecessary

Comment thread pkg/daemon/controller/controller.go Outdated
Comment on lines +456 to +458
} else {
klog.V(1).Infof("removed old certificate file: %s", file)
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider removing the else statement, unreachable after the continue.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback, will update it

@jrvaldes
Copy link
Copy Markdown
Contributor

/test unit

@jrvaldes
Copy link
Copy Markdown
Contributor

@jrvaldes
Copy link
Copy Markdown
Contributor

/retitle OCPBUGS-86246: Clean up old ovnkube-client certificates

@openshift-ci openshift-ci Bot changed the title Clean up old ovnkube-client certificates OCPBUGS-86246: Clean up old ovnkube-client certificates May 20, 2026
@openshift-ci-robot openshift-ci-robot added jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels May 20, 2026
@openshift-ci-robot
Copy link
Copy Markdown

@ranjithrajaram: This pull request references Jira Issue OCPBUGS-86246, which is invalid:

  • expected the bug to target the "5.0.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

Trying to address OCPBUGS-86246

WICD now periodically removes old ovnkube-client certificate files from the CNI config directory to prevent disk space exhaustion. The cleanup runs during the normal reconciliation loop (every 2 minutes) and keeps only the 3 most recent certificates, deleting older ones.

Without this cleanup, certificate files accumulate indefinitely as the hybrid-overlay service generates a new timestamped certificate daily. This can lead to hundreds of certificate files consuming disk space.

The cleanup logic is implemented in a separate function for testability, with comprehensive unit tests covering various scenarios including the production case of 150+ accumulated files.

Summary by CodeRabbit

  • New Features

  • Added automatic cleanup of old certificate files during service reconciliation. The system retains the 3 most recent certificates and removes older ones. Cleanup failures emit warnings but do not halt reconciliation.

  • Tests

  • Added test coverage for certificate cleanup functionality.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

WICD now periodically removes old ovnkube-client certificate files from
the CNI config directory to prevent disk space exhaustion. The cleanup
runs during the normal reconciliation loop (every 2 minutes) and keeps
only the 2 most recent certificates, deleting older ones.

Without this cleanup, certificate files accumulate indefinitely as the
hybrid-overlay service generates a new timestamped certificate daily.
This can lead to hundreds of certificate files consuming disk space.

The cleanup uses filename-based sorting instead of expensive os.Stat()
calls, leveraging the ISO-8601-like timestamp format in the filenames
(ovnkube-client-YYYY-MM-DD-HH-MM-SS.pem) for efficient lexicographical
comparison.

The cleanup logic is implemented in a separate function for testability,
with comprehensive unit tests covering various scenarios including the
production case of 150+ accumulated files.
@mansikulkarni96
Copy link
Copy Markdown
Member

Thanks for working on this @ranjithrajaram
/approve

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 26, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mansikulkarni96, ranjithrajaram

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 26, 2026
Copy link
Copy Markdown
Contributor

@jrvaldes jrvaldes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/ok-to-test

Comment on lines +421 to +425
// cleanupOldCertificatesInDir removes old ovnkube-client certificate files from the specified directory,
// keeping only the most recent ones to prevent disk space exhaustion.
// This function is separated for testability.
func cleanupOldCertificatesInDir(certDir string) error {
certPattern := filepath.Join(certDir, "ovnkube-client-*.pem")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider adding a link to source the function comment so it is clear where the naming convention is coming from.

https://github.com/ovn-kubernetes/ovn-kubernetes/blob/df4b95aaeb7dfa2015bd4995e47e7e49dd79b9b7/go-controller/pkg/util/kube.go#L137


// Keep the 2 most recent certificates, delete the rest
filesToDelete := matches[2:]
klog.Infof("cleaning up %d old certificate files from %s", len(filesToDelete), certDir)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this log will overload the log stream; consider lowering the verbosity level to debug klog.V(1).Infof

@openshift-ci openshift-ci Bot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels May 26, 2026
@jrvaldes
Copy link
Copy Markdown
Contributor

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels May 26, 2026
@openshift-ci-robot
Copy link
Copy Markdown

@jrvaldes: This pull request references Jira Issue OCPBUGS-86246, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (5.0.0) matches configured target version for branch (5.0.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)
Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 26, 2026

@ranjithrajaram: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/vsphere-disconnected-e2e-operator 86e8a59 link true /test vsphere-disconnected-e2e-operator
ci/prow/wicd-unit-vsphere 86e8a59 link true /test wicd-unit-vsphere
ci/prow/azure-e2e-operator 86e8a59 link true /test azure-e2e-operator

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. ok-to-test Indicates a non-member PR verified by an org member that is safe to test.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants