Skip to content

[SREP-3734] Port AlertmanagerInhibitions e2e tests from osde2e#515

Open
smarthall wants to merge 1 commit into
openshift:masterfrom
smarthall:move-inhibitions-test-to-cao
Open

[SREP-3734] Port AlertmanagerInhibitions e2e tests from osde2e#515
smarthall wants to merge 1 commit into
openshift:masterfrom
smarthall:move-inhibitions-test-to-cao

Conversation

@smarthall
Copy link
Copy Markdown
Member

@smarthall smarthall commented Apr 29, 2026

Summary

Ports the AlertmanagerInhibitions e2e tests from osde2e into this repo, adapted to the CAMO test framework.

What's tested

  • Inhibit rules exist: Validates that the alertmanager config contains inhibit rules with a severity=critical source match (existing Describe("Alertmanager configuration structure") block).
  • Inhibition actually works: Injects synthetic ClusterOperatorDegraded and ClusterOperatorDown alerts directly into the Alertmanager v2 API, then verifies ClusterOperatorDown is present with state=suppressed. This proves the inhibition rule is working end-to-end.

Approach: synthetic alert injection

Instead of degrading a real cluster operator (risky, slow, doesn't verify actual suppression), we POST synthetic alerts to Alertmanager's /api/v2/alerts endpoint. Alertmanager treats them identically to real alerts — inhibition rules apply immediately.

POST uses pod exec to reach each alertmanager instance directly, since alerts posted via the API are not gossiped between peers and the external route load-balances across replicas. Cleanup resolves the synthetic alerts by posting them with endsAt in the past.

Test completes in ~11 seconds with no cluster state changes.

New utilities in test/e2e/utils/

  • AlertmanagerClient — authenticated HTTP client for the Alertmanager v2 API (GET via external route, POST via pod exec)
  • PostAlerts() / ResolveAlerts() — inject and clean up synthetic alerts
  • GetAlerts() / GetAlertsByName() — query alert state including inhibition status

Test plan

  • go build -tags osde2e ./test/e2e/... compiles cleanly
  • All inhibition specs pass against a live cluster (~11s)
  • Existing tests unaffected

Related

Corresponding osde2e test removal: openshift/osde2e#3222

🤖 Generated with Claude Code

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 29, 2026

Warning

Rate limit exceeded

@smarthall has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 44 minutes and 20 seconds before requesting another review.

To keep reviews running without waiting, you can enable usage-based add-on for your organization. This allows additional reviews beyond the hourly cap. Account admins can enable it under billing.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: bbf01ead-cc87-4887-a370-0a93c1f87de9

📥 Commits

Reviewing files that changed from the base of the PR and between 0a9d537 and 37690a4.

📒 Files selected for processing (2)
  • test/e2e/configure_alertmanager_operator_tests.go
  • test/e2e/utils/utils.go

Walkthrough

Adds an Alertmanager v2 test client and e2e tests that post synthetic alerts to Alertmanager, poll the API until a target alert becomes suppressed by inhibition rules, and resolve the synthetic alerts during cleanup.

Changes

Cohort / File(s) Summary
E2E Test
test/e2e/configure_alertmanager_operator_tests.go
Initializes a typed Kubernetes client and an Alertmanager v2 client in the test harness; adds tests that post synthetic ClusterOperatorDegraded (source) and ClusterOperatorDown (target) alerts, poll Alertmanager v2 until the target is suppressed, and resolve injected alerts on cleanup.
E2E Utilities / Alertmanager Client
test/e2e/utils/utils.go
Adds AlertmanagerClient and related types (AlertmanagerV2Alert, AlertmanagerV2AlertStatus, AlertmanagerV2PostAlert); implements NewAlertmanagerClient (route host discovery, SA token, router CA, TLS-configured http.Client), GetAlerts/GetAlertsByName (v2 API filtering), PostAlerts (pod exec curl to post synthetic alerts), and ResolveAlerts (post with past endsAt).

Sequence Diagram(s)

sequenceDiagram
  participant Test
  participant KubeAPI as Kubernetes API
  participant Route as Route (alertmanager-main)
  participant AlertmgrPod as Alertmanager Pod
  participant AlertmgrAPI as Alertmanager v2 API

  Test->>KubeAPI: discover `alertmanager-main` route host, get SA token & router CA
  KubeAPI-->>Test: route host, bearer token, CA bundle
  Test->>Route: use host + CA + token to construct AlertmanagerClient (HTTPS)
  Note right of Test: static config checks may read operator-managed alertmanager.yaml
  Test->>AlertmgrPod: exec `curl` POST /api/v2/alerts (post synthetic source & target)
  AlertmgrPod->>AlertmgrAPI: ingest posted alerts
  AlertmgrAPI-->>AlertmgrAPI: evaluate inhibition rules (source -> suppress target)
  loop poll until suppressed
    Test->>AlertmgrAPI: GET /api/v2/alerts?filter=...
    AlertmgrAPI-->>Test: alert list (target status becomes "suppressed")
  end
  Test->>AlertmgrPod: exec `curl` POST /api/v2/alerts to resolve alerts (endsAt in past)
  AlertmgrAPI-->>Test: alerts resolved
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 8 | ❌ 4

❌ Failed checks (2 warnings, 2 inconclusive)

Check name Status Explanation Resolution
Test Structure And Quality ⚠️ Warning Inhibition test uses 10-second timeout (inadequate for cluster operations), positional Eventually arguments (inconsistent with fluent API pattern), incorrect function signature, and lacks table-driven validation of specific inhibit rule pairs. Use Eventually with fluent API (WithPolling/WithTimeout 2-minute), correct function signature receiving context parameter, and add table-driven test validating specific inhibit rule pairs with negative test case.
Microshift Test Compatibility ⚠️ Warning Tests access ClusterOperator API, Alertmanager v2 API, and openshift-monitoring namespace components unavailable on MicroShift. Add [apigroup:config.openshift.io] tag to test names or guard with exutil.IsMicroShiftCluster() check to skip on MicroShift.
Single Node Openshift (Sno) Test Compatibility ❓ Inconclusive The file test/e2e/configure_alertmanager_operator_tests.go was not found or accessible for inspection. Provide the file path or contents directly, or verify the file exists in the repository and is accessible.
Ipv6 And Disconnected Network Test Compatibility ❓ Inconclusive Unable to access the specified test files. Direct file inspection requires access to the repository code. Provide file contents or repository access to perform IPv4 hardcoding and external connectivity analysis.
✅ Passed checks (8 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main change: porting AlertmanagerInhibitions e2e tests from osde2e to this repository, which aligns with all the changes shown in the raw summary.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed All test names are stable and deterministic, using clear descriptive static strings without dynamic content like pod names, timestamps, UUIDs, or IP addresses.
Topology-Aware Scheduling Compatibility ✅ Passed Pull request contains only end-to-end test code and utilities with no deployment manifests, operator code, or topology-affecting Kubernetes resources.
Ote Binary Stdout Contract ✅ Passed The pull request complies with the OTE Binary Stdout Contract. All output is properly routed through test-framework-managed channels.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
Review rate limit: 0/1 reviews remaining, refill in 44 minutes and 20 seconds.

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot requested review from iamkirkbater and ravitri April 29, 2026 05:04
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Apr 29, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: smarthall

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 29, 2026
@smarthall smarthall changed the title Port AlertmanagerInhibitions e2e tests from osde2e [SREP-3734] Port AlertmanagerInhibitions e2e tests from osde2e Apr 29, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 29, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 68.26%. Comparing base (81b35d1) to head (37690a4).

Additional details and impacted files

Impacted file tree graph

@@           Coverage Diff           @@
##           master     #515   +/-   ##
=======================================
  Coverage   68.26%   68.26%           
=======================================
  Files           8        8           
  Lines        1087     1087           
=======================================
  Hits          742      742           
  Misses        315      315           
  Partials       30       30           
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/e2e/configure_alertmanager_operator_tests.go`:
- Around line 1444-1498: The test currently removes any IDP named by
identityProviderName in cleanup() and later only deletes by that name, which can
permanently mutate cluster auth; instead capture the original identityProviders
slice from oauth (e.g. store in a new originalIDPs variable after the first Get
in the test before appending degradingIDP), then modify oauth.Object to add
degradingIDP as you do, and change cleanup() to restore the entire originalIDPs
slice (using unstructured.SetNestedSlice and
dynamicClient.Resource(oauthGVR).Update) rather than deleting by name; also
invoke the deferred cleanup with a non-cancelable context (e.g.
context.Background()) so it runs even if the test context times out.
- Around line 1520-1544: The test currently treats absence or any non-"active"
state as successful inhibition; update the assertion to ensure
ClusterOperatorDown was actually suppressed by Alertmanager: first verify the
alert is firing in Prometheus (query Prometheus for
ClusterOperatorDown{name="authentication"} via the existing Prometheus client)
or, if you choose the minimal change, require that amClient.GetAlertsByName(...)
returns at least one alert whose alert.Status.State == "suppressed" and
len(alert.Status.InhibitedBy) > 0; replace the len(alerts)==0 branch with a
failure (or an explicit Prometheus firing check) and add the explicit checks for
"suppressed" and non-empty InhibitedBy in the loop that currently logs state to
guarantee suppression rather than silence/absence.

In `@test/e2e/utils/utils.go`:
- Around line 477-487: The code currently uses the first entry in status.ingress
(ingress[0]) which can be non-admitted; update the logic in the function that
builds the Alertmanager client (the code around NewAlertmanagerClient / the
route handling where variables ingress, ingressMap, host are used) to iterate
over the ingress slice and pick the first ingress entry whose conditions include
an object with type "Admitted" and status "True", then extract the host from
that admitted ingress; if none are admitted, return the existing error. Ensure
you check types when casting each ingress element to map[string]interface{} and
use unstructured.NestedSlice/NestedString on the admitted ingress map to get the
host.
- Around line 131-145: The HasInhibitRuleMatching function miscompares unordered
equal label lists and treats a missing target_match_re["alertname"] as matching;
fix by performing an order-insensitive comparison for equal (either sort both
slices before using slices.Equal or check lengths and that every element in the
provided equal slice exists in rule.Equal) and skip the rule if
rule.TargetMatchRE does not contain "alertname" (use val, ok :=
rule.TargetMatchRE["alertname"]; if !ok { continue }). Keep the existing
sourceMatch check and only call regexp.MatchString with the validated val.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 5aa31286-f99c-492b-bc2c-b7983a1663f0

📥 Commits

Reviewing files that changed from the base of the PR and between 81b35d1 and 11f656c.

📒 Files selected for processing (2)
  • test/e2e/configure_alertmanager_operator_tests.go
  • test/e2e/utils/utils.go

Comment thread test/e2e/configure_alertmanager_operator_tests.go Outdated
Comment thread test/e2e/configure_alertmanager_operator_tests.go Outdated
Comment thread test/e2e/utils/utils.go Outdated
Comment thread test/e2e/utils/utils.go Outdated
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/e2e/utils/utils.go`:
- Line 566: The linter flags the unchecked error from defer resp.Body.Close();
update the defer to explicitly discard the Close() error so it satisfies
errcheck (e.g., use a defer that assigns the return to the blank identifier)
where resp is used in this helper in test/e2e/utils/utils.go; ensure you only
change the defer expression (referencing resp.Body.Close()) and not other logic.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 85889864-3b3a-422c-82ce-ca0ac5f08f31

📥 Commits

Reviewing files that changed from the base of the PR and between 11f656c and f52d558.

📒 Files selected for processing (2)
  • test/e2e/configure_alertmanager_operator_tests.go
  • test/e2e/utils/utils.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • test/e2e/configure_alertmanager_operator_tests.go

Comment thread test/e2e/utils/utils.go Outdated
@smarthall smarthall force-pushed the move-inhibitions-test-to-cao branch 2 times, most recently from 0a9d537 to 38f913e Compare April 30, 2026 02:05
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (3)
test/e2e/utils/utils.go (2)

464-475: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Select an admitted ingress entry instead of assuming status.ingress[0].

Line 468 can pick a non-admitted router entry, which makes NewAlertmanagerClient flaky on clusters with multiple ingress controllers.

Proposed fix
-	ingressMap, ok := ingress[0].(map[string]interface{})
-	if !ok {
-		return nil, fmt.Errorf("unexpected ingress format in alertmanager-main route")
-	}
-	host, found, err := unstructured.NestedString(ingressMap, "host")
-	if err != nil || !found || host == "" {
-		return nil, fmt.Errorf("alertmanager-main route has no host in ingress status")
-	}
+	var host string
+	for _, item := range ingress {
+		ingressMap, ok := item.(map[string]interface{})
+		if !ok {
+			continue
+		}
+		conditions, _, _ := unstructured.NestedSlice(ingressMap, "conditions")
+		admitted := false
+		for _, c := range conditions {
+			conditionMap, ok := c.(map[string]interface{})
+			if !ok {
+				continue
+			}
+			condType, _, _ := unstructured.NestedString(conditionMap, "type")
+			condStatus, _, _ := unstructured.NestedString(conditionMap, "status")
+			if condType == "Admitted" && condStatus == "True" {
+				admitted = true
+				break
+			}
+		}
+		if admitted {
+			host, _, _ = unstructured.NestedString(ingressMap, "host")
+			if host != "" {
+				break
+			}
+		}
+	}
+	if host == "" {
+		return nil, fmt.Errorf("alertmanager-main route has no admitted ingress host")
+	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/utils/utils.go` around lines 464 - 475, The code currently assumes
status.ingress[0] is the admitted route which can be non-admitted; update the
selection logic in NewAlertmanagerClient (or the function containing
route/ingress parsing) to iterate the ingress slice, cast each entry to
map[string]interface{} (ingressMap), and pick the first entry whose "conditions"
array contains an object with type=="Admitted" and status=="True"; then read the
"host" from that selected ingressMap instead of using ingress[0]; keep existing
error returns if no admitted ingress or no host is found.

545-545: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Handle deferred Close() return to satisfy errcheck.

Line 545 currently ignores the returned error and is flagged by lint.

Proposed fix
-	defer resp.Body.Close()
+	defer func() { _ = resp.Body.Close() }()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/utils/utils.go` at line 545, Replace the bare defer
resp.Body.Close() so the returned error is handled; change it to a deferred
closure that calls resp.Body.Close() and checks the error (e.g. defer func() {
if err := resp.Body.Close(); err != nil { log.Printf("resp.Body.Close error:
%v", err) } }()), locating the call to resp.Body.Close() in the same helper in
test/e2e/utils/utils.go and use the package-appropriate logger (t.Logf,
log.Printf, or similar) available in that context.
test/e2e/configure_alertmanager_operator_tests.go (1)

1410-1415: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Avoid first-item short-circuit; assert inhibition explicitly.

Line 1414 returns from the loop on the first non-suppressed alert, and Line 1411 accepts suppressed without checking inhibitedBy. This can produce false results when multiple matching alerts are returned.

Proposed fix
-				for _, alert := range alerts {
-					if alert.Status.State == "suppressed" {
-						return nil
-					}
-					return fmt.Errorf("ClusterOperatorDown state=%q, expected suppressed", alert.Status.State)
-				}
-				return nil
+				for _, alert := range alerts {
+					if alert.Status.State != "suppressed" || len(alert.Status.InhibitedBy) == 0 {
+						return fmt.Errorf(
+							"ClusterOperatorDown is not inhibited (state=%q, inhibitedBy=%v)",
+							alert.Status.State, alert.Status.InhibitedBy,
+						)
+					}
+				}
+				return nil
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/configure_alertmanager_operator_tests.go` around lines 1410 - 1415,
The loop over alerts currently short-circuits on the first item and treats any
"suppressed" state as sufficient without checking inhibition; change the logic
in the alerts iteration to (1) examine every alert instead of returning from the
first iteration, (2) for each alert assert that alert.Status.State ==
"suppressed" and that the alert's inhibition marker (e.g.
alert.Status.InhibitedBy or alert.InhibitedBy—use the actual field present) is
non-empty, and (3) only succeed (return nil) if all matching alerts are
suppressed and have a non-empty inhibitedBy; otherwise return an
aggregated/first meaningful error after the loop mentioning which alerts failed
the checks. Ensure you update the loop that references alerts and
alert.Status.State to perform these checks.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@test/e2e/configure_alertmanager_operator_tests.go`:
- Around line 1410-1415: The loop over alerts currently short-circuits on the
first item and treats any "suppressed" state as sufficient without checking
inhibition; change the logic in the alerts iteration to (1) examine every alert
instead of returning from the first iteration, (2) for each alert assert that
alert.Status.State == "suppressed" and that the alert's inhibition marker (e.g.
alert.Status.InhibitedBy or alert.InhibitedBy—use the actual field present) is
non-empty, and (3) only succeed (return nil) if all matching alerts are
suppressed and have a non-empty inhibitedBy; otherwise return an
aggregated/first meaningful error after the loop mentioning which alerts failed
the checks. Ensure you update the loop that references alerts and
alert.Status.State to perform these checks.

In `@test/e2e/utils/utils.go`:
- Around line 464-475: The code currently assumes status.ingress[0] is the
admitted route which can be non-admitted; update the selection logic in
NewAlertmanagerClient (or the function containing route/ingress parsing) to
iterate the ingress slice, cast each entry to map[string]interface{}
(ingressMap), and pick the first entry whose "conditions" array contains an
object with type=="Admitted" and status=="True"; then read the "host" from that
selected ingressMap instead of using ingress[0]; keep existing error returns if
no admitted ingress or no host is found.
- Line 545: Replace the bare defer resp.Body.Close() so the returned error is
handled; change it to a deferred closure that calls resp.Body.Close() and checks
the error (e.g. defer func() { if err := resp.Body.Close(); err != nil {
log.Printf("resp.Body.Close error: %v", err) } }()), locating the call to
resp.Body.Close() in the same helper in test/e2e/utils/utils.go and use the
package-appropriate logger (t.Logf, log.Printf, or similar) available in that
context.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: a60d3211-5b39-4147-877e-44e109ce6431

📥 Commits

Reviewing files that changed from the base of the PR and between f52d558 and 0a9d537.

📒 Files selected for processing (2)
  • test/e2e/configure_alertmanager_operator_tests.go
  • test/e2e/utils/utils.go

@smarthall smarthall force-pushed the move-inhibitions-test-to-cao branch from 38f913e to b4bd314 Compare April 30, 2026 02:14
Move the AlertmanagerInhibitions tests into this repo's e2e suite.
Instead of degrading a real cluster operator, inject synthetic alerts
directly into Alertmanager's v2 API and verify the inhibition rule
suppresses ClusterOperatorDown when ClusterOperatorDegraded is active.

POST uses pod exec to reach each alertmanager instance directly,
since alerts posted via the API are not gossiped between peers.
Test completes in seconds with no cluster state changes.

Signed-off-by: Daniel Hall <danhall@redhat.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@smarthall smarthall force-pushed the move-inhibitions-test-to-cao branch from b4bd314 to 37690a4 Compare April 30, 2026 02:17
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Apr 30, 2026

@smarthall: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant