Skip to content

refactor(site-explorer): clarify the ACPowercycle-fallback log and scope its test#2738

Merged
chet merged 1 commit into
NVIDIA:mainfrom
chet:gh-issue-2635-02
Jun 22, 2026
Merged

refactor(site-explorer): clarify the ACPowercycle-fallback log and scope its test#2738
chet merged 1 commit into
NVIDIA:mainfrom
chet:gh-issue-2635-02

Conversation

@chet

@chet chet commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

Follow-up to #2718, addressing CodeRabbit's review of that PR.

Two small changes to the PowerCycle -> ACPowercycle fallback from #2718:

  • Log clarity. The warning announced "PowerCycle refused", but the fallback
    fires on any PowerCycle failure, not only a vendor that declines the
    action. It now reports the observed failure and leaves the cause to the
    attached error field.
  • Test precision. The fallback-order test computed its
    PowerCycle-before-ACPowercycle positions across every recorded power call;
    they are now scoped to the host under test.

Deliberately not done -- gating the fallback on an explicit "refused" signal
(CodeRabbit's other suggestion). The signal that distinguishes a real refusal
(RedfishError::NotSupported / a 4xx from the BMC) exists in libredfish, but
site-explorer flattens every Redfish error into a stringly
EndpointExplorationError before the fallback sees it -- so gating cleanly would
mean a new error variant + mapper plumbing + test-mock rework. It also would not
be reliable for the vendor that matters: Dell returns a generic error in lockdown
(per a standing libredfish TODO), not NotSupported. Since ACPowercycle is
the normal reset for most vendors and a power cycle is the intended action
regardless, the broad fallback is correct as-is.

Supports #2635.

…ope its test

The ACPowercycle fallback (NVIDIA#2718) tries `PowerCycle` first and escalates to
`ACPowercycle` when that call returns any error. Its warning announced
"PowerCycle refused", but the fallback fires on every `PowerCycle` failure --
a transient transport or auth error, not only a vendor that declines the
action. The log now reports what it observed and leaves the cause to the
attached `error` field, so the line never claims a refusal that did not happen.

The fallback-order test computed its `PowerCycle`-before-`ACPowercycle`
positions across every recorded power call; they are now scoped to the host
under test so an unrelated endpoint's power action cannot skew the assertion.

- Log the `PowerCycle` -> `ACPowercycle` fallback by its observed failure rather
  than an assumed vendor refusal.
- Scope the fallback-order assertions to the host's BMC so unrelated power calls
  cannot skew them.
- Keep the fallback firing on any `PowerCycle` failure on purpose -- `ACPowercycle`
  is the normal reset for most vendors, so gating it on a parsed "refused" signal
  would add brittle error-classification for no behavioral gain.

Verified with clippy (`--all-features --all-targets`) and the
`test_site_explorer_falls_back_to_ac_powercycle_when_powercycle_refused`
integration test.

This supports NVIDIA#2635

Signed-off-by: Chet Nichols III <chetn@nvidia.com>
@chet

chet commented Jun 22, 2026

Copy link
Copy Markdown
Contributor Author

@coderabbitai review

@copy-pr-bot

copy-pr-bot Bot commented Jun 22, 2026

Copy link
Copy Markdown

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@coderabbitai

coderabbitai Bot commented Jun 22, 2026

Copy link
Copy Markdown
Contributor
✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@coderabbitai

coderabbitai Bot commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 805af257-787a-482a-84d6-351ac09f9fe7

📥 Commits

Reviewing files that changed from the base of the PR and between f947966 and d4f3496.

📒 Files selected for processing (2)
  • crates/site-explorer/src/lib.rs
  • crates/site-explorer/tests/site_explorer.rs

Summary by CodeRabbit

  • Bug Fixes

    • Fixed test logic to correctly validate power control call ordering for specific hosts, preventing interference from other endpoints.
  • Chores

    • Updated error logging message for improved clarity when power cycling operations fail.

Walkthrough

A warning log message in redfish_powercycle is reworded from "refused" to "failed" when PowerCycle does not succeed. The corresponding fallback test is updated to filter recorded power calls by the specific host BMC IP before asserting PowerCycle ordering relative to ACPowercycle.

Changes

PowerCycle Fallback: Log Wording and Test Scoping

Layer / File(s) Summary
Log wording update and host-scoped ordering assertion
crates/site-explorer/src/lib.rs, crates/site-explorer/tests/site_explorer.rs
Warning log on PowerCycle failure is changed from "refused" to "failed". The test test_site_explorer_falls_back_to_ac_powercycle_when_powercycle_refused now filters power_calls to host_power_calls using addr.ip() == host_bmc_ip before computing the positional ordering of PowerCycle vs ACPowercycle, eliminating interference from other endpoints' power actions.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

Possibly related PRs

  • NVIDIA/infra-controller#2718: Directly related — introduces the PowerCycle → ACPowercycle fallback behavior and the originating test that this PR's log and assertion corrections refine.
🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately reflects the two primary changes: clarifying the ACPowercycle-fallback log message and narrowing the scope of its test.
Description check ✅ Passed The description comprehensively documents both changes, explains the rationale for not implementing stricter error gating, and provides context for the design decision.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@chet chet marked this pull request as ready for review June 22, 2026 07:35
@chet chet requested a review from a team as a code owner June 22, 2026 07:35
@github-actions

Copy link
Copy Markdown

🔍 Container Scan Summary

Service Total Critical High Medium Low Other
boot-artifacts-aarch64 3 0 0 3 0 0
boot-artifacts-x86_64 3 0 0 3 0 0
forge-admin-cli-x86_64 264 6 23 99 6 130
machine-validation-runner 704 34 183 258 35 194
machine_validation 704 34 183 258 35 194
nvmetal-carbide 704 34 183 258 35 194
TOTAL 2382 108 572 879 111 712

Per-CVE detail lives in the per-service grype-* artifacts (JSON + SARIF). Severity counts only — no CVE IDs published here.

@chet chet merged commit 7ec3878 into NVIDIA:main Jun 22, 2026
55 checks passed
chet added a commit that referenced this pull request Jun 22, 2026
Instruments the DPU NIC-mode migration flow with a labeled counter, plus
a stale-comment fix.

Site explorer migrates a DPU into the mode its host's `dpu_mode`
declares -- find a mismatch, issue `set_nic_mode`, power-cycle the host,
register a NicMode host with zero DPUs. That flow was only observable
through logs; it now records each step as
`carbide_site_explorer_dpu_migration_signals_count`, labeled by signal:

- `mode_mismatch_found`, `set_nic_mode_issued`, `reset_requested`,
`registered_zero_dpu_for_nic_mode`
- exposed via the same observable-gauge pattern as the existing host/DPU
pairing-blocker counters
- emitted at each signal's true event site (the `metrics` handle is
threaded into the mode-check helpers)

Also corrects a `handle_no_dpu_error` doc comment in machine-controller
that equated a zero-DPU host with `NoDpu` alone -- it's equally a
`NicMode` host.

**Deferred:** the `reset-fallback-used` signal belongs in
`redfish_powercycle`, which open PR #2738 is editing; added once that
merges.

Supports #2634.

Signed-off-by: Chet Nichols III <chetn@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants