refactor(site-explorer): clarify the ACPowercycle-fallback log and scope its test#2738
Conversation
…ope its test The ACPowercycle fallback (NVIDIA#2718) tries `PowerCycle` first and escalates to `ACPowercycle` when that call returns any error. Its warning announced "PowerCycle refused", but the fallback fires on every `PowerCycle` failure -- a transient transport or auth error, not only a vendor that declines the action. The log now reports what it observed and leaves the cause to the attached `error` field, so the line never claims a refusal that did not happen. The fallback-order test computed its `PowerCycle`-before-`ACPowercycle` positions across every recorded power call; they are now scoped to the host under test so an unrelated endpoint's power action cannot skew the assertion. - Log the `PowerCycle` -> `ACPowercycle` fallback by its observed failure rather than an assumed vendor refusal. - Scope the fallback-order assertions to the host's BMC so unrelated power calls cannot skew them. - Keep the fallback firing on any `PowerCycle` failure on purpose -- `ACPowercycle` is the normal reset for most vendors, so gating it on a parsed "refused" signal would add brittle error-classification for no behavioral gain. Verified with clippy (`--all-features --all-targets`) and the `test_site_explorer_falls_back_to_ac_powercycle_when_powercycle_refused` integration test. This supports NVIDIA#2635 Signed-off-by: Chet Nichols III <chetn@nvidia.com>
|
@coderabbitai review |
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
✅ Action performedReview finished.
|
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (2)
Summary by CodeRabbit
WalkthroughA warning log message in ChangesPowerCycle Fallback: Log Wording and Test Scoping
Estimated code review effort🎯 1 (Trivial) | ⏱️ ~3 minutes Possibly related PRs
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
🔍 Container Scan Summary
Per-CVE detail lives in the per-service |
Instruments the DPU NIC-mode migration flow with a labeled counter, plus a stale-comment fix. Site explorer migrates a DPU into the mode its host's `dpu_mode` declares -- find a mismatch, issue `set_nic_mode`, power-cycle the host, register a NicMode host with zero DPUs. That flow was only observable through logs; it now records each step as `carbide_site_explorer_dpu_migration_signals_count`, labeled by signal: - `mode_mismatch_found`, `set_nic_mode_issued`, `reset_requested`, `registered_zero_dpu_for_nic_mode` - exposed via the same observable-gauge pattern as the existing host/DPU pairing-blocker counters - emitted at each signal's true event site (the `metrics` handle is threaded into the mode-check helpers) Also corrects a `handle_no_dpu_error` doc comment in machine-controller that equated a zero-DPU host with `NoDpu` alone -- it's equally a `NicMode` host. **Deferred:** the `reset-fallback-used` signal belongs in `redfish_powercycle`, which open PR #2738 is editing; added once that merges. Supports #2634. Signed-off-by: Chet Nichols III <chetn@nvidia.com>
Follow-up to #2718, addressing CodeRabbit's review of that PR.
Two small changes to the
PowerCycle->ACPowercyclefallback from #2718:fires on any
PowerCyclefailure, not only a vendor that declines theaction. It now reports the observed failure and leaves the cause to the
attached
errorfield.PowerCycle-before-ACPowercyclepositions across every recorded power call;they are now scoped to the host under test.
Deliberately not done -- gating the fallback on an explicit "refused" signal
(CodeRabbit's other suggestion). The signal that distinguishes a real refusal
(
RedfishError::NotSupported/ a 4xx from the BMC) exists inlibredfish, butsite-explorer flattens every Redfish error into a stringly
EndpointExplorationErrorbefore the fallback sees it -- so gating cleanly wouldmean a new error variant + mapper plumbing + test-mock rework. It also would not
be reliable for the vendor that matters: Dell returns a generic error in lockdown
(per a standing
libredfishTODO), notNotSupported. SinceACPowercycleisthe normal reset for most vendors and a power cycle is the intended action
regardless, the broad fallback is correct as-is.
Supports #2635.