feat(site-explorer): instrument the DPU NIC-mode migration flow#2748
Conversation
Site explorer drives a DPU into the mode its host's `dpu_mode` declares -- finding a mismatch, issuing `set_nic_mode`, power-cycling the host to apply it, and registering a NicMode host with zero managed DPUs. That flow was observable only through scattered logs. It now records each step as a labeled counter (`carbide_site_explorer_dpu_migration_signals_count`), so the migration shows up in metrics -- and the gap between "mismatch found" and "set_nic_mode issued" surfaces flips that failed. It also corrects a stale comment in the machine-controller's NoDpu handling that read `expected_dpu_count == 0` as `NoDpu` alone: a zero-DPU host is equally a `NicMode` host whose DPU runs as a plain NIC, and both rightly have no DPU to answer Redfish. - Add a `DpuMigrationSignal` counter for the four migration steps, exposed via the same observable gauge as the host/DPU pairing blockers. - Emit each signal at its true event site, threading `metrics` into the mode-check helpers so the mismatch and the flip are counted where they happen rather than at a caller that would miscount. - Correct the `handle_no_dpu_error` doc comment to name both `NoDpu` and `NicMode`. The `reset-fallback-used` signal is deferred: it belongs in `redfish_powercycle`, which an open follow-up (NVIDIA#2738) is editing, and will be added once that merges. This supports NVIDIA#2634 Signed-off-by: Chet Nichols III <chetn@nvidia.com>
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
@coderabbitai review |
Summary by CodeRabbit
WalkthroughIntroduces a ChangesDPU NIC-mode migration metrics instrumentation
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Possibly related issues
Possibly related PRs
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
✅ Action performedReview finished.
|
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
crates/site-explorer/src/lib.rs (1)
1491-1500: 🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick winRemove the new
#[allow(clippy::too_many_arguments)]by narrowing function inputs.Line 1491 adds a new clippy suppression. Please refactor
record_host_dpu_deviceto accept a small context struct (or grouped params) instead of allowing the lint.Refactor direction
- #[allow(clippy::too_many_arguments)] - async fn record_host_dpu_device( - &self, - part_number: Option<&str>, - serial_number: Option<&str>, - dpu_sn_to_endpoint: &HashMap<String, ExploredEndpoint>, - host_dpu_mode: DpuMode, - host_ep: &ExploredEndpoint, - exploration: &mut DpuExplorationState, - metrics: &mut SiteExplorationMetrics, - ) + struct HostDpuDeviceContext<'a> { + part_number: Option<&'a str>, + serial_number: Option<&'a str>, + dpu_sn_to_endpoint: &'a HashMap<String, ExploredEndpoint>, + host_dpu_mode: DpuMode, + host_ep: &'a ExploredEndpoint, + exploration: &'a mut DpuExplorationState, + metrics: &'a mut SiteExplorationMetrics, + } + + async fn record_host_dpu_device(&self, ctx: HostDpuDeviceContext<'_>)As per coding guidelines, “Enable all clippy lints by default, and treat all warnings as errors… Fix warnings and clippy lints instead of using
#[allow(...)]unless there is a strong reason.”🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/site-explorer/src/lib.rs` around lines 1491 - 1500, Remove the #[allow(clippy::too_many_arguments)] annotation from the record_host_dpu_device function by refactoring the function signature. Instead of accepting eight separate parameters, create a new context struct that groups related parameters together (such as part_number and serial_number, or DPU-related fields like dpu_sn_to_endpoint and host_dpu_mode), then update the function to accept this struct. This will reduce the total number of parameters below the clippy threshold and eliminate the need for the lint suppression.Source: Coding guidelines
🧹 Nitpick comments (2)
crates/site-explorer/src/metrics.rs (2)
83-93: 🧹 Nitpick | 🔵 Trivial | ⚡ Quick winAdd a table-driven test for the
DpuMigrationSignallabel contract.This
Displaymapping is now a metric label API contract. Please add a table-driven test that asserts each variant maps to the expected string to prevent silent label regressions.As per coding guidelines, “Prefer table-driven tests for any function that maps inputs to outputs … Use
carbide_test_support::value_scenarios!macro for total operations.”🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/site-explorer/src/metrics.rs` around lines 83 - 93, The DpuMigrationSignal Display implementation now serves as a metric label API contract and needs test coverage to prevent regressions. Create a table-driven test using the carbide_test_support::value_scenarios! macro that verifies each variant of DpuMigrationSignal (ModeMismatchFound, SetNicModeIssued, ResetRequested, RegisteredZeroDpuForNicMode) correctly maps to its expected string representation ("mode_mismatch_found", "set_nic_mode_issued", "reset_requested", "registered_zero_dpu_for_nic_mode"). The test should assert the Display trait output for each variant matches the expected string to catch any silent label changes.Source: Coding guidelines
158-162: 🧹 Nitpick | 🔵 Trivial | ⚡ Quick winUse
DpuMigrationSignalas the map key instead ofString.Line 285 currently converts every signal to
Stringfor storage, and Line 646 then reuses raw strings as labels. KeepingHashMap<DpuMigrationSignal, usize>in-memory is safer and avoids key drift; convert to label only at emit time.Suggested refactor
- pub dpu_migration_signals: HashMap<String, usize>, + pub dpu_migration_signals: HashMap<DpuMigrationSignal, usize>,pub fn increment_dpu_migration_signal(&mut self, signal: DpuMigrationSignal) { *self .dpu_migration_signals - .entry(signal.to_string()) + .entry(signal) .or_default() += 1; }- for (signal, &count) in metrics.dpu_migration_signals.iter() { + for (signal, &count) in metrics.dpu_migration_signals.iter() { observer.observe( count as u64, - &[attrs, &[KeyValue::new("signal", signal.clone())]].concat(), + &[attrs, &[KeyValue::new("signal", signal.to_string())]].concat(), ); }Also applies to: 285-290, 646-650
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/site-explorer/src/metrics.rs` around lines 158 - 162, The dpu_migration_signals HashMap currently uses String as the key type, which causes the signal to be converted to String at line 285 for storage and then reused as raw strings at line 646 for labels, creating potential for key drift. Change the dpu_migration_signals field definition from HashMap<String, usize> to HashMap<DpuMigrationSignal, usize>, then update the code at line 285 to store the DpuMigrationSignal directly without converting it to String, and finally update the code at line 646 to convert the DpuMigrationSignal to its label string representation only at emit time when creating the metric output.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@crates/site-explorer/src/lib.rs`:
- Around line 1491-1500: Remove the #[allow(clippy::too_many_arguments)]
annotation from the record_host_dpu_device function by refactoring the function
signature. Instead of accepting eight separate parameters, create a new context
struct that groups related parameters together (such as part_number and
serial_number, or DPU-related fields like dpu_sn_to_endpoint and host_dpu_mode),
then update the function to accept this struct. This will reduce the total
number of parameters below the clippy threshold and eliminate the need for the
lint suppression.
---
Nitpick comments:
In `@crates/site-explorer/src/metrics.rs`:
- Around line 83-93: The DpuMigrationSignal Display implementation now serves as
a metric label API contract and needs test coverage to prevent regressions.
Create a table-driven test using the carbide_test_support::value_scenarios!
macro that verifies each variant of DpuMigrationSignal (ModeMismatchFound,
SetNicModeIssued, ResetRequested, RegisteredZeroDpuForNicMode) correctly maps to
its expected string representation ("mode_mismatch_found",
"set_nic_mode_issued", "reset_requested", "registered_zero_dpu_for_nic_mode").
The test should assert the Display trait output for each variant matches the
expected string to catch any silent label changes.
- Around line 158-162: The dpu_migration_signals HashMap currently uses String
as the key type, which causes the signal to be converted to String at line 285
for storage and then reused as raw strings at line 646 for labels, creating
potential for key drift. Change the dpu_migration_signals field definition from
HashMap<String, usize> to HashMap<DpuMigrationSignal, usize>, then update the
code at line 285 to store the DpuMigrationSignal directly without converting it
to String, and finally update the code at line 646 to convert the
DpuMigrationSignal to its label string representation only at emit time when
creating the metric output.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 4d4d48c0-ec33-45b6-aeb5-13d054ae8cb4
📒 Files selected for processing (3)
crates/machine-controller/src/handler.rscrates/site-explorer/src/lib.rscrates/site-explorer/src/metrics.rs
🔍 Container Scan Summary
Per-CVE detail lives in the per-service |
Instruments the DPU NIC-mode migration flow with a labeled counter, plus a stale-comment fix.
Site explorer migrates a DPU into the mode its host's
dpu_modedeclares -- find a mismatch, issueset_nic_mode, power-cycle the host, register a NicMode host with zero DPUs. That flow was only observable through logs; it now records each step ascarbide_site_explorer_dpu_migration_signals_count, labeled by signal:mode_mismatch_found,set_nic_mode_issued,reset_requested,registered_zero_dpu_for_nic_modemetricshandle is threaded into the mode-check helpers)Also corrects a
handle_no_dpu_errordoc comment in machine-controller that equated a zero-DPU host withNoDpualone -- it's equally aNicModehost.Deferred: the
reset-fallback-usedsignal belongs inredfish_powercycle, which open PR #2738 is editing; added once that merges.Supports #2634.