Skip to content

feat(site-explorer): instrument the DPU NIC-mode migration flow#2748

Merged
chet merged 1 commit into
NVIDIA:mainfrom
chet:gh-issue-2634
Jun 22, 2026
Merged

feat(site-explorer): instrument the DPU NIC-mode migration flow#2748
chet merged 1 commit into
NVIDIA:mainfrom
chet:gh-issue-2634

Conversation

@chet

@chet chet commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

Instruments the DPU NIC-mode migration flow with a labeled counter, plus a stale-comment fix.

Site explorer migrates a DPU into the mode its host's dpu_mode declares -- find a mismatch, issue set_nic_mode, power-cycle the host, register a NicMode host with zero DPUs. That flow was only observable through logs; it now records each step as carbide_site_explorer_dpu_migration_signals_count, labeled by signal:

  • mode_mismatch_found, set_nic_mode_issued, reset_requested, registered_zero_dpu_for_nic_mode
  • exposed via the same observable-gauge pattern as the existing host/DPU pairing-blocker counters
  • emitted at each signal's true event site (the metrics handle is threaded into the mode-check helpers)

Also corrects a handle_no_dpu_error doc comment in machine-controller that equated a zero-DPU host with NoDpu alone -- it's equally a NicMode host.

Deferred: the reset-fallback-used signal belongs in redfish_powercycle, which open PR #2738 is editing; added once that merges.

Supports #2634.

Site explorer drives a DPU into the mode its host's `dpu_mode` declares --
finding a mismatch, issuing `set_nic_mode`, power-cycling the host to apply
it, and registering a NicMode host with zero managed DPUs. That flow was
observable only through scattered logs. It now records each step as a labeled
counter (`carbide_site_explorer_dpu_migration_signals_count`), so the
migration shows up in metrics -- and the gap between "mismatch found" and
"set_nic_mode issued" surfaces flips that failed.

It also corrects a stale comment in the machine-controller's NoDpu handling
that read `expected_dpu_count == 0` as `NoDpu` alone: a zero-DPU host is
equally a `NicMode` host whose DPU runs as a plain NIC, and both rightly have
no DPU to answer Redfish.

- Add a `DpuMigrationSignal` counter for the four migration steps, exposed via
  the same observable gauge as the host/DPU pairing blockers.
- Emit each signal at its true event site, threading `metrics` into the
  mode-check helpers so the mismatch and the flip are counted where they
  happen rather than at a caller that would miscount.
- Correct the `handle_no_dpu_error` doc comment to name both `NoDpu` and
  `NicMode`.

The `reset-fallback-used` signal is deferred: it belongs in `redfish_powercycle`,
which an open follow-up (NVIDIA#2738) is editing, and will be added once that merges.

This supports NVIDIA#2634

Signed-off-by: Chet Nichols III <chetn@nvidia.com>
@copy-pr-bot

copy-pr-bot Bot commented Jun 22, 2026

Copy link
Copy Markdown

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@chet

chet commented Jun 22, 2026

Copy link
Copy Markdown
Contributor Author

@coderabbitai review

@coderabbitai

coderabbitai Bot commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Summary by CodeRabbit

  • New Features

    • Added enhanced metrics tracking for DPU NIC-mode migration steps, providing improved observability into the DPU mode reconciliation process and host configuration changes.
  • Documentation

    • Clarified documentation regarding zero-DPU host behavior in both dedicated DPU and NIC-mode configurations.

Walkthrough

Introduces a DpuMigrationSignal enum in site-explorer/src/metrics.rs, extends SiteExplorationMetrics with a dpu_migration_signals counter map and increment_dpu_migration_signal() method, registers a new OTel gauge, threads metrics through record_host_dpu_device and check_and_configure_dpu_mode, and increments signals at mode-mismatch, NIC-mode set, reset-request, and NIC-mode registration points. A comment in handler.rs is updated to cover the NicMode case alongside NoDpu.

Changes

DPU NIC-mode migration metrics instrumentation

Layer / File(s) Summary
DpuMigrationSignal contract and SiteExplorationMetrics extension
crates/site-explorer/src/metrics.rs
Defines the DpuMigrationSignal enum (ResetRequested, RegisteredZeroDpuForNicMode, ModeMismatchFound, SetNicModeIssued) with Display label mappings. Adds dpu_migration_signals: HashMap<String, usize> field to SiteExplorationMetrics, initializes it in new(), and adds increment_dpu_migration_signal(). Registers the carbide_site_explorer_dpu_migration_signals_count OTel observable gauge emitting counts per signal label.
Metrics threading and signal emission in site-explorer
crates/site-explorer/src/lib.rs, crates/machine-controller/src/handler.rs
Imports DpuMigrationSignal. Extends record_host_dpu_device and check_and_configure_dpu_mode signatures with metrics: &mut SiteExplorationMetrics. All call sites in identify_managed_hosts pass metrics. Increments ModeMismatchFound and SetNicModeIssued inside check_and_configure_dpu_mode, ResetRequested at the power-cycle scheduling point, and RegisteredZeroDpuForNicMode on the NicMode early-return path. Adds #[allow(clippy::too_many_arguments)] to record_host_dpu_device. Updates handle_no_dpu_error comment in handler.rs to include the NicMode case.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related issues

Possibly related PRs

  • NVIDIA/infra-controller#2687: Modifies check_and_configure_dpu_mode and the fallback-serial match path in lib.rs, the exact same functions this PR extends with metrics threading.
  • NVIDIA/infra-controller#2718: Changes the redfish_powercycle implementation used on the reset path where this PR now emits DpuMigrationSignal::ResetRequested, directly connecting the two flows.
🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically identifies the main change: instrumentation of the DPU NIC-mode migration flow in site-explorer. It accurately summarizes the primary objective of the changeset.
Description check ✅ Passed The description is comprehensive and directly addresses the changeset, detailing the metric instrumentation, signal labels, implementation approach, and the doc comment correction. It provides sufficient context about the migration flow and rationale.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai

coderabbitai Bot commented Jun 22, 2026

Copy link
Copy Markdown
Contributor
✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
crates/site-explorer/src/lib.rs (1)

1491-1500: 🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win

Remove the new #[allow(clippy::too_many_arguments)] by narrowing function inputs.

Line 1491 adds a new clippy suppression. Please refactor record_host_dpu_device to accept a small context struct (or grouped params) instead of allowing the lint.

Refactor direction
- #[allow(clippy::too_many_arguments)]
- async fn record_host_dpu_device(
-     &self,
-     part_number: Option<&str>,
-     serial_number: Option<&str>,
-     dpu_sn_to_endpoint: &HashMap<String, ExploredEndpoint>,
-     host_dpu_mode: DpuMode,
-     host_ep: &ExploredEndpoint,
-     exploration: &mut DpuExplorationState,
-     metrics: &mut SiteExplorationMetrics,
- )
+ struct HostDpuDeviceContext<'a> {
+     part_number: Option<&'a str>,
+     serial_number: Option<&'a str>,
+     dpu_sn_to_endpoint: &'a HashMap<String, ExploredEndpoint>,
+     host_dpu_mode: DpuMode,
+     host_ep: &'a ExploredEndpoint,
+     exploration: &'a mut DpuExplorationState,
+     metrics: &'a mut SiteExplorationMetrics,
+ }
+
+ async fn record_host_dpu_device(&self, ctx: HostDpuDeviceContext<'_>)

As per coding guidelines, “Enable all clippy lints by default, and treat all warnings as errors… Fix warnings and clippy lints instead of using #[allow(...)] unless there is a strong reason.”

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/site-explorer/src/lib.rs` around lines 1491 - 1500, Remove the
#[allow(clippy::too_many_arguments)] annotation from the record_host_dpu_device
function by refactoring the function signature. Instead of accepting eight
separate parameters, create a new context struct that groups related parameters
together (such as part_number and serial_number, or DPU-related fields like
dpu_sn_to_endpoint and host_dpu_mode), then update the function to accept this
struct. This will reduce the total number of parameters below the clippy
threshold and eliminate the need for the lint suppression.

Source: Coding guidelines

🧹 Nitpick comments (2)
crates/site-explorer/src/metrics.rs (2)

83-93: 🧹 Nitpick | 🔵 Trivial | ⚡ Quick win

Add a table-driven test for the DpuMigrationSignal label contract.

This Display mapping is now a metric label API contract. Please add a table-driven test that asserts each variant maps to the expected string to prevent silent label regressions.

As per coding guidelines, “Prefer table-driven tests for any function that maps inputs to outputs … Use carbide_test_support::value_scenarios! macro for total operations.”

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/site-explorer/src/metrics.rs` around lines 83 - 93, The
DpuMigrationSignal Display implementation now serves as a metric label API
contract and needs test coverage to prevent regressions. Create a table-driven
test using the carbide_test_support::value_scenarios! macro that verifies each
variant of DpuMigrationSignal (ModeMismatchFound, SetNicModeIssued,
ResetRequested, RegisteredZeroDpuForNicMode) correctly maps to its expected
string representation ("mode_mismatch_found", "set_nic_mode_issued",
"reset_requested", "registered_zero_dpu_for_nic_mode"). The test should assert
the Display trait output for each variant matches the expected string to catch
any silent label changes.

Source: Coding guidelines


158-162: 🧹 Nitpick | 🔵 Trivial | ⚡ Quick win

Use DpuMigrationSignal as the map key instead of String.

Line 285 currently converts every signal to String for storage, and Line 646 then reuses raw strings as labels. Keeping HashMap<DpuMigrationSignal, usize> in-memory is safer and avoids key drift; convert to label only at emit time.

Suggested refactor
- pub dpu_migration_signals: HashMap<String, usize>,
+ pub dpu_migration_signals: HashMap<DpuMigrationSignal, usize>,
 pub fn increment_dpu_migration_signal(&mut self, signal: DpuMigrationSignal) {
     *self
         .dpu_migration_signals
-        .entry(signal.to_string())
+        .entry(signal)
         .or_default() += 1;
 }
- for (signal, &count) in metrics.dpu_migration_signals.iter() {
+ for (signal, &count) in metrics.dpu_migration_signals.iter() {
     observer.observe(
         count as u64,
-        &[attrs, &[KeyValue::new("signal", signal.clone())]].concat(),
+        &[attrs, &[KeyValue::new("signal", signal.to_string())]].concat(),
     );
 }

Also applies to: 285-290, 646-650

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/site-explorer/src/metrics.rs` around lines 158 - 162, The
dpu_migration_signals HashMap currently uses String as the key type, which
causes the signal to be converted to String at line 285 for storage and then
reused as raw strings at line 646 for labels, creating potential for key drift.
Change the dpu_migration_signals field definition from HashMap<String, usize> to
HashMap<DpuMigrationSignal, usize>, then update the code at line 285 to store
the DpuMigrationSignal directly without converting it to String, and finally
update the code at line 646 to convert the DpuMigrationSignal to its label
string representation only at emit time when creating the metric output.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@crates/site-explorer/src/lib.rs`:
- Around line 1491-1500: Remove the #[allow(clippy::too_many_arguments)]
annotation from the record_host_dpu_device function by refactoring the function
signature. Instead of accepting eight separate parameters, create a new context
struct that groups related parameters together (such as part_number and
serial_number, or DPU-related fields like dpu_sn_to_endpoint and host_dpu_mode),
then update the function to accept this struct. This will reduce the total
number of parameters below the clippy threshold and eliminate the need for the
lint suppression.

---

Nitpick comments:
In `@crates/site-explorer/src/metrics.rs`:
- Around line 83-93: The DpuMigrationSignal Display implementation now serves as
a metric label API contract and needs test coverage to prevent regressions.
Create a table-driven test using the carbide_test_support::value_scenarios!
macro that verifies each variant of DpuMigrationSignal (ModeMismatchFound,
SetNicModeIssued, ResetRequested, RegisteredZeroDpuForNicMode) correctly maps to
its expected string representation ("mode_mismatch_found",
"set_nic_mode_issued", "reset_requested", "registered_zero_dpu_for_nic_mode").
The test should assert the Display trait output for each variant matches the
expected string to catch any silent label changes.
- Around line 158-162: The dpu_migration_signals HashMap currently uses String
as the key type, which causes the signal to be converted to String at line 285
for storage and then reused as raw strings at line 646 for labels, creating
potential for key drift. Change the dpu_migration_signals field definition from
HashMap<String, usize> to HashMap<DpuMigrationSignal, usize>, then update the
code at line 285 to store the DpuMigrationSignal directly without converting it
to String, and finally update the code at line 646 to convert the
DpuMigrationSignal to its label string representation only at emit time when
creating the metric output.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 4d4d48c0-ec33-45b6-aeb5-13d054ae8cb4

📥 Commits

Reviewing files that changed from the base of the PR and between f947966 and 48ceb6c.

📒 Files selected for processing (3)
  • crates/machine-controller/src/handler.rs
  • crates/site-explorer/src/lib.rs
  • crates/site-explorer/src/metrics.rs

@chet chet marked this pull request as ready for review June 22, 2026 07:35
@chet chet requested a review from a team as a code owner June 22, 2026 07:35
@github-actions

Copy link
Copy Markdown

🔍 Container Scan Summary

Service Total Critical High Medium Low Other
boot-artifacts-aarch64 3 0 0 3 0 0
boot-artifacts-x86_64 3 0 0 3 0 0
forge-admin-cli-x86_64 264 6 23 99 6 130
machine-validation-runner 704 34 183 258 35 194
machine_validation 704 34 183 258 35 194
nvmetal-carbide 704 34 183 258 35 194
TOTAL 2382 108 572 879 111 712

Per-CVE detail lives in the per-service grype-* artifacts (JSON + SARIF). Severity counts only — no CVE IDs published here.

@chet chet merged commit 09809c2 into NVIDIA:main Jun 22, 2026
55 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants