Skip to content

fix: harden probe reliability against post-provision host settling#299

Open
l50 wants to merge 2 commits into
dreadnode:mainfrom
l50:upstream/validator-transient-hardening
Open

fix: harden probe reliability against post-provision host settling#299
l50 wants to merge 2 commits into
dreadnode:mainfrom
l50:upstream/validator-transient-hardening

Conversation

@l50

@l50 l50 commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

Key Changes:

  • Introduced a retryMustExist helper and probeOutcome type to distinguish genuine negatives from transient blips, preventing healthy labs from being reported as broken during the post-provision settling window
  • Refactored runPSErr to retry empty-but-successful responses separately from transport errors, and extracted runPSTransport to own the dead-host marking logic
  • Added mssqlProbe with a sentinel-based completion signal (sqlProbeSentinel) so empty SQL result sets are definitively distinguished from truncated/incomplete output
  • Added comprehensive unit tests covering all retry and short-circuit paths for the new probe infrastructure

Added:

  • probeOutcome enum (probePositive, probeNegative, probeIncomplete) and retryMustExist method to Validator — retries only genuine negatives, returns immediately on positive or transport error to avoid amplifying latency against slow/dead hosts
  • mssqlProbe method that appends sqlProbeSentinel to every SQL script so callers can distinguish a completed query with zero rows (ok=true, empty string) from a probe that never finished (ok=false) — checks.go
  • sqlProbeSentinel, transientRetries, backoffBase, and backoffSleep constants/helpers to script_runner.go for shared retry configuration across all probe types
  • PowerShell script constants scriptADUserExists, scriptADUserWithGroups, and scriptAdminShares extracted as named constants with improved error handling (e.g., Get-ADUser with -ErrorAction Stop to surface RPC failures rather than silently returning false negatives) — checks.go
  • probe_retry_test.go with 11 tests covering mssqlProbe, runPSErr, retryMustExist, and runScriptJSON retry and recovery scenarios using a stubProvider that avoids real network calls

Changed:

  • runPSErr now retries empty-but-successful output up to transientRetries times with backoff before returning blank, and delegates all transport-level retry and dead-host marking to the new runPSTransport helper — validator.go
  • runScriptJSON retries when the JSON envelope is absent from a non-empty response (a settling host emitting banners before the real payload), while still failing fast on malformed envelopes (a script bug, not a transient condition) — script_runner.go
  • MSSQL check functions (checkMSSQL, checkMSSQLExtendedFeatures) updated to use mssqlProbe and emit WARN instead of FAIL when a probe cannot complete, preventing false failures during host settling — checks.go
  • AD user and admin share checks (checkUsernamePasswordEqual, checkConfiguredUsers, checkAdminShares) updated to use retryMustExist with typed outcomes, replacing ad-hoc string matching and single-shot probes — checks.go
  • mssqlQueryFn type signature updated from func(...) string to func(...) (string, bool) to propagate probe completeness to callers — checks.go

l50 added 2 commits June 21, 2026 20:43
…lts (#2)

**Key Changes:**

- Introduced `mssqlProbe` with a sentinel-based protocol to reliably separate a genuine empty SQL result set from a host still settling after provisioning
- Added retry logic in `runPSErr` and `runScriptJSON` to absorb transient empty-but-successful responses without masking real "not found" results
- Upgraded all MSSQL check call sites to handle the new `(string, bool)` return so probes that cannot complete emit WARN instead of a bogus FAIL
- Added comprehensive unit tests covering the sentinel, retry, recovery, and transport-error paths

**Added:**

- `mssqlProbe` method - new dedicated SQL probe that appends `sqlProbeSentinel` to every query script and uses its presence to distinguish a completed-but-empty result (`ok=true, rows=""`) from a probe that never finished (`ok=false`); callers WARN rather than FAIL on the latter
- `sqlProbeSentinel` and `transientRetries` constants plus `backoffBase` variable and `backoffSleep` helper in `script_runner.go` to centralise retry policy and allow tests to shrink backoff to milliseconds
- `runPSTransport` method extracted from `runPSErr` to own transport-level retries and dead-host tracking, keeping the two concerns separate
- `probe_retry_test.go` - seven focused tests covering: genuine empty is definitive (no retry), transient empty retries then warns, mid-probe recovery, transport error does not amplify retries, `runPSErr` retry behaviour, non-empty returns immediately, and `runScriptJSON` envelope retry

**Changed:**

- `runPSErr` now wraps `runPSTransport` and adds an outer loop that retries empty-but-successful responses up to `transientRetries` times with backoff before returning a blank result with `nil` error, preserving the existing contract for callers
- `runScriptJSON` replaced the single `runPS` call with a retry loop over `runPSErr` that re-runs when the JSON envelope is absent (a transient settling signature) and fails fast only on malformed envelopes, which indicate a script bug rather than a transient blip
- `mssqlQueryFn` type signature updated from `func(...) string` to `func(...) (string, bool)` to propagate probe completion status through `checkMSSQLExtendedFeatures`
- All three MSSQL check loops (sysadmins, `EXECUTE AS LOGIN`, linked servers) and the `xp_cmdshell`/`TRUSTWORTHY` checks now use `switch`/`if !ok` guards to emit WARN with a `(host settling?)` hint when a probe does not complete, instead of silently treating the empty string as a definitive negative
…are probes (#3)

**Key Changes:**

- Introduced `retryMustExist` to distinguish transient absences (e.g., DC mid-replication, admin shares briefly de-registering) from genuine configuration defects, reducing false FAILs
- Replaced inline PowerShell strings with named constants (`scriptADUserExists`, `scriptADUserWithGroups`, `scriptAdminShares`) that use `-ErrorAction Stop` to prevent silent false negatives
- Admin share checks now emit a WARN instead of silently missing the transport-error case
- Added four targeted unit tests covering all retry branches of `retryMustExist`

**Added:**

- `probeOutcome` type and constants (`probePositive`, `probeNegative`, `probeIncomplete`) to classify probe results with clear semantics - `checks.go`
- `retryMustExist` helper that retries only `probeNegative` outcomes up to `transientRetries` attempts, returning `probeIncomplete` immediately to avoid amplifying latency against slow or dead hosts - `checks.go`
- Named PowerShell script constants (`scriptADUserExists`, `scriptADUserWithGroups`, `scriptAdminShares`) extracted from inline strings, with `scriptAdminShares` updated to exit non-zero on Server service query failure rather than silently returning empty output - `checks.go`
- Unit tests for all `retryMustExist` branches: positive short-circuits on first call, persistent negative is trusted after all retries, transient negative recovers to positive, and incomplete does not retry - `probe_retry_test.go`

**Changed:**

- AD user existence checks in `checkUsernamePasswordEqual` and `checkConfiguredUsers` now wrap the probe in `retryMustExist`, converting the outcome enum back to PASS/FAIL/WARN results and preserving the probe error for WARN messages
- Admin share enumeration in `checkAdminShares` now uses `retryMustExist` with `scriptAdminShares`, adding a missing `default` WARN branch for transport errors that previously had no result emitted
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant