fix: harden probe reliability against post-provision host settling#299
Open
l50 wants to merge 2 commits into
Open
fix: harden probe reliability against post-provision host settling#299l50 wants to merge 2 commits into
l50 wants to merge 2 commits into
Conversation
…lts (#2) **Key Changes:** - Introduced `mssqlProbe` with a sentinel-based protocol to reliably separate a genuine empty SQL result set from a host still settling after provisioning - Added retry logic in `runPSErr` and `runScriptJSON` to absorb transient empty-but-successful responses without masking real "not found" results - Upgraded all MSSQL check call sites to handle the new `(string, bool)` return so probes that cannot complete emit WARN instead of a bogus FAIL - Added comprehensive unit tests covering the sentinel, retry, recovery, and transport-error paths **Added:** - `mssqlProbe` method - new dedicated SQL probe that appends `sqlProbeSentinel` to every query script and uses its presence to distinguish a completed-but-empty result (`ok=true, rows=""`) from a probe that never finished (`ok=false`); callers WARN rather than FAIL on the latter - `sqlProbeSentinel` and `transientRetries` constants plus `backoffBase` variable and `backoffSleep` helper in `script_runner.go` to centralise retry policy and allow tests to shrink backoff to milliseconds - `runPSTransport` method extracted from `runPSErr` to own transport-level retries and dead-host tracking, keeping the two concerns separate - `probe_retry_test.go` - seven focused tests covering: genuine empty is definitive (no retry), transient empty retries then warns, mid-probe recovery, transport error does not amplify retries, `runPSErr` retry behaviour, non-empty returns immediately, and `runScriptJSON` envelope retry **Changed:** - `runPSErr` now wraps `runPSTransport` and adds an outer loop that retries empty-but-successful responses up to `transientRetries` times with backoff before returning a blank result with `nil` error, preserving the existing contract for callers - `runScriptJSON` replaced the single `runPS` call with a retry loop over `runPSErr` that re-runs when the JSON envelope is absent (a transient settling signature) and fails fast only on malformed envelopes, which indicate a script bug rather than a transient blip - `mssqlQueryFn` type signature updated from `func(...) string` to `func(...) (string, bool)` to propagate probe completion status through `checkMSSQLExtendedFeatures` - All three MSSQL check loops (sysadmins, `EXECUTE AS LOGIN`, linked servers) and the `xp_cmdshell`/`TRUSTWORTHY` checks now use `switch`/`if !ok` guards to emit WARN with a `(host settling?)` hint when a probe does not complete, instead of silently treating the empty string as a definitive negative
…are probes (#3) **Key Changes:** - Introduced `retryMustExist` to distinguish transient absences (e.g., DC mid-replication, admin shares briefly de-registering) from genuine configuration defects, reducing false FAILs - Replaced inline PowerShell strings with named constants (`scriptADUserExists`, `scriptADUserWithGroups`, `scriptAdminShares`) that use `-ErrorAction Stop` to prevent silent false negatives - Admin share checks now emit a WARN instead of silently missing the transport-error case - Added four targeted unit tests covering all retry branches of `retryMustExist` **Added:** - `probeOutcome` type and constants (`probePositive`, `probeNegative`, `probeIncomplete`) to classify probe results with clear semantics - `checks.go` - `retryMustExist` helper that retries only `probeNegative` outcomes up to `transientRetries` attempts, returning `probeIncomplete` immediately to avoid amplifying latency against slow or dead hosts - `checks.go` - Named PowerShell script constants (`scriptADUserExists`, `scriptADUserWithGroups`, `scriptAdminShares`) extracted from inline strings, with `scriptAdminShares` updated to exit non-zero on Server service query failure rather than silently returning empty output - `checks.go` - Unit tests for all `retryMustExist` branches: positive short-circuits on first call, persistent negative is trusted after all retries, transient negative recovers to positive, and incomplete does not retry - `probe_retry_test.go` **Changed:** - AD user existence checks in `checkUsernamePasswordEqual` and `checkConfiguredUsers` now wrap the probe in `retryMustExist`, converting the outcome enum back to PASS/FAIL/WARN results and preserving the probe error for WARN messages - Admin share enumeration in `checkAdminShares` now uses `retryMustExist` with `scriptAdminShares`, adding a missing `default` WARN branch for transport errors that previously had no result emitted
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Key Changes:
retryMustExisthelper andprobeOutcometype to distinguish genuine negatives from transient blips, preventing healthy labs from being reported as broken during the post-provision settling windowrunPSErrto retry empty-but-successful responses separately from transport errors, and extractedrunPSTransportto own the dead-host marking logicmssqlProbewith a sentinel-based completion signal (sqlProbeSentinel) so empty SQL result sets are definitively distinguished from truncated/incomplete outputAdded:
probeOutcomeenum (probePositive,probeNegative,probeIncomplete) andretryMustExistmethod toValidator— retries only genuine negatives, returns immediately on positive or transport error to avoid amplifying latency against slow/dead hostsmssqlProbemethod that appendssqlProbeSentinelto every SQL script so callers can distinguish a completed query with zero rows (ok=true, empty string) from a probe that never finished (ok=false) —checks.gosqlProbeSentinel,transientRetries,backoffBase, andbackoffSleepconstants/helpers toscript_runner.gofor shared retry configuration across all probe typesscriptADUserExists,scriptADUserWithGroups, andscriptAdminSharesextracted as named constants with improved error handling (e.g.,Get-ADUserwith-ErrorAction Stopto surface RPC failures rather than silently returning false negatives) —checks.goprobe_retry_test.gowith 11 tests coveringmssqlProbe,runPSErr,retryMustExist, andrunScriptJSONretry and recovery scenarios using astubProviderthat avoids real network callsChanged:
runPSErrnow retries empty-but-successful output up totransientRetriestimes with backoff before returning blank, and delegates all transport-level retry and dead-host marking to the newrunPSTransporthelper —validator.gorunScriptJSONretries when the JSON envelope is absent from a non-empty response (a settling host emitting banners before the real payload), while still failing fast on malformed envelopes (a script bug, not a transient condition) —script_runner.gocheckMSSQL,checkMSSQLExtendedFeatures) updated to usemssqlProbeand emitWARNinstead ofFAILwhen a probe cannot complete, preventing false failures during host settling —checks.gocheckUsernamePasswordEqual,checkConfiguredUsers,checkAdminShares) updated to useretryMustExistwith typed outcomes, replacing ad-hoc string matching and single-shot probes —checks.gomssqlQueryFntype signature updated fromfunc(...) stringtofunc(...) (string, bool)to propagate probe completeness to callers —checks.go