Skip to content

test/storm: shared WaitForLogin SSH fallback for reboot waits#695

Draft
bfjelds wants to merge 1 commit into
mainfrom
user/bfjelds/storm-ssh-login-fallback
Draft

test/storm: shared WaitForLogin SSH fallback for reboot waits#695
bfjelds wants to merge 1 commit into
mainfrom
user/bfjelds/storm-ssh-login-fallback

Conversation

@bfjelds

@bfjelds bfjelds commented Jun 24, 2026

Copy link
Copy Markdown
Member

Summary

Introduces a single shared helper, stormvm.WaitForLoginWithSshFallback, for waiting on a VM to return after a reboot, and consolidates both the rollback and servicing test suites onto it.

On QEMU it waits for the serial login: prompt and, if that times out, falls back to confirming the reboot over SSH by comparing uptime --since before and after. This tolerates the known serial-getty udev race (systemd#10850, ~2% of boots) where dev-ttyS0.device is skipped so serial-getty never starts even though the VM is healthy. On genuine failure it captures a screenshot and scans the serial log for dracut/initramfs symptoms (CheckSerialLogForDracutIssues, bug 15086). It also proactively (re)starts serial-getty@ttyS0 so later boots are detected via the serial log, and handles the Azure platform via SSH liveness polling.

Changes

  • tools/storm/utils/vm/wait_login.go (new): shared WaitForLoginWithSshFallback and CheckSerialLogForDracutIssues.
  • tools/storm/rollback/tests/helper.go: the update, rollback, and split-rollback reboot paths call the shared helper instead of QemuConfig.WaitForLogin.
  • tools/storm/servicing/tests/update.go: the finalize-reboot path calls the shared helper, removing ~90 lines of inline SSH-fallback logic and the duplicated local checkSerialLogForDracutIssues.
  • tools/storm/utils/vm/qemu/qemu.go: remove the stale local serial.log accumulator before each WaitForLogin so each saved NNN-serial.log contains only that boot.

Notes

  • The i%10 periodic-reboot path in update.go still uses RebootQemuVm with a lightweight SSH liveness check. It cannot call the shared helper without an import cycle (RebootQemuVm is in package qemu, which package vm imports), and it bundles libvirt reboot+wait rather than duplicating the fallback.

Validation

  • go build ./... and go vet pass for the affected packages (one pre-existing %w-in-logrus.Errorf vet warning is unrelated to this change).

Introduce stormvm.WaitForLoginWithSshFallback, a single helper for waiting
on a VM to return after a reboot. On QEMU it waits for the serial "login:"
prompt and, if that times out, falls back to confirming the reboot over SSH
by comparing "uptime --since" before and after. This tolerates the known
serial-getty udev race (systemd#10850, ~2% of boots) where dev-ttyS0.device
is skipped so serial-getty never starts even though the VM is healthy. On
genuine failure it captures a screenshot and scans the serial log for
dracut/initramfs symptoms (CheckSerialLogForDracutIssues, bug 15086). It also
proactively (re)starts serial-getty@ttyS0 so later boots are detected via the
serial log, and handles the Azure platform via SSH liveness polling.

Consolidate both test suites onto the shared helper:
- rollback tests (helper.go): the update, rollback, and split-rollback reboot
  paths now call WaitForLoginWithSshFallback instead of QemuConfig.WaitForLogin.
- servicing tests (update.go): the finalize-reboot path now calls the shared
  helper, removing ~90 lines of inline SSH-fallback logic and the duplicated
  local checkSerialLogForDracutIssues.

Also remove the stale local serial.log accumulator before each WaitForLogin
(qemu.go) so every saved NNN-serial.log contains only that boot rather than
accumulating output from all prior iterations.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant