Skip to content

docs(cozystack-upgrade): add KubeVirt 1.6→1.8 VM cold-restart workflow#7

Open
kvaps wants to merge 1 commit into
mainfrom
feat/kubevirt-1.6-to-1.8-vm-restart
Open

docs(cozystack-upgrade): add KubeVirt 1.6→1.8 VM cold-restart workflow#7
kvaps wants to merge 1 commit into
mainfrom
feat/kubevirt-1.6-to-1.8-vm-restart

Conversation

@kvaps
Copy link
Copy Markdown
Member

@kvaps kvaps commented Apr 28, 2026

Summary

Adds a procedure to the cozystack-upgrade skill for the KubeVirt 1.6.x → 1.8.x bump that's coming with Cozystack release-1.4 (cozystack/cozystack#2502).

When that upgrade is applied via helm upgrade cozystack, every VM that was running pre-upgrade fails to live-migrate afterwards because the new QEMU can't reload the old in-memory virtio-net device state (kubevirt/kubevirt#16386). KubeVirt's workloadUpdateMethods keeps retrying, the cluster ends up flapping.

Validated end-to-end on staging (hidora-hikube-lab) and production (hidora-hikube): 161 running VMs, ~85 minutes total, no customer-visible incidents.

Changes

  • references/known-failures.md — new entry cozy-bump: SKILL.md backtick sequence breaks slash-command preprocessor #8 with the exact pre-upgrade prep (workloadUpdateMethods: [], suspend the kubevirt HR), the paced cold-restart loop, post-upgrade verification, and the steady-state cleanup.
  • SKILL.md — adds a red-flag table row and a top-level "KubeVirt 1.6.x → 1.8.x special handling" note so the skill catches this before running helm upgrade.

The flow is built around the conventional Cozystack upgrade path (helm upgrade cozystack ...), not ad-hoc make apply. Coordination with VM owners is the main requirement: every non-excluded VM gets ~30-60s downtime in a controlled order.

Why "do not merge"

Blocked on cozystack/cozystack#2502 (the actual KubeVirt 1.8.2 bump). This skill change describes the workflow for a Cozystack release that doesn't exist yet — merging earlier would point users at a procedure they don't need.

Merge condition: merge once cozystack/cozystack#2502 lands in a Cozystack release (currently targeted at release-1.4). If a better upstream fix appears for kubevirt/kubevirt#16386 before then (e.g. a way to pin per-VMI launcher images so existing VMs don't need cold-restart), revisit this PR — the workflow may no longer be needed.

Summary by CodeRabbit

  • Documentation
    • Added comprehensive guidance for managing KubeVirt upgrades from 1.6.x to 1.8.x, including documented known failures and workaround procedures.
    • Included detailed procedures for VM restart coordination, state snapshots, and troubleshooting evacuation issues during the upgrade process.

Review Change Stack

Cozystack release-1.4 will bump KubeVirt from 1.6.3 to 1.8.2 (cozystack PR
#2502). Every VM that was running before the upgrade then fails to live-migrate
because the in-memory QEMU device state can't be reloaded by the new QEMU on
the target launcher (kubevirt/kubevirt#16386, virtio-net specifically).

Add a known-failures entry covering:
- pre-upgrade: set workloadUpdateMethods=[] and suspend the kubevirt HelmRelease
- post-upgrade: paced cold-restart of all running VMs (with an exclusion list
  for tenants who can't take the downtime window)
- steady state: re-enable workloadUpdateMethods once the cluster is uniformly
  on the new launcher image

Also add a SKILL.md red-flag row and a top-level "KubeVirt 1.6.x → 1.8.x
special handling" note so the operator catches this before running helm upgrade.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
@kvaps kvaps added documentation Improvements or additions to documentation do not merge Do not merge until linked dependency is resolved labels Apr 28, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 28, 2026

📝 Walkthrough

Walkthrough

This PR adds documentation for a KubeVirt upgrade issue that manifests as live-migration failures during upgrades across QEMU version boundaries (KubeVirt 1.6.x → 1.8.x). The changes provide complete operator guidance: problem definition, pre-upgrade preparation, post-upgrade phased recovery procedures, steady-state reconciliation, and coordination checklists with tenant opt-out mechanisms.

Changes

KubeVirt 1.6.x → 1.8.x Upgrade Workaround

Layer / File(s) Summary
Known-failure symptom and root cause
skills/cozystack-upgrade/skills/cozystack-upgrade/references/known-failures.md
Documents the live-migration failure on virtio-net causing accumulating Failed evacuations without convergence, the upstream QEMU root cause, and why workloadUpdateMethods adjustments do not address the issue.
Pre-upgrade preparation and migration prevention
skills/cozystack-upgrade/skills/cozystack-upgrade/references/known-failures.md
Provides pre-upgrade recovery steps: baseline state snapshots, patches to disable workloadUpdateMethods, and HelmRelease suspension to prevent chart-driven auto-migration during upgrade.
Post-upgrade phased cold-restart and steady-state restoration
skills/cozystack-upgrade/skills/cozystack-upgrade/references/known-failures.md
Describes the post-upgrade recovery workflow: generating a VM restart list with exclusions, iteratively deleting launcher pods with fixed spacing, verifying launcher images, confirming VMIs are not wedged, and restoring HelmRelease and workloadUpdateMethods configuration.
Troubleshooting reference and operator coordination guidance
skills/cozystack-upgrade/skills/cozystack-upgrade/SKILL.md, skills/cozystack-upgrade/skills/cozystack-upgrade/references/known-failures.md
Adds SKILL.md troubleshooting entry that references when the KubeVirt 1.6.x → 1.8.x workflow is required, and documents operator coordination guidance covering downtime expectations, restart ordering/ETA, HA-sensitivity warnings, and exclusion-based tenant opt-out mechanism.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Poem

A rabbit hops through upgrade woes,
KubeVirt's path now clearly shows—
Pre and post steps, paced and planned,
Cold restarts with steady hand,
Migration falters, but workarounds bloom! 🐰✨

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically describes the main change: adding documentation for a KubeVirt 1.6→1.8 VM cold-restart workflow, which aligns with both file changes (new SKILL.md section and known-failures.md entry).
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/kubevirt-1.6-to-1.8-vm-restart

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds documentation and a detailed recovery procedure for KubeVirt upgrades from version 1.6.x to 1.8.x, addressing a known issue where live-migrations fail due to a QEMU version bump. The feedback suggests improving the provided bash scripts by using a generic default for the namespace exclusion list and adding a status filter to ensure only running pods are targeted during the phased cold-restart process.

```bash
# 6. Build the worklist of VMIs to restart. Excludes any that the operator
# must leave alone (replace EXCLUDED_NS as needed).
EXCLUDED_NS=tenant-edoors # comma-separated if more than one; adjust grep below
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The example value tenant-edoors is very specific and might be accidentally used if the user copy-pastes the block. It's better to provide an empty default. Also, the comment mentions "adjust grep below" but the implementation uses awk.

Suggested change
EXCLUDED_NS=tenant-edoors # comma-separated if more than one; adjust grep below
EXCLUDED_NS="" # comma-separated list of namespaces to exclude

Comment on lines +330 to +331
pod=$(kubectl -n "$ns" get pods -l kubevirt.io=virt-launcher,vm.kubevirt.io/name="$vmi" \
-o jsonpath='{.items[0].metadata.name}' 2>/dev/null)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

To ensure the script targets the active workload and avoids issues with pods in Terminating or Failed states (which might exist if a VM is undergoing issues), it's safer to filter for Running pods.

Suggested change
pod=$(kubectl -n "$ns" get pods -l kubevirt.io=virt-launcher,vm.kubevirt.io/name="$vmi" \
-o jsonpath='{.items[0].metadata.name}' 2>/dev/null)
pod=$(kubectl -n "$ns" get pods -l kubevirt.io=virt-launcher,vm.kubevirt.io/name="$vmi" \
--field-selector=status.phase=Running -o jsonpath='{.items[0].metadata.name}' 2>/dev/null)

@kvaps kvaps removed the do not merge Do not merge until linked dependency is resolved label May 13, 2026
@kvaps kvaps marked this pull request as ready for review May 13, 2026 14:55
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (4)
skills/cozystack-upgrade/skills/cozystack-upgrade/references/known-failures.md (4)

317-323: ⚡ Quick win

Consider adding a multi-namespace exclusion example.

The comment on line 317 mentions "comma-separated if more than one" but only shows a single namespace. Adding an example would help operators handle multiple exclusions correctly.

📝 Suggested clarification
-EXCLUDED_NS=tenant-edoors      # comma-separated if more than one; adjust grep below
+EXCLUDED_NS=tenant-edoors      # comma-separated if more than one: tenant-foo,tenant-bar
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@skills/cozystack-upgrade/skills/cozystack-upgrade/references/known-failures.md`
around lines 317 - 323, Add a concrete multi-namespace example for EXCLUDED_NS
so operators see the intended format; update the EXCLUDED_NS line (the symbol
EXCLUDED_NS in the snippet) to show a comma-separated example like
"EXCLUDED_NS=tenant-edoors,tenant-foo" and add a short clarifying sentence after
the awk invocation explaining that the awk split logic (variable skip and
split(ex,e,",")) will skip all listed namespaces and that the output is written
to /tmp/vms-to-restart.txt.

340-340: 💤 Low value

Consider noting storage considerations.

The pacing note mentions "storage IO surges (DRBD/LINSTOR resyncs)" as a reason to loosen spacing, but operators might benefit from more explicit guidance on how to detect if storage is "hot" or what tighter spacing might look like.

💡 Suggested expansion (optional)
-**Pacing.** 30s spacing × N VMs = total wall time. For 161 VMs that's ~85 min. Tighter spacing risks storage IO surges (DRBD/LINSTOR resyncs). Loosen if storage is hot, tighten if maintenance window is short.
+**Pacing.** 30s spacing × N VMs = total wall time. For 161 VMs that's ~85 min. Tighter spacing risks storage IO surges (DRBD/LINSTOR resyncs). Monitor storage metrics during the first few restarts; loosen to 45-60s if you see latency spikes. Tighten to 15-20s if storage is lightly loaded and the maintenance window is short.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@skills/cozystack-upgrade/skills/cozystack-upgrade/references/known-failures.md`
at line 340, Update the "Pacing. 30s spacing × N VMs = total wall time..."
paragraph in known-failures.md to explicitly list how to detect a "hot" storage
subsystem and give example tighter spacing values: recommend checking storage
metrics (IOPS, avg latency in ms, device/util%, queue depth, DRBD/LINSTOR resync
activity and throughput) and concrete thresholds (e.g., latency >10–20ms, util
>70–80%, sustained high IOPS or active DRBD resyncs) to justify loosening
spacing, and provide example alternative spacings (e.g., 60–120s for hot
storage, 10–20s when maintenance window is tight) so operators know when to
loosen or tighten the 30s spacing.

368-376: ⚡ Quick win

Consider documenting the exclusion list for post-upgrade reference.

The coordination section mentions adding tenants to the exclusion list but doesn't suggest recording which VMs were excluded. This information would be valuable for tracking which VMs still need manual restart and for audit trails.

📋 Suggested addition
 - Tenants who need to defer should be added to the exclusion list; their VM will keep running on the old QEMU until they restart it themselves.
+
+**Record exclusions**: Save the exclusion list and the final worklist to a file for reference. This helps track which VMs remain on the old launcher image and need follow-up:
+```bash
+echo "EXCLUDED_NS=$EXCLUDED_NS" > /tmp/kubevirt-upgrade-exclusions.txt
+cp /tmp/vms-to-restart.txt /tmp/vms-restarted-$(date +%Y%m%d).txt
+```
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@skills/cozystack-upgrade/skills/cozystack-upgrade/references/known-failures.md`
around lines 368 - 376, Add a step in the "Coordination with the user" section
that records which tenants/namespaces were placed on the exclusion list by
persisting the EXCLUDED_NS variable to a file and snapshotting the VM lists so
post-upgrade operators can see which VMs were skipped and which were restarted;
specifically, write EXCLUDED_NS to a stable path (e.g.,
/tmp/kubevirt-upgrade-exclusions.txt) and copy the existing vms-to-restart.txt
to a dated vms-restarted-YYYYMMDD.txt for auditing and follow-up.

344-354: ⚡ Quick win

Consider adding a brief wait before verification.

The verification commands run immediately after the restart loop completes. If the last VM in the worklist is still starting its new launcher pod, the verification might show a transitional state. A brief note or explicit wait could prevent confusion.

⏱️ Suggested addition
 done < /tmp/vms-to-restart.txt
+
+# Wait a moment for the last launcher to stabilize
+echo "Waiting 60s for final launchers to reach Running..."
+sleep 60

 # 8. Verify everything landed on the new launcher image
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@skills/cozystack-upgrade/skills/cozystack-upgrade/references/known-failures.md`
around lines 344 - 354, Add a short synchronization step before the two
verification blocks: wait for the new virt-launcher pods to reach Ready (or
sleep briefly) before running the kubectl get pods -l kubevirt.io=virt-launcher
... and kubectl get vmi -A --no-headers ... checks so transient launcher
startups don't produce false-failure output; implement this by inserting a small
wait or a kubectl wait for Ready on pods labeled kubevirt.io=virt-launcher (or
equivalent) immediately before the existing verification commands.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In
`@skills/cozystack-upgrade/skills/cozystack-upgrade/references/known-failures.md`:
- Around line 317-323: Add a concrete multi-namespace example for EXCLUDED_NS so
operators see the intended format; update the EXCLUDED_NS line (the symbol
EXCLUDED_NS in the snippet) to show a comma-separated example like
"EXCLUDED_NS=tenant-edoors,tenant-foo" and add a short clarifying sentence after
the awk invocation explaining that the awk split logic (variable skip and
split(ex,e,",")) will skip all listed namespaces and that the output is written
to /tmp/vms-to-restart.txt.
- Line 340: Update the "Pacing. 30s spacing × N VMs = total wall time..."
paragraph in known-failures.md to explicitly list how to detect a "hot" storage
subsystem and give example tighter spacing values: recommend checking storage
metrics (IOPS, avg latency in ms, device/util%, queue depth, DRBD/LINSTOR resync
activity and throughput) and concrete thresholds (e.g., latency >10–20ms, util
>70–80%, sustained high IOPS or active DRBD resyncs) to justify loosening
spacing, and provide example alternative spacings (e.g., 60–120s for hot
storage, 10–20s when maintenance window is tight) so operators know when to
loosen or tighten the 30s spacing.
- Around line 368-376: Add a step in the "Coordination with the user" section
that records which tenants/namespaces were placed on the exclusion list by
persisting the EXCLUDED_NS variable to a file and snapshotting the VM lists so
post-upgrade operators can see which VMs were skipped and which were restarted;
specifically, write EXCLUDED_NS to a stable path (e.g.,
/tmp/kubevirt-upgrade-exclusions.txt) and copy the existing vms-to-restart.txt
to a dated vms-restarted-YYYYMMDD.txt for auditing and follow-up.
- Around line 344-354: Add a short synchronization step before the two
verification blocks: wait for the new virt-launcher pods to reach Ready (or
sleep briefly) before running the kubectl get pods -l kubevirt.io=virt-launcher
... and kubectl get vmi -A --no-headers ... checks so transient launcher
startups don't produce false-failure output; implement this by inserting a small
wait or a kubectl wait for Ready on pods labeled kubevirt.io=virt-launcher (or
equivalent) immediately before the existing verification commands.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 03ba9f5e-0a6b-465d-a94d-696e72408c67

📥 Commits

Reviewing files that changed from the base of the PR and between 768754f and 3c61311.

📒 Files selected for processing (2)
  • skills/cozystack-upgrade/skills/cozystack-upgrade/SKILL.md
  • skills/cozystack-upgrade/skills/cozystack-upgrade/references/known-failures.md

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant