docs(cozystack-upgrade): add KubeVirt 1.6→1.8 VM cold-restart workflow#7
docs(cozystack-upgrade): add KubeVirt 1.6→1.8 VM cold-restart workflow#7kvaps wants to merge 1 commit into
Conversation
Cozystack release-1.4 will bump KubeVirt from 1.6.3 to 1.8.2 (cozystack PR #2502). Every VM that was running before the upgrade then fails to live-migrate because the in-memory QEMU device state can't be reloaded by the new QEMU on the target launcher (kubevirt/kubevirt#16386, virtio-net specifically). Add a known-failures entry covering: - pre-upgrade: set workloadUpdateMethods=[] and suspend the kubevirt HelmRelease - post-upgrade: paced cold-restart of all running VMs (with an exclusion list for tenants who can't take the downtime window) - steady state: re-enable workloadUpdateMethods once the cluster is uniformly on the new launcher image Also add a SKILL.md red-flag row and a top-level "KubeVirt 1.6.x → 1.8.x special handling" note so the operator catches this before running helm upgrade. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
📝 WalkthroughWalkthroughThis PR adds documentation for a KubeVirt upgrade issue that manifests as live-migration failures during upgrades across QEMU version boundaries (KubeVirt 1.6.x → 1.8.x). The changes provide complete operator guidance: problem definition, pre-upgrade preparation, post-upgrade phased recovery procedures, steady-state reconciliation, and coordination checklists with tenant opt-out mechanisms. ChangesKubeVirt 1.6.x → 1.8.x Upgrade Workaround
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request adds documentation and a detailed recovery procedure for KubeVirt upgrades from version 1.6.x to 1.8.x, addressing a known issue where live-migrations fail due to a QEMU version bump. The feedback suggests improving the provided bash scripts by using a generic default for the namespace exclusion list and adding a status filter to ensure only running pods are targeted during the phased cold-restart process.
| ```bash | ||
| # 6. Build the worklist of VMIs to restart. Excludes any that the operator | ||
| # must leave alone (replace EXCLUDED_NS as needed). | ||
| EXCLUDED_NS=tenant-edoors # comma-separated if more than one; adjust grep below |
There was a problem hiding this comment.
The example value tenant-edoors is very specific and might be accidentally used if the user copy-pastes the block. It's better to provide an empty default. Also, the comment mentions "adjust grep below" but the implementation uses awk.
| EXCLUDED_NS=tenant-edoors # comma-separated if more than one; adjust grep below | |
| EXCLUDED_NS="" # comma-separated list of namespaces to exclude |
| pod=$(kubectl -n "$ns" get pods -l kubevirt.io=virt-launcher,vm.kubevirt.io/name="$vmi" \ | ||
| -o jsonpath='{.items[0].metadata.name}' 2>/dev/null) |
There was a problem hiding this comment.
To ensure the script targets the active workload and avoids issues with pods in Terminating or Failed states (which might exist if a VM is undergoing issues), it's safer to filter for Running pods.
| pod=$(kubectl -n "$ns" get pods -l kubevirt.io=virt-launcher,vm.kubevirt.io/name="$vmi" \ | |
| -o jsonpath='{.items[0].metadata.name}' 2>/dev/null) | |
| pod=$(kubectl -n "$ns" get pods -l kubevirt.io=virt-launcher,vm.kubevirt.io/name="$vmi" \ | |
| --field-selector=status.phase=Running -o jsonpath='{.items[0].metadata.name}' 2>/dev/null) |
There was a problem hiding this comment.
🧹 Nitpick comments (4)
skills/cozystack-upgrade/skills/cozystack-upgrade/references/known-failures.md (4)
317-323: ⚡ Quick winConsider adding a multi-namespace exclusion example.
The comment on line 317 mentions "comma-separated if more than one" but only shows a single namespace. Adding an example would help operators handle multiple exclusions correctly.
📝 Suggested clarification
-EXCLUDED_NS=tenant-edoors # comma-separated if more than one; adjust grep below +EXCLUDED_NS=tenant-edoors # comma-separated if more than one: tenant-foo,tenant-bar🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@skills/cozystack-upgrade/skills/cozystack-upgrade/references/known-failures.md` around lines 317 - 323, Add a concrete multi-namespace example for EXCLUDED_NS so operators see the intended format; update the EXCLUDED_NS line (the symbol EXCLUDED_NS in the snippet) to show a comma-separated example like "EXCLUDED_NS=tenant-edoors,tenant-foo" and add a short clarifying sentence after the awk invocation explaining that the awk split logic (variable skip and split(ex,e,",")) will skip all listed namespaces and that the output is written to /tmp/vms-to-restart.txt.
340-340: 💤 Low valueConsider noting storage considerations.
The pacing note mentions "storage IO surges (DRBD/LINSTOR resyncs)" as a reason to loosen spacing, but operators might benefit from more explicit guidance on how to detect if storage is "hot" or what tighter spacing might look like.
💡 Suggested expansion (optional)
-**Pacing.** 30s spacing × N VMs = total wall time. For 161 VMs that's ~85 min. Tighter spacing risks storage IO surges (DRBD/LINSTOR resyncs). Loosen if storage is hot, tighten if maintenance window is short. +**Pacing.** 30s spacing × N VMs = total wall time. For 161 VMs that's ~85 min. Tighter spacing risks storage IO surges (DRBD/LINSTOR resyncs). Monitor storage metrics during the first few restarts; loosen to 45-60s if you see latency spikes. Tighten to 15-20s if storage is lightly loaded and the maintenance window is short.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@skills/cozystack-upgrade/skills/cozystack-upgrade/references/known-failures.md` at line 340, Update the "Pacing. 30s spacing × N VMs = total wall time..." paragraph in known-failures.md to explicitly list how to detect a "hot" storage subsystem and give example tighter spacing values: recommend checking storage metrics (IOPS, avg latency in ms, device/util%, queue depth, DRBD/LINSTOR resync activity and throughput) and concrete thresholds (e.g., latency >10–20ms, util >70–80%, sustained high IOPS or active DRBD resyncs) to justify loosening spacing, and provide example alternative spacings (e.g., 60–120s for hot storage, 10–20s when maintenance window is tight) so operators know when to loosen or tighten the 30s spacing.
368-376: ⚡ Quick winConsider documenting the exclusion list for post-upgrade reference.
The coordination section mentions adding tenants to the exclusion list but doesn't suggest recording which VMs were excluded. This information would be valuable for tracking which VMs still need manual restart and for audit trails.
📋 Suggested addition
- Tenants who need to defer should be added to the exclusion list; their VM will keep running on the old QEMU until they restart it themselves. + +**Record exclusions**: Save the exclusion list and the final worklist to a file for reference. This helps track which VMs remain on the old launcher image and need follow-up: +```bash +echo "EXCLUDED_NS=$EXCLUDED_NS" > /tmp/kubevirt-upgrade-exclusions.txt +cp /tmp/vms-to-restart.txt /tmp/vms-restarted-$(date +%Y%m%d).txt +```🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@skills/cozystack-upgrade/skills/cozystack-upgrade/references/known-failures.md` around lines 368 - 376, Add a step in the "Coordination with the user" section that records which tenants/namespaces were placed on the exclusion list by persisting the EXCLUDED_NS variable to a file and snapshotting the VM lists so post-upgrade operators can see which VMs were skipped and which were restarted; specifically, write EXCLUDED_NS to a stable path (e.g., /tmp/kubevirt-upgrade-exclusions.txt) and copy the existing vms-to-restart.txt to a dated vms-restarted-YYYYMMDD.txt for auditing and follow-up.
344-354: ⚡ Quick winConsider adding a brief wait before verification.
The verification commands run immediately after the restart loop completes. If the last VM in the worklist is still starting its new launcher pod, the verification might show a transitional state. A brief note or explicit wait could prevent confusion.
⏱️ Suggested addition
done < /tmp/vms-to-restart.txt + +# Wait a moment for the last launcher to stabilize +echo "Waiting 60s for final launchers to reach Running..." +sleep 60 # 8. Verify everything landed on the new launcher image🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@skills/cozystack-upgrade/skills/cozystack-upgrade/references/known-failures.md` around lines 344 - 354, Add a short synchronization step before the two verification blocks: wait for the new virt-launcher pods to reach Ready (or sleep briefly) before running the kubectl get pods -l kubevirt.io=virt-launcher ... and kubectl get vmi -A --no-headers ... checks so transient launcher startups don't produce false-failure output; implement this by inserting a small wait or a kubectl wait for Ready on pods labeled kubevirt.io=virt-launcher (or equivalent) immediately before the existing verification commands.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In
`@skills/cozystack-upgrade/skills/cozystack-upgrade/references/known-failures.md`:
- Around line 317-323: Add a concrete multi-namespace example for EXCLUDED_NS so
operators see the intended format; update the EXCLUDED_NS line (the symbol
EXCLUDED_NS in the snippet) to show a comma-separated example like
"EXCLUDED_NS=tenant-edoors,tenant-foo" and add a short clarifying sentence after
the awk invocation explaining that the awk split logic (variable skip and
split(ex,e,",")) will skip all listed namespaces and that the output is written
to /tmp/vms-to-restart.txt.
- Line 340: Update the "Pacing. 30s spacing × N VMs = total wall time..."
paragraph in known-failures.md to explicitly list how to detect a "hot" storage
subsystem and give example tighter spacing values: recommend checking storage
metrics (IOPS, avg latency in ms, device/util%, queue depth, DRBD/LINSTOR resync
activity and throughput) and concrete thresholds (e.g., latency >10–20ms, util
>70–80%, sustained high IOPS or active DRBD resyncs) to justify loosening
spacing, and provide example alternative spacings (e.g., 60–120s for hot
storage, 10–20s when maintenance window is tight) so operators know when to
loosen or tighten the 30s spacing.
- Around line 368-376: Add a step in the "Coordination with the user" section
that records which tenants/namespaces were placed on the exclusion list by
persisting the EXCLUDED_NS variable to a file and snapshotting the VM lists so
post-upgrade operators can see which VMs were skipped and which were restarted;
specifically, write EXCLUDED_NS to a stable path (e.g.,
/tmp/kubevirt-upgrade-exclusions.txt) and copy the existing vms-to-restart.txt
to a dated vms-restarted-YYYYMMDD.txt for auditing and follow-up.
- Around line 344-354: Add a short synchronization step before the two
verification blocks: wait for the new virt-launcher pods to reach Ready (or
sleep briefly) before running the kubectl get pods -l kubevirt.io=virt-launcher
... and kubectl get vmi -A --no-headers ... checks so transient launcher
startups don't produce false-failure output; implement this by inserting a small
wait or a kubectl wait for Ready on pods labeled kubevirt.io=virt-launcher (or
equivalent) immediately before the existing verification commands.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 03ba9f5e-0a6b-465d-a94d-696e72408c67
📒 Files selected for processing (2)
skills/cozystack-upgrade/skills/cozystack-upgrade/SKILL.mdskills/cozystack-upgrade/skills/cozystack-upgrade/references/known-failures.md
Summary
Adds a procedure to the
cozystack-upgradeskill for the KubeVirt 1.6.x → 1.8.x bump that's coming with Cozystackrelease-1.4(cozystack/cozystack#2502).When that upgrade is applied via
helm upgrade cozystack, every VM that was running pre-upgrade fails to live-migrate afterwards because the new QEMU can't reload the old in-memoryvirtio-netdevice state (kubevirt/kubevirt#16386). KubeVirt'sworkloadUpdateMethodskeeps retrying, the cluster ends up flapping.Validated end-to-end on staging (hidora-hikube-lab) and production (hidora-hikube): 161 running VMs, ~85 minutes total, no customer-visible incidents.
Changes
references/known-failures.md— new entry cozy-bump: SKILL.md backtick sequence breaks slash-command preprocessor #8 with the exact pre-upgrade prep (workloadUpdateMethods: [], suspend thekubevirtHR), the paced cold-restart loop, post-upgrade verification, and the steady-state cleanup.SKILL.md— adds a red-flag table row and a top-level "KubeVirt 1.6.x → 1.8.x special handling" note so the skill catches this before runninghelm upgrade.The flow is built around the conventional Cozystack upgrade path (
helm upgrade cozystack ...), not ad-hocmake apply. Coordination with VM owners is the main requirement: every non-excluded VM gets ~30-60s downtime in a controlled order.Why "do not merge"
Blocked on cozystack/cozystack#2502 (the actual KubeVirt 1.8.2 bump). This skill change describes the workflow for a Cozystack release that doesn't exist yet — merging earlier would point users at a procedure they don't need.
Merge condition: merge once cozystack/cozystack#2502 lands in a Cozystack release (currently targeted at
release-1.4). If a better upstream fix appears for kubevirt/kubevirt#16386 before then (e.g. a way to pin per-VMI launcher images so existing VMs don't need cold-restart), revisit this PR — the workflow may no longer be needed.Summary by CodeRabbit