Summary
After cocoon vm clone --on-demand <hibernate-import-snapshot> resumes an Ubuntu 24.04 guest, the guest is reachable on neither its pre-hibernate IP nor any new IP. cocoon vm ls shows state=running, ip=- indefinitely. Symptom is identical to #28 (Windows, fixed via the per-image CocoonNicAutoHeal scheduled task + in-guest PnP rebind), but the in-guest recovery path that's image-resident for Windows has no Linux equivalent baked into our base image, and our base image lacks anything cocoon-specific for this.
Environment
- Cocoon cluster:
cocoonset-gke, vk-cocoon on cocoonset-node-2
- vm-service env:
testing
- Hot snapshot:
epoch.simular.cloud/simular/ubuntu-hot-testing:v1 (fresh bake at 12:00 UTC today)
- Per-VM hibernate snapshot:
vk-default-vm-c547fa0a-0 (saved at 2026-05-14 12:21:48 UTC)
Reproduce
vm-service-driven, but the underlying vk-cocoon CLI sequence is:
sudo cocoon vm rm --force <pre-hibernate-vm-id>
sudo cocoon snapshot inspect vk-default-vm-c547fa0a-0
sudo cocoon vm clone --output json --name vk-default-vm-c547fa0a-0 \
--network cocoon-dhcp --on-demand vk-default-vm-c547fa0a-0
(This is exactly what vk-cocoon logs during a spec.suspend=false reconcile after hibernate.)
Observations
Pre-hibernate VM:
- guest MAC: (whatever was leased originally)
- DHCP-assigned IP:
172.20.1.58
- working agent → vm-service token-exchange, etc.
Post-wake (after the clone above):
cocoon vm ls:
ID NAME STATE CPU MEMORY STORAGE IP ...
E5LFZLS2QQXYPBRQEQ5OYQISOQ vk-default-vm-c547fa0a-0 running 4 8GiB 20GiB - ...
- Host-side veth/netns: present, MAC
2a:98:96:a6:fc:65 on veth8e430d83, peer in cocoon-E5LFZLS2QQXYPBRQEQ5OYQISOQ.
/var/lib/cocoon/net/leases.json: no entry for 2a:98:96:a6:fc:65 (the new MAC). The old MAC's lease (for 172.20.1.58) is also gone. So cocoon-dhcp IPAM lost the binding too.
ping 172.20.1.58 from cni0 / from a sibling cocoon pod: "Destination Host Unreachable", ip neigh shows the entry as FAILED.
kubectl exec and cocoon vm exec both hang (no vsock progress) — guest is alive but doesn't progress past the wake point because its NIC stack is hot-swapped to a new MAC and there's no in-guest path to renegotiate DHCP.
This is the same shape as #28 — virtio-net hot-swap leaves the guest with a fresh MAC the guest hasn't bound to. The Linux symptom is that systemd-networkd / NetworkManager (or whatever's managing eth0) doesn't notice the new device, so no DHCPDISCOVER goes out on the new interface, so no lease, so no IPAM entry, so cocoon-dhcp doesn't even know the VM exists.
Why this matters
vm clone --on-demand <hibernate-snapshot> is the wake path that vk-cocoon uses for spec.suspend=false on a CocoonSet. For us, that's every hibernate-cycle on the Linux cocoon path. As shipped today, it's a one-way road: hibernate works, but the woken guest is never reachable again.
#28's resolution baked CocoonNicAutoHeal into the Windows base image. The Linux analog would have to be image-resident as well (we can't run anything via cocoon vm exec from the host until the guest comes back), but there's no cocoonstack/ubuntu analog of cocoonstack/windows shipping such a recovery hook in the base. Two paths I can see:
-
Image-side fix: ship a small systemd unit in the cocoon Ubuntu base that watches for link-up on a freshly-attached virtio-net interface and triggers networkctl renew / dhclient -r && dhclient on it. Belt-and-suspenders, but it's a property of the image not of cocoon, and we'd have to add it to every Ubuntu base downstream wants to wake from hibernate.
-
Host-side fix in cocoon: at clone-from-hibernate time, re-use the saved MAC instead of regenerating a new one. The saved snapshot already encodes the guest's view of its NIC (driver state, IP, etc.); regenerating the MAC is what breaks the guest. If the post-wake MAC matches the pre-hibernate MAC, the guest never knew anything changed and DHCP/leases just keep working. That's an in-cocoon change to vm clone when the snapshot is a hibernate-import.
(2) is the cleaner fix — it makes hibernate→wake actually transparent to the guest regardless of OS, and cocoon-dhcp's existing lease for the old MAC stays valid for the lease duration. (1) is the workaround if (2) isn't desirable for some reason (e.g. MAC collisions across cross-node clones).
Repro artifacts
- vk-cocoon journal on cocoonset-node-2 around
2026-05-14T12:24:07Z to 12:24:19Z — full sequence.
- cocoonset name:
default/vm-c547fa0a, vm-id E5LFZLS2QQXYPBRQEQ5OYQISOQ, still in this state at time of filing.
If you want hands-on access let me know and I'll keep the VM around; otherwise vm-service will tear it down after the e2e times out.
Summary
After
cocoon vm clone --on-demand <hibernate-import-snapshot>resumes an Ubuntu 24.04 guest, the guest is reachable on neither its pre-hibernate IP nor any new IP.cocoon vm lsshowsstate=running, ip=-indefinitely. Symptom is identical to #28 (Windows, fixed via the per-imageCocoonNicAutoHealscheduled task + in-guest PnP rebind), but the in-guest recovery path that's image-resident for Windows has no Linux equivalent baked into our base image, and our base image lacks anything cocoon-specific for this.Environment
cocoonset-gke, vk-cocoon oncocoonset-node-2testingepoch.simular.cloud/simular/ubuntu-hot-testing:v1(fresh bake at 12:00 UTC today)vk-default-vm-c547fa0a-0(saved at 2026-05-14 12:21:48 UTC)Reproduce
vm-service-driven, but the underlying vk-cocoon CLI sequence is:
(This is exactly what vk-cocoon logs during a
spec.suspend=falsereconcile after hibernate.)Observations
Pre-hibernate VM:
172.20.1.58Post-wake (after the clone above):
cocoon vm ls:2a:98:96:a6:fc:65onveth8e430d83, peer incocoon-E5LFZLS2QQXYPBRQEQ5OYQISOQ./var/lib/cocoon/net/leases.json: no entry for2a:98:96:a6:fc:65(the new MAC). The old MAC's lease (for172.20.1.58) is also gone. Sococoon-dhcpIPAM lost the binding too.ping 172.20.1.58from cni0 / from a sibling cocoon pod: "Destination Host Unreachable",ip neighshows the entry as FAILED.kubectl execandcocoon vm execboth hang (no vsock progress) — guest is alive but doesn't progress past the wake point because its NIC stack is hot-swapped to a new MAC and there's no in-guest path to renegotiate DHCP.This is the same shape as #28 — virtio-net hot-swap leaves the guest with a fresh MAC the guest hasn't bound to. The Linux symptom is that
systemd-networkd/NetworkManager(or whatever's managing eth0) doesn't notice the new device, so no DHCPDISCOVER goes out on the new interface, so no lease, so no IPAM entry, so cocoon-dhcp doesn't even know the VM exists.Why this matters
vm clone --on-demand <hibernate-snapshot>is the wake path that vk-cocoon uses forspec.suspend=falseon a CocoonSet. For us, that's every hibernate-cycle on the Linux cocoon path. As shipped today, it's a one-way road: hibernate works, but the woken guest is never reachable again.#28's resolution baked
CocoonNicAutoHealinto the Windows base image. The Linux analog would have to be image-resident as well (we can't run anything viacocoon vm execfrom the host until the guest comes back), but there's nococoonstack/ubuntuanalog ofcocoonstack/windowsshipping such a recovery hook in the base. Two paths I can see:Image-side fix: ship a small systemd unit in the cocoon Ubuntu base that watches for link-up on a freshly-attached virtio-net interface and triggers
networkctl renew/dhclient -r && dhclienton it. Belt-and-suspenders, but it's a property of the image not of cocoon, and we'd have to add it to every Ubuntu base downstream wants to wake from hibernate.Host-side fix in cocoon: at clone-from-hibernate time, re-use the saved MAC instead of regenerating a new one. The saved snapshot already encodes the guest's view of its NIC (driver state, IP, etc.); regenerating the MAC is what breaks the guest. If the post-wake MAC matches the pre-hibernate MAC, the guest never knew anything changed and DHCP/leases just keep working. That's an in-cocoon change to
vm clonewhen the snapshot is a hibernate-import.(2) is the cleaner fix — it makes hibernate→wake actually transparent to the guest regardless of OS, and
cocoon-dhcp's existing lease for the old MAC stays valid for the lease duration. (1) is the workaround if (2) isn't desirable for some reason (e.g. MAC collisions across cross-node clones).Repro artifacts
2026-05-14T12:24:07Zto12:24:19Z— full sequence.default/vm-c547fa0a, vm-idE5LFZLS2QQXYPBRQEQ5OYQISOQ, still in this state at time of filing.If you want hands-on access let me know and I'll keep the VM around; otherwise vm-service will tear it down after the e2e times out.