Skip to content

Linux guest unreachable after vm clone --on-demand from hibernate snapshot (no DHCP lease for hot-swapped NIC MAC) #47

@tonicmuroq

Description

@tonicmuroq

Summary

After cocoon vm clone --on-demand <hibernate-import-snapshot> resumes an Ubuntu 24.04 guest, the guest is reachable on neither its pre-hibernate IP nor any new IP. cocoon vm ls shows state=running, ip=- indefinitely. Symptom is identical to #28 (Windows, fixed via the per-image CocoonNicAutoHeal scheduled task + in-guest PnP rebind), but the in-guest recovery path that's image-resident for Windows has no Linux equivalent baked into our base image, and our base image lacks anything cocoon-specific for this.

Environment

  • Cocoon cluster: cocoonset-gke, vk-cocoon on cocoonset-node-2
  • vm-service env: testing
  • Hot snapshot: epoch.simular.cloud/simular/ubuntu-hot-testing:v1 (fresh bake at 12:00 UTC today)
  • Per-VM hibernate snapshot: vk-default-vm-c547fa0a-0 (saved at 2026-05-14 12:21:48 UTC)

Reproduce

vm-service-driven, but the underlying vk-cocoon CLI sequence is:

sudo cocoon vm rm --force <pre-hibernate-vm-id>
sudo cocoon snapshot inspect vk-default-vm-c547fa0a-0
sudo cocoon vm clone --output json --name vk-default-vm-c547fa0a-0 \
     --network cocoon-dhcp --on-demand vk-default-vm-c547fa0a-0

(This is exactly what vk-cocoon logs during a spec.suspend=false reconcile after hibernate.)

Observations

Pre-hibernate VM:

  • guest MAC: (whatever was leased originally)
  • DHCP-assigned IP: 172.20.1.58
  • working agent → vm-service token-exchange, etc.

Post-wake (after the clone above):

  • cocoon vm ls:
    ID                          NAME                      STATE    CPU  MEMORY  STORAGE  IP  ...
    E5LFZLS2QQXYPBRQEQ5OYQISOQ  vk-default-vm-c547fa0a-0  running  4    8GiB    20GiB    -   ...
    
  • Host-side veth/netns: present, MAC 2a:98:96:a6:fc:65 on veth8e430d83, peer in cocoon-E5LFZLS2QQXYPBRQEQ5OYQISOQ.
  • /var/lib/cocoon/net/leases.json: no entry for 2a:98:96:a6:fc:65 (the new MAC). The old MAC's lease (for 172.20.1.58) is also gone. So cocoon-dhcp IPAM lost the binding too.
  • ping 172.20.1.58 from cni0 / from a sibling cocoon pod: "Destination Host Unreachable", ip neigh shows the entry as FAILED.
  • kubectl exec and cocoon vm exec both hang (no vsock progress) — guest is alive but doesn't progress past the wake point because its NIC stack is hot-swapped to a new MAC and there's no in-guest path to renegotiate DHCP.

This is the same shape as #28 — virtio-net hot-swap leaves the guest with a fresh MAC the guest hasn't bound to. The Linux symptom is that systemd-networkd / NetworkManager (or whatever's managing eth0) doesn't notice the new device, so no DHCPDISCOVER goes out on the new interface, so no lease, so no IPAM entry, so cocoon-dhcp doesn't even know the VM exists.

Why this matters

vm clone --on-demand <hibernate-snapshot> is the wake path that vk-cocoon uses for spec.suspend=false on a CocoonSet. For us, that's every hibernate-cycle on the Linux cocoon path. As shipped today, it's a one-way road: hibernate works, but the woken guest is never reachable again.

#28's resolution baked CocoonNicAutoHeal into the Windows base image. The Linux analog would have to be image-resident as well (we can't run anything via cocoon vm exec from the host until the guest comes back), but there's no cocoonstack/ubuntu analog of cocoonstack/windows shipping such a recovery hook in the base. Two paths I can see:

  1. Image-side fix: ship a small systemd unit in the cocoon Ubuntu base that watches for link-up on a freshly-attached virtio-net interface and triggers networkctl renew / dhclient -r && dhclient on it. Belt-and-suspenders, but it's a property of the image not of cocoon, and we'd have to add it to every Ubuntu base downstream wants to wake from hibernate.

  2. Host-side fix in cocoon: at clone-from-hibernate time, re-use the saved MAC instead of regenerating a new one. The saved snapshot already encodes the guest's view of its NIC (driver state, IP, etc.); regenerating the MAC is what breaks the guest. If the post-wake MAC matches the pre-hibernate MAC, the guest never knew anything changed and DHCP/leases just keep working. That's an in-cocoon change to vm clone when the snapshot is a hibernate-import.

(2) is the cleaner fix — it makes hibernate→wake actually transparent to the guest regardless of OS, and cocoon-dhcp's existing lease for the old MAC stays valid for the lease duration. (1) is the workaround if (2) isn't desirable for some reason (e.g. MAC collisions across cross-node clones).

Repro artifacts

  • vk-cocoon journal on cocoonset-node-2 around 2026-05-14T12:24:07Z to 12:24:19Z — full sequence.
  • cocoonset name: default/vm-c547fa0a, vm-id E5LFZLS2QQXYPBRQEQ5OYQISOQ, still in this state at time of filing.

If you want hands-on access let me know and I'll keep the VM around; otherwise vm-service will tear it down after the e2e times out.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions