Linux guest unreachable after vm clone --on-demand from hibernate snapshot (no DHCP lease for hot-swapped NIC MAC)

## Summary

After `cocoon vm clone --on-demand <hibernate-import-snapshot>` resumes an **Ubuntu 24.04** guest, the guest is reachable on **neither** its pre-hibernate IP nor any new IP. `cocoon vm ls` shows `state=running, ip=-` indefinitely. Symptom is identical to #28 (Windows, fixed via the per-image `CocoonNicAutoHeal` scheduled task + in-guest PnP rebind), but the in-guest recovery path that's image-resident for Windows has no Linux equivalent baked into our base image, and our base image lacks anything cocoon-specific for this.

## Environment

- Cocoon cluster: `cocoonset-gke`, vk-cocoon on `cocoonset-node-2`
- vm-service env: `testing`
- Hot snapshot: `epoch.simular.cloud/simular/ubuntu-hot-testing:v1` (fresh bake at 12:00 UTC today)
- Per-VM hibernate snapshot: `vk-default-vm-c547fa0a-0` (saved at 2026-05-14 12:21:48 UTC)

## Reproduce

vm-service-driven, but the underlying vk-cocoon CLI sequence is:

```
sudo cocoon vm rm --force <pre-hibernate-vm-id>
sudo cocoon snapshot inspect vk-default-vm-c547fa0a-0
sudo cocoon vm clone --output json --name vk-default-vm-c547fa0a-0 \
     --network cocoon-dhcp --on-demand vk-default-vm-c547fa0a-0
```

(This is exactly what vk-cocoon logs during a `spec.suspend=false` reconcile after hibernate.)

## Observations

Pre-hibernate VM:

- guest MAC: (whatever was leased originally)
- DHCP-assigned IP: `172.20.1.58`
- working agent → vm-service token-exchange, etc.

Post-wake (after the clone above):

- `cocoon vm ls`:
  ```
  ID                          NAME                      STATE    CPU  MEMORY  STORAGE  IP  ...
  E5LFZLS2QQXYPBRQEQ5OYQISOQ  vk-default-vm-c547fa0a-0  running  4    8GiB    20GiB    -   ...
  ```
- Host-side veth/netns: present, MAC `2a:98:96:a6:fc:65` on `veth8e430d83`, peer in `cocoon-E5LFZLS2QQXYPBRQEQ5OYQISOQ`.
- `/var/lib/cocoon/net/leases.json`: no entry for `2a:98:96:a6:fc:65` (the new MAC). The old MAC's lease (for `172.20.1.58`) is also gone. So `cocoon-dhcp` IPAM lost the binding too.
- `ping 172.20.1.58` from cni0 / from a sibling cocoon pod: "Destination Host Unreachable", `ip neigh` shows the entry as FAILED.
- `kubectl exec` and `cocoon vm exec` both hang (no vsock progress) — guest is alive but doesn't progress past the wake point because its NIC stack is hot-swapped to a new MAC and there's no in-guest path to renegotiate DHCP.

This is the same shape as #28 — virtio-net hot-swap leaves the guest with a fresh MAC the guest hasn't bound to. The Linux symptom is that `systemd-networkd` / `NetworkManager` (or whatever's managing eth0) doesn't notice the new device, so no DHCPDISCOVER goes out on the new interface, so no lease, so no IPAM entry, so cocoon-dhcp doesn't even know the VM exists.

## Why this matters

`vm clone --on-demand <hibernate-snapshot>` is the wake path that vk-cocoon uses for `spec.suspend=false` on a CocoonSet. For us, that's every hibernate-cycle on the Linux cocoon path. As shipped today, it's a one-way road: hibernate works, but the woken guest is never reachable again.

#28's resolution baked `CocoonNicAutoHeal` into the Windows base image. The Linux analog would have to be image-resident as well (we can't run anything via `cocoon vm exec` from the host until the guest comes back), but there's no `cocoonstack/ubuntu` analog of `cocoonstack/windows` shipping such a recovery hook in the base. Two paths I can see:

1. **Image-side fix**: ship a small systemd unit in the cocoon Ubuntu base that watches for link-up on a freshly-attached virtio-net interface and triggers `networkctl renew` / `dhclient -r && dhclient` on it. Belt-and-suspenders, but it's a property of the image not of cocoon, and we'd have to add it to every Ubuntu base downstream wants to wake from hibernate.

2. **Host-side fix in cocoon**: at clone-from-hibernate time, *re-use the saved MAC* instead of regenerating a new one. The saved snapshot already encodes the guest's view of its NIC (driver state, IP, etc.); regenerating the MAC is what breaks the guest. If the post-wake MAC matches the pre-hibernate MAC, the guest never knew anything changed and DHCP/leases just keep working. That's an in-cocoon change to `vm clone` when the snapshot is a hibernate-import.

(2) is the cleaner fix — it makes hibernate→wake actually transparent to the guest regardless of OS, and `cocoon-dhcp`'s existing lease for the old MAC stays valid for the lease duration. (1) is the workaround if (2) isn't desirable for some reason (e.g. MAC collisions across cross-node clones).

## Repro artifacts

- vk-cocoon journal on cocoonset-node-2 around `2026-05-14T12:24:07Z` to `12:24:19Z` — full sequence.
- cocoonset name: `default/vm-c547fa0a`, vm-id `E5LFZLS2QQXYPBRQEQ5OYQISOQ`, still in this state at time of filing.

If you want hands-on access let me know and I'll keep the VM around; otherwise vm-service will tear it down after the e2e times out.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Linux guest unreachable after vm clone --on-demand from hibernate snapshot (no DHCP lease for hot-swapped NIC MAC) #47

Summary

Environment

Reproduce

Observations

Why this matters

Repro artifacts

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Linux guest unreachable after vm clone --on-demand from hibernate snapshot (no DHCP lease for hot-swapped NIC MAC) #47

Description

Summary

Environment

Reproduce

Observations

Why this matters

Repro artifacts

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions