vm clone substitutes base layers from a different cached image (snapshot's cocoon.json digests ignored)

## Summary

`cocoon vm clone` produces a runtime disk config that does **not** match the snapshot manifest's declared base-layer digests. It instead substitutes layers from a different cached image that happens to share the same `repo:tag` "stem" path. The substitution silently downgrades the cloned VM's lowerdirs from the snapshot's declared N layers to whatever shorter-tag image happens to be cached locally, masking files that exist only in the upper base layers.

Related to but distinct from #37/#38: in our case the base image **is** present locally at the correct digest, but cocoon still constructs the runtime config from a *different* image's layer set.

## Environment

- Cocoon cluster: `cocoonset-gke`, vk-cocoon on `cocoonset-node-2`
- vm-service env: `testing`
- Hot snapshot: `epoch.simular.cloud/simular/ubuntu-hot-testing:v1`
- Snapshot baseimage annotation: `epoch.simular.cloud/simular/ubuntu:24.04-xface`

## Evidence

### 1. Snapshot manifest on epoch declares the correct layer set

`cocoon.json` blob from the hot-snapshot manifest references 11 base layers:

```
cocoon-layer0:  b40150c1c2717d... (29 MB)
cocoon-layer1:  3a6844925eb6c8... (444 MB)
cocoon-layer2:  e91c01bdde5a23...
cocoon-layer3:  f930570bc53388...
cocoon-layer4:  e88e723e1e6992...
cocoon-layer5:  f230ba36f6926f...
cocoon-layer6:  65be6c6a51e92d...
cocoon-layer7:  d8e7381f1b6d5d...
cocoon-layer8:  01f018bd59be27...
cocoon-layer9:  a91b26a4acfbe4...
cocoon-layer10: 4f81e1bad188e0...
```

The snapshot's `vm.config` layer agrees — 12 disks (11 base + cow.raw).

### 2. The referenced base manifest at epoch matches

`crane manifest epoch.simular.cloud/simular/ubuntu:24.04-xface` returns digest `sha256:fff6f7a6786e...` with exactly those 11 layers.

### 3. All 11 base-layer `.erofs` blobs are present on the cloning node

```
$ ls /var/lib/cocoon/oci/blobs/
b40150c1c2717d...erofs  3a6844925eb6c8...erofs  e91c01bdde5a23...erofs
f930570bc53388...erofs  e88e723e1e6992...erofs  f230ba36f6926f...erofs
65be6c6a51e92d...erofs  d8e7381f1b6d5d...erofs  01f018bd59be27...erofs
a91b26a4acfbe4...erofs  4f81e1bad188e0...erofs
```

### 4. The cloned VM's runtime config uses **different** layers

`/var/lib/cocoon/run/cloudhypervisor/<ULID>/config.json` of the live cloned VM lists only **6** base disks:

```
cocoon-layer0: b40150c1c2717d...erofs   (matches new base layer 0)
cocoon-layer1: 3a6844925eb6c8...erofs   (matches new base layer 1)
cocoon-layer2: 51189d853427f2...erofs   ← NOT in new base manifest
cocoon-layer3: 268433ca369440...erofs   ← NOT in new base manifest
cocoon-layer4: bab9e12b5c15d0...erofs   ← NOT in new base manifest
cocoon-layer5: ff10e322f9ef77...erofs   ← NOT in new base manifest
```

Inside the VM `mount` confirms a 6-lowerdir overlay (vs. the 11-lowerdir overlay produced by `cocoon vm run epoch.simular.cloud/simular/ubuntu:24.04-xface` directly on the same node).

### 5. The four "alien" digests are from a different cached image

```
$ crane manifest epoch.simular.cloud/simular/ubuntu:24.04-xface-20260513 | jq '.layers[].digest'
sha256:b40150c1c2717d...
sha256:3a6844925eb6c8...
sha256:51189d853427f2...   ← match
sha256:268433ca369440...   ← match
sha256:bab9e12b5c15d0...   ← match
sha256:ff10e322f9ef77...   ← match
```

`simular/ubuntu:24.04-xface-20260513` is yesterday's date-pinned alias of the same image stream. It shares prefix layers (`b40150`, `3a6844`) with today's `:24.04-xface` and diverges thereafter (6 layers total vs. 11 today). On disk the node has both `:24.04-xface` and `:24.04-xface-20260513` cached.

### 6. Result

The cloned VM is missing every file added in the upper 5 layers of the snapshot's declared base — in our case the agent watchdog assets (`/usr/local/bin/simular-agent-watchdog.sh`, `/etc/cron.d/simular-pro-agent`, `/etc/xdg/autostart/simular-pro-agent.desktop`, `/var/lib/systemd/linger/root`). They were present in the bake VM at snapshot time but never end up in cow.raw because they live in the (read-only) base, and at clone time the wrong base layer set is mounted.

## Hypothesis

`cocoon vm clone` does not faithfully reproduce the layer set declared in the snapshot's `cocoon.json`. It looks like clone resolves base layers by consulting the local image DB at the snapshot's `baseimage` tag, and the local DB may return a different image whose `:tag` happened to be cached. With short stems like `:24.04-xface` colliding with `:24.04-xface-20260513`, the wrong layer set is selected.

## Expected

`cocoon vm clone` should treat the snapshot's `cocoon.json` `storage_configs[*].path` digests as authoritative. If a declared `.erofs` blob is missing locally, auto-pull (or fail loudly) instead of substituting from an arbitrary local image entry that shares prefix layers.

## Repro

1. Build base image A and push as `:tag-A` (e.g. 6 layers).
2. Pull `:tag-A` on a cocoon node.
3. Rebuild the image (more steps) and push as `:tag` (e.g. 11 layers). Both `:tag` and `:tag-A` exist in epoch; the node has both in cache.
4. Bake a hot snapshot from `:tag` and push it.
5. `cocoon vm clone <hot-snapshot>` on the same node.

Observed: cloned VM mounts 6 lowerdirs from `:tag-A`, not 11 from `:tag`. Files added by the upper 5 layers of `:tag` are gone.

Workaround for our pipeline: write all snapshot-critical assets via the install/warm script so they end up in `cow.raw` instead of base layers — but this defeats the purpose of layered base images.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vm clone substitutes base layers from a different cached image (snapshot's cocoon.json digests ignored) #46

Summary

Environment

Evidence

1. Snapshot manifest on epoch declares the correct layer set

2. The referenced base manifest at epoch matches

3. All 11 base-layer `.erofs` blobs are present on the cloning node

4. The cloned VM's runtime config uses different layers

5. The four "alien" digests are from a different cached image

6. Result

Hypothesis

Expected

Repro

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

vm clone substitutes base layers from a different cached image (snapshot's cocoon.json digests ignored) #46

Description

Summary

Environment

Evidence

1. Snapshot manifest on epoch declares the correct layer set

2. The referenced base manifest at epoch matches

3. All 11 base-layer .erofs blobs are present on the cloning node

4. The cloned VM's runtime config uses different layers

5. The four "alien" digests are from a different cached image

6. Result

Hypothesis

Expected

Repro

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

3. All 11 base-layer `.erofs` blobs are present on the cloning node