Summary
cocoon vm clone produces a runtime disk config that does not match the snapshot manifest's declared base-layer digests. It instead substitutes layers from a different cached image that happens to share the same repo:tag "stem" path. The substitution silently downgrades the cloned VM's lowerdirs from the snapshot's declared N layers to whatever shorter-tag image happens to be cached locally, masking files that exist only in the upper base layers.
Related to but distinct from #37/#38: in our case the base image is present locally at the correct digest, but cocoon still constructs the runtime config from a different image's layer set.
Environment
- Cocoon cluster:
cocoonset-gke, vk-cocoon on cocoonset-node-2
- vm-service env:
testing
- Hot snapshot:
epoch.simular.cloud/simular/ubuntu-hot-testing:v1
- Snapshot baseimage annotation:
epoch.simular.cloud/simular/ubuntu:24.04-xface
Evidence
1. Snapshot manifest on epoch declares the correct layer set
cocoon.json blob from the hot-snapshot manifest references 11 base layers:
cocoon-layer0: b40150c1c2717d... (29 MB)
cocoon-layer1: 3a6844925eb6c8... (444 MB)
cocoon-layer2: e91c01bdde5a23...
cocoon-layer3: f930570bc53388...
cocoon-layer4: e88e723e1e6992...
cocoon-layer5: f230ba36f6926f...
cocoon-layer6: 65be6c6a51e92d...
cocoon-layer7: d8e7381f1b6d5d...
cocoon-layer8: 01f018bd59be27...
cocoon-layer9: a91b26a4acfbe4...
cocoon-layer10: 4f81e1bad188e0...
The snapshot's vm.config layer agrees — 12 disks (11 base + cow.raw).
2. The referenced base manifest at epoch matches
crane manifest epoch.simular.cloud/simular/ubuntu:24.04-xface returns digest sha256:fff6f7a6786e... with exactly those 11 layers.
3. All 11 base-layer .erofs blobs are present on the cloning node
$ ls /var/lib/cocoon/oci/blobs/
b40150c1c2717d...erofs 3a6844925eb6c8...erofs e91c01bdde5a23...erofs
f930570bc53388...erofs e88e723e1e6992...erofs f230ba36f6926f...erofs
65be6c6a51e92d...erofs d8e7381f1b6d5d...erofs 01f018bd59be27...erofs
a91b26a4acfbe4...erofs 4f81e1bad188e0...erofs
4. The cloned VM's runtime config uses different layers
/var/lib/cocoon/run/cloudhypervisor/<ULID>/config.json of the live cloned VM lists only 6 base disks:
cocoon-layer0: b40150c1c2717d...erofs (matches new base layer 0)
cocoon-layer1: 3a6844925eb6c8...erofs (matches new base layer 1)
cocoon-layer2: 51189d853427f2...erofs ← NOT in new base manifest
cocoon-layer3: 268433ca369440...erofs ← NOT in new base manifest
cocoon-layer4: bab9e12b5c15d0...erofs ← NOT in new base manifest
cocoon-layer5: ff10e322f9ef77...erofs ← NOT in new base manifest
Inside the VM mount confirms a 6-lowerdir overlay (vs. the 11-lowerdir overlay produced by cocoon vm run epoch.simular.cloud/simular/ubuntu:24.04-xface directly on the same node).
5. The four "alien" digests are from a different cached image
$ crane manifest epoch.simular.cloud/simular/ubuntu:24.04-xface-20260513 | jq '.layers[].digest'
sha256:b40150c1c2717d...
sha256:3a6844925eb6c8...
sha256:51189d853427f2... ← match
sha256:268433ca369440... ← match
sha256:bab9e12b5c15d0... ← match
sha256:ff10e322f9ef77... ← match
simular/ubuntu:24.04-xface-20260513 is yesterday's date-pinned alias of the same image stream. It shares prefix layers (b40150, 3a6844) with today's :24.04-xface and diverges thereafter (6 layers total vs. 11 today). On disk the node has both :24.04-xface and :24.04-xface-20260513 cached.
6. Result
The cloned VM is missing every file added in the upper 5 layers of the snapshot's declared base — in our case the agent watchdog assets (/usr/local/bin/simular-agent-watchdog.sh, /etc/cron.d/simular-pro-agent, /etc/xdg/autostart/simular-pro-agent.desktop, /var/lib/systemd/linger/root). They were present in the bake VM at snapshot time but never end up in cow.raw because they live in the (read-only) base, and at clone time the wrong base layer set is mounted.
Hypothesis
cocoon vm clone does not faithfully reproduce the layer set declared in the snapshot's cocoon.json. It looks like clone resolves base layers by consulting the local image DB at the snapshot's baseimage tag, and the local DB may return a different image whose :tag happened to be cached. With short stems like :24.04-xface colliding with :24.04-xface-20260513, the wrong layer set is selected.
Expected
cocoon vm clone should treat the snapshot's cocoon.json storage_configs[*].path digests as authoritative. If a declared .erofs blob is missing locally, auto-pull (or fail loudly) instead of substituting from an arbitrary local image entry that shares prefix layers.
Repro
- Build base image A and push as
:tag-A (e.g. 6 layers).
- Pull
:tag-A on a cocoon node.
- Rebuild the image (more steps) and push as
:tag (e.g. 11 layers). Both :tag and :tag-A exist in epoch; the node has both in cache.
- Bake a hot snapshot from
:tag and push it.
cocoon vm clone <hot-snapshot> on the same node.
Observed: cloned VM mounts 6 lowerdirs from :tag-A, not 11 from :tag. Files added by the upper 5 layers of :tag are gone.
Workaround for our pipeline: write all snapshot-critical assets via the install/warm script so they end up in cow.raw instead of base layers — but this defeats the purpose of layered base images.
Summary
cocoon vm cloneproduces a runtime disk config that does not match the snapshot manifest's declared base-layer digests. It instead substitutes layers from a different cached image that happens to share the samerepo:tag"stem" path. The substitution silently downgrades the cloned VM's lowerdirs from the snapshot's declared N layers to whatever shorter-tag image happens to be cached locally, masking files that exist only in the upper base layers.Related to but distinct from #37/#38: in our case the base image is present locally at the correct digest, but cocoon still constructs the runtime config from a different image's layer set.
Environment
cocoonset-gke, vk-cocoon oncocoonset-node-2testingepoch.simular.cloud/simular/ubuntu-hot-testing:v1epoch.simular.cloud/simular/ubuntu:24.04-xfaceEvidence
1. Snapshot manifest on epoch declares the correct layer set
cocoon.jsonblob from the hot-snapshot manifest references 11 base layers:The snapshot's
vm.configlayer agrees — 12 disks (11 base + cow.raw).2. The referenced base manifest at epoch matches
crane manifest epoch.simular.cloud/simular/ubuntu:24.04-xfacereturns digestsha256:fff6f7a6786e...with exactly those 11 layers.3. All 11 base-layer
.erofsblobs are present on the cloning node4. The cloned VM's runtime config uses different layers
/var/lib/cocoon/run/cloudhypervisor/<ULID>/config.jsonof the live cloned VM lists only 6 base disks:Inside the VM
mountconfirms a 6-lowerdir overlay (vs. the 11-lowerdir overlay produced bycocoon vm run epoch.simular.cloud/simular/ubuntu:24.04-xfacedirectly on the same node).5. The four "alien" digests are from a different cached image
simular/ubuntu:24.04-xface-20260513is yesterday's date-pinned alias of the same image stream. It shares prefix layers (b40150,3a6844) with today's:24.04-xfaceand diverges thereafter (6 layers total vs. 11 today). On disk the node has both:24.04-xfaceand:24.04-xface-20260513cached.6. Result
The cloned VM is missing every file added in the upper 5 layers of the snapshot's declared base — in our case the agent watchdog assets (
/usr/local/bin/simular-agent-watchdog.sh,/etc/cron.d/simular-pro-agent,/etc/xdg/autostart/simular-pro-agent.desktop,/var/lib/systemd/linger/root). They were present in the bake VM at snapshot time but never end up in cow.raw because they live in the (read-only) base, and at clone time the wrong base layer set is mounted.Hypothesis
cocoon vm clonedoes not faithfully reproduce the layer set declared in the snapshot'scocoon.json. It looks like clone resolves base layers by consulting the local image DB at the snapshot'sbaseimagetag, and the local DB may return a different image whose:taghappened to be cached. With short stems like:24.04-xfacecolliding with:24.04-xface-20260513, the wrong layer set is selected.Expected
cocoon vm cloneshould treat the snapshot'scocoon.jsonstorage_configs[*].pathdigests as authoritative. If a declared.erofsblob is missing locally, auto-pull (or fail loudly) instead of substituting from an arbitrary local image entry that shares prefix layers.Repro
:tag-A(e.g. 6 layers).:tag-Aon a cocoon node.:tag(e.g. 11 layers). Both:tagand:tag-Aexist in epoch; the node has both in cache.:tagand push it.cocoon vm clone <hot-snapshot>on the same node.Observed: cloned VM mounts 6 lowerdirs from
:tag-A, not 11 from:tag. Files added by the upper 5 layers of:tagare gone.Workaround for our pipeline: write all snapshot-critical assets via the install/warm script so they end up in
cow.rawinstead of base layers — but this defeats the purpose of layered base images.