|
| 1 | +# bcvk and Apple container integration |
| 2 | + |
| 3 | +Apple's [`container`](https://github.com/apple/container) is a Swift-based |
| 4 | +tool that runs Linux containers as lightweight virtual machines on Apple Silicon |
| 5 | +Macs using the macOS Virtualization framework. It is macOS-only (requires macOS |
| 6 | +26+ and Apple Silicon) and targets standard OCI container images. |
| 7 | + |
| 8 | +bcvk runs *bootable* container images as VMs using QEMU/libvirt on Linux. |
| 9 | + |
| 10 | +Despite both tools using VMs, they use them differently. Apple's `container` |
| 11 | +runs container processes inside lightweight VMs — the VM is an isolation |
| 12 | +mechanism wrapping what is conceptually still a container. bcvk boots a |
| 13 | +complete OS from a container image — the VM *is* the end product, not an |
| 14 | +implementation detail. |
| 15 | + |
| 16 | +The interesting integration opportunity is that Apple's tool creates ext4 |
| 17 | +filesystem images from OCI layers and caches them on disk. bcvk could read |
| 18 | +those ext4 images directly, extract the kernel, and boot them as full VMs — |
| 19 | +avoiding the `bootc install to-disk` step on macOS. |
| 20 | + |
| 21 | +## Background |
| 22 | + |
| 23 | +Apple's `container` converts OCI images into ext4 filesystem images using |
| 24 | +`EXT4Unpacker` from the |
| 25 | +[Containerization](https://github.com/apple/containerization) Swift package, |
| 26 | +then attaches them to VMs as virtio-blk devices. A `SnapshotStore` caches the |
| 27 | +ext4 images keyed by manifest digest. The VM boots a separate minimal kernel |
| 28 | +and `vminitd` guest agent — the kernel is *not* from the container image. |
| 29 | + |
| 30 | +bcvk's current flows (ephemeral run via VirtioFS, to-disk via `bootc install`) |
| 31 | +require bootc images that contain a kernel, initramfs, and systemd. The kernel |
| 32 | +is always extracted from the container image itself. |
| 33 | + |
| 34 | +## How Apple's ext4 pipeline works |
| 35 | + |
| 36 | +Apple's `EXT4Unpacker` (in the |
| 37 | +[Containerization](https://github.com/apple/containerization) package) does |
| 38 | +roughly the following: |
| 39 | + |
| 40 | +1. Creates a sparse ext4 filesystem image file via `EXT4.Formatter(path, |
| 41 | + minDiskSize: N)`. The default minimum size is 512 GiB for regular |
| 42 | + containers (sparse, so actual disk usage is much smaller). |
| 43 | + |
| 44 | +2. Iterates through the OCI image manifest's layers in order. For each layer, |
| 45 | + it calls `filesystem.unpack(source: layer.path, ...)`, which reads the |
| 46 | + layer tarball (gzip, zstd, or uncompressed) and writes its contents |
| 47 | + directly into the ext4 image. OCI whiteout files (`.wh.*` and |
| 48 | + `.wh..wh..opq`) are handled inline — whiteout entries delete files from |
| 49 | + previous layers. |
| 50 | + |
| 51 | +3. The result is a flat ext4 image containing the fully merged container |
| 52 | + rootfs. No union filesystem or overlay is needed at runtime. |
| 53 | + |
| 54 | +This ext4 image is then attached to a lightweight VM as a virtio-blk device. |
| 55 | +Inside the VM, a minimal guest agent (`vminitd`) mounts it and runs container |
| 56 | +processes within Linux cgroups and namespaces. Critically, the kernel and |
| 57 | +vminitd are *not* from the container image — they're provided separately by the |
| 58 | +`container` tool's own "init image." |
| 59 | + |
| 60 | +## How bcvk's current flows work |
| 61 | + |
| 62 | +bcvk has two main paths for getting from container image to running VM: |
| 63 | + |
| 64 | +**Ephemeral run** (`bcvk ephemeral run`): The container image is pulled via |
| 65 | +podman and mounted directly as the VM's root filesystem using VirtioFS (via |
| 66 | +virtiofsd). The kernel and initramfs are extracted from *within* the container |
| 67 | +image (from `/usr/lib/modules/<version>/` or `/boot/EFI/Linux/*.efi`). The VM |
| 68 | +boots with `rootfstype=virtiofs root=rootfs` and systemd takes over as init. |
| 69 | +This requires a *bootc* image — one that contains a kernel, initramfs, and |
| 70 | +systemd. |
| 71 | + |
| 72 | +**To-disk** (`bcvk to-disk`): An ephemeral VM is launched using the approach |
| 73 | +above, and within it, `bootc install to-disk` runs to install the OS to an |
| 74 | +attached virtio-blk disk. The output is a full disk image (with partition |
| 75 | +table, bootloader, etc.) suitable for libvirt or QEMU. |
| 76 | + |
| 77 | +Both flows fundamentally require bootc images. Standard OCI containers (e.g. |
| 78 | +`docker.io/library/nginx`) lack a kernel, initramfs, and systemd, so bcvk |
| 79 | +can't boot them. |
| 80 | + |
| 81 | +## Reusing Apple's ext4 images directly |
| 82 | + |
| 83 | +Since Apple's `container` tool already synthesizes ext4 rootfs images and |
| 84 | +caches them on disk (see the "Apple's storage APIs" section below), bcvk |
| 85 | +doesn't need to reimplement ext4 synthesis. On macOS, if the user has already |
| 86 | +pulled an image with Apple's `container` tool, the ext4 is sitting at a |
| 87 | +well-known path: |
| 88 | +`~/Library/Application Support/com.apple.containerization/snapshots/<manifest-digest>/snapshot`. |
| 89 | + |
| 90 | +To boot that ext4 as a VM, bcvk needs to: |
| 91 | + |
| 92 | +1. **Locate the ext4 snapshot** — resolve the image reference to a |
| 93 | + platform-specific manifest digest, strip the `sha256:` prefix, and look |
| 94 | + for the file at the snapshot store path. |
| 95 | + |
| 96 | +2. **Extract the kernel** — read the kernel and initramfs out of the ext4 |
| 97 | + image without mounting it (see below). |
| 98 | + |
| 99 | +3. **Boot via QEMU** — direct kernel boot (`-kernel`/`-initrd`) with the |
| 100 | + ext4 image attached as a virtio-blk device and the right kernel command |
| 101 | + line to mount it as root. |
| 102 | + |
| 103 | +This avoids reimplementing Apple's `EXT4Unpacker` entirely. bcvk becomes a |
| 104 | +consumer of Apple's snapshot store rather than a competing image pipeline. |
| 105 | + |
| 106 | +## Kernel extraction from the ext4 image |
| 107 | + |
| 108 | +Apple's `container` ships its own pre-built kernel separately from the |
| 109 | +container image. bcvk takes a different approach: the kernel always comes from |
| 110 | +the container image itself. This is a core design principle — bcvk boots the |
| 111 | +image's own kernel so the VM matches what would run in production. There is no |
| 112 | +"ship a separate kernel" option. |
| 113 | + |
| 114 | +For bootc images accessed via VirtioFS, bcvk already extracts the kernel from |
| 115 | +the mounted filesystem using `find_kernel()` in `crates/kit/src/kernel.rs`. |
| 116 | +That function searches for UKIs in `/boot/EFI/Linux/*.efi` and |
| 117 | +`/usr/lib/modules/<version>/*.efi`, and for traditional `vmlinuz` + |
| 118 | +`initramfs.img` pairs in `/usr/lib/modules/<version>/`. It operates on a |
| 119 | +`cap_std::fs::Dir`, which requires the filesystem to be mounted or otherwise |
| 120 | +accessible as a directory tree. |
| 121 | + |
| 122 | +When working with Apple's ext4 snapshots, the rootfs is an ext4 image file |
| 123 | +rather than a mounted directory. The kernel needs to be extracted from that |
| 124 | +ext4 image *before* the VM boots (since QEMU's `-kernel` flag needs the kernel |
| 125 | +as a host file). This creates a chicken-and-egg problem: we need to read the |
| 126 | +ext4 to get the kernel, but we don't want to mount the ext4 (that would |
| 127 | +require root or fuse). |
| 128 | + |
| 129 | +The solution is to use a userspace ext4 reader. The |
| 130 | +[`ext4-view`](https://github.com/nicholasbishop/ext4-view-rs) crate provides |
| 131 | +read-only access to ext4 filesystems from a file or byte buffer, without |
| 132 | +mounting. It's pure Rust, no unsafe, `no_std` compatible, and its API follows |
| 133 | +`std::fs` conventions (`read()`, `read_dir()`, `metadata()`, `exists()`). |
| 134 | + |
| 135 | +The implementation would work roughly as follows: |
| 136 | + |
| 137 | +1. After locating the ext4 snapshot from Apple's store, open it with |
| 138 | + `ext4_view::Ext4::load_from_path()`. |
| 139 | + |
| 140 | +2. Run the same kernel search logic that `find_kernel()` uses, but against |
| 141 | + the `Ext4` filesystem API instead of `cap_std::fs::Dir`. The search paths |
| 142 | + are identical: `/boot/EFI/Linux/*.efi`, `/usr/lib/modules/<version>/*.efi`, |
| 143 | + `/usr/lib/modules/<version>/vmlinuz` + `initramfs.img`. |
| 144 | + |
| 145 | +3. Extract the kernel (and initramfs if present) to a temporary file on the |
| 146 | + host using `Ext4::read()`, which returns the file contents as `Vec<u8>`. |
| 147 | + |
| 148 | +4. Pass the extracted kernel to QEMU via `-kernel` (and `-initrd` if |
| 149 | + applicable), with the ext4 image as a virtio-blk device. |
| 150 | + |
| 151 | +This approach is attractive because `ext4-view`'s API maps closely to |
| 152 | +`cap_std::fs::Dir`. The kernel search logic could be refactored to be generic |
| 153 | +over a filesystem trait — something like a `ReadDir + Read + Metadata` |
| 154 | +abstraction — that both `Dir` and `Ext4` implement. Alternatively, a simpler |
| 155 | +approach: a second `find_kernel_in_ext4()` function that duplicates the search |
| 156 | +logic against the `Ext4` type. Given that the search logic is ~90 lines, a |
| 157 | +small amount of duplication may be acceptable for a first pass, with |
| 158 | +deduplication via a trait coming later. |
| 159 | + |
| 160 | +The `ext4-view` crate is Apache-2.0/MIT dual-licensed (compatible with bcvk's |
| 161 | +licensing), has no unsafe code, and is actively maintained. It handles the ext4 |
| 162 | +format details (block groups, extent trees, directory entries) that would be |
| 163 | +tedious to implement from scratch. |
| 164 | + |
| 165 | +## What would be different from default `apple/container` |
| 166 | + |
| 167 | +Even though bcvk would read `apple/container`'s ext4 images, the boot model is |
| 168 | +fundamentally different. `apple/container`'s `vminitd` is a purpose-built gRPC |
| 169 | +agent that manages container processes using Linux cgroups and namespaces |
| 170 | +*within* the VM — essentially a container runtime inside a VM. bcvk boots |
| 171 | +systemd and runs the full OS using the image's own kernel. The container |
| 172 | +image *is* the OS. |
| 173 | + |
| 174 | +This means the images bcvk can boot from `apple/container`'s snapshot store are |
| 175 | +limited to those that contain a kernel — bootc-style images. For images that |
| 176 | +lack a kernel entirely (e.g. `docker.io/library/nginx`), bcvk would not |
| 177 | +attempt to boot them. That's not bcvk's use case. |
| 178 | + |
| 179 | +## Practical assessment |
| 180 | + |
| 181 | +The implementation path for booting `apple/container`'s ext4 snapshots: |
| 182 | + |
| 183 | +1. Locate the ext4 snapshot on disk. Resolve the image reference to a |
| 184 | + manifest digest (via the OCI index in `apple/container`'s content store |
| 185 | + or by querying `container` CLI) and find the file at |
| 186 | + `~/Library/Application Support/com.apple.containerization/snapshots/<digest>/snapshot`. |
| 187 | + |
| 188 | +2. Use `ext4-view` to read the ext4 image and extract the kernel and |
| 189 | + initramfs to temporary host files, using the same search logic as the |
| 190 | + existing `find_kernel()`. |
| 191 | + |
| 192 | +3. Boot via QEMU with `-kernel`/`-initrd` pointing to the extracted files |
| 193 | + and the ext4 image as a virtio-blk root device. |
| 194 | + |
| 195 | +4. Wire this into `bcvk ephemeral run` as a new path on macOS. |
| 196 | + |
| 197 | +The hardest part is not reading the ext4 or extracting the kernel — both are |
| 198 | +straightforward with `ext4-view`. The more interesting design question is |
| 199 | +digest resolution: mapping an image reference to the right snapshot directory. |
| 200 | + |
| 201 | +## Apple's storage APIs and what they expose |
| 202 | + |
| 203 | +The `Containerization` Swift package and the `container` tool's services expose |
| 204 | +a layered set of APIs for accessing stored container images and their |
| 205 | +synthesized ext4 filesystems. Understanding these APIs is useful for evaluating |
| 206 | +whether bcvk (or any external tool) could reuse Apple's image storage directly. |
| 207 | + |
| 208 | +### The content store: OCI blobs as files |
| 209 | + |
| 210 | +The lowest layer is `LocalContentStore` (in `ContainerizationOCI`), which |
| 211 | +implements a standard OCI content-addressable storage layout. Blobs are stored |
| 212 | +as flat files at `<basePath>/blobs/sha256/<digest>`, where the default base |
| 213 | +path is `~/Library/Application Support/com.apple.containerization/content/`. |
| 214 | + |
| 215 | +The `ContentStore` protocol provides `get(digest:) -> Content?`, which returns |
| 216 | +a `Content` object for any blob. The `Content` protocol exposes: |
| 217 | + |
| 218 | +- `path: URL` — the filesystem path to the blob file |
| 219 | +- `data() -> Data` — read the entire blob into memory |
| 220 | +- `data(offset:length:) -> Data?` — read a range of the blob |
| 221 | +- `size() -> UInt64` — file size |
| 222 | +- `digest() -> SHA256.Digest` — content hash |
| 223 | + |
| 224 | +`Image.getContent(digest:)` wraps this: given a digest that the image |
| 225 | +references, it returns the `Content` object, from which you can get the `.path` |
| 226 | +to the raw layer tarball on disk. The layers are stored as compressed tarballs |
| 227 | +(gzip or zstd), exactly as pulled from the registry. |
| 228 | + |
| 229 | +Any external tool that knows the digest of a layer can read it directly from |
| 230 | +the filesystem without going through the Swift API — the layout is just files |
| 231 | +in a well-known directory. |
| 232 | + |
| 233 | +### The snapshot store: cached ext4 images |
| 234 | + |
| 235 | +Above the content store sits `SnapshotStore` (in the `container` tool's |
| 236 | +`ContainerImagesService`). This is where synthesized ext4 images are cached. |
| 237 | +The layout on disk is `<basePath>/snapshots/<manifest-digest>/snapshot`, where |
| 238 | +each `snapshot` file is a regular (sparse) ext4 filesystem image. |
| 239 | + |
| 240 | +`SnapshotStore.get(for:platform:)` returns a `Filesystem` object describing |
| 241 | +the cached ext4. The `Filesystem` type has a `source: String` field containing |
| 242 | +the absolute path to the ext4 file, along with `type` (block, virtiofs, etc.), |
| 243 | +`destination` (mount point), and `options` (mount options). For snapshots, the |
| 244 | +type is `.block(format: "ext4", ...)` and the source points to the `snapshot` |
| 245 | +file. |
| 246 | + |
| 247 | +`SnapshotStore.unpack(image:platform:)` creates the ext4 if it doesn't already |
| 248 | +exist: it delegates to `EXT4Unpacker.unpack()`, which iterates the image's |
| 249 | +layers in order, unpacking each compressed tarball directly into an ext4 image |
| 250 | +via `EXT4.Formatter`. The result is moved atomically into the snapshot |
| 251 | +directory. Alongside the `snapshot` file, a `snapshot-info` JSON file stores |
| 252 | +the serialized `Filesystem` metadata. |
| 253 | + |
| 254 | +### Can an external tool read these files? |
| 255 | + |
| 256 | +Yes, straightforwardly. Both the layer tarballs in the content store and the |
| 257 | +ext4 images in the snapshot store are regular files. There is no database, no |
| 258 | +proprietary container format, no locking mechanism that would prevent another |
| 259 | +process from reading them. If Apple's `container` tool has already pulled an |
| 260 | +image and unpacked it, bcvk could read the ext4 file directly from |
| 261 | +`~/Library/Application Support/com.apple.containerization/snapshots/<digest>/snapshot`. |
| 262 | + |
| 263 | +There are caveats. The snapshot store is keyed by the platform-specific |
| 264 | +manifest digest (not the image reference or index digest), so you'd need to |
| 265 | +resolve the image reference to the correct manifest digest to find the right |
| 266 | +snapshot directory. The content store's digest-stripping convention |
| 267 | +(`trimmingDigestPrefix` removes the `sha256:` prefix) is standard. Both stores |
| 268 | +could be relocated if the user changes the base path. |
| 269 | + |
| 270 | +### Could bcvk use the Swift APIs directly? |
| 271 | + |
| 272 | +Not practically. The `Containerization` package is Swift-only and the ext4 |
| 273 | +writing code (`ContainerizationEXT4`) is gated behind `#if os(macOS)` — it |
| 274 | +won't compile on Linux. The APIs are designed to be consumed from Swift |
| 275 | +processes running on macOS. |
| 276 | + |
| 277 | +However, bcvk doesn't *need* the APIs. Since the on-disk layout is simple and |
| 278 | +well-defined, bcvk can read the ext4 snapshot files directly using their |
| 279 | +filesystem paths. The `ext4-view` crate handles reading the ext4 contents |
| 280 | +(for kernel extraction) without any dependency on Apple's Swift packages. |
| 281 | + |
| 282 | +### Summary of the storage surface |
| 283 | + |
| 284 | +The content store and snapshot store together form a clean two-tier cache: |
| 285 | +compressed layer tarballs keyed by content digest, and materialized ext4 images |
| 286 | +keyed by manifest digest. Both tiers are plain files on disk with predictable |
| 287 | +paths. An external tool running on the same macOS system can read them without |
| 288 | +any API dependency on Apple's Swift packages — all you need is the image digest |
| 289 | +and knowledge of the directory layout. |
0 commit comments