Skip to content

Add OCI image support: pull, unpack, run, prune, status, policy#34

Open
Max042004 wants to merge 61 commits into
sysprog21:mainfrom
Max042004:oci-image
Open

Add OCI image support: pull, unpack, run, prune, status, policy#34
Max042004 wants to merge 61 commits into
sysprog21:mainfrom
Max042004:oci-image

Conversation

@Max042004
Copy link
Copy Markdown
Collaborator

@Max042004 Max042004 commented May 15, 2026

This PR lands the full elfuse OCI image support. It supersedes the
original Phase 1 scope of this PR (CLI scaffold + pull/inspect) and
now covers Phases 1-4 plus the post-Phase-3 improvements plan: image
layout alignment, GC/prune, layer + stack snapshot caches, store
status, parallel pull, registry policy.json, and a heavy-mode compat
matrix.

Scope

  • Pull / inspect — content-addressable blob store, HTTPS + bearer
    token, OCI index walk to the linux/arm64 leaf manifest, partial-
    store-aware inspect renderer.
  • Unpack — tar reader (ustar + PAX x/g records), gzip + decode-
    only vendored zstd, whiteout-aware layer apply (typeflag '1'/'2'/'5'
    • .wh.* markers), per-image sysroot on a case-sensitive APFS
      sparsebundle.
  • Runelfuse oci run clones the unpacked tree via clonefile(2),
    honors Entrypoint / Cmd / Env / WorkingDir / User, and reuses the
    existing elfuse launch path so a dynamically-linked guest binary
    runs through the same shim + syscall surface as the non-OCI mode.
  • Lifecycleoci prune with --older-than / --keep-bytes;
    layer + stack prune sweep; oci status (text + --json);
    oci rebuild-cache for pre-snapshot stores.
  • Performance — parallel blob fetch with HTTP Range resume;
    per-layer raw snapshot cache; ChainID stack snapshot cache; APFS
    COW clone-rootfs reuse between runs.
  • Policy — podman / skopeo-style policy.json + registries.d
    overlay (per-registry insecure / ca_bundle / auth_file). CLI flags
    override; loopback-only --insecure.
  • Test coverage — 25 OCI unit suites (test-oci-*), compat-shell
    smoke (tests/test-oci-compat.sh), and an opt-in heavy mode
    (OCI_COMPAT_TEST=1) that drives three layered fixtures
    (alpine-shaped, busybox-shaped hardlink dispatch, two-layer
    whiteout) end-to-end through a freshly-provisioned scratch
    sparsebundle.

Manual smoke test (docker.io/library/python:3.12)

A real end-to-end pull-and-run against a mainstream multi-layer glibc
image. The image's default Entrypoint is docker-entrypoint.sh (a
shell script, which elfuse does not execute), so the commands below
override --entrypoint to the python3 binary directly.

make elfuse
SCRATCH=$(mktemp -d)
echo "store: $SCRATCH"

# 1. Pull (~400 MB across 7 layers, ~3 minutes on a fast link).
#    If your terminal mishandles CSI cursor-up and the progress
#    output stacks duplicate rows, prepend ELFUSE_OCI_PROGRESS=plain
#    to fall back to one summary line per blob.
./build/elfuse oci pull --store "$SCRATCH" python:3.12

# 2. Offline inspect: image index -> linux/arm64 manifest -> config
#    runtime block (Entrypoint / Cmd / Env / WorkingDir / User).
./build/elfuse oci inspect --store "$SCRATCH" python:3.12

# 3. Cold run. First invocation triggers layer unpack onto the
#    sysroot APFS sparsebundle, then clone-rootfs, then launch. The
#    unpack step dominates the ~50 s wall on a fresh store.
./build/elfuse oci run --store "$SCRATCH" \
    --entrypoint /usr/local/bin/python3 python:3.12 \
    -c 'print("hello from elfuse", 1+2)'
# expected stdout:  hello from elfuse 3

# 4. Warm run. clone-rootfs reuses the unpacked image tree, so wall
#    drops to ~2 s and is dominated by VM bring-up + dynamic-linker
#    bring-up + Python interp init.
./build/elfuse oci run --store "$SCRATCH" \
    --entrypoint /usr/local/bin/python3 python:3.12 \
    -c 'import sys, platform; print(sys.version); print(platform.platform()); print(platform.machine())'
# expected stdout:  Python 3.12.x ... / Linux-<kernel>-aarch64-with-glibc2.41 / aarch64

# 5. stdlib smoke. Confirms json + math + f-string formatting all
#    flow through the emulated syscall surface.
./build/elfuse oci run --store "$SCRATCH" \
    --entrypoint /usr/local/bin/python3 python:3.12 \
    -c 'import json, math; print(json.dumps({"pi": round(math.pi, 5), "ok": True}))'
# expected stdout:  {"pi": 3.14159, "ok": true}

Performance characterization (vs OrbStack)

Measured on Apple M4 / macOS 15.4.1 (Darwin 24.4.0). OrbStack 2.1.3
acts as the ground-truth aarch64-linux runtime: it executes the same
docker.io/library/python:3.12 image inside a Virtualization.framework-
backed Linux VM with a real Linux kernel, so the comparison isolates
the cost of elfuse's user-mode ABI emulation against a native syscall
surface.

Pure CPU (factorial big-int multiply, no syscall)

import sys, math, time
sys.set_int_max_str_digits(0)   # Python 3.12 default cap is 4300 digits
N = 200000
t = time.perf_counter()
f = math.factorial(N)
s = sum(int(d) for d in str(f))
print("fact(%d) digit_sum=%d digits=%d compute=%.3fs" %
      (N, s, len(str(f)), time.perf_counter() - t))

Each engine ran twice; the second is warm. compute is the
time.perf_counter() delta inside Python (pure interpreter +
big-int multiply work); real is the outer wall (includes engine
startup); startup ≈ real - compute.

Engine run compute (s) real (s) startup (s)
elfuse 1 0.791 3.72 2.93
elfuse 2 warm 0.804 3.35 2.55
orbstack 1 0.792 1.10 0.31
orbstack 2 warm 0.796 0.97 0.17

Both engines emit digit_sum=4154076 digits=973351 — correctness
parity confirmed. Pure compute ratio: 1.01× (within measurement
noise). HVF runs guest aarch64 instructions directly so big-int
multiply + Python bytecode dispatch pay zero translation overhead.
Startup ratio: 15.0× (constant ~2.5 s for elfuse vs ~0.17 s for
orbstack), independent of N — verified separately at N=50000 where
both compute drops to ~0.14 s but elfuse startup stays at 2.53 s.

Syscall density (Python loop hammering syscalls)

import os, time
N_BASE = 1_000_000
N_READ = 100_000

def time_loop(label, fn, n):
    fn(min(n // 100, 10_000))   # warm-up
    t = time.perf_counter()
    fn(n)
    return label, time.perf_counter() - t, n

def baseline(n):
    for _ in range(n): pass

def getppid(n):
    g = os.getppid
    for _ in range(n): g()

def clock_ns(n):
    g = time.monotonic_ns
    for _ in range(n): g()

def urandom_read(n):
    fd = os.open("/dev/urandom", os.O_RDONLY)
    try:
        rd = os.read
        for _ in range(n): rd(fd, 1)
    finally:
        os.close(fd)

results = [
    time_loop("baseline (pass)",              baseline,     N_BASE),
    time_loop("getppid",                      getppid,      N_BASE),
    time_loop("clock_gettime (monotonic_ns)", clock_ns,     N_BASE),
    time_loop("/dev/urandom 1B read",         urandom_read, N_READ),
]
base_per = results[0][1] / results[0][2]
for label, secs, n in results:
    per = secs / n
    overhead = (per - base_per) * 1e6 if label != "baseline (pass)" else 0.0
    print("%-38s total=%.3fs n=%d per=%.3fus  syscall_overhead=%.3fus" %
          (label, secs, n, per * 1e6, overhead))

syscall_overhead strips the Python loop interpreter cost (measured
from the baseline band) so the residual is the pure trap+return
cost of a single syscall.

Band elfuse (μs/call) orbstack (μs/call) ratio
baseline (pass) 0.007 0.007 1.0×
getppid 0.960 0.091 10.5×
clock_gettime (monotonic_ns) 1.006 0.018 55.9×
/dev/urandom 1B read 1.704 0.210 8.1×

getppid is the cleanest measurement: no kernel work, just trap +
return. elfuse pays roughly 1 μs per syscall versus ~0.1 μs native.
Rough HVF round-trip breakdown: vCPU state sync ~200 ns, Linux→macOS
semantics ~100 ns, the macOS syscall itself ~100 ns, errno + sync
back ~100 ns, HVF re-entry + ERET ~500 ns. This 1 μs floor is the
structural ceiling for any elfuse syscall path.

vDSO observationtime.monotonic_ns should hit the synthetic
vDSO under src/core/vdso.{c,h} and skip the trap (orbstack does, at
0.018 μs), but the measured 1.006 μs matches the trapping baseline.
elfuse's vDSO entry is not being picked up by glibc 2.41 in this
image. This is an existing optimization opportunity unrelated to the
scope of this PR; left untouched here so the patch series stays
focused on image-distribution and runtime correctness.

Wall-clock model

For a pure-CPU workload of compute time W:

elfuse_total   ≈ 2.5 s + W
orbstack_total ≈ 0.17 s + W
W elfuse orbstack ratio scenario
0.1 s 2.6 s 0.27 s 9.6× CLI one-shot
1 s 3.5 s 1.17 s 3.0× short script
10 s 12.5 s 10.17 s 1.23× medium task
60 s 62.5 s 60.17 s 1.04× batch job

elfuse is competitive for long-running workloads (where the constant
startup amortizes out) and a known tradeoff for short CLI one-shots
where startup dominates total wall.

Known limitations

  • fork() followed by execve() of a dynamically-linked ELF crashes
    in the child during dynamic-linker bring-up. This blocks Python's
    subprocess.run([...other_dynamic_binary...]), shell pipelines that
    spawn external binaries, and timeout(1). Single-process Python
    workloads, stdlib computation, and file I/O are unaffected.
  • Multi-arch image selection is hardcoded to linux/arm64. There is
    no --platform flag; cross-arch image support is out of scope for
    this PR.
  • pull progress uses CSI cursor-up + clear-line for in-place
    redraw. Terminal panes that ignore those escapes show stacking
    rows; set ELFUSE_OCI_PROGRESS=plain to disable the redraw and
    emit one summary line per blob instead.

Summary by cubic

Expands elfuse oci from pull/inspect into a full image lifecycle: run, unpack/clone on a case‑sensitive APFS volume, with parallel/resumable pulls, caching, GC, and status. Adds policy‑driven auth/TLS, richer inspect, and runtime wiring to execute images directly.

  • New Features

    • New CLI: oci run|unpack|clone|prune|rebuild-cache|status; pull gains --refresh and progress; inspect shows image runtime and layer‑reuse stats.
    • Unpack pipeline: tar reader + gzip/zstd decode, whiteout‑aware layer apply, per‑image sysroot on a sparse APFS volume; per‑run rootfs via clonefile(2).
    • Caches: raw per‑layer snapshots and ChainID stack snapshots; parallel blob fetch with HTTP Range resume; schema marker under layers/ (v2).
    • Runtime: inject /etc/{resolv.conf,hosts,hostname}; emulate /dev/{full,console}; add /proc cgroup/hostname/comm/statm; PATH resolver; runspec merge; shared VM launcher.
    • Auth/TLS policy: podman/skopeo‑style policy.json + registries.d overlay merged with CLI; Basic auth, custom CA, loopback‑gated --insecure.
    • Store ops: prune (blobs/layers/stacks) with --older-than/--keep-bytes; status summary; dedup metrics in inspect; writes oci-layout.
  • Migration

    • Pins moved to OCI index.json; store auto‑migrates from refs/ on open.
    • Layer cache marked schema v2; first open wipes legacy v1 entries (blobs/images untouched).
    • Vendored decode‑only zstd and cJSON included; relies on system zlib and libcurl.

Written for commit 700ac9d. Summary will update on new commits. Review in cubic

Max042004 added 7 commits May 15, 2026 23:00
Lays the first slice of Phase 1 from issue sysprog21#31: the elfuse oci
subcommand surface and a self-contained OCI image reference parser.
No registry, store, or unpack code lands here; this is the routing
and parsing scaffold that every later piece depends on.

src/main.c routes argv[1] == "oci" to oci_cli_main before the
Hypervisor.framework setup runs, so image distribution never has to
satisfy the host DC ZVA assertion or the HVF entitlement check. The
existing arg parser, --help, --version, --fork-child, and guest
execution paths are otherwise untouched.

src/oci/cli.c implements pull, inspect, prune, and list dispatch.
inspect parses a reference and prints the canonical form along with
the registry, repository, tag, and digest fields, which proves the
end-to-end wiring. The remaining subcommands return rc=2 with an
explicit "not implemented yet" message rather than crashing or
silently succeeding so users get a stable surface to script against.

src/oci/ref.c implements the de-facto containerd/docker reference
grammar:

  reference := name [":" tag] ["@" digest]
  name      := [domain "/"] path
  domain    := first slash component containing "." or ":" or
               equal to "localhost"
  path      := component ("/" component)*
  component := [a-z0-9]+ ((["._-"] | "__") [a-z0-9]+)*
  tag       := [A-Za-z0-9_] [A-Za-z0-9_.-]{0,127}
  digest    := ("sha256" | "sha512") ":" lowercase-hex

Defaults match Docker conventions: missing registry becomes
docker.io, single-segment paths under docker.io pick up the library/
prefix, and missing tag/digest defaults the tag to latest. A digest-
only reference leaves tag NULL so the canonical form does not
fabricate a tag the user never wrote. Digest hex is required to be
lowercase because the local content-addressable store will key off
the canonical digest string and uppercase encodings would otherwise
cause silent dedup misses.

memrchr is GNU-only and Darwin libc does not ship it, so a small
memrchr_local helper handles the rightmost-slash search the tag
detector needs. The looks_like_domain helper compares localhost as
a 9-byte literal (the earlier draft had a length bug here that the
unit tests caught).

tests/test-oci-ref.c is a native macOS test program (not cross-
compiled, no Hypervisor.framework, no codesign) that links directly
against src/oci/ref.c. It runs 14 happy-path cases covering Docker
defaults, registry detection, port handling, sha256 and sha512
digests, tag+digest pinning, and every separator variant in the
component grammar, plus 20 error cases covering empty input, NULL
input, uppercase, malformed digests, double @, empty tag/digest
suffixes, length limits, and structural validation. All 34 cases
pass.

mk/config.mk adds tests/test-oci-ref.c to NATIVE_TESTS so the cross-
compile pattern rule does not pick it up. Makefile adds the link
rule for build/test-oci-ref (no codesign because there is no HVF
dependency). mk/tests.mk exposes test-oci-ref as a phony target and
runs it as the last stage of make check, alongside the existing
proctitle, busybox, sysroot, and timeout-disable validations.
Second slice of Phase 1 from issue sysprog21#31. Lands the on-disk storage
substrate that the upcoming registry client will spill manifests,
configs, and layers into. No HTTP, no unpack, no CLI surface yet;
this slice is intentionally a pure library plus offline unit tests so
the storage semantics can be audited without standing up a network.

src/oci/digest.{c,h} wraps CommonCrypto SHA-256 and SHA-512 in a
streaming digester so multi-gigabyte layers can be hashed without
buffering. Calls into CommonCrypto are clamped to 1 GiB chunks because
CC_LONG is 32-bit and OCI layers can legitimately exceed that. Hex
output is lowercase to match the reference parser (src/oci/ref.c);
the OCI image reference grammar already rejects uppercase digest hex,
so the entire pipeline -- parser, manifest fetcher, local store --
shares one canonical encoding and cannot silently miss a dedup match.
A separate one-shot helper, hex validator, and "<algo>:<hex>" parser
sit on top of the same streaming primitive.

src/oci/blob-store.{c,h} is the content-addressable store. Layout
matches the OCI image-layout convention: <root>/blobs/<algo>/<hex>
for committed blobs plus <root>/tmp/blob-<pid>-<seq>-XXXXXX for the
in-flight staging file. mkstemp supplies global uniqueness; an
in-process counter is added to the template so failures of the rand
pool cannot defeat in-process disambiguation. The commit path hashes
streamed bytes, fsyncs the staging file, and uses link(2) rather than
rename(2) to publish the final inode. link returning EEXIST is the
dedup hit signal: two writers racing on the same digest both unlink
their staging files and report success, because the content is by
definition identical when the digest matched. Digest mismatch returns
-1 with errno EINVAL and unlinks the staging file, so an interrupted
or hostile pull never leaves a visible-complete blob behind. The
abort path takes the same cleanup. STORE_PATH_MAX is set comfortably
above PATH_MAX so snprintf truncation cannot silently corrupt a path;
callers passing smaller buffers still detect overflow via the return
value.

Per oci-roadmap.md Q1, the store will eventually sit on a
case-sensitive APFS sparse volume managed by elfuse, but the volume
bootstrap is its own later slice. For now the store API takes a plain
directory path; the same API survives the volume migration unchanged.

tests/test-oci-digest.c exercises 25 cases: NIST FIPS-180-4 vectors
(empty, "abc", 56-byte, one-million-'a') for both SHA-256 and
SHA-512, the same one-million-'a' streamed in 4 KiB and 17-byte
chunks to lock down the chunking loop, hex validator boundary cases,
and every "<algo>:<hex>" parse rejection (missing colon, unknown
algorithm, short hex, uppercase hex, NULL input). NULL and zero-
length updates must be safe and must not perturb the running state.

tests/test-oci-blob-store.c drives 14 cases inside an mkdtemp scratch
directory: layout creation, idempotent reopen, path() formatting,
one-shot put + has() round-trip, dedup commit leaves the same inode,
digest mismatch is rejected with EINVAL and tmp/ stays empty,
streaming writer over multiple chunks, abort leaves no leftover, and
close + reopen still sees the committed blob (issue sysprog21#31 DoD: "store
survives restart"). dir_is_empty / path_is_dir / path_is_file helpers
keep the assertions terse.

Makefile adds oci/digest.c and oci/blob-store.c to SRCS, plus the
two new native-test link rules. mk/config.mk extends NATIVE_TESTS so
the cross-compile pattern rule does not pick the new tests up.
mk/tests.mk exposes test-oci-digest and test-oci-blob-store as phony
targets and runs them as the final two stages of make check, beside
the existing test-oci-ref stage. All 39 (25 + 14) new assertions
pass; the rest of make check stays green (unit suite 81 passed / 0
failed, busybox, proctitle, procfs-exec, timeout-disable, OCI-ref
34/34).
Third slice of Phase 1 from issue sysprog21#31. Lands the JSON deserialization
substrate the upcoming registry client will run every fetched manifest,
index, and config blob through. No HTTP, no unpack, no CLI surface yet;
this slice is intentionally a pure offline library plus a 76-case unit
test driven by inline JSON fixtures so the parse contract is auditable
without standing up a network.

externals/cjson/ vendors cJSON v1.7.18 verbatim (MIT-licensed, single
.c/.h pair) per oci-roadmap.md Q9. No local modifications; future
security updates re-fetch via the three curl commands in
externals/cjson/VENDORING.md. .gitignore switches from ignoring all of
externals/ to ignoring externals/* with an explicit !externals/cjson/
exception so the vendored tree stays tracked while the downloaded test
fixtures stay out of git. The Makefile compiles cJSON with the same
project CFLAGS the rest of the codebase uses; cJSON happens to be clean
under -Wall -Wextra -Wpedantic on this version, so no per-file warning
override is required.

src/oci/media-type.{c,h} is the canonical enum + table for every OCI
and Docker media type the manifest/index/config/layer code branches
on. Foreign (nondistributable) layers are recognized and distinguishable
so the parser can name the actual offending layer type instead of
collapsing them to a generic "unknown", but the supported-layer
predicate excludes them per oci-roadmap.md Q3 (elfuse cannot fetch the
out-of-band payload they reference). The parser strips charset/boundary
parameters and surrounding whitespace before lookup so the registry's
Content-Type header value canonicalizes the same way the manifest's
mediaType JSON field does.

src/oci/manifest.{c,h} parses image manifests, image indexes, and image
configs against schemaVersion 2. Every descriptor digest is validated
through oci_digest_parse so a parsed oci_descriptor_t carries both the
original "<algo>:<hex>" string and a populated (algo, hex[]) pair the
blob store from slice 2 can consume directly. Size fields go through a
fractional-part / negative / round-trip-precision check because cJSON
returns numbers in a double; the parser rejects sizes beyond 2**53 - 1
where IEEE 754 precision starts dropping integers and rejects fractional
sizes that would otherwise truncate silently to a near-but-wrong
integer. Manifest config descriptors are required to carry a config
media type, layer descriptors must carry a layer media type, and
foreign layers are rejected with a precise error. Image configs require
rootfs.type == "layers" (the only value the OCI image-spec defines) and
validate every rootfs.diff_ids entry as a lowercase digest. Platform
fields default empty variant / os.version strings to "" rather than NULL
so the selector can use unconditional strcmp.

oci_index_pick_linux_arm64 prefers variant "v8", then empty variant,
then any other arm64 variant. It also skips entries whose manifest
media type is not recognized -- even when the platform matches, the
registry-fetch path cannot consume the resulting manifest, so picking
such an entry would only defer a failure.

tests/test-oci-manifest.c exercises 76 cases inline: every recognized
media type lookup, charset/whitespace stripping, NULL and bogus
strings, every predicate, both compression results; OCI and Docker
happy-path manifest parses with two-layer gzip + zstd mix; the seven
manifest rejection paths (malformed JSON, schemaVersion != 2, missing
config, uppercase digest, negative size, fractional size, foreign
layer, non-config media type on the config descriptor); the four
index paths (multi-arch v8 wins; no-v8 picks empty variant over v7;
no linux/arm64 returns NULL; Docker manifest list; unknown manifest
mediaType is recorded but the selector skips it); and the four image
config paths (happy with User/Env/Entrypoint/Cmd/WorkingDir/diff_ids;
missing rootfs; non-layers rootfs.type; malformed diff_id).

Makefile / mk/config.mk / mk/tests.mk wire the new translation units
into elfuse's link line, add oci/media-type.o + oci/manifest.o + the
vendored cJSON object, register tests/test-oci-manifest.c in
NATIVE_TESTS so the cross-compile pattern rule does not pick it up,
and run the new test as the final stage of make check beside the
existing test-oci-ref / test-oci-digest / test-oci-blob-store stages.
All 76 new assertions pass; the rest of make check stays green
(unit suite 81 passed / 0 failed / 3 skipped, busybox, proctitle,
procfs-exec, timeout-disable, OCI-ref 34/34, OCI-digest 25/25,
OCI-blob-store 14/14).

elfuse oci pull / prune / list still return rc=2; wiring the parser
into the CLI is gated on slice 4 (HTTPS + token challenge + blob
fetch). The parsers exist now so that work can land without also
adding deserialization.
Fourth slice of Phase 1 from issue sysprog21#31, split into 4a here. Lands the
HTTP fetch substrate that connects the slice-3 manifest parsers to a
real registry and streams blob bodies into the slice-2 content-addressed
store, all behind a single fetcher handle. No CLI wiring yet (elfuse oci
pull still returns rc=2); slice 5 connects the pull command to this
layer, persists the manifest graph, and pins the resolved tag-to-digest.

Slice 4 was cut into 4a / 4b per oci-roadmap.md Q7 so each slice stays
under the ~800 LOC review budget. 4a covers the anonymous Docker Hub /
GHCR public-pull subset: anonymous GET, 401 + Www-Authenticate Bearer
challenge, token fetch, retry, blob streaming with declared-size cap
and on-commit digest verification. 4b will add basic auth, --insecure-ca
custom CA, and --insecure loopback-gated TLS verify off.

src/oci/fetch.{c,h} wraps libcurl. A fetcher owns one CURL easy handle,
one cached bearer token, and the most recent Www-Authenticate challenge.
The first request is anonymous. If the registry replies 401, the header
parser captures realm / service / scope, fetch_token GETs the realm with
those parameters, the JSON response is parsed with cJSON, and the
original request is retried once with Authorization: Bearer <token>.
The cached token is reused for subsequent calls on the same fetcher so
a manifest plus N layer pulls cost one token round trip rather than
N+1. docker.io is rewritten to registry-1.docker.io because the
reference parser stores the canonical name while the actual API host
differs.

The blob path is content-addressed end to end. oci_fetch_blob short
circuits when the descriptor is already present in the store; otherwise
it opens an oci_blob_writer keyed by the descriptor digest, streams
response body chunks through the writer, and tracks a running byte
count capped at the descriptor's declared size so a hostile server
cannot stream forever. The writer's own digest check at commit time
rejects any payload that hashes to anything other than the descriptor
hex. Size mismatch, digest mismatch, transport error, and non-2xx all
unwind via oci_blob_writer_abort so an interrupted pull never leaves a
visible-complete blob behind. CURLOPT_FOLLOWLOCATION is enabled so the
common case where a registry 307s blob fetches to S3 / Cloudfront with
a pre-signed URL works transparently; libcurl strips the Authorization
header on cross-host redirects, which is exactly what the storage
backend expects.

The header parser keys on Content-Type, Docker-Content-Digest, and
Www-Authenticate. Content-Type is stripped of charset/parameters before
the manifest parser sees it so the canonicalization matches the
mediaType field inside the JSON body. Docker-Content-Digest is captured
verbatim so the upcoming tag-to-digest pinning in slice 5 can record
the registry's resolved digest without recomputing.

Response body accumulation has a 16 MiB ceiling (FETCH_BODY_MAX) so an
unbounded reply cannot fill memory; real manifests, indexes, and image
configs are orders of magnitude below this. Blob responses bypass the
buffer entirely and stream straight through the writer.

tests/test-oci-fetch.c spawns an in-process HTTP/1.1 mock server bound
to 127.0.0.1 on an ephemeral port and drives the fetcher against
scripted handlers. Nine offline cases exercise anonymous manifest GET
(body, Content-Type stripping, Docker-Content-Digest capture); manifest
404 surfaces with the right status; bearer challenge runs the full
401 then token then retry sequence and inspects the request log to
verify the second hop hits /token and the third carries the Bearer
header; cached token reuse on a second fetch confirms no re-challenge
round trip; blob success commits a known-good payload to the store;
already-cached blob short-circuits with zero server requests; oversize
response is rejected and leaves no visible blob; digest mismatch on a
correctly-sized payload is rejected at commit; blob 404 fails cleanly.
An opt-in tenth case behind OCI_FETCH_ONLINE=1 pulls alpine:3.20 from
Docker Hub through the real bearer flow as a smoke test; it is wired
as make test-oci-fetch-online and is not part of make check.

Makefile adds src/oci/fetch.c to SRCS and -lcurl to HVF_LDFLAGS so the
production elfuse binary links libcurl from the macOS SDK (no vendoring
per oci-roadmap.md Q7 and Q9). build/test-oci-fetch links libcurl plus
pthread for the mock server. mk/config.mk registers the test source in
NATIVE_TESTS so the cross-compile pattern rule does not try to
aarch64-compile it. mk/tests.mk adds test-oci-fetch as the final stage
of make check and exposes test-oci-fetch-online as a separate target.

make check stays green: 78 unit tests, busybox 81/0/3, proctitle,
procfs-exec, timeout-disable, OCI-ref 34/34, OCI-digest 25/25,
OCI-blob-store 14/14, OCI-manifest 76/76, OCI-fetch 9/9.
…ecure)

Fourth slice of Phase 1 from issue sysprog21#31, 4b half. Closes out the
oci-roadmap.md Q7 ship list by extending the slice-4a fetcher with
HTTP Basic authentication, custom CA bundle, and a loopback-gated
TLS verify-off path. fetch_manifest / fetch_blob signatures are
unchanged; everything new lives in oci_fetcher_options_t and a new
per-easy-handle helper.

src/oci/fetch.h grows four fields on oci_fetcher_options_t: username,
password, ca_file, allow_insecure. oci_fetcher_new now stashes
username/password as a pre-joined "user:pass" string (CURLOPT_USERPWD
takes the joined form), strdup's ca_file, and records allow_insecure
verbatim. apply_security_opts() is called from every GET callsite
(perform_manifest_get, perform_blob_get, fetch_token) right after
curl_easy_reset, which attaches CURLOPT_USERPWD plus
CURLAUTH_BASIC, CURLOPT_CAINFO, and CURLOPT_SSL_VERIFY{PEER,HOST}=0
when each is set. This shape gives the token endpoint the basic
credentials too: a registry that bridges Basic for the token
exchange and Bearer for the data API sees both. libcurl drops the
USERPWD-derived Authorization header in favor of the manually
appended Authorization: Bearer on the retry, so basic gives way to
bearer once a token is in hand.

The loopback policy gate runs at the entry of oci_fetch_manifest
and oci_fetch_blob, not in oci_fetcher_new: ref is not available at
construction time, and policy is about which host the fetcher is
actually about to talk to. extract_host_from_registry strips the
optional :port (and the [] of bracketed IPv6 literals) from
ref->registry, is_loopback_host case-insensitively matches against
127.0.0.1 / localhost / ::1, and check_insecure_policy combines
them so a non-loopback target with allow_insecure=true returns -1
with errno=EPERM before a single byte is sent. The policy reads
ref->registry rather than the test-only base_url_override so unit
tests can drive a non-loopback ref while still pointing the mock
URL at 127.0.0.1, and the production surface (no override) gets
the same answer it would in deployment.

tests/test-oci-fetch.c upgrades the in-process mock from plain HTTP
to TLS. The mock generates an ephemeral RSA-2048 keypair and a
self-signed certificate at startup via OpenSSL EVP, signed for
CN=127.0.0.1 with SAN IP:127.0.0.1 + DNS:localhost, valid for one
day. The certificate PEM is written into the scratch directory and
the fetcher receives the path through opts.ca_file. accept loop
wraps each connection in SSL_accept; read/write go through a small
io_t abstraction so handler signatures change only in the IO
parameter type. mock_send_full keeps the same response shape but
writes through SSL_write.

libcurl's SSL backend is forced to OpenSSL (LibreSSL on macOS) via
curl_global_sslset() called before any other libcurl entry. macOS
system libcurl is a multi-SSL build that defaults to Secure
Transport, and Secure Transport ignores CURLOPT_CAINFO. Without
this pin the ca_file negative cases would pass for the wrong
reason: the handshake would succeed against the keychain, not the
supplied PEM. LibreSSL on macOS still finds the system trust roots
for the OCI_FETCH_ONLINE=1 case, so the online docker.io smoke
test continues to work.

mk/toolchain.mk auto-detects OPENSSL_PREFIX from
/opt/homebrew/opt/openssl@3 (Apple Silicon) or
/usr/local/opt/openssl@3 (Intel) and exposes OPENSSL_CFLAGS /
OPENSSL_LDFLAGS. The Makefile attaches them only to
build/test-oci-fetch (target-specific CFLAGS plus link flags), so
the production elfuse binary still has no OpenSSL dependency: the
new TLS plumbing is testing scaffolding, not runtime code.

Test count grows from 9 to 15 cases. New cases: basic auth success
(verifies the server saw "Basic YWxpY2U6c2VjcmV0" exactly once);
basic auth carried into the token endpoint (verifies the token GET
saw the same basic credentials and the manifest retry switched to
Bearer); insecure on a loopback registry is allowed (HTTPS request
goes through despite no ca_file); insecure on a non-loopback
registry is rejected with errno=EPERM and zero bytes leak to the
mock server (request log stays empty); ca_file unset against the
self-signed mock fails the handshake with http_status=0; ca_file
pointing at an unrelated self-signed certificate also fails the
handshake. The 9 existing cases continue to pass over TLS by
supplying the mock's CA PEM as ca_file.

make check stays green: 78 unit tests, busybox 81/0/3, proctitle,
procfs-exec, timeout-disable, OCI-ref 34/34, OCI-digest 25/25,
OCI-blob-store 14/14, OCI-manifest 76/76, OCI-fetch 15/15.
make test-oci-fetch-online (opt-in) also passes.
Slice 5a of Phase 1 from issue sysprog21#31. Wires the slice 4a/4b fetcher and
the slice 3 manifest parser into the elfuse oci pull command and
persists the resolved blob graph on disk. inspect still renders only
the canonical reference; the offline manifest-tree renderer ships in
slice 5b.

src/oci/store.{c,h} wraps the slice-2 content-addressable blob store
with a tag-to-digest pin table. On-disk layout under <root>:

  blobs/<algo>/<hex>                       (immutable, from slice 2)
  tmp/blob-<pid>-<seq>-XXXXXX              (in-flight staging)
  refs/<registry>/<repository>/<tag>       (pin file, one line: <algo>:<hex>)

oci_store_open creates the refs/ subtree, then opens a blob store
rooted at the same path so the two layers share one directory.
oci_store_put_ref refuses digest-only refs (their digest is the pin,
no file needed), validates the supplied digest string with
oci_digest_parse, mkdir -p's the registry/repository prefix on demand,
writes <digest>\n into a tmp file alongside the final path, fsyncs,
and renames into place. Rename rather than link because tag pins are
mutable: pulling alpine:3.20 today may resolve to a different digest
than yesterday and overwriting the pin is the correct semantic. The
blob layer keeps its link(2) discipline because content-addressed
blobs stay immutable.

oci_store_get_ref reads the pin file, strips the trailing newline,
validates the digest via oci_digest_parse, and returns a heap-
allocated copy. Miss reports errno=ENOENT so callers can distinguish
"never pulled" from "io error reading pin".

oci_store_default_root returns the platform default: $XDG_DATA_HOME/
elfuse/store when set, otherwise $HOME/Library/Application Support/
elfuse/store. Phase 2 will mount a sparse case-sensitive APFS volume
at the same path (oci-roadmap.md Q1); the API does not change.

src/oci/pull.{c,h} implements the pipeline. oci_pull runs five phases
linearly:

  1. Fetch the top-level manifest by ref->digest or ref->tag,
     advertising Accept for both OCI and Docker index + manifest types.
  2. Hash the body with SHA-256 and cross-check against the
     Docker-Content-Digest header when the registry sent one. Body /
     header mismatch is a hostile-registry signal and aborts before
     anything else writes to the store. When the user pulled by digest,
     also cross-check the body digest against ref->digest.
  3. Persist the manifest body into blob store at sha256:<computed-hex>.
  4. If the top-level was an image index, parse it, run
     oci_index_pick_linux_arm64, fetch the sub-manifest by its
     descriptor digest with expected-digest verification, persist it,
     and switch to the sub-manifest body for the next phase. The pin
     digest stays at the top-level (index) digest so that the next
     inspect / pull by tag re-walks index then manifest.
  5. Parse the manifest, fetch the config blob, fetch each layer blob
     in manifest order via oci_fetch_blob. Each blob fetch short-
     circuits when oci_blob_store_has reports a hit, so a re-pull
     issues zero layer downloads (only the two manifest bodies are
     re-fetched in the index case; manifest caching is its own future
     slice).
  6. Write the tag-to-manifest-digest pin via oci_store_put_ref. Skip
     for digest-only refs (no tag to pin).

Schema v1 manifests and foreign / nondistributable layers are rejected
by oci_manifest_parse from slice 3; oci_pull surfaces those
diagnostics and aborts before any partial layer hits the store. The
errno preserved across the cleanup goto so callers can key tests off
EPROTO / ENOENT / EINVAL without seeing free()'s leftover stomp.

Progress output is one line per descriptor with a truncated digest,
size, state (downloaded vs cached), and media-type name. -q / --quiet
silences it. The full hex still goes into the pin file and the blob
store for verification.

src/oci/cli.c grows pull argument parsing: --store DIR, -u | --user
USER[:PASS], --insecure-ca PEM, --insecure, -q | --quiet, plus the
positional reference. Defaults come from oci_store_default_root.
split_userpass handles "user", "user:", and "user:pass" forms with one
dynamically-allocated buffer the cleanup path frees. inspect, prune,
list keep their slice-1 behaviour for now.

tests/lib/oci-mock.{c,h} extracts the TLS-terminated HTTP/1.1 mock
server from test-oci-fetch.c. The accept loop, ephemeral self-signed
RSA-2048 + SAN cert generator, header parser, request log, and
mock_send_full response helper all move out so both the fetch and the
pull suites share one ~400 LOC implementation. Public symbols gain an
oci_mock_ prefix to make the helper boundary explicit. Three small
helpers (wipe_dir, scratch_root, base_url) tag along because both
suites need them. test-oci-fetch.c shrinks by 380 lines, switches to
the new header, and keeps its 15/15 passing.

tests/test-oci-store.c covers 9 cases: layout creation, put + get
round trip, miss returns ENOENT with out_digest=NULL, digest-only ref
is rejected with EINVAL (its digest is the pin), malformed digest
string is rejected with EINVAL, deep repository slashes get mkdir -p,
pin overwrite replaces the file, blob and pin share the same root,
and default_root respects XDG_DATA_HOME / falls back to HOME.

tests/test-oci-pull.c covers 6 end-to-end cases against the mock. The
test builds a synthetic image at runtime: three layer byte strings,
one image config JSON referencing the layer digests, one manifest JSON
referencing the config + layer digests, one index JSON referencing the
manifest digest. All five digests are real SHA-256 of the actual bytes
the mock serves, so the cross-check inside oci_pull exercises a real
verification path. The cases are: tag resolves to index resolves to
arm64 sub-manifest with config + 3 layers stored and pin written;
tag resolves directly to manifest (no index) with pin written; digest-
only ref pulls but no pin is written (and get_ref returns EINVAL);
re-pull short-circuits layer + config downloads (second pull issues
exactly 2 requests: index + sub-manifest); body / Docker-Content-Digest
mismatch aborts with EPROTO and no pin written; index without
linux/arm64 entry aborts with ENOENT.

Makefile / mk/config.mk / mk/tests.mk wire the new translation units:
oci/store.o and oci/pull.o join SRCS; test-oci-store.c and
test-oci-pull.c land in NATIVE_TESTS so the cross-compile rule skips
them; new link rules build test-oci-store and test-oci-pull; tests/lib/
oci-mock.o is a separate object linked into both test-oci-fetch and
test-oci-pull with OPENSSL_CFLAGS applied; make check gains two new
stages running test-oci-store and test-oci-pull after the existing OCI
suites.

make check stays fully green: 78 unit tests; busybox 81/0/3;
proctitle low-stack; procfs-exec; timeout-disable; OCI-ref 34/34;
OCI-digest 25/25; OCI-blob-store 14/14; OCI-manifest 76/76;
OCI-fetch 15/15; OCI-store 9/9; OCI-pull 6/6. make
test-oci-fetch-online (opt-in) still passes.
Slice 5b of Phase 1 from issue sysprog21#31. Closes out Phase 1 by giving
elfuse oci inspect an actual function beyond the slice-1 canonical-ref
print: it reads the local store the slice 5a pull pipeline populated
and renders the manifest graph without touching the network. Phase 2
follows: sparse APFS volume bootstrap, layer unpack with whiteouts,
clonefile copy-up.

src/oci/inspect.{c,h} owns the offline renderer. oci_inspect resolves
the manifest digest in three steps:

  1. ref->digest when set (digest-pinned reference)
  2. pin file <root>/refs/<registry>/<repository>/<tag> when ref->tag
     is set
  3. Neither: print "(no local manifest; run 'elfuse oci pull' first)"
     on stdout and return 0. This preserves the slice-1 inspect smoke
     output shape for refs that were never pulled.

The pinned digest goes through oci_digest_parse to reject corrupt pin
files, then read_blob_file slurps <root>/blobs/<algo>/<hex> into a
heap buffer. read_blob_file caps the read at 64 MiB (real manifests
are well under 1 MiB; the cap prevents a corrupted store from forcing
a pathological malloc) and reports errno=ENOENT when the blob file
is absent.

Classification between index and manifest is structural: the slice-3
parsers reject disjoint shapes (oci_index_parse requires a manifests
array; oci_manifest_parse requires config + layers), so trying index
first and falling back to manifest is unambiguous. Image config blobs
never reach this path because pins point at manifest-shaped blobs.

Index rendering prints a platforms table. Default mode shows only the
picked linux/arm64 entry (tagged "[arm64]") and drills into the
sub-manifest blob to print its config descriptor + layer table. The
--all-platforms flag lists every platform entry and skips the drill;
the flag answers "what does this image cover", not "what is inside
the arm64 variant". Both decisions are documented inline at the
oci_inspect_options_t definition.

Failure mode for a partial store: index loads fine but the linux/arm64
sub-manifest blob is missing. The platform table still goes to stdout
(the user sees what is available), a warning lands on stderr, and the
call returns -1 with errno=ENOENT and err_msg = "indexed manifest
blob missing from local store". Scripts key on the exit code; humans
read the table. The errno is preserved across the cleanup goto in the
same shape slice-5a oci_pull adopted.

Digest formatting follows the slice-5a progress lines for visual
consistency: full digests appear in the pinned: line and in index
entry tagging (so users can copy / grep the exact value), and a 22-
column short form ("sha256:" + 12 hex + "...") appears in the layer
tables. short_digest takes a caller-supplied buffer so two short
digests in one printf do not clobber a shared static.

src/oci/cli.c grows parse_inspect_args + a cmd_inspect rewrite. The
new flag set is --store DIR (override the platform default) and
--all-platforms (the flag described above); the canonical-ref header
print stays in cli.c so the slice-1 smoke output continues working
when the store has no record. After the header, cmd_inspect opens
the store and calls oci_inspect. rc 0 means success or pin miss; rc
1 means a real failure (malformed blob, blob missing, IO).

tests/test-oci-inspect.c drives 6 cases against a pre-populated
scratch store. The store is built directly with oci_blob_store_put_
bytes + oci_store_put_ref, not through oci_pull, so the test stays
independent of the slice-4 fetcher and the slice-5a pipeline.
open_memstream captures stdout into a heap buffer and the assertions
grep for distinctive substrings (digest hex prefixes, "[arm64]",
section headers) so format tweaks do not cause spurious failures.

The 6 cases are: a direct image manifest (config + 2 layers, asserts
no [2] index appears so off-by-one shows up); an image index where
default mode drills the arm64 sub-manifest and amd64 / s390x stay
hidden; the same index with --all-platforms (all three platforms
listed, drill section absent); a pin miss for an unknown tag (rc=0,
informational line); a digest reference whose blob is absent (rc=-1,
errno=ENOENT, "error: manifest blob ... not found"); and the index-ok
sub-manifest-missing case (stdout still has the platform table, rc=-1,
errno=ENOENT, err_msg identifies the missing inner blob). The last
case dup2's stderr to /dev/null around the run so the warning line
does not pollute the test driver output.

Makefile adds oci/inspect.c to SRCS. mk/config.mk registers
tests/test-oci-inspect.c in NATIVE_TESTS so the cross-compile pattern
rule skips it. The new link rule pulls in inspect.o, store.o,
blob-store.o, digest.o, manifest.o, media-type.o, ref.o, and cJSON;
no libcurl, no openssl. mk/tests.mk gains a test-oci-inspect target
and runs it as a make-check stage after OCI-pull.

make check stays fully green: 78 unit tests; busybox 81/0/3;
proctitle low-stack; procfs-exec; timeout-disable; OCI-ref 34/34;
OCI-digest 25/25; OCI-blob-store 14/14; OCI-manifest 76/76;
OCI-fetch 15/15; OCI-store 9/9; OCI-pull 6/6; OCI-inspect 6/6.
make test-oci-fetch-online (opt-in) still passes.

elfuse oci inspect now has a real second pane: the slice-1 canonical
header followed by either the rendered manifest tree or a clear
"never pulled" notice. prune and list still return rc=2.
Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

11 issues found across 40 files

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="src/oci/pull.c">

<violation number="1" location="src/oci/pull.c:253">
P2: Error-path leak: `sub_resp` may be allocated but not freed when sub-manifest fetch fails before `have_sub` is set.</violation>
</file>

<file name="src/oci/media-type.c">

<violation number="1" location="src/oci/media-type.c:100">
P2: Media type parsing is case-sensitive, but media type type/subtype tokens are case-insensitive; valid values with different casing will be misclassified as unknown.</violation>
</file>

<file name="src/oci/ref.c">

<violation number="1" location="src/oci/ref.c:83">
P2: Repository-path validation incorrectly rejects valid names with repeated dashes (for example `my--repo`).</violation>

<violation number="2" location="src/oci/ref.c:356">
P2: `docker.io` default-namespace detection is case-sensitive, so mixed-case hostnames can skip the required `library/` prefix.</violation>
</file>

<file name="src/oci/fetch.c">

<violation number="1" location="src/oci/fetch.c:782">
P2: Manifest fetch skips bearer-challenge parsing when a token is already cached, so 401 responses from expired/stale tokens are not retried with a refreshed token.</violation>

<violation number="2" location="src/oci/fetch.c:945">
P2: Blob fetch also disables challenge parsing when a token is cached, preventing 401-triggered token refresh and causing avoidable pull failures.</violation>
</file>

<file name="src/oci/blob-store.c">

<violation number="1" location="src/oci/blob-store.c:354">
P2: The commit path is not crash-durable because it never fsyncs the destination directory after linking the blob into place.</violation>
</file>

<file name="src/oci/store.c">

<violation number="1" location="src/oci/store.c:285">
P2: Fsync the pin directory after `rename` to make tag->digest updates crash-safe; file fsync alone does not persist the directory entry change.</violation>
</file>

<file name="src/oci/manifest.c">

<violation number="1" location="src/oci/manifest.c:295">
P2: `schemaVersion` parsing can accept fractional JSON numbers because `valueint` is used without an integer round-trip check.</violation>

<violation number="2" location="src/oci/manifest.c:385">
P2: Layer descriptor memory is leaked on post-parse validation failures because `nlayers` is incremented too late.</violation>

<violation number="3" location="src/oci/manifest.c:481">
P2: Index descriptor memory leaks when platform parsing fails because `nentries` is incremented after the fallible parse.</violation>
</file>

Tip: cubic can generate docs of your entire codebase and keep them up to date. Try it here.
Re-trigger cubic

Comment thread src/oci/pull.c
fflush(progress);
}

if (fetch_and_persist_manifest(fetcher, store, ref,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Error-path leak: sub_resp may be allocated but not freed when sub-manifest fetch fails before have_sub is set.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At src/oci/pull.c, line 253:

<comment>Error-path leak: `sub_resp` may be allocated but not freed when sub-manifest fetch fails before `have_sub` is set.</comment>

<file context>
@@ -0,0 +1,346 @@
+            fflush(progress);
+        }
+
+        if (fetch_and_persist_manifest(fetcher, store, ref,
+                                       entry->desc.digest_str,
+                                       entry->desc.digest_str, &sub_resp,
</file context>

Comment thread src/oci/media-type.c
return OCI_MT_UNKNOWN;

for (size_t i = 0; i < MEDIA_TYPE_COUNT; i++) {
if (!strcmp(MEDIA_TYPES[i].name, buf))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Media type parsing is case-sensitive, but media type type/subtype tokens are case-insensitive; valid values with different casing will be misclassified as unknown.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At src/oci/media-type.c, line 100:

<comment>Media type parsing is case-sensitive, but media type type/subtype tokens are case-insensitive; valid values with different casing will be misclassified as unknown.</comment>

<file context>
@@ -0,0 +1,189 @@
+        return OCI_MT_UNKNOWN;
+
+    for (size_t i = 0; i < MEDIA_TYPE_COUNT; i++) {
+        if (!strcmp(MEDIA_TYPES[i].name, buf))
+            return MEDIA_TYPES[i].kind;
+    }
</file context>

Comment thread src/oci/ref.c
} else {
return false;
}
if (i >= len || !is_lower_alnum(s[i]))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Repository-path validation incorrectly rejects valid names with repeated dashes (for example my--repo).

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At src/oci/ref.c, line 83:

<comment>Repository-path validation incorrectly rejects valid names with repeated dashes (for example `my--repo`).</comment>

<file context>
@@ -0,0 +1,429 @@
+        } else {
+            return false;
+        }
+        if (i >= len || !is_lower_alnum(s[i]))
+            return false;
+    }
</file context>

Comment thread src/oci/ref.c
goto oom;

bool needs_library_prefix =
strcmp(out->registry, DEFAULT_REGISTRY) == 0 &&
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: docker.io default-namespace detection is case-sensitive, so mixed-case hostnames can skip the required library/ prefix.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At src/oci/ref.c, line 356:

<comment>`docker.io` default-namespace detection is case-sensitive, so mixed-case hostnames can skip the required `library/` prefix.</comment>

<file context>
@@ -0,0 +1,429 @@
+        goto oom;
+
+    bool needs_library_prefix =
+        strcmp(out->registry, DEFAULT_REGISTRY) == 0 &&
+        memchr(path_start, '/', path_len) == NULL;
+    if (needs_library_prefix) {
</file context>

Comment thread src/oci/fetch.c Outdated
Comment thread src/oci/blob-store.c
return -1;
}

if (link(w->tmp_path, final_path) < 0) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: The commit path is not crash-durable because it never fsyncs the destination directory after linking the blob into place.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At src/oci/blob-store.c, line 354:

<comment>The commit path is not crash-durable because it never fsyncs the destination directory after linking the blob into place.</comment>

<file context>
@@ -0,0 +1,399 @@
+        return -1;
+    }
+
+    if (link(w->tmp_path, final_path) < 0) {
+        if (errno != EEXIST) {
+            int saved = errno;
</file context>

Comment thread src/oci/store.c
*err_msg = "close on pin tmp file failed";
return -1;
}
if (rename(tmp, path) < 0) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Fsync the pin directory after rename to make tag->digest updates crash-safe; file fsync alone does not persist the directory entry change.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At src/oci/store.c, line 285:

<comment>Fsync the pin directory after `rename` to make tag->digest updates crash-safe; file fsync alone does not persist the directory entry change.</comment>

<file context>
@@ -0,0 +1,360 @@
+            *err_msg = "close on pin tmp file failed";
+        return -1;
+    }
+    if (rename(tmp, path) < 0) {
+        int saved = errno;
+        unlink(tmp);
</file context>

Comment thread src/oci/manifest.c
goto fail;
const cJSON *plat =
cJSON_GetObjectItemCaseSensitive(entry, "platform");
if (parse_platform(plat, &slot->platform, err_msg) < 0)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Index descriptor memory leaks when platform parsing fails because nentries is incremented after the fallible parse.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At src/oci/manifest.c, line 481:

<comment>Index descriptor memory leaks when platform parsing fails because `nentries` is incremented after the fallible parse.</comment>

<file context>
@@ -0,0 +1,707 @@
+            goto fail;
+        const cJSON *plat =
+            cJSON_GetObjectItemCaseSensitive(entry, "platform");
+        if (parse_platform(plat, &slot->platform, err_msg) < 0)
+            goto fail;
+        out->nentries++;
</file context>
Suggested change
if (parse_platform(plat, &slot->platform, err_msg) < 0)
out->nentries++;
if (parse_platform(plat, &slot->platform, err_msg) < 0)
goto fail;

Comment thread src/oci/manifest.c
Comment on lines +385 to +405
if (parse_descriptor(desc, &out->layers[out->nlayers], err_msg) < 0)
goto fail;
oci_media_type_t lmt = out->layers[out->nlayers].media_type;
if (!oci_media_type_is_layer(lmt)) {
set_parse_err(err_msg,
"manifest layer has non-layer media type");
goto fail;
}
if (oci_media_type_is_foreign(lmt)) {
set_parse_err(err_msg,
"manifest references foreign (nondistributable) "
"layer; not supported");
goto fail;
}
if (!oci_media_type_is_layer_supported(lmt)) {
set_parse_err(err_msg,
"manifest layer media type is not supported "
"(only tar / tar+gzip / tar+zstd)");
goto fail;
}
out->nlayers++;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Layer descriptor memory is leaked on post-parse validation failures because nlayers is incremented too late.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At src/oci/manifest.c, line 385:

<comment>Layer descriptor memory is leaked on post-parse validation failures because `nlayers` is incremented too late.</comment>

<file context>
@@ -0,0 +1,707 @@
+            set_parse_err(err_msg, "manifest layer entry is not an object");
+            goto fail;
+        }
+        if (parse_descriptor(desc, &out->layers[out->nlayers], err_msg) < 0)
+            goto fail;
+        oci_media_type_t lmt = out->layers[out->nlayers].media_type;
</file context>
Suggested change
if (parse_descriptor(desc, &out->layers[out->nlayers], err_msg) < 0)
goto fail;
oci_media_type_t lmt = out->layers[out->nlayers].media_type;
if (!oci_media_type_is_layer(lmt)) {
set_parse_err(err_msg,
"manifest layer has non-layer media type");
goto fail;
}
if (oci_media_type_is_foreign(lmt)) {
set_parse_err(err_msg,
"manifest references foreign (nondistributable) "
"layer; not supported");
goto fail;
}
if (!oci_media_type_is_layer_supported(lmt)) {
set_parse_err(err_msg,
"manifest layer media type is not supported "
"(only tar / tar+gzip / tar+zstd)");
goto fail;
}
out->nlayers++;
oci_descriptor_t *slot = &out->layers[out->nlayers];
if (parse_descriptor(desc, slot, err_msg) < 0)
goto fail;
out->nlayers++;
oci_media_type_t lmt = slot->media_type;
if (!oci_media_type_is_layer(lmt)) {
set_parse_err(err_msg,
"manifest layer has non-layer media type");
goto fail;
}
if (oci_media_type_is_foreign(lmt)) {
set_parse_err(err_msg,
"manifest references foreign (nondistributable) "
"layer; not supported");
goto fail;
}
if (!oci_media_type_is_layer_supported(lmt)) {
set_parse_err(err_msg,
"manifest layer media type is not supported "
"(only tar / tar+gzip / tar+zstd)");
goto fail;
}

Comment thread src/oci/manifest.c
*err_msg = type_msg;
return -1;
}
*out = item->valueint;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: schemaVersion parsing can accept fractional JSON numbers because valueint is used without an integer round-trip check.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At src/oci/manifest.c, line 295:

<comment>`schemaVersion` parsing can accept fractional JSON numbers because `valueint` is used without an integer round-trip check.</comment>

<file context>
@@ -0,0 +1,707 @@
+            *err_msg = type_msg;
+        return -1;
+    }
+    *out = item->valueint;
+    return 0;
+}
</file context>

Max042004 added 22 commits May 20, 2026 22:02
Phase 2 of issue sysprog21#31 needs zstd to decompress OCI image layers that
carry application/vnd.oci.image.layer.v1.tar+zstd (or the Docker
equivalent). zstd has wide registry support beyond gzip and is the only
other compression in the OCI spec that real-world images use.

Per oci-roadmap.md Q9, the OCI work stays hand-rolled C with no Go or
Rust toolchain dependency. zstd cleanly separates decode-only from the
full encoder, so the vendored subset is intentionally minimal:

  lib/zstd.h, lib/zstd_errors.h
  lib/common/*.{c,h}        (allocator, FSE/Huff decoders, xxhash,
                             portability shims, threading stubs)
  lib/decompress/*.{c,h}    (streaming decode state machine)

Compression, dictBuilder, deprecated, and legacy v01-v06 paths are
excluded. lib/decompress/huf_decompress_amd64.S is also dropped: the
build sets -DZSTD_DISABLE_ASM=1 so huf_decompress.c skips the AMD64
asm symbols, and the elfuse host is Apple Silicon in any case.

Build wiring mirrors externals/cjson/: a per-file rule under
build/externals/zstd/, project warning posture relaxed via -Wno-*
because zstd is third-party code, configuration macros
-DZSTD_DISABLE_ASM=1 -DZSTD_LEGACY_SUPPORT=0 -DZSTD_MULTITHREAD=0,
and the objects statically embedded into elfuse so no -lzstd link line.

-lz is appended to HVF_LDFLAGS for gzip-compressed layers (zlib is a
macOS system library, no vendoring needed).

externals/zstd/VENDORING.md records the upstream tag, the exact curl
and cp commands, and the rule that only src/oci/decompress.c may
include externals/zstd/lib/zstd.h.
Phase 2 layer unpack needs to walk tar entries out of decompressed
layer streams. This commit adds the streaming reader on its own so the
applier in a later commit consumes a stable typed-entry API instead of
parsing tar headers inline.

src/oci/tar.{c,h} parses POSIX 1003.1-1990 ustar headers, the
ustar prefix+name join (allowing names up to 255 chars without the GNU
extension), and the GNU '././@LongLink' typeflag-'L'/'K' records used
when an OCI registry hands the reader a path or symlink target longer
than 100 bytes. The reader collapses block, char, fifo, and socket
typeflags into a single OCI_TAR_UNSUPPORTED variant so the applier can
emit one precise refusal message per unpack-time rejection without
re-decoding the typeflag.

PAX extended headers are rejected outright with EPROTONOSUPPORT, per
oci-roadmap.md Q3 asymmetric subset. If a real-world image is found
to depend on a PAX-only field, expand the accept list with targeted
parsing (mtime / size / path) rather than enabling generic PAX
extension support. The on-roadmap risk register records this caveat.

Header chksums are verified against both unsigned and signed-byte sums
so historic and modern tar implementations interoperate. The base-256
GNU encoding for sizes that overflow 8 GiB octal is accepted; layer
blobs that large are not realistic but the parser stays honest.

The reader exposes a callback-driven byte source so the future
oci/decompress.c can hand it zlib, libzstd, or passthrough streams
without the tar parser caring which side feeds it. Short reads and
sub-block chunking are handled internally via a 512-byte block
realignment loop, validated by the unit test feeding 1, 5, 256, and
512-byte chunks of the same fixture.

tests/test-oci-tar.c builds tar payloads in memory (no external tar)
and exercises 19 cases covering: empty archive EOF, regular files at
four chunk sizes, directories with trailing-slash normalization,
symlinks, hardlinks, GNU long-name >100-char paths, PAX rejection,
char/block/fifo collapsing to UNSUPPORTED, unknown typeflag rejection,
chksum mismatch, .wh.<name> and .wh..wh..opq whiteout flagging, and
implicit payload drain on next-iter calls.

Makefile, mk/config.mk, and mk/tests.mk register the new translation
unit plus the test-oci-tar build and run rules; the test is wired into
make check after test-oci-inspect.
Phase 2 layer unpack needs to consume OCI layers in
application/vnd.oci.image.layer.v1.tar+gzip,
application/vnd.oci.image.layer.v1.tar+zstd, and the uncompressed
application/vnd.oci.image.layer.v1.tar shapes. This commit puts gzip,
zstd, and passthrough behind one oci_stream_t so the tar reader stays
compression-agnostic.

src/oci/decompress.{c,h} provides oci_decompress_open(fd, alg) plus
streaming oci_stream_read / oci_stream_close. gzip routes through
zlib's inflate with windowBits = 15 + 32 so the decoder auto-detects
the gzip wrapper (raw deflate without a header is intentionally
rejected because real OCI layers always carry the gzip wrapper). zstd
routes through libzstd's streaming ZSTD_DCtx with
ZSTD_d_windowLogMax = 27 (128 MiB), so a pathologically large window
parameter is rejected with EINVAL before any output is produced.

decompress.c is the only translation unit in elfuse that includes
externals/zstd/lib/zstd.h. The build rule attaches -I$(ZSTD_DIR)/lib
as a target-specific CFLAG so the rest of the codebase never sees
zstd headers, keeping the public include surface to oci/decompress.h.

A passthrough mode lets the tar reader consume OCI_COMPRESSION_NONE
layers through the same API. The implementation does not buffer
beyond the initial input buffer in passthrough mode; once exhausted
it hands the caller's buf directly to read(2) so large uncompressed
payloads stream without an extra copy.

tests/test-oci-decompress.c covers five cases: passthrough,
gzip roundtrip with a zlib-generated fixture, gzip truncated frame
rejection, zstd roundtrip with an embedded byte-array fixture
(produced once via the system zstd CLI because the vendored libzstd
is decode-only), and the 28-bit window cap rejection regression.

Makefile, mk/config.mk, and mk/tests.mk register the new translation
unit, the target-specific zstd include path, and the test build / run
rules. The test binary links the vendored zstd objects plus system
zlib (-lz, already in HVF_LDFLAGS from the zstd vendoring commit).
elfuse unpacks layers as the invoking macOS user; chown to arbitrary
uids/gids fails, and the host inode mode cannot always carry the
Linux setuid/setgid/sticky bits a tar entry requests. Phase 3 still
needs the original Linux view at runtime, so Phase 2 records the
authoritative uid/gid/mode per guest path in a sidecar JSON file
that lives alongside the unpacked tree.

src/oci/layer-meta.{c,h} provides oci_meta_table_t with record /
lookup / remove / count plus write and read helpers that serialize
to <root_dir>/.elfuse-meta.json. The on-disk schema is

  { "version": 1,
    "entries": [ { "p": "/path", "u": NNN, "g": NNN, "m": NNN } ] }

Mode bits are stored decimal because cJSON has no native octal; the
bottom 12 bits encode rwx + setuid + setgid + sticky verbatim. The
unit test confirms setuid 0104755 and sticky 0101777 round-trip
faithfully.

oci_meta_remove keeps the persisted table tight: whiteouts and tar
overwrites drop the prior entry so a redundant sidecar tuple never
shadows a path that no longer exists in the unpacked tree. Storage
is a linear-scan dynamic array because OCI layers typically hold a
few hundred to a few thousand entries; if profiling later shows the
scan is hot, the same struct can sit behind an open-addressing
FNV-1a hash without touching callers.

Writes go through a tmp + fsync + atomic rename so an interrupted
write never publishes a partially-flushed sidecar. Reads cap the
file at 64 MiB so a hostile or corrupt sidecar cannot drag the host
into swap; reject on malformed JSON or version mismatch with EINVAL,
on missing file with ENOENT (the latter is the cold-cache signal
that an old unpack predated the sidecar feature).

xattr storage is intentionally absent: oci-roadmap.md Q3 commits
Phase 2 to ignore-with-warning on xattr entries rather than
fabricate a half-supported mapping between Linux user/security/
system xattr namespaces and the macOS extended-attribute domain.

tests/test-oci-meta.c covers six cases: record / lookup / count /
miss-ENOENT, idempotent overwrite, remove with no-op on missing
path, write+read roundtrip preserving setuid and sticky, missing
sidecar reports ENOENT, and malformed JSON rejected with EINVAL.
Phase 2 layer unpack drives every entry of every layer's decompressed
tar stream into the unpack root in strict manifest order. The applier
ties tar reader, decompression dispatch, and sidecar metadata together;
the next commit (sparse APFS volume bootstrap) and the one after
(clonefile per-run rootfs) build on this surface.

src/oci/layer-apply.{c,h} exposes oci_layer_apply(reader, root, stats,
meta, err) plus oci_path_join_safe and oci_symlink_target_check. The
latter two are exported for unit tests because the path containment
rules they enforce are the security boundary unpack relies on:

  oci_path_join_safe mirrors src/syscall/path.h::path_translate_at
  from PR sysprog21#33. Reject leading '/' (absolute), reject any segment equal
  to `..`, reject empty paths. Real OCI layers ship paths relative to
  the layer root; anything else is hostile.

  oci_symlink_target_check parses the symlink target as if a follower
  started at link_dir under sysroot=root. Absolute targets get treated
  as sysroot-relative (which matches how the guest will follow them at
  runtime through src/syscall/proc-state.c::sysroot_path_is_contained).
  The check tracks running depth and rejects any drop below zero with
  ELOOP, so `escape -> ../../../etc/passwd` from inside the unpack root
  is refused before symlink(2) ever fires.

Whiteout handling follows the OCI image-spec layer change-set rules:

  .wh.<name>         the upper-layer entry <name> is removed
                     recursively from the unpack root; the sidecar
                     drops any prior tuple for the same guest path so
                     the persisted .elfuse-meta.json never references
                     a path that no longer exists on disk.

  .wh..wh..opq       the containing directory's lower-layer contents
                     are cleared; subsequent entries in this same
                     layer (e.g. dir/kept) survive because they land
                     after the marker.

Hardlinks resolve the target as an intra-archive guest path through
oci_path_join_safe and require lstat(target_host) to succeed; a
hardlink to a missing target is rejected with ENOLINK, since the OCI
spec mandates apply order and forward references would mean a
malformed archive.

Mode bits propagate via fchmod to whatever the host inode can carry;
setuid/setgid/sticky and the rwx triplets are also recorded in the
running oci_meta_table_t so Phase 3 has the authoritative Linux view
regardless of what the host kernel let elfuse apply as a non-root
user.

Block, char, fifo, and socket entries are refused with ENOTSUP at
this layer because the tar reader already collapses them into
OCI_TAR_UNSUPPORTED; the applier surfaces the precise error per
oci-roadmap.md Q3.

A local la_strchrnul shim sidesteps the macOS 15.4 deployment-target
gate on strchrnul, so the applier builds against older SDKs without
an __builtin_available block.

tests/test-oci-layer-apply.c covers nine cases: basic mixed-entry
apply (regular + dir + symlink + hardlink with inode parity check),
symlink escape rejected with ELOOP, hardlink missing target rejected
with ENOLINK, whiteout removal, opaque whiteout clears prior dir
contents while later entries survive, char-device entry rejected with
ENOTSUP, path-join `..` rejected with EINVAL, path-join absolute
rejected with EINVAL, and the legal-target acceptance path.
Phase 2 unpack requires a case-sensitive filesystem (oci-roadmap.md
Q1) so Linux layers that ship colliding names (Foo and foo in the
same directory, common in man pages and many distros) survive without
silent merging. macOS data volumes default to case-insensitive APFS,
so elfuse provisions its own sparsebundle.

src/oci/volume.{c,h} resolves the sysroot volume root and provisions
on first use. The default path is

    $HOME/Library/Application Support/elfuse/sysroots/

with a sparsebundle backing image at

    $HOME/Library/Application Support/elfuse/sysroots.sparsebundle

Bootstrap delegates to src/core/sysroot.h::sysroot_create_mount,
which already wraps the hdiutil create + attach sequence with a
case-sensitive APFS format. No duplicated hdiutil orchestration.

A pthread-mutex-protected cache keeps the mount handle alive across
multiple oci subcommand invocations within one elfuse process, so
running `elfuse oci pull` followed by `elfuse oci unpack` does not
re-attach the sparsebundle on every command.

`--volume DIR` overrides go through sysroot_probe_case_sensitivity
from PR sysprog21#33. Non-case-sensitive directories are refused with EINVAL
rather than silently engaging the case-fold sidecar in
src/syscall/sidecar.c; the sidecar is a runtime fallback for guests,
not a Phase 2 unpack policy (see the design note in oci-roadmap.md
Q1: a single sparse APFS volume beats both "require user-provided
case-sensitive volume" and "strict collision rejection").

oci_volume_subdir creates intermediate components for the
images/, runs/, and images/.staging/ subtrees the next two commits
(clonefile copy-up + unpack orchestrator) will write to. Existing
directories are tolerated; assembly failures surface the underlying
errno.

tests/test-oci-volume.c covers the override rejection path and the
subdir creation. The default-sparsebundle bootstrap is gated behind
OCI_VOLUME_TEST=1 because hdiutil orchestration costs ~150 ms and
~16 MiB of disk on every first invocation; make check runs the
ungated subset (case-insensitive rejection, ENOENT on missing path,
subdir creation).
Phase 2 commits oci-roadmap.md Q2 to APFS clonefile-based copy-up:
each `elfuse oci clone` invocation gets a fresh directory tree
cloned from the immutable image sysroot. APFS file-level CoW makes
the clone nearly O(1) at start and only allocates new blocks for
files the guest actually modifies, so the rootfs model is cheap on
both wall time and disk.

This is structurally a place elfuse beats VM-backed runtimes:
Docker Desktop and OrbStack run Linux overlayfs inside a guest
kernel; elfuse has no guest kernel, and the closest macOS-native
primitive (clonefile) gives roughly the same "cheap per-container
view of a shared base" property without the in-guest daemon.
fuse-overlayfs via the PR sysprog21#35 guest FUSE transport would also work
in theory, but oci-roadmap.md Q2 deliberately rejects it: it adds
an in-guest daemon, costs IPC on every syscall, and offers nothing
clonefile cannot already do for the unpack tree.

src/oci/clone-rootfs.{c,h} exposes:

  oci_clone_rootfs(src_image_dir, volume_root, **out_run_dir, **err)
      Allocates a fresh <volume>/runs/<random>/ slot, calls
      clonefile(src, dst, CLONE_NOFOLLOW), returns the absolute
      path. CLONE_NOFOLLOW prevents a symlink at the src root from
      pulling the clone off the immutable image; the layer applier
      from the previous commit already rejected escape-symlinks
      inside the tree.

  oci_clone_rootfs_remove(run_dir, **err)
      Recursive cleanup. Tolerates ENOENT so a CLI flow can call
      remove unconditionally on the success path without surfacing
      "file not found" when there is nothing to remove.

  oci_clone_rootfs_gc(volume_root, older_than, **err)
      Phase 2 stub. Phase 3 will walk volume_root/runs/ and unlink
      entries older than older_than for `elfuse oci prune`.

Apple's clonefile(2) is recursive across directories since macOS
10.12. Hardlinks INSIDE the source tree survive the clone metadata
pass; cross-tree hardlinks back to the immutable image are NOT
created, so the layer applier's intra-archive hardlink handling in
the previous commit was load-bearing.

Run-id generation uses getentropy for 12 hex chars (48 bits), which
is ample for elfuse process lifetimes and avoids the predictability
of a time-or-counter-based scheme.

tests/test-oci-clone.c covers three cases: CoW preservation (mutate
the clone, assert source unchanged), no-op remove on a missing path,
and the gc stub. The CoW test skips with a clear message if
clonefile returns ENOTSUP, so a future non-APFS scratch directory
does not turn the suite red.
This commit ties tar reader, decompression dispatch, layer applier,
sidecar metadata, sparse APFS volume bootstrap, and clonefile-based
copy-up together into one orchestrator and exposes the
user-facing surface via two new subcommands.

src/oci/unpack.{c,h} provides oci_unpack(store, ref, opts,
**out_image_dir, **err). Pipeline:

  1. oci_volume_ensure resolves and provisions the sysroot volume
     (default sparsebundle or --volume override); non-case-sensitive
     overrides are rejected with EINVAL before any disk write.
  2. oci_volume_subdir provisions images/ and images/.staging/.
  3. resolve_manifest_digest prefers ref->digest when set, otherwise
     reads the tag pin via oci_store_get_ref; pin miss surfaces
     ENOENT so the CLI can print a "run oci pull first" hint.
  4. read_blob loads the manifest body via oci_blob_store_path.
     If the body is an image index, oci_index_pick_linux_arm64
     selects the arm64 sub-manifest and re-reads it.
  5. For each layer in manifest order: foreign / nondistributable
     layers refuse with ENOTSUP, reverify_layer_digest re-runs
     SHA-256 over the on-disk blob bytes (defensive, even though
     Phase 1 already verified at write time), then
     oci_decompress_open + oci_tar_reader_new + oci_layer_apply
     drive the entry stream into a staging tree under
     <volume>/images/.staging/<random>/.
  6. oci_meta_write commits .elfuse-meta.json into the staging tree.
  7. rename(2) atomically moves the staging tree into the final
     images/sha256-<hex>/ slot.

Re-running with the same ref short-circuits when the final slot
exists; --force removes the prior commit (via rm -rf) before staging.

src/oci/cli.c gains cmd_unpack and cmd_clone:

  elfuse oci unpack [--store DIR] [--volume DIR] [--force] [-q] <ref>
  elfuse oci clone  [--store DIR] [--volume DIR] [--name N] [--keep] <ref>

Both subcommands print exactly one line on stdout: the absolute
path of the unpacked or cloned tree. Trailing slash on unpack lets
$(elfuse oci unpack alpine)/bin compose cleanly. Diagnostic noise
and per-layer apply progress flow to stderr so the stdout contract
stays scriptable.

clone implies unpack: it calls oci_unpack first and then
oci_clone_rootfs to materialize a fresh <volume>/runs/<random>/
under the same sparsebundle. --keep is forward-looking (Phase 2
does not auto-clean either way), --name is reserved for Phase 3.

tests/test-oci-unpack.c is the integration smoke. Every constituent
module already has dedicated unit coverage in
test-oci-{tar,decompress,layer-apply,meta,volume,clone}, so this
file confirms the link-time dependency edges plus a useful invariant:
oci_unpack must NOT spin up hdiutil or hit the network without a
valid volume context. The full end-to-end fixture is reserved for
Phase 3 where the e2e suite gains a shared tests/lib/oci-fixture
alongside tests/lib/oci-mock.

End-to-end smoke (on the author's Apple Silicon Mac, with network):

  ./build/elfuse oci pull alpine:latest
  IMG=$(./build/elfuse oci unpack alpine:latest)
  test -f "${IMG}lib/ld-musl-aarch64.so.1"  # interpreter present
  ROOT=$(./build/elfuse oci clone alpine:latest)
  ./build/elfuse --sysroot "$ROOT" /bin/sh -c 'echo ok'

Phase 2 is intentionally a no-op for `elfuse run IMAGE`; Phase 3
wires that and the Entrypoint / Cmd / Env / User merge.
hdiutil prints messages like '"diskN" ejected.' to stdout even on
success, and 'created: <path>' on hdiutil create. Phase 2's oci
unpack and oci clone subcommands promise a single-line stdout
contract (the unpacked or cloned tree path), so a downstream

    ROOT=$(elfuse oci clone alpine:latest)

would otherwise capture an hdiutil progress line into $ROOT and break
every subsequent flow that treats $ROOT as a path. Phase 1 never had
a caller that strictly cared about stdout, so this surfaced only with
Phase 2 in place.

Add spawn_simple_silent that posix_spawn_file_actions_addopen's
/dev/null over the child's fd 1, and route the two stdout-printing
hdiutil callers (detach via sysroot_detach_mountpoint_force, create
via the sparsebundle bootstrap path inside sysroot_create_mount)
through it. Stderr is intentionally left alone so genuine hdiutil
error output still surfaces for diagnostics.

The hdiutil attach path already used spawn_capture_stdout to parse
the plist, so it was never a leak source.

End-to-end smoke after the fix:

    ROOT=$(./build/elfuse oci clone alpine:latest)
    ./build/elfuse --sysroot "$ROOT" "$ROOT/bin/busybox" \
        sh -c 'echo hello from inside oci alpine'

now prints exactly the canonical greeting on stdout with no hdiutil
noise mixed in.
Extends the inspect renderer with a runtime: section below the layer
table that surfaces the launch contract elfuse oci run will honor in a
later Phase 3 commit. The block lists User, WorkingDir, Entrypoint,
Cmd, and Env from the image-config blob referenced by the manifest's
config descriptor, giving an operator a single view of how a pulled
image expects to be invoked.

Reads the config blob with the existing read_blob_file helper and the
oci_image_config_parse parser, both already present in Phase 1. The
read is best-effort: a missing or malformed config blob leaves the
block out silently instead of failing the whole inspect, since the
manifest tree is the primary signal and the config digest is already
named in the layer table. Both the direct-manifest path and the index
drill path now pass blobs into render_manifest so they share the same
rendering.

Absent fields skip their bullet entirely so an image that only sets
Cmd does not advertise five empty rows; explicit empty arrays render
as [] so an operator can tell "field present but empty" apart from
"field absent". Entrypoint and Cmd render as JSON-style arrays with
backslash and double-quote escaping; Env folds onto continuation
lines indented to the value column to keep grep-friendly VAR=value
shape across multiple variables.

Extends tests/test-oci-inspect.c with full image-config coverage in
the direct-manifest case (all five runtime fields populated, with
substring assertions for each rendered line) and adds a new empty-Env
case that verifies the explicit-empty bullet and the absent-field
omissions for User, WorkingDir, and Entrypoint.
Introduces src/oci/runspec.{c,h}, a pure-data module that folds the
image-config runtime block (User, WorkingDir, Entrypoint, Cmd, Env)
together with elfuse oci run CLI overrides into a concrete launch
bundle: guest cwd, argv, envp, and optional uid/gid credentials. No
filesystem touches, no PATH search, no syscalls -- those concerns
belong to follow-up Phase 3 commits. The split keeps the override
matrix and the Env policy verifiable by a unit test that builds
oci_image_runtime_t literals in C.

Argv assembly walks the override matrix documented in the Phase 3
plan. --entrypoint clobbers both image Entrypoint and image Cmd
([override] ++ CLI args). Image Entrypoint plus image Cmd ride
together when no CLI args were given; once any CLI positional appears,
the image Cmd is dropped and the CLI args take its slot. The one
hard-fail case is the all-empty path (image has neither Entrypoint
nor Cmd and the CLI supplied no argv), which returns EINVAL with
"image has no entrypoint or cmd; pass one on the CLI".

Env merge starts from the image Env array, applies CLI -e overrides
in order (KEY=VAL set-or-replace; bare KEY imports the matching host
environ value when present, otherwise drops silently), auto-imports
TERM from the host when the merged Env has no TERM, injects the
Linux PAM-default PATH when no PATH key has landed, and forces
container=elfuse so systemd-style sandbox detection works regardless
of what the image declared. CLI overrides whose KEY starts with
DYLD_ hard-fail with EINVAL because DYLD_* is a macOS-only loader
contract with no guest meaning; image-provided DYLD_* entries pass
through (aarch64 Linux ignores them, so the runtime cost of stripping
exceeds the safety win).

WorkingDir defaults to "/" when neither the image nor the CLI sets
it; relative paths and any path containing a ".." segment hard-fail
with EINVAL. Sysroot containment is enforced later by the
path-resolve module and the syscall layer.

User accepts numeric "UID" or "UID:GID". Symbolic users such as
nginx fail with the deterministic Phase 4 pointer message
("NSS resolution not yet implemented"). UID-only inputs default GID
to the same value to match the proc_set_ids triple-set call shape
the Phase 3 plan describes. CLI --user takes precedence over image
User; both routes share the same numeric parser but emit distinct
diagnostics so the user can tell whether the bad value came from
their flag or from the pulled image.

Error reporting uses a thread-local 512-byte buffer for dynamic
messages and static string literals for the fixed ones. The header
documents the shared "*err valid until the next call from this
thread" lifetime contract so the caller does not have to branch on
which path failed.

The new tests/test-oci-runspec.c covers every row of the argv
override matrix (8), every step of the Env policy including TERM
and PATH gates (11), every User parse outcome (6), and every
WorkingDir validation case (5) -- 30 cases total, all green.
Wires the binary into Makefile / mk/config.mk / mk/tests.mk under
'make check'.
Introduces src/oci/path-resolve.{c,h}, the pre-launch helper that takes
a guest argv[0] (POSIX execvp semantics), the merged PATH from the
runspec env, and the guest cwd, and returns both the host filesystem
path elfuse should open() to load the binary and the guest-absolute
path the guest itself thinks it is running. The split is necessary
because the host opens a file inside the cloned rootfs while the guest
reads /proc/self/exe and argv[0] expecting its own absolute view.

Containment policy: every candidate is fed to realpath(3) and the
resolved path must land inside the sysroot. Escape symlinks (a layer
mistake or a malicious image dropping /usr/bin/foo -> ../../../etc/
passwd) are silently skipped so the PATH search continues past them
to the next entry. This matches runc's escape-symlink handling and
keeps the launch deterministic regardless of layer order. The
containment uses realpath internally but the returned host_path stays
as the symlink-as-found so the guest sees argv[0] under the name it
was invoked with (the kernel handles symlink resolution at open
time).

POSIX execvp semantics: argv0 containing '/' bypasses PATH (absolute
argv0 mapped to <sysroot><argv0>; relative argv0 anchored to
cwd_guest). Otherwise PATH is split on ':' and each entry is treated
as a guest-absolute directory; empty entries fall back to cwd_guest
per POSIX. Executability is decided by host stat(2) (which follows
symlinks) against st_mode & 0111. PATH search records the first
found-but-not-executable candidate and surfaces EACCES if no later
entry succeeds, mirroring execvp's "first noexec wins" behaviour.

Diagnostics carry the guest argv[0] quoted. PATH search misses also
quote a colon-separated list of directories that were actually probed
(empty searched-dirs annotation when PATH was empty, no annotation at
all for direct-mode argv0 with '/'). Escape symlinks and broken
chains do NOT show up in the searched list: they are directories
that contributed no host candidate, so an operator reading the error
sees the dirs that were genuinely walked. The module owns a
thread-local 1 KiB err buffer for dynamic messages.

The module deliberately does NOT reuse src/syscall/path.c's
path_translate_at because that resolver is tied to the running
guest's live sysroot/cwd plumbing while this resolver runs before
the vCPU starts. Containment via realpath is the same idea but the
input/output contracts differ.

The new tests/test-oci-path-resolve.c covers PATH-search hits and
misses, internal symlink follow (host_path keeps the symlink-as-found),
escape-symlink filter (skipped from search and from searched-dirs list),
EACCES on noexec (both PATH and direct modes), ENOENT diagnostics with
and without the searched-dirs suffix, relative argv0 anchored to
cwd_guest, and empty-PATH handling -- 11 cases, all green. macOS
expands /tmp -> /private/tmp via realpath, so the test scratch root is
realpath'd at construction time to keep the assert-equal comparisons
honest.

Wires the binary into Makefile / mk/config.mk / mk/tests.mk under
'make check'.
Splits the post-CLI VM bring-up out of src/main.c into a new
src/core/launch.{c,h} module so the Phase 3 oci run orchestrator
(commit 5) can share one launch path with the legacy positional-ELF
main. No functional change: the same guest_bootstrap_prepare ->
sysroot casefold probe -> guest_bootstrap_create_vcpu -> GDB stub ->
vcpu_run_loop -> teardown sequence runs in the same order against
the same inputs.

launch_args_t carries everything elfuse_launch needs in one struct so
the call shape is stable across future callers: elf_path, sysroot,
guest_argv (NULL-terminated heap copy), envp (NULL -> host environ),
gdb port and stop-on-entry, timeout, verbose, plus three forward-
looking fields that Phase 3 commit 5 will start populating
(has_creds + uid/gid for OCI User spoofing, cwd_guest for image
WorkingDir, fork_child_fd / vfork_notify_fd for any future routing
of the fork-child entry through one launch struct).

The proctitle rewriting call stays in main(). Its old position was
between guest_bootstrap_prepare and guest_bootstrap_create_vcpu,
neither of which read the original argv block, so moving it to
right before the elfuse_launch call is a behavior-preserving move.
The reason it must stay in the caller at all is the same as before:
runtime_set_process_title needs the live argv pointer the kernel
handed in, not the strdup'd shadow that guest_argv carries.

shim_blob.h follows the bring-up into launch.c. main.c no longer
references shim_bin / shim_bin_len, and a single definition site
keeps the linker honest. The HVF headers (Hypervisor/hv_vcpu)
drop out of main.c with the rest of the VM types.

cleanup_main_resources shrinks: the guest_t and guest_initialized
arguments are gone (elfuse_launch owns those), so main()'s remaining
cleanup is just the host cwd restore, the --create-sysroot detach,
the heap argv free, and the elf_path / sysroot_path free. Every
pre-launch error path in main() now calls cleanup with the
new (smaller) signature.

Regression gate (mandatory per Phase 3 plan):
  - 'make check' captured before the refactor (Phase 3 commit 3 HEAD).
  - Refactor applied, 'make check' re-run. Both summary blocks are
    byte-identical: 78 internal aarch64 tests + 81/84 busybox applets
    + every OCI suite (ref/digest/blob-store/manifest/fetch/store/
    pull/inspect/tar/decompress/meta/layer-apply/volume/clone/unpack/
    runspec/path-resolve) at the same pass counts.
  - 'build/elfuse build/test-hello' smoke prints "hello" as expected.
  - tests/test-matrix.sh elfuse-aarch64 was not run; the local
    worktree has no externals/test-fixtures checkout. The 'make check'
    proctitle low-stack regression + busybox applet suite cover the
    same proctitle / signal / dynamic-linking surface that the matrix
    would have exercised.
Closes the Phase 3 launch loop. The new src/oci/run.{c,h} module
walks the orchestration the plan calls for:

  1. oci_unpack into the APFS sysroot volume (idempotent; no-op if
     layers already extracted, hard fails if the image was never
     pulled)
  2. resolve the volume root via oci_volume_ensure so clone-rootfs
     lands in the same sparsebundle as unpack
  3. oci_clone_rootfs into <volume>/runs/<id>/ via clonefile(2)
  4. read + parse the manifest, then the image config blob, off
     the local blob store
  5. fold the image runtime block and the CLI overrides into one
     launch bundle via oci_runspec_build (Phase 3 C2)
  6. mkdir -p the resolved WorkingDir under the cloned rootfs,
     best-effort chown to spec.uid:spec.gid (macOS rejects fchown
     for non-root callers spoofing arbitrary uids; sidecar
     metadata will record the intended owner once Phase 4 lands)
  7. resolve argv[0] inside the cloned rootfs via oci_path_resolve
     (Phase 3 C3) so PATH search and sysroot containment happen
     before bring-up
  8. swap argv[0] for the guest-absolute path so the guest's
     /proc/self/exe matches the name it was invoked under
  9. save host cwd and chdir into <run_dir><spec.cwd> so the
     guest inherits its OCI WorkingDir
 10. assemble launch_args_t and dispatch through elfuse_launch
     (Phase 3 C4); a process-global launch override hook lets the
     unit test substitute a capture-and-return-0 stub instead of
     spinning up a real HVF VM
 11. restore host cwd, free intermediate state, remove the clone
     dir unless --keep is set; the cleanup runs on launch failure
     too so a failed run does not leave stale clones on the volume

oci_cli_run handles the user-facing CLI surface: --store /
--volume / --entrypoint / -e KEY[=VAL] (repeatable) / -w / -u /
--keep / --name (reserved; clone-rootfs has no deterministic-name
slot today) / IMAGE / ARG-tail. Parsing follows the same shape as
cmd_pull / cmd_clone; flag-walk until the first non-flag, IMAGE
next, everything after is positional argv. The dispatcher in
src/oci/cli.c gains the "run" case between "clone" and "prune".

Test coverage in tests/test-oci-run.c (6 cases, all green):

  - cli: -h prints the run usage block, rc=0
  - cli: missing IMAGE returns rc=2
  - cli: unknown option returns rc=2
  - cli: -e without a value returns rc=2
  - run: --volume=/tmp (case-insensitive on default macOS APFS)
    fails fast inside oci_unpack -> oci_volume_ensure; the launch
    override never fires
  - run: ref with no local pin reports an ENOENT-class failure;
    the launch override never fires

The test ships a process-local elfuse_launch stub that abort()s if
called. Every case installs a hook via oci_run_set_launch_for_testing
before invoking the orchestrator, so the stub is purely a linker
satisfier and lets the test binary skip core/launch.o and the
entire VM/syscall transitive chain. End-to-end launch coverage
(actually running a guest from a hand-built fixture store) is the
job of the Phase 3 commit 6 compat shell harness, which has the
fixture builder and a real sparsebundle path.

The orchestrator owns a thread-local 2 KiB err buffer so dynamic
diagnostics (quoted argv[0] + searched PATH list propagated up
from path-resolve) can flow through *err to the CLI driver.
Lands the Phase 3 commit-6 surface: a standalone OCI fixture builder
tool, a shell harness that drives the new oci run subcommand end-to-end
against a hand-built store, and the user-facing docs/usage.md section
that documents the override matrix, env policy, and scope guardrails.

tests/lib/oci-fixture-builder.c is a self-contained CLI that takes a
store root, a ref, image-config flags (--entrypoint, --cmd, --env,
--workdir, --user), and one or more uncompressed-tar layer files,
then hashes + writes the layer blobs, synthesizes the image-config
JSON (with rootfs.diff_ids tying back to the uncompressed-layer
digests per OCI spec), writes the config blob, builds + writes the
manifest blob, and pins the ref. The tool is offline and reusable for
any "shape an image from local files" workflow, not just the compat
suite. cJSON does the JSON assembly so the output matches what the
Phase 1 parser expects byte for byte.

tests/test-oci-compat.sh runs in three layers:

  1. Default mode (always under make check):
     - CLI surface smokes: --help renders the run usage block (rc=0),
       missing IMAGE / unknown option / -e without value all return
       rc=2.
     - Fixture-builder integration: assemble a tiny one-layer fixture
       under a scratch tmpdir (the layer is a tar containing the
       project's existing test-hello aarch64 assembly stub), assert
       exit 0, assert the store now has 3 blobs (layer + config +
       manifest) and a ref pin file, then drive `elfuse oci inspect`
       against the fixture and assert the runtime block renders with
       the entrypoint / env / user lines we supplied.

  2. OCI_COMPAT_TEST=1 (gated):
     - Reserves a slot for the alpine-shaped / busybox-shaped /
       two-layer-whiteout end-to-end fixtures from the Phase 3 plan,
       which need an hdiutil-backed sparsebundle (case-sensitive APFS)
       to exercise the actual elfuse oci run launch. The heavy
       harness lands in a follow-up patch alongside the Phase 4 work
       so this commit stays scoped to what default `make check` can
       verify.

  3. OCI_FETCH_ONLINE=1 (gated):
     - Sibling slot for the docker.io/library/alpine:3 pull+run check.
       Skipped by default; the heavy compat matrix ships the harness.

mk/tests.mk gets two new targets: oci-fixture-builder (build the tool
on its own) and test-oci-compat (run the shell harness, wired into
make check). The harness depends on the ELFUSE_BIN, the builder, and
the existing TEST_HELLO_DEP so make picks the right order.

docs/usage.md gains a "Running OCI Images" section before the
compatibility model: a tabular options list, the full argv override
matrix, the env merge policy (image base -> -e KEY=VAL -> -e KEY
host import -> TERM auto-import -> Linux PAM default PATH ->
container=elfuse forced injection, with the DYLD_* CLI-reject rule
explicit), the User / WorkingDir guardrails (numeric only, no `..`
segments), and the Phase 3 scope notes spelling out what is Phase 4
work and what is permanently out of scope.

Counts: tests/test-oci-compat.sh reports 10/10 passes in default
mode (4 CLI smokes + 6 fixture-builder integration checks) plus 2
SKIP lines for the gated harnesses. Full make check still green
across every OCI suite (no Phase 1 / Phase 2 / earlier Phase 3
regression).
The store root now carries a spec-compliant <root>/oci-layout marker so
external tools (skopeo, umoci, crane) can consume the directory as
oci:<root>. The write is atomic (tmp + link, EEXIST = happy path) and
idempotent: a pre-existing marker is never rewritten, preserving any
third-party version bump.
Adopt the OCI image-layout v1.0.0 index.json schema as the single
source of truth for tag-to-digest pins, replacing the per-tag flat
files under refs/<registry>/<repository>/<tag>. Each pin is one
manifests[] descriptor keyed by org.opencontainers.image.ref.name;
mediaType, digest, and size mirror the manifest blob on disk.

Writers serialize the read-modify-write of index.json via
flock(<root>/index.json.lock, LOCK_EX) and publish atomically through
tmp + rename, so concurrent pulls of distinct tags both land. Readers
parse the rename-atomic snapshot lock-free.

Stop writing refs/ entirely. A pre-existing refs/ directory is left
untouched so a downgrade still finds the legacy data; C2.3 will
migrate older stores on open.

Also expose oci_store_list_refs for downstream callers (Plan 1
root-set, Plan 4 oci status). test-oci-store gains schema validation,
enumeration, and concurrent-writer coverage.
C2.2 stopped writing the legacy refs/ pin tree but left readers
unable to see pins that pre-dated the index.json schema. C2.3
detects such a store on oci_store_open and rebuilds index.json
from refs/ contents in place, keeping refs/ on disk so a downgrade
to the pre-C2.2 binary still finds its data.

Migration acquires flock(index.json.lock, LOCK_EX) before the
read-modify-write so it cannot race a concurrent first put_ref;
a re-stat under the lock bails out when another opener completed
the migration first. Pins whose manifest blob is missing from
blobs/ are skipped with a stderr warning rather than aborting the
open, since a single dangling pin should not block recovery of
the rest of the store. The walker handles arbitrarily deep
repository paths (ghcr.io/owner/group/sub/img) and rejects
malformed leaves (too shallow, dotfiles) with explicit log lines.

Migration can be suppressed by setting ELFUSE_OCI_NO_MIGRATE in
the environment, which leaves refs/ visible only to a downgraded
binary and makes oci_store_get_ref return ENOENT until the env
var is cleared on a later open. This is the documented escape
hatch for downgrade tests and recovery workflows.

Three new test cases cover the path: a two-pin fixture migrates
and survives a reopen without re-running; ELFUSE_OCI_NO_MIGRATE
keeps index.json absent and the legacy pin invisible; and a
coexisting refs/ alongside an existing index.json leaves the
index.json byte-for-byte untouched on reopen. All 18 store unit
tests plus the wider OCI suites (blob-store, pull, inspect, run,
compat) remain green.
The oci_unpack pipeline now writes .elfuse-origin.json into the staging
directory before the final rename. The file records manifest_digest,
config_digest, and the rootfs.diff_ids array parsed from the image
config blob. This is the on-disk attribution Plan 1's root-set walker
needs to map an unpacked sysroot back to every blob it depends on, so
a future prune sweep does not delete layer blobs still in use.

A new oci_origin_write helper in src/oci/origin-meta.c implements the
atomic tmp+fsync+rename pattern, mirroring src/oci/layer-meta.c. The
helper is failure-fatal at the unpack call site: a missing origin
file would silently break GC, which is unrecoverable, so the staging
directory is torn down and oci_unpack returns -1 on any write error.

New unit suite tests/test-oci-origin.c covers single-diff, multi-diff
order, empty diff arrays, rewrite-overwrites-atomically, NULL/empty
guards, and NULL diff_ids fallback. Build wiring adds origin-meta.o
to test-oci-unpack and test-oci-run link lists and registers
test-oci-origin in make check.
oci_store_collect_roots accumulates every blob digest still reachable
from on-disk state into a sorted digest set: pins in index.json drive
the manifest/config/layer walk, and unpacked image trees under
<volume>/images/sha256-<hex>/ contribute their origin sidecar's
manifest digest as another walk root. The mark phase Plan 1's prune
sweep needs in C1.3; pure read, no mutation.

For each manifest digest the walker reads the blob and parses it. An
image-manifest contributes its config descriptor plus every layer
descriptor. An image-index contributes every sub-manifest descriptor
and recurses into the ones whose blob is on disk so config + layers
join the keep set; sub-manifests for un-fetched platforms still get
their descriptor digest recorded so a sweep cannot delete the
platform that did materialise.

Failure model is fail-fast on anything that would let prune later
delete a reachable blob: a missing manifest blob, an unparseable
manifest, a missing or malformed .elfuse-origin.json, or a missing
image-config blob all return -1 with err populated. A missing
<volume>/images/ tree is the fresh-store case and treated as zero
contribution rather than an error. The opposite policy (soft skip on
corrupt origin) would let one broken tree leak a sweep that deletes
blobs it actually needs, which is unrecoverable; fail-fast is the
safer side of the data-loss vs. ergonomics trade.

New module src/oci/digest-set.c/h: sorted strdup array with bsearch
contains and lower-bound add. Working set is in the low hundreds, so
O(n) insertion stays cheap and the API leaves room for a hash-backed
implementation later if profiling proves the sweep hot.

src/oci/volume-list.c is split off from volume.c on purpose. The
mount/provisioning path pulls in core/sysroot.o and the hdiutil
chain; the read-only enumerator only needs opendir + lstat. Keeping
them in the same translation unit would force every store-linking
test to link the sysroot stack just to walk a directory. The new
oci_volume_list_unpacked stays in the same oci/volume.h namespace.

src/oci/origin-meta.c gains oci_origin_read / oci_origin_free
alongside the C1.1 writer; the layer-meta module is the precedent for
read + write in the same translation unit. Reader validates that
manifest_digest, config_digest, and layer_diffids are present and
correctly typed; mistyped or missing fields surface as EINVAL so the
garbage collector treats a malformed sidecar as a fatal root-set
hole.

Seven new cases in test-oci-store.c cover the empty store, a single
pin (manifest + config + layer), two pins with one shared layer
(dedup), an unpacked tree without any pin, the pin + unpacked
combination, a corrupt origin sidecar (fail-fast), and a pin whose
manifest blob has been unlinked from blobs/ (fail-fast). 25/25 store
tests green; full OCI matrix (origin, unpack, pull, inspect, run,
blob-store, compat) all still pass.
C1.3 wires the previously-stub elfuse oci prune command to a real
mark-and-sweep collector. The mark phase reuses oci_store_collect_roots
(C1.2) over pins in index.json and unpacked sysroots under --volume.
The sweep walks blobs/sha256/ and blobs/sha512/, unlinking any blob
whose digest is not in the keep set.

Locking: prune runs under flock(index.json.lock, LOCK_EX) for the
duration of mark + sweep so a concurrent put_ref cannot publish a new
pin between collect_roots and sweep. Pull-side blob commit remains
lock-free; a pull interrupted mid-blob is treated as a transient that
the caller re-fetches on retry.

CLI: --commit gates the unlink (default is dry-run with a "(dry-run;
pass --commit to delete)" footer). --volume mirrors unpack/clone so
the same volume root contributes its unpacked sysroots to the keep
set. Output is two lines (reclaimable + kept) to stdout, ready for
shell composition.

Subdirectories under blobs/<algo>/ and files whose names are not
valid lowercase hex of the right length are skipped without surfacing
as errors, leaving foreign state (external tool metadata, hand-created
directories) untouched.

Tests: 8 new cases in test-oci-store.c covering dry-run vs commit,
no-pins-no-volume, unpacked-tree as sole root, mark failure abort,
idempotent re-run, decoy subdir + non-hex filename ignored, and
NULL-arg EINVAL. compat shell gains 5 prune-smoke lines exercising
the full CLI dispatch on the fixture-builder store.
oci_store_prune now classifies dangling blobs into a candidate list
before unlinking and runs two optional filter passes that flip
candidates from PRUNE to SKIP:

  --older-than DUR vetoes per-blob: a dangling blob whose mtime is
  younger than (now - DUR) survives. The grace window protects the
  blob committed by a half-completed pull whose put_ref has not
  landed yet.

  --keep-bytes SIZE enforces a global LRU budget over candidates
  that survived the older-than veto. Survivors are sorted by mtime
  ascending and walked newest-first; the newest blobs whose
  cumulative size fits SIZE are reclassified as SKIP, and the
  walk terminates at the first blob that does not fit so older
  candidates are always evicted ahead of newer ones even when an
  older blob could fit alone. Order matters: filter older-than
  first so a transient blob in the grace window never enters the
  LRU computation.

Both flags default to 0, which the store API documents as "no
filter" so the C1.3 behaviour (every dangling blob is pruned)
is preserved when the caller does not opt in. SKIP candidates
contribute to a new skipped_blobs / skipped_bytes pair on
oci_store_prune_options_t so callers can render a three-way
kept / pruned / skipped split.

CLI parsing accepts s/m/h/d/w suffixes for --older-than and
K/M/G (optional B trailer) for --keep-bytes; the size grammar
is KiB-based to match du and df. Negative inputs, ERANGE, and
unrecognised suffixes are rejected with EINVAL and a stderr
message that names the failing argument. The prune output
prints a "skipped: N blobs (M bytes)" line only when at least
one candidate was spared so unfiltered prunes stay quiet.

Tests:
- 6 new cases in test-oci-store.c covering the older-than veto,
  zero-disables, the keep-bytes LRU eviction order, zero-budget
  equivalence, both filters composed, and a dry-run smoke that
  asserts stats track without touching disk. New helpers
  set_blob_mtime (utimes wrapper) and stage_dated_dangling drive
  blob mtime deterministically.
- 4 new compat shell smokes covering --older-than 1d on a touch -t
  backdated blob, --keep-bytes 0 as the disable form, and the two
  parser-rejection paths (invalid duration, invalid byte size).
- Full OCI suite green: store 39/39, origin 6/6, unpack 1/1,
  pull 6/6, inspect 7/7, run 6/6, blob-store 14/14, compat 20/20.
Max042004 added 27 commits May 21, 2026 22:30
Replaces the C3.2 cumulative-by-diff_id snapshot scheme with the
Plan 3 C3.3c-ii two-tier cache. The per-layer cache at
<store>/layers/sha256/<diff_id>/ now holds the raw tar payload of
one layer (whiteout markers preserved as 0-byte files via
oci_layer_apply_raw_tar) plus a per-layer .elfuse-meta.layer.json
sidecar. A new <store>/layers/stacks/sha256/<chain_hex>/ cache
holds the assembled cumulative stage_dir state through some
prefix of an image's layer list, keyed by the OCI image-spec
ChainID, plus the cumulative .elfuse-meta.json sidecar.

oci_unpack now:

  1. Computes ChainID(L0..Lk) up front for every layer.
  2. Searches the stack cache from the longest prefix down. On
     hit, rm stage_dir + clonefile(stack_dir, stage_dir) and
     re-load cum_meta from the snapshot's .elfuse-meta.json so
     trailing layers accumulate on top.
  3. For each layer the prefix did not cover, raw cache hit
     skips populate; raw cache miss stages a raw_dir,
     drives oci_unpack_layer_raw, writes the per-layer sidecar,
     then oci_store_layer_commit publishes it.
  4. oci_unpack_assemble_layer runs a two-pass overlay walker
     against the raw cache entry. Pass 1 honours .wh.<name>
     (rm-r the named entry) and .wh..wh..opq (clear the parent
     dir contents) against stage_dir; pass 2 clonefiles every
     non-whiteout entry on top. Both passes skip the
     .elfuse-meta.layer.json sidecar.
  5. After each layer, the orchestrator writes the cumulative
     .elfuse-meta.json into stage_dir and snapshots it into
     layers/stacks/sha256/<chain>/ via clonefile +
     oci_store_stack_commit so any future unpack sharing this
     prefix short-circuits.

oci_unpack_layer reverts to the C3.1 seven-arg shape and is now
a pure stage_dir overlay-extract primitive. The cache plumbing
that previously lived inside the helper moves entirely to the
orchestrator. A sibling oci_unpack_layer_raw drives the raw-tar
applier so the orchestrator does not duplicate the reverify +
decompress + applier scaffold for the populate path.
oci_unpack_assemble_layer is exposed in unpack.h so multi-layer
test fixtures can drive the assembly without a hdiutil-backed
volume.

layer-meta.{c,h} gains oci_meta_read_named and _write_named so
callers can pick the on-disk filename. The existing oci_meta_read
/ _write become thin wrappers passing ".elfuse-meta.json" so
cumulative call sites are unchanged. raw cache writers pass
".elfuse-meta.layer.json"; cumulative stack writers keep the
default. Filenames must be relative basenames (no embedded '/').

Hardlink relationships from the tar are not reconstructed across
the assembly step. Each clonefile produces an independent inode;
APFS copy-on-write keeps disk usage flat regardless. This is a
known limitation documented in oci_unpack_assemble_layer's doc.

Stack snapshot commit failure is fatal: silently degrading the
cache would defeat the dedup path C3.3c-ii exists to enable.
EXDEV during clonefile (cache and stage on different APFS
volumes) is also a hard failure, matching the C3.2 policy.

test-oci-unpack: deletes the three C3.2 cumulative cache cases
(cache_miss_populates_layers_subtree, cache_hit_skips_re_extract,
cache_hit_merges_meta) whose semantics no longer exists. The
three C3.1 helper cases stay with the signature updated to the
new seven-arg oci_unpack_layer shape. Nine new cases cover the
C3.3c-ii surface:

  unpack_layer_raw_single_file_populates_raw_dir
  unpack_layer_raw_preserves_whiteout_as_file
  two_layer_overlay_assembly_no_whiteout
  two_layer_whiteout_removes_lower
  two_layer_opaque_clears_dir
  cross_image_raw_cache_dedup
  cross_image_stack_prefix_dedup
  same_image_full_stack_short_circuits
  meta_sidecar_split_round_trip

Tests drive building-block APIs directly because oci_unpack's
hdiutil-backed volume gate is still wired to OCI_VOLUME_TEST=1;
the multi-layer flows compose oci_unpack_layer_raw +
oci_unpack_assemble_layer + oci_store_{layer,stack}_commit to
verify cache state across image scenarios.

Test surface: test-oci-unpack 13/13, test-oci-store 53/53,
test-oci-origin 6/6, test-oci-pull 6/6, test-oci-inspect 7/7,
test-oci-run 6/6, test-oci-blob-store 14/14, test-oci-layer-apply
12/12, test-oci-tar 19/19, test-oci-decompress 5/5,
test-oci-digest 33/33, test-oci-compat.sh 20/20. make elfuse
links + signs.
Plan 3 C3.4 surfaces a "layer reuse:" section in oci inspect output and
exposes the helper Plan 4 oci status will reuse for store-wide aggregate
stats.

The new oci_dedup_metrics_compute walker in src/oci/dedup-metrics.c
takes a target manifest digest plus an optional volume_root and reports
five fields: total_layers (target's rootfs.diff_ids count), shared_layers
(|target ^ others|), shared_bytes (raw cache st_size sum over the
intersection), compared_images (other images deduped by manifest digest),
and deepest_shared_prefix (longest k where ChainID(target[0..k-1]) is
also reached by some other image's chain). Two dedup axes (per-diff_id
and per-ChainID prefix) match how the C3.3c-ii orchestrator dedupes:
raw cache shares any same-diff_id payload regardless of position, while
the stack cache only short-circuits a leading prefix.

Walker sources:
  - Pins from index.json. Image-index pins resolve to their linux/arm64
    sub-manifest before contributing diff_ids; pins that fail to resolve
    or whose image-config is missing/unparseable are skipped silently
    (dedup is informational, not a GC keep-set).
  - Unpacked sysroots under volume_root/images/sha256-<hex>/. The origin
    sidecar already carries manifest_digest + diff_ids so no blob read is
    needed. Unpacked trees whose manifest digest matches a pin already
    counted are dropped via the compared_manifests set so the same image
    never inflates compared_images.

Target failures are surfaced as -1 with errno + *err; the inspect render
catches that and prints "layer reuse: (image-config unavailable)" so the
surrounding manifest tree still renders. compared_images == 0 (only the
target itself in the store) prints "(no other images to compare)".

oci_inspect_options_t gains volume_root and suppress_layer_reuse; the
new section renders after the layer table in both the direct-manifest
and the indexed-drill paths, but is skipped under --all-platforms (no
manifest is picked). The CLI plumbs --volume DIR into inspect; help text
documents the new flag.

Failure model:
  - Target manifest/config missing or unparseable -> degrade sentinel,
    rc unchanged.
  - Other-image manifest/config missing or unparseable -> silently
    skipped; compared_images counts only fully usable images.
  - Bytes formatting: >= 1 MiB renders as "~X.Y MiB on cache"; smaller
    non-zero values render in bytes; zero bytes are omitted so the
    line never implies a populated cache that does not exist.

Reusable bits for Plan 4 (oci status): oci_dedup_metrics_compute is the
per-image entry point. A store-wide aggregate is one iteration over
oci_store_list_refs away.

Tests: test-oci-dedup-metrics is new (8 cases covering single image,
shared layers with byte accounting, disjoint diff_ids, same-manifest
self-exclusion, reordered shared layer with no prefix, image-index pin
arm64 resolution, missing other-config skip, and unpacked sysroot
contribution via volume_root). test-oci-inspect grows from 7 to 13
(+6 integration cases verifying section render shape, zero-shared
output, single-image sentinel, target config degrade sentinel,
--all-platforms suppression, and --volume unpacked-tree counting).
Plan 3 C3.5 introduces elfuse oci rebuild-cache. The new subcommand
walks every <volume>/images/sha256-<hex>/ unpacked sysroot, reads its
.elfuse-origin.json sidecar to recover the original layer diff_id
ordering, recomputes the terminating ChainID via oci_chainid_compute,
and (when --commit is set) clonefiles the tree into a fresh stack cache
entry at <store>/layers/stacks/sha256/<chain>/. Subsequent unpacks of
any image sharing the same ordered layer list short-circuit through
the C3.3c-ii orchestrator's stack-cache fast path instead of paying
the full extract cost.

This closes the migration gap left by C3.3c-ii. The stack cache only
grows as a side effect of oci_unpack, so trees unpacked before C3.3c
landed (or unpacked into a store while the C3.3b schema marker had
just wiped v1 entries) leave no stack snapshot on disk even though
their assembled stage_dir state is sitting under images/.

Design decisions:
  - Only the terminating ChainID is back-filled per tree. Intermediate
    prefix entries cannot be reconstructed because an unpacked tree
    only captures the final overlay state; the per-layer raw cache at
    layers/sha256/<diff_id>/ similarly remains empty until a re-pull
    plus re-unpack of the source image repopulates it.
  - Detection: every tree with a non-empty diff_ids list is eligible;
    the stack_has probe filters out already-cached chains. Repeated
    invocations are idempotent.
  - CLI shape matches the flat dispatch the other subcommands use:
    elfuse oci rebuild-cache --store DIR --volume DIR [--commit].
    Dry-run is default (the prune convention); --commit writes.
  - The unpacked tree's .elfuse-origin.json is stripped from the
    staged snapshot before stack_commit so the rebuilt entry matches
    the byte shape a fresh oci_unpack produces (origin_write runs
    AFTER the stack snapshot in the fresh-unpack path).
  - Per-tree failure is non-fatal: origin read errors split into
    no_origin (ENOENT) vs bad_origin counters; chainid_compute,
    clonefile, or stack_commit failure increments trees_failed and
    logs to stderr. The walk never aborts so a single corrupt tree
    cannot block back-filling the rest. Listing-level failure (such
    as failing to traverse images/) returns -1.
  - No interaction with raw cache entries, blob storage, pin
    metadata, or the C3.3b schema marker; rebuild-cache only
    manipulates layers/stacks/. EEXIST during stack_commit is
    treated as benign (matches the C3.3c-i contract) so concurrent
    rebuild + unpack do not race destructively.

API surface lives in the new src/oci/rebuild-cache.{c,h}; the cli.c
addition is just argument parsing + the human report. The internal
rm_recursive helper is duplicated from src/oci/unpack.c (rather than
lifting both copies to a shared util) so the small back-fill module
does not have to pull in the full unpack / layer-apply / decompress
graph for one staging-tree cleanup helper. Lift to a shared util when
a third call site appears.

Tests: test-oci-rebuild-cache is new (9 cases: empty volume, single-
tree commit with origin sidecar strip verification, dry-run reports
without touching disk, already-cached skip via pre-seeded stack dir,
missing origin sidecar skip, malformed origin JSON skip, empty
diff_ids array skip, two-tree two-round idempotency, three-layer
ChainID matches the on-disk path computed via an independent
oci_chainid_compute walk). test-oci-compat.sh grows from 20 to 21
with a rebuild-cache --dry-run smoke against the existing fixture.
make elfuse links + signs.
Plan 3 C3.3d extends oci_store_prune to garbage-collect entries in
the two Plan 3 cache families alongside the Plan 1 blob sweep. The
existing blob-only mark plus sweep stays byte-identical when layers/
and layers/stacks/ are empty; CLI output preserves the
"reclaimable: N blobs" / "kept: M blobs" / "dry-run" lines verbatim
so operator scripts and the existing compat smoke patterns keep
matching.

Mark walker:

A new public oci_store_collect_layer_roots in src/oci/store.{h,c} runs
parallel to oci_store_collect_roots and produces two sorted digest
sets: one of diff_ids (keys for <root>/layers/<algo>/<hex>/) and one
of every ChainID prefix (keys for <root>/layers/stacks/<algo>/<hex>/).
The walker shares the pin and unpacked-sysroot sources with blob
mark. For each pinned image-manifest it drills into the image-config
blob and reads rootfs.diff_ids; for image-index pins it picks the
linux/arm64 sub-manifest and recurses, treating "no linux/arm64
entry" and "sub-manifest blob not on disk" as soft no-contribution
to match expand_manifest_digest's policy for the same shapes.
Unpacked sysroots come from the origin sidecar's layer_diffids field
so no blob read is needed for that source. Missing or unparseable
manifest / image-config blobs are fatal mark failures so prune
cannot delete a reachable cache entry on the false belief that
nothing references it.

ChainID expansion records every prefix chain, not just the
terminating chain, because oci_unpack writes one stack snapshot per
prefix during the apply loop (src/oci/unpack.c around line 1063); a
walker that only tracked terminal chains would let prune delete a
prefix entry an unpack actually committed.

Three small statics in store.c (resolve_image_diff_ids,
add_diff_ids_and_chains, dir_tree_size_sum) are pattern-duplicates of
sibling helpers in src/oci/dedup-metrics.c. They were copied rather
than lifted to follow the rebuild-cache.c::rm_recursive precedent
(commit 4df17b1) of deferring a shared util module until a third
caller appears.

Sweep:

The blob-specific classify_algo_dir + apply_verdicts machinery from
Plan 1 stays in place but moves from "writes stats->kept_blobs /
stats->pruned_blobs / ..." to "writes through caller-supplied output
pointers". The change is local and additive at every call site. A
new classify_tree_cache_dir sweeps both <root>/layers/<algo>/ and
<root>/layers/stacks/<algo>/ via the same base_subpath parameter: it
recognises sha256/sha512 directory entries whose name is a valid
lowercase hex digest, looks the canonical "<algo>:<hex>" up in the
family's keep set, and either bumps the kept counter or appends a
candidate with the entry's recursive st_size sum and its st_mtime
(set by rename(2) at commit time, so newer entries sort newer for the
LRU budget).

apply_verdicts gains a prune_family_t parameter selecting the
removal primitive: BLOB calls unlink(2); TREE calls the existing
layer_stage_rm recursive rm helper so a populated cache directory
goes down in one call. ENOENT mid-removal stays a benign skip in
both branches; other errno values are fatal.

oci_store_prune now runs three back-to-back family pipelines (blobs,
layers, stacks). Each family classifies, filters, and applies its
own candidate list against its own keep set. The mark phase still
runs once: collect_roots produces the blob set first, then
collect_layer_roots produces the diff_id + chain_id sets, both
under the same flock(index.json.lock, LOCK_EX) window so all three
keep sets are derived from one snapshot of pins + unpacked sysroots.
The keep-bytes budget applies per family (each family runs its own
apply_filters pass) so a fat blob cannot crowd a layer eviction off
a shared global budget.

Lock model is unchanged: index.json.lock LOCK_EX covers the whole
operation. Layer / stack cache writers (oci_unpack,
oci_rebuild_cache) do not take this lock so they may publish new
entries while prune is running; those entries are reachable from
their image's pin or unpacked sysroot, both of which the mark phase
captured, so their diff_id / chain_id is in the keep set even if the
directory did not exist at sweep time. The narrow remaining window
(layer extracted before put_ref lands) matches the C1.3 blob
mid-pull semantic: the operator retries the pull.

oci_store_prune_options_t gains ten new output fields
(kept/pruned/skipped counts + pruned/skipped byte sums for layers
and stacks). The struct grows additively so designated-init callers
do not break. C3.3d does not bump the layers/.schema marker (still
v2): the on-disk layout did not change, only the prune behaviour.

CLI:

src/oci/cli.c::cmd_prune renders new "layers:" and "stacks:" lines
right after the existing blob "reclaimable: / reclaimed:" line, and
new "kept: N layers" / "kept: N stacks" lines after "kept: N blobs".
Both groups render only when their counter is non-zero so an
empty-cache store still produces the legacy two-line output, and
the compat smoke patterns that grep for "reclaimable: N blobs",
"kept: N blobs", and "dry-run" all keep matching. No new CLI flags:
layer + stack sweep is the default behaviour because an operator
running prune wants the store cleaned in full.

Tests:

tests/test-oci-store.c gains 14 new cases (53 -> 67) covering:

  - oci_store_collect_layer_roots: empty store, single-layer pin
    (L0 ChainID identity), three-layer pin (every prefix chain
    present), unpacked-tree contribution, fatal failure on missing
    image-config blob.
  - layer sweep: dangling entry unlinked, entry kept via pin, entry
    kept via unpacked tree, dry-run does not touch disk, recursive
    st_size sum drives pruned_layer_bytes.
  - stack sweep: dangling entry unlinked, every prefix chain kept
    when its image is pinned.
  - filter integration: older-than veto skips fresh layer entry,
    keep-bytes budget evicts oldest layer first.

A new stage_image_v2 helper writes a parseable image-config blob
with caller-supplied rootfs.diff_ids so the mark walker can drill
into rootfs.diff_ids. The existing stage_image helper upgrades from
an opaque config payload to a minimal valid image-config (empty
rootfs.diff_ids, opaque payload folded into an "author" annotation
to keep per-test digest distinctness); every Plan 1 / C1.3 / C1.4
test stays green because layer mark walks an empty diff_id list and
contributes nothing to the keep set.

tests/test-oci-compat.sh gains a C3.3d smoke block (21 -> 23) that
drops a dangling layer dir and a dangling stack dir into the store
after the prune-smoke fixtures, runs prune --commit, and asserts
both the new "layers: reclaimed" / "stacks: reclaimed" lines and the
on-disk teardown.

Test surface verified locally on oci-plan3-layer-snapshot:
test-oci-store 67/67, test-oci-origin 6/6, test-oci-inspect 13/13,
test-oci-unpack 13/13, test-oci-pull 6/6, test-oci-run 6/6,
test-oci-blob-store 14/14, test-oci-layer-apply 12/12,
test-oci-tar 19/19, test-oci-decompress 5/5, test-oci-digest 33/33,
test-oci-dedup-metrics 8/8, test-oci-rebuild-cache 9/9,
test-oci-compat 23/23. make elfuse links and signs.
New public API oci_status_compute(store, opts, out, err) plus
oci_status_free in src/oci/status.{c,h}. The walker iterates pins via
oci_store_list_refs, optionally walks unpacked sysroots under
volume_root/images/, runs three disk sweeps (blobs/, layers/,
layers/stacks/), and computes raw + stack cache populate ratios over
the reachable diff_id / ChainID union sets. Per-pin / per-tree failures
land in a status enum (missing-manifest, corrupt-manifest,
corrupt-config, missing-origin, ...) so a single bad row does not hide
the rest of the snapshot; fatal exits are reserved for store-open and
sweep IO failures.

CLI gains elfuse oci status --store DIR --volume DIR --json
--no-disk-usage. Human render emits PINS / UNPACKED SYSROOTS / STORE
TOTALS sections; --json emits a schemaVersion 1 document keyed for
jq consumers. --no-disk-usage zeroes every byte total while keeping
the entry counts and populate ratios intact, for operators running
status on stores too large to walk recursively.

Implementation duplicates slurp_blob / sum_tree_size /
resolve_config_digest / load_diff_ids / accumulate_chain from
dedup-metrics.c. This is the third caller of the diff_id walker
pattern after dedup-metrics.c and store.c; the rebuild-cache rm
recursive precedent is the model for deferring a shared
src/oci/image-walk module until a fourth caller appears.

Tests: new test-oci-status binary with 10 cases (empty store, single
pin with size and mtime asserts, missing manifest sentinel, corrupt
manifest sentinel with sibling-still-OK assertion, image-index pin
drills to linux/arm64 sub-manifest, unpacked sysroot row with bytes,
unpacked missing origin sentinel, skip_disk_usage zeroes byte fields,
populate ratio with 5 reachable diff_ids and 3 cached entries, two
images sharing one layer dedupe to 3 in the reachable union).
tests/test-oci-compat.sh gains 3 status smoke assertions (human PINS
+ STORE TOTALS shape, --json schemaVersion 1 with pins / totals
substrings, --no-disk-usage zeroes blob_bytes while
disk_usage_skipped is true). make elfuse links + signs; full OCI
unit suite (store, origin, inspect, unpack, pull, run, blob-store,
layer-apply, tar, decompress, digest, dedup-metrics, rebuild-cache)
stays green.
The default oci pull behaviour stays unchanged: every step re-runs and
the manifest body is always re-fetched. The new --refresh flag wires
an opt-in conditional GET path so a repeat pull of a tag whose pinned
digest is still on disk emits If-None-Match: "<pinned-digest>" on the
top-level manifest request. On 304 the cached manifest body is loaded
from the blob store, the config and layer loops short-circuit via the
existing oci_blob_store_has cache check, and the pin write is skipped.
On 200 with a new digest the pipeline runs in full; the previous
manifest blob stays on disk until prune sweeps it.

Plumbing changes:

  - oci_fetch_manifest gains an if_none_match parameter and an etag
    field on the response. 304 is now treated as success (body NULL,
    body_len 0, http_status 304); other non-2xx statuses still raise
    EPROTO.

  - oci_pull_options_t gains a refresh bool. The new prologue in
    oci_pull reads the pin, stats the manifest blob, and forwards the
    quoted digest only when both are available; missing pin or
    missing blob falls through to a normal pull.

  - pull.c gains a small load_manifest_blob helper for the 304 path
    (lift to image-walk pending a fourth caller; the diff_id walker
    duplication watchlist still applies).

  - CLI gains a --refresh flag with a usage line; cli/cmd_pull
    forwards args.refresh into oci_pull_options_t.

  - The TLS mock server captures inbound If-None-Match into the
    request struct, and oci_mock_send_full grows an etag argument so
    handlers can emit ETag headers and respond 304 when the inbound
    digest matches. Existing callers receive NULL.

Test surface:

  - test-oci-pull adds four cases: --refresh produces a single
    network request on an unchanged tag (no blob re-fetch);
    --refresh against a tag whose registry digest has flipped
    re-pulls in full and keeps the old manifest blob on disk for
    prune; --refresh against an empty store falls through with no
    If-None-Match sent; --refresh on a digest-only ref is a noop
    (no conditional header, no pin written).

  - test-oci-compat asserts oci pull --help advertises the new flag.

  - The full OCI regression matrix (fetch / pull / store / origin /
    inspect / unpack / run / blob-store / layer-apply / tar /
    decompress / digest / dedup-metrics / rebuild-cache / status /
    compat) stays green, and make check (aarch64 unit + busybox)
    reports 81 passed.

Scope notes: the sub-manifest fetch after an index drill does not yet
participate in conditional revalidation (no semantic anchor for a
by-digest GET), so a tag whose top-level is an image index still
issues one extra network round trip on 304. Layer / config blob
short-circuits cover the heavy bytes either way. Future work can add
a sub-manifest cache to elide that hop.
Plan 6 C6.1 lands a podman/skopeo-style policy.json reader at
src/oci/policy.{c,h} that fetch.c (C6.2) will consult before applying
CLI overrides. The loader walks the candidate path chain
ELFUSE_POLICY_FILE > $XDG_CONFIG_HOME/elfuse/policy.json (fallback
$HOME/.config/elfuse/policy.json) > $HOME/Library/Application Support/
elfuse/policy.json > built-in default. The supported schema subset is
{default{insecure, ca_bundle}, registries{<host>{insecure, ca_bundle,
auth_file, sigstore{publicKey}}}}. The sigstore.publicKey field parses
into oci_policy_effective_t.sigstore_public_key but is otherwise
unused; it reserves the slot for a future Phase 4+ verify hook so
operators can author the field today without churning the schema later.

Path expansion handles a leading "~/" or pure "~" by joining against
\$HOME; "~user/" forms pass through verbatim so the loader does not
pull in getpwnam. ca_bundle existence is checked at load time so a
fetcher consulting the policy never races a missing trust bundle
mid-pull; auth_file existence and 0600 mode-checking land with the
fetcher's credential reader in C6.2 because they share failure-mode
ergonomics. Unknown JSON keys at every level are accepted and
recorded so the C6.3 registries.d overlay and future schema
extensions can roll out without coordinated reader changes.

Default oci pull behaviour stays byte-identical: no caller consumes
oci_policy_t yet (C6.2 plumbs it through cli.c + fetch.c). The
translation unit is added to SRCS to keep the warnings posture in
sync with the rest of OCI.

Tests: new tests/test-oci-policy.c with 16 sub-tests:
  - path chain (empty, override wins, override miss is hard error,
    override empty string falls through, xdg fallback, home/.config
    fallback, library fallback)
  - full schema round trip (ghcr.io + 127.0.0.1:5000 + quay.io
    sigstore + unknown-host falls back to default)
  - "~/" path expansion
  - unknown top-level + entry keys tolerated
  - four invalid shapes hard-error with diagnostic
  - ca_bundle missing hard-error
  - ca_bundle:null inherits default, missing entry inherits default

make elfuse links + signs; the loader is unused but compiles clean.
C6.2 wires the C6.1 policy loader into the registry fetcher and into
cmd_pull. Pull now reads a policy.json from the documented config-path
chain (ELFUSE_POLICY_FILE / XDG / HOME / Library) and merges the
per-registry insecure / ca_bundle / auth_file settings with the CLI
flags. CLI wins: an explicit -u, --insecure, or --insecure-ca shadows
the matching policy field for the same registry.

The fetcher gains a const oci_policy_t* (caller-owned lifetime) and a
new file-local effective_opts_t. Each manifest / blob / token request
calls resolve_effective(f, ref, &eff) which performs an
oci_policy_lookup on ref->registry, merges with the CLI defaults, and
loads any policy auth_file via a new oci_policy_load_auth helper. The
auth file must be {"username","password"} JSON with mode (st_mode &
077) == 0; group- or other-readable files fail with EPERM. The loopback
gate around allow_insecure now reads the effective bit, so a policy
insecure=true on a non-loopback host fails the same way a CLI
--insecure on a non-loopback host does.

cmd_pull loads the policy before constructing the fetcher and prints
one stderr warning per CLI flag that overrides a non-default policy
value, gated on a non-empty policy source path and silenced by
--quiet. The Pull usage block documents the lookup chain.

Default oci pull behaviour without a policy file stays byte-identical:
oci_policy_load with no candidate file returns the built-in zero
policy, resolve_effective treats every policy field as NULL/false, and
the fetcher behaves exactly as it did before.

Tests:
  - test-oci-policy +4 (oci_policy_load_auth: happy path, mode 0644
    rejected, missing username rejected, malformed JSON rejected);
    16 -> 20 sub-tests green.
  - test-oci-pull +3 (policy insecure=true for loopback host pulls
    without CLI --insecure; policy auth_file with mode 0644 aborts
    the pull with a mode diagnostic; CLI --insecure overrides policy
    insecure=false); 10 -> 13 tests green.
  - test-oci-compat.sh +1 (oci pull --help mentions the Policy
    lookup block).
  - test-oci-fetch 15/15 green after the apply_security_opts /
    check_insecure_policy refactor.
  - test-oci-store / origin / inspect / unpack / run / blob-store /
    layer-apply / tar / decompress / digest / dedup-metrics /
    rebuild-cache / status all green; full make check 0 FAIL.

Not covered by automated tests: cmd_pull's warn output itself
(stderr-from-binary integration would need a mock + elfuse launcher
harness that does not exist today). The fetcher-side override
behaviour is exercised by the three new test-oci-pull cases; the
warn print remains a manual-verification surface for now.
Plan 6 C6.3. Each per-host JSON snippet under registries.d/ next to the
base policy file field-merges into the matching entry (or grafts a new
one if absent). The overlay path reuses the policy_entry_t shape minus
the registries-wrapper; the filename minus its .json suffix is the
target host. Files are processed in lexicographic order for determinism.

Implementation:

- policy_entry_t gains has_ca_bundle / has_auth_file /
  has_sigstore_public_key alongside the existing has_insecure so the
  field-level merge can distinguish "field declared" from "field
  omitted" for every overlayable slot. Lookup keeps reading the NULL
  pointer the way C6.1 wrote it; the flags are merge-only state.
- parse_entry_block is split: parse_entry_fields is the shared
  per-field walker, taking a src_path that is NULL for base-policy
  entries (host-scoped diagnostics) and the overlay file path for
  registries.d entries (file-scoped diagnostics). field_err picks the
  right format. parse_sigstore_block becomes parse_sigstore_fields
  along the same lines.
- load_overlay_dir scans <base-policy-parent>/registries.d/ for
  *.json. ENOENT on the directory is silent (overlay is optional);
  any other opendir errno (ENOTDIR, EACCES, ...) is a hard error so an
  operator pointing at an unreadable tree gets told. Non-regular
  candidates are skipped defensively; non-*.json filenames are
  ignored (README, .DS_Store, ...).
- parse_overlay_file slurps the file, requires an object root, walks
  the same fields as a base entry into a scratch policy_entry_t, then
  merge_overlay_into_entry transfers the declared slots into the
  target. ca_bundle stat-check runs against the overlay-declared path
  with a file-scoped diagnostic. unknown_keys append (no dedup).
- sigstore.publicKey now parses identically in base and overlay, and
  surfaces through oci_policy_lookup. fetch.c still does not consume
  it; the slot stays reserved for the Phase 4+ sigstore verify hook.

No public API change: oci_policy_load / _free / _lookup / _source /
_load_auth signatures and oci_policy_effective_t shape are
byte-identical to C6.2. fetch.c, cli.c, and pull.c are untouched.

Tests (tests/test-oci-policy.c, 20 -> 28 green):

- overlay_field_level_merge:        base ca_bundle + overlay auth_file
                                    -> both present after lookup
- overlay_adds_new_host:            overlay introduces a host the base
                                    policy never declared
- overlay_overrides_base_field:     overlay insecure=true beats base
                                    insecure=false
- overlay_dir_missing_silent:       no registries.d/ next to base
                                    policy is a successful load
- overlay_malformed_json_hard_error: bad overlay JSON propagates an
                                     error with overlay path + "JSON"
- overlay_ignores_non_json_files:   README.md sibling does not derail
                                    the scan
- overlay_sigstore_public_key_surfaced: overlay sigstore.publicKey is
                                        readable via
                                        oci_policy_effective_t
- overlay_multiple_hosts:           two overlay files for two hosts
                                    each surface independently

Default oci pull behaviour with no policy file and no registries.d/ is
byte-identical to C6.2. test-oci-fetch (15/15), test-oci-pull (13/13),
and test-oci-compat (28/28) stay green; make elfuse LD + SIGN clean.
Implements Plan 5 C5.1. oci_fetch_blob_batch dispatches a descriptor
array through libcurl's multi interface so a pull's config and layer
blobs flow over the network in parallel instead of one after another.
The concurrency cap reads OCI_FETCH_MAX_CONCURRENT (default 4, clamped
to [1, 16]); a single effective_opts_t resolves the CLI + policy merge
once at batch entry and every easy handle borrows it; first-round 401
with a Bearer challenge triggers one serial token refresh and a retry
round restarted with the refreshed token; any single-blob failure
aborts every in-flight writer atomically before any commit lands so a
partial pull never leaves a visible blob behind.

oci_fetch_blob now forwards onto oci_fetch_blob_batch as a one-element
wrapper. pull.c collapses its serial config + layers loop into a
single batch call and still prints per-blob cached vs downloaded lines
because the store-has lookup is captured before the batch hides the
transfer.

blob-store gains oci_blob_writer_begin_named, a writer entry that
stages partials at tmp/blob-<hex prefix 16>-XXXXXX so the C5.2 resume
sweep can find them by digest. The previous tmp/blob-<pid>-<seq>
naming stays available via the unchanged oci_blob_writer_begin so
existing callers keep working.

The test mock now spawns one worker thread per accepted connection
(detached) and exposes a per-connection response delay and an
in-flight watermark. Without the worker change, parallel batch tests
would see only one transfer at a time and the wall-clock speedup
assertion would be vacuous. Five new fetch cases cover the new path:
parallel wall-time beats serial by at least 1.5x (8 blobs, 150 ms
mock delay), any blob failure aborts the whole batch with no tmp
leak, duplicate digests fetch once, a single token refresh covers
every first-round 401, and OCI_FETCH_MAX_CONCURRENT=2 caps the
in-flight count.
Each oci_fetch_blob_batch entry sweeps tmp/ for partials older than
seven days, then per-blob calls oci_blob_writer_resume_named, which
scandirs tmp/ for blob-<hex16>-* matches, picks the largest survivor,
reopens it O_RDWR, replays its bytes through the digester, and seeks
to end-of-file. The caller sets CURLOPT_RANGE = bytes=<offset>- on the
easy handle and seeds bctx.bytes_seen with the partial size so the
streaming overflow gate measures total-blob progress.

Servers that ignore the Range and reply 200, or reply 416 Range Not
Satisfiable, trip a per-handle BH_NEEDS_RESTART state. The body
callback peeks CURLINFO_RESPONSE_CODE on its first invocation while
the request carried a Range header and aborts early when the status
is not 206; a 416 with no body falls through to the score path's
status-only restart trigger. The outer multi loop processes restarts
with a new batch_reset_handle_fresh helper (the renamed retry path,
now shared between token-refresh and Range-restart). After the reset
resume_offset is zero, so a second 200/416 cannot pick the restart
branch again and self-caps at one attempt per handle.

resume_named pre-rejects partials whose size is >= the descriptor's
declared size: a partial at or past expected size would otherwise tip
the streaming overflow gate into a non-recoverable failure instead of
a clean restart. Reopen / re-hash failures unlink the partial and
fall back to oci_blob_writer_begin_named so default behaviour stays
byte-identical to C5.1 when no partial is present.

oci_blob_store_sweep_partials is a new public entry that unlinks
blob-* files in tmp/ older than ttl_secs. The batch invokes it once
per call with seven days. The wide blob-* prefix is safe because
the blob store owns tmp/ exclusively.

oci_mock_send_full gains a content_range parameter so handlers can
issue 206 with the correct Content-Range header. The mock Range
parser that landed in C5.1 is now consumed by a batch_range_mode_t
flag the h_batch handler reads (honour / ignore / 416).

Tests: 4 new sub-cases in test-oci-fetch (resume 206 happy path,
server-ignores-Range restart, 416 restart, 7-day stale sweep) and
3 in test-oci-blob-store (resume reopens partial and commits, resume
falls back to fresh writer when no partial, sweep TTL unlinks aged
files only). make check 0 FAIL: 639 OK across all suites.
batch_handle_t gains progress_cb + progress_user borrowed from the
batch entry's parameters. batch_configure_easy wires
CURLOPT_XFERINFOFUNCTION (CURLOPT_NOPROGRESS=0) only when the caller
supplies a callback, so the C5.1 fast path (no progress) keeps zero
xferinfo overhead. batch_xferinfo_cb forwards into the user callback
with bytes_dl adjusted to (dlnow + resume_offset) so a resumed
transfer's progress pairs correctly with desc->size as the total.
dltotal is ignored because libcurl reports the remaining-bytes count
when a Range header is in flight, while desc->size is the
authoritative whole-blob total. batch_score_done fires one explicit
final invocation at the BH_DONE_OK boundary so the renderer always
sees a bytes_dl == bytes_total event regardless of libcurl's
xferinfo pacing.

pull.c grows a file-local renderer (pull_progress_t) that splits
descriptors into cached (printed immediately, byte-identical to the
C5.1 / C5.2 wording) and to-be-downloaded (rendered through the
callback). TTY mode prints n placeholder lines and uses CSI nF +
CSI 2K to redraw the zone in place on every xferinfo tick. Non-TTY
mode defers per-blob output until the final bytes_dl == bytes_total
event, preserving the line-per-completion log shape that scripted
consumers grep against. isatty(fileno(progress)) is the detection
gate; --quiet keeps fp == NULL and short-circuits all formatter
output. The cached-vs-downloaded annotation stays byte-identical to
the pre-C5.3 output.

Tests: test-oci-fetch grows one case verifying the cb contract --
every committed blob produces at least one event with bytes_dl ==
bytes_total == desc->size, and every event's bytes_total matches the
descriptor size. test-oci-pull grows one case running oci_pull with
opts.progress = tmpfile() so the buffer is captured non-TTY; the
assertion tallies one downloaded line per blob (config + N layers),
two manifest lines, zero cached lines, and asserts the buffer is
free of any CSI escape sequence. make check 0 FAIL: 641 OK across
all suites.
Phase 4 F4.1 acceptance asks for two distinct properties from the
APFS clonefile-based per-run rootfs: a write inside the clone must
succeed without backing into the source, and a delete inside the
clone must not unlink the corresponding source entry. The existing
test_clone_cow covers only the mutate-existing path; this commit
adds:

  test_clone_new_file_isolated     - touch a brand-new file in the
                                     clone; source dir stays without
                                     that path (the literal "touch
                                     /hello" wording from issue sysprog21#31
                                     Phase 4 acceptance 1)

  test_clone_unlink_preserves_src  - delete an existing file in the
                                     clone; the same path in the
                                     source still resolves

Both reuse the existing mkdtemp / write_file / file_has scaffolding
and inherit the ENOTSUP skip so non-APFS scratch volumes report
SKIP rather than fail. No src/ change; F4.1 production code (the
APFS clonefile call site in src/oci/clone-rootfs.c) was committed
in f317c81 during Phase 2.
Phase 4 F4.2 (/etc/resolv.conf) and F4.3 (/etc/hosts + /etc/hostname)
ask elfuse to synthesise host-truth files into the per-run rootfs so
guest libc lookups (getaddrinfo, gethostname, /etc/hosts walks) see
values matching the macOS host rather than the image's containerd
defaults.

New src/oci/runtime-files.{c,h} exposes a single
oci_runtime_files_inject(run_dir, err) entry point that creates
<run_dir>/etc/ at mode 0755 if missing and writes three files,
unlinking any pre-existing symlink first (image distros often ship
/etc/resolv.conf as a symlink to /run/systemd/resolve/stub-resolv.conf
that would otherwise dangle inside the guest):

  /etc/resolv.conf - "nameserver <ip>" lines extracted from
                     scutil --dns stdout via a posix_spawn + pipe
                     reader; falls back to 8.8.8.8 / 1.1.1.1 when
                     scutil fails or reports zero configured
                     resolvers
  /etc/hosts       - fixed five-line block: 127.0.0.1 localhost,
                     ::1 with the ip6-loopback aliases, the two
                     link-local multicast names, and 127.0.0.1
                     host.elfuse.internal as the documented host-
                     loopback hook. The image's own /etc/hosts is
                     overwritten unconditionally; no merge.
  /etc/hostname    - the literal string "elfuse\n" matching the
                     container=elfuse env injection runspec already
                     sets

src/oci/run.c gains a step-3.5 call to oci_runtime_files_inject
between oci_clone_rootfs and the manifest parse; failures abort the
run before launch with the inject diagnostic surfaced through *err
and the clone-rootfs torn down by the existing cleanup epilogue.

Six unit tests in tests/test-oci-runtime-files.c cover the policy:
fresh /etc creation, symlink overwrite, regular-file overwrite,
literal hostname content, the required /etc/hosts entries, and
/etc/resolv.conf containing a nameserver line regardless of whether
scutil succeeded or the fallback fired.
Linux /dev/full reads return a NUL stream and writes always fail with
ENOSPC.  Container runtimes synthesise /dev/console from the controlling
tty because the host /dev/console is reserved for kernel use.  Neither
node can come from an OCI layer (layer-apply rejects char device tar
entries with ENOTSUP), so both are added to the procemu runtime
intercept path.

/dev/full opens host /dev/zero so reads naturally return zeros and lseek
works, then tags the FD via proc_path so proc_intercept_write returns
ENOSPC for any non-zero write while preserving the POSIX zero-length
write succeeds rule.  /dev/console maps to host /dev/tty, matching the
runc/containerd controlling-tty redirect.

Extend tests/test-proc.c with /dev/full read/write/writev/lseek cases
and a best-effort /dev/console open case that tolerates non-tty CI
environments.
Adds six new synthetic /proc files for container-style detection and
sysinfo introspection:

  /proc/self/cgroup          - cgroup v2 "0::/" (not containerized)
  /proc/self/comm            - basename of the loaded ELF + LF
  /proc/self/statm           - seven page-count fields, source same as
                               /proc/self/stat
  /proc/sys/kernel/ostype    - literal "Linux"
  /proc/sys/kernel/osrelease - mirrors cached uname release
  /proc/sys/kernel/hostname  - mirrors cached uname nodename

systemd-detect-virt, runc-internal, and podman read /proc/self/cgroup
to decide whether they are running inside a container; the canonical
v2 "0::/" form tells them elfuse is a plain host environment. The
sysctl files keep procfs and uname(2) agreed on so init scripts that
cross-check do not abort.

Adds sys_uname_cached() so procemu can read the static uname struct
without duplicating literal strings.

Eight new procfs cases in tests/test-procfs.c cover the new files.

Follow-up not in scope: /proc/cpuinfo currently reports host
_SC_NPROCESSORS_ONLN. Wiring it to a guest vCPU count would need a
new guest_t accessor; defer until that API has a second caller.
issue sysprog21#31 Phase 4 F4.7. The image-config User field accepts six shapes
per OCI image-spec: empty, uid, uid:gid, name, name:group, uid:group,
name:gid. The runspec resolver previously parsed only the two numeric
shapes and rejected symbolic forms with a Phase 4 pointer; container
detection tooling and most base images use nobody / www-data / postgres
style strings, so a guest run falling through to host uid was the
practical outcome.

oci_user_lookup() parses passwd-shaped and group-shaped tokens against
the per-run clone-rootfs (rootfs/etc/passwd, rootfs/etc/group),
preferring numeric interpretation when the token is all-digit (matching
runc). A symbolic User with a rootfs missing /etc/passwd fails closed
with EINVAL rather than silently degrading to root, so a misconfigured
image surfaces at launch instead of at first guest decision.

The lookup helper lives in its own translation unit so the runspec
module stays pure-data: oci_runspec_build only touches the filesystem
when the caller passes a rootfs through flags->rootfs_for_nss. CLI
--user is extended to accept symbolic forms through the same path.

Coverage: tests/test-oci-user.c (12 cases) drives the parser and
filesystem path against scratch rootfses; tests/test-oci-runspec.c
adds three runspec-seam cases, rewrites the symbolic-rejected case to
assert the no-rootfs branch, and converts the legacy non-numeric
--user case into a no-rootfs diagnostic check.
Replace Phase 4 -> later pointers in docs/usage.md (Scope guardrails
and User and WorkingDir) with the surface that actually landed in
C4.1..C4.5: per-run writable rootfs via APFS clonefile, /etc/
{resolv.conf,hosts,hostname} injection, /dev/{full,console} plus the
existing null/zero/random/urandom/tty set, /proc/self/{cgroup,comm,
statm} and /proc/sys/kernel/{ostype,osrelease,hostname}, and the
seven-shape User resolver against rootfs /etc/passwd + /etc/group.

Add a Libc-adjacent compatibility section that fixes elfuse's
position on the six host-fs-adjacent payloads the spec leaves to the
image: nsswitch.conf (only files and dns backends work), NSS shared
objects (no host dlopen of guest .so), tzdata (image carries; no
format conversion), locale-archive (image carries; C fallback when
absent), gconv-modules (image carries; iconv yields EILSEQ when
absent), and ld.so.cache (dynamic linker handles its own). Includes
a three-row symptom matrix covering getaddrinfo, date / TZ-dependent
output, and locale-aware sort / printf.
Docker.io multi-arch tags such as alpine:3 pin the ref at the image
index digest, not at the leaf manifest digest, because the index is
the natural refresh anchor (a new arm64 manifest only changes the
index entry, not the index digest expectation in client tooling).
oci_pull already preserves this shape: pin -> index blob.

oci_run previously fed the pinned blob straight into oci_manifest_parse
and failed with "manifest parse failed: manifest config descriptor
missing" because an index has "manifests"[] instead of "config" plus
"layers". oci_inspect already does the classify-then-walk pattern;
oci_run now mirrors it through a new resolve_image_manifest() static
helper. The helper:

  1. parses the pinned digest string
  2. loads the blob
  3. tries oci_index_parse first
  4. on index, picks linux/arm64 via oci_index_pick_linux_arm64,
     loads the sub-manifest blob, swaps the body
  5. parses the final body with oci_manifest_parse

Step 4 in oci_run shrinks to a single call into the helper plus the
caller-side cleanup that already existed.

A test-only hook oci_run_resolve_image_manifest_for_testing exposes the
helper so tests/test-oci-run.c can drive multi-arch fixtures without
needing a case-sensitive APFS sysroot volume. Three new cases cover the
shapes:

  - leaf-pinned: ref pinned at the manifest digest (fixture-builder
    path, tests/test-oci-compat.sh path); parses as a leaf without
    index drilling
  - index-walked: ref pinned at a three-platform index whose arm64
    leaf the helper must drill into; the helper returns the leaf-
    manifest body, not the index body
  - index without arm64: helper rejects with ENOENT and an error
    message that mentions linux/arm64

End-to-end sanity: build/elfuse oci run alpine:3 /bin/busybox echo
"hi from alpine" now prints "hi from alpine"; previously it failed
at manifest parse before reaching unpack.
The placeholder skip block in tests/test-oci-compat.sh always said
the alpine:3 online harness "lands in a follow-up patch". Land it
now, on the back of the index-walk fix (76303c2): when OCI_FETCH_ONLINE
is set, the suite pulls docker.io/library/alpine:3 into a scratch
store under SCRATCH, then runs alpine:3 against /bin/busybox echo
with a fixed sentinel string. Two assertions:

  - oci pull alpine:3 succeeds (cycles the registry HTTPS client and
    the index-aware pin storage that 5b10f432 already records)
  - oci run alpine:3 returns 0 and stdout matches the sentinel line
    verbatim; if oci_run regressed back to the pre-fix behavior the
    log substring "manifest config descriptor missing" surfaces a
    targeted bad message instead of the generic rc check

This is the regression anchor for the multi-arch index-walk path
that 76303c2 introduced: anything that breaks the docker.io image-
index unwrap surface trips this case the moment a developer flips
OCI_FETCH_ONLINE=1. The scratch store keeps the test isolated from
the user's default store; the default sparsebundle volume is reused
for unpack since the on-volume image content is content-addressed
and idempotent.

OCI_FETCH_ONLINE remains gated, so make check stays offline-only.
Local verification: OCI_FETCH_ONLINE=1 bash tests/test-oci-compat.sh
reports 30/30, including both new online cases.
Every container layer tar carries a root-directory entry encoded
as "./". The DIR-type trailing-slash strip in src/oci/tar.c
collapses it to ".", which oci_path_join_safe then explicitly
rejected as "empty path" (EINVAL). Cold unpack of any real-world
image - including busybox:latest as the smallest reproducer - died
on the first tar entry before producing a single file on disk.

Skip the entry in layer_apply_impl after the leading-slash strip:
the unpack root is created by the assembler before this loop runs,
so the root entry has no work to drive. Empty paths are skipped
the same way for archives that record a zero-length root name.

Wall 1 of the cold-unpack repair sweep. Walls 2 (EXDEV) and 3 (PAX)
follow in the next two commits and become visible only after this
patch clears the path-join error.
The default elfuse layout puts the store on the root APFS volume
and the stage on a hdiutil-mounted sparsebundle, so the three
clonefile(2) call sites in oci/unpack.c returned EXDEV on every
fresh unpack and had no fallback path. Cold unpack of busybox
(and any other image) failed with "assemble: clonefile EXDEV (raw
cache and stage must share an APFS volume)" before any layer file
landed on disk.

Switch all three sites (per-file raw assembly, stack restore, stack
snapshot) to copyfile with COPYFILE_CLONE. The clone flag keeps the
APFS COW path on same-volume copies and falls back to a real byte
copy across volumes, so the default layout works without changing
where the store or sparsebundle live. The dir-tree sites also pass
COPYFILE_RECURSIVE and COPYFILE_NOFOLLOW so symlinks are preserved.

Wall 2 of the cold-unpack repair sweep. Walls 1 (root tar entry)
and 3 (PAX) cover the other two failure modes the cold path hits.
Real-world container layers (anything glibc-shaped with long
pathnames or filenames over 100 bytes) emit POSIX.1-2001 PAX
extended headers with typeflag 'x' to carry path / linkpath /
size / mtime keys. The tar parser previously refused the typeflag
outright with EPROTONOSUPPORT, so python:alpine and every other
image that carries even one PAX-encoded path failed cold unpack
with "tar PAX extensions not supported".

Add consume_pax_record alongside the GNU 'L' / 'K' long-name path.
Per-file 'x' records have their payload parsed for "<len> key=val\n"
tuples; `path` and `linkpath` keys promote into the same
pending_long_name and pending_long_link buffers the GNU path
populates, so downstream code stays unaware of which long-name
format produced the override. Other keys (size, mtime, atime, uid,
gid, xattrs) are silently ignored - the unpack pipeline does not
track them.

Global 'g' records establish defaults for all subsequent entries.
Container builders use 'g' for mtime / uid defaults that this
project does not consume, so the implementation discards the
payload bytes-correctly without parsing.

tests/test-oci-tar.c retires test_pax_rejected (the old contract
was rejection with EPROTONOSUPPORT) and gains two replacements:
test_pax_extended_path verifies that a per-file 'x' record's path
key latches onto the next entry, and test_pax_global_skipped
verifies that a 'g' record is consumed silently without disturbing
the entry that follows.

Wall 3 of the cold-unpack repair sweep. With Walls 1 and 2 already
landed, this completes the path from registry pull to a running
guest binary for production-shape images.
OCI_COMPAT_TEST=1 was a SKIP slot from Phase 3 since
"sparsebundle volume provisioning and the three Phase 3 plan
fixtures land in a follow-up compat-matrix patch". This commit
lands the first leg: a scratch case-sensitive APFS sparsebundle
that the heavy block creates on demand and detaches in the EXIT
trap, plus the first of the three fixtures - alpine-shaped, a
single-layer image with /bin/busybox + /etc/os-release.

The scratch volume keeps the heavy E2E from polluting
$HOME/Library/Application Support/elfuse/sysroots.sparsebundle
on a developer laptop, which is the practical reason the gate
existed in the first place. hdiutil create + attach goes through
the same case-sensitive APFS path that oci_volume_ensure
validates, so --volume points at the fresh mountpoint with no
extra plumbing.

Fixture A drives busybox as both the entrypoint and the applet
dispatcher: oci run ... echo "elfuse-alpine-shaped-ok" produces
the canonical stdout line via the echo applet. busybox is the
static aarch64-linux-musl binary at
externals/test-fixtures/aarch64-musl/staticbin/bin/busybox; the
fixture skips with a fetch-fixtures.sh pointer when missing so
clean clones still pass make check. Default mode (no
OCI_COMPAT_TEST=1) keeps the original SKIP behavior, so
$HOME never sees an hdiutil mount.

Validation:
  - OCI_COMPAT_TEST=1 bash tests/test-oci-compat.sh: 31/31
    (28 default + 3 heavy/A)
  - default bash tests/test-oci-compat.sh: 28/28
  - 25 OCI unit suites (test-oci-*): all green
Fixture B drives the apply_hardlink path that none of the
mainstream registry images exercise at any meaningful scale
(debian:bookworm-slim ships 2 hardlinks, python:3.12 ships 1,
ruby:alpine ships 0 in its core layer). The layer tar carries
/bin/busybox plus /bin/echo and /bin/cat as on-disk hardlinks;
BSD tar detects the shared inode and emits two typeflag '1'
records that layer-apply must turn back into real hardlinks on
the unpacked tree.

A pre-flight check rejects the case where the build host's tar
silently turns hardlinks into duplicates, so the fixture cannot
quietly degrade into a busybox-only smoke if a later host swap
changes that behavior. BSD tar tags hardlink rows with a leading
'h' in the mode column ("hrwxr-xr-x ... link to ..."), distinct
from regular '-' and symlink 'l' rows.

The entrypoint is /bin/echo, which is the hardlink itself, so
busybox's argv[0] applet dispatch picks the echo applet and the
CLI tail becomes its argument verbatim. The expected stdout is
the canonical "elfuse-busybox-shaped-ok" line. A regression
where unpack drops the hardlink (or links to the wrong target)
shows up as either a launch failure or as the busybox usage
banner instead of the echoed line.

Validation:
  - OCI_COMPAT_TEST=1 bash tests/test-oci-compat.sh: 34/34
    (31 prior heavy/A baseline + 3 heavy/B)
  - default bash tests/test-oci-compat.sh: 28/28 unchanged
Fixture C closes the third leg of the heavy compat matrix. Layer
1 stages /bin/busybox plus /bin/ls (hardlink) and a /data dir
with keep.txt + remove.txt; layer 2 carries a single empty file
at /data/.wh.remove.txt. After layer apply the unpacked rootfs
must contain /data/keep.txt and nothing else under /data.

The OCI image-spec is explicit that the ".wh.<name>" marker
must never appear in the final filesystem, so the test asserts
on two surfaces: the runtime stdout shape (`/bin/ls /data`
emits "keep.txt" and not "remove.txt") and the on-disk
unpacked tree under HEAVY_MOUNT/images/sha256-<hex>/data (must
have keep.txt, must not have remove.txt, must not have
.wh.remove.txt). A regression that forwards the marker as a
real file would slip past the runtime check on layered tooling
but fails the disk-state check immediately.

This completes the Phase 3 follow-up the original SKIP comment
named: sparsebundle volume provisioning + all three plan
fixtures (alpine-shaped, busybox-shaped, two-layer-whiteout)
now run end-to-end under OCI_COMPAT_TEST=1.

Validation:
  - OCI_COMPAT_TEST=1 bash tests/test-oci-compat.sh: 37/37
    (28 default + 9 heavy across the three fixtures)
  - default bash tests/test-oci-compat.sh: 28/28 unchanged
  - 25 OCI unit suites (test-oci-*): all green
pull_progress_tty_redraw uses CSI cursor-up ("\033[NF") plus
CSI clear-line ("\033[2K") to redraw N blob rows in place each
time the curl xferinfo callback fires. Some terminal panes
emulate a pty (so isatty reports true) but silently ignore the
cursor-up sequence; the result is that every redraw cycle
prints the same N rows below the previous ones, stacking
hundreds of duplicate lines across a single pull, and the
ignored clear-line lets shorter media-type strings bleed into
the suffix of the prior longer one ("config.v1+jsontar+gzip"
instead of "config.v1+json").

The fix is a one-line escape hatch: ELFUSE_OCI_PROGRESS=plain
(also accepted: lines, off) forces is_tty=false even on a real
TTY, sending the renderer down the line-per-completion path
that already exists for non-TTY callers (test-oci-pull's
test_pull_progress_non_tty covers it). Operators on a misbehaving
terminal pane can export the env once and never see the stacking
again; the default behavior on cooperative terminals is unchanged.

Validation:
  - 25 OCI unit suites: all green (test-oci-pull 14/14 unchanged)
  - bash tests/test-oci-compat.sh: 28/28
  - OCI_COMPAT_TEST=1 bash tests/test-oci-compat.sh: 37/37
@Max042004 Max042004 changed the title Add elfuse oci subcommand for pulling and inspecting images Add OCI image support: pull, unpack, run, prune, status, policy May 23, 2026
Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 issues found across 131 files (changes from recent commits).

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="src/oci/pull.c">

<violation number="1" location="src/oci/pull.c:253">
P2: Error-path leak: `sub_resp` may be allocated but not freed when sub-manifest fetch fails before `have_sub` is set.</violation>
</file>

<file name="src/oci/media-type.c">

<violation number="1" location="src/oci/media-type.c:100">
P2: Media type parsing is case-sensitive, but media type type/subtype tokens are case-insensitive; valid values with different casing will be misclassified as unknown.</violation>
</file>

<file name="src/oci/ref.c">

<violation number="1" location="src/oci/ref.c:83">
P2: Repository-path validation incorrectly rejects valid names with repeated dashes (for example `my--repo`).</violation>

<violation number="2" location="src/oci/ref.c:356">
P2: `docker.io` default-namespace detection is case-sensitive, so mixed-case hostnames can skip the required `library/` prefix.</violation>
</file>

<file name="src/oci/fetch.c">

<violation number="1" location="src/oci/fetch.c:782">
P2: Manifest fetch skips bearer-challenge parsing when a token is already cached, so 401 responses from expired/stale tokens are not retried with a refreshed token.</violation>

<violation number="2" location="src/oci/fetch.c:945">
P2: Blob fetch also disables challenge parsing when a token is cached, preventing 401-triggered token refresh and causing avoidable pull failures.</violation>
</file>

<file name="src/oci/blob-store.c">

<violation number="1" location="src/oci/blob-store.c:354">
P2: The commit path is not crash-durable because it never fsyncs the destination directory after linking the blob into place.</violation>
</file>

<file name="src/oci/store.c">

<violation number="1" location="src/oci/store.c:285">
P2: Fsync the pin directory after `rename` to make tag->digest updates crash-safe; file fsync alone does not persist the directory entry change.</violation>
</file>

<file name="src/oci/manifest.c">

<violation number="1" location="src/oci/manifest.c:295">
P2: `schemaVersion` parsing can accept fractional JSON numbers because `valueint` is used without an integer round-trip check.</violation>

<violation number="2" location="src/oci/manifest.c:385">
P2: Layer descriptor memory is leaked on post-parse validation failures because `nlayers` is incremented too late.</violation>

<violation number="3" location="src/oci/manifest.c:481">
P2: Index descriptor memory leaks when platform parsing fails because `nentries` is incremented after the fallible parse.</violation>
</file>

<file name="docs/usage.md">

<violation number="1" location="docs/usage.md:135">
P2: Contradictory documentation for `--user`. The options table describes it as 'numeric only', but the User and WorkingDir section immediately below describes detailed symbolic-name resolution (accepting symbolic `name`, `name:group`, reading /etc/passwd and /etc/group). These cannot both be correct.</violation>
</file>

<file name="src/oci/inspect.h">

<violation number="1" location="src/oci/inspect.h:57">
P3: The `suppress_layer_reuse` comment is inverted and documents the opposite runtime behavior, which can cause callers to pass the wrong value.</violation>
</file>

<file name="externals/zstd/VENDORING.md">

<violation number="1" location="externals/zstd/VENDORING.md:12">
P3: The file references 'oci-roadmap.md', which does not exist in the codebase. Remove the broken reference or update it to point to the actual document containing the policy commitment.</violation>
</file>

Note: This PR contains a large number of files. cubic only reviews up to 100 files per PR, so some files may not have been reviewed. cubic prioritizes the most important files to review.
On a pro plan you can use ultrareview for larger PRs.

Re-trigger cubic

Comment thread docs/usage.md
| `-e KEY=VAL`, `--env KEY=VAL` | Set or replace one env var (repeatable) |
| `-e KEY`, `--env KEY` | Import `KEY` from the host environ (repeatable) |
| `-w DIR`, `--workdir DIR` | Override image WorkingDir |
| `-u UID[:GID]`, `--user UID[:GID]` | Override image User (numeric only) |
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Contradictory documentation for --user. The options table describes it as 'numeric only', but the User and WorkingDir section immediately below describes detailed symbolic-name resolution (accepting symbolic name, name:group, reading /etc/passwd and /etc/group). These cannot both be correct.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At docs/usage.md, line 135:

<comment>Contradictory documentation for `--user`. The options table describes it as 'numeric only', but the User and WorkingDir section immediately below describes detailed symbolic-name resolution (accepting symbolic `name`, `name:group`, reading /etc/passwd and /etc/group). These cannot both be correct.</comment>

<file context>
@@ -99,6 +99,179 @@ and memory access, and per-thread inspection. Implementation details, including
+| `-e KEY=VAL`, `--env KEY=VAL` | Set or replace one env var (repeatable) |
+| `-e KEY`, `--env KEY` | Import `KEY` from the host environ (repeatable) |
+| `-w DIR`, `--workdir DIR` | Override image WorkingDir |
+| `-u UID[:GID]`, `--user UID[:GID]` | Override image User (numeric only) |
+| `--keep` | Keep the per-run cloned rootfs after exit |
+| `--name NAME` | Reserved: deterministic clone-dir suffix (ignored today) |
</file context>
Suggested change
| `-u UID[:GID]`, `--user UID[:GID]` | Override image User (numeric only) |
| `-u UID[:GID]`, `--user UID[:GID]` | Override image User (supports numeric UID[:GID] or symbolic name[:group]) |

Comment thread src/oci/inspect.h
Comment on lines +57 to +62
/* When true (default), render a "layer reuse:" section after the
* manifest layer table. Setting this to false suppresses the section
* entirely (useful for tests that only want to verify the renderer
* baseline without dedup compute side-effects). The CLI never sets
* this to false.
*/
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P3: The suppress_layer_reuse comment is inverted and documents the opposite runtime behavior, which can cause callers to pass the wrong value.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At src/oci/inspect.h, line 57:

<comment>The `suppress_layer_reuse` comment is inverted and documents the opposite runtime behavior, which can cause callers to pass the wrong value.</comment>

<file context>
@@ -45,9 +46,21 @@ typedef struct {
+     * convention. Pure information: dedup metrics never write to disk.
+     */
+    const char *volume_root;
+    /* When true (default), render a "layer reuse:" section after the
+     * manifest layer table. Setting this to false suppresses the section
+     * entirely (useful for tests that only want to verify the renderer
</file context>
Suggested change
/* When true (default), render a "layer reuse:" section after the
* manifest layer table. Setting this to false suppresses the section
* entirely (useful for tests that only want to verify the renderer
* baseline without dedup compute side-effects). The CLI never sets
* this to false.
*/
/* When false (default), render a "layer reuse:" section after the
* manifest layer table. Setting this to true suppresses the section
* entirely (useful for tests that only want to verify the renderer
* baseline without dedup compute side-effects). The CLI never sets
* this to true.
*/


## Why vendored, decode-only

`oci-roadmap.md` Q9 commits the OCI work to hand-rolled C: no Go, no Rust,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P3: The file references 'oci-roadmap.md', which does not exist in the codebase. Remove the broken reference or update it to point to the actual document containing the policy commitment.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At externals/zstd/VENDORING.md, line 12:

<comment>The file references 'oci-roadmap.md', which does not exist in the codebase. Remove the broken reference or update it to point to the actual document containing the policy commitment.</comment>

<file context>
@@ -0,0 +1,72 @@
+
+## Why vendored, decode-only
+
+`oci-roadmap.md` Q9 commits the OCI work to hand-rolled C: no Go, no Rust,
+no `cargo` / `go` in the build matrix. zstd is the only OCI-spec layer
+compression beyond gzip that has wide registry support, and the upstream
</file context>

Copy link
Copy Markdown
Contributor

@jserv jserv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rebase onto the latest main branch and squash/rework the commits into fewer, cleaner ones.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants