feat(worker-image): superadmin set/validate per-project worker image (MNG-1698) by aaight · Pull Request #1466 · mongrel-intelligence/cascade

aaight · 2026-06-26T12:20:13Z

Per-project worker image — plan 3/4: set-validation (CLI/API + router-side validation job + audit)

Implements the operator-facing backend for spec 022 (per-project worker image). A superadmin sets/clears a project's worker image via tRPC + CLI. Because the Docker socket is router-only, the set mutation does not validate inline — it records the reference as pending and enqueues an eager router-side validation job that pulls the image, pins its immutable @sha256: digest, runs the runtime smoke-test, and marks the project verified (digest pinned) or failed (precise reason). Plan 2 already wired spawn to launch a verified digest, so after this plan the feature is fully functional headless.

Issue: https://linear.app/issue/MNG-1698

What changed

1. Extended runtime smoke-test (the validation contract)

New src/router/worker-image-checks.ts — the single source of truth for the in-container checks (WORKER_IMAGE_HARD_CHECKS = cascade-tools / node / git / engine CLI; WORKER_IMAGE_VALIDATION_CHECKS adds the python shim + Playwright) and a buildWorkerImageCheckScript() that fails fast with a grep-stable FAIL: <label> line.
tests/docker/worker-runtime-tools/run-test.sh extended with the HARD-contract block (cascade-tools / node / git / engine CLI), keeping the existing python + Playwright blocks.

2. tRPC set/clear mutation — superadmin, syntactic validation, enqueue, audit (src/api/routers/projects.ts)

create/update accept workerImage (z.string().nullish()). Worker-image changes are superadmin-gated (FORBIDDEN otherwise).
Malformed references are rejected synchronously with BAD_REQUEST (nothing persisted) via the new pure grammar validator src/config/workerImageRef.ts.
A valid ref persists workerImage + workerImageStatus='pending', clears digest/error, and enqueues a validation job. null clears all four columns (revert to global default); no enqueue.
Every set/clear emits a structured, grep-stable audit line: { event: 'project_worker_image_changed', actorId, projectId, from, to }.
defaults now exposes the global routerConfig.workerImage.

3. Validation job + router-side handler

src/queue/client.ts — new WorkerImageValidationJob type + enqueueWorkerImageValidationJob() (deterministic, self-deduplicating job id per project).
src/router/worker-image-validation.ts — the handler: pullImageOnce(ref) → inspect → resolve the launchable repo@sha256:… digest from RepoDigests → run the extended smoke-test in a one-shot docker run --rm → persist verified+digest or failed+reason. Fail-closed: every non-verified path records failed, so a project is never stranded in pending. The persist is ref-guarded (recordWorkerImageValidationResult) so a stale result can't clobber a ref the operator changed mid-flight.
src/router/worker-manager.ts — the dashboard-jobs processor routes worker-image-validation straight to the handler (no worker slot, no container spawn); all other dashboard jobs still go through guardedSpawn.

4. CLI surface (src/cli/dashboard/projects/)

update: --worker-image <ref> and --clear-worker-image (mutually exclusive).
create: --worker-image <ref>.
show: renders the worker image + lifecycle (pending / verified → <digest> / failed: <reason> / (global default)). Typed FORBIDDEN/BAD_REQUEST envelopes already surface cleanly through the shared error mapper.

Digest format note. The pinned digest is the full launchable repo@sha256:… RepoDigests entry (matched to the pulled repository), not a bare sha256:… — plan 2's resolveEffectiveBaseImage launches workerImageDigest directly, so it must be a pull-by-digest reference.

Acceptance criteria

Superadmin sets via CLI + tRPC; non-superadmin → FORBIDDEN (AC fix: use CI environment for deploy workflow secrets #8 authz)
Malformed ref → synchronous BAD_REQUEST (nothing persisted); valid ref → pending + validation enqueued (AC Improved performance and stability #4 partial)
Handler pulls, pins the immutable digest, marks verified on a passing smoke-test (AC chore: update llmist and zangief dependencies #3 / Improved performance and stability #4)
Failing image (missing tool / unpullable) → failed + precise reason, never launches; never stuck in pending (AC Improved performance and stability #4 fail-closed)
Clearing reverts to the global default — all four columns null (AC chore: add .nvmrc for Node.js 22 #2 partial)
Every set/clear emits a structured, grep-stable audit log line (AC fix: use CI environment for deploy workflow secrets #8 audit)
npm run build, npm test, npm run test:integration, npm run lint, npm run typecheck pass
Docs updated (README.md, CHANGELOG.md)

Testing

Unit: worker-runtime-tools-contract, workerImageRef, worker-image-validation-job (queue), worker-image-validation (handler), projects-worker-image (tRPC), projects-worker-image (CLI), plus a dashboard-dispatch routing case in worker-manager. Full unit suite: 10,461 passed.
Integration: recordWorkerImageValidationResult round-trip + ref-guard in projectsRepository. Full integration suite: 627 passed.
npm run typecheck, npm run build, and scoped biome check on changed/new files all clean.

Out of scope (later plans)

Dashboard UI (plan 4 — MNG-1699).
Dockerfile-build, per-agent images, per-project registry creds.

🕵️ claude-code · claude-opus-4-8 · run details

…(MNG-1698)

codecov · 2026-06-26T12:25:58Z

Codecov Report

❌ Patch coverage is 80.36649% with 75 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
src/router/worker-image-validation.ts	65.54%	51 Missing ⚠️
src/db/repositories/projectsRepository.ts	0.00%	18 Missing ⚠️
src/cli/dashboard/projects/show.ts	84.00%	4 Missing ⚠️
src/cli/dashboard/projects/create.ts	75.00%	1 Missing ⚠️
src/config/workerImageRef.ts	94.44%	1 Missing ⚠️

📢 Thoughts on this report? Let us know!

nhopeatall

Summary

REQUEST_CHANGES — the set/clear tRPC surface, queue wiring, ref grammar validator, ref-guarded persistence, and CLI are clean and well-tested. But the router-side validator is non-functional: the runtime smoke-test script it generates is unparseable by bash, so every image — valid or not — is marked failed and can never launch. That defeats the core purpose of this plan (AC #3 / #4).

Code Issues

Blocking

src/router/worker-image-checks.ts:83 (root cause) + :56 (trigger) — buildWorkerImageCheckScript() interpolates the raw ${check.command} into a double-quoted echo "FAIL: ... (${check.command})". The Playwright SOFT check command contains literal double quotes (node -e "require('@playwright/test/package.json')"); embedded in the double-quoted echo they close the string and leave require('@playwright/test/package.json') unquoted, so bash aborts with syntax error near unexpected token '('. defaultRunImageCheck calls buildWorkerImageCheckScript() with the default HARD+SOFT list and runs docker.run(ref, ['bash','-lc', script]); bash fails at parse time, so the script exits 2 before any check runs — for every candidate image, including valid cascade-worker images. handleWorkerImageValidation then records failed, and since plan 2 only launches verified digests, no per-project worker image can ever be verified or used. The feature is dead on arrival.

Verified locally — bash -n on the generated script:
```
syntax error near unexpected token `('
```
on the playwright line; HARD-only checks parse and run fine. Running it as the handler does (bash -lc <script>) returns exit code 2 with no FAIL: line, so summarizeFailure stores the bash error as the reason.

CI stays green because nothing exercises the real script: worker-runtime-tools-contract.test.ts only does substring toContain assertions, and worker-image-validation.test.ts injects a stubbed runImageCheck via deps. Please add a test that pipes buildWorkerImageCheckScript() through bash -n (or bash -c).

Should Fix

src/api/routers/projects.ts:135 — the project_worker_image_changed audit line is emitted only after the validation enqueue resolves, but the column write was already persisted by then. If the enqueue throws (e.g. Redis unavailable) the mutation rejects and the persisted change is left with no audit record (AC #8 wants every set/clear audited). Emit the audit line before / independently of the enqueue.

Questions

Local-only images vs. digest pinning — inspectImageDigest resolves the launch digest from RepoDigests, which is empty for an image that exists only locally (built, never pushed/pulled). Such a ref is marked failed ("no RepoDigests after pull"), yet the work order lists "local" as in v1 scope ("host-pullable/public/local"). Is a purely locally-built per-project image intended to be supported here, or does "local" mean "already pulled from a registry"? If the former, the digest path needs a fallback, since plan 2 launches workerImageDigest as a pull-by-digest reference (a bare sha256: image Id isn't pullable).

🕵️ claude-code · claude-opus-4-8 · run details

nhopeatall · 2026-06-26T12:33:30Z

+	const lines = ['set -u'];
+	for (const check of checks) {
+		lines.push(
+			`if ! { ${check.command} ; } >/dev/null 2>&1 ; then echo "FAIL: ${check.label} check failed (${check.command})" >&2 ; exit 1 ; fi`,


BLOCKING — this generated smoke-test script is unparseable, so every per-project image validation fails.

echo "FAIL: ${check.label} check failed (${check.command})" embeds the raw command. The Playwright SOFT check (line 56) is NODE_PATH=$(npm root -g) node -e "require('@playwright/test/package.json')" — it contains literal double quotes. Interpolated into this double-quoted echo, the inner " close the string and leave require('@playwright/test/package.json') unquoted, so bash aborts with syntax error near unexpected token '('.

defaultRunImageCheck calls buildWorkerImageCheckScript() with the default HARD+SOFT list, then runs docker.run(ref, ['bash','-lc', script]). Bash detects the syntax error at parse time, so the whole script exits 2 before any check runs — for every image, including valid cascade-worker images. summarizeFailure finds no FAIL: line and stores the bash error; the project is marked failed. Since plan 2 only launches verified digests, no per-project image can ever be used.

Verified locally: bash -n on the generated script reports syntax error near unexpected token '(' on the playwright line; HARD-only checks parse fine. CI misses it because the contract test only does substring toContain checks and the handler test stubs runImageCheck, so nothing parses/executes the real script.

Fix: don't embed the raw command in a double-quoted echo — e.g. drop (${check.command}) and keep just the label (enough for summarizeFailure's FAIL: grep), or shell-escape the command. Add a test that runs buildWorkerImageCheckScript() through bash -n.

Fixed in 0c3f475c. Reproduced your bash -n finding locally first (syntax error near unexpected token '(' on the Playwright line, exit 2, no FAIL: line), then fixed the root cause exactly as you diagnosed.

The generator no longer interpolates the raw ${check.command} into a double-quoted echo. It now emits the label via:

printf 'FAIL: %s check failed\n' '<label>' >&2

and drops the raw command from the message — the label alone is enough for summarizeFailure's FAIL: grep. The label is POSIX single-quote-escaped via a new shellSingleQuote() helper, so even labels containing shell metacharacters (the real engine CLI (claude/codex/opencode) label has parens) are printed literally and never re-parsed. The check command still runs inside if ! { ${command} ; }, where its own balanced double quotes are valid.

Added two regression tests in worker-runtime-tools-contract.test.ts that exercise the real generated script (not a stub):

pipes buildWorkerImageCheckScript() through bash -n and asserts it parses (status 0, no syntax error);

forces a failing check with a parens-containing label and asserts the runtime stderr is exactly FAIL: engine CLI (claude/codex/opencode) check failed (metacharacters literal, nothing executed).

nhopeatall · 2026-06-26T12:33:30Z

+	from: string | null;
+}): Promise<void> {
+	if (opts.change.enqueueRef) {
+		await enqueueWorkerImageValidationJob({


SHOULD_FIX — the audit line can be lost on an already-persisted change. The worker-image columns are written by updateProject/createProject before this runs, but the project_worker_image_changed audit line below is only emitted after enqueueWorkerImageValidationJob resolves. If the enqueue throws (e.g. Redis down), the mutation rejects and no audit line is logged — even though the change WAS persisted. For a superadmin-only, audited mutation (AC #8), that's a gap. Emit the audit line before / independently of the enqueue so every persisted change is always audited.

Fixed in 0c3f475c. finalizeWorkerImageChange now emits the project_worker_image_changed audit line before the enqueue, so a persisted set/clear is always audited even if enqueueWorkerImageValidationJob throws (e.g. Redis unavailable). The enqueue failure still propagates — the operator must know validation wasn't scheduled — but the already-committed change is recorded first.

Added a regression test (still audits a persisted change when the validation enqueue throws (Redis down)) that rejects the enqueue, asserts the mutation rejects with the enqueue error, and asserts the audit line was still logged with the correct from/to.

… audit before enqueue The router-side validator's generated smoke-test script was unparseable: the Playwright SOFT check command contains literal double quotes, and the previous `echo "FAIL: ... (${command})"` interpolated it raw into a double-quoted string, closing it and leaving `(` unquoted. bash aborted at PARSE time (exit 2, no FAIL: line) for EVERY candidate image, so no per-project worker image could ever be verified. Emit the label via `printf '%s' <single-quoted label>` and drop the raw command from the message (the label is enough for summarizeFailure's grep). Add bash -n parse + runtime FAIL-line regression tests. Also emit the `project_worker_image_changed` audit line BEFORE enqueueing the validation job: the column write is already committed by then, so a persisted set/clear must always be audited even if the enqueue throws (e.g. Redis down). Clarify in-code that v1 intentionally requires a registry digest (RepoDigests); a purely-local, build-only image is out of scope (Dockerfile-build follow-up). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

nhopeatall

Summary

APPROVE — spec 022 plan 3 is implemented completely and correctly. The prior REQUEST_CHANGES blocker (the buildWorkerImageCheckScript() bash parse error that marked every image failed) is genuinely fixed: the label is now emitted via printf with POSIX single-quote escaping and the raw command is no longer echoed. I reproduced the generated script and confirmed it (a) parses under bash -n and (b) prints a grep-stable FAIL: engine CLI (claude/codex/opencode) check failed line with the parens rendered literally. Two regression tests (bash -n parse + runtime FAIL line) lock it in. The audit-before-enqueue fix and the local-image clarification are also in place. CI is green (7/7).

I verified end-to-end against plan 2: resolveEffectiveBaseImage launches workerImageDigest directly as a pull-by-digest reference and throws a terminal error unless workerImageStatus === 'verified', so storing the full repo@sha256:… RepoDigests entry and the fail-closed pending/failed handling are correct and consistent. Superadmin gate, synchronous BAD_REQUEST grammar check, pending+enqueue, clear-to-default, ref-guarded persistence, audit line, and the dashboard-job routing (no slot / no spawn) all match the ACs and are well-tested.

Code Issues

Should Fix (non-blocking)

src/queue/client.ts:160 — Re-setting the worker image while a prior validation job for the same project is actively running can silently drop the new ref's validation. The jobId is per-project (worker-image-validation-<projectId>), and removeDashboardJob no-ops when the job is active (can't remove an active job — its own comment only claims to clear "completed/failed"). The subsequent submitDashboardJob → queue.add(..., { jobId }) is then deduped by BullMQ against the still-active job, so the new ref never gets a job. The mutation has already persisted the new ref as pending, and the active job's result is correctly dropped by the recordResult ref-guard (wrote === false) — but nothing re-enqueues for the new ref, so the project stays pending until the operator sets again after the in-flight job finishes. Fail-closed is preserved (plan 2 fails loud on non-verified, so no bad image launches) and it's recoverable, hence non-blocking — but the multi-minute pull+smoke-test window makes the race reachable. Consider re-enqueuing for the project's current ref when a stale result is dropped, or keying the job so an active run can be superseded.

Notes (NITPICK — no action required)

src/router/worker-image-checks.ts:52 — The SOFT checks (python, Playwright) are documented as "not strictly required to boot CASCADE," yet a missing one marks the image failed and it can never launch, with no override. This matches AC #3 (which lists python/Playwright in the passing smoke-test) and is clearly documented, so it's intended — flagging only because an otherwise cascade-capable image that happens to lack Playwright is rejected.
src/router/worker-image-validation.ts — On smoke-test timeout the Promise.race rejects but the docker run container isn't actively stopped; AutoRemove reaps it only when it exits on its own. Negligible impact given the checks are quick --version probes.

🕵️ claude-code · claude-opus-4-8 · run details

nhopeatall · 2026-06-26T12:55:31Z

+	const jobId = workerImageValidationJobId(payload.projectId);
+	// Clear any prior (completed/failed) job for this project so a re-set always
+	// schedules a fresh validation rather than colliding with a stale job id.
+	await removeDashboardJob(jobId);


Race: removeDashboardJob no-ops when the prior job is active (you can't queue.remove an active job — the comment above correctly says it only clears completed/failed). If an operator re-sets the image while the previous validation is mid pull/smoke-test, this remove is a no-op and the following submitDashboardJob is deduped by BullMQ against the still-active job (same per-project jobId), so the new ref's validation is silently dropped. The mutation already persisted the new ref as pending, and the active job's result is correctly discarded by the recordResult ref-guard — but nothing re-enqueues for the new ref, so the project stays stuck in pending until the operator sets it again. Fail-closed is preserved (plan 2 fails loud on non-verified), so this is recoverable / non-blocking, but worth closing: e.g. re-enqueue for the project's current ref when recordResult returns false.

nhopeatall · 2026-06-26T12:55:31Z

+ * stays explicit. Playwright is verified by package presence only — a full
+ * Chromium launch belongs in the CI smoke-test, not the per-project validator.
+ */
+export const WORKER_IMAGE_SOFT_CHECKS: readonly WorkerImageCheck[] = [


Non-blocking note: these SOFT checks are documented as "not strictly required to boot CASCADE," but the validator runs WORKER_IMAGE_VALIDATION_CHECKS (HARD + SOFT) and fails the image on any non-zero exit, so an image missing only python or Playwright is marked failed and can never launch — with no override. This matches AC #3 (python/Playwright are part of the passing smoke-test) and is documented, so it appears intentional; flagging for awareness in case operators provide HARD-complete images without Playwright.

feat(worker-image): superadmin set/validate per-project worker image …

fa0e450

…(MNG-1698)

nhopeatall requested changes Jun 26, 2026

View reviewed changes

nhopeatall approved these changes Jun 26, 2026

View reviewed changes

aaight merged commit 99c6693 into dev Jun 26, 2026
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(worker-image): superadmin set/validate per-project worker image (MNG-1698)#1466

feat(worker-image): superadmin set/validate per-project worker image (MNG-1698)#1466
aaight merged 2 commits into
devfrom
feature/MNG-1698-worker-image-validation

aaight commented Jun 26, 2026

Uh oh!

codecov Bot commented Jun 26, 2026 •

edited

Loading

Uh oh!

nhopeatall left a comment

Uh oh!

nhopeatall Jun 26, 2026

Uh oh!

aaight Jun 26, 2026

Uh oh!

nhopeatall Jun 26, 2026

Uh oh!

aaight Jun 26, 2026

Uh oh!

nhopeatall left a comment

Uh oh!

nhopeatall Jun 26, 2026

Uh oh!

nhopeatall Jun 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

aaight commented Jun 26, 2026

Per-project worker image — plan 3/4: set-validation (CLI/API + router-side validation job + audit)

What changed

Acceptance criteria

Testing

Out of scope (later plans)

Uh oh!

codecov Bot commented Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

nhopeatall left a comment

Choose a reason for hiding this comment

Summary

Code Issues

Blocking

Should Fix

Questions

Uh oh!

nhopeatall Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

aaight Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

nhopeatall Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

aaight Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

nhopeatall left a comment

Choose a reason for hiding this comment

Summary

Code Issues

Should Fix (non-blocking)

Notes (NITPICK — no action required)

Uh oh!

nhopeatall Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

nhopeatall Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov Bot commented Jun 26, 2026 •

edited

Loading