feat(worker-image): superadmin set/validate per-project worker image (MNG-1698)#1466
Conversation
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
nhopeatall
left a comment
There was a problem hiding this comment.
Summary
REQUEST_CHANGES — the set/clear tRPC surface, queue wiring, ref grammar validator, ref-guarded persistence, and CLI are clean and well-tested. But the router-side validator is non-functional: the runtime smoke-test script it generates is unparseable by bash, so every image — valid or not — is marked failed and can never launch. That defeats the core purpose of this plan (AC #3 / #4).
Code Issues
Blocking
-
src/router/worker-image-checks.ts:83 (root cause) + :56 (trigger) —
buildWorkerImageCheckScript()interpolates the raw${check.command}into a double-quotedecho "FAIL: ... (${check.command})". The Playwright SOFT check command contains literal double quotes (node -e "require('@playwright/test/package.json')"); embedded in the double-quoted echo they close the string and leaverequire('@playwright/test/package.json')unquoted, so bash aborts withsyntax error near unexpected token '('.defaultRunImageCheckcallsbuildWorkerImageCheckScript()with the default HARD+SOFT list and runsdocker.run(ref, ['bash','-lc', script]); bash fails at parse time, so the script exits 2 before any check runs — for every candidate image, including valid cascade-worker images.handleWorkerImageValidationthen recordsfailed, and since plan 2 only launchesverifieddigests, no per-project worker image can ever be verified or used. The feature is dead on arrival.Verified locally —
bash -non the generated script:syntax error near unexpected token `('on the playwright line; HARD-only checks parse and run fine. Running it as the handler does (
bash -lc <script>) returns exit code 2 with noFAIL:line, sosummarizeFailurestores the bash error as the reason.CI stays green because nothing exercises the real script:
worker-runtime-tools-contract.test.tsonly does substringtoContainassertions, andworker-image-validation.test.tsinjects a stubbedrunImageCheckviadeps. Please add a test that pipesbuildWorkerImageCheckScript()throughbash -n(orbash -c).
Should Fix
- src/api/routers/projects.ts:135 — the
project_worker_image_changedaudit line is emitted only after the validation enqueue resolves, but the column write was already persisted by then. If the enqueue throws (e.g. Redis unavailable) the mutation rejects and the persisted change is left with no audit record (AC #8 wants every set/clear audited). Emit the audit line before / independently of the enqueue.
Questions
- Local-only images vs. digest pinning —
inspectImageDigestresolves the launch digest fromRepoDigests, which is empty for an image that exists only locally (built, never pushed/pulled). Such a ref is markedfailed("no RepoDigests after pull"), yet the work order lists "local" as in v1 scope ("host-pullable/public/local"). Is a purely locally-built per-project image intended to be supported here, or does "local" mean "already pulled from a registry"? If the former, the digest path needs a fallback, since plan 2 launchesworkerImageDigestas a pull-by-digest reference (a baresha256:image Id isn't pullable).
🕵️ claude-code · claude-opus-4-8 · run details
| const lines = ['set -u']; | ||
| for (const check of checks) { | ||
| lines.push( | ||
| `if ! { ${check.command} ; } >/dev/null 2>&1 ; then echo "FAIL: ${check.label} check failed (${check.command})" >&2 ; exit 1 ; fi`, |
There was a problem hiding this comment.
BLOCKING — this generated smoke-test script is unparseable, so every per-project image validation fails.
echo "FAIL: ${check.label} check failed (${check.command})" embeds the raw command. The Playwright SOFT check (line 56) is NODE_PATH=$(npm root -g) node -e "require('@playwright/test/package.json')" — it contains literal double quotes. Interpolated into this double-quoted echo, the inner " close the string and leave require('@playwright/test/package.json') unquoted, so bash aborts with syntax error near unexpected token '('.
defaultRunImageCheck calls buildWorkerImageCheckScript() with the default HARD+SOFT list, then runs docker.run(ref, ['bash','-lc', script]). Bash detects the syntax error at parse time, so the whole script exits 2 before any check runs — for every image, including valid cascade-worker images. summarizeFailure finds no FAIL: line and stores the bash error; the project is marked failed. Since plan 2 only launches verified digests, no per-project image can ever be used.
Verified locally: bash -n on the generated script reports syntax error near unexpected token '(' on the playwright line; HARD-only checks parse fine. CI misses it because the contract test only does substring toContain checks and the handler test stubs runImageCheck, so nothing parses/executes the real script.
Fix: don't embed the raw command in a double-quoted echo — e.g. drop (${check.command}) and keep just the label (enough for summarizeFailure's FAIL: grep), or shell-escape the command. Add a test that runs buildWorkerImageCheckScript() through bash -n.
There was a problem hiding this comment.
Fixed in 0c3f475c. Reproduced your bash -n finding locally first (syntax error near unexpected token '(' on the Playwright line, exit 2, no FAIL: line), then fixed the root cause exactly as you diagnosed.
The generator no longer interpolates the raw ${check.command} into a double-quoted echo. It now emits the label via:
printf 'FAIL: %s check failed\n' '<label>' >&2
and drops the raw command from the message — the label alone is enough for summarizeFailure's FAIL: grep. The label is POSIX single-quote-escaped via a new shellSingleQuote() helper, so even labels containing shell metacharacters (the real engine CLI (claude/codex/opencode) label has parens) are printed literally and never re-parsed. The check command still runs inside if ! { ${command} ; }, where its own balanced double quotes are valid.
Added two regression tests in worker-runtime-tools-contract.test.ts that exercise the real generated script (not a stub):
- pipes
buildWorkerImageCheckScript()throughbash -nand asserts it parses (status 0, nosyntax error); - forces a failing check with a parens-containing label and asserts the runtime stderr is exactly
FAIL: engine CLI (claude/codex/opencode) check failed(metacharacters literal, nothing executed).
| from: string | null; | ||
| }): Promise<void> { | ||
| if (opts.change.enqueueRef) { | ||
| await enqueueWorkerImageValidationJob({ |
There was a problem hiding this comment.
SHOULD_FIX — the audit line can be lost on an already-persisted change. The worker-image columns are written by updateProject/createProject before this runs, but the project_worker_image_changed audit line below is only emitted after enqueueWorkerImageValidationJob resolves. If the enqueue throws (e.g. Redis down), the mutation rejects and no audit line is logged — even though the change WAS persisted. For a superadmin-only, audited mutation (AC #8), that's a gap. Emit the audit line before / independently of the enqueue so every persisted change is always audited.
There was a problem hiding this comment.
Fixed in 0c3f475c. finalizeWorkerImageChange now emits the project_worker_image_changed audit line before the enqueue, so a persisted set/clear is always audited even if enqueueWorkerImageValidationJob throws (e.g. Redis unavailable). The enqueue failure still propagates — the operator must know validation wasn't scheduled — but the already-committed change is recorded first.
Added a regression test (still audits a persisted change when the validation enqueue throws (Redis down)) that rejects the enqueue, asserts the mutation rejects with the enqueue error, and asserts the audit line was still logged with the correct from/to.
… audit before enqueue
The router-side validator's generated smoke-test script was unparseable: the
Playwright SOFT check command contains literal double quotes, and the previous
`echo "FAIL: ... (${command})"` interpolated it raw into a double-quoted string,
closing it and leaving `(` unquoted. bash aborted at PARSE time (exit 2, no
FAIL: line) for EVERY candidate image, so no per-project worker image could ever
be verified. Emit the label via `printf '%s' <single-quoted label>` and drop the
raw command from the message (the label is enough for summarizeFailure's grep).
Add bash -n parse + runtime FAIL-line regression tests.
Also emit the `project_worker_image_changed` audit line BEFORE enqueueing the
validation job: the column write is already committed by then, so a persisted
set/clear must always be audited even if the enqueue throws (e.g. Redis down).
Clarify in-code that v1 intentionally requires a registry digest (RepoDigests);
a purely-local, build-only image is out of scope (Dockerfile-build follow-up).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
nhopeatall
left a comment
There was a problem hiding this comment.
Summary
APPROVE — spec 022 plan 3 is implemented completely and correctly. The prior REQUEST_CHANGES blocker (the buildWorkerImageCheckScript() bash parse error that marked every image failed) is genuinely fixed: the label is now emitted via printf with POSIX single-quote escaping and the raw command is no longer echoed. I reproduced the generated script and confirmed it (a) parses under bash -n and (b) prints a grep-stable FAIL: engine CLI (claude/codex/opencode) check failed line with the parens rendered literally. Two regression tests (bash -n parse + runtime FAIL line) lock it in. The audit-before-enqueue fix and the local-image clarification are also in place. CI is green (7/7).
I verified end-to-end against plan 2: resolveEffectiveBaseImage launches workerImageDigest directly as a pull-by-digest reference and throws a terminal error unless workerImageStatus === 'verified', so storing the full repo@sha256:… RepoDigests entry and the fail-closed pending/failed handling are correct and consistent. Superadmin gate, synchronous BAD_REQUEST grammar check, pending+enqueue, clear-to-default, ref-guarded persistence, audit line, and the dashboard-job routing (no slot / no spawn) all match the ACs and are well-tested.
Code Issues
Should Fix (non-blocking)
- src/queue/client.ts:160 — Re-setting the worker image while a prior validation job for the same project is actively running can silently drop the new ref's validation. The jobId is per-project (
worker-image-validation-<projectId>), andremoveDashboardJobno-ops when the job isactive(can't remove an active job — its own comment only claims to clear "completed/failed"). The subsequentsubmitDashboardJob→queue.add(..., { jobId })is then deduped by BullMQ against the still-active job, so the new ref never gets a job. The mutation has already persisted the new ref aspending, and the active job's result is correctly dropped by therecordResultref-guard (wrote === false) — but nothing re-enqueues for the new ref, so the project stayspendinguntil the operator sets again after the in-flight job finishes. Fail-closed is preserved (plan 2 fails loud on non-verified, so no bad image launches) and it's recoverable, hence non-blocking — but the multi-minute pull+smoke-test window makes the race reachable. Consider re-enqueuing for the project's current ref when a stale result is dropped, or keying the job so an active run can be superseded.
Notes (NITPICK — no action required)
- src/router/worker-image-checks.ts:52 — The SOFT checks (
python, Playwright) are documented as "not strictly required to boot CASCADE," yet a missing one marks the imagefailedand it can never launch, with no override. This matches AC #3 (which lists python/Playwright in the passing smoke-test) and is clearly documented, so it's intended — flagging only because an otherwise cascade-capable image that happens to lack Playwright is rejected. - src/router/worker-image-validation.ts — On smoke-test timeout the
Promise.racerejects but thedocker runcontainer isn't actively stopped;AutoRemovereaps it only when it exits on its own. Negligible impact given the checks are quick--versionprobes.
🕵️ claude-code · claude-opus-4-8 · run details
| const jobId = workerImageValidationJobId(payload.projectId); | ||
| // Clear any prior (completed/failed) job for this project so a re-set always | ||
| // schedules a fresh validation rather than colliding with a stale job id. | ||
| await removeDashboardJob(jobId); |
There was a problem hiding this comment.
Race: removeDashboardJob no-ops when the prior job is active (you can't queue.remove an active job — the comment above correctly says it only clears completed/failed). If an operator re-sets the image while the previous validation is mid pull/smoke-test, this remove is a no-op and the following submitDashboardJob is deduped by BullMQ against the still-active job (same per-project jobId), so the new ref's validation is silently dropped. The mutation already persisted the new ref as pending, and the active job's result is correctly discarded by the recordResult ref-guard — but nothing re-enqueues for the new ref, so the project stays stuck in pending until the operator sets it again. Fail-closed is preserved (plan 2 fails loud on non-verified), so this is recoverable / non-blocking, but worth closing: e.g. re-enqueue for the project's current ref when recordResult returns false.
| * stays explicit. Playwright is verified by package presence only — a full | ||
| * Chromium launch belongs in the CI smoke-test, not the per-project validator. | ||
| */ | ||
| export const WORKER_IMAGE_SOFT_CHECKS: readonly WorkerImageCheck[] = [ |
There was a problem hiding this comment.
Non-blocking note: these SOFT checks are documented as "not strictly required to boot CASCADE," but the validator runs WORKER_IMAGE_VALIDATION_CHECKS (HARD + SOFT) and fails the image on any non-zero exit, so an image missing only python or Playwright is marked failed and can never launch — with no override. This matches AC #3 (python/Playwright are part of the passing smoke-test) and is documented, so it appears intentional; flagging for awareness in case operators provide HARD-complete images without Playwright.
Per-project worker image — plan 3/4: set-validation (CLI/API + router-side validation job + audit)
Implements the operator-facing backend for spec 022 (per-project worker image). A superadmin sets/clears a project's worker image via tRPC + CLI. Because the Docker socket is router-only, the set mutation does not validate inline — it records the reference as
pendingand enqueues an eager router-side validation job that pulls the image, pins its immutable@sha256:digest, runs the runtime smoke-test, and marks the projectverified(digest pinned) orfailed(precise reason). Plan 2 already wired spawn to launch a verified digest, so after this plan the feature is fully functional headless.Issue: https://linear.app/issue/MNG-1698
What changed
1. Extended runtime smoke-test (the validation contract)
src/router/worker-image-checks.ts— the single source of truth for the in-container checks (WORKER_IMAGE_HARD_CHECKS=cascade-tools/node/git/ engine CLI;WORKER_IMAGE_VALIDATION_CHECKSadds the python shim + Playwright) and abuildWorkerImageCheckScript()that fails fast with a grep-stableFAIL: <label>line.tests/docker/worker-runtime-tools/run-test.shextended with the HARD-contract block (cascade-tools/node/git/ engine CLI), keeping the existing python + Playwright blocks.2. tRPC set/clear mutation — superadmin, syntactic validation, enqueue, audit (
src/api/routers/projects.ts)create/updateacceptworkerImage(z.string().nullish()). Worker-image changes are superadmin-gated (FORBIDDENotherwise).BAD_REQUEST(nothing persisted) via the new pure grammar validatorsrc/config/workerImageRef.ts.workerImage+workerImageStatus='pending', clears digest/error, and enqueues a validation job.nullclears all four columns (revert to global default); no enqueue.{ event: 'project_worker_image_changed', actorId, projectId, from, to }.defaultsnow exposes the globalrouterConfig.workerImage.3. Validation job + router-side handler
src/queue/client.ts— newWorkerImageValidationJobtype +enqueueWorkerImageValidationJob()(deterministic, self-deduplicating job id per project).src/router/worker-image-validation.ts— the handler:pullImageOnce(ref)→inspect→ resolve the launchablerepo@sha256:…digest fromRepoDigests→ run the extended smoke-test in a one-shotdocker run --rm→ persistverified+digest orfailed+reason. Fail-closed: every non-verified path recordsfailed, so a project is never stranded inpending. The persist is ref-guarded (recordWorkerImageValidationResult) so a stale result can't clobber a ref the operator changed mid-flight.src/router/worker-manager.ts— the dashboard-jobs processor routesworker-image-validationstraight to the handler (no worker slot, no container spawn); all other dashboard jobs still go throughguardedSpawn.4. CLI surface (
src/cli/dashboard/projects/)update:--worker-image <ref>and--clear-worker-image(mutually exclusive).create:--worker-image <ref>.show: renders the worker image + lifecycle (pending/verified → <digest>/failed: <reason>/(global default)). TypedFORBIDDEN/BAD_REQUESTenvelopes already surface cleanly through the shared error mapper.Digest format note. The pinned digest is the full launchable
repo@sha256:…RepoDigests entry (matched to the pulled repository), not a baresha256:…— plan 2'sresolveEffectiveBaseImagelaunchesworkerImageDigestdirectly, so it must be a pull-by-digest reference.Acceptance criteria
FORBIDDEN(AC fix: use CI environment for deploy workflow secrets #8 authz)BAD_REQUEST(nothing persisted); valid ref →pending+ validation enqueued (AC Improved performance and stability #4 partial)verifiedon a passing smoke-test (AC chore: update llmist and zangief dependencies #3 / Improved performance and stability #4)failed+ precise reason, never launches; never stuck inpending(AC Improved performance and stability #4 fail-closed)npm run build,npm test,npm run test:integration,npm run lint,npm run typecheckpassREADME.md,CHANGELOG.md)Testing
worker-runtime-tools-contract,workerImageRef,worker-image-validation-job(queue),worker-image-validation(handler),projects-worker-image(tRPC),projects-worker-image(CLI), plus a dashboard-dispatch routing case inworker-manager. Full unit suite: 10,461 passed.recordWorkerImageValidationResultround-trip + ref-guard inprojectsRepository. Full integration suite: 627 passed.npm run typecheck,npm run build, and scopedbiome checkon changed/new files all clean.Out of scope (later plans)
🕵️ claude-code · claude-opus-4-8 · run details