Skip to content

feat(worker-image): superadmin set/validate per-project worker image (MNG-1698)#1466

Merged
aaight merged 2 commits into
devfrom
feature/MNG-1698-worker-image-validation
Jun 26, 2026
Merged

feat(worker-image): superadmin set/validate per-project worker image (MNG-1698)#1466
aaight merged 2 commits into
devfrom
feature/MNG-1698-worker-image-validation

Conversation

@aaight

@aaight aaight commented Jun 26, 2026

Copy link
Copy Markdown
Collaborator

Per-project worker image — plan 3/4: set-validation (CLI/API + router-side validation job + audit)

Implements the operator-facing backend for spec 022 (per-project worker image). A superadmin sets/clears a project's worker image via tRPC + CLI. Because the Docker socket is router-only, the set mutation does not validate inline — it records the reference as pending and enqueues an eager router-side validation job that pulls the image, pins its immutable @sha256: digest, runs the runtime smoke-test, and marks the project verified (digest pinned) or failed (precise reason). Plan 2 already wired spawn to launch a verified digest, so after this plan the feature is fully functional headless.

Issue: https://linear.app/issue/MNG-1698

What changed

1. Extended runtime smoke-test (the validation contract)

  • New src/router/worker-image-checks.ts — the single source of truth for the in-container checks (WORKER_IMAGE_HARD_CHECKS = cascade-tools / node / git / engine CLI; WORKER_IMAGE_VALIDATION_CHECKS adds the python shim + Playwright) and a buildWorkerImageCheckScript() that fails fast with a grep-stable FAIL: <label> line.
  • tests/docker/worker-runtime-tools/run-test.sh extended with the HARD-contract block (cascade-tools / node / git / engine CLI), keeping the existing python + Playwright blocks.

2. tRPC set/clear mutation — superadmin, syntactic validation, enqueue, audit (src/api/routers/projects.ts)

  • create/update accept workerImage (z.string().nullish()). Worker-image changes are superadmin-gated (FORBIDDEN otherwise).
  • Malformed references are rejected synchronously with BAD_REQUEST (nothing persisted) via the new pure grammar validator src/config/workerImageRef.ts.
  • A valid ref persists workerImage + workerImageStatus='pending', clears digest/error, and enqueues a validation job. null clears all four columns (revert to global default); no enqueue.
  • Every set/clear emits a structured, grep-stable audit line: { event: 'project_worker_image_changed', actorId, projectId, from, to }.
  • defaults now exposes the global routerConfig.workerImage.

3. Validation job + router-side handler

  • src/queue/client.ts — new WorkerImageValidationJob type + enqueueWorkerImageValidationJob() (deterministic, self-deduplicating job id per project).
  • src/router/worker-image-validation.ts — the handler: pullImageOnce(ref)inspect → resolve the launchable repo@sha256:… digest from RepoDigests → run the extended smoke-test in a one-shot docker run --rm → persist verified+digest or failed+reason. Fail-closed: every non-verified path records failed, so a project is never stranded in pending. The persist is ref-guarded (recordWorkerImageValidationResult) so a stale result can't clobber a ref the operator changed mid-flight.
  • src/router/worker-manager.ts — the dashboard-jobs processor routes worker-image-validation straight to the handler (no worker slot, no container spawn); all other dashboard jobs still go through guardedSpawn.

4. CLI surface (src/cli/dashboard/projects/)

  • update: --worker-image <ref> and --clear-worker-image (mutually exclusive).
  • create: --worker-image <ref>.
  • show: renders the worker image + lifecycle (pending / verified → <digest> / failed: <reason> / (global default)). Typed FORBIDDEN/BAD_REQUEST envelopes already surface cleanly through the shared error mapper.

Digest format note. The pinned digest is the full launchable repo@sha256:… RepoDigests entry (matched to the pulled repository), not a bare sha256:… — plan 2's resolveEffectiveBaseImage launches workerImageDigest directly, so it must be a pull-by-digest reference.

Acceptance criteria

Testing

  • Unit: worker-runtime-tools-contract, workerImageRef, worker-image-validation-job (queue), worker-image-validation (handler), projects-worker-image (tRPC), projects-worker-image (CLI), plus a dashboard-dispatch routing case in worker-manager. Full unit suite: 10,461 passed.
  • Integration: recordWorkerImageValidationResult round-trip + ref-guard in projectsRepository. Full integration suite: 627 passed.
  • npm run typecheck, npm run build, and scoped biome check on changed/new files all clean.

Out of scope (later plans)

  • Dashboard UI (plan 4 — MNG-1699).
  • Dockerfile-build, per-agent images, per-project registry creds.

🕵️ claude-code · claude-opus-4-8 · run details

@codecov

codecov Bot commented Jun 26, 2026

Copy link
Copy Markdown

@nhopeatall nhopeatall left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary

REQUEST_CHANGES — the set/clear tRPC surface, queue wiring, ref grammar validator, ref-guarded persistence, and CLI are clean and well-tested. But the router-side validator is non-functional: the runtime smoke-test script it generates is unparseable by bash, so every image — valid or not — is marked failed and can never launch. That defeats the core purpose of this plan (AC #3 / #4).

Code Issues

Blocking

  • src/router/worker-image-checks.ts:83 (root cause) + :56 (trigger) — buildWorkerImageCheckScript() interpolates the raw ${check.command} into a double-quoted echo "FAIL: ... (${check.command})". The Playwright SOFT check command contains literal double quotes (node -e "require('@playwright/test/package.json')"); embedded in the double-quoted echo they close the string and leave require('@playwright/test/package.json') unquoted, so bash aborts with syntax error near unexpected token '('. defaultRunImageCheck calls buildWorkerImageCheckScript() with the default HARD+SOFT list and runs docker.run(ref, ['bash','-lc', script]); bash fails at parse time, so the script exits 2 before any check runs — for every candidate image, including valid cascade-worker images. handleWorkerImageValidation then records failed, and since plan 2 only launches verified digests, no per-project worker image can ever be verified or used. The feature is dead on arrival.

    Verified locally — bash -n on the generated script:

    syntax error near unexpected token `('
    

    on the playwright line; HARD-only checks parse and run fine. Running it as the handler does (bash -lc <script>) returns exit code 2 with no FAIL: line, so summarizeFailure stores the bash error as the reason.

    CI stays green because nothing exercises the real script: worker-runtime-tools-contract.test.ts only does substring toContain assertions, and worker-image-validation.test.ts injects a stubbed runImageCheck via deps. Please add a test that pipes buildWorkerImageCheckScript() through bash -n (or bash -c).

Should Fix

  • src/api/routers/projects.ts:135 — the project_worker_image_changed audit line is emitted only after the validation enqueue resolves, but the column write was already persisted by then. If the enqueue throws (e.g. Redis unavailable) the mutation rejects and the persisted change is left with no audit record (AC #8 wants every set/clear audited). Emit the audit line before / independently of the enqueue.

Questions

  • Local-only images vs. digest pinninginspectImageDigest resolves the launch digest from RepoDigests, which is empty for an image that exists only locally (built, never pushed/pulled). Such a ref is marked failed ("no RepoDigests after pull"), yet the work order lists "local" as in v1 scope ("host-pullable/public/local"). Is a purely locally-built per-project image intended to be supported here, or does "local" mean "already pulled from a registry"? If the former, the digest path needs a fallback, since plan 2 launches workerImageDigest as a pull-by-digest reference (a bare sha256: image Id isn't pullable).

🕵️ claude-code · claude-opus-4-8 · run details

Comment thread src/router/worker-image-checks.ts Outdated
const lines = ['set -u'];
for (const check of checks) {
lines.push(
`if ! { ${check.command} ; } >/dev/null 2>&1 ; then echo "FAIL: ${check.label} check failed (${check.command})" >&2 ; exit 1 ; fi`,

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BLOCKING — this generated smoke-test script is unparseable, so every per-project image validation fails.

echo "FAIL: ${check.label} check failed (${check.command})" embeds the raw command. The Playwright SOFT check (line 56) is NODE_PATH=$(npm root -g) node -e "require('@playwright/test/package.json')" — it contains literal double quotes. Interpolated into this double-quoted echo, the inner " close the string and leave require('@playwright/test/package.json') unquoted, so bash aborts with syntax error near unexpected token '('.

defaultRunImageCheck calls buildWorkerImageCheckScript() with the default HARD+SOFT list, then runs docker.run(ref, ['bash','-lc', script]). Bash detects the syntax error at parse time, so the whole script exits 2 before any check runs — for every image, including valid cascade-worker images. summarizeFailure finds no FAIL: line and stores the bash error; the project is marked failed. Since plan 2 only launches verified digests, no per-project image can ever be used.

Verified locally: bash -n on the generated script reports syntax error near unexpected token '(' on the playwright line; HARD-only checks parse fine. CI misses it because the contract test only does substring toContain checks and the handler test stubs runImageCheck, so nothing parses/executes the real script.

Fix: don't embed the raw command in a double-quoted echo — e.g. drop (${check.command}) and keep just the label (enough for summarizeFailure's FAIL: grep), or shell-escape the command. Add a test that runs buildWorkerImageCheckScript() through bash -n.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 0c3f475c. Reproduced your bash -n finding locally first (syntax error near unexpected token '(' on the Playwright line, exit 2, no FAIL: line), then fixed the root cause exactly as you diagnosed.

The generator no longer interpolates the raw ${check.command} into a double-quoted echo. It now emits the label via:

printf 'FAIL: %s check failed\n' '<label>' >&2

and drops the raw command from the message — the label alone is enough for summarizeFailure's FAIL: grep. The label is POSIX single-quote-escaped via a new shellSingleQuote() helper, so even labels containing shell metacharacters (the real engine CLI (claude/codex/opencode) label has parens) are printed literally and never re-parsed. The check command still runs inside if ! { ${command} ; }, where its own balanced double quotes are valid.

Added two regression tests in worker-runtime-tools-contract.test.ts that exercise the real generated script (not a stub):

  • pipes buildWorkerImageCheckScript() through bash -n and asserts it parses (status 0, no syntax error);
  • forces a failing check with a parens-containing label and asserts the runtime stderr is exactly FAIL: engine CLI (claude/codex/opencode) check failed (metacharacters literal, nothing executed).

Comment thread src/api/routers/projects.ts Outdated
from: string | null;
}): Promise<void> {
if (opts.change.enqueueRef) {
await enqueueWorkerImageValidationJob({

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SHOULD_FIX — the audit line can be lost on an already-persisted change. The worker-image columns are written by updateProject/createProject before this runs, but the project_worker_image_changed audit line below is only emitted after enqueueWorkerImageValidationJob resolves. If the enqueue throws (e.g. Redis down), the mutation rejects and no audit line is logged — even though the change WAS persisted. For a superadmin-only, audited mutation (AC #8), that's a gap. Emit the audit line before / independently of the enqueue so every persisted change is always audited.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 0c3f475c. finalizeWorkerImageChange now emits the project_worker_image_changed audit line before the enqueue, so a persisted set/clear is always audited even if enqueueWorkerImageValidationJob throws (e.g. Redis unavailable). The enqueue failure still propagates — the operator must know validation wasn't scheduled — but the already-committed change is recorded first.

Added a regression test (still audits a persisted change when the validation enqueue throws (Redis down)) that rejects the enqueue, asserts the mutation rejects with the enqueue error, and asserts the audit line was still logged with the correct from/to.

… audit before enqueue

The router-side validator's generated smoke-test script was unparseable: the
Playwright SOFT check command contains literal double quotes, and the previous
`echo "FAIL: ... (${command})"` interpolated it raw into a double-quoted string,
closing it and leaving `(` unquoted. bash aborted at PARSE time (exit 2, no
FAIL: line) for EVERY candidate image, so no per-project worker image could ever
be verified. Emit the label via `printf '%s' <single-quoted label>` and drop the
raw command from the message (the label is enough for summarizeFailure's grep).
Add bash -n parse + runtime FAIL-line regression tests.

Also emit the `project_worker_image_changed` audit line BEFORE enqueueing the
validation job: the column write is already committed by then, so a persisted
set/clear must always be audited even if the enqueue throws (e.g. Redis down).

Clarify in-code that v1 intentionally requires a registry digest (RepoDigests);
a purely-local, build-only image is out of scope (Dockerfile-build follow-up).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

@nhopeatall nhopeatall left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary

APPROVE — spec 022 plan 3 is implemented completely and correctly. The prior REQUEST_CHANGES blocker (the buildWorkerImageCheckScript() bash parse error that marked every image failed) is genuinely fixed: the label is now emitted via printf with POSIX single-quote escaping and the raw command is no longer echoed. I reproduced the generated script and confirmed it (a) parses under bash -n and (b) prints a grep-stable FAIL: engine CLI (claude/codex/opencode) check failed line with the parens rendered literally. Two regression tests (bash -n parse + runtime FAIL line) lock it in. The audit-before-enqueue fix and the local-image clarification are also in place. CI is green (7/7).

I verified end-to-end against plan 2: resolveEffectiveBaseImage launches workerImageDigest directly as a pull-by-digest reference and throws a terminal error unless workerImageStatus === 'verified', so storing the full repo@sha256:… RepoDigests entry and the fail-closed pending/failed handling are correct and consistent. Superadmin gate, synchronous BAD_REQUEST grammar check, pending+enqueue, clear-to-default, ref-guarded persistence, audit line, and the dashboard-job routing (no slot / no spawn) all match the ACs and are well-tested.

Code Issues

Should Fix (non-blocking)

  • src/queue/client.ts:160 — Re-setting the worker image while a prior validation job for the same project is actively running can silently drop the new ref's validation. The jobId is per-project (worker-image-validation-<projectId>), and removeDashboardJob no-ops when the job is active (can't remove an active job — its own comment only claims to clear "completed/failed"). The subsequent submitDashboardJobqueue.add(..., { jobId }) is then deduped by BullMQ against the still-active job, so the new ref never gets a job. The mutation has already persisted the new ref as pending, and the active job's result is correctly dropped by the recordResult ref-guard (wrote === false) — but nothing re-enqueues for the new ref, so the project stays pending until the operator sets again after the in-flight job finishes. Fail-closed is preserved (plan 2 fails loud on non-verified, so no bad image launches) and it's recoverable, hence non-blocking — but the multi-minute pull+smoke-test window makes the race reachable. Consider re-enqueuing for the project's current ref when a stale result is dropped, or keying the job so an active run can be superseded.

Notes (NITPICK — no action required)

  • src/router/worker-image-checks.ts:52 — The SOFT checks (python, Playwright) are documented as "not strictly required to boot CASCADE," yet a missing one marks the image failed and it can never launch, with no override. This matches AC #3 (which lists python/Playwright in the passing smoke-test) and is clearly documented, so it's intended — flagging only because an otherwise cascade-capable image that happens to lack Playwright is rejected.
  • src/router/worker-image-validation.ts — On smoke-test timeout the Promise.race rejects but the docker run container isn't actively stopped; AutoRemove reaps it only when it exits on its own. Negligible impact given the checks are quick --version probes.

🕵️ claude-code · claude-opus-4-8 · run details

Comment thread src/queue/client.ts
const jobId = workerImageValidationJobId(payload.projectId);
// Clear any prior (completed/failed) job for this project so a re-set always
// schedules a fresh validation rather than colliding with a stale job id.
await removeDashboardJob(jobId);

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Race: removeDashboardJob no-ops when the prior job is active (you can't queue.remove an active job — the comment above correctly says it only clears completed/failed). If an operator re-sets the image while the previous validation is mid pull/smoke-test, this remove is a no-op and the following submitDashboardJob is deduped by BullMQ against the still-active job (same per-project jobId), so the new ref's validation is silently dropped. The mutation already persisted the new ref as pending, and the active job's result is correctly discarded by the recordResult ref-guard — but nothing re-enqueues for the new ref, so the project stays stuck in pending until the operator sets it again. Fail-closed is preserved (plan 2 fails loud on non-verified), so this is recoverable / non-blocking, but worth closing: e.g. re-enqueue for the project's current ref when recordResult returns false.

* stays explicit. Playwright is verified by package presence only — a full
* Chromium launch belongs in the CI smoke-test, not the per-project validator.
*/
export const WORKER_IMAGE_SOFT_CHECKS: readonly WorkerImageCheck[] = [

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-blocking note: these SOFT checks are documented as "not strictly required to boot CASCADE," but the validator runs WORKER_IMAGE_VALIDATION_CHECKS (HARD + SOFT) and fails the image on any non-zero exit, so an image missing only python or Playwright is marked failed and can never launch — with no override. This matches AC #3 (python/Playwright are part of the passing smoke-test) and is documented, so it appears intentional; flagging for awareness in case operators provide HARD-complete images without Playwright.

@aaight aaight merged commit 99c6693 into dev Jun 26, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants