feat(runs): early agent_runs visibility via pre-created queued row (MNG-1695)#1465
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
nhopeatall
left a comment
There was a problem hiding this comment.
Summary
REQUEST_CHANGES. Improvement A (reorder tryCreateRun before the clone) and the repository/threading/compensator work are correct and well-tested — I verified capacity math stays running-only, stats filter to terminal statuses, the compensator fast-path is correctly scoped to manual-run, and engine/model handling is consistent with spec 018's deferred-fill. One real regression in Improvement B needs addressing before merge.
Code Issues
Should Fix (blocking)
src/api/routers/runs.ts (queued-row pre-creation) — the pre-created queued row leaks permanently when the worker skips dispatch before activating it.
The queued row is born at tRPC trigger time, but it is only ever resolved to a terminal state in three places: (1) enqueue-failure rollback, (2) the compensator fast-path on a spawn failure, and (3) tryCreateRun→activateQueuedRun once the worker actually reaches executeWithEngine. The problem is that the manual-run worker can return cleanly (exit 0) without ever reaching executeWithEngine, leaving the row stuck as queued forever.
Trace (queue mode, manual implementation run on a work item with an open PR):
triggercreatesqueuedrow X and enqueues the job (success — no rollback).- Worker boots →
processDashboardJob('manual-run')→triggerManualRun→runAgentExecutionPipeline. runAgentExecutionPipeline(src/triggers/shared/agent-execution.ts:81-84) hitsrunFreshnessGate→ blocked →return;beforerunAgentForContext(line 100). SoexecuteWithEngine/tryCreateRun/activateQueuedRunis never called and X is never activated.- The pipeline returns void;
triggerManualRun's catch (manual-runner.ts:188) is not even entered (no throw).processDashboardJobreturns; the container exits 0. cleanupWorker(src/router/active-workers.ts:116) only callsfailOrphanedRunwhenexitCode !== 0, so it is skipped. The periodic orphan sweep (orphan-cleanup.ts:145) only acts on Docker containers that still exist, but an AutoRemove container is already gone. Row X staysqueuedindefinitely.
The same leak occurs on the budget-abort early return (agent-execution.ts:93 / agent-execution-lifecycle.ts:96) and, more generally, on any error thrown between the start of triggerManualRun's try-block and activation, because that catch swallows the error (logs only) and the container still exits 0.
Consequences (both user-visible):
- A permanent yellow
queuedbadge that never resolves, and the dashboard keeps polling it forever (isRunActive('queued') === true). - Because
hasActiveRunForWorkItemnow countsqueued(includeQueued: true, bounded byDEFAULT_STALE_RUN_THRESHOLD_MS≈ 2h onstartedAt), every subsequent manual trigger and retry on that work item returnsCONFLICT("A worker is already active on this work item") for ~2 hours. This is a real lock-out reachable via a normal operator action (manually re-running implementation on a card that already has an open PR, or triggering any agent on an over-budget work item).
Before this PR these skip/abort paths simply created no row (the very invisibility being fixed), so there was nothing to leak. Improvement B turns each of them into a stuck row + a multi-hour lock-out.
Suggested direction: resolve the pre-created row to a terminal state on the skip/abort paths. E.g. have the pre-activation early-returns in runAgentExecutionPipeline (and/or the swallowing catch in triggerManualRun) call failQueuedOrRunningRun(agentInput.preCreatedRunId, ...) when a preCreatedRunId is present and was not activated — failQueuedOrRunningRun already no-ops safely on a row that reached a terminal state, and it matches a still-queued row. (A dedicated non-error terminal reason like "Skipped: freshness gate / budget" would read better than failed, but failed is at least self-healing.)
Questions
- Was the manual-run-skips-dispatch case (freshness gate / budget abort) considered when scoping the cleanup paths? The Risks section covers spawn failure and the
failOrphanedRunFallbackwindow, but not a clean (exit-0) worker that skips dispatch after the row was pre-created.
🕵️ claude-code · claude-opus-4-8 · run details
| // (`resolveEngineName`) so the stored engine matches the activated run | ||
| // (`activateQueuedRun` never overwrites the `engine` column). | ||
| const engine = resolveEngineName(input.agentType, pc.project); | ||
| const runId = await createQueuedRun({ |
There was a problem hiding this comment.
Pre-created queued row leaks on the worker-side skip/abort paths.
This row is only resolved to terminal by (a) the enqueue-failure rollback below, (b) the compensator fast-path on a spawn failure, or (c) activateQueuedRun once the worker reaches executeWithEngine. But a manual implementation run can return cleanly (exit 0) without ever activating it: runAgentExecutionPipeline returns early when runFreshnessGate blocks (agent-execution.ts:82) or checkPreRunBudget aborts (agent-execution.ts:93), both before runAgentForContext → executeWithEngine → tryCreateRun. The pipeline returns void (no throw), the container exits 0, and cleanupWorker skips failOrphanedRun (it only fires on non-zero exit). The AutoRemove container is gone, so the periodic orphan sweep can't catch it either.
Result: the row is stuck queued forever (perpetual yellow badge + endless polling), and because hasActiveRunForWorkItem now counts queued with maxAgeMs = DEFAULT_STALE_RUN_THRESHOLD_MS (~2h), every later trigger/retry on this work item returns CONFLICT for ~2 hours. Reachable by manually re-running implementation on a card with an open PR, or triggering on an over-budget work item.
Consider failing/closing the pre-created row (failQueuedOrRunningRun(preCreatedRunId, ...)) on these skip/abort paths when it was never activated.
There was a problem hiding this comment.
Fixed in c4d8ae74. The pre-created queued row is now resolved to a terminal failed state on every worker-side skip/abort path that returns before activation:
runAgentExecutionPipelinecallsfailQueuedOrRunningRun(agentInput.preCreatedRunId, …)via a guardedresolvePreCreatedRunOnSkiphelper on each pre-activation early return — integration validation, the freshness gate, and the budget abort — each with a descriptive reason (Skipped before start: …).triggerManualRun's swallowing catch resolves the row for any error thrown before activation (PM-provider/credential setup or any pre-activation pipeline step).
failQueuedOrRunningRun no-ops on a row that already reached terminal — so a genuinely-activated run keeps its own status + error from the executeAgentPipeline finalizeRun path — and no-ops entirely when there is no preCreatedRunId (every non-manual source). That clears both the perpetual yellow queued badge and the ~2h CONFLICT lock-out.
Re: your question — yes, the freshness-gate / budget-abort clean-exit case is now explicitly covered, alongside the catch-all for any pre-activation throw. Added unit coverage for the validation / budget / no-id / success / db-error skip cases plus the manual-runner catch path.
…(MNG-1695) The pre-created `queued` row (Improvement B) leaked permanently whenever the manual-run worker returned cleanly (exit 0) without ever activating it. The freshness gate, budget abort, and integration-validation early returns in `runAgentExecutionPipeline` all stop before `tryCreateRun`/`activateQueuedRun`, and the swallowing catch in `triggerManualRun` turns any pre-activation error into a clean exit too. On exit 0, `cleanupWorker` skips `failOrphanedRun` and the AutoRemove container is gone before the orphan sweep runs, so the row stayed `queued` forever — a perpetual yellow badge plus, because `hasActiveRunForWorkItem` counts `queued`, a ~2h CONFLICT lock-out of every later trigger/retry on the work item. Resolve the pre-created row to a terminal state on those paths: - `runAgentExecutionPipeline` now calls `failQueuedOrRunningRun` (via a guarded `resolvePreCreatedRunOnSkip` helper) on each pre-activation early return, with a descriptive reason (validation / freshness / budget). - `triggerManualRun`'s catch resolves the row for any error thrown before activation. `failQueuedOrRunningRun` no-ops on a row that already reached terminal (the normal post-activation success/failure path), so genuine runs keep their own status, and the helper no-ops entirely when no `preCreatedRunId` is present (every non-manual source). Adds unit coverage for the validation/budget/no-id/success/db-error skip cases and the manual-runner catch path. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
nhopeatall
left a comment
There was a problem hiding this comment.
Summary
COMMENT. The prior blocking review (perpetual queued leak + ~2h CONFLICT lock-out on the freshness-gate / budget-abort skip paths) is correctly fixed: resolvePreCreatedRunOnSkip now resolves the pre-created row on the integration-validation, freshness-gate, and budget-abort early returns, and triggerManualRun's catch covers any throw before activation. Improvement A (reorder tryCreateRun before the clone), the repository primitives (ACTIVE_RUN_STATUSES / createQueuedRun / activateQueuedRun / failQueuedOrRunningRun), the running-only capacity math (includeQueued opt-in), and the manual-run-scoped compensator fast-path are all correct and well-tested.
One residual gap of the same class remains — reachable only via engine misconfiguration, so non-blocking, but worth closing.
Code Issues
Should Fix
src/agents/registry.ts:29-43 (via runAgentForContext → runAgent) — a clean (exit-0) engine-guard return skips activation and re-leaks the pre-created queued row.
resolvePreCreatedRunOnSkip enumerates the pipeline's pre-activation early returns, but activation happens deeper, inside runAgentForContext → runAgent → executeWithEngine → tryCreateRun. runAgent has two early returns that produce a failed AgentResult without throwing and without reaching executeWithEngine/tryCreateRun:
if (!engine)—resolveEngineName(src/backends/resolution.ts:13) returns the project'sagentEngine.default/overrides[agentType]verbatim with no registry validation, andgetEngine(src/backends/registry.ts:12) returnsundefinedfor an unknown name. A typo'd / non-existent engine name therefore hits this branch.if (!engine.supportsAgentType(agentType))— currently unreachable (all built-in engines returntrueviaNativeToolEngine/llmist), but a future custom engine would expose it.
Trace (queue mode, project with a misconfigured engine name, manual implementation run on a card):
runs.triggerpasses the conflict/enabled checks,createQueuedRunwrites row X (status='queued'), enqueues successfully.- Worker boots →
triggerManualRunpre-checks pass →runAgentExecutionPipelineclears validation, freshness gate, and budget. runAgentForContext→runAgenthitsif (!engine)→ returns{ success: false, ... }.tryCreateRun/activateQueuedRunis never called; X staysqueued.- The pipeline treats it as a normal failed result (post-side-effects + failure lifecycle), returns void;
triggerManualRun's try completes — no throw, so the catch is not entered — and the container exits 0. cleanupWorker(src/router/active-workers.ts:116) only callsfailOrphanedRunon a non-zero exit, so it's skipped → row X leaks asqueuedforever.
Same user-visible consequences the prior review called out: a perpetual yellow queued badge that the dashboard polls indefinitely, and (for runs with a workItemId) every later manual trigger and retry on that card returns CONFLICT for ~DEFAULT_STALE_RUN_THRESHOLD_MS (~2h), because hasActiveRunForWorkItem now counts queued. Reachability is lower than the freshness/budget case (a standing engine misconfig rather than a normal action), but once present it leaks on every manual run for that project.
Note the broader fragility: resolving the row by enumerating every pre-activation exit is inherently incomplete — the runAgent guards are proof one was missed. A more robust alternative is a single terminal-state reconciliation that does not depend on the path taken — e.g. in triggerManualRun, after the pipeline returns (in the finally, covering both clean-return and throw), call failQueuedOrRunningRun(input.preCreatedRunId, ...) when preCreatedRunId is set. It already no-ops on terminal rows (a successfully-activated run is completed/failed by then), so it would rescue a never-activated row regardless of which skip/clean-failure path produced it, and would let the per-early-return calls be dropped.
Questions
- Was the deeper
runAgentengine-guard return (unknown / unsupported engine) considered alongside the pipeline early returns when scopingresolvePreCreatedRunOnSkip? It's the same leak class one call level below the covered returns.
🕵️ claude-code · claude-opus-4-8 · run details
| * always safe. Its own failure must never crash the worker on an otherwise-clean | ||
| * skip, so DB errors are logged and swallowed. | ||
| */ | ||
| async function resolvePreCreatedRunOnSkip( |
There was a problem hiding this comment.
This resolves the pipeline's pre-activation early returns, but activation actually happens one level deeper in runAgentForContext → runAgent. runAgent (src/agents/registry.ts:29-43) has two guard returns — if (!engine) and if (!engine.supportsAgentType(...)) — that return a failed AgentResult without throwing and without reaching executeWithEngine/tryCreateRun. Since resolveEngineName does no registry validation and getEngine returns undefined for an unknown name, a misconfigured agentEngine default/override hits if (!engine), the worker exits 0, cleanupWorker skips failOrphanedRun, and the pre-created queued row leaks (perpetual badge + ~2h CONFLICT lock-out via hasActiveRunForWorkItem) — the exact symptom this helper was added to prevent.
Consider a path-independent reconciliation instead of enumerating every exit: in triggerManualRun's finally, call failQueuedOrRunningRun(input.preCreatedRunId, ...) when set. It no-ops on already-terminal rows (an activated run is completed/failed by then), so it rescues a never-activated row regardless of which skip/clean-failure path produced it.
There was a problem hiding this comment.
Fixed in 604e0eb2. You're right — the runAgent engine-guard returns (!engine for an unknown/misconfigured engine name, !supportsAgentType) are the same leak class one call level below the pipeline early returns, and they were not covered.
I took your recommended path-independent approach rather than enumerating the deeper exit:
- Removed
resolvePreCreatedRunOnSkipand its three call sites fromagent-execution.ts(back to plainreturn;). - Moved + generalized the reconciliation into
triggerManualRun'sfinally, so it resolves the pre-createdqueuedrow regardless of how the pipeline ended — clean pre-activation skip, clean engine-guard return (no throw, never reachesexecuteWithEngine), or thrown error. It can't be defeated by a future exit we forget to enumerate.
Why the finally is sufficient and safe:
preCreatedRunIdonly ever originates from the manual-run path, andtriggerManualRunis the single worker-side entry thatawaits the entire pipeline — so thefinallycovers 100% of leak cases.failQueuedOrRunningRunno-ops at the DB level (WHERE status IN (queued, running)) on a row already finalized byexecuteAgentPipeline'sfinalizeRun, so the unconditional call is safe for genuinely-activated runs (they keep their owncompleted/failed/timed_outstatus + error). The reason string also distinguishes the throw case (Manual run failed before completion: …) from the clean never-activated case.
Coverage:
- Unit (
manual-runner.test.ts): added the clean (exit-0) engine-guard / skip return path asserting the row is still resolved, plus the throw, no-id, and swallowed-DB-error cases. - Integration (
runsRepository.test.ts): addedfailQueuedOrRunningRun no-ops on an already-terminal (completed) rowto lock the DB-level guarantee the unconditionalfinallyrelies on.
…(MNG-1695) The previous fix resolved the pre-created `queued` row on the pipeline's pre-activation early returns (integration validation, freshness gate, budget abort) via `resolvePreCreatedRunOnSkip`. But activation happens one level deeper — in `runAgent` (src/agents/registry.ts) → `executeWithEngine` → `tryCreateRun`. `runAgent` has two guard returns (`!engine` for an unknown/misconfigured engine name, and `!supportsAgentType`) that produce a failed AgentResult WITHOUT throwing and WITHOUT reaching `executeWithEngine`. A standing engine misconfig therefore exits the worker 0 with the queued row never activated, so it re-leaked as a perpetual `queued` badge and locked out every later trigger/retry on the work item for ~2h (hasActiveRunForWorkItem counts `queued`). Replace the per-early-return enumeration with a single path-independent reconciliation in `triggerManualRun`'s `finally`: it resolves the pre-created `queued` row regardless of HOW the pipeline ended — clean skip, clean engine-guard return, or throw — so it cannot be defeated by a future exit we forgot to enumerate. `preCreatedRunId` only ever originates from the manual-run path, and `triggerManualRun` awaits the entire pipeline, so the `finally` covers 100% of leak cases. `failQueuedOrRunningRun` no-ops at the DB level on a row already finalized by `executeAgentPipeline`, making the unconditional call safe for genuinely-activated runs. - Remove `resolvePreCreatedRunOnSkip` + its 3 call sites from agent-execution.ts - Move + generalize the reconciliation into triggerManualRun's `finally` - Add integration coverage: failQueuedOrRunningRun no-ops on a completed row - Add unit coverage for the clean (exit-0) engine-guard / skip return path Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
nhopeatall
left a comment
There was a problem hiding this comment.
Summary
APPROVE. This is a well-scoped, thoroughly-tested fix for early agent_runs visibility, and it has correctly absorbed the fixes from the two prior review rounds. I re-traced the full run lifecycle and the previously-blocking leak class is now closed.
What I verified
Improvement A (reorder tryCreateRun before the clone) — model/maxIterations are deferred-filled via tryUpdateRunPlanResolution(runId, …) (spec 018, adapter.ts:125), so reusing the pre-created runId loses nothing; the watchdog closure now captures a defined runId during the clone, and a setupRepoDir throw surfaces a visible failed row (covered by the new pipeline tests).
The queued-row leak (the prior blocking finding) — the path-independent reconciliation in triggerManualRun's finally is the robust fix. I walked every exit path for a workItem manual run and each resolves the pre-created row:
- normal activation →
completeRun/finalizeRun(andfinallythen no-ops on the terminal row); - clean pre-activation skip (validation / freshness gate / budget) and
runAgent!engine/!supportsAgentTypereturns →finallyresolves it; - throw inside the try →
finally(pipelineErrorbranch); - throw before the try/finally (
validateIntegrationsline 123) or hard crash → non-zero exit →cleanupWorker→failOrphanedRun, which now matchesACTIVE_RUN_STATUSES; - spawn failure → compensator fast-path (scoped to
manual-run); enqueue failure → tRPC rollback.
failQueuedOrRunningRun is guarded on status IN (queued, running), so a genuinely-completed run keeps its own status — confirmed by the integration test.
Other invariants — capacity math stays running-only (agent-type-lock, work-item-lock, implementation-freshness-gate all call countActiveRuns without includeQueued; only hasActiveRunForWorkItem opts in); stats/charts filter to terminal statuses (runStatsRepository/prWorkItemsRepository), so queued rows don't pollute aggregates; the compensator fast-path correctly excludes retry-run/debug-analysis; cancelRunById is intentionally running-only; startedAt has a DB default so createQueuedRun rows sort/display correctly; frontend polling/badge/filter/retry-hide are consistently driven by isRunActive.
The residual edges (no-workItemId hard-crash before activation hitting the failOrphanedRunFallback window — no CONFLICT lock-out since there's no work item; and the manual+webhook race inside the brief queued window under running-only capacity) are explicitly acknowledged in the card's Risks section. No blocking issues.
🕵️ claude-code · claude-opus-4-8 · run details
Summary
Closes the "manual run invisible for 4+ min" gap with two complementary fixes for early
agent_runsvisibility (MNG-1695):tryCreateRunnow runs before the git clone inexecuteAgentPipeline, so the run row appears in the dashboard ~10–15s sooner for all run types. A clone/setup failure now happens after the row exists, sohandleAgentPipelineError → finalizeRunmarks a visiblefailedrow instead of silently no-op'ing on an undefinedrunId.runs.triggermutation pre-creates astatus='queued'row at tRPC time (visible in ~1s), threaded aspreCreatedRunIdthrough the BullMQ manual-run job →manual-runner→AgentInput→executeWithEngine→tryCreateRun, which activates the queued row (queued → running) instead of inserting a duplicate.No DB migration —
agent_runs.statusis alreadytext(...).default('running');'queued'is a new string value. Stats queries filter to terminal statuses, so charts are unaffected.Issue: https://linear.app/issue/MNG-1695
What changed
Repository layer (
runsRepository.ts)ACTIVE_RUN_STATUSES = ['running', 'queued']+isActiveRunStatus().createQueuedRun()— insertsstatus='queued'.activateQueuedRun()—queued → running(resetsstartedAt, never touchesengine); idempotent (returnsfalseon a 2nd attempt so BullMQ retries are safe).failQueuedOrRunningRun()— fails a still-active row (enqueue-failure rollback + compensator fast-path).countActiveRuns()gains opt-inincludeQueued— default staysrunning-only so capacity math (agent-type-lock,work-item-lock,implementation-freshness-gate) is unchanged.hasActiveRunForWorkItemopts in, hardening both CONFLICT guards (runs.trigger+runs.retry).failOrphanedRun/failOrphanedRunFallbacknow match both active statuses so a worker that crashes before activation still gets its queued row failed.Threading (
types,runTracking,adapter,queue/client,worker-entry,manual-runner)preCreatedRunIdadded toAgentInput,RunTrackingInput,ManualTriggerInput, and theManualRunJob/ManualRunJobDatapayloads;executeWithEngineforwards it intorunTracking.tryCreateRunbranches:preCreatedRunIdset →activateQueuedRun+ reuse id; unset → legacycreateRunINSERT. JOB_ID storage preserved on both paths.tRPC (
runs.ts)triggerresolves the engine via the sameresolveEngineNameresolverrunAgentuses, pre-creates the queued row, and enqueues withrunId. Enqueue failure rolls the row back tofailedand surfaces aTRPCError. The in-process (non-queue) branch is unchanged.Compensator (
dispatch-compensator.ts)recordSpawnFailureStubfast-path: amanual-runjob carrying arunIdfails that pre-created row instead of inserting a duplicate stub. Scoped tomanual-runsoretry-run/debug-analysisrunIds (which reference other runs) are never flipped.Frontend
isRunActive(status)helper (mirrorsACTIVE_RUN_STATUSES) drivesrefetchIntervalacross 5 run pages so the dashboard keeps polling while a row isqueued.queuedbadge,queuedfilter option, and retry-button hidden while a run is active. Optional CLIformatStatusparity (queued→ yellow).Intentional scope decisions (see card Risks)
running-only by default.activateQueuedRunresetsstartedAt(duration measured from worker boot, matches today).running-only (aqueuedrun has no worker to cancel).Testing
activateQueuedRunvscreateRun), pipeline call-order (A),isRunActivetruth table, compensator fast-path vs insert path, tRPC queue-mode pre-create + enqueue-failure rollback, repository insert/update shapes.createQueuedRun → activateQueuedRun → failQueuedOrRunningRunlifecycle, idempotent re-activation, engine preserved,queuedcounted by CONFLICT guard (not by default capacity math),failOrphanedRunmatches queued.npm run typecheck(root + web) and biome lint on changed files clean.🤖 Generated with Claude Code
🕵️ claude-code · claude-opus-4-8 · run details