Skip to content

feat(runs): early agent_runs visibility via pre-created queued row (MNG-1695)#1465

Merged
aaight merged 3 commits into
devfrom
feature/MNG-1695-early-agent-runs-visibility
Jun 26, 2026
Merged

feat(runs): early agent_runs visibility via pre-created queued row (MNG-1695)#1465
aaight merged 3 commits into
devfrom
feature/MNG-1695-early-agent-runs-visibility

Conversation

@aaight

@aaight aaight commented Jun 26, 2026

Copy link
Copy Markdown
Collaborator

Summary

Closes the "manual run invisible for 4+ min" gap with two complementary fixes for early agent_runs visibility (MNG-1695):

  • Improvement A (universal): tryCreateRun now runs before the git clone in executeAgentPipeline, so the run row appears in the dashboard ~10–15s sooner for all run types. A clone/setup failure now happens after the row exists, so handleAgentPipelineError → finalizeRun marks a visible failed row instead of silently no-op'ing on an undefined runId.
  • Improvement B (manual triggers): the runs.trigger mutation pre-creates a status='queued' row at tRPC time (visible in ~1s), threaded as preCreatedRunId through the BullMQ manual-run job → manual-runnerAgentInputexecuteWithEnginetryCreateRun, which activates the queued row (queued → running) instead of inserting a duplicate.

No DB migrationagent_runs.status is already text(...).default('running'); 'queued' is a new string value. Stats queries filter to terminal statuses, so charts are unaffected.

Issue: https://linear.app/issue/MNG-1695

What changed

Repository layer (runsRepository.ts)

  • New single source of truth ACTIVE_RUN_STATUSES = ['running', 'queued'] + isActiveRunStatus().
  • createQueuedRun() — inserts status='queued'.
  • activateQueuedRun()queued → running (resets startedAt, never touches engine); idempotent (returns false on a 2nd attempt so BullMQ retries are safe).
  • failQueuedOrRunningRun() — fails a still-active row (enqueue-failure rollback + compensator fast-path).
  • countActiveRuns() gains opt-in includeQueueddefault stays running-only so capacity math (agent-type-lock, work-item-lock, implementation-freshness-gate) is unchanged. hasActiveRunForWorkItem opts in, hardening both CONFLICT guards (runs.trigger + runs.retry).
  • failOrphanedRun / failOrphanedRunFallback now match both active statuses so a worker that crashes before activation still gets its queued row failed.

Threading (types, runTracking, adapter, queue/client, worker-entry, manual-runner)

  • preCreatedRunId added to AgentInput, RunTrackingInput, ManualTriggerInput, and the ManualRunJob/ManualRunJobData payloads; executeWithEngine forwards it into runTracking.
  • tryCreateRun branches: preCreatedRunId set → activateQueuedRun + reuse id; unset → legacy createRun INSERT. JOB_ID storage preserved on both paths.

tRPC (runs.ts)

  • In queue mode only, trigger resolves the engine via the same resolveEngineName resolver runAgent uses, pre-creates the queued row, and enqueues with runId. Enqueue failure rolls the row back to failed and surfaces a TRPCError. The in-process (non-queue) branch is unchanged.

Compensator (dispatch-compensator.ts)

  • recordSpawnFailureStub fast-path: a manual-run job carrying a runId fails that pre-created row instead of inserting a duplicate stub. Scoped to manual-run so retry-run/debug-analysis runIds (which reference other runs) are never flipped.

Frontend

  • New isRunActive(status) helper (mirrors ACTIVE_RUN_STATUSES) drives refetchInterval across 5 run pages so the dashboard keeps polling while a row is queued.
  • Yellow queued badge, queued filter option, and retry-button hidden while a run is active. Optional CLI formatStatus parity (queued → yellow).

Intentional scope decisions (see card Risks)

  • Capacity/lock math + freshness-gate stay running-only by default.
  • activateQueuedRun resets startedAt (duration measured from worker boot, matches today).
  • Retry runs are out of scope (still benefit from Improvement A's speedup).
  • Cancel button stays running-only (a queued run has no worker to cancel).

Testing

  • Unit: run-tracking branch selection (activateQueuedRun vs createRun), pipeline call-order (A), isRunActive truth table, compensator fast-path vs insert path, tRPC queue-mode pre-create + enqueue-failure rollback, repository insert/update shapes.
  • Integration: createQueuedRun → activateQueuedRun → failQueuedOrRunningRun lifecycle, idempotent re-activation, engine preserved, queued counted by CONFLICT guard (not by default capacity math), failOrphanedRun matches queued.
  • Full unit suite (10401 tests) green; targeted integration file green; npm run typecheck (root + web) and biome lint on changed files clean.

🤖 Generated with Claude Code

🕵️ claude-code · claude-opus-4-8 · run details

@codecov

codecov Bot commented Jun 26, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@nhopeatall nhopeatall left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary

REQUEST_CHANGES. Improvement A (reorder tryCreateRun before the clone) and the repository/threading/compensator work are correct and well-tested — I verified capacity math stays running-only, stats filter to terminal statuses, the compensator fast-path is correctly scoped to manual-run, and engine/model handling is consistent with spec 018's deferred-fill. One real regression in Improvement B needs addressing before merge.

Code Issues

Should Fix (blocking)

src/api/routers/runs.ts (queued-row pre-creation) — the pre-created queued row leaks permanently when the worker skips dispatch before activating it.

The queued row is born at tRPC trigger time, but it is only ever resolved to a terminal state in three places: (1) enqueue-failure rollback, (2) the compensator fast-path on a spawn failure, and (3) tryCreateRunactivateQueuedRun once the worker actually reaches executeWithEngine. The problem is that the manual-run worker can return cleanly (exit 0) without ever reaching executeWithEngine, leaving the row stuck as queued forever.

Trace (queue mode, manual implementation run on a work item with an open PR):

  1. trigger creates queued row X and enqueues the job (success — no rollback).
  2. Worker boots → processDashboardJob('manual-run')triggerManualRunrunAgentExecutionPipeline.
  3. runAgentExecutionPipeline (src/triggers/shared/agent-execution.ts:81-84) hits runFreshnessGate → blocked → return; before runAgentForContext (line 100). So executeWithEngine/tryCreateRun/activateQueuedRun is never called and X is never activated.
  4. The pipeline returns void; triggerManualRun's catch (manual-runner.ts:188) is not even entered (no throw). processDashboardJob returns; the container exits 0.
  5. cleanupWorker (src/router/active-workers.ts:116) only calls failOrphanedRun when exitCode !== 0, so it is skipped. The periodic orphan sweep (orphan-cleanup.ts:145) only acts on Docker containers that still exist, but an AutoRemove container is already gone. Row X stays queued indefinitely.

The same leak occurs on the budget-abort early return (agent-execution.ts:93 / agent-execution-lifecycle.ts:96) and, more generally, on any error thrown between the start of triggerManualRun's try-block and activation, because that catch swallows the error (logs only) and the container still exits 0.

Consequences (both user-visible):

  • A permanent yellow queued badge that never resolves, and the dashboard keeps polling it forever (isRunActive('queued') === true).
  • Because hasActiveRunForWorkItem now counts queued (includeQueued: true, bounded by DEFAULT_STALE_RUN_THRESHOLD_MS ≈ 2h on startedAt), every subsequent manual trigger and retry on that work item returns CONFLICT ("A worker is already active on this work item") for ~2 hours. This is a real lock-out reachable via a normal operator action (manually re-running implementation on a card that already has an open PR, or triggering any agent on an over-budget work item).

Before this PR these skip/abort paths simply created no row (the very invisibility being fixed), so there was nothing to leak. Improvement B turns each of them into a stuck row + a multi-hour lock-out.

Suggested direction: resolve the pre-created row to a terminal state on the skip/abort paths. E.g. have the pre-activation early-returns in runAgentExecutionPipeline (and/or the swallowing catch in triggerManualRun) call failQueuedOrRunningRun(agentInput.preCreatedRunId, ...) when a preCreatedRunId is present and was not activated — failQueuedOrRunningRun already no-ops safely on a row that reached a terminal state, and it matches a still-queued row. (A dedicated non-error terminal reason like "Skipped: freshness gate / budget" would read better than failed, but failed is at least self-healing.)

Questions

  • Was the manual-run-skips-dispatch case (freshness gate / budget abort) considered when scoping the cleanup paths? The Risks section covers spawn failure and the failOrphanedRunFallback window, but not a clean (exit-0) worker that skips dispatch after the row was pre-created.

🕵️ claude-code · claude-opus-4-8 · run details

Comment thread src/api/routers/runs.ts
// (`resolveEngineName`) so the stored engine matches the activated run
// (`activateQueuedRun` never overwrites the `engine` column).
const engine = resolveEngineName(input.agentType, pc.project);
const runId = await createQueuedRun({

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pre-created queued row leaks on the worker-side skip/abort paths.

This row is only resolved to terminal by (a) the enqueue-failure rollback below, (b) the compensator fast-path on a spawn failure, or (c) activateQueuedRun once the worker reaches executeWithEngine. But a manual implementation run can return cleanly (exit 0) without ever activating it: runAgentExecutionPipeline returns early when runFreshnessGate blocks (agent-execution.ts:82) or checkPreRunBudget aborts (agent-execution.ts:93), both before runAgentForContextexecuteWithEnginetryCreateRun. The pipeline returns void (no throw), the container exits 0, and cleanupWorker skips failOrphanedRun (it only fires on non-zero exit). The AutoRemove container is gone, so the periodic orphan sweep can't catch it either.

Result: the row is stuck queued forever (perpetual yellow badge + endless polling), and because hasActiveRunForWorkItem now counts queued with maxAgeMs = DEFAULT_STALE_RUN_THRESHOLD_MS (~2h), every later trigger/retry on this work item returns CONFLICT for ~2 hours. Reachable by manually re-running implementation on a card with an open PR, or triggering on an over-budget work item.

Consider failing/closing the pre-created row (failQueuedOrRunningRun(preCreatedRunId, ...)) on these skip/abort paths when it was never activated.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in c4d8ae74. The pre-created queued row is now resolved to a terminal failed state on every worker-side skip/abort path that returns before activation:

  • runAgentExecutionPipeline calls failQueuedOrRunningRun(agentInput.preCreatedRunId, …) via a guarded resolvePreCreatedRunOnSkip helper on each pre-activation early return — integration validation, the freshness gate, and the budget abort — each with a descriptive reason (Skipped before start: …).
  • triggerManualRun's swallowing catch resolves the row for any error thrown before activation (PM-provider/credential setup or any pre-activation pipeline step).

failQueuedOrRunningRun no-ops on a row that already reached terminal — so a genuinely-activated run keeps its own status + error from the executeAgentPipeline finalizeRun path — and no-ops entirely when there is no preCreatedRunId (every non-manual source). That clears both the perpetual yellow queued badge and the ~2h CONFLICT lock-out.

Re: your question — yes, the freshness-gate / budget-abort clean-exit case is now explicitly covered, alongside the catch-all for any pre-activation throw. Added unit coverage for the validation / budget / no-id / success / db-error skip cases plus the manual-runner catch path.

…(MNG-1695)

The pre-created `queued` row (Improvement B) leaked permanently whenever the
manual-run worker returned cleanly (exit 0) without ever activating it. The
freshness gate, budget abort, and integration-validation early returns in
`runAgentExecutionPipeline` all stop before `tryCreateRun`/`activateQueuedRun`,
and the swallowing catch in `triggerManualRun` turns any pre-activation error
into a clean exit too. On exit 0, `cleanupWorker` skips `failOrphanedRun` and
the AutoRemove container is gone before the orphan sweep runs, so the row stayed
`queued` forever — a perpetual yellow badge plus, because `hasActiveRunForWorkItem`
counts `queued`, a ~2h CONFLICT lock-out of every later trigger/retry on the work
item.

Resolve the pre-created row to a terminal state on those paths:
- `runAgentExecutionPipeline` now calls `failQueuedOrRunningRun` (via a guarded
  `resolvePreCreatedRunOnSkip` helper) on each pre-activation early return, with
  a descriptive reason (validation / freshness / budget).
- `triggerManualRun`'s catch resolves the row for any error thrown before
  activation.

`failQueuedOrRunningRun` no-ops on a row that already reached terminal (the normal
post-activation success/failure path), so genuine runs keep their own status, and
the helper no-ops entirely when no `preCreatedRunId` is present (every non-manual
source). Adds unit coverage for the validation/budget/no-id/success/db-error skip
cases and the manual-runner catch path.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

@nhopeatall nhopeatall left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary

COMMENT. The prior blocking review (perpetual queued leak + ~2h CONFLICT lock-out on the freshness-gate / budget-abort skip paths) is correctly fixed: resolvePreCreatedRunOnSkip now resolves the pre-created row on the integration-validation, freshness-gate, and budget-abort early returns, and triggerManualRun's catch covers any throw before activation. Improvement A (reorder tryCreateRun before the clone), the repository primitives (ACTIVE_RUN_STATUSES / createQueuedRun / activateQueuedRun / failQueuedOrRunningRun), the running-only capacity math (includeQueued opt-in), and the manual-run-scoped compensator fast-path are all correct and well-tested.

One residual gap of the same class remains — reachable only via engine misconfiguration, so non-blocking, but worth closing.

Code Issues

Should Fix

src/agents/registry.ts:29-43 (via runAgentForContextrunAgent) — a clean (exit-0) engine-guard return skips activation and re-leaks the pre-created queued row.

resolvePreCreatedRunOnSkip enumerates the pipeline's pre-activation early returns, but activation happens deeper, inside runAgentForContext → runAgent → executeWithEngine → tryCreateRun. runAgent has two early returns that produce a failed AgentResult without throwing and without reaching executeWithEngine/tryCreateRun:

  • if (!engine)resolveEngineName (src/backends/resolution.ts:13) returns the project's agentEngine.default / overrides[agentType] verbatim with no registry validation, and getEngine (src/backends/registry.ts:12) returns undefined for an unknown name. A typo'd / non-existent engine name therefore hits this branch.
  • if (!engine.supportsAgentType(agentType)) — currently unreachable (all built-in engines return true via NativeToolEngine/llmist), but a future custom engine would expose it.

Trace (queue mode, project with a misconfigured engine name, manual implementation run on a card):

  1. runs.trigger passes the conflict/enabled checks, createQueuedRun writes row X (status='queued'), enqueues successfully.
  2. Worker boots → triggerManualRun pre-checks pass → runAgentExecutionPipeline clears validation, freshness gate, and budget.
  3. runAgentForContextrunAgent hits if (!engine) → returns { success: false, ... }. tryCreateRun/activateQueuedRun is never called; X stays queued.
  4. The pipeline treats it as a normal failed result (post-side-effects + failure lifecycle), returns void; triggerManualRun's try completes — no throw, so the catch is not entered — and the container exits 0.
  5. cleanupWorker (src/router/active-workers.ts:116) only calls failOrphanedRun on a non-zero exit, so it's skipped → row X leaks as queued forever.

Same user-visible consequences the prior review called out: a perpetual yellow queued badge that the dashboard polls indefinitely, and (for runs with a workItemId) every later manual trigger and retry on that card returns CONFLICT for ~DEFAULT_STALE_RUN_THRESHOLD_MS (~2h), because hasActiveRunForWorkItem now counts queued. Reachability is lower than the freshness/budget case (a standing engine misconfig rather than a normal action), but once present it leaks on every manual run for that project.

Note the broader fragility: resolving the row by enumerating every pre-activation exit is inherently incomplete — the runAgent guards are proof one was missed. A more robust alternative is a single terminal-state reconciliation that does not depend on the path taken — e.g. in triggerManualRun, after the pipeline returns (in the finally, covering both clean-return and throw), call failQueuedOrRunningRun(input.preCreatedRunId, ...) when preCreatedRunId is set. It already no-ops on terminal rows (a successfully-activated run is completed/failed by then), so it would rescue a never-activated row regardless of which skip/clean-failure path produced it, and would let the per-early-return calls be dropped.

Questions

  • Was the deeper runAgent engine-guard return (unknown / unsupported engine) considered alongside the pipeline early returns when scoping resolvePreCreatedRunOnSkip? It's the same leak class one call level below the covered returns.

🕵️ claude-code · claude-opus-4-8 · run details

Comment thread src/triggers/shared/agent-execution.ts Outdated
* always safe. Its own failure must never crash the worker on an otherwise-clean
* skip, so DB errors are logged and swallowed.
*/
async function resolvePreCreatedRunOnSkip(

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This resolves the pipeline's pre-activation early returns, but activation actually happens one level deeper in runAgentForContext → runAgent. runAgent (src/agents/registry.ts:29-43) has two guard returns — if (!engine) and if (!engine.supportsAgentType(...)) — that return a failed AgentResult without throwing and without reaching executeWithEngine/tryCreateRun. Since resolveEngineName does no registry validation and getEngine returns undefined for an unknown name, a misconfigured agentEngine default/override hits if (!engine), the worker exits 0, cleanupWorker skips failOrphanedRun, and the pre-created queued row leaks (perpetual badge + ~2h CONFLICT lock-out via hasActiveRunForWorkItem) — the exact symptom this helper was added to prevent.

Consider a path-independent reconciliation instead of enumerating every exit: in triggerManualRun's finally, call failQueuedOrRunningRun(input.preCreatedRunId, ...) when set. It no-ops on already-terminal rows (an activated run is completed/failed by then), so it rescues a never-activated row regardless of which skip/clean-failure path produced it.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 604e0eb2. You're right — the runAgent engine-guard returns (!engine for an unknown/misconfigured engine name, !supportsAgentType) are the same leak class one call level below the pipeline early returns, and they were not covered.

I took your recommended path-independent approach rather than enumerating the deeper exit:

  • Removed resolvePreCreatedRunOnSkip and its three call sites from agent-execution.ts (back to plain return;).
  • Moved + generalized the reconciliation into triggerManualRun's finally, so it resolves the pre-created queued row regardless of how the pipeline ended — clean pre-activation skip, clean engine-guard return (no throw, never reaches executeWithEngine), or thrown error. It can't be defeated by a future exit we forget to enumerate.

Why the finally is sufficient and safe:

  • preCreatedRunId only ever originates from the manual-run path, and triggerManualRun is the single worker-side entry that awaits the entire pipeline — so the finally covers 100% of leak cases.
  • failQueuedOrRunningRun no-ops at the DB level (WHERE status IN (queued, running)) on a row already finalized by executeAgentPipeline's finalizeRun, so the unconditional call is safe for genuinely-activated runs (they keep their own completed/failed/timed_out status + error). The reason string also distinguishes the throw case (Manual run failed before completion: …) from the clean never-activated case.

Coverage:

  • Unit (manual-runner.test.ts): added the clean (exit-0) engine-guard / skip return path asserting the row is still resolved, plus the throw, no-id, and swallowed-DB-error cases.
  • Integration (runsRepository.test.ts): added failQueuedOrRunningRun no-ops on an already-terminal (completed) row to lock the DB-level guarantee the unconditional finally relies on.

…(MNG-1695)

The previous fix resolved the pre-created `queued` row on the pipeline's
pre-activation early returns (integration validation, freshness gate, budget
abort) via `resolvePreCreatedRunOnSkip`. But activation happens one level
deeper — in `runAgent` (src/agents/registry.ts) → `executeWithEngine` →
`tryCreateRun`. `runAgent` has two guard returns (`!engine` for an
unknown/misconfigured engine name, and `!supportsAgentType`) that produce a
failed AgentResult WITHOUT throwing and WITHOUT reaching `executeWithEngine`.
A standing engine misconfig therefore exits the worker 0 with the queued row
never activated, so it re-leaked as a perpetual `queued` badge and locked out
every later trigger/retry on the work item for ~2h (hasActiveRunForWorkItem
counts `queued`).

Replace the per-early-return enumeration with a single path-independent
reconciliation in `triggerManualRun`'s `finally`: it resolves the pre-created
`queued` row regardless of HOW the pipeline ended — clean skip, clean
engine-guard return, or throw — so it cannot be defeated by a future exit we
forgot to enumerate. `preCreatedRunId` only ever originates from the manual-run
path, and `triggerManualRun` awaits the entire pipeline, so the `finally`
covers 100% of leak cases. `failQueuedOrRunningRun` no-ops at the DB level on a
row already finalized by `executeAgentPipeline`, making the unconditional call
safe for genuinely-activated runs.

- Remove `resolvePreCreatedRunOnSkip` + its 3 call sites from agent-execution.ts
- Move + generalize the reconciliation into triggerManualRun's `finally`
- Add integration coverage: failQueuedOrRunningRun no-ops on a completed row
- Add unit coverage for the clean (exit-0) engine-guard / skip return path

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

@nhopeatall nhopeatall left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary

APPROVE. This is a well-scoped, thoroughly-tested fix for early agent_runs visibility, and it has correctly absorbed the fixes from the two prior review rounds. I re-traced the full run lifecycle and the previously-blocking leak class is now closed.

What I verified

Improvement A (reorder tryCreateRun before the clone)model/maxIterations are deferred-filled via tryUpdateRunPlanResolution(runId, …) (spec 018, adapter.ts:125), so reusing the pre-created runId loses nothing; the watchdog closure now captures a defined runId during the clone, and a setupRepoDir throw surfaces a visible failed row (covered by the new pipeline tests).

The queued-row leak (the prior blocking finding) — the path-independent reconciliation in triggerManualRun's finally is the robust fix. I walked every exit path for a workItem manual run and each resolves the pre-created row:

  • normal activation → completeRun/finalizeRun (and finally then no-ops on the terminal row);
  • clean pre-activation skip (validation / freshness gate / budget) and runAgent !engine / !supportsAgentType returns → finally resolves it;
  • throw inside the try → finally (pipelineError branch);
  • throw before the try/finally (validateIntegrations line 123) or hard crash → non-zero exit → cleanupWorkerfailOrphanedRun, which now matches ACTIVE_RUN_STATUSES;
  • spawn failure → compensator fast-path (scoped to manual-run); enqueue failure → tRPC rollback.

failQueuedOrRunningRun is guarded on status IN (queued, running), so a genuinely-completed run keeps its own status — confirmed by the integration test.

Other invariants — capacity math stays running-only (agent-type-lock, work-item-lock, implementation-freshness-gate all call countActiveRuns without includeQueued; only hasActiveRunForWorkItem opts in); stats/charts filter to terminal statuses (runStatsRepository/prWorkItemsRepository), so queued rows don't pollute aggregates; the compensator fast-path correctly excludes retry-run/debug-analysis; cancelRunById is intentionally running-only; startedAt has a DB default so createQueuedRun rows sort/display correctly; frontend polling/badge/filter/retry-hide are consistently driven by isRunActive.

The residual edges (no-workItemId hard-crash before activation hitting the failOrphanedRunFallback window — no CONFLICT lock-out since there's no work item; and the manual+webhook race inside the brief queued window under running-only capacity) are explicitly acknowledged in the card's Risks section. No blocking issues.

🕵️ claude-code · claude-opus-4-8 · run details

@aaight aaight merged commit 046289c into dev Jun 26, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants