Skip to content

codex-rescue reliability pass: completion signal, launch/cancel, and worker-lifecycle hardening#346

Open
zknpr wants to merge 18 commits into
openai:mainfrom
zknpr:fix/clarify-agent-resume-token-suffix
Open

codex-rescue reliability pass: completion signal, launch/cancel, and worker-lifecycle hardening#346
zknpr wants to merge 18 commits into
openai:mainfrom
zknpr:fix/clarify-agent-resume-token-suffix

Conversation

@zknpr
Copy link
Copy Markdown

@zknpr zknpr commented May 22, 2026

Rolling reliability pass on the codex-rescue / companion runtime — the original "stop waiting on finished runs" fix plus everything surfaced using it in anger.

Completion signal (original)

A synchronous codex-rescue return carried no machine-checkable "done" signal, so the main thread misread finished runs as still running. Adds a deterministic sentinel ([[codex-task status=complete|dispatched]]) + a PostToolUse hook that injects an authoritative completion line from companion state, plus a status --wait watcher for background jobs (a backgrounded Bash command isn't bound by the foreground cap, so it re-invokes the main thread on terminal).

Launch / cancel hardening

  • Atomic launch/cancel via a single updateState read-modify-write: no cancel-resurrect race, spawn errors no longer crash the parent, and a terminal launch (cancelled/failed before worker spawn) emits no dispatched sentinel.
  • status --wait reconciles a dead background worker to failed instead of polling a stale running to timeout; a plain status read reconciles too, so result/hook/resume all see the durable terminal state.
  • The hook trusts the structured Agent tool status over an echoed [[codex-task ...]] token; sub_agent_entered is treated as non-terminal; task --background exits non-zero on a failed launch.

Worker-lifecycle fixes (surfaced in real-world use)

  • app-server leak: a locally-spawned codex app-server orphaned on normal host exit, snowballing under broker-unavailable conditions; each client now reaps only its own child via a process.once("exit") SIGKILL backstop (no process scanning — parallel sessions own their own app-servers).
  • default --background: a foreground (blocking) task call running inside the codex-rescue subagent is auto-backgrounded by the harness and reaped when the subagent ends (~143s), killing the worker mid-run. codex-rescue now defaults to --background (the worker reparents to launchd and survives); foreground is reserved for short, explicit --wait.
  • cwd write-guard: a --write task is told up front it can only write within the run's cwd and must not probe the sandbox; rescue.md documents that cross-repo work needs --cwd.
  • dead-worker honesty: because the shared app-server applies edits, in-flight edits can land after the worker pid dies; the worker-exited failure message now carries the verify-on-disk discipline (re-check git status, don't trust a single snapshot).

Tests: 119/119.

… running signal

Every Agent(codex:codex-rescue) return is suffixed by the Claude Code harness
with `agentId: <id> (use SendMessage with to: '<id>' to continue this agent)`.
This is a resume token for the same agent thread, not a "still running" or
"in background" status indicator. Callers (including Claude itself) have
repeatedly misread it as the latter and then incorrectly told users that
Codex is still running in the background, when the work has actually
completed.

Adds one explicit caller-side rule in two places:

- plugins/codex/skills/codex-result-handling/SKILL.md — the internal guidance
  the caller consults when presenting Codex output.
- plugins/codex/commands/rescue.md Operating rules — the most common
  user-facing entry point (`/codex:rescue`).

No code changes; both files are caller-side documentation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@zknpr zknpr requested a review from a team May 22, 2026 19:45
…ops waiting on finished runs

A synchronous codex-rescue return carried no machine-checkable completion
signal — only raw Codex text plus the harness resume-token suffix — so the
main thread inferred status from prose and concluded "still running". For long
runs the Claude Code Bash tool auto-backgrounds the companion call past its
~600s cap and returns "Command running in background... you will be notified",
which the subagent faithfully relays; the detached run then completes but is
orphaned. This is model-independent (reproduced on Opus), so it is not fixable
by subagent instructions or a model bump.

Fix at the structural layer:
- Companion stamps a deterministic sentinel: [[codex-task status=complete]]
  (foreground) / [[codex-task status=dispatched id=<jobId>]] (background); the
  background dispatch message no longer falsely promises a notification.
- New PostToolUse hook (codex-rescue-completion-hook.mjs) injects an
  authoritative line into the main thread on every codex-rescue return:
  complete -> done/verify-on-disk; Bash auto-background -> still finishing
  detached, don't wait; dispatched -> poll /codex:status.
- Reliable --background routing: explicit --background is honored, long
  open-ended work defaults to task --background, foreground only for short
  bounded runs.
- Docs (rescue.md, codex-cli-runtime, codex-result-handling) tell the main
  thread to trust the sentinel/hook over prose.

Tests: 92/92, incl. new task-status-token + hook coverage and updated command/
runtime assertions. Validated a real --background lifecycle end to end.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@zknpr zknpr changed the title docs: clarify Agent harness's agentId suffix is a resume token, not a running signal fix: deterministic completion signal for codex-rescue (stop waiting on finished runs) May 24, 2026
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d651c502a3

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".


function isSynchronousAgentReturn(toolResponse) {
const status = String(toolResponse?.status ?? "").toLowerCase();
return status !== "async_launched" && status !== "running";
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Restrict completion context to successful Agent returns

Treating every non-running/async_launched status as synchronous completion causes false “COMPLETED and exited” guidance for failed rescue calls. In isSynchronousAgentReturn, statuses like failed/errored pass this check, and buildCompletionContext then falls back to COMPLETE_CONTEXT when no sentinel is present, so the parent thread is told the run finished successfully even when the Agent tool actually failed. This can suppress retries/escalation and mislead users about the true execution state.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in e1b8ae3. Success is now asserted only when the [[codex-task status=complete]] sentinel is present. A failure status (failed/error/timeout/...) or an empty return now yields a non-success line — "exited WITHOUT a success signal ... may have failed or produced no result ... re-run or escalate" — and a tokenless-but-otherwise-fine return gets a neutral "exited, verify on disk" line, never a false "COMPLETED ... final result". Every branch still emits "do not wait", so a failed run can't reintroduce the original hang. Added failed-status and empty-return test cases.

Addresses the PR openai#346 Codex review (P1): buildCompletionContext fell back to
the "COMPLETED ... final result" line for any synchronous return without a
sentinel — including failed/errored/empty returns — falsely signaling success
and suppressing retry/escalation.

Now success is asserted only when the [[codex-task status=complete]] sentinel
is present. A failure status (failed/error/timeout/...) or an empty return
(codex-rescue returns nothing when Codex can't be invoked) gets a failure-aware
"exited without a success signal — may have failed, verify/retry" line; any
other tokenless return gets a neutral "exited, verify on disk" line. Every
branch still says "do not wait" so a failed run doesn't reintroduce the
original waiting bug.

Tests: 94/94 (added failed-status and empty-return cases; tokenless default is
now neutral).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e1b8ae3407

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread plugins/codex/scripts/lib/render.mjs Outdated
Comment on lines +327 to +328
const message = String(parsedResult?.failureMessage ?? "").trim() || "Codex did not return a final message.";
return `${message}\n`;
return appendTaskStatusToken(message, completeToken);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Avoid stamping complete token on failure fallback output

When rawOutput is empty, this branch still appends [[codex-task status=complete]] to a failure fallback message, so non-zero/failed task runs can carry a synthetic "complete" sentinel. The new PostToolUse hook treats that token as authoritative and emits the "COMPLETED and exited" context before checking failure status, which can misclassify failed rescue executions as successful completions and suppress retry/escalation behavior.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 65d8320. renderTaskResult now appends the complete sentinel only on a successful exit (status 0); empty/failure fallback output renders without it, so a failed run no longer carries a success signal. Added a failure-no-sentinel test.

A foreground `task` ran the Codex turn in-process, so when the Bash tool
auto-backgrounded a long foreground call and the owning subagent was reaped,
the run was killed mid-task and the work was lost with no resumable record.

Now the foreground path enqueues the same detached (`detached` + `unref`)
worker the `--background` path uses and waits for it inline, printing the
worker's stored result. If the inline wait is killed, the detached worker
survives and the result is recoverable via /codex:status. Completed-run output
is byte-identical (the worker stamps the [[codex-task status=complete]]
sentinel). The inline wait degrades to a "still running" status after the
existing wait cap instead of blocking forever. enqueueBackgroundTask now writes
the job record before spawning to close a worker-startup race.

Tests: 95/95 (new test proves foreground goes through the detached worker via
the queued-job record + log line, with identical stdout).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review


P2 Badge Reject conflicting --background and --wait flags

The task handler documents --wait as an explicit foreground choice, but when both flags are present this branch still routes to background because it only checks options.background and never errors on the conflict. In that input case, the command silently does the opposite of an explicit wait request, which can unexpectedly detach work and require manual status polling.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +733 to +735
const snapshot = await waitForSingleJobSnapshot(cwd, job.id, {
pollIntervalMs: FOREGROUND_TASK_POLL_INTERVAL_MS
});
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Avoid timing out foreground task waits after 4 minutes

runForegroundTaskWorker waits on waitForSingleJobSnapshot without a timeout override, so it inherits the 240000ms default and exits non-zero once that deadline is hit even though the detached worker may still be running. Any foreground rescue that legitimately runs longer than ~4 minutes will be reported as failed/unfinished to the caller, which can trigger incorrect retry/escalation behavior and duplicate work while the original task continues in the background.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 65d8320. runForegroundTaskWorker now passes an unbounded timeout (timeoutMs: Infinity) instead of inheriting the 240s status default, so long xhigh foreground runs wait for the real result; the detached worker still provides survival if the waiting process is killed.

status: "queued",
phase: "queued",
pid: child.pid ?? null,
pid: null,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Preserve worker PID for queued task cancellation

The queued job record now stores pid: null, and the detached worker PID is not persisted before returning. /codex:cancel relies on job.pid to terminate active jobs, so cancelling immediately after dispatch (while still queued) can mark the job as cancelled without actually stopping the worker; that worker can then start and finish anyway, potentially applying writes after the user thought cancellation succeeded.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 65d8320. After spawning the detached worker, the job record is updated with the worker pid (guarded by isActiveJobStatus to avoid clobbering a finished job), so /codex:cancel can terminate it. The pre-spawn record write is kept to avoid the worker-startup race.

Comment thread plugins/codex/scripts/lib/render.mjs Outdated
Comment on lines +327 to +328
const message = String(parsedResult?.failureMessage ?? "").trim() || "Codex did not return a final message.";
return `${message}\n`;
return appendTaskStatusToken(message, completeToken);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Gate complete sentinel on successful task exit status

renderTaskResult now appends [[codex-task status=complete]] unconditionally, including fallback failure messages. runTrackedJob marks non-zero task exits as failed, but the stored rendered output still carries this success token, and the new PostToolUse hook treats that token as authoritative completion success. This lets failed/interrupted rescue runs be misclassified as completed, suppressing retry/escalation exactly when the task did not succeed.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 65d8320. The complete sentinel is now gated on a successful exit status (status 0) rather than on rawOutput presence, so non-zero/failed task exits no longer emit it.

zknpr and others added 2 commits May 25, 2026 01:38
…reground wait, worker pid

Three regressions from e1b8ae3/c4722b5:
- renderTaskResult stamped [[codex-task status=complete]] on failure/empty
  fallback output, so a failed run carried the success sentinel and the
  completion hook read it as success. The sentinel is now appended only on a
  successful exit (status 0); failure/empty output renders without it.
- runForegroundTaskWorker inherited the 240s status-wait default, so foreground
  xhigh runs (routinely >4min) timed out instead of returning the result. The
  inline wait is now unbounded; the detached worker provides survival.
- enqueueBackgroundTask left the queued record pid: null and never persisted
  the spawned worker pid, breaking /codex:cancel. The worker pid is now written
  after spawn, guarded against clobbering a finished job.

Tests: 97/97 (added failure-no-sentinel + worker-pid coverage).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
A companion --background job has no push channel to the main thread, and the
subagent often paraphrases the [[...dispatched...]] sentinel away — so the hook
could even report a still-running job as "done" (found in real testing).

- The PostToolUse hook now uses companion STATE as ground truth: if the
  workspace has an active (queued/running) task job after a codex-rescue return,
  it injects a watcher instruction — arm `status <jobId> --wait` via the Bash
  tool with run_in_background=true, which exits and re-invokes Claude on terminal
  status. Robust against paraphrased/dropped dispatch tokens and the Bash
  auto-background case.
- The companion --background dispatch message and the rescue/result-handling
  docs now carry the same watcher command.

Gives background codex-rescue a reliable completion notification — the
backgrounded `status --wait` watcher isn't subject to the 600s foreground cap.

Tests: 98/98 (new state-seeded watcher test incl. a paraphrased-dispatch case).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2071e212aa

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +162 to +164
const activeTaskJob = findNewestActiveTaskJob(getHookCwd(input));
if (activeTaskJob) {
return buildWatcherContext(activeTaskJob.id);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Check completion token before inferring active task state

buildCompletionContext returns a watcher message as soon as any queued/running task exists in companion state, before it inspects the current tool output for [[codex-task status=complete]]. In a workspace with an unrelated background task already running, a newly completed codex:codex-rescue foreground run will be misclassified as still running, causing the parent thread to ignore the actual final result and keep waiting/polling. This regression is introduced by prioritizing global active-job state over the response’s explicit completion sentinel.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 11db7d5. buildCompletionContext now checks the completion/dispatch token BEFORE the active-task-state inference, so a return carrying the complete sentinel returns COMPLETE even when an unrelated active task job exists; the state inference is only a fallback for paraphrased/dropped dispatch tokens.

Comment on lines +106 to +107
[...loadState(cwd).jobs]
.filter((job) => job?.jobClass === "task" && ACTIVE_TASK_STATUSES.has(String(job.status ?? "")))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Scope active-task detection to the current Claude session

findNewestActiveTaskJob scans all tracked task jobs in the workspace without any session filter, so the PostToolUse hook can attach a watcher for a different Claude session’s job. When two sessions run in the same repo, a rescue return in session A can be annotated with session B’s running job id, sending the user to poll/cancel the wrong job and masking the actual state of the current run.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 11db7d5. findNewestActiveTaskJob is now scoped to the hook input's session_id (matching job.sessionId), with an unscoped fallback only when no session_id is provided, so it won't point the watcher at another Claude session's active job.

const BASH_AUTO_BACKGROUND_OUTPUT_PREFIX = "Output is being written to:";

function buildWatcherContext(jobId) {
return `codex-rescue background job ${jobId} is RUNNING — there is no automatic push notification. To be notified, arm a watcher: run this via the Bash tool with run_in_background=true: node ${COMPANION_SCRIPT_PATH} status ${jobId} --wait --timeout-ms 1800000 — it blocks until the job is terminal, then exits and re-invokes you. If it returns and the job is still running, re-arm the same command. Do NOT treat the job as done until the watcher reports a terminal status.`;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Quote companion script path in watcher command text

The watcher guidance interpolates COMPANION_SCRIPT_PATH into a shell command without quoting it. On installations where the plugin path contains spaces (for example under Application Support), the suggested node ... status command is split into multiple arguments and fails, so users who follow the injected context cannot arm the completion watcher and may miss terminal job transitions.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 11db7d5. The watcher command now quotes the companion path: node "<COMPANION_SCRIPT_PATH>" status ... — handles paths with spaces.

zknpr and others added 2 commits May 25, 2026 11:21
…ession scope, quoting

- Check the completion/dispatch token BEFORE inferring active-task state, so a
  return that actually completed isn't masked by an unrelated active job.
- Scope findNewestActiveTaskJob to the hook's session_id (job.sessionId match),
  with an unscoped fallback when no session_id is present — avoids pointing the
  watcher at another Claude session's job.
- Quote the companion path in the watcher command (handles paths with spaces).

Tests: 100/100 (added complete-token-beats-active-job, session-scoping, and
quoted-path cases).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ackground/--wait conflict

- Hook: the dispatched-token branch now emits the full watcher instruction
  (buildWatcherContext) instead of a bare "poll /codex:status", matching the
  state-inference fallback; removed the now-unused buildDispatchedContext.
- Companion: handleTask now rejects conflicting --background + --wait with an
  error, mirroring the --resume/--fresh guard (PR openai#346 review P2 — previously
  --background silently won, ignoring the explicit foreground --wait).

Tests: 101/101.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@zknpr
Copy link
Copy Markdown
Author

zknpr commented May 25, 2026

Addressed the P2 from the c4722b5 review body ("Reject conflicting --background and --wait flags") in 153bd63: handleTask now throws "Choose either --background or --wait, not both." when both flags are present — mirroring the existing --resume/--fresh guard — instead of silently backgrounding and ignoring the explicit foreground --wait.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 153bd63aec

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +178 to +181
const activeTaskJob = findNewestActiveTaskJob(getHookCwd(input), input?.session_id);
if (activeTaskJob) {
return buildWatcherContext(activeTaskJob.id);
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Guard active-job fallback with invocation-specific evidence

This fallback infers that the current codex:codex-rescue call is still running whenever any task in the same session is queued/running, and it runs before the failure/empty-return handling. In a session with an unrelated background rescue already active, a new foreground rescue that exits without a sentinel (for example, Codex invocation failure or token loss in agent output) will be misclassified as still running and mapped to the wrong job ID, which can hide a failed run and send users to poll/cancel the wrong task. The fallback should only trigger when the current invocation is tied to that job (e.g., dispatched token/job ID or other invocation-specific signal).

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 186e103. The global active-job fallback now only fires when THIS return contains invocation-specific background evidence (a BACKGROUND_INTENT_PATTERN match), so a sentinel-less/failed foreground return with an unrelated concurrent active task is no longer misclassified as still-running — it falls through to the failure/neutral branch. Session scoping retained.

Comment on lines +677 to +680
writeJobFile(job.workspaceRoot, job.id, queuedRecord);
upsertJob(job.workspaceRoot, queuedRecord);

const child = spawnDetachedTaskWorker(cwd, job.id);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Re-check cancellation before spawning detached worker

Fresh evidence in this commit: enqueueBackgroundTask now persists a cancellable queued record (pid: null) before spawn, but still unconditionally spawns the detached worker afterward. If /codex:cancel lands in that window, the job is marked cancelled with no PID to kill, and this code can still start the worker anyway, allowing writes to continue after the user saw a successful cancel. Re-read job state before spawn (or atomically claim start) and skip spawning when status is no longer active.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 186e103. enqueueBackgroundTask re-reads the job record immediately before spawnDetachedTaskWorker and skips the spawn (returning the terminal status, logging the skip) when /codex:cancel marked it non-active in the write->spawn window.

Comment on lines +178 to +180
const activeTaskJob = findNewestActiveTaskJob(getHookCwd(input), input?.session_id);
if (activeTaskJob) {
return buildWatcherContext(activeTaskJob.id);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Prefer Bash auto-background marker over global active-job fallback

When the Agent output explicitly contains Bash’s auto-background marker, that is invocation-specific evidence that this run detached past the foreground cap; however, the hook first checks for any active task in session state and returns a watcher for that job. In sessions with another active task, auto-backgrounded foreground runs will be misattributed to the unrelated job and lose the output-path guidance needed to follow the detached process. Check the auto-background marker before global state fallback (or require job correlation for fallback).

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 186e103. The Bash auto-background marker (invocation-specific) is now checked BEFORE the global state fallback, so an auto-backgrounded run keeps its streamed-output-path guidance instead of being misattributed to an unrelated active job.

Comment on lines +747 to +750
const snapshot = await waitForSingleJobSnapshot(cwd, job.id, {
// PR #346 review: foreground xhigh waits must survive beyond the 240s status default.
timeoutMs: Infinity,
pollIntervalMs: FOREGROUND_TASK_POLL_INTERVAL_MS
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Bound foreground waits when detached worker fails to start

Foreground task now always waits with timeoutMs: Infinity after dispatching a detached worker. If that worker never transitions the job out of queued/running (for example, process spawn failure/resource exhaustion or early worker bootstrap failure), this wait loop never terminates and the caller hangs indefinitely instead of surfacing an actionable error. Keep a finite guard timeout or detect launch failure before entering an unbounded wait.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 186e103. The foreground Infinity wait now detects when the detached worker pid has exited while the job is still queued/running (failWhenWorkerExits + pid-liveness) and bails with 'background worker exited before completing; check /codex:status ' instead of hanging. Liveness-based, so legitimately long xhigh runs still wait.

- Hook: gate the global active-job fallback on invocation-specific background
  evidence (BACKGROUND_INTENT_PATTERN) and check the Bash auto-background marker
  before it, so a sentinel-less foreground failure isn't misrouted to an
  unrelated concurrent task's job. (P1 + P2)
- Companion: enqueueBackgroundTask re-reads the record before spawning and skips
  the worker if /codex:cancel already marked it terminal in the write->spawn
  window. (P1)
- Companion: the foreground Infinity wait now bails with an actionable error
  when the detached worker pid exits while the job is still active (pid-liveness,
  so long runs still wait), plus a brief missing-job retry for the
  worker-writes-record race. (P2)

Tests: 104/104.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 186e103504

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +622 to +625
const statusToken = buildTaskDispatchedStatusToken(payload.jobId);
return [
statusToken,
`${payload.title} dispatched as background job ${payload.jobId}.`,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Skip dispatched token when launch status is terminal

renderQueuedTaskLaunch always emits [[codex-task status=dispatched ...]] and “dispatched as background job ...” text, but enqueueBackgroundTask can now return a terminal payload (for example status: "cancelled") when cancellation wins before worker spawn. In that race, non-JSON callers are told the job was dispatched and should be watched even though no worker was launched, which misreports state and can send users into unnecessary polling/wait logic; only queued/running launches should produce the dispatched sentinel.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in a7aaff1. renderQueuedTaskLaunch is now status-aware: a terminal enqueue payload (e.g. cancelled/failed before a worker launches) renders a plain terminal message with no [[codex-task status=dispatched …]] sentinel and no watcher guidance; only queued/running launches emit the dispatched token.

Covered by task --background cancelled before launch renders terminal text without a dispatched token, which asserts the output contains neither the dispatched sentinel nor "dispatched as background job" and instead shows "was cancelled before a worker launched; no work started."

Comment on lines +767 to +770
writeJobFile(job.workspaceRoot, job.id, {
...spawnedRecord,
pid: child.pid ?? null
});
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Re-check cancellation before writing spawned worker PID

After spawning the worker, this branch writes pid back based on a previously read spawnedRecord status. If /codex:cancel lands between that read and this write, cancellation can be overwritten with a queued record and live PID, effectively resurrecting a job the user already cancelled and allowing detached work to continue. Re-read state (or atomically compare-and-set) immediately before persisting the PID so cancelled terminal state cannot be clobbered during launch races.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in a7aaff1. persistSpawnedTaskWorkerPid now writes the pid and re-checks cancellation inside a single updateState read-modify-write: if the job is already terminal (a /codex:cancel won the post-spawn race), the pid is not written and the just-spawned worker is killed via terminateProcessTree instead of being resurrected by stale launch data. The terminal check (isTerminalJobRecord) consults both the summary state row and the per-job file, since either can be observed first during a race.

Covered by task --background does not persist a worker pid when cancellation wins after spawn, which injects the cancellation in the post-spawn window and asserts the job stays cancelled, pid stays null, and the orphaned worker process group is killed.

}

async function runForegroundTaskWorker(cwd, job, request, options = {}) {
enqueueBackgroundTask(cwd, job, request);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Guard foreground path against detached spawn error events

Foreground task now always routes through enqueueBackgroundTask, which spawns a detached worker process; if that spawn fails (for example due transient ENOENT/resource errors), Node emits an error event on the child process and this path has no handler, so the parent process can crash before returning a structured failure or updating job state. Because this commit moved all foreground runs onto this launch path, the failure mode now affects normal synchronous rescue execution as well, not just optional background mode.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in a7aaff1. spawnDetachedTaskWorker now wraps spawn in try/catch for synchronous failures, attaches child.once("error", …) for async error events, and treats a missing child.pid as a launch failure. Any spawn failure routes through markTaskLaunchFailed, which atomically marks the job failed so the foreground waiter surfaces a structured error instead of letting an unhandled error event crash the parent.

Covered by foreground task reports a structured failure when the detached worker spawn fails, which emits an async ENOENT error event with no pid and asserts no ETIMEDOUT hang, no Unhandled 'error' event, and a failed payload with the spawn-failure message.

…penai#346 review)

Resolve the launch/cancel race class and spawn-error crash flagged on 186e103
by routing every launch/cancel job-record mutation through a single atomic
updateState read-modify-write instead of read-then-write checks:

- P1 cancel-resurrect race (persistSpawnedTaskWorkerPid): the worker pid is now
  written and cancellation re-checked inside one updateState transition. If a
  concurrent /codex:cancel already made the job terminal, the pid is NOT written
  and the just-spawned worker is killed (terminateProcessTree) instead of being
  resurrected by stale launch data.
- P1 unhandled detached spawn error (spawnDetachedTaskWorker): wrap spawn in
  try/catch for sync failures, attach child.once('error') for async failures,
  and treat a missing child.pid as a launch failure. Any spawn failure
  atomically marks the job failed (markTaskLaunchFailed) so the foreground
  waiter surfaces a structured error instead of crashing or hanging.
- P2 dispatched sentinel on terminal launch (renderQueuedTaskLaunch): a terminal
  enqueue payload (cancelled/failed before worker launch) now renders a plain
  terminal message with NO dispatched sentinel or watcher guidance, so hooks and
  humans never treat a non-started job as live background work.

New helpers isTerminalJobRecord/resolveLaunchStatus check both the summary state
row and the per-job file, since either can be observed first during a race.

Tests: 3 new runtime cases (cancel-after-spawn kills the orphan and keeps the
job cancelled with no pid; spawn failure yields a structured failed payload with
no hang/crash; cancel-before-launch renders terminal text with no dispatched
token). Full suite 107/107 green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a7aaff1ebd

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +171 to +172
if (token?.status === "complete") {
return COMPLETE_CONTEXT;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Ignore forged complete tokens on failed Agent responses

Treating [[codex-task status=complete]] as authoritative before checking the Agent tool status allows a failed codex:codex-rescue return to be misclassified as success whenever the model output itself contains that literal line (for example, if the prompt or generated text echoes it). In that case the hook emits the “COMPLETED and exited” guidance even though tool_response.status is failed, which can suppress retries/escalation and mislead the parent thread about the real run outcome.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in bb53f47. The structured Agent tool status is now authoritative over any [[codex-task …]] line in the body: buildCompletionContext short-circuits to the failure context via a new hasFailureStatus(tool_response) check before parsing the token, so a failed return whose text merely echoes [[codex-task status=complete]] is reported as a failure, not success. A genuine success (status completed/success/absent — none of which are in FAILURE_STATUSES) still falls through and honors the complete token, so no regression.

Covered by codex-rescue hook reports failure (not success) when a failed return echoes the complete token: status: "failed" + a standalone complete-token line asserts the emitted context matches /exited WITHOUT a success signal/ and not /COMPLETED and exited/.

Comment on lines +176 to +177
if (token?.status === "dispatched" && token.id) {
return buildWatcherContext(token.id);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Reject dispatched tokens from failed Agent responses

The hook trusts a [[codex-task status=dispatched id=...]] line even when the Agent call itself failed, so any failed return that echoes that literal token is reclassified as an active background launch and gets watcher instructions for that job id. This can hide real failures and redirect users to poll/cancel the wrong task, because the dispatched branch runs before failure-status handling.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in bb53f47 by the same early failure-status gate: because hasFailureStatus returns the failure context before token handling, the dispatched-token branch is never reached for a failed return, so an echoed [[codex-task status=dispatched id=…]] no longer redirects to watcher/poll instructions for a phantom job.

Covered by codex-rescue hook reports failure (not dispatch) when a failed return echoes a dispatched token: status: "failed" + a dispatched-token line asserts the failure context and that no status … --wait watcher instruction is emitted.

Comment on lines +90 to +93
function isSynchronousAgentReturn(toolResponse) {
const status = String(toolResponse?.status ?? "").toLowerCase();
return status !== "async_launched" && status !== "running";
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Treat sub_agent_entered as a non-terminal Agent status

isSynchronousAgentReturn currently treats every status except running/async_launched as terminal, which includes Agent responses with status: "sub_agent_entered" (interactive subagent handoff). In that state the rescue agent has not finished, but this hook will emit “exited / do not wait” context, causing the parent flow to stop waiting while the subagent is still active.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in bb53f47. isSynchronousAgentReturn now treats sub_agent_entered as non-terminal (added to a NON_TERMINAL_AGENT_STATUSES set alongside async_launched/running), so the hook stays silent during an interactive subagent handoff instead of emitting "exited / do not wait."

One note in the interest of honesty: I could not independently confirm sub_agent_entered against the harness's Agent tool status enum, so I've implemented this as a defensive guard — it is a no-op if that status never occurs and correct if it does. If you can point me at the authoritative status list I'll align the set exactly (and add any sibling non-terminal statuses).

Covered by codex-rescue hook stays silent on sub_agent_entered handoff: a sub_agent_entered return asserts exit 0 and empty stdout (no additionalContext).

@@ -774,20 +1186,23 @@ async function handleTask(argv) {
return;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Exit non-zero when background task launch returns failed

handleTask always returns success from the --background path after printing the payload, even when enqueueBackgroundTask reports a terminal failure (for example worker spawn/pid failure). This makes task --background exit with code 0 while returning { "status": "failed" }, so scripts and calling agents can treat a failed dispatch as successful and skip retry/escalation.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in bb53f47. The --background branch of handleTask now sets process.exitCode = 1 when payload.status === "failed" (a worker spawn/pid failure from enqueueBackgroundTask), so a failed dispatch no longer exits 0 while printing { "status": "failed" }. A cancelled launch stays exit 0, since a user-initiated cancel before launch is not an error.

Covered by task --background exits non-zero when the worker spawn fails: a stubbed async spawn ENOENT asserts a non-zero exit, payload.status === "failed", and no Unhandled 'error' event.

zknpr and others added 6 commits May 25, 2026 13:26
…instead of hanging

The background-completion watcher reaches detached workers through
`status <jobId> --wait`, but that path called waitForSingleJobSnapshot
WITHOUT failWhenWorkerExits — the dead-worker liveness check (hasExitedActiveWorker)
was only wired into the foreground task path. So when a --background worker died
without writing a terminal status (crash/kill/teardown), the job stayed `running`
forever, the watcher polled that stale status until --timeout-ms elapsed (e.g. 30
min) and returned a misleading timeout, and the same stale `running` blocked
/codex:result and kept the completion hook treating the job as active.

Fix (reuse the proven foreground helpers, no new machinery):
- handleStatus --wait now passes failWhenWorkerExits: true; on workerExited it
  reconciles via failActiveWorkerJob (persists status=failed, pid=null,
  errorMessage) and renders the crash, mirroring runForegroundTaskWorker.
- A plain status <jobId> read also reconciles a dead active worker, so result,
  hooks, and later status callers see the durable failure (failActiveWorkerJob
  persists globally, so the first read repairs state for everyone).

No false positives: hasExitedActiveWorker requires a recorded pid that is not
alive, so pid-null/queued and live-pid jobs are never reconciled.

Tests (+3, suite 110/110): a dead-pid running job makes status --wait return
promptly as failed (no ETIMEDOUT) and persists failed/pid null; a plain status
read reconciles the same; a live-pid (process.pid) running job still times out
and is NOT reconciled (false-positive guard).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…on-zero on failed launch (PR openai#346 review)

Address the 4 findings from the Codex review of a7aaff1:

- P1 (hook): the completion hook honored a [[codex-task status=complete]] /
  [[codex-task status=dispatched]] line in the response body BEFORE checking the
  Agent tool status, so a failed return whose text merely echoed the literal
  token (model output or prompt echo) was misclassified as success / active
  dispatch — suppressing retries and pointing watchers at a phantom job. The
  structured tool status is authoritative: a known failure status now
  short-circuits to the failure context before any token handling. A genuine
  success (status completed/success/none) still honors the complete token.
- P2 (hook): isSynchronousAgentReturn treated every status except
  async_launched/running as terminal, including sub_agent_entered (interactive
  subagent handoff). The hook now treats sub_agent_entered as non-terminal and
  stays silent so it never tells the parent "exited" while the subagent is live.
- P1 (companion): handleTask --background printed the launch payload and always
  returned exit 0, even when enqueueBackgroundTask reported status=failed (worker
  spawn/pid failure). It now sets a non-zero exit code on a failed launch so
  callers don't treat a failed dispatch as success (cancelled stays exit 0).

Tests (+4, suite 114/114): a failed return echoing the complete token yields the
failure context (not COMPLETED); a failed return echoing a dispatched token yields
the failure context with no watcher instruction; a sub_agent_entered return emits
no context; task --background exits non-zero with status=failed on spawn failure.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ss leak)

A SpawnedCodexAppServerClient spawns `codex app-server` as a NON-detached child
but only killed it via a deferred, unref'd setTimeout in close(). On a normal
host exit (or process.exit()) before close() ran or that timer fired, the OS
does not reap the child, so the app-server orphaned. When the shared broker
can't start (resource pressure / after /reload-plugins), every command falls
back to a local spawn, so orphans snowballed (observed 564 processes), which
then exhausted resources and crashed new workers at ~2 min.

Fix: each client reaps ONLY its own spawned child on host exit via a
process.once("exit") SIGKILL backstop, removed on child error/exit and in
close() (idempotent). No process scanning (parallel sessions own their own
app-servers) and no SIGTERM/SIGINT handlers (a non-exiting signal listener
would keep plain CLI hosts alive; the existing process-tree kill already
covers signals). The only gap closed is normal host exit.

Tests (+2): a locally-spawned app-server is dead after its host exits WITHOUT
close(); close() restores process.listenerCount("exit") to baseline (no leak).
Suite 88/88.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When a write-capable `task` run is dispatched against a target outside the
run's cwd, Codex had no up-front signal that its workspace-write sandbox is
rooted at the cwd, so it burned the whole run probing the sandbox
(test -w / ls -ldOe / stat / id) and then failed. (This bites the Agent-tool
codex-rescue dispatch, which inherits the session cwd and cannot change it.)

- Implementer-facing: executeTaskRun now injects a concise workspace-boundary
  notice as a leading, EPHEMERAL element of the Codex turn input for write
  runs only ("write access ONLY within <cwd>; do NOT probe the sandbox; if you
  must write outside it, stop and report it needs --cwd <target>"). It is added
  via buildTurnInput(prompt, { boundaryNote }) so it never touches the stored
  request.prompt, the persistent thread name, or --resume-last. Read-only and
  review runs are unchanged (boundaryNote null).
- Dispatcher-facing: commands/rescue.md documents that codex-rescue writes are
  confined to the run's cwd, that the Agent-tool dispatch inherits the session
  cwd, and that cross-repo work must use the companion with --cwd <target>.

Tests (+3): the notice is prepended to write-task turn input; it is absent from
the stored prompt and thread name (ephemeral); a read-only run gets no notice.
Suite 89/89.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…agent is reaped at ~143s

Root cause (reproduced cleanly, not the daemon leak): a foreground (blocking)
`node companion task` call running INSIDE the codex-rescue Agent subagent is
auto-backgrounded by the harness and then reaped when the subagent ends —
observed at ~143s — which kills the detached Codex worker mid-run and silently
loses the work. Confirmed by lifetimes: foreground-via-subagent died at
142.1s / 143.3s (multiwarm) and 143.9s (controlled repro), while --background
via the subagent survived 309s and via direct Bash 461s / 492s. --background
returns immediately, so the worker reparents to launchd and survives.

Fix: codex-rescue now defaults to `task --background` whenever the user did not
explicitly pass `--wait`; foreground is reserved for short, clearly-bounded
`--wait` requests (well under ~140s), and substantial/long/write/uncertain work
always goes `--background`. The agent md and rescue.md document why (the
subagent reaping at ~143s, not the ~600s direct-Bash cap that the docs wrongly
cited). The main thread continues to arm the status --wait watcher for
background completion — the pattern proven reliable across this work.

Tests: assertions updated to the new default-background guidance + that the
incorrect ~600s foreground-cap claim is gone. Suite 119/119.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… the app-server

The worker drives the turn but the shared app-server applies file edits, so
edits in flight when the worker dies can land AFTER the worker pid is dead and
after a status snapshot (the "failed-but-wrote" race — confirmed in the wild:
a reaped worker's edits surfaced after the job already read dead/failed). The
dead-worker reconciliation marks such a job `failed` promptly, which is correct
(no result was produced) but can fire mid-write.

buildWorkerExitedError now carries the verify-on-disk discipline in its message:
"... files it was editing may have already landed or may still be landing via
the shared app-server — re-check disk state (`git status`) before concluding,
and do not trust a single snapshot." So any caller (foreground waiter, status
--wait reconcile, result) is warned to re-verify rather than act on one snapshot.

This is now rare in practice: default-`--background` (93e29aa) stops workers
being reaped mid-turn, so they complete normally; this covers the residual
genuine-crash case. Message keeps the existing prefix so callers/tests are
unaffected. Suite 119/119.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@zknpr zknpr changed the title fix: deterministic completion signal for codex-rescue (stop waiting on finished runs) codex-rescue reliability pass: completion signal, launch/cancel, and worker-lifecycle hardening May 25, 2026
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4ebc407a7c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +203 to +207
this._exitReaper = () => {
try {
if (this.proc && this.proc.exitCode === null && !this.proc.killed) {
this.proc.kill("SIGKILL");
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid leaving zombie app-server children on host exit

Killing the direct codex app-server child inside an exit handler sends SIGKILL but gives the parent no chance to reap it, so in environments where PID 1 does not promptly reap orphans (common in containers), the child can remain as a persistent <defunct> process. This turns the leak fix into a zombie-process leak over repeated runs and can eventually exhaust process slots; the cleanup path should avoid creating unreaped children at process teardown.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Real edge, but I've kept the SIGKILL backstop deliberately — reasoning:

  • The exit handler is the last-resort path; it only fires when close() didn't run (abnormal host exit). close() is the normal path and reaps properly (it awaits the child's exit). So the zombie window only opens on an abrupt host exit.
  • When the host exits right after SIGKILL, the killed child is orphaned and reparented to PID 1, which reaps it promptly under any normal init (launchd on macOS, systemd/init on Linux). On the companion's actual runtime that's a transient <defunct>, not accumulation.
  • The <defunct>-accumulation you describe requires a non-reaping PID 1 (an app-as-PID-1 container with no tini/dumb-init), which isn't a typical environment for the codex companion. There the correct fix is an init shim, not dropping the kill.
  • Crucially, the alternative — not killing to avoid the transient zombie — reinstates the original live-process leak (a running app-server), which is strictly worse than a reaped-by-init zombie. And a synchronous exit handler can't waitpid/reap (beforeExit doesn't fire on process.exit()/signals), so SIGKILL is the only reliable last-resort there.

So I'm treating this as a known low-severity limitation of the sync-exit backstop rather than changing it. Happy to reconsider if there's a synchronous reap approach I'm missing.

Comment on lines +205 to +207
if (this.proc && this.proc.exitCode === null && !this.proc.killed) {
this.proc.kill("SIGKILL");
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Kill the whole app-server tree on Windows exit reaping

On Windows this client is spawned with shell: true, so this.proc is cmd.exe and the real codex app-server is a child process; this exit reaper calls this.proc.kill("SIGKILL") directly, which can terminate only the shell while leaving the app-server running. The result is a leaked background app-server after host exit, especially in non-broker/direct mode where this hook is relied on for cleanup.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 0a8ecc6. The exit reaper now mirrors close()'s platform branch: on Windows — where shell: true makes this.proc cmd.exe and the real codex app-server is its child — it calls terminateProcessTree(this.proc.pid), which is synchronous (taskkill /PID <pid> /T /F) and therefore safe inside the exit handler, killing the whole tree including the app-server. Non-Windows keeps the direct-child SIGKILL (no shell wrapper there). This reuses the exact terminateProcessTree path close() already uses on Windows.

…openai#346 review P1)

The leak-fix exit reaper called this.proc.kill("SIGKILL") unconditionally, but on
Windows the client is spawned with shell:true, so this.proc is cmd.exe and the
real codex app-server is its child — a flat SIGKILL kills only the shell and
leaks the app-server. Mirror close()'s platform branch: on Windows use
terminateProcessTree (synchronous taskkill /T /F, safe in an exit handler) to
kill the whole tree; non-Windows keeps the direct-child SIGKILL.

(The P2 zombie note is addressed in-thread: the SIGKILL backstop is the right
tradeoff — close() reaps in the normal path, the orphan is reaped by init on any
normal host, and not killing reinstates the worse live-process leak.)

Suite 119/119.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0a8ecc61f5

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +1198 to +1199
if (payload.status === "failed") {
process.exitCode = 1;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Treat cancelled background launches as command failures

handleTask now exits non-zero only when payload.status === "failed", but enqueueBackgroundTask can also return a terminal "cancelled" payload when cancellation wins before worker spawn. In that race, task --background returns exit code 0 even though no worker was launched, so callers that key off process status can incorrectly treat dispatch as successful and skip retry/escalation logic.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in the latest push. Generalized the exit-code rule to a single invariant: task --background exits 0 iff a live job was dispatched (queued/running); any terminal launch status — failed or cancelled (when a concurrent cancel wins before worker spawn) — now exits non-zero. The check is if (!isActiveJobStatus(payload.status)) process.exitCode = 1;, replacing the narrower status === "failed". So a caller keying off exit status can rely on it to mean "a background job is now running, go poll it." Updated the three cancelled-launch tests to assert non-zero exit (the dispatched/queued case still asserts 0); suite 119/119.

…o (PR openai#346 review P2)

handleTask --background exited non-zero only on status=failed, but
enqueueBackgroundTask can also return a terminal `cancelled` payload when a
concurrent /codex:cancel wins before worker spawn. In that race no worker
launched, yet the command exited 0 — so a caller keying off exit status treats
dispatch as successful and skips retry/escalation.

Generalize to the clean invariant: exit 0 iff a live job was dispatched
(queued/running); any terminal launch status (failed OR cancelled) exits
non-zero. Replaces the `status === "failed"` check with `!isActiveJobStatus`.

Tests: the three cancelled-launch cases now assert non-zero exit (the
dispatched/queued case still asserts 0). Suite 119/119.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant