fix(panel): judge cwd = PR review worktree, not operator's CWD#182
Conversation
Operator observation: writers were spending massive token churn running their own pnpm test / typecheck / lint loops, and judges were burning tokens flagging 'this won't build' findings. Both wasteful — deterministic verify can be run once by cube, authoritatively. New phase folded into phase2_run_writers (keeps phase numbering stable): 1. Writer commits as usual 2. cube runs verify.cmd in the writer worktree, captures combined stdout/stderr to .cube/verify-logs/<task>-attempt-<N>.log 3. On failure: tiny feedback prompt with absolute log path, resume the writer. Writer Reads the log, fixes, commits. send_feedback_async auto-commits on exit (PR #178). 4. Loop up to max_attempts (default 3). On final failure, judges still run. Writer prompt updated: 'Don't run tests / lint / typecheck — cube does it after you exit.' Removes the writer's most expensive habit. Config in cube.yaml verify section. Empty cmd disables the gate. v2 cube.yaml ships with: pnpm install --frozen-lockfile && pnpm typecheck && pnpm lint && pnpm test
WalkthroughThis pull request introduces an automated deterministic verify gate that executes repo-configured verification commands after writers commit, captures logs, retries on failure by resuming the writer with feedback, and reports results to the orchestrator. ChangesVerify Gate Implementation
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
Important Pre-merge checks failedPlease resolve all errors before merging. Addressing warnings is optional. ❌ Failed checks (1 error)
✅ Passed checks (2 passed)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@python/cube/automation/verify.py`:
- Around line 124-170: The final return currently hardcodes
attempts=max_attempts which misreports when the loop exits early; change the
final VerifyResult return to use the actual last attempt count (e.g.,
attempts=attempt) and ensure attempt is referenced from the for-loop (or fall
back to 0 if not set) so that VerifyResult(ok=False, attempts=attempt,
last_log_path=last_log_path, last_exit_code=last_exit_code) reflects the real
number of verify attempts; reference the for-loop variable "attempt" and the
VerifyResult construction to locate where to update.
In `@python/cube/commands/orchestrate/handlers.py`:
- Around line 89-94: The loop over writer_keys in handlers.py currently uses
"continue" when a writer worktree (computed via WORKTREE_BASE / project_name /
f\"writer-{wconf.name}-{ctx.task_id}\") is missing, which silently bypasses
verification; change this so missing worktrees cause a hard failure instead of
quietly skipping: when get_writer_config(wkey) yields wconf but
worktree.exists() is False, log an error including wkey and wconf.name (use the
existing logger), and either raise a clear exception or record the failure in
the verification result so the phase returns non-success (do not use continue).
Apply the same behavior to the analogous block handling lines 119-124 so missing
writer worktrees consistently fail verify rather than being ignored.
In `@python/cube/commands/orchestrate/prompts.py`:
- Around line 46-59: Update the writer prompt generation to only include the "do
NOT self-verify" paragraph when the repo has a verify command configured (i.e.,
verify.cmd is set/non-empty); detect the verify setting and conditionally append
the string "Focus on the code change. Don't run tests / typecheck / lint — cube
runs them deterministically after you exit. If verify fails, cube will resume
you with the log path; Read it and fix." to the prompt output in
python/cube/commands/orchestrate/prompts.py instead of unconditionally embedding
it, referencing the verify configuration key (verify.cmd) when constructing the
prompt.
In `@python/cube/core/user_config.py`:
- Around line 246-251: The parsing for verify config in load_config() assumes
verify_raw is a dict and that timeout_seconds/max_attempts are int-coercible,
which will raise for malformed YAML like "verify: true" or "timeout_seconds:
'fast'"; update the block that builds verify_raw and verify_cfg (symbols:
verify_raw, VerifyConfig, verify_cfg, load_config) to first ensure verify_raw is
a mapping (fall back to {} if not), extract cmd using str(...) but only if
present, and safely parse timeout_seconds and max_attempts by attempting int()
in a try/except (or using conditional isinstance checks) falling back to the
existing defaults (600 and 3) when parsing fails or values are missing; keep the
creation of VerifyConfig but feed it these validated/coerced values so malformed
types do not abort load_config().
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 3001ad13-a248-408c-a4bf-8ad0dac6d668
📒 Files selected for processing (4)
python/cube/automation/verify.pypython/cube/commands/orchestrate/handlers.pypython/cube/commands/orchestrate/prompts.pypython/cube/core/user_config.py
📜 Review details
🧰 Additional context used
🪛 Ruff (0.15.13)
python/cube/automation/verify.py
[error] 61-61: subprocess call: check for execution of untrusted input
(S603)
[error] 62-62: Starting a process with a partial executable path
(S607)
| for attempt in range(1, max_attempts + 1): | ||
| log_path = _logs_dir() / f"{task_id}-attempt-{attempt}.log" | ||
| print_info(f"🔬 Verify (attempt {attempt}/{max_attempts}): {verify_cmd}") | ||
| exit_code, tail = _run_once(verify_cmd, worktree, timeout_seconds, log_path) | ||
| last_log_path = log_path | ||
| last_exit_code = exit_code | ||
|
|
||
| if exit_code == 0: | ||
| print_success(f"Verify passed (attempt {attempt}/{max_attempts})") | ||
| return VerifyResult(ok=True, attempts=attempt, last_log_path=log_path, last_exit_code=0) | ||
|
|
||
| print_warning(f"Verify failed (exit {exit_code}). Log: {log_path}") | ||
| console.print(f"[dim]Tail:\n{tail[-1500:]}[/dim]") | ||
|
|
||
| if attempt >= max_attempts: | ||
| print_error( | ||
| f"Verify still failing after {max_attempts} attempts. " | ||
| "Handing current state to judges; they will grade against the failing build." | ||
| ) | ||
| break | ||
|
|
||
| # Write a feedback prompt to disk, then resume the writer with it. | ||
| feedback_path = Path(PROJECT_ROOT) / ".prompts" / f"verify-feedback-{task_id}-{attempt}.md" | ||
| feedback_path.parent.mkdir(parents=True, exist_ok=True) | ||
| feedback_path.write_text(_feedback_prompt(log_path, exit_code, attempt, max_attempts, verify_cmd)) | ||
|
|
||
| from ..core.session import load_session | ||
|
|
||
| session_id = load_session(writer_info.key.upper(), task_id) | ||
| if not session_id: | ||
| print_warning(f"No session to resume for {writer_info.label}; aborting verify loop") | ||
| break | ||
|
|
||
| await send_feedback_async( | ||
| task_id=task_id, | ||
| feedback_file=feedback_path, | ||
| session_id=session_id, | ||
| worktree=worktree, | ||
| writer_name=writer_info.name, | ||
| writer_model=writer_info.model, | ||
| writer_label=writer_info.label, | ||
| writer_key=writer_info.key, | ||
| writer_color=writer_info.color, | ||
| ) | ||
| # send_feedback_async commits any writer changes (PR #178). | ||
|
|
||
| return VerifyResult(ok=False, attempts=max_attempts, last_log_path=last_log_path, last_exit_code=last_exit_code) |
There was a problem hiding this comment.
Return the actual number of verify attempts performed.
If the loop exits early (e.g., Line 154 no session to resume), Line 170 still reports attempts=max_attempts, which misreports execution state.
💡 Suggested fix
- for attempt in range(1, max_attempts + 1):
+ attempts_run = 0
+ for attempt in range(1, max_attempts + 1):
+ attempts_run = attempt
log_path = _logs_dir() / f"{task_id}-attempt-{attempt}.log"
@@
- break
+ break
@@
- break
+ break
@@
- return VerifyResult(ok=False, attempts=max_attempts, last_log_path=last_log_path, last_exit_code=last_exit_code)
+ return VerifyResult(
+ ok=False,
+ attempts=attempts_run,
+ last_log_path=last_log_path,
+ last_exit_code=last_exit_code,
+ )🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@python/cube/automation/verify.py` around lines 124 - 170, The final return
currently hardcodes attempts=max_attempts which misreports when the loop exits
early; change the final VerifyResult return to use the actual last attempt count
(e.g., attempts=attempt) and ensure attempt is referenced from the for-loop (or
fall back to 0 if not set) so that VerifyResult(ok=False, attempts=attempt,
last_log_path=last_log_path, last_exit_code=last_exit_code) reflects the real
number of verify attempts; reference the for-loop variable "attempt" and the
VerifyResult construction to locate where to update.
| for wkey in writer_keys: | ||
| wconf = get_writer_config(wkey) | ||
| worktree = WORKTREE_BASE / project_name / f"writer-{wconf.name}-{ctx.task_id}" | ||
| if not worktree.exists(): | ||
| continue | ||
| writers.append( |
There was a problem hiding this comment.
Don’t silently bypass verify when writer worktrees are missing.
Lines 92-94 continue quietly skips verification, and the phase still reports success. If worktree resolution regresses, verify can be effectively disabled with no hard signal.
💡 Suggested fix
- writers: list[WriterInfo] = []
+ writers: list[WriterInfo] = []
+ missing_worktrees: list[str] = []
@@
- if not worktree.exists():
- continue
+ if not worktree.exists():
+ missing_worktrees.append(f"{wconf.label}: {worktree}")
+ continue
@@
+ for item in missing_worktrees:
+ print_warning(f"Verify skipped for missing worktree: {item}")
+
+ if not writers:
+ print_error("Verify is configured but no writer worktrees were found.")
+ raise typer.Exit(1)Also applies to: 119-124
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@python/cube/commands/orchestrate/handlers.py` around lines 89 - 94, The loop
over writer_keys in handlers.py currently uses "continue" when a writer worktree
(computed via WORKTREE_BASE / project_name /
f\"writer-{wconf.name}-{ctx.task_id}\") is missing, which silently bypasses
verification; change this so missing worktrees cause a hard failure instead of
quietly skipping: when get_writer_config(wkey) yields wconf but
worktree.exists() is False, log an error including wkey and wconf.name (use the
existing logger), and either raise a clear exception or record the failure in
the verification result so the phase returns non-success (do not use continue).
Apply the same behavior to the analogous block handling lines 119-124 so missing
writer worktrees consistently fail verify rather than being ignored.
| ### Do NOT self-verify — cube runs the deterministic verify gate | ||
| Cube runs the repo's verify command (typecheck + lint + tests) automatically | ||
| after the writer commits. **Writers must NOT run `pnpm verify` / `task verify` / | ||
| `npm test` / `pytest` / lint themselves.** Reasons: | ||
| - Cube's run is authoritative; writer churn on the same commands is wasted tokens. | ||
| - If verify fails, cube re-resumes the writer with a pointer to the log on disk | ||
| (writer uses `Read` to inspect, fixes, commits — cube runs verify again). | ||
| - Judges only see code that passes verify (or has hit the retry cap), so the | ||
| panel never burns tokens grading "this won't build" findings. | ||
|
|
||
| **Include this in the writer prompt** as an explicit instruction: | ||
| "Focus on the code change. Don't run tests / typecheck / lint — cube runs them | ||
| deterministically after you exit. If verify fails, cube will resume you with | ||
| the log path; Read it and fix." |
There was a problem hiding this comment.
Make verify-gate instructions conditional on config.
This block assumes verify is always enabled, but verify.cmd can be unset/empty. In that case, telling writers “do NOT self-verify” is incorrect and can let unverified changes through.
Suggested patch
async def generate_writer_prompt(task_id: str, task_content: str, prompts_dir: Path) -> Path:
@@
- prompt = f"""# Task: Generate Writer Prompt
+ from ...automation.verify import is_verify_configured
+ from ...core.user_config import load_config
+
+ verify_enabled = is_verify_configured(load_config())
+ verify_instructions = (
+ """### Do NOT self-verify — cube runs the deterministic verify gate
+Cube runs the repo's verify command (typecheck + lint + tests) automatically
+after the writer commits. **Writers must NOT run `pnpm verify` / `task verify` /
+`npm test` / `pytest` / lint themselves.** Reasons:
+- Cube's run is authoritative; writer churn on the same commands is wasted tokens.
+- If verify fails, cube re-resumes the writer with a pointer to the log on disk
+ (writer uses `Read` to inspect, fixes, commits — cube runs verify again).
+- Judges only see code that passes verify (or has hit the retry cap), so the
+ panel never burns tokens grading "this won't build" findings.
+
+**Include this in the writer prompt** as an explicit instruction:
+"Focus on the code change. Don't run tests / typecheck / lint — cube runs them
+deterministically after you exit. If verify fails, cube will resume you with
+the log path; Read it and fix."
+"""
+ if verify_enabled
+ else """### Verification responsibility
+Verify gate is not configured for this repo run. Writers must run the project's
+verification checks (tests/lint/typecheck) before commit/push."""
+ )
+
+ prompt = f"""# Task: Generate Writer Prompt
@@
-### Do NOT self-verify — cube runs the deterministic verify gate
-Cube runs the repo's verify command (typecheck + lint + tests) automatically
-after the writer commits. **Writers must NOT run `pnpm verify` / `task verify` /
-`npm test` / `pytest` / lint themselves.** Reasons:
-- Cube's run is authoritative; writer churn on the same commands is wasted tokens.
-- If verify fails, cube re-resumes the writer with a pointer to the log on disk
- (writer uses `Read` to inspect, fixes, commits — cube runs verify again).
-- Judges only see code that passes verify (or has hit the retry cap), so the
- panel never burns tokens grading "this won't build" findings.
-
-**Include this in the writer prompt** as an explicit instruction:
-"Focus on the code change. Don't run tests / typecheck / lint — cube runs them
-deterministically after you exit. If verify fails, cube will resume you with
-the log path; Read it and fix."
+{verify_instructions}🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@python/cube/commands/orchestrate/prompts.py` around lines 46 - 59, Update the
writer prompt generation to only include the "do NOT self-verify" paragraph when
the repo has a verify command configured (i.e., verify.cmd is set/non-empty);
detect the verify setting and conditionally append the string "Focus on the code
change. Don't run tests / typecheck / lint — cube runs them deterministically
after you exit. If verify fails, cube will resume you with the log path; Read it
and fix." to the prompt output in python/cube/commands/orchestrate/prompts.py
instead of unconditionally embedding it, referencing the verify configuration
key (verify.cmd) when constructing the prompt.
| verify_raw = data.get("verify") or {} | ||
| verify_cfg = VerifyConfig( | ||
| cmd=str(verify_raw.get("cmd", "")).strip(), | ||
| timeout_seconds=int(verify_raw.get("timeout_seconds", 600)), | ||
| max_attempts=int(verify_raw.get("max_attempts", 3)), | ||
| ) |
There was a problem hiding this comment.
Harden verify config parsing against malformed YAML values.
Line 246 and Lines 249-250 assume valid types (dict + int-coercible values). A config like verify: true or timeout_seconds: "fast" will throw and abort load_config().
💡 Suggested fix
- verify_raw = data.get("verify") or {}
- verify_cfg = VerifyConfig(
- cmd=str(verify_raw.get("cmd", "")).strip(),
- timeout_seconds=int(verify_raw.get("timeout_seconds", 600)),
- max_attempts=int(verify_raw.get("max_attempts", 3)),
- )
+ verify_raw = data.get("verify")
+ if not isinstance(verify_raw, dict):
+ verify_raw = {}
+
+ def _as_int(value: object, default: int, *, minimum: int) -> int:
+ try:
+ parsed = int(value)
+ except (TypeError, ValueError):
+ return default
+ return max(parsed, minimum)
+
+ verify_cfg = VerifyConfig(
+ cmd=str(verify_raw.get("cmd", "")).strip(),
+ timeout_seconds=_as_int(verify_raw.get("timeout_seconds", 600), 600, minimum=1),
+ max_attempts=_as_int(verify_raw.get("max_attempts", 3), 3, minimum=1),
+ )📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| verify_raw = data.get("verify") or {} | |
| verify_cfg = VerifyConfig( | |
| cmd=str(verify_raw.get("cmd", "")).strip(), | |
| timeout_seconds=int(verify_raw.get("timeout_seconds", 600)), | |
| max_attempts=int(verify_raw.get("max_attempts", 3)), | |
| ) | |
| verify_raw = data.get("verify") | |
| if not isinstance(verify_raw, dict): | |
| verify_raw = {} | |
| def _as_int(value: object, default: int, *, minimum: int) -> int: | |
| try: | |
| parsed = int(value) | |
| except (TypeError, ValueError): | |
| return default | |
| return max(parsed, minimum) | |
| verify_cfg = VerifyConfig( | |
| cmd=str(verify_raw.get("cmd", "")).strip(), | |
| timeout_seconds=_as_int(verify_raw.get("timeout_seconds", 600), 600, minimum=1), | |
| max_attempts=_as_int(verify_raw.get("max_attempts", 3), 3, minimum=1), | |
| ) |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@python/cube/core/user_config.py` around lines 246 - 251, The parsing for
verify config in load_config() assumes verify_raw is a dict and that
timeout_seconds/max_attempts are int-coercible, which will raise for malformed
YAML like "verify: true" or "timeout_seconds: 'fast'"; update the block that
builds verify_raw and verify_cfg (symbols: verify_raw, VerifyConfig, verify_cfg,
load_config) to first ensure verify_raw is a mapping (fall back to {} if not),
extract cmd using str(...) but only if present, and safely parse timeout_seconds
and max_attempts by attempting int() in a try/except (or using conditional
isinstance checks) falling back to the existing defaults (600 and 3) when
parsing fails or values are missing; keep the creation of VerifyConfig but feed
it these validated/coerced values so malformed types do not abort load_config().
) (#184) PR #182 and #183 squash-merged to empty diffs because they were stacked off the verify-gate branch instead of main. Re-applying both fixes against the actual main HEAD: 1. judge_panel.run_judge: use judge_info.review_worktree as the cwd when the PR review flow synced one. Stops judges from inheriting the operator's Claude Code session worktree (which may be on a stale unrelated branch). Previously: judges reviewed code from cool-satoshi worktree on commits dozens behind the actual PR. 2. config._find_git_root: use 'git rev-parse --git-common-dir' to find the MAIN repo working tree, not the per-Claude-Code-session worktree. All worktrees of the same repo now share .agent-sessions/, .prompts/decisions/, .cube/. Eliminates the 'No session found' regression that came from state being scattered across worktrees. Verified: PROJECT_ROOT resolves to the main repo root from both the main checkout and any Claude Code session worktree. 237 tests pass. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
… canonical path (#187) Codex judges run with --sandbox workspace-write and --cd <pr-N worktree> (PR #182). The sandbox blocks writes anywhere outside the cwd workspace — including the main repo's .prompts/decisions/ where decision JSONs MUST land. Result: gpt-5.5 judges (Backend, Frontend & UX) silently failed to write their decisions; only opus/claude judges (which use a different sandbox) actually persisted. Pass --add-dir <project_root> to codex when worktree differs from PROJECT_ROOT. Sandbox stays in place; just adds the main repo to the writeable allowlist so judges can write the decision file at its canonical absolute path. find_decision_file's worktree-scan fallback still acts as a safety net for any judge that writes to the worktree's .prompts/decisions/ instead. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Symptom
Operator running cube from a Claude Code session worktree (e.g. `.claude/worktrees/cool-satoshi-bd8d21`) was seeing judges report findings against code that didn't exist on the actual PR. Investigation showed judges were Reading from the operator's session worktree — dozens of commits behind, on an unrelated branch — not from the cube-synced PR review worktree.
Specifically: judges flagged `apps/api/.../line:728` as buggy. Line 728 on the PR head was a totally different function from line 728 on the operator's session branch.
Root cause
`run_dir = WORKTREE_BASE.parent if cli_name == "gemini" else PROJECT_ROOT`
For all non-gemini CLIs, the judge inherits cwd = PROJECT_ROOT (the operator's repo checkout). Relative-path Reads land there, regardless of what worktree the prompt instructed.
Fix
Use `judge_info.review_worktree` as the cwd when set (PR #172 already populates it from the synced PR worktree at `~/.cube/worktrees//pr-/`). Falls back to the legacy paths when no review worktree was wired (writer reviews, gemini, etc).
Test plan
🤖 Generated with Claude Code
Deterministic Verify Gate Between Writer and Judge Phases
This PR introduces a deterministic verification system that runs between the writer and judge phases to eliminate wasteful token churn.
Problem Solved:
Writers were spending massive tokens running their own test/lint/typecheck loops, and judges were burning tokens flagging "this won't build" findings. The verify gate runs once authoritatively instead.
How It Works:
verify.cmdin the writer's worktree, capturing logs to.cube/verify-logs/<task>-attempt-<N>.logmax_attempts(default 3)Key Changes:
verify.py(new, 181 lines): ImplementsVerifyResultdataclass andrun_verify_loop()which executes the verify command with timeout handling, log truncation, and async writer feedback resumption viasend_feedback_asynchandlers.pyPhase 2: Extendedphase2_run_writersto integrate the verify-and-repair loop, deriving writer worktrees and independently running verification for eachVerifyConfigblock incube.yamlwith fieldscmd,timeout_seconds(600s default), andmax_attempts(3 default); emptycmddisables the gatepnpm install --frozen-lockfile && pnpm typecheck && pnpm lint && pnpm testImpact:
Removes the writer's most expensive habit, streamlines the workflow by centralising verification, and surfaces failure context to judges when verification does fail.