diff --git a/codev/plans/984-multi-architect-coordination-l.md b/codev/plans/984-multi-architect-coordination-l.md new file mode 100644 index 000000000..600ef8fa1 --- /dev/null +++ b/codev/plans/984-multi-architect-coordination-l.md @@ -0,0 +1,320 @@ +# Implementation Plan: Multi-Architect Coordination Layer + +## Metadata +- **ID**: plan-2026-06-04-984-multi-architect-coordination-l +- **Status**: draft +- **Specification**: [`codev/specs/984-multi-architect-coordination-l.md`](../specs/984-multi-architect-coordination-l.md) +- **Created**: 2026-06-04 +- **GitHub Issue**: [#984](https://github.com/cluesmith/codev/issues/984) + +## Executive Summary + +Implements the six-point coordination layer from the approved spec, in dependency order. The **ownership ledger** (#3) is the data foundation; the **roster** (#1) reads it; **bounded state files** (#5) supply the template the **lifecycle** (#4) creates and archives; the **board** (#2) extends the dashboard overview; **builder-base SHA-pin** (#6 — architect-directed, replacing worktree isolation) is a localized change to the spawn base-resolution path. A final **documentation** phase surfaces every new command/flag. + +All phases ship as commits within a **single PR** (per the builder PR strategy and the spec's Amendment 1 — #6 is now small enough that no slicing is needed). Each phase is independently committable and testable. Backward compatibility for single-`main` workspaces is verified per phase (N=1 must be unchanged). + +**Integration anchors** (verified against the codebase): +- SQLite schema: `packages/codev/src/agent-farm/db/schema.ts` (`LOCAL_SCHEMA`, `CREATE TABLE IF NOT EXISTS`); versioned migrations in `db/index.ts` `ensureLocalDatabase()`. +- State CRUD: `packages/codev/src/agent-farm/state.ts` (`upsertBuilder`, `setArchitectByName`, `loadState`, `removeArchitect`). +- Spawn: `commands/spawn.ts` (`SPAWNING_ARCHITECT_NAME` at module load; `upsertBuilder` in `spawnSpec`); base resolution in `commands/spawn-worktree.ts` (`createWorktree`). +- Tower: routes in `servers/tower-routes.ts`; architect add/remove impl in `servers/tower-instances.ts` (`addArchitect` ~L987, `removeArchitect` ~L1161, both around `setArchitectByName`); REST client in `packages/core/src/tower-client.ts`. +- Overview/board: `servers/overview.ts` (`OverviewCache`, `discoverBuilders`, porch `status.yaml` parsing); dashboard `packages/dashboard/src/components/WorkView.tsx` + `BuilderCard.tsx`; shared types in `packages/types/src/api.ts`. +- CLI: `packages/codev/src/agent-farm/cli.ts` (Commander `.command().action()`). +- Tests: Vitest in `packages/codev/src/agent-farm/__tests__/` (`state.test.ts`, `tower-routes.test.ts`, e2e under `__tests__/e2e/`). + +## Success Metrics +- [ ] All spec Success Criteria (#1–#6 + backward compat) met. +- [ ] Each phase has unit tests; critical paths have integration tests; dashboard changes have Playwright coverage at N=1 and N>1. +- [ ] No reduction in existing coverage; the afx/Tower/dashboard suites stay green. +- [ ] N=1 (single `main` architect) behavior and dashboard rendering are unchanged. +- [ ] Documentation (`CLAUDE.md`/`AGENTS.md`, `agent-farm.md`, role docs, skeleton templates) updated for every new surface. + +## Phases (Machine Readable) + + + +```json +{ + "phases": [ + {"id": "phase_1_ledger", "title": "Issue-ownership ledger + dedup-at-spawn (#3): SQLite table + migration, ledger CRUD, spawn recording, dedup gate with --override-owner, CODEV_ARCHITECT_NAME N>1 validation"}, + {"id": "phase_2_roster", "title": "Architect roster (#1): afx architects [--json] command + GET /api/workspaces/:enc/architects Tower endpoint, joining architect table to ledger + builders + overview cache + last-activity"}, + {"id": "phase_3_bounded_state", "title": "Bounded/rotated state files (#5): templated head//history, afx state rotate command + entry-boundary rotation util (loss-free), role-doc/template updates"}, + {"id": "phase_4_lifecycle", "title": "Architect lifecycle (#4): create architect state file from template on add-architect (idempotent); archive state file + release ledger + re-home builders to main on remove-architect"}, + {"id": "phase_5_board", "title": "Unified board / who-owes-next (#2): extend overview API + WorkView to group open threads by owning architect with a total who-owes-next function; optional afx architects --board text digest"}, + {"id": "phase_6_sha_pin", "title": "Builder-base SHA-pin (#6): at spawn, fetch + rev-parse the integration/default branch tip and branch the builder from that SHA; fail-fast on fetch error; --base override; read default branch (no hardcoded main)"}, + {"id": "phase_7_docs", "title": "Documentation: CLAUDE.md/AGENTS.md, agent-farm.md, role docs, skeleton templates for afx architects, --override-owner, --base, state rotation, lifecycle semantics"} + ] +} +``` + +## Phase Breakdown + +### Phase 1: Issue-ownership ledger + dedup-at-spawn (#3) +**Dependencies**: None — the data foundation. + +#### Objectives +- Persist a durable `issue_number → architect` ownership record at spawn, independent of builder lifecycle. +- Block a second architect from spawning on an issue already owned by a different architect, unless explicitly overridden. + +#### Files +- `packages/codev/src/agent-farm/db/schema.ts` — add `issue_ownership` table to `LOCAL_SCHEMA` (`CREATE TABLE IF NOT EXISTS`) **plus** the partial unique index `CREATE UNIQUE INDEX IF NOT EXISTS … WHERE released = 0`. Because `db.exec(LOCAL_SCHEMA)` runs on every DB init (confirmed — Gemini/Claude), this auto-applies to existing DBs; a discrete numbered migration in `db/index.ts` is **optional** (only needed if we later alter columns/backfill) — note it but don't require it. +- `packages/codev/src/agent-farm/state.ts` — ledger CRUD: `recordOwnership`, `getOwner`, `releaseOwnership`, `listOwnershipForArchitect`, `overrideOwnership` (release-and-reinsert with `override_of`). +- `packages/codev/src/agent-farm/commands/spawn.ts` — **claim ownership *before* worktree/session creation** (see Implementation Details); resolve the spawning architect (with `--override-owner`), validate `CODEV_ARCHITECT_NAME` in N>1, run the dedup check. Touches **only the numbered-issue spawn paths** (`spawnSpec` and `spawnIssueDrivenBuilder`), **not** `spawnTask`/`spawnShell`/`spawnWorktree`/`spawnProtocol` (those have no issue number — Claude 3b). +- `packages/codev/src/agent-farm/types.ts` — add `overrideOwner?` / `base?` to `SpawnOptions` (Codex — required plumbing for the new flags). +- `packages/codev/src/agent-farm/cli.ts` — add `--override-owner ` to the `spawn` command. +- `packages/codev/src/agent-farm/__tests__/state.test.ts` (+ a focused `ledger.test.ts`) — CRUD + dedup + atomicity tests; **`__tests__/spawn.test.ts`** — extend the existing spawn-validation tests for `--override-owner` (Codex). + +#### Implementation Details +- **Schema**: columns `workspace_path TEXT NOT NULL, issue_number TEXT NOT NULL, architect TEXT NOT NULL, created_at TEXT DEFAULT (datetime('now')), released INTEGER NOT NULL DEFAULT 0, released_at TEXT, override_of TEXT`. Partial unique index `CREATE UNIQUE INDEX IF NOT EXISTS idx_issue_ownership_live ON issue_ownership(workspace_path, issue_number) WHERE released = 0` — makes the dedup check-then-insert **atomic** (a concurrent double-spawn loses on the constraint; surface that as the dedup warning). +- **Ownership is claimed BEFORE side effects (Codex — important).** Today `spawnSpec` creates the worktree/session *before* `upsertBuilder`. If the ledger write/check happened after `upsertBuilder`, a duplicate would only be refused *after* the worktree/session already exist. Therefore: run the dedup check **and** insert the live ledger row (atomically, via the unique index) **first**, before `createWorktree`/session creation. The `builder_id` may not exist yet at claim time — claim with `architect` + `issue_number` (nullable `builder_id`, backfilled on success). **Rollback**: if a later spawn step fails, release the just-claimed ledger row (or mark it released) so a failed spawn does not leave a phantom owner. This makes the dedup gate actually preventive. +- **Architect resolution at spawn**: `SPAWNING_ARCHITECT_NAME` currently = `process.env.CODEV_ARCHITECT_NAME || 'main'` (module-scope, `spawn.ts:36`). Replace with a **per-invocation resolver helper** (so `--override-owner` can supersede it and N>1 validation runs). **N>1 validation**: if more than one architect is registered and the resolved name is not a registered architect, fail loud (no `main` default). N=1 keeps the `main` fallback. +- **Dedup gate**: look up the live owner for `issue_number`; if it exists and differs from the spawning architect, print a clear warning naming the owner and **refuse** unless `--override-owner` was passed. `--resume` and same-architect spawns never trip. Override = `releaseOwnership(prior)` + `recordOwnership(new, override_of=prior)`. +- **Closed-issue dedup (resolves spec nice-to-know OQ)**: for now, **treat a closed prior-owned issue the same as open** (block + `--override-owner`). Revisit only if feedback says warn-but-not-block is preferable; documented so the gate logic is unambiguous for the builder. +- **Ledger durability**: builder `cleanup` and issue close do **not** release entries (per spec); release happens only via override, `remove-architect` (Phase 4), or a future explicit command. + +#### Acceptance Criteria +- [ ] Spawning issue N records `N→`; re-spawn/resume by the same architect does not block. +- [ ] A different architect spawning N is refused with a named warning; `--override-owner` proceeds and records `override_of`. +- [ ] Concurrent double-spawn yields exactly one live owner (unique-index test). +- [ ] N>1 spawn with unset/unknown `CODEV_ARCHITECT_NAME` is refused; N=1 unchanged. +- [ ] Migration creates the table + index on a pre-existing DB without data loss. + +#### Test Plan +- **Unit**: ledger CRUD; override release-and-reinsert; partial-unique-index rejects a second live owner; N>1 name validation. +- **Integration**: a spawn path test (mocked git/Tower) that records ownership and blocks a cross-architect spawn. + +#### Risks +- **Migration on existing DBs** → use `CREATE TABLE/INDEX IF NOT EXISTS` + a versioned migration; test against a seeded old DB. +- **Resolving architect per-invocation without breaking module-scope readers** → centralize resolution in a helper used by all spawn paths (`spawnSpec`, `spawnTask`, etc.). + +--- + +### Phase 2: Architect roster — `afx architects` (#1) +**Dependencies**: Phase 1 (reads the ledger). + +#### Objectives +- One authoritative table: per architect → owned issues (+open/closed), live builders, last-activity, state-file path. + +#### Files +- `packages/codev/src/agent-farm/commands/architects.ts` — new command (table + `--json`). +- `packages/codev/src/agent-farm/cli.ts` — register `afx architects [--json]`. +- `packages/codev/src/agent-farm/servers/tower-routes.ts` — new `GET /api/workspaces/:encoded/architects` handler (the POST/DELETE on this path already exist for add/remove; add the GET branch). +- `packages/core/src/tower-client.ts` — `getArchitects(workspacePath)` client method. +- `packages/codev/src/agent-farm/servers/overview.ts` — reuse `OverviewCache` issue list for open/closed; expose a helper if needed. +- `packages/types/src/api.ts` — `ArchitectRosterRow` shape. +- `packages/codev/src/agent-farm/__tests__/` — roster join + degraded-forge unit tests; a route test. + +#### Implementation Details +- **Join**: `architect` table (via `loadState`) ⋈ ledger (Phase 1) ⋈ `builders` (`spawned_by_architect = name`) ⋈ last-activity. +- **Open/closed**: intersect ledger issue ids against the **existing overview/issue cache** (no per-issue forge shell-out). When the cache is unavailable/stale, render status `unknown` (never block on forge). +- **Last-activity**: `max(architect terminal-session timestamp, most-recent owned-builder updated_at)`; `—` if neither. +- **State-file path**: `codev/state/architects/.md` (created by Phase 4; the roster shows the path whether or not it exists yet, flagging missing). +- **N=1**: a single coherent `main` row; `--json` parses. + +#### Acceptance Criteria +- [ ] `afx architects` renders the joined table at N=1 and N>1; `--json` is valid and parseable. +- [ ] Open/closed labels come from the cache; forge outage degrades to `unknown` without error or >1s latency. +- [ ] Last-activity matches the most-recent of terminal/builder activity. + +#### Test Plan +- **Unit**: roster assembly from seeded architect/ledger/builder rows; degraded-forge path; last-activity selection. +- **Integration**: `GET …/architects` route returns the expected JSON; CLI renders it. + +#### Risks +- **<1s budget** → no per-issue shell-out; reuse the cache. Test with a stubbed slow/absent forge. + +--- + +### Phase 3: Bounded / rotated state files (#5) +**Dependencies**: None (templates + rotation util). Ordered before Phase 4 because Phase 4 creates the architect state file from this template. + +#### Objectives +- A templated head/history structure with a machine-parseable boundary, and a loss-free rotation that keeps cold-resume cost bounded. + +#### Files +- `codev/state/` template assets — an **architect-state template** and the **boundary marker** convention; mirror into `codev-skeleton/` where adopters need it. +- `codev/roles/builder.md` + `codev-skeleton/roles/builder.md` — teach the head/``/history structure for builder `*_thread.md` (extends the #823 thread-file section). +- `codev/roles/architect.md` + `codev-skeleton/roles/architect.md` — same structure for the architect state file. +- `packages/codev/src/lib/state-rotation.ts` — pure rotation function (committed home: codev-specific, not a cross-package `packages/core` utility — Claude 3a): parse below the boundary, move oldest **whole entries** into `codev/state/archive/-.md`, never split a fenced code block. Creates `codev/state/archive/` (`mkdir -p` equivalent) on first archive. +- `packages/codev/src/agent-farm/commands/state-rotate.ts` + `cli.ts` — `afx state rotate ` (and an internal entry point for opportunistic triggers). +- `.gitignore` — ensure `codev/state/archive/` retention is handled per project convention (commit vs ignore — follow existing `codev/state/` disposition; archives are history, default commit). +- Tests — rotation correctness + loss-free reconstruction. + +#### Implementation Details +- **Template regions**: `# ` + bounded head (AI-maintained "current state"), then `<!-- ARCHIVE BOUNDARY -->`, then `## History` (append-only entries delimited by `### ` or `---`). +- **Rotation**: operates **only** below the boundary, at whole-entry boundaries; when the history region exceeds the cap (configurable; default in bytes/lines), move the oldest entries to the dated archive. The head is never auto-truncated (convention-bounded). **Loss-free invariant**: `archive + live` concatenation reconstructs the original (tested). +- **Trigger**: explicit `afx state rotate <id>` + opportunistic call on lifecycle/digest events (Phase 4 add/remove, Phase 5 board regeneration). **No daemon / file watcher.** +- **Date in archive name**: passed in by the caller (the rotation util takes a date argument; the CLI supplies it) — keeps the util pure/testable. + +#### Acceptance Criteria +- [ ] A state file exceeding the cap rotates only whole history entries below the boundary; a fenced code block is never split; head untouched. +- [ ] `archive + live` reconstructs the original content (loss-free). +- [ ] Builder and architect role docs/templates carry the structure; `copy-skeleton` validation confirms the shipped skeleton matches. +- [ ] No new always-on process is introduced. + +#### Test Plan +- **Unit**: rotation on fixtures (with/without code blocks, under/over cap); reconstruction invariant; idempotence (rotating an under-cap file is a no-op). +- **Manual**: run `afx state rotate` on a large fixture thread file. + +#### Risks +- **History loss / markdown corruption** → relocate-not-delete; entry-boundary parsing; reconstruction test is a hard gate. +- **AI not following the template head** → role-doc guidance; rotation only touches the history region, so a sloppy head never causes data loss. + +--- + +### Phase 4: Architect lifecycle — state file on add, archive + release on retire (#4) +**Dependencies**: Phase 1 (ledger release) + Phase 3 (state-file template). + +#### Objectives +- Add-architect always yields a state file (from template, idempotent); remove-architect archives it and releases owned builders/issues — no hand-renaming, no orphans. + +#### Files +- `packages/codev/src/agent-farm/servers/tower-instances.ts` — hook into `addArchitect` (function ~L868; the `setArchitectByName` call within is ~L987) and `removeArchitect` (function ~L1100; the `setArchitectByName(...,null)` call within is ~L1161) — line numbers per Claude 3c. **On add**: create `codev/state/architects/<name>.md` from the Phase 3 template if absent (never clobber), creating the `codev/state/architects/` directory if missing (`mkdir -p` equivalent). **On remove**: archive the state file → `codev/state/archive/architects/<name>-<date>.md` (create dir if missing), release ledger entries, re-home builders. +- `packages/codev/src/agent-farm/state.ts` — `releaseOwnershipForArchitect(name)` and `rehomeBuildersToMain(name)` (`UPDATE builders SET spawned_by_architect='main' WHERE spawned_by_architect=name`). +- (Possibly) `commands/workspace-add-architect.ts` / `workspace-remove-architect.ts` — only if any client-side messaging needs updating; the substantive work is server-side. +- Tests — add creates file (idempotent); remove archives + releases + re-homes. + +#### Implementation Details +- **On add** (ordered): create state file from template if missing (idempotent — re-adding a retired name restores rather than overwrites). +- **On remove** (ordered): (1) archive state file (recoverable, not deleted); (2) `releaseOwnershipForArchitect`; (3) `rehomeBuildersToMain` — builders **keep running**; routing already falls back to `main` post-#774, so the column update makes the new home explicit, no routing-code change; (4) remove architect row (existing). **No worktree removal** (per Amendment 1, #6 is SHA-pin — there is no architect worktree). +- **`main` guard**: `remove-architect` already refuses to remove `main` — unchanged. + +#### Acceptance Criteria +- [ ] `add-architect <name>` creates `codev/state/architects/<name>.md`; re-running does not clobber an edited file. +- [ ] `remove-architect <name>` archives the state file (recoverable), marks its ledger entries released, and re-homes its live builders to `main` (still running). +- [ ] Routing from a re-homed builder lands on `main`. + +#### Test Plan +- **Unit**: idempotent create; archive move; ledger release; builder re-home SQL. +- **Integration**: route-level add then remove, asserting filesystem + DB + SSE side effects (reuse the `architects-updated` SSE from #823). + +#### Risks +- **Workspace-relative path resolution (Claude — elevated to a risk)**: these filesystem ops run inside Tower route handlers (`tower-instances.ts`), so a relative-vs-absolute path mistake puts the state file in the wrong place. Resolve `codev/state/architects/` from the **workspace path the handler already has** (the same one passed to `setArchitectByName`), and `mkdir -p` the new subdirectories (`codev/state/architects/`, `codev/state/archive/architects/`) since they don't pre-exist. +- **Filesystem race with a live AI writing the file** → archive is a move of the file as-is; the AI re-creates on next write if needed (acceptable; documented). + +--- + +### Phase 5: Unified board / who-owes-next (#2) +**Dependencies**: Phase 1 (ledger) + Phase 2 (roster data/types). + +#### Objectives +- The dashboard Work view can group open threads by owning architect, each with a deterministic who-owes-next signal; optional CLI text digest. + +#### Files +- `packages/codev/src/agent-farm/servers/overview.ts` — add the who-owes-next derivation (a **total function** over already-parsed porch/overview state) and architect-grouping data to the overview payload. **Crucially, join the ownership ledger (Phase 1)** so that *owned-but-unspawned* issues appear: for a live ledger entry with no active builder, **synthesize an overview item** (a minimal `OverviewBuilder`-shaped row, or a dedicated `OverviewOwnedIssue` entry in the payload) so the board can render it (Codex + Gemini — the board data model is the ledger ⋈ builders, not builders alone). +- `packages/types/src/api.ts` — extend the overview shape with who-owes-next + grouping fields, and the synthesized owned-issue shape (additive, optional). +- `packages/dashboard/src/components/WorkView.tsx` (+ CSS) — group by owning architect and render who-owes-next **only when `architectCount > 1`** (reuse `architectCount` + `OverviewBuilder.spawnedByArchitect` from #823 for the live-builder rows; render synthesized owned-issue rows alongside — **no new dashboard deps**). +- `packages/codev/src/agent-farm/commands/architects.ts` — optional `--board` text digest (nice-to-know; may defer). +- Tests — who-owes-next unit tests (all cases incl. `unknown` and owned-unspawned); Playwright N=1-identical + N>1 grouping. + +#### Implementation Details +- **Board data model**: the board groups by owning architect over the union of (a) live builders enriched with `spawnedByArchitect` and (b) **ledger-owned issues with no active builder** (synthesized from the Phase 1 ledger join in `overview.ts`). Without (b), the "owned-but-unspawned" who-owes-next case has no row to attach to. +- **who-owes-next** (total, with `unknown` fallback): pending human gate → architect; mid-phase progressing → builder; review-complete / PR-open-awaiting-merge → architect; ledger entry with no active builder → architect; idle/stuck builder → builder (stalled); else → `unknown`. +- **N=1 invariant**: grouping/who-owes-next render is gated behind `architectCount > 1`; N=1 Work view renders byte-identically to today. +- **Live-update (optional, noted by Claude)**: the roster/board refresh on the existing overview poll. A dedicated SSE event for ledger writes is a possible nice-to-have but is **not required** (the poll already picks up changes); left out of scope unless requested. + +#### Acceptance Criteria +- [ ] Work view groups open threads by owning architect at N>1, each showing item, state/phase, who-owes-next. +- [ ] who-owes-next returns a defined value (never crashes) for every state, including `unknown`. +- [ ] N=1 Work view is byte-identical to pre-984 (Playwright snapshot/textContent). + +#### Test Plan +- **Unit**: who-owes-next over every enumerated state + a deliberately ambiguous one (→ `unknown`). +- **Playwright**: N=1 identical; N=2/3 grouping + who-owes-next labels render (per the UI-visual-verification discipline). + +#### Risks +- **N=1 regression** → gate all new rendering behind `architectCount > 1`; visual-verify. +- **who-owes-next ambiguity** → enforced `unknown` fallback; tested. + +--- + +### Phase 6: Builder-base SHA-pin (#6) +**Dependencies**: None (localized to the spawn base-resolution path). Independent — could be reordered earlier; kept here per the spec's "staged last" guidance. + +#### Objectives +- Every builder branches from a fresh, known-good integration-branch-tip SHA captured at spawn — never the shared checkout's stale live HEAD. + +#### Files +- `packages/codev/src/agent-farm/commands/spawn-worktree.ts` — modify **`createWorktree()` only** (the fresh-spawn path at ~L158). **Do NOT** modify `createWorktreeFromBranch()` (~L297) — it already fetches and handles existing remote branches for `--resume`/fork PRs; touching it would double-fetch and break resume (Claude 3d). +- `packages/codev/src/agent-farm/commands/spawn.ts` — thread a resolved `--base` through to `createWorktree`; resolve the default branch (helper). +- `packages/codev/src/agent-farm/types.ts` — `base?` already added in Phase 1's `SpawnOptions` change; ensure it reaches `createWorktree`. +- `packages/codev/src/agent-farm/cli.ts` — add `--base <ref|sha>` to `spawn`. +- Tests — base-pin from a deliberately-stale checkout; fail-fast on fetch error; `--base` override; non-`main` default branch. + +#### Implementation Details +- **Attached-branch flow (Codex — the original sketch would detach):** `git branch <builder-branch> <sha>` followed by `git worktree add <path> <sha>` would leave the worktree **detached**. Correct sequence: (1) resolve base SHA (below); (2) `git branch <builder-branch> <base-sha>` to create the branch *pointing at the pinned SHA*; (3) `git worktree add <path> <builder-branch>` to attach the worktree to that branch. The builder ends up on a named branch whose tip is the fresh base — not detached HEAD. +- **Base SHA resolution**: (1) determine the integration/default branch via a **fallback chain** (Claude 3e): workspace config → `git symbolic-ref refs/remotes/origin/HEAD` → `git remote show origin` (heavier but reliable) → **fail loud** (never assume `main`); (2) `git fetch <remote> <branch>`; (3) `git rev-parse <remote>/<branch>` → base SHA. +- **Fail-fast**: if the fetch fails, **error loudly**; do **not** silently fall back to stale local HEAD. `--base <ref|sha>` lets the architect pin an explicit base deliberately (incl. `--base HEAD` for an intentional offline spawn — Gemini's offline note). +- **Interaction with `--branch`/resume**: when an explicit `--branch`/existing-branch/resume path is taken (i.e. `createWorktreeFromBranch`), SHA-pin does **not** apply — that path already targets a specific branch; `--base` is for the fresh-spawn path only. +- **Backward compat**: common case resolves to a fresh `origin/main` — what operators already expect; no new directories/config. + +#### Acceptance Criteria +- [ ] With the shared checkout left on a stale commit/other branch, a spawned builder's base is the freshly-fetched integration tip (merge-base with `origin/main` = `origin/main` tip). +- [ ] Fetch failure fails the spawn loud; `--base <sha>` branches from the explicit SHA. +- [ ] Default branch is read, not hardcoded (a workspace defaulting to `ci` pins `origin/ci`). + +#### Test Plan +- **Integration fixture setup (Claude)**: create a temp git repo where the local checkout's HEAD is deliberately **behind** `origin/<default>` (e.g. init a bare "origin", clone, add commits to origin, leave the local behind), then spawn → assert the builder branch's tip == origin's tip (merge-base check), **not** the stale local HEAD. +- **Unit/Integration**: simulate fetch failure → loud error (no silent fallback); `--base <sha>` override path; non-`main` default branch (workspace defaulting to `ci`); assert the worktree is **attached** to a named branch (not detached). + +#### Risks +- **Wrong base / silent stale fallback** → read the real default branch; fail-fast (no silent fallback); `--base` escape hatch — all tested. +- **Added fetch latency** → bounded single fetch; correctness-critical; acceptable. + +--- + +### Phase 7: Documentation +**Dependencies**: All prior phases (documents their surfaces). + +#### Objectives +- Every new command/flag/semantic is discoverable in the canonical docs and role files. + +#### Files +- `CLAUDE.md` + `AGENTS.md` (repo root) — `afx architects`, `--override-owner`, `--base`, state rotation, lifecycle/re-home semantics; keep the two byte-identical. +- `codev/resources/commands/agent-farm.md` — full reference entries. +- `codev-skeleton/templates/CLAUDE.md` + `…/AGENTS.md` — adopter-facing variants. +- `codev/roles/architect.md` + `codev/roles/builder.md` (+ skeleton copies) — any role-facing notes not already added in Phase 3. +- Run the repo's `copy-skeleton`/validation to confirm shipped skeleton matches. + +#### Implementation Details +- Mirror the #823 documentation pattern (one coherent section per surface; CLAUDE.md/AGENTS.md identical; skeleton templates adopter-friendly). + +#### Acceptance Criteria +- [ ] All new surfaces documented; CLAUDE.md == AGENTS.md; skeleton templates carry equivalent content; `copy-skeleton` validation passes. + +#### Test Plan +- **Manual/CI**: skeleton-sync validation; doc-link check if present. + +#### Risks +- **Drift between CLAUDE.md and AGENTS.md** → edit atomically; verify byte-identity. + +--- + +## Cross-Phase Risk Assessment +- **Six interdependent points in one PR** → strict dependency ordering (1 → 2; 3 → 4; 1 → 5; 6 independent; 7 last); each phase a self-contained commit; CMAP at PR. +- **N=1 backward compatibility** → every phase that touches a user surface verifies N=1 is unchanged; dashboard changes gated behind `architectCount > 1` and Playwright-verified. +- **DB migration safety** → `IF NOT EXISTS` + versioned migration tested against a seeded old DB. +- **#6 correctness** → fail-fast + read-real-default-branch + `--base`, all tested (the spec's highest-impact risk row). +- **Privacy** → no source-workspace identifiers in code/tests/fixtures/docs (generalized fixtures only). + +## Consultation Log + +### Iteration 1 (2026-06-04) — Gemini APPROVE · Claude APPROVE · Codex REQUEST_CHANGES (all HIGH) +Strong consensus; Codex's REQUEST_CHANGES + the two APPROVERs' notes were all incorporated: +- **Phase 1 — ownership claimed before side effects (Codex)**: the ledger check+write now happens *before* `createWorktree`/session creation (the old "after `upsertBuilder`" order couldn't refuse a duplicate before the worktree existed), with rollback-on-failure so a failed spawn leaves no phantom owner. Scoped to numbered spawns only (`spawnSpec`/`spawnIssueDrivenBuilder`) per Claude 3b. Added `types.ts` `SpawnOptions` + `__tests__/spawn.test.ts` plumbing per Codex. Closed-issue dedup default resolved (treat same as open) per Claude. +- **Phase 6 — attached-branch flow (Codex)**: corrected the git sequence so the worktree is **attached** to a named branch pointing at the pinned SHA (the original `worktree add <sha>` would detach). Modify `createWorktree()` only, not `createWorktreeFromBranch()` (already fetches) per Claude 3d. Added the default-branch fallback chain (config → `origin/HEAD` → `git remote show origin` → fail) per Claude 3e, and a concrete stale-HEAD fixture for the test. +- **Phase 5 — board needs a ledger join (Codex + Gemini)**: who-owes-next's "owned-but-unspawned" case requires synthesizing overview rows from the ledger (not just `architectCount`+`spawnedByArchitect`); `overview.ts` now joins the Phase 1 ledger and the payload/types carry the synthesized owned-issue shape. +- **Minor/trivial**: `state-rotation` committed to `packages/codev/src/lib/` (codev-specific, not `packages/core`) per Claude 3a; `addArchitect`/`removeArchitect` function line numbers corrected (~L868/~L1100) per Claude 3c; `mkdir -p` for new `codev/state/architects/` and `codev/state/archive/` dirs called out (Phase 3/4); workspace-relative path resolution elevated to a Phase 4 risk; optional ledger-write SSE noted as out-of-scope (poll suffices). +- **Migration note (Gemini)**: clarified that `db.exec(LOCAL_SCHEMA)` auto-applies `CREATE TABLE/INDEX IF NOT EXISTS` to existing DBs, so a discrete numbered migration is optional. + +**Date**: 2026-06-04 +**Models Consulted**: gemini, codex, claude + +## Phase Status Tracking +- [ ] Phase 1 — Ledger + dedup (`pending`) +- [ ] Phase 2 — Roster (`pending`) +- [ ] Phase 3 — Bounded state files (`pending`) +- [ ] Phase 4 — Lifecycle (`pending`) +- [ ] Phase 5 — Board / who-owes-next (`pending`) +- [ ] Phase 6 — Builder-base SHA-pin (`pending`) +- [ ] Phase 7 — Documentation (`pending`) diff --git a/codev/projects/984-multi-architect-coordination-l/984-plan-iter1-rebuttals.md b/codev/projects/984-multi-architect-coordination-l/984-plan-iter1-rebuttals.md new file mode 100644 index 000000000..2418e659b --- /dev/null +++ b/codev/projects/984-multi-architect-coordination-l/984-plan-iter1-rebuttals.md @@ -0,0 +1,45 @@ +# Plan 984 — Rebuttal to Iteration-1 Consultation + +**Verdicts**: Gemini **APPROVE** · Claude **APPROVE** · Codex **REQUEST_CHANGES** (all HIGH confidence). +**Disposition**: Accepted all of Codex's four points and every actionable note from the two approvers. No disagreements. Plan revised in commit on the builder branch (Plan iter-1). + +--- + +## Codex (REQUEST_CHANGES) — all four accepted + +### 1 — Ownership write placed too late (would not prevent side effects) +**Accepted; corrected.** Codex is right: `spawnSpec` creates the worktree/session *before* `upsertBuilder`, so a ledger check/write "after `upsertBuilder`" would only refuse a duplicate *after* the worktree/session already exist. Phase 1 now **claims ownership (dedup check + atomic ledger insert via the partial unique index) before `createWorktree`/session creation**, claiming on `architect + issue_number` (builder_id nullable, backfilled on success), with **rollback** (release the just-claimed row) if a later spawn step fails — so a failed spawn leaves no phantom owner and the gate is genuinely preventive. + +### 2 — Phase 6 git sketch would detach the worktree +**Accepted; corrected.** `git branch <branch> <sha>` + `git worktree add <path> <sha>` detaches. Phase 6 now specifies the **attached-branch flow**: `git branch <builder-branch> <base-sha>` then `git worktree add <path> <builder-branch>` — the worktree is attached to a named branch whose tip is the pinned SHA. Added a test assertion that the result is attached (not detached HEAD), and documented that the explicit-`--branch`/resume path (`createWorktreeFromBranch`) is unaffected. + +### 3 — Phase 5 board under-scoped (owned-but-unspawned has no row) +**Accepted; corrected.** The who-owes-next "ledger entry with no active builder → architect" case has no builder row to attach to. Phase 5 now **joins the Phase 1 ledger in `overview.ts`** and **synthesizes an overview row** (an `OverviewBuilder`-shaped/`OverviewOwnedIssue` entry) for owned-but-unspawned issues; the payload/types carry the synthesized shape and `WorkView` renders these alongside live builders. (Gemini flagged the same — same fix.) + +### 4 — Missing plumbing for new flags/validation +**Accepted.** Added `packages/codev/src/agent-farm/types.ts` (`SpawnOptions.overrideOwner?` / `base?`) and `__tests__/spawn.test.ts` (extending the existing spawn-validation tests) to the Phase 1 file list. + +--- + +## Claude (APPROVE) — minor notes, all applied + +- **3a** state-rotation home → committed to `packages/codev/src/lib/` (codev-specific, not a `packages/core` cross-package util). +- **3b** spawn-path scope → Phase 1 now states explicitly it touches only the **numbered** spawn paths (`spawnSpec`, `spawnIssueDrivenBuilder`), not `spawnTask`/`spawnShell`/`spawnWorktree`/`spawnProtocol`. +- **3c** line numbers → corrected: `addArchitect` function ~L868, `removeArchitect` ~L1100 (the ~L987/~L1161 are the `setArchitectByName` calls within). +- **3d** Phase 6 → clarified it modifies `createWorktree()` **only**, not `createWorktreeFromBranch()` (which already fetches; touching it would double-fetch/break resume). +- **3e** default-branch discovery → added the fallback chain: workspace config → `git symbolic-ref refs/remotes/origin/HEAD` → `git remote show origin` → **fail** (never assume `main`). +- **Phase 6 test** → added concrete fixture setup (temp git repo with local HEAD deliberately behind `origin/<default>`; assert builder tip == origin tip). +- **Phase 4 risk** → workspace-relative path resolution elevated to a named risk; `mkdir -p` for the new `codev/state/architects/` and `codev/state/archive/architects/` directories called out (Phase 3 + Phase 4). +- **Closed-issue dedup** → resolved with a stated default (treat closed same as open; block + `--override-owner`), so the Phase 1 gate logic is unambiguous. +- **Ledger-write SSE** → noted as an optional nice-to-have, **out of scope** (the existing overview poll already surfaces changes). + +## Gemini (APPROVE) — notes applied + +- **Migration** → clarified that `db.exec(LOCAL_SCHEMA)` runs on every init, so `CREATE TABLE/INDEX IF NOT EXISTS` auto-applies to existing DBs; a discrete numbered migration in `db/index.ts` is **optional** (only for column alterations/backfills). +- **Phase 5 ledger-only injection** → addressed by the same ledger-join/synthesized-row fix as Codex #3. +- **Phase 4 rehoming** and **Phase 6 fail-fast/offline `--base HEAD`** → confirmed as designed; no change needed. + +--- + +## Net +Two APPROVE + one REQUEST_CHANGES whose four points were all genuine implementation-ordering/correctness gaps (claim-before-side-effects, attached-branch flow, board ledger join, flag plumbing) — all fixed. The plan now reflects verified integration points and a build-ready phase breakdown. No open disagreements. diff --git a/codev/projects/984-multi-architect-coordination-l/984-specify-iter1-rebuttals.md b/codev/projects/984-multi-architect-coordination-l/984-specify-iter1-rebuttals.md new file mode 100644 index 000000000..e4d7f0dc8 --- /dev/null +++ b/codev/projects/984-multi-architect-coordination-l/984-specify-iter1-rebuttals.md @@ -0,0 +1,59 @@ +# Spec 984 — Rebuttal to Iteration-1 Consultation + +**Verdicts**: Gemini REQUEST_CHANGES · Codex REQUEST_CHANGES · Claude COMMENT (all HIGH confidence). +**Disposition**: Accepted essentially all feedback. No substantive disagreements; one item narrowed in scope (below). The spec was revised in commit `d7e17715`. + +--- + +## Gemini (REQUEST_CHANGES) + +### G1 — Worktree data loss on retire (#6) +**Accepted.** Added a **dirty-worktree guard**: `remove-architect` inspects the architect's `.architects/<name>/` worktree; if it has uncommitted/staged/untracked work it **aborts with a clear error listing the dirty paths** and requires `--force`. `--force` archives the state file (#4) before removal and surfaces (never silently destroys) committed-but-unpushed branches. See #6 approach + #4 step 4 + Test Scenario 14. + +### G2 — Architect worktree path unspecified (#6) +**Accepted.** Fixed location: **`.architects/<name>/`** at the workspace root (symmetric with `.builders/<id>/`), added to `.gitignore` and AI-context ignore files so nested worktrees don't pollute the main checkout or AI context. See #6 approach. + +### G3 — Markdown truncation safety (#5) +**Accepted.** Rotation no longer does arbitrary line/byte cuts. The template carries a machine-parseable **`<!-- ARCHIVE BOUNDARY -->`** delimiter; rotation operates only on the history region below it, moving **whole entries** (never splitting a fenced code block), into a dated archive. Added a loss-free reconstruction invariant + Test Scenarios 15 and Non-Functional Test 4. See #5 approach. + +### Gemini's OQ recommendations — all adopted +State file at `codev/state/architects/<name>.md`; last-activity = max(terminal session, owned-builder `updated_at`); re-home to `main` without killing builders; `--override-owner`. Recorded in **Resolved Design Decisions**. + +--- + +## Codex (REQUEST_CHANGES) + +### C1 — Acceptance-critical decisions left open +**Accepted.** Added a **Resolved Design Decisions** section promoting state-file location, override flag, rotation mechanism/cap/trigger, dedup override semantics, retire policy, and last-activity source from "open" to "fixed," so #1/#4/#5 are implementable and testable. Only genuinely architect-level forks (#6 PR-slicing, migration ergonomics) remain open. + +### C2 — #6 scope contradiction (own-checkout vs. leave-pre-existing-shared) +**Accepted; contradiction resolved.** Restated the invariant as *"no architect's branch switch disturbs a sibling,"* not *"every architect physically has a worktree."* Concretely: `main` = main checkout; architects **added after this feature** get `.architects/<name>/` by default; pre-existing non-`main` architects keep the shared checkout until migrated via an explicit `afx workspace migrate-architect`. The #6 success criterion was reworded to match. Internally consistent now. + +### C3 — Ledger semantics imprecise (override / close / reopen / cleanup / release) +**Accepted.** Added precise rules: first-owner-wins; `--override-owner` = mark prior `released` + insert new live entry with `override_of` (auditable, not silent transfer); issue close does **not** auto-release (ownership = who's working it); builder `cleanup` does **not** release (durability is the point); release happens only via override / `remove-architect` / explicit future command. See #3 "Precise ledger semantics." + +### C4 — Degraded/error behavior underspecified (forge unavailable; ambiguous who-owes-next) +**Accepted.** Roster intersects ledger ids against the **existing overview/issue cache** and renders `unknown` when forge data is unavailable — never blocks, stays deterministic. who-owes-next is now a **total function** with an explicit `unknown` fallback for non-standard states. See #1 + #2 approaches. + +--- + +## Claude (COMMENT — 7 gaps, all addressed) + +1. **Forge query mechanism for "open issues"** → resolved via overview-cache intersection (no per-issue shell-out; keeps the <1s budget). +2. **who-owes-next edge cases** (stalled / awaiting-merge / owned-unspawned) → enumerated explicitly in the total function. +3. **Re-homing semantics concrete steps** → #4 now lists the ordered DB/file steps (archive file → release ledger rows → `UPDATE builders SET spawned_by_architect='main'` → remove row → remove worktree). Routing already falls back to `main` post-#774, so no routing-code change — the column update makes the new home explicit. +4. **#6 spawn-base resolution sketch** → added: `spawn.ts` resolves the spawning architect's worktree from the `architect` table and runs the worktree-add against it; `main` resolves to the main checkout (unchanged path). +5. **`CODEV_ARCHITECT_NAME` fallback in N>1** → spawn now **requires** a registered name in multi-architect workspaces (no silent `main`); security note + Test Scenario 13 added. +6. **Concurrent-spawn race on dedup** → partial unique index on `(workspace_path, issue_number) WHERE released = 0` makes check-then-insert atomic; Test Scenario 12 added. +7. **No new dashboard deps for #2** → confirmed; grouping reuses existing components + `architectCount`/`spawnedByArchitect` from #823. + +--- + +## Scope narrowing (the one place I didn't expand) + +- **Fuzzy "same-symptom" dedup** stays **out of scope** (issue-level only) — this is explicit in the issue and none of the reviewers contested it. The duplicate-investigation failure mode is addressed at the issue grain, which is where the observed overlap actually occurred (3 architects, 3 *issues*). +- **#6 migration of pre-existing architects** is provided as an *explicit, opt-in* command rather than an automatic on-upgrade migration — the least-disruptive choice, flagged for architect confirmation at the gate. + +## Net + +Two REQUEST_CHANGES → all blocking points resolved in `d7e17715`; one COMMENT → all seven gaps closed. The remaining Open Questions are deliberate architect-level decisions (most prominently: does #6 ship in this PR or its own — all three reviewers recommend its own), not unresolved design gaps. diff --git a/codev/projects/984-multi-architect-coordination-l/status.yaml b/codev/projects/984-multi-architect-coordination-l/status.yaml new file mode 100644 index 000000000..8d399b908 --- /dev/null +++ b/codev/projects/984-multi-architect-coordination-l/status.yaml @@ -0,0 +1,23 @@ +id: '984' +title: multi-architect-coordination-l +protocol: spir +phase: plan +plan_phases: [] +current_plan_phase: null +gates: + spec-approval: + status: approved + requested_at: '2026-06-04T04:27:13.748Z' + approved_at: '2026-06-04T05:54:59.349Z' + plan-approval: + status: pending + requested_at: '2026-06-04T06:11:58.987Z' + pr: + status: pending + verify-approval: + status: pending +iteration: 1 +build_complete: true +history: [] +started_at: '2026-06-04T04:10:24.406Z' +updated_at: '2026-06-04T06:11:58.987Z' diff --git a/codev/specs/984-multi-architect-coordination-l.md b/codev/specs/984-multi-architect-coordination-l.md new file mode 100644 index 000000000..1a5fa1a9c --- /dev/null +++ b/codev/specs/984-multi-architect-coordination-l.md @@ -0,0 +1,352 @@ +# Specification: Multi-Architect Coordination Layer (roster · board · dedup-at-spawn · lifecycle · bounded state · builder-base SHA-pin) + +## Metadata +- **ID**: spec-2026-06-03-984-multi-architect-coordination-l +- **Status**: draft +- **Created**: 2026-06-03 +- **GitHub Issue**: [#984](https://github.com/cluesmith/codev/issues/984) +- **Area**: `area/tower` +- **Predecessors**: #755 (per-architect primitive), #761 (tab strip), #774 (affinity routing), #786 (lifecycle/persistence/UX), **#823 (coordination-b — builder attribution, messaging docs, builder thread files, VSCode refresh)** + +## Clarifying Questions Asked + +Issue #984 is a detailed six-point field report, generalized from a private heavy-use workspace (peak 5 architects + 7 builders) and routed via Shannon's operator. The deliverables, scope boundaries, and "build all six as one coherent SPIR" framing are explicit in the issue body, so no pre-draft clarification was solicited from the architect. However, mapping the field report onto *this* repository surfaced design forks that are recorded in **Open Questions** (most consequentially: the architect-state-file convention, the state-rotation mechanism, and the checkout-isolation approach). These are flagged for resolution at the spec-approval gate rather than guessed silently. + +One scoping reconciliation was made during drafting and is documented in **Current State**: the issue describes "architect state files" with unbounded strikethrough history that get hand-renamed to `*-inactive`. That convention lived in the private source workspace. In *this* codebase, architect identity is persisted in SQLite (not markdown), and the only on-disk state files are the per-*builder* `codev/state/<id>_thread.md` logs introduced by #823. This spec therefore treats point #4 as *introducing* a first-class architect state-file convention, and point #5 as bounding both that new file and the existing builder thread files. + +## Problem Statement + +Codev/afx lets an operator **spawn** N architects in one workspace (the `workspace add-architect` / `remove-architect` primitives shipped across #755/#786, and #823 made the *builder→architect* relationship visible). But the workspace still gives the operator no first-class way to **see, route, dedup, or retire** architects as a fleet. Every coordination function that a second human would perform falls back onto the single operator. The source workspace observed five failure modes, every one of which recurred: + +1. **The human is the registry.** No command answers "which architect owns what." The operator holds the roster in their head. +2. **The human is the message router.** Cross-architect coordination is relayed by hand. Sibling messaging (`afx send architect:<name>`) exists and was documented by #823, but it remains under-surfaced — there is no place that *shows* the operator who to route to. +3. **Duplicate investigation.** One symptom was independently investigated by three architects under three issues; the overlap was discovered only after the fact, at roughly 3× the cost. (An audit confirmed the fleet's actual code changes were largely orthogonal — the waste is coordination, not diff contention.) +4. **Lifecycle / state drift.** ~14 architect identities churned in ~2 weeks. Retirement was performed by hand-renaming state files to `*-inactive`; some live architects had no state file at all; one appointed architect's state file was never created. Where state files existed, they grew unbounded (tens of KB of strikethrough history), making a cold resume of an architect's context expensive. +5. **Shared checkout couples the fleet.** All architects share one git checkout. A branch switch by any one architect yanks the working tree out from under the siblings, and builders branch from the shared HEAD — so one stale checkout makes *everyone's* new builders stale. The current mitigation is a fragile, human-enforced "never switch branches" discipline rule. + +The unifying diagnosis: **the coordination layer that a second operator would provide does not exist as software.** This spec builds it. + +## Current State + +### What already exists (do not rebuild) + +| Capability | Status | Where | +|---|---|---| +| `afx workspace add-architect --name <n>` / `remove-architect` | ✅ exists | `cli.ts` → `commands/workspace-add-architect.ts`, `…-remove-architect.ts` → Tower `addArchitect`/`removeArchitect` | +| Architect persistence (survives Tower stop/start) | ✅ exists (#786) | SQLite `architect` table, composite PK `(workspace_path, id)`; `state.ts` load/reconcile | +| Spawn affinity (builder remembers spawning architect) | ✅ exists (#774) | `CODEV_ARCHITECT_NAME` env → `commands/spawn.ts` → `builders.spawned_by_architect` | +| `afx status` enumerates architects + builders | ✅ exists (#786) | `commands/status.ts` | +| Builder attribution visible on dashboard Work view | ✅ exists (#823) | `OverviewBuilder.spawnedByArchitect`, `BuilderCard` | +| Sibling messaging `afx send architect:<name>` + docs | ✅ exists (#774) + documented (#823) | `tower-messages.ts`; `CLAUDE.md`/`AGENTS.md`/`agent-farm.md` | +| Per-builder thread file `codev/state/<id>_thread.md` | ✅ convention exists (#823) | free-text markdown, written by the builder AI; no code reads/writes it | +| Cross-project board (`afx tower`) + dashboard Work view | ✅ exists | `servers/tower-routes.ts`, `servers/overview.ts`, `packages/dashboard/src/components/WorkView.tsx` | + +### What does **not** exist (the gaps this spec fills) + +- **No `afx architects` roster.** `afx status` lists architects but does not join them to owned issues, live builders, last-activity, or a state-file path. There is no single "who owns what" surface. +- **No issue-ownership ledger.** `builders.spawned_by_architect` records *builder*→architect affinity, but it is per-builder and lifecycle-bound (it disappears semantics-wise once a builder is cleaned up). There is no durable *issue*→architect record, and therefore no dedup-at-spawn check. +- **No architect state-file convention.** Architect identity is in SQLite; there is no per-architect markdown working-memory file, no template for one, and no lifecycle that creates/archives it. +- **No bounded-state mechanism.** Builder thread files (and any future architect state file) are free-text markdown an AI appends to; nothing caps the head or rotates the history. Cold-resume cost is unbounded. +- **No pinned builder base.** Builders `git worktree add` from the shared checkout's **live HEAD**, so a builder inherits whatever commit (possibly badly stale, possibly the wrong branch) the shared checkout happens to be on. Observed in practice: builders branched from a `main` that was ~145 commits behind. There is no capture of a fresh known-good base SHA at spawn. + +## Desired State + +A workspace operator running N architects can: + +1. Run **`afx architects`** and see a single table: each architect, the open issues it owns, its live builders, its last activity, and its state-file path. This is the authoritative answer to "who owns what." +2. See the same ownership/state picture, **grouped by architect**, on the dashboard (extending the existing Work view rather than adding a separate artifact), including a "who-owes-next" signal per open thread (is the ball with the architect or the builder?). +3. Have **issue ownership recorded automatically at spawn**, and be **warned (and required to override)** when spawning a second architect onto an issue already owned by a *different* architect — catching the duplicate-investigation failure before the work is done, not after. +4. **Add** an architect and have its state file created from a template automatically; **retire** an architect and have its state file archived and its owned builders/issues released — no hand-renaming, no orphaned or missing state files. +5. Trust that state files stay **bounded**: a capped "current state" head with older history auto-rotated into an archive, so a cold resume reads a predictable, small amount. +6. Trust that every builder branches from a **fresh, known-good base SHA** captured at spawn (the integration/default branch tip), never from whatever stale commit the shared checkout happens to be sitting on — eliminating the stale/wrong-base-builder hazard without restructuring how architects share the checkout. + +Backward compatibility is a hard requirement: a workspace with a single `main` architect (the overwhelmingly common case) must see **zero behavior change** unless it opts into the new surfaces. N=1 dashboards render identically; `afx architects` works but is trivially a one-row table; builder-base SHA-pinning degenerates to "branch from a fresh `origin/main`," which is what operators already expect. + +## Stakeholders +- **Primary Users**: Workspace operators running multi-architect fleets (today: the private source workspace; Shannon as the concrete external N>1 adopter). +- **Secondary Users**: Single-architect operators (must be unaffected); architect AI sessions (consume the roster/board, write bounded state files); builder AI sessions (branch from a pinned known-good base SHA). +- **Technical Team**: Codev maintainers (Tower, afx CLI, dashboard). +- **Business Owners**: M Waleed Kadous (architect / approver). + +## Success Criteria + +- [ ] **#1 Roster** — `afx architects` prints a table with, per architect: name, owned open issues (count + ids), live builders (count + ids), last-activity timestamp, and state-file path. Works at N=1 (one row) and N>1. A machine-readable form (`--json`) is available for tooling. +- [ ] **#2 Board** — The dashboard Work view can present open threads grouped by owning architect, each row showing item, owning architect, state/phase, and who-owes-next. Implemented as an extension of the existing Work view / overview API, not a separate artifact. N=1 renders identically to today. +- [ ] **#3 Ledger + dedup** — Every numbered spawn records an issue→architect ownership entry keyed by the spawning architect (via existing affinity). A spawn onto an issue already owned by a *different* architect prints a clear warning and **refuses** unless an override flag is passed. Same-architect re-spawn (e.g. `--resume`) is never blocked. Issue-level dedup only (no fuzzy symptom matching). +- [ ] **#4 Lifecycle** — `workspace add-architect` creates the architect's state file from a template if absent (idempotent: never clobbers an existing file). `workspace remove-architect` archives the state file and releases the retiring architect's owned issues and live builders per a defined, documented policy. No state file is ever silently missing for a live architect. +- [ ] **#5 Bounded state** — A defined mechanism caps the "current state" head of a state file and rotates older history to an archive, keeping cold-resume cost predictable. Applies to `codev/state/<id>_thread.md` (builders) and the architect state file from #4. The mechanism does not corrupt or lose history (it relocates, not deletes). +- [ ] **#6 Builder-base SHA-pin** — At spawn, the builder's base is a fresh SHA captured by fetching + `rev-parse`-ing the integration/default branch tip (e.g. `origin/main`, or the workspace's configured default), and the builder branch is created from that SHA — not from the shared checkout's live HEAD. A fetch failure fails the spawn loud (no silent fallback to stale local HEAD). A `--base <ref|sha>` override allows a deliberate non-default base. The common case (fresh `origin/main`) matches existing operator expectations. (Architects continue to share the main checkout; per the architect's decision, no per-architect worktrees and no `.architects/` directories are introduced.) +<!-- REVIEW(@architect): This one I'm not quite so sure about because architects have to interact with builders and if they're started in a subdir they may not be able to do that directly. --> +- [ ] **Backward compat** — A single-`main`-architect workspace exhibits no behavioral or visual change without opt-in. +- [ ] All new tests pass; no reduction in existing coverage; the existing afx/Tower/dashboard suites stay green. +- [ ] Documentation updated (`CLAUDE.md`/`AGENTS.md`, `codev/resources/commands/agent-farm.md`, role docs) for every new surface. + +## Constraints + +### Technical Constraints +- **Tooling is fixed** (baked by repo conventions): TypeScript; Commander.js for the afx CLI; the Tower HTTP server + REST client (`packages/core/tower-client.ts`); SQLite via the existing `db/schema.ts` for persisted state; React/Vite for the dashboard; **Vitest** for tests. New persisted state (e.g. the ownership ledger) must use the existing SQLite database and the established `(workspace_path, …)` scoping pattern — no new datastore. +- **Forge abstraction**: any issue metadata (open/closed, title) must go through the existing forge concept layer (`lib/forge.ts`), not a hardcoded `gh` call, so non-GitHub providers degrade gracefully. +- **Architect identity is SQLite-backed** (`architect` table, PK `(workspace_path, id)`); the roster, ledger, and lifecycle build on this, not a parallel store. +- **Builder-base SHA-pin must read the workspace's real default branch**, not hardcode `main` (some workspaces default to `ci`). It must capture the base via a fresh fetch + `rev-parse` of the remote integration ref, and **fail loud** on fetch error rather than silently using a stale local HEAD. +- **No new checkout topology**: per the architect's decision, #6 introduces no per-architect worktrees, no `.architects/` directories, and no Tower checkout-root changes — it is a localized change to the spawn base-resolution path only. +- **`main` is reserved** as an architect identity (symmetric with the existing `afx dev main` reserved-target pattern). +- **No new always-on background process**: rotation/bounding must be triggered (on a lifecycle/digest event or by an explicit command), not a new daemon. + +### Business Constraints +- **No time estimates** (per protocol). +- Backward compatibility for N=1 workspaces is non-negotiable. +- The private source workspace's specifics (names, project ids, product terminology) must NOT leak into code, tests, fixtures, or docs — only the generalized pattern ships. + +## Assumptions +- The spawning architect is reliably known at spawn time via `CODEV_ARCHITECT_NAME` (established by #774); ownership recording can rely on it. +- The architect-state-file convention is new and owned by this spec; no external tool depends on a prior format. +- Live builders spawned by a retiring architect should continue running (not be killed); "release" means re-homing ownership, not termination (consistent with the #823 observation that such builders keep running and route to `main`). +- The workspace's integration/default branch tip (e.g. `origin/main`, or `origin/ci`) is the correct known-good base for new builders, and is reachable via a fetch at spawn time; its name is discoverable from config/forge defaults rather than assumed. + +## Solution Approaches + +The six points are interdependent but separable. The cheapest/highest-leverage pieces (#3 ledger + #1 roster) share a data foundation; #2 reads it; #4 manages lifecycle around it; #5 is contained; #6 is the large structural change. Below, each point lists the recommended approach plus the alternatives weighed. + +### Point #3 — Issue-ownership ledger + dedup-at-spawn (foundation, highest leverage) + +**Recommended — durable ledger table, dedup gate at spawn.** Add an `issue_ownership` table (SQLite, scoped by `workspace_path`) recording `issue_number → architect`, first-owner-wins, with `created_at`, an `released` flag (+ `released_at`), and an `override_of` column (the prior owner, populated only when an override created the row). At numbered spawn, the spawning architect is recorded; if a *different* architect already owns an **unreleased** entry for that issue, spawn prints a warning naming the owner and refuses unless `--override-owner` is passed. `--resume` and same-architect spawns never trip the gate. + +**Precise ledger semantics** (resolving Codex/Claude feedback): +- **First-owner-wins.** The first unreleased entry for an `(workspace_path, issue_number)` is authoritative. A partial unique index on `(workspace_path, issue_number) WHERE released = 0` enforces at most one live owner and makes the check-then-insert **atomic** — a concurrent double-spawn cannot create two live owners (the loser's `INSERT` fails the constraint and is surfaced as the dedup warning). This closes the race Claude raised. +- **Override behavior.** `--override-owner` does **not** silently transfer. It marks the existing entry `released` (recording the releasing actor) and inserts a new live entry owned by the overriding architect with `override_of = <prior owner>`. The override is therefore auditable in the ledger, never silent. +- **Issue close / reopen.** Closing an issue does not auto-release its ledger entry (ownership is about *who is working it*, not issue state); the roster simply renders it under "owned" with status `closed`. Re-spawning the same architect on a reopened issue is the same-architect no-block path. (Whether the dedup check should *warn but not block* on a closed prior-owned issue is the one remaining nice-to-know Open Question.) +- **Builder cleanup** (`afx cleanup`) does **not** release the ledger entry — durability across cleanup is the entire point (it's what catches late-discovered overlap). Release happens only via `--override-owner`, `remove-architect` (#4), or an explicit future release command. +- **`CODEV_ARCHITECT_NAME` validation.** In an N>1 workspace, spawn must resolve a **registered** architect name; it must NOT silently fall back to `main` when the env var is unset/unknown (that would let a misconfigured terminal claim ownership as `main`). At N=1 the existing fall-back-to-`main` behavior is unchanged. + +- **Pros**: Durable across builder cleanup (catches the real failure mode — overlap discovered late); cheap; reuses affinity; clear, auditable UX. +- **Cons**: New table + migration; release semantics couple to #4; needs a deliberate override path. +- **Alternatives considered**: (a) *Derive ownership from `builders.spawned_by_architect`* — rejected: lifecycle-bound and per-builder, so it can't catch overlap once a builder is cleaned up, and it has no entry for an issue an architect is investigating without a builder. (b) *Fuzzy "same symptom" matching* — explicitly out of scope per the issue (issue-level dedup only). + +**Complexity**: Low–Medium. **Risk**: Low. + +### Point #1 — Architect roster (`afx architects`) + +**Recommended — read-only join command + Tower endpoint.** New `afx architects [--json]` command and a Tower API endpoint that joins the `architect` table to: owned issues (from the #3 ledger), live builders (`builders` where `spawned_by_architect = name`), last-activity, and the architect's state-file path (#4). Renders a table. + +**Open/closed status source + degraded behavior** (resolving Claude/Codex feedback): the roster must stay interactive (<1s), so it must **not** shell out to the forge once per owned issue. Instead it reuses the **existing overview/issue cache** (`servers/overview.ts` already fetches the issue/PR lists via the forge concept layer and caches them) and **intersects** the ledger's owned issue ids against that cached set to label each as open/closed. When the forge cache is unavailable or stale, the roster still renders every owned issue id with status `unknown` (it never blocks on the forge) — local SQLite data (architect, ledger, builders) is always authoritative for the roster's structure. This keeps the command deterministic and within budget regardless of forge health. + +**Last-activity source** (resolved): the most recent of (a) the architect's active terminal-session timestamp and (b) the `updated_at` of its most-recently-active owned builder — whichever is newer; `—` if neither exists. + +- **Pros**: Single source of truth for "who owns what"; pure read; trivially correct at N=1; forge-outage tolerant. +- **Cons**: Reuses the overview cache, so open/closed labels are as fresh as that cache (acceptable; the roster is a coordination aid, not a forge mirror). +- **Alternatives**: Fold the roster into `afx status` — rejected: `afx status` is already dense and the roster's join (issues, state-file path) is a distinct concern; a dedicated `--json`-friendly command is clearer. Per-issue forge shell-out — rejected on the <1s budget. + +**Complexity**: Low. **Risk**: Low. + +### Point #2 — Unified board / digest + +**Recommended — extend the dashboard Work view + overview API; optional CLI digest.** Per the issue's explicit preference ("prefer extending the existing dashboard Work view / `afx tower` over a separate artifact"), add an architect-grouped presentation to the Work view: open threads grouped by owning architect, each showing item, state/phase, and **who-owes-next**. The grouping reuses the existing React component structure plus the already-present `architectCount` and `OverviewBuilder.spawnedByArchitect` (both shipped by #823) — **no new dashboard dependencies** are required (resolving Claude's scope question). Optionally expose the same digest as text via the roster command (e.g. `afx architects --board`) for terminal users. + +**Who-owes-next derivation** (resolving Claude/Codex edge-case feedback) — a small, total function over already-parsed porch/overview state, with an explicit fallback so the board is deterministic: +- Pending **human gate** (spec/plan/pr/dev approval) → **architect**. +- Builder mid-phase, actively progressing → **builder**. +- Review complete / **PR open awaiting merge** → **architect**. +- Issue in the ledger with **no active builder** (owned but not yet spawned) → **architect**. +- Builder present but **idle/stuck** (no pending gate, no recent activity) → **builder (stalled)** — surfaced distinctly so it reads as "needs a nudge." +- Any state not matching the above → **`unknown`** (never a crash; the board degrades gracefully). + +- **Pros**: One artifact, auto-derived, no new file to keep in sync (directly answers failure mode #4's "one artifact vs N state files"); reuses overview plumbing; no new deps. +- **Cons**: The who-owes-next rule must be implemented as a total function with the `unknown` fallback; grouping UI must preserve N=1 identical rendering. +- **Alternatives**: A generated markdown digest file regenerated on spawn/gate/merge — rejected as the primary surface (the issue prefers extending existing surfaces; a file reintroduces a sync burden). May still be offered as a CLI text view. + +**Complexity**: Medium. **Risk**: Low–Medium (dashboard regression surface — must verify N=1 visually). + +### Point #4 — Formal lifecycle (state file on add, archive/release on retire) + +**Recommended — enforce state file from template on add; archive + release on remove.** Extend the existing add/remove paths. + +- **Architect state-file convention (resolved):** `codev/state/architects/<name>.md` — a dedicated subdirectory that cleanly namespaces architect state away from builder `*_thread.md` files in `codev/state/` (no collision risk). +- **On add:** if `codev/state/architects/<name>.md` does not exist, create it from the architect-state template; **never clobber** an existing file (idempotent — re-adding a previously-retired name reuses/restores rather than overwriting). +- **On remove — concrete release steps** (resolving Claude's "what does re-home mean" feedback), executed in this order: + 1. **Archive the state file**: move `codev/state/architects/<name>.md` → `codev/state/archive/architects/<name>-<date>.md` (recoverable; not deleted). + 2. **Release ledger entries**: mark every unreleased `issue_ownership` row owned by `<name>` as `released` (with `released_at`); the issues become re-claimable. + 3. **Re-home live builders**: `UPDATE builders SET spawned_by_architect = 'main'` for that architect's live builders — builders **keep running**; this is the concrete meaning of "re-home," and it makes `afx send architect` from those builders route to `main` (which the post-#774 routing in `tower-messages.ts` already does as a fallback, so no routing-code change is required — but the column update makes the new home explicit rather than implicit). + 4. **Remove the architect row** (existing behavior). (No worktree to remove — #6 is SHA-pin, not per-architect checkouts.) + +- **Pros**: Eliminates hand-renaming, missing files, and orphaned ownership in one pass; builds directly on #3's ledger; release steps are explicit and ordered. +- **Cons**: "Re-home to main" is a deliberate policy (documented); couples to #3. +- **Alternatives**: Keep lifecycle hand-managed (status quo) — rejected, it is the source of failure mode #4. + +**Complexity**: Medium. **Risk**: Low–Medium. + +### Point #5 — Bounded, templated state files + +**Recommended — templated head/history split with an explicit machine-parseable boundary marker + a triggered rotation command.** Define a state-file template with two regions separated by an **unambiguous delimiter comment** so rotation is never a blind line/byte cut (resolving Gemini's markdown-truncation-safety concern): + +``` +# <title> +... bounded "current state" head — AI-maintained, capped ... + +<!-- ARCHIVE BOUNDARY --> +## History +... append-only log entries below this line ... +``` + +- **Rotation operates only on the region *below* `<!-- ARCHIVE BOUNDARY -->`**, and only at whole **entry** boundaries (e.g. `### `-delimited or `---`-delimited log entries) — it never splits a fenced code block or a partial markdown element. When the file exceeds its cap, the **oldest** complete history entries are *moved* (not deleted) into a dated archive file (`codev/state/archive/<id>-<date>.md`), leaving the head and the most-recent history intact. Concatenating archive + live file reconstructs the original (loss-free invariant, tested). +- **The cap** is a configurable threshold (default measured in bytes/lines of the history region); the head region is bounded by *convention* (the role docs instruct the AI to keep the head a small "current state" summary) and is never auto-truncated by the tool — only the history region rotates. +- **Trigger**: an explicit `afx state rotate <id>` (or equivalent) subcommand, plus opportunistic invocation on lifecycle/digest events (add/remove-architect, board regeneration). **No daemon, no file watcher** (honors the no-new-always-on-process constraint). +- Applies uniformly to builder `*_thread.md` and the architect `codev/state/architects/<name>.md` files (both adopt the boundary marker in their templates). + +- **Pros**: Bounds cold-resume cost; loss-free (relocate at entry boundaries, never split markdown); deterministic boundary; no background process. +- **Cons**: Requires the templates and role docs to teach the head/boundary/history structure so the AI writes within it; the rotation tool must parse entry boundaries, not raw offsets. +- **Alternatives**: (a) Convention-only guidance with no tooling — rejected: relies on the AI self-policing length, which is exactly what failed. (b) Arbitrary line/byte truncation — rejected: splits markdown/code blocks (Gemini). (c) A hook/daemon that watches and rotates — rejected: violates the no-new-daemon constraint. + +**Complexity**: Medium. **Risk**: Medium (correctness of rotation; must not lose history or split markdown). + +### Point #6 — Builder-base SHA-pinning (architect-directed: fallback B chosen over worktree isolation) + +> **Post-approval architect decision (2026-06-04):** At the spec-approval gate the architect **replaced full per-architect checkout isolation with the documented fallback (B): builder-base SHA-pinning.** Rationale: the damaging failure mode observed in practice is the *stale/wrong-base builder* (builders repeatedly branched from a `main` that was ~145 commits behind), which SHA-pinning fixes directly and cheaply — without the high-risk Tower/spawn refactor that per-architect worktrees require. Architects continue to share the main checkout; the "never switch branches" discipline rule **stays** for architects (accepted — architects sit on the integration branch and rarely need independent branches). This resolves both former Open Questions (PR-slicing and migration ergonomics) — see below. + +**Approach — capture a fresh known-good base SHA at spawn and branch the builder from it.** Today `createWorktree()` runs `git branch` + `git worktree add` from the shared checkout's **live HEAD**, so a builder inherits whatever (possibly stale, possibly wrong-branch) commit the shared checkout happens to be sitting on. Instead, at spawn: +1. **Resolve the integration/default branch** for the workspace — the repo's default branch tip, e.g. `origin/main` (or `origin/ci` in workspaces whose default branch is `ci`). The branch name is read from existing config/forge defaults, not hardcoded. +2. **Fetch + `rev-parse`** that remote ref to capture a fresh, explicit base SHA (`git fetch <remote> <branch>` then `git rev-parse <remote>/<branch>`). +3. **Branch the new builder from that SHA** (`git branch <builder-branch> <sha>` / `git worktree add <path> <sha>`), not from the shared checkout's working HEAD. + +This eliminates the stale/wrong-base hazard regardless of what state any architect left the shared checkout in, and it is a localized change to the spawn base-resolution path — **no Tower refactor, no per-architect worktrees, no `.architects/<name>/` directories, no lifecycle/migration surface.** + +- **Edge / failure handling:** if the fetch fails (offline / forge unreachable), spawn must **fail loud** with a clear error rather than silently falling back to the stale local HEAD (fail-fast — a silent fallback would reintroduce exactly the hazard this fixes). A `--base <ref|sha>` escape hatch lets the architect pin an explicit base when they deliberately want a non-default base (e.g. stacking on an unmerged branch). +- **Backward compatibility:** for the common case the captured SHA *is* `origin/main`'s tip, so behavior is "branch from a fresh main" — what operators already expect. No new directories, no gitignore changes, no dashboard/Tower changes. + +- **Pros**: Directly kills the observed stale-base failure; small, localized diff in the spawn path; low blast radius; ships inside the main PR; no migration story needed. +- **Cons**: Does **not** give architects independent branch state — so the "never switch branches" rule remains for architects (explicitly accepted by the architect; architects rarely need it). Adds a fetch to the spawn path (small latency; acceptable, and correctness-critical). +- **Alternatives (now historical):** (a) *Per-architect git worktrees* — the spec's original recommendation; **rejected by the architect** as too high-risk (Tower/spawn refactor) for the benefit, given architects rarely switch branches. (b) *Separate full clone per architect* — rejected (heavy disk). (c) *Branch from live shared HEAD* — the status quo that causes the hazard. + +**Complexity**: Low–Medium. **Risk**: Low–Medium (must fail-fast on fetch error; must read the correct default branch, not hardcode `main`). + +### Cross-cutting: PR sequencing + +The issue suggests sequencing #3+#1 (cheap, high-leverage) first, then #5 (contained), then #4/#2 (enhancements), with #6 last. Per the builder PR strategy, all phases ship as commits within a single PR. With the architect's decision to make #6 a small SHA-pin change (not worktree isolation), **all six points ship in one PR** — there is no longer any blast-radius reason to slice #6 out. + +## Resolved Design Decisions + +The first consultation pushed (Gemini + Codex REQUEST_CHANGES) to resolve the acceptance-critical forks rather than leave them open. These are now **fixed in the approaches above** and listed here for visibility; the architect can still overturn any at the gate: + +- **Architect state-file convention** → `codev/state/architects/<name>.md` (dedicated subdir; no collision with builder `*_thread.md`). +- **Dedup override flag** → `--override-owner` (distinct from git-dirty `--force`); overrides are recorded in the ledger (`override_of`), never silent. +- **Ledger atomicity** → partial unique index on `(workspace_path, issue_number) WHERE released = 0`, making check-then-insert atomic (closes the concurrent-spawn race). +- **Retire / re-home policy** → builders keep running; their `spawned_by_architect` is updated to `main`; ledger entries marked `released`; state file archived. (No architect worktree to remove under SHA-pin.) +- **Roster last-activity** → max(terminal-session ts, most-recent owned-builder `updated_at`). +- **Roster open/closed source** → intersect ledger ids against the existing overview/issue cache; `unknown` when forge data is unavailable (never blocks). +- **State rotation** → machine-parseable `<!-- ARCHIVE BOUNDARY -->` delimiter; rotate only complete history entries below it into `codev/state/archive/…`; explicit command + opportunistic on lifecycle/digest events; no daemon. +- **`CODEV_ARCHITECT_NAME` validation** → in N>1 workspaces spawn requires a registered architect name (no silent fallback to `main`); N=1 fallback unchanged. +- **#6 approach (architect-directed, post-approval)** → **builder-base SHA-pin**, not per-architect worktrees. Capture a fresh integration-branch-tip SHA (fetch + `rev-parse`) at spawn and branch the builder from it; fail-fast on fetch error; `--base` override available. Architects keep sharing the main checkout; "never switch branches" stays for architects. Ships inside the main PR; no migration surface. + +## Open Questions + +### Resolved at the spec-approval gate (2026-06-04) +- ✅ **#6 approach** — architect chose **builder-base SHA-pin** over per-architect worktree isolation. *Consequence:* **#6 PR-slicing is moot** (SHA-pin is small/low-risk → ships in the main PR), and **pre-existing-architect migration ergonomics is moot** (no worktrees to migrate into; the `migrate-architect` surface is dropped entirely). +- ✅ **Resolved defaults confirmed** — `codev/state/architects/<name>.md` location, `--override-owner` release-and-reinsert semantics, and the re-home-to-`main` retire policy all stand as written (architect: "all other resolved decisions stand"). + +### Nice-to-Know (Optimization — defer to plan/implementation) +- [ ] Should `afx architects` ship a `--board` text digest (#2 in the terminal) in this spec, or defer to dashboard-only? +- [ ] Should the dedup check **warn-but-not-block** on a *closed/merged* prior-owned issue (vs. block the same as an open one)? +- [ ] Exact name/source of the integration branch for SHA-pin per workspace (config key vs. forge default) — a plan-level detail; the constraint (read it, don't hardcode `main`) is fixed. + +## Performance Requirements +- `afx architects` and the overview/board endpoint must remain interactive (well under ~1s for a realistic fleet of ≤ ~10 architects / ~20 builders); they are read joins over local SQLite + cached forge data, so no heavy queries. +- State-file rotation operates on small markdown files; it must be fast enough to run inline on a lifecycle event without a perceptible stall. +- Builder-base SHA-pin adds one `git fetch` of the integration ref to the spawn path (small, bounded latency); it is correctness-critical and acceptable. No new background or per-architect disk cost. + +## Security Considerations +- **No new auth surface.** All new commands/endpoints are workspace-local, consistent with existing afx/Tower scope. The ownership ledger and roster expose only data already visible within the workspace. +- **Spoofing**: ownership recording relies on `CODEV_ARCHITECT_NAME`, the same trust basis as existing affinity routing; the dedup override must not become a way to silently reassign ownership without it being visible in the ledger (record overrides via `override_of`). +- **N>1 attribution integrity**: in a multi-architect workspace, an unset/unknown `CODEV_ARCHITECT_NAME` must NOT silently default ownership to `main` (a misconfigured terminal would otherwise claim ownership it didn't earn); spawn refuses until a registered name resolves. N=1 keeps the existing `main` default. +- **Privacy**: the source workspace's identifying details must not appear in code/tests/fixtures/docs. + +## Test Scenarios + +### Functional Tests +1. **Ledger happy path**: architect `main` spawns issue 100 → ledger records `100→main`; `afx architects` shows `main` owning 100. +2. **Dedup block**: architect `b` spawns issue 100 (owned by `main`) → spawn refuses with a warning naming `main`; with `--override-owner` it proceeds and the override is recorded. +3. **Same-architect re-spawn / resume**: `main` re-spawns/resumes issue 100 → no block. +4. **Roster at N=1**: fresh workspace, only `main` → `afx architects` prints a single coherent row; `--json` parses. +5. **Roster at N>1**: two architects with distinct owned issues + live builders → table shows correct joins and state-file paths. +6. **Add-architect creates state file**: `add-architect --name b` with no existing file → file created from template; running it again does not clobber an edited file. +7. **Remove-architect archives + releases**: retire `b` → its state file is archived (recoverable), its ledger entries marked released, its live builders keep running with attribution re-homed to `main`. +8. **Rotation is loss-free**: a state file whose history exceeds the cap → head stays bounded, overflow relocated to an archive, total history preserved (concatenation reconstructs the original). +9. **Board / who-owes-next**: an issue at a pending human gate shows "owed by architect"; an issue mid-phase shows "owed by builder"; grouping by architect is correct. +10. **Builder-base SHA-pin**: with the shared checkout deliberately left on a stale commit (or a different branch), spawning a builder branches it from the freshly-fetched integration-branch-tip SHA — **not** the stale local HEAD. The resulting builder branch's merge-base with `origin/main` is `origin/main`'s tip. +11. **Backward compat**: single-`main` workspace — dashboard Work view renders byte-identically; SHA-pin resolves to a fresh `origin/main`; no new directories/config; all commands behave as before. +12. **Concurrent-spawn race**: two architects spawn on the same unowned issue near-simultaneously → exactly one wins the ledger (partial unique index), the other gets the dedup warning; no double live-owner. +13. **Env-var validation (N>1)**: spawn with `CODEV_ARCHITECT_NAME` unset/unknown in a >1-architect workspace → refused (no silent `main` ownership). At N=1, unchanged fallback to `main`. +14. **SHA-pin fail-fast**: with the integration ref unreachable (fetch fails), spawn **errors loudly** and does not silently branch from a stale local HEAD. A `--base <sha>` override branches from the explicit SHA instead. +15. **Rotation boundary safety**: a state file whose history region contains a fenced code block → rotation moves only whole entries below `<!-- ARCHIVE BOUNDARY -->`, never splitting the code block; head untouched. +16. **Override is auditable**: `--override-owner` releases the prior entry and inserts a new one with `override_of` set; the ledger shows the transfer history. + +### Non-Functional Tests +1. **Performance**: roster/board endpoints return within budget for a ≤10-architect/≤20-builder fleet, including when forge data is stale (must not shell out per issue). +2. **SHA-pin isolation from shared-checkout state**: the captured base SHA is independent of the shared checkout's current branch/HEAD, so a builder's base is correct no matter what state any architect left the shared checkout in. +3. **No-daemon check**: rotation runs only on triggers/commands; no new always-on process is introduced. +4. **Loss-free rotation invariant**: archive file + live file concatenated reconstructs the original pre-rotation content. + +## Dependencies +- **External Services**: GitHub (via the forge abstraction) for issue open/closed/title; must degrade gracefully for other forges. +- **Internal Systems**: Tower server + REST client; SQLite state DB + `db/schema.ts`; afx CLI (Commander); dashboard overview API + Work view; git worktree machinery (`spawn-worktree.ts`). +- **Libraries/Frameworks**: Existing only — Commander, Vitest, React/Vite, the project's SQLite layer. No new runtime dependency anticipated. + +## References +- Issue [#984](https://github.com/cluesmith/codev/issues/984) +- Predecessor specs/reviews: `codev/specs/823-multi-architect-coordination-b.md` (+ its review), `…/786-…`, `…/774`-era affinity, `…/755-…`, `…/761-…` +- Surfaces: `packages/codev/src/agent-farm/cli.ts`, `commands/spawn.ts`, `commands/spawn-worktree.ts`, `commands/workspace-add-architect.ts`, `…-remove-architect.ts`, `db/schema.ts`, `state.ts`, `servers/tower-routes.ts`, `servers/overview.ts`, `lib/forge.ts`; `packages/dashboard/src/components/WorkView.tsx`; `packages/types/src/api.ts` + +## Risks and Mitigation +| Risk | Probability | Impact | Mitigation Strategy | +|------|------------|--------|-------------------| +| #6 SHA-pin reads the wrong base (hardcoded `main`) or silently falls back to stale HEAD on fetch failure | Med | High | Read the workspace's real default branch from config/forge, never hardcode; **fail-fast** on fetch error (no silent fallback); `--base` override; common case = fresh `origin/main` (unchanged expectation) | +| Dashboard N>1 grouping regresses N=1 rendering | Med | Med | Gate all grouping behind `architectCount > 1`; visual-verify N=1 identical (per UI-verification discipline) | +| State rotation loses history | Low | High | Relocate-not-delete; test that concatenation reconstructs the original; never operate in place destructively | +| Ledger semantics confuse same-architect resume / closed issues | Med | Med | Explicit "same-architect never blocks" rule; Open Question on closed-issue behavior resolved before coding | +| Scope creep across six interdependent points in one PR | High | Med | Phase the plan by leverage (#3+#1 → #5 → #4+#2 → #6); each phase independently committable; #6 is now a small SHA-pin change, reducing total risk | +| Source-workspace details leak into the public repo | Low | Med | Generalized fixtures only; scrub names/ids/terminology | + +## Expert Consultation + +### Consultation Log — Iteration 1 (2026-06-03) +**Models Consulted**: Gemini (REQUEST_CHANGES), Codex (REQUEST_CHANGES), Claude (COMMENT). All HIGH confidence; codebase claims verified accurate by Claude. + +**Convergent themes and how they were addressed:** +- **Resolve acceptance-critical Open Questions now** (Gemini + Codex) → added the **Resolved Design Decisions** section; promoted state-file location, override flag, rotation mechanism, last-activity source, and retire policy from open to fixed. +- **#5 markdown-truncation safety** (Gemini) → rotation now uses a parseable `<!-- ARCHIVE BOUNDARY -->` delimiter and only moves whole history entries; never splits code blocks. Added loss-free reconstruction test. +- **#6 data-loss / worktree path / spawn-base resolution** (Gemini + Claude + Codex) → added `.architects/<name>/` (gitignored) location, dirty-worktree abort-unless-`--force` retire guard, and a spawn-base-resolution sketch ("builder branches from the spawning architect's worktree HEAD"). +- **#6 scope contradiction** (Codex) → resolved: `main` = main checkout; new architects isolated by default; pre-existing migrate via explicit command. Success criterion reworded for consistency. +- **Ledger precision + concurrent-spawn race** (Codex + Claude) → added override-is-release-and-reinsert semantics, close/reopen + cleanup rules, and a partial unique index making check-then-insert atomic. +- **Forge query mechanism + degraded behavior** (Claude + Codex) → roster intersects ledger ids against the existing overview cache; renders `unknown` on forge outage; never per-issue shell-out. +- **Who-owes-next edge cases** (Claude + Codex) → made it a total function with explicit cases (gate→architect, mid-phase→builder, awaiting-merge→architect, owned-unspawned→architect, idle→builder-stalled, else→`unknown`). +- **`CODEV_ARCHITECT_NAME` fallback in N>1** (Claude) → spawn now requires a registered name in multi-architect workspaces; added security note + test. +- **No new dashboard deps for #2** (Claude) → confirmed; grouping reuses existing components + `architectCount`/`spawnedByArchitect` from #823. + +**Remaining open (deferred to architect at the gate):** #6 PR-slicing, pre-existing-architect migration ergonomics, and two nice-to-knows (`--board` text digest; closed-issue dedup behavior). + +> **Post-gate note:** the architect resolved #6 by switching to **builder-base SHA-pin** (see Amendment 1), which dissolved both the PR-slicing and migration-ergonomics questions. The iter-1 consultation analysis above is preserved as the historical record of the *worktree-isolation* design that was considered and then superseded. + +**Date**: 2026-06-03 +**Models Consulted**: gemini, codex, claude + +## Approval +- [ ] Architect (human) review at `spec-approval` gate +- [ ] Expert AI Consultation Complete (3-way) + +## Notes +- This spec shipped recommended defaults with explicitly-flagged Open Questions for the consequential forks; at the spec-approval gate the architect confirmed the defaults and **redirected #6 from per-architect worktree isolation to builder-base SHA-pin** (see the Amendment below). +- "Build all six as one coherent SPIR" is honored: the points share a data foundation (the ledger) and a presentation (roster/board), and lifecycle/state/SHA-pin all hang off the same architect identity. All six ship in one PR. + +--- + +## Amendments + +### Amendment 1 — #6 redirected to builder-base SHA-pin (2026-06-04, at spec-approval) + +**Summary**: The architect approved the spec at the spec-approval gate with one change: **Point #6 becomes builder-base SHA-pinning (the spec's documented fallback B), replacing per-architect checkout isolation (`.architects/<name>/` worktrees).** + +**Why**: The damaging failure mode in practice is the *stale/wrong-base builder* (builders repeatedly branched from a `main` ~145 commits behind), which SHA-pin fixes directly and cheaply. Per-architect worktrees carry a high-risk Tower/spawn refactor whose main benefit — independent architect branch state — is rarely needed, since architects sit on the integration branch. Architects continue sharing the main checkout; the "never switch branches" discipline rule stays for architects (accepted). + +**Spec changes** (applied throughout, not deferred): Point #6 section rewritten to SHA-pin (capture a fresh integration-branch-tip SHA via fetch + `rev-parse`, branch the builder from it, fail-fast on fetch error, `--base` override); Desired State point 6, Success Criterion #6, Current State gap, Constraints, Assumptions, Performance, Risks, Test Scenarios 10/11/14 + non-functional #2 all updated to match. The `migrate-architect` surface and the dirty-worktree retire guard are **dropped** (no architect worktrees to migrate or clean). #4's retire steps simplified accordingly. + +**Resolved consequences**: #6 PR-slicing → **moot** (ships in the main PR); pre-existing-architect migration ergonomics → **moot** (no worktrees). All #1–#5 resolved decisions stand unchanged. + +**Plan impact**: #6 becomes a small, low-risk phase in the spawn base-resolution path; the plan should still order it last but no longer treat it as a sliceable high-blast-radius piece. diff --git a/codev/state/spir-984_thread.md b/codev/state/spir-984_thread.md new file mode 100644 index 000000000..f2e64c932 --- /dev/null +++ b/codev/state/spir-984_thread.md @@ -0,0 +1,48 @@ +# spir-984 — Multi-architect coordination layer + +## Project +Issue #984 (SPIR, area/tower). Six coordinated points: +1. Architect roster (`afx architects`) +2. Unified board / digest (prefer extending dashboard Work view / `afx tower`) +3. Issue-ownership ledger + dedup-at-spawn +4. Formal lifecycle (enhance `afx workspace add-architect`/`remove-architect`) +5. Bounded, templated state files (`codev/state/<id>_thread.md`) +6. Per-architect checkout isolation (largest/riskiest) + +## Status log + +### Specify — started 2026-06-03 +- Strict mode, porch-driven. Phase = `specify`. Spec did not pre-exist; porch spawned me to create it. +- Reading protocol + template; exploring codebase surfaces for all six points. + +### Findings that shape the spec +- **Spec 823 ("coordination-b") already shipped** (review on main): builder attribution in dashboard, inter-agent messaging docs, per-builder thread-file convention, VSCode Architects-tree auto-refresh. 984 builds the NEXT layer — must not re-do these. +- **Architects are stored in SQLite** `architect` table (`packages/codev/src/agent-farm/db/schema.ts`), composite PK `(workspace_path, id)`. `id` is the architect name ('main', 'architect-2', …). No per-architect markdown state file exists in code today. +- **Spawn affinity**: `CODEV_ARCHITECT_NAME` env injected by Tower into each architect terminal; read at `commands/spawn.ts`; builders carry `spawned_by_architect`. +- **`codev/state/<id>_thread.md`** are *builder* free-text logs, written by the builder AI via the Write tool (no code produces them). Point #5 "bounded state" targets these + any new architect state file. +- **Worktrees**: builders get `git worktree add` from the shared main checkout (`commands/spawn-worktree.ts`); all architects share that one checkout → the coupling #6 addresses. Natural fix = per-architect worktrees (mirrors builder model, cheap object-store sharing). +- afx CLI = Commander.js in `agent-farm/cli.ts`; Tower routes in `servers/tower-routes.ts`; dashboard Work view in `packages/dashboard/src/components/WorkView.tsx` + `/api/overview` (`servers/overview.ts`). Tests = Vitest in `agent-farm/__tests__/`. + +### Key scoping decision +The issue's "architect state files" description comes from the private source workspace (markdown working-memory files). In OUR repo that convention doesn't formally exist — so #4 must *introduce* it, and #5 bounds both builder thread files and the new architect state file. Surfacing this in the spec's Current State + Open Questions rather than blocking. + +Writing spec now with recommended defaults + explicit Open Questions for: #6 checkout-isolation approach (worktree vs clone vs pin-base) and whether it ships as its own PR; #5 rotation mechanism; architect state-file convention/location. + +- Spec drafted + committed (37b1987a). Ran porch-orchestrated 3-way consult (iter1). +- **Consult verdicts**: Gemini REQUEST_CHANGES, Codex REQUEST_CHANGES, Claude COMMENT (all HIGH conf; Claude verified codebase claims accurate). +- Incorporated everything → committed d7e17715. Key resolutions: state file at `codev/state/architects/<name>.md`; `--override-owner` w/ release-and-reinsert + `override_of`; partial unique index for atomic dedup; rotation via `<!-- ARCHIVE BOUNDARY -->` whole-entry moves (loss-free); `.architects/<name>/` gitignored worktrees, main=main checkout, new architects isolated by default, pre-existing migrate explicitly; dirty-worktree retire aborts unless --force; roster open/closed via overview cache (unknown on outage); who-owes-next total function; CODEV_ARCHITECT_NAME required in N>1. +- **Left for architect at gate**: #6 PR-slicing (all 3 recommend own PR), pre-existing-architect migration ergonomics, 2 nice-to-knows. +- Wrote rebuttal (984-specify-iter1-rebuttals.md, committed) documenting how each REQUEST_CHANGES point was addressed. +- **GATE: spec-approval reached.** Registered via `porch gate 984`; notified architect via `afx send`. STOPPED, awaiting human `porch approve 984 spec-approval`. Not calling approve myself (strict mode). +- Architect decisions pending at gate: (1) #6 PR-slicing (own PR recommended by all 3), (2) pre-existing-architect migration ergonomics. + +### Spec APPROVED (2026-06-04) + Plan phase +- **Architect approved spec at gate** with ONE modification: **#6 → builder-base SHA-pin** (documented fallback B), NOT per-architect worktrees. Rationale: the damaging failure mode is stale-base builders (hit a `main` ~145 commits behind repeatedly); SHA-pin fixes it cheaply without the Tower/spawn refactor. Architects keep sharing main checkout; "never switch branches" stays for architects. Resolves both open gate Qs: PR-slicing moot (ships in main PR), migration ergonomics moot (no worktrees). +- Amended spec comprehensively (Amendment 1 + all #6-touching sections: title, Desired State, Success Criterion #6, Current State, Constraints, Assumptions, Performance, Risks, Test Scenarios 10/11/14, Resolved Decisions, Open Questions, Notes). Committed. +- NOTE: I did NOT self-approve the porch gate. After I committed the amendment, porch status had already advanced past spec-approval (human ran `porch approve`), so I proceeded to Plan. +- **Plan phase**: ran a 2nd Explore to nail integration points (schema migrations via db/index.ts ensureLocalDatabase; state.ts CRUD; spawn.ts SPAWNING_ARCHITECT_NAME + upsertBuilder; spawn-worktree.ts createWorktree base; Tower add/remove in tower-instances.ts ~L987/L1161; tower-client.ts; overview.ts OverviewCache; WorkView.tsx; cli.ts Commander; Vitest tests). +- Wrote plan: 7 phases (ledger → roster → bounded-state → lifecycle → board → SHA-pin → docs), machine-readable phases JSON, all porch checks pass (plan_exists/has_phases_json/min_two_phases=7). Committed. +- Ran 3-way plan consult (iter1): **Gemini APPROVE, Claude APPROVE, Codex REQUEST_CHANGES** (all HIGH). +- Codex's 4 points all valid + fixed: (1) claim ownership BEFORE worktree/session creation + rollback (old order couldn't prevent side effects); (2) Phase 6 attached-branch flow (old sketch detached); (3) Phase 5 board must join ledger + synthesize owned-but-unspawned rows; (4) types.ts/spawn.test.ts plumbing. Plus approver notes (state-rotation→codev/src/lib, line numbers, mkdir -p, default-branch fallback chain, createWorktree-only, closed-issue dedup default=block, stale-HEAD test fixture). Wrote rebuttal. +- **GATE: plan-approval reached.** Registered via `porch gate 984`; notified architect. STOPPED, awaiting human `porch approve 984 plan-approval`. Not self-approving. +- After approval → Implement phase (7 phases, IDE cycle each, all one PR).