diff --git a/.claude/commands/grill-me.md b/.claude/commands/grill-me.md new file mode 100755 index 00000000..b9a83c59 --- /dev/null +++ b/.claude/commands/grill-me.md @@ -0,0 +1,9 @@ +--- +description: Interview me relentlessly to stress-test a plan or design. +--- + +Interview me relentlessly about every aspect of this plan until we reach a shared understanding. Walk down each branch of the design tree, resolving dependencies between decisions one-by-one. For each question, provide your recommended answer. + +Ask the questions one at a time. + +If a question can be answered by exploring the codebase, explore the codebase instead. diff --git a/.claude/commands/memo.md b/.claude/commands/memo.md new file mode 100755 index 00000000..fb47a417 --- /dev/null +++ b/.claude/commands/memo.md @@ -0,0 +1,43 @@ +--- +description: Save current task state to auto-memory, then promote reusable lessons to skills and trim memory. +--- + +# Memo + +Save a snapshot of current work to persistent memory, then clean up. + +## Step 1 — Save current state + +Write a concise summary of in-progress or recently completed work to the +auto-memory `MEMORY.md` for this project. Include: + +- What was done (feature, bug, refactor, area of code) +- Current status (completed, blocked, in-progress) +- Key decisions or outcomes worth remembering across conversations + +Do not duplicate information already in skills, CLAUDE.md, or README-CLAUDE.md. + +## Step 2 — Promote to skills + +Review the memory file for items that represent **reusable patterns or +lessons** — things that would help future sessions on this project. For +each such item: + +1. Identify which skill file it belongs in (or create a new one under + `.claude/skills//SKILL.md`). +2. Add it to the appropriate skill. +3. Remove it from memory (it now lives in the skill). + +Examples of promotable items: +- A non-obvious convention specific to this project +- A "foot-gun" pattern worth warning future-you about +- A reusable recipe (test invocation, deploy command, debugging trick) + +## Step 3 — Trim memory + +Remove from memory anything that is: +- Already captured in skills, CLAUDE.md, or README-CLAUDE.md +- Too specific to a single completed task to be useful again +- Stale or superseded by later work + +Keep memory concise — ideally under 30 lines. diff --git a/.claude/commands/to-issues.md b/.claude/commands/to-issues.md new file mode 100755 index 00000000..4ec9a561 --- /dev/null +++ b/.claude/commands/to-issues.md @@ -0,0 +1,96 @@ +--- +description: Break a plan/PRD into independently-grabbable issues using tracer-bullet vertical slices. +argument-hint: "[issue ref or path, optional]" +--- + +# To Issues + +Break a plan into independently-grabbable issues using vertical slices (tracer bullets). + +The issue tracker and triage label vocabulary should have been provided to you — run `/setup-matt-pocock-skills` if not. + +## Process + +### 1. Gather context + +Work from whatever is already in the conversation context. If the user passes an issue reference (issue number, URL, or path) as an argument, fetch it from the issue tracker and read its full body and comments. + +### 2. Explore the codebase (optional) + +If you have not already explored the codebase, do so to understand the current state of the code. Issue titles and descriptions should use the project's domain glossary vocabulary, and respect ADRs in the area you're touching. + +### 3. Draft vertical slices + +Break the plan into **tracer bullet** issues. Each issue is a thin vertical slice that cuts through ALL integration layers end-to-end, NOT a horizontal slice of one layer. + +Slices may be 'HITL' or 'AFK'. HITL slices require human interaction, such as an architectural decision or a design review. AFK slices can be implemented and merged without human interaction. Prefer AFK over HITL where possible. + + +- Each slice delivers a narrow but COMPLETE path through every layer (schema, API, UI, tests) +- A completed slice is demoable or verifiable on its own +- Prefer many thin slices over few thick ones + + +### 4. Quiz the user + +Present the proposed breakdown as a numbered list. For each slice, show: + +- **Title**: short descriptive name +- **Type**: HITL / AFK +- **Blocked by**: which other slices (if any) must complete first +- **User stories covered**: which user stories this addresses (if the source material has them) + +Ask the user: + +- Does the granularity feel right? (too coarse / too fine) +- Are the dependency relationships correct? +- Should any slices be merged or split further? +- Are the correct slices marked as HITL and AFK? + +Iterate until the user approves the breakdown. + +### 5. Publish the issues to the issue tracker + +For each approved slice, publish a new issue to the issue tracker. Use the issue body template below. Apply the `needs-triage` triage label so each issue enters the normal triage flow. + +Publish issues in dependency order (blockers first) so you can reference real issue identifiers in the "Blocked by" field. + +### 6. Tell the user how to consume them + +Each slice is meant to be picked up in **its own context**. Do not implement multiple slices in a single Claude session — context bloat undermines reviewability, blurs the per-slice commit discipline, and means a single mistake mid-stream contaminates everything that follows. + +After publishing, tell the user explicitly: + +> Each slice should be implemented in its own context. Either: +> +> - Run `/clear` between slices and pick them up one at a time in fresh sessions, OR +> - Spawn one subagent per non-blocked slice and let them work in parallel (each subagent gets its own context). +> +> Don't pick up the next slice in the same session you finished the last one in. + +If the agent that ran `/to-issues` is the same agent that would naturally pick up slice #1: stop after publishing. Hand back to the user. Let them start a fresh session. + + +## Parent + +A reference to the parent issue on the issue tracker (if the source was an existing issue, otherwise omit this section). + +## What to build + +A concise description of this vertical slice. Describe the end-to-end behavior, not layer-by-layer implementation. + +## Acceptance criteria + +- [ ] Criterion 1 +- [ ] Criterion 2 +- [ ] Criterion 3 + +## Blocked by + +- A reference to the blocking ticket (if any) + +Or "None - can start immediately" if no blockers. + + + +Do NOT close or modify any parent issue. diff --git a/.claude/commands/to-prd.md b/.claude/commands/to-prd.md new file mode 100755 index 00000000..48756d8f --- /dev/null +++ b/.claude/commands/to-prd.md @@ -0,0 +1,73 @@ +--- +description: Turn the current conversation context into a PRD and publish it to the issue tracker. +--- + +Take the current conversation context and codebase understanding and produce a PRD. Do NOT interview the user — just synthesize what you already know. + +The issue tracker and triage label vocabulary should have been provided to you — run `/setup-matt-pocock-skills` if not. + +## Process + +1. Explore the repo to understand the current state of the codebase, if you haven't already. Use the project's domain glossary vocabulary throughout the PRD, and respect any ADRs in the area you're touching. + +2. Sketch out the major modules you will need to build or modify to complete the implementation. Actively look for opportunities to extract deep modules that can be tested in isolation. + +A deep module (as opposed to a shallow module) is one which encapsulates a lot of functionality in a simple, testable interface which rarely changes. + +Check with the user that these modules match their expectations. Check with the user which modules they want tests written for. + +3. Write the PRD using the template below, then publish it to the project issue tracker. Apply the `needs-triage` triage label so it enters the normal triage flow. + + + +## Problem Statement + +The problem that the user is facing, from the user's perspective. + +## Solution + +The solution to the problem, from the user's perspective. + +## User Stories + +A LONG, numbered list of user stories. Each user story should be in the format of: + +1. As an , I want a , so that + + +1. As a mobile bank customer, I want to see balance on my accounts, so that I can make better informed decisions about my spending + + +This list of user stories should be extremely extensive and cover all aspects of the feature. + +## Implementation Decisions + +A list of implementation decisions that were made. This can include: + +- The modules that will be built/modified +- The interfaces of those modules that will be modified +- Technical clarifications from the developer +- Architectural decisions +- Schema changes +- API contracts +- Specific interactions + +Do NOT include specific file paths or code snippets. They may end up being outdated very quickly. + +## Testing Decisions + +A list of testing decisions that were made. Include: + +- A description of what makes a good test (only test external behavior, not implementation details) +- Which modules will be tested +- Prior art for the tests (i.e. similar types of tests in the codebase) + +## Out of Scope + +A description of the things that are out of scope for this PRD. + +## Further Notes + +Any further notes about the feature. + + diff --git a/.claude/commands/toolbox-update.md b/.claude/commands/toolbox-update.md new file mode 100755 index 00000000..9ac1d54e --- /dev/null +++ b/.claude/commands/toolbox-update.md @@ -0,0 +1,43 @@ +--- +description: Rescan BOTH ~/.claude and the current workspace .claude and refresh the hardcoded list inside /toolbox. +--- + +Refresh the hardcoded listing inside `~/.claude/commands/toolbox.md` (and the workspace `./.claude/commands/toolbox.md` if present) so it matches the current state of disk. claude-sandbox's adapted `toolbox` enumerates BOTH locations — Claude actually loads from both, so the listing must reflect both. + +## Steps + +1. Enumerate user-global commands: every `*.md` file in `~/.claude/commands/`. The command name is the filename minus `.md`. + +2. Enumerate user-global skills: every `SKILL.md` under `~/.claude/skills/**/`. Take `name` from frontmatter. + +3. Enumerate workspace commands: every `*.md` file in `./.claude/commands/` (relative to the current working directory). + +4. Enumerate workspace skills: every `SKILL.md` under `./.claude/skills/**/`. + +5. For each file, extract the frontmatter `description` field. Trim to the first sentence (stop at the first `.` followed by space or end of string). If a description spans multiple sentences with extra detail, keep only the first. + +6. Build the new listing block in this exact format, sorted alphabetically within each section: + + ``` + **User-global commands (`~/.claude/commands/`)** + - `/` — + ... + + **User-global skills (`~/.claude/skills/`)** + - `/` — + ... + + **Workspace commands (`./.claude/commands/`)** + - `/` — + ... + + **Workspace skills (`./.claude/skills/`)** + - `/` — + ... + ``` + + Show skill names with a leading `/` so they render with the same highlighting as commands, even though skills aren't slash-invocable. If a section is empty, write `- (none)`. + +7. Edit `~/.claude/commands/toolbox.md`. Replace everything **after** the line `Output exactly the following text verbatim, with no preamble, commentary, or trailing summary:` (and the blank line that follows) with the freshly-built listing. Preserve the frontmatter and the instruction line above it. If `./.claude/commands/toolbox.md` exists, do the same to it. + +8. Print a single line confirmation: `Updated toolbox.md — + commands, + skills.` diff --git a/.claude/commands/toolbox.md b/.claude/commands/toolbox.md new file mode 100755 index 00000000..b7aaceec --- /dev/null +++ b/.claude/commands/toolbox.md @@ -0,0 +1,27 @@ +--- +description: List the commands and skills available from BOTH the user-global ~/.claude/ toolkit AND the current workspace's .claude/ overrides (fast, hardcoded — refresh with /toolbox-update). +--- + +Output exactly the following text verbatim, with no preamble, commentary, or trailing summary. The two sections below correspond to the two directories Claude actually has access to: the user-global `~/.claude/` toolkit and the current workspace's `.claude/` overrides. If a name appears in both, the workspace copy wins for the running Claude (it's loaded later); both are listed here so you can see what's available at each scope. + +**User-global commands (`~/.claude/commands/`)** +- `/grill-me` — Interview me relentlessly to stress-test a plan or design. +- `/to-issues` — Break a plan/PRD into independently-grabbable issues using tracer-bullet vertical slices. +- `/to-prd` — Turn the current conversation context into a PRD and publish it to the issue tracker. +- `/toolbox` — List the user-scoped commands and skills in ~/.claude. +- `/toolbox-update` — Rescan ~/.claude and the workspace .claude and refresh the hardcoded list inside /toolbox. +- `/write-a-skill` — Create a new agent skill with proper structure and progressive disclosure. +- `/zoom-out` — Zoom out and give a higher-level map of the surrounding code. + +**User-global skills (`~/.claude/skills/`)** +- `/diagnose` — Disciplined diagnosis loop for hard bugs and performance regressions. +- `/grill-with-docs` — Grilling session that challenges your plan against the existing domain model, sharpens terminology, and updates documentation inline as decisions crystallise. +- `/improve-codebase-architecture` — Find deepening opportunities in a codebase, informed by the domain language in CONTEXT.md and the decisions in docs/adr/. +- `/tdd` — Test-driven development with red-green-refactor loop. +- `/triage` — Triage issues through a state machine driven by triage roles. + +**Workspace commands (`./.claude/commands/`)** +- `/verify-sandbox` — Run the 17-check sandbox PASS/FAIL battery against the live process. + +**Workspace skills (`./.claude/skills/`)** +- (none unless installed via `claude-sandbox install-skill`) diff --git a/.claude/commands/verify-sandbox.md b/.claude/commands/verify-sandbox.md new file mode 100755 index 00000000..d63a0edd --- /dev/null +++ b/.claude/commands/verify-sandbox.md @@ -0,0 +1,362 @@ +--- +description: Verify the Claude sandbox is intact — runs the 16-check PASS/FAIL battery + 10 adversarial breakout probes when the battery passes, and exits non-zero on any failure so the command is usable as a CI assertion. +--- + +`/verify-sandbox` runs **two phases** against the live Claude process: + +1. The deterministic **16-check battery** — small bash tests that each + return PASS or FAIL with a one-line explanation. Covers every + defence in `README-CLAUDE.md`'s "What's locked down" table. +2. When (and only when) the 16 checks all pass, **10 adversarial + breakout probes** — open-ended attempts to escape the sandbox or + exfiltrate credentials, designed by reasoning about gaps the + deterministic checks don't directly exercise. + +Run phase 1 below in order, capture PASS/FAIL, and print the table +described under "Output format". If every check passes, run phase 2. +Any FAIL in either phase must cause the overall command to exit +non-zero (so CI assertions work). + +## Check 01 — IS_SANDBOX sentinel + +`IS_SANDBOX=1` is set inside the sandbox by `bwrap --setenv`. If +unset, Claude was launched against the real binary +(`/.runtime/claude`) directly, bypassing the sandbox entirely. +This is the fall-through sentinel. + +```bash +[ "${IS_SANDBOX:-}" = "1" ] +``` + +## Check 02 — NO_NEW_PRIVS + +bwrap sets `PR_SET_NO_NEW_PRIVS=1` before exec'ing the target, so +setuid binaries inside the sandbox cannot gain privileges. With +NO_NEW_PRIVS in effect, `/proc/self/status` reports `NoNewPrivs: 1`. +Without it, `sudo` / setuid-root binaries inside the sandbox could +elevate (in concert with a userns escape) and break the rest of +the threat model. + +The earlier check 02 read `/proc/1/comm` and expected `bwrap|claude| +node`. That was a victim of the same procfs-leak failure mode the +new check 07 documents — on rootless nested-userns hosts procfs is +mounted in the outer pidns, so `/proc/1/comm` reads the devcontainer +init (`sh`) instead of the sandbox target. The "bwrap is in our +ancestry" property is already covered by check 01 (`IS_SANDBOX=1` +is only set by `bwrap --setenv`), so check 02 was redundant *and* +broken on the hosts we care about. Repurposed to cover NO_NEW_PRIVS, +which was previously listed as "Implicit" in README-CLAUDE.md with +no PASS/FAIL check of its own. + +```bash +grep -q '^NoNewPrivs:[[:space:]]*1$' /proc/self/status +``` + +## Check 03 — strict-under-/root + +`$HOME` (typically `/root`) is a tmpfs with only `.claude`, +`.claude.json` (Claude Code's account state), and (optionally) +`.cache` bound back in, plus a `.config` intermediate tmpfs that holds +the `gh` / `glab-cli` credential binds. Claude Code itself writes +`.local/{bin,share,state}/claude` and a `.local/share/applications` +`.desktop` URL handler into the tmpfs on first launch, so `.local` is +also expected (contents live in the tmpfs, not bound from the host). +The defence-in-depth file masks (checks 14–15) also bind `/dev/null` +over `.netrc`, `.Xauthority`, and `.ICEauthority` — so those names +are expected to appear too, as size-zero entries (which checks 14–15 +verify; `.ICEauthority` is masked without a dedicated check because +it shares the X11 cookie attack surface). Anything else under `$HOME`, +or anything besides `gh` / `glab-cli` under `$HOME/.config`, means the +strict-under-/root inversion regressed. `.gitconfig` is no longer +masked — it doesn't normally appear under the tmpfs `$HOME`, but the +allow-list still permits the name in case a tool drops one. + +Claude Code, left to its own devices, would drop a Chrome native- +messaging-host manifest (`com.anthropic.claude_code_browser_extension. +json`) into each chromium-family browser's `NativeMessagingHosts` +directory on launch — `BraveSoftware`, `chromium`, `google-chrome`, +`microsoft-edge`, `opera`, `vivaldi`. That manifest registers the +in-sandbox Claude as an RPC target for any installed browser +extension, which is outside the threat model. The shadow injects +`--no-chrome` and strips user-supplied `--chrome` so the manifests +never get written, and check 03 enforces that: if any of those six +browser-named dirs reappears under `$HOME/.config`, the disable +regressed. + +```bash +# ls -A skips . and ..; the allowed top-level entries are the +# .claude/.cache binds, the .claude.json account-state bind, the +# .config intermediate tmpfs for the selectively-exposed gh/glab +# binds, the .local tree Claude Code writes into the tmpfs at +# runtime, and the four masked dotfiles intentionally bound to /dev/null. +extras="$(ls -A "$HOME" 2>/dev/null | grep -vxE '\.claude|\.claude\.json|\.cache|\.config|\.local|\.gitconfig|\.netrc|\.Xauthority|\.ICEauthority' || true)" +[ -z "$extras" ] || exit 1 +# When .config is present (bwrap intermediate for the credential +# binds), assert it contains only the trusted subdirs — anything else +# means either a sibling ~/.config tool (VS Code, etc.) leaked through +# or the shadow's --no-chrome injection regressed (browser dirs from +# Claude Code's Chrome native-messaging-host self-registration). +if [ -d "$HOME/.config" ]; then + config_extras="$(ls -A "$HOME/.config" 2>/dev/null | grep -vxE 'gh|glab-cli' || true)" + [ -z "$config_extras" ] +fi +``` + +## Check 04 — env scrub: GH_TOKEN + +With `--clearenv` and an explicit allow-list, `GH_TOKEN` from the +host shell must be empty inside the sandbox. + +```bash +[ -z "${GH_TOKEN:-}" ] +``` + +## Check 05 — env scrub: DISPLAY + +`DISPLAY` is deliberately not in the `--clearenv` allow-list — it +closes the X11 reachability path. + +```bash +[ -z "${DISPLAY:-}" ] +``` + +## Check 06 — cap_drop ALL + +`--cap-drop ALL` empties the effective capability set. `CapEff` in +`/proc/self/status` reads all zeros. + +```bash +grep -q '^CapEff:\s*0\{16\}$' /proc/self/status +``` + +## Check 07 — --unshare-pid (kernel pidns isolation) + +`--unshare-pid` puts the sandbox in a nested PID namespace. The +kernel-level effect is what matters for the threat model: `kill()` / +`ptrace()` are scoped to the new pidns, so the sandbox cannot signal +or attach to host or devcontainer processes. We positively assert +the nesting via `/proc/self/status:NSpid:` — outside any sandbox +this has one entry; inside one nested pidns it has two. + +The companion property (procfs *view* aligned with the new pidns) is +not checked here. On rootless devcontainer hosts bwrap's `--proc /proc` +mounts procfs against its outer pidns rather than the spawned child's, +so process-tree visibility leaks even though kernel kill/ptrace +scoping is intact. The launch-time probe in claude-shadow detects this +and sets `CLAUDE_SANDBOX_FRESH_PROC=0`. Credential-bearing procfs +entries (`/proc//environ`, `/maps`, `/fd`, `/mem`) stay gated by +`PTRACE_MODE_READ_FSCREDS` + YAMA `ptrace_scope=1`, so leaked +visibility does not become credential exfil — but see README-CLAUDE.md +for the honest tally. + +```bash +# NSpid: lists our PID across each pidns level (outermost first). +# With --unshare-pid in effect we sit in at least one nested pidns, +# so the line has >= 2 fields after the label. +nspid_count=$(awk '$1=="NSpid:"{print NF-1;exit}' /proc/self/status) +[ "${nspid_count:-1}" -ge 2 ] +``` + +## Check 08 — --unshare-ipc + +The SysV IPC namespace differs from the host's. We compare the +inode of `/proc/self/ns/ipc` to PID 1's (PID 1 is bwrap-or-claude +inside, by check 02; the inodes differ from the host's by virtue of +unshare). + +```bash +# inside an unshared ipcns, /proc/self/ns/ipc resolves to a different +# inode than the un-namespaced kernel default. We can't sample the +# host inode from inside, but we CAN assert /proc/self/ns/ipc exists +# and is a symlink to a unique ipc:[]. +ipc_link="$(readlink /proc/self/ns/ipc 2>/dev/null || true)" +case "$ipc_link" in ipc:\[*\]) exit 0 ;; *) exit 1 ;; esac +``` + +## Check 09 — --unshare-uts + +The UTS namespace is unshared, so a hostname change inside doesn't +affect the host. We assert the namespace symlink exists with the +expected shape; the integration test exercises the behavioural property. + +```bash +uts_link="$(readlink /proc/self/ns/uts 2>/dev/null || true)" +case "$uts_link" in uts:\[*\]) exit 0 ;; *) exit 1 ;; esac +``` + +## Check 10 — private /dev (TIOCSTI blocked) + +We dropped `--new-session` so SIGWINCH and job control reach the +sandbox. The TIOCSTI defence is now delivered by two coupled +mechanisms: the shadow wraps bwrap in `script(1)` (the in-sandbox +process inherits script's allocated pty as its controlling terminal, +not the host's), and `bwrap_argv.sh` uses `--dev /dev` (a fresh +devtmpfs with a fresh devpts mount — the host's `/dev/pts/*` is +not visible). An ioctl(TIOCSTI) inside the sandbox can therefore +only inject into script's pty, whose contents script reads and +writes as *output bytes* to the host terminal — never as input to +the parent shell. + +```bash +# /dev must be a fresh mount inside the sandbox (not a bind of the +# host's /dev). Under --dev /dev bwrap mounts a private devtmpfs; +# under --dev-bind /dev /dev it would be a bind mount. mountinfo +# field 9 (fs type) distinguishes them. +awk '$5 == "/dev" { print $9; exit }' /proc/self/mountinfo \ + | grep -qE '^(tmpfs|devtmpfs)$' +``` + +## Check 11 — /tmp is tmpfs and empty + +The host's `/tmp` carries VS Code IPC sockets (`vscode-ipc-*.sock`, +`vscode-git-*.sock`). `--tmpfs /tmp` masks them. We assert no such +socket is visible. + +```bash +# No vscode-ipc-*.sock and no vscode-git-*.sock visible inside. +! ls /tmp/vscode-ipc-*.sock /tmp/vscode-git-*.sock >/dev/null 2>&1 +``` + +## Check 12 — /run/user is tmpfs and empty + +`--tmpfs /run/user` masks the user's runtime directory which can hold +DBus sockets and other IPC bridges. + +```bash +[ -z "$(ls -A /run/user 2>/dev/null)" ] +``` + +## Check 13 — /run/secrets is tmpfs and empty + +`--tmpfs /run/secrets` closes the Docker/Compose secrets path even +when the host has populated `/run/secrets/*`. + +```bash +[ -z "$(ls -A /run/secrets 2>/dev/null)" ] +``` + +## Check 14 — file mask: .netrc empty + +`--bind-try /dev/null /root/.netrc` masks any host `.netrc` +credentials. + +```bash +[ ! -s "$HOME/.netrc" ] +``` + +## Check 15 — file mask: .Xauthority empty + +`--bind-try /dev/null /root/.Xauthority` masks the X11 cookie that +would otherwise authenticate against a host X server. + +```bash +[ ! -s "$HOME/.Xauthority" ] +``` + +## Check 16 — curated gitconfig active + +`GIT_CONFIG_GLOBAL=/etc/claude-gitconfig` is exported and the file's +`user.email` matches the host's. Verifies that the curated gitconfig +is in effect at every launch. + +```bash +[ "${GIT_CONFIG_GLOBAL:-}" = "/etc/claude-gitconfig" ] && \ + [ -n "$(git config --get user.email 2>/dev/null)" ] +``` + +## Phase 2 — Adversarial probes (only when 01–16 all PASS) + +When the deterministic battery is clean, think of **10 novel breakout +approaches and try them**. The goal is to find a gap the 16-check +matrix doesn't directly exercise — anything that lets the sandbox +escape its filesystem inversion, recover scrubbed env vars, reach +the host's network identity, signal/observe processes outside the +pidns, or otherwise violate the threat model in +`README-CLAUDE.md`. + +Constraints on the probes: + +- Distinct from each other and from phase 1. Don't just re-test + `--cap-drop ALL` or `--clearenv` from a different angle. +- Each probe is a single bash snippet (or a short sequence) that + attempts the breakout, then a one-line classification: + - **[BLOCKED]** — the attempt failed in the way the sandbox + expects (EACCES, EPERM, ENOENT for masked paths, etc.). + - **[ESCAPED]** — the attempt succeeded in a way that violates + the threat model (e.g., readable host credential, writable + host path outside the workspace, observable host process tree + beyond what `/proc` leak already discloses, signal delivered + to a process outside the pidns). + - **[INCONCLUSIVE]** — the attempt didn't error but didn't + demonstrate a breach either; explain why. +- Bias toward novelty: kernel interfaces (eBPF, perf events, kernel + keyrings, io_uring), filesystem corners (proc, sys, debugfs, + cgroup, securityfs, `/proc//root` traversal), env-var + recovery paths, IPC channels (abstract unix sockets, signalfd, + pidfd, fanotify), network reachability (loopback services, + /etc/resolv.conf, AF_NETLINK, raw sockets), credential paths + (shells/CLIs that look in unexpected places), exec-chain + escalation (setuid binaries despite NO_NEW_PRIVS, file + capabilities), bwrap-specific (`--die-with-parent` race, + `--new-session` bypass), env-redirect bypasses that would route + `git` back to a host gitconfig despite GIT_CONFIG_GLOBAL. + +Print the probes as a numbered list under a header +`Adversarial probes:`, each line `[BLOCKED|ESCAPED|INCONCLUSIVE] +NN `. Any **[ESCAPED]** makes +the overall result `SANDBOX LEAKING` regardless of phase 1, and +the command exits non-zero. **[INCONCLUSIVE]** is informational +and does not change the exit code, but every inconclusive probe +should be followed by a "Suggested follow-up:" line proposing what +a more targeted test would look like. + +If all 10 probes are **[BLOCKED]**, the sandbox passes both phases +and the final line becomes `RESULT: SANDBOX OK (16 deterministic + +10 adversarial)`. + +## Output format + +Print a header line `"/verify-sandbox: 16 checks"`, then one +`[PASS]` / `[FAIL]` line per check (zero-padded number, name, +one-line explanation on FAIL), then a `Summary:` line. + +``` +/verify-sandbox: 16 checks + [PASS] 01 IS_SANDBOX sentinel set + [PASS] 02 NO_NEW_PRIVS: setuid escalation blocked + [PASS] 03 strict-under-/root: only .claude (+.cache/.local) under $HOME + [PASS] 04 env scrub: GH_TOKEN empty + [PASS] 05 env scrub: DISPLAY empty + [PASS] 06 cap_drop ALL: CapEff=0000000000000000 + [PASS] 07 --unshare-pid: NSpid has >= 2 entries (kernel pidns isolated) + [PASS] 08 --unshare-ipc: ipcns symlink present + [PASS] 09 --unshare-uts: utsns symlink present + [PASS] 10 --new-session: no controlling tty (TIOCSTI blocked) + [PASS] 11 /tmp tmpfs: no vscode-ipc-*.sock visible + [PASS] 12 /run/user empty + [PASS] 13 /run/secrets empty (Docker/Compose secrets masked) + [PASS] 14 file mask: $HOME/.netrc is empty + [PASS] 15 file mask: $HOME/.Xauthority is empty + [PASS] 16 curated gitconfig: GIT_CONFIG_GLOBAL set, user.email present + Summary: 16 PASS / 0 FAIL + +Adversarial probes: + [BLOCKED] 01 read /proc//environ — EACCES (YAMA ptrace_scope=1) + [BLOCKED] 02 reach VS Code IPC via /tmp/vscode-ipc-*.sock — ENOENT (tmpfs masks) + [BLOCKED] 03 abuse /proc/self/exe to re-launch with caps — exec'd binary still caps=0 + ... (8 more) + Adversarial summary: 10 BLOCKED / 0 ESCAPED / 0 INCONCLUSIVE +``` + +If any phase-1 check FAILs, replace `[PASS]` with `[FAIL]` and +append the specific reason to that line. Then exit non-zero and +SKIP phase 2 entirely (no point red-teaming a known-broken +sandbox). + +If any phase-2 probe is `[ESCAPED]`, exit non-zero regardless of +phase-1 results. + +Final result line: +- All 16 PASS + 10 BLOCKED → `RESULT: SANDBOX OK (16 deterministic + 10 adversarial)` +- All 16 PASS + ≥1 INCONCLUSIVE + 0 ESCAPED → `RESULT: SANDBOX OK (16 deterministic + N BLOCKED, M INCONCLUSIVE)` +- Any FAIL or ESCAPED → `RESULT: SANDBOX LEAKING — open an issue against gilesknap/claude-sandbox` diff --git a/.claude/commands/write-a-skill.md b/.claude/commands/write-a-skill.md new file mode 100755 index 00000000..e40d93c5 --- /dev/null +++ b/.claude/commands/write-a-skill.md @@ -0,0 +1,116 @@ +--- +description: Create a new agent skill with proper structure and progressive disclosure. +--- + +# Writing Skills + +## Process + +1. **Gather requirements** - ask user about: + - What task/domain does the skill cover? + - What specific use cases should it handle? + - Does it need executable scripts or just instructions? + - Any reference materials to include? + +2. **Draft the skill** - create: + - SKILL.md with concise instructions + - Additional reference files if content exceeds 500 lines + - Utility scripts if deterministic operations needed + +3. **Review with user** - present draft and ask: + - Does this cover your use cases? + - Anything missing or unclear? + - Should any section be more/less detailed? + +## Skill Structure + +``` +skill-name/ +├── SKILL.md # Main instructions (required) +├── REFERENCE.md # Detailed docs (if needed) +├── EXAMPLES.md # Usage examples (if needed) +└── scripts/ # Utility scripts (if needed) + └── helper.js +``` + +## SKILL.md Template + +```md +--- +name: skill-name +description: Brief description of capability. Use when [specific triggers]. +--- + +# Skill Name + +## Quick start + +[Minimal working example] + +## Workflows + +[Step-by-step processes with checklists for complex tasks] + +## Advanced features + +[Link to separate files: See [REFERENCE.md](REFERENCE.md)] +``` + +## Description Requirements + +The description is **the only thing your agent sees** when deciding which skill to load. It's surfaced in the system prompt alongside all other installed skills. Your agent reads these descriptions and picks the relevant skill based on the user's request. + +**Goal**: Give your agent just enough info to know: + +1. What capability this skill provides +2. When/why to trigger it (specific keywords, contexts, file types) + +**Format**: + +- Max 1024 chars +- Write in third person +- First sentence: what it does +- Second sentence: "Use when [specific triggers]" + +**Good example**: + +``` +Extract text and tables from PDF files, fill forms, merge documents. Use when working with PDF files or when user mentions PDFs, forms, or document extraction. +``` + +**Bad example**: + +``` +Helps with documents. +``` + +The bad example gives your agent no way to distinguish this from other document skills. + +## When to Add Scripts + +Add utility scripts when: + +- Operation is deterministic (validation, formatting) +- Same code would be generated repeatedly +- Errors need explicit handling + +Scripts save tokens and improve reliability vs generated code. + +## When to Split Files + +Split into separate files when: + +- SKILL.md exceeds 100 lines +- Content has distinct domains (finance vs sales schemas) +- Advanced features are rarely needed + +## Review Checklist + +After drafting, verify: + +- [ ] Description includes triggers ("Use when...") +- [ ] SKILL.md under 100 lines +- [ ] No time-sensitive info +- [ ] Consistent terminology +- [ ] Concrete examples included +- [ ] References one level deep diff --git a/.claude/commands/zoom-out.md b/.claude/commands/zoom-out.md new file mode 100755 index 00000000..1fc05204 --- /dev/null +++ b/.claude/commands/zoom-out.md @@ -0,0 +1,5 @@ +--- +description: Zoom out and give a higher-level map of the surrounding code. +--- + +I don't know this area of code well. Go up a layer of abstraction. Give me a map of all the relevant modules and callers, using the project's domain glossary vocabulary. diff --git a/.claude/hooks/sandbox-check.sh b/.claude/hooks/sandbox-check.sh new file mode 100755 index 00000000..f85a35c0 --- /dev/null +++ b/.claude/hooks/sandbox-check.sh @@ -0,0 +1,36 @@ +#!/usr/bin/env bash +# UserPromptSubmit hook. Verifies the Claude sandbox is intact before +# every prompt. Exit 2 blocks the prompt and surfaces the message. +# +# Belt-and-suspenders against the "user invoked Claude via a non-shadow +# path" bypass — the bwrap launcher sets IS_SANDBOX=1, so an unset +# value means we are not in the sandbox. + +fail() { echo "BLOCKED: $1" >&2; exit 2; } + +[ "${IS_SANDBOX:-}" = "1" ] || \ + fail "IS_SANDBOX unset — Claude was launched outside the bwrap shadow. Run via /usr/local/bin/claude." + +# Strict-under-/root: the host gitconfig must NOT be readable. +[ ! -e "$HOME/.gitconfig" ] || ! [ -s "$HOME/.gitconfig" ] || \ + fail "$HOME/.gitconfig is reachable — strict-under-/root inversion broken or the file mask regressed." + +# Env scrub: tokens that may have been on the host shell must be empty. +[ -z "${GH_TOKEN:-}" ] || fail "GH_TOKEN is set inside the sandbox — --clearenv allowlist regressed." +[ -z "${GITHUB_TOKEN:-}" ] || fail "GITHUB_TOKEN is set inside the sandbox — --clearenv allowlist regressed." +[ -z "${ANTHROPIC_API_KEY:-}" ] || fail "ANTHROPIC_API_KEY is set inside the sandbox — --clearenv allowlist regressed." +[ -z "${SSH_AUTH_SOCK:-}" ] || fail "SSH_AUTH_SOCK is set inside the sandbox — --clearenv allowlist regressed." +[ -z "${DISPLAY:-}" ] || fail "DISPLAY is set inside the sandbox — --clearenv allowlist regressed." + +# Curated gitconfig steering. +[ "${GIT_CONFIG_GLOBAL:-}" = "/etc/claude-gitconfig" ] || \ + fail "GIT_CONFIG_GLOBAL is '${GIT_CONFIG_GLOBAL:-}', not /etc/claude-gitconfig — git would fall back to the host gitconfig." +[ "${GIT_CONFIG_SYSTEM:-}" = "/dev/null" ] || \ + fail "GIT_CONFIG_SYSTEM is '${GIT_CONFIG_SYSTEM:-}', not /dev/null — git would read the host /etc/gitconfig." + +# /run/secrets must be empty. +if [ -d /run/secrets ] && [ -n "$(ls -A /run/secrets 2>/dev/null)" ]; then + fail "/run/secrets is non-empty — Docker/Compose secrets are reachable. tmpfs mask regressed." +fi + +exit 0 diff --git a/.claude/settings.json b/.claude/settings.json new file mode 100644 index 00000000..24d8de4b --- /dev/null +++ b/.claude/settings.json @@ -0,0 +1,18 @@ +{ + "hooks": { + "UserPromptSubmit": [ + { + "hooks": [ + { + "type": "command", + "command": ".claude/hooks/sandbox-check.sh" + } + ] + } + ] + }, + "statusLine": { + "type": "command", + "command": ".claude/statusline-command.sh" + } +} diff --git a/.claude/skills/claude-sandbox/SKILL.md b/.claude/skills/claude-sandbox/SKILL.md new file mode 100755 index 00000000..0df0eec8 --- /dev/null +++ b/.claude/skills/claude-sandbox/SKILL.md @@ -0,0 +1,248 @@ +--- +name: claude-sandbox +description: Architecture decisions and historical reversals for this repo's bwrap-based Claude sandbox. Covers real claude off PATH, container-scoped PATs, Ubuntu-24.04 CI bwrap workarounds, dogfood ≈ guest, the `just promote` three-layer model (no JSONC editing), and two walked-back paths (Python orchestration; embedding in python-copier-template). Surface before edits to `.devcontainer/claude-sandbox/{claude-shadow,install.sh,promote.sh}`, `install`, `tests/`, `.github/workflows/ci.yml`, or `.claude/commands/verify-sandbox.md`; or before any suggestion to re-introduce Python tooling, embed in python-copier-template, persist gh/glab PATs across containers, or auto-edit JSONC devcontainer.json. +--- + +# claude-sandbox + +Project-specific architecture decisions. The code documents *what*; +this skill documents *why* and *what regressions to refuse*. Threat +model: `README-CLAUDE.md`; live verification: `/verify-sandbox` +(`.claude/commands/verify-sandbox.md`). + +## Invariant 1 — plain `claude` MUST resolve to the shadow + +Anthropic's `curl install.sh` drops the real binary at +`~/.local/bin/claude` AND prepends `$HOME/.local/bin` to the user's +shell rc. After the next shell, `which claude` resolves past the +bwrap shadow at `/usr/local/bin/claude` → **sandbox escape via +plain `claude`**. + +`install_claude_binary` fixes this by relocating the real binary to +`/usr/libexec/claude-sandbox/claude` (off the user's PATH). The +shadow binds it back to `~/.local/bin/claude` *inside* the sandbox +so Claude's `installMethod=native` self-check still sees the +conventional path. + +**Refuse as regressions:** +- Any "simplification" that skips the relocate-after-curl step. +- Removing the unconditional bind-back of `~/.local/bin/claude` + inside the sandbox — the dest is created on the in-sandbox tmpfs + `$HOME`, so don't gate it on the host file existing. +- `tests/bwrap_argv.sh` scenarios 1 & 4a guard the bind pair; update + both if you change the bind. + +**Acceptable swap:** if Anthropic adds `--no-modify-path`, drop the +relocate — provided plain `claude` still cannot resolve past +`/usr/local/bin/claude`. + +## Invariant 2 — PATs are container-scoped; `just gh-auth` per rebuild is deliberate + +The re-paste-on-rebuild ceremony for `gh` / `glab` PATs is the cost +of keeping blast radius small: fine-grained PATs typically cover +multiple repos, so any path mounted across devcontainers would let +a compromised session reach every repo the PAT touches. + +`~/.claude` and `~/.claude.json` *are* cross-container (via +`link_terminal_config` symlinks) because they hold one Claude login, +not repo-scoped credentials. Don't conflate the two. + +**Refuse as regressions:** +- New persistent-credential mounts (volume, bind, anywhere) for + `gh` or `glab` PATs. +- Re-purposing the (currently deleted) `/cache` Docker volume for + tokens. Restoring `/cache` for *caches* is fine; for tokens, not. + +If a future request says "stop re-pasting the PAT" — surface this +tradeoff before implementing the shortcut. + +## Invariant 3 — bwrap on Ubuntu 24.04 GitHub runners needs three workarounds + +`ubuntu-latest` ships configured in ways that break bwrap. The +failure modes cascade in this order: + +1. **`setting up uid map: Permission denied`** — + `kernel.apparmor_restrict_unprivileged_userns=1` is the runner + default. Relax the sysctl and install an unconfined AppArmor + profile for `/usr/bin/bwrap`. +2. **`/run/secrets` doesn't exist** — sandbox does + `--tmpfs /run/secrets`; `sudo mkdir -p /run/secrets` first. +3. **`$GITHUB_WORKSPACE` lives under `$HOME=/home/runner`** — + path-positional checks that assert "$HOME contains only X" trip + on the workspace bind. `export HOME=/tmp/sandbox-home` before + the bwrap step (+ `mkdir -p "$HOME/.claude" "$HOME/.cache"`). + +All three are required, in order. `.github/workflows/ci.yml` applies +them — five push-and-iterate cycles to land this; don't re-discover. + +## Design principle — keep dogfood ≈ guest + +The repo's own devcontainer (dogfood) and a `git clone + ./install` +inside any other devcontainer (guest) should go through the same +setup path. Prefer `install.sh` over `devcontainer.json` / +`postCreate.sh` / `initializeCommand.sh` when a fix can live in +either — guest devcontainers then get it for free, and the audit +surface stays single-track. + +Sample: per-file binds for `/root/.claude{,.json}` were dropped once +`link_terminal_config` covered both paths uniformly; only the shared +`/user-terminal-config` bind remains in `devcontainer.json`. + +**Refuse as regressions:** dogfood-only `postCreate` / +`initializeCommand` work, or `devcontainer.json` mounts that could +have been done in `install.sh`. Ask "would this work for a +clone+install inside an unrelated devcontainer?" — if not, push it +into `install.sh`. + +## Design principle — `just promote` does three layers, never edits JSONC + +`just promote ` (PR #20, issue #18) makes a target workspace +a self-sufficient claude-sandbox host: + +1. **Curated `.claude/`** — commands, skills, hooks, statusline, + plus `wire_settings_{hook,statusline}` against the target's + `settings.json`. +2. **Install machinery** — `.devcontainer/claude-sandbox/{install.sh, + claude-shadow, promote.sh}` + root `justfile`. The justfile is + shipped verbatim, so its recipes must all be promote-target-safe; + source-repo-only recipes (`test`, `upgrade`, `verify`) were dropped + for this reason. The root `install` shim is *not* copied; it's the + source repo's manual-UX entry (`./install`), not a target workflow. +3. **`.devcontainer/postCreate.sh`** running + `bash .devcontainer/claude-sandbox/install.sh` (created if absent, + idempotently appended otherwise). Promote then prints a one-line + `postCreateCommand` snippet for the user to paste into + `devcontainer.json` — we do **not** edit it. + +**Refuse as a regression**: auto-editing `devcontainer.json`. It's +JSONC in the wild and comment-preserving structured edits need +either ~50 lines of awk (string/block-comment state-tracking) or a +node/python lib dependency — both rejected in PR #20. The user knows +whether they've wired the line or need to chain it. "Strip and +re-insert comments" isn't simpler either — re-insert needs stable +anchors that survive the edit. Print the snippet; trust them. + +**Two intentional don't-update edges in re-promote** — the only +gaps in the "re-promote = full sync" mental model: + +- `wire_settings_statusline` is *create-if-absent*. An existing + `.statusLine` (ours or the user's) is left alone. +- `wire_postcreate_script` only checks whether `bash install` is on + any line of `postCreate.sh`. The file body is never rewritten if + the file exists. + +Everything else propagates via `install_file`'s `cmp -s` +overwrite-on-diff. + +**Source-guard pattern**: `install.sh` ends with +`[ "${BASH_SOURCE[0]}" = "$0" ] && main "$@"` so `promote.sh` can +`source install.sh` to reuse `install_file` + `wire_settings_*` +without re-running `main`. Don't remove the guard. + +## Historical reversals — raise before re-treading + +Two paths walked back. If a change suggests either, surface the +history and re-justify against the underlying principle — **the +sandbox's surface must stay small enough to audit in one read** — +before proceeding. + +### Reversal 1 — Python orchestration + +Trajectory: `embedded bash → standalone bash → Python package + typer +CLI → bash-only` (commits `25e67ce`, `a35b8ee`, then `bf65407` +"feat: bash-only rewrite — drop Python package, self-contained +shadow", 2026-05-12, issue #14 / PR #15). + +The tool is fundamentally one bash function building a bwrap argv. +A ~110 KB Python package (pyproject, uv lockfile, pytest scaffolding, +37 unit tests, typer CLI) made the security-critical bits harder to +audit across multiple modules. Bash-only is ~80 lines shadow + ~80 +lines installer. + +**Refuse without justification:** +- "Let's add a small Python CLI for nicer error messages / config / + arg parsing." +- "Let's bring back pytest / uv / a `src/` package — it's only a + little code." +- Anything that re-introduces `pyproject.toml`, `uv.lock`, + `src/claude_sandbox/`, or `test_*.py`. + +Root `CLAUDE.md` says "Bash-only. No Python package, no uv, no +pytest — don't add them back." This skill explains the why. + +### Reversal 2 — extracted from python-copier-template + +The sandbox originally lived embedded in `python-copier-template` as +`.devcontainer/claude-sandbox.sh` (a single bash script using +`unshare -m` + tmpfs overlays). Extracted because: + +- A security tool needs **one canonical, audit-friendly home**, not + a templated copy in every project. +- The bwrap-based defences (`--cap-drop ALL`, `--clearenv` + allow-list, strict-under-`/root` inversion, `NO_NEW_PRIVS`, …) + replace the older `unshare -m` and would be awkward inside a + per-project template. +- A standalone repo gets a versioned release surface, its own CI, + and `/verify-sandbox` as a first-class command. + +`/workspaces/python-copier-template/.devcontainer/claude-sandbox.sh` +exists as prior art but is **not** maintained. + +**Refuse without justification:** +- Adding a `template/` directory or `copier.yml`. +- "Let's keep a copy synced into python-copier-template" — the + template should *consume* this repo, not embed it. + +## Diagnostic discipline — silent in-sandbox check failures + +When a check inside the sandbox fails silently (subprocess swallows +stdout/stderr), inject a debug `INNER` step that runs the same body +verbatim and prints its output *before* exec'ing the real verifier. +The original Check 03 silent failure was unsolvable until we printed +`extras` directly — `--bind-try /dev/null` masks themselves create +entries under `$HOME` (the spec hadn't whitelisted them). One +`printf` beats hours of guessing from outside. + +## Diagnostic discipline — bind-mount vs runtime tmpfs write + +When unexpected entries appear inside the sandbox (typically under +`$HOME` or `$HOME/.config`), **first determine whether they're a +host bind-mount leak or a sandboxed-process tmpfs write**. The +remediations are completely different. + +```bash +# Bind from the host? +grep " /root/.config/ " /proc/self/mountinfo +# stat -c '%D' compares device IDs — tmpfs entries share /root's dev. +stat -c '%n: dev=%D inode=%i' /root /root/.config/ +``` + +No mountinfo entry + same `dev` as `/root` → tmpfs write by +sandboxed code (a feature self-registering). Fix upstream by +disabling the feature, not by widening the allow-list. Mountinfo +entry → genuine inversion leak; tighten the bwrap argv. + +Concrete miss (2026-05): Chrome `NativeMessagingHosts` dirs under +`~/.config/` — initially flagged as a bind leak; mountinfo showed +no bind. It was Claude Code's startup write registering the browser +extension. Fix: `--no-chrome` injection in the shadow, check 03 +stayed strict. + +## Where things live + +| Concern | File | +|-------------------------------|-----------------------------------------------------| +| bwrap argv construction | `.devcontainer/claude-sandbox/claude-shadow` | +| Installer (relocate + wire) | `.devcontainer/claude-sandbox/install.sh` | +| Promote orchestrator | `.devcontainer/claude-sandbox/promote.sh` | +| Root-shim installer entry | `install` | +| bwrap argv unit tests | `tests/bwrap_argv.sh` | +| End-to-end install smoke test | `tests/smoke.sh` | +| Promote smoke test | `tests/promote.sh` | +| CI workflow | `.github/workflows/ci.yml` | +| Live verification spec | `.claude/commands/verify-sandbox.md` | +| Pre-prompt gate hook | `.claude/hooks/sandbox-check.sh` | +| Threat model + binds rationale| `README-CLAUDE.md` | +| Recipes (promote, gh-auth, …) | `justfile` (shipped verbatim by `just promote`) | + +Touching any of these → re-read this skill first. diff --git a/.claude/skills/diagnose/SKILL.md b/.claude/skills/diagnose/SKILL.md new file mode 100755 index 00000000..ed55bda2 --- /dev/null +++ b/.claude/skills/diagnose/SKILL.md @@ -0,0 +1,117 @@ +--- +name: diagnose +description: Disciplined diagnosis loop for hard bugs and performance regressions. Reproduce → minimise → hypothesise → instrument → fix → regression-test. Use when user says "diagnose this" / "debug this", reports a bug, says something is broken/throwing/failing, or describes a performance regression. +--- + +# Diagnose + +A discipline for hard bugs. Skip phases only when explicitly justified. + +When exploring the codebase, use the project's domain glossary to get a clear mental model of the relevant modules, and check ADRs in the area you're touching. + +## Phase 1 — Build a feedback loop + +**This is the skill.** Everything else is mechanical. If you have a fast, deterministic, agent-runnable pass/fail signal for the bug, you will find the cause — bisection, hypothesis-testing, and instrumentation all just consume that signal. If you don't have one, no amount of staring at code will save you. + +Spend disproportionate effort here. **Be aggressive. Be creative. Refuse to give up.** + +### Ways to construct one — try them in roughly this order + +1. **Failing test** at whatever seam reaches the bug — unit, integration, e2e. +2. **Curl / HTTP script** against a running dev server. +3. **CLI invocation** with a fixture input, diffing stdout against a known-good snapshot. +4. **Headless browser script** (Playwright / Puppeteer) — drives the UI, asserts on DOM/console/network. +5. **Replay a captured trace.** Save a real network request / payload / event log to disk; replay it through the code path in isolation. +6. **Throwaway harness.** Spin up a minimal subset of the system (one service, mocked deps) that exercises the bug code path with a single function call. +7. **Property / fuzz loop.** If the bug is "sometimes wrong output", run 1000 random inputs and look for the failure mode. +8. **Bisection harness.** If the bug appeared between two known states (commit, dataset, version), automate "boot at state X, check, repeat" so you can `git bisect run` it. +9. **Differential loop.** Run the same input through old-version vs new-version (or two configs) and diff outputs. +10. **HITL bash script.** Last resort. If a human must click, drive _them_ with `scripts/hitl-loop.template.sh` so the loop is still structured. Captured output feeds back to you. + +Build the right feedback loop, and the bug is 90% fixed. + +### Iterate on the loop itself + +Treat the loop as a product. Once you have _a_ loop, ask: + +- Can I make it faster? (Cache setup, skip unrelated init, narrow the test scope.) +- Can I make the signal sharper? (Assert on the specific symptom, not "didn't crash".) +- Can I make it more deterministic? (Pin time, seed RNG, isolate filesystem, freeze network.) + +A 30-second flaky loop is barely better than no loop. A 2-second deterministic loop is a debugging superpower. + +### Non-deterministic bugs + +The goal is not a clean repro but a **higher reproduction rate**. Loop the trigger 100×, parallelise, add stress, narrow timing windows, inject sleeps. A 50%-flake bug is debuggable; 1% is not — keep raising the rate until it's debuggable. + +### When you genuinely cannot build a loop + +Stop and say so explicitly. List what you tried. Ask the user for: (a) access to whatever environment reproduces it, (b) a captured artifact (HAR file, log dump, core dump, screen recording with timestamps), or (c) permission to add temporary production instrumentation. Do **not** proceed to hypothesise without a loop. + +Do not proceed to Phase 2 until you have a loop you believe in. + +## Phase 2 — Reproduce + +Run the loop. Watch the bug appear. + +Confirm: + +- [ ] The loop produces the failure mode the **user** described — not a different failure that happens to be nearby. Wrong bug = wrong fix. +- [ ] The failure is reproducible across multiple runs (or, for non-deterministic bugs, reproducible at a high enough rate to debug against). +- [ ] You have captured the exact symptom (error message, wrong output, slow timing) so later phases can verify the fix actually addresses it. + +Do not proceed until you reproduce the bug. + +## Phase 3 — Hypothesise + +Generate **3–5 ranked hypotheses** before testing any of them. Single-hypothesis generation anchors on the first plausible idea. + +Each hypothesis must be **falsifiable**: state the prediction it makes. + +> Format: "If is the cause, then will make the bug disappear / will make it worse." + +If you cannot state the prediction, the hypothesis is a vibe — discard or sharpen it. + +**Show the ranked list to the user before testing.** They often have domain knowledge that re-ranks instantly ("we just deployed a change to #3"), or know hypotheses they've already ruled out. Cheap checkpoint, big time saver. Don't block on it — proceed with your ranking if the user is AFK. + +## Phase 4 — Instrument + +Each probe must map to a specific prediction from Phase 3. **Change one variable at a time.** + +Tool preference: + +1. **Debugger / REPL inspection** if the env supports it. One breakpoint beats ten logs. +2. **Targeted logs** at the boundaries that distinguish hypotheses. +3. Never "log everything and grep". + +**Tag every debug log** with a unique prefix, e.g. `[DEBUG-a4f2]`. Cleanup at the end becomes a single grep. Untagged logs survive; tagged logs die. + +**Perf branch.** For performance regressions, logs are usually wrong. Instead: establish a baseline measurement (timing harness, `performance.now()`, profiler, query plan), then bisect. Measure first, fix second. + +## Phase 5 — Fix + regression test + +Write the regression test **before the fix** — but only if there is a **correct seam** for it. + +A correct seam is one where the test exercises the **real bug pattern** as it occurs at the call site. If the only available seam is too shallow (single-caller test when the bug needs multiple callers, unit test that can't replicate the chain that triggered the bug), a regression test there gives false confidence. + +**If no correct seam exists, that itself is the finding.** Note it. The codebase architecture is preventing the bug from being locked down. Flag this for the next phase. + +If a correct seam exists: + +1. Turn the minimised repro into a failing test at that seam. +2. Watch it fail. +3. Apply the fix. +4. Watch it pass. +5. Re-run the Phase 1 feedback loop against the original (un-minimised) scenario. + +## Phase 6 — Cleanup + post-mortem + +Required before declaring done: + +- [ ] Original repro no longer reproduces (re-run the Phase 1 loop) +- [ ] Regression test passes (or absence of seam is documented) +- [ ] All `[DEBUG-...]` instrumentation removed (`grep` the prefix) +- [ ] Throwaway prototypes deleted (or moved to a clearly-marked debug location) +- [ ] The hypothesis that turned out correct is stated in the commit / PR message — so the next debugger learns + +**Then ask: what would have prevented this bug?** If the answer involves architectural change (no good test seam, tangled callers, hidden coupling) hand off to the `/improve-codebase-architecture` skill with the specifics. Make the recommendation **after** the fix is in, not before — you have more information now than when you started. diff --git a/.claude/skills/grill-with-docs/ADR-FORMAT.md b/.claude/skills/grill-with-docs/ADR-FORMAT.md new file mode 100755 index 00000000..da7e78ec --- /dev/null +++ b/.claude/skills/grill-with-docs/ADR-FORMAT.md @@ -0,0 +1,47 @@ +# ADR Format + +ADRs live in `docs/adr/` and use sequential numbering: `0001-slug.md`, `0002-slug.md`, etc. + +Create the `docs/adr/` directory lazily — only when the first ADR is needed. + +## Template + +```md +# {Short title of the decision} + +{1-3 sentences: what's the context, what did we decide, and why.} +``` + +That's it. An ADR can be a single paragraph. The value is in recording *that* a decision was made and *why* — not in filling out sections. + +## Optional sections + +Only include these when they add genuine value. Most ADRs won't need them. + +- **Status** frontmatter (`proposed | accepted | deprecated | superseded by ADR-NNNN`) — useful when decisions are revisited +- **Considered Options** — only when the rejected alternatives are worth remembering +- **Consequences** — only when non-obvious downstream effects need to be called out + +## Numbering + +Scan `docs/adr/` for the highest existing number and increment by one. + +## When to offer an ADR + +All three of these must be true: + +1. **Hard to reverse** — the cost of changing your mind later is meaningful +2. **Surprising without context** — a future reader will look at the code and wonder "why on earth did they do it this way?" +3. **The result of a real trade-off** — there were genuine alternatives and you picked one for specific reasons + +If a decision is easy to reverse, skip it — you'll just reverse it. If it's not surprising, nobody will wonder why. If there was no real alternative, there's nothing to record beyond "we did the obvious thing." + +### What qualifies + +- **Architectural shape.** "We're using a monorepo." "The write model is event-sourced, the read model is projected into Postgres." +- **Integration patterns between contexts.** "Ordering and Billing communicate via domain events, not synchronous HTTP." +- **Technology choices that carry lock-in.** Database, message bus, auth provider, deployment target. Not every library — just the ones that would take a quarter to swap out. +- **Boundary and scope decisions.** "Customer data is owned by the Customer context; other contexts reference it by ID only." The explicit no-s are as valuable as the yes-s. +- **Deliberate deviations from the obvious path.** "We're using manual SQL instead of an ORM because X." Anything where a reasonable reader would assume the opposite. These stop the next engineer from "fixing" something that was deliberate. +- **Constraints not visible in the code.** "We can't use AWS because of compliance requirements." "Response times must be under 200ms because of the partner API contract." +- **Rejected alternatives when the rejection is non-obvious.** If you considered GraphQL and picked REST for subtle reasons, record it — otherwise someone will suggest GraphQL again in six months. diff --git a/.claude/skills/grill-with-docs/CONTEXT-FORMAT.md b/.claude/skills/grill-with-docs/CONTEXT-FORMAT.md new file mode 100755 index 00000000..ddfa247c --- /dev/null +++ b/.claude/skills/grill-with-docs/CONTEXT-FORMAT.md @@ -0,0 +1,77 @@ +# CONTEXT.md Format + +## Structure + +```md +# {Context Name} + +{One or two sentence description of what this context is and why it exists.} + +## Language + +**Order**: +{A concise description of the term} +_Avoid_: Purchase, transaction + +**Invoice**: +A request for payment sent to a customer after delivery. +_Avoid_: Bill, payment request + +**Customer**: +A person or organization that places orders. +_Avoid_: Client, buyer, account + +## Relationships + +- An **Order** produces one or more **Invoices** +- An **Invoice** belongs to exactly one **Customer** + +## Example dialogue + +> **Dev:** "When a **Customer** places an **Order**, do we create the **Invoice** immediately?" +> **Domain expert:** "No — an **Invoice** is only generated once a **Fulfillment** is confirmed." + +## Flagged ambiguities + +- "account" was used to mean both **Customer** and **User** — resolved: these are distinct concepts. +``` + +## Rules + +- **Be opinionated.** When multiple words exist for the same concept, pick the best one and list the others as aliases to avoid. +- **Flag conflicts explicitly.** If a term is used ambiguously, call it out in "Flagged ambiguities" with a clear resolution. +- **Keep definitions tight.** One sentence max. Define what it IS, not what it does. +- **Show relationships.** Use bold term names and express cardinality where obvious. +- **Only include terms specific to this project's context.** General programming concepts (timeouts, error types, utility patterns) don't belong even if the project uses them extensively. Before adding a term, ask: is this a concept unique to this context, or a general programming concept? Only the former belongs. +- **Group terms under subheadings** when natural clusters emerge. If all terms belong to a single cohesive area, a flat list is fine. +- **Write an example dialogue.** A conversation between a dev and a domain expert that demonstrates how the terms interact naturally and clarifies boundaries between related concepts. + +## Single vs multi-context repos + +**Single context (most repos):** One `CONTEXT.md` at the repo root. + +**Multiple contexts:** A `CONTEXT-MAP.md` at the repo root lists the contexts, where they live, and how they relate to each other: + +```md +# Context Map + +## Contexts + +- [Ordering](./src/ordering/CONTEXT.md) — receives and tracks customer orders +- [Billing](./src/billing/CONTEXT.md) — generates invoices and processes payments +- [Fulfillment](./src/fulfillment/CONTEXT.md) — manages warehouse picking and shipping + +## Relationships + +- **Ordering → Fulfillment**: Ordering emits `OrderPlaced` events; Fulfillment consumes them to start picking +- **Fulfillment → Billing**: Fulfillment emits `ShipmentDispatched` events; Billing consumes them to generate invoices +- **Ordering ↔ Billing**: Shared types for `CustomerId` and `Money` +``` + +The skill infers which structure applies: + +- If `CONTEXT-MAP.md` exists, read it to find contexts +- If only a root `CONTEXT.md` exists, single context +- If neither exists, create a root `CONTEXT.md` lazily when the first term is resolved + +When multiple contexts exist, infer which one the current topic relates to. If unclear, ask. diff --git a/.claude/skills/grill-with-docs/SKILL.md b/.claude/skills/grill-with-docs/SKILL.md new file mode 100755 index 00000000..6dad6ad7 --- /dev/null +++ b/.claude/skills/grill-with-docs/SKILL.md @@ -0,0 +1,88 @@ +--- +name: grill-with-docs +description: Grilling session that challenges your plan against the existing domain model, sharpens terminology, and updates documentation (CONTEXT.md, ADRs) inline as decisions crystallise. Use when user wants to stress-test a plan against their project's language and documented decisions. +--- + + + +Interview me relentlessly about every aspect of this plan until we reach a shared understanding. Walk down each branch of the design tree, resolving dependencies between decisions one-by-one. For each question, provide your recommended answer. + +Ask the questions one at a time, waiting for feedback on each question before continuing. + +If a question can be answered by exploring the codebase, explore the codebase instead. + + + + + +## Domain awareness + +During codebase exploration, also look for existing documentation: + +### File structure + +Most repos have a single context: + +``` +/ +├── CONTEXT.md +├── docs/ +│ └── adr/ +│ ├── 0001-event-sourced-orders.md +│ └── 0002-postgres-for-write-model.md +└── src/ +``` + +If a `CONTEXT-MAP.md` exists at the root, the repo has multiple contexts. The map points to where each one lives: + +``` +/ +├── CONTEXT-MAP.md +├── docs/ +│ └── adr/ ← system-wide decisions +├── src/ +│ ├── ordering/ +│ │ ├── CONTEXT.md +│ │ └── docs/adr/ ← context-specific decisions +│ └── billing/ +│ ├── CONTEXT.md +│ └── docs/adr/ +``` + +Create files lazily — only when you have something to write. If no `CONTEXT.md` exists, create one when the first term is resolved. If no `docs/adr/` exists, create it when the first ADR is needed. + +## During the session + +### Challenge against the glossary + +When the user uses a term that conflicts with the existing language in `CONTEXT.md`, call it out immediately. "Your glossary defines 'cancellation' as X, but you seem to mean Y — which is it?" + +### Sharpen fuzzy language + +When the user uses vague or overloaded terms, propose a precise canonical term. "You're saying 'account' — do you mean the Customer or the User? Those are different things." + +### Discuss concrete scenarios + +When domain relationships are being discussed, stress-test them with specific scenarios. Invent scenarios that probe edge cases and force the user to be precise about the boundaries between concepts. + +### Cross-reference with code + +When the user states how something works, check whether the code agrees. If you find a contradiction, surface it: "Your code cancels entire Orders, but you just said partial cancellation is possible — which is right?" + +### Update CONTEXT.md inline + +When a term is resolved, update `CONTEXT.md` right there. Don't batch these up — capture them as they happen. Use the format in [CONTEXT-FORMAT.md](./CONTEXT-FORMAT.md). + +Don't couple `CONTEXT.md` to implementation details. Only include terms that are meaningful to domain experts. + +### Offer ADRs sparingly + +Only offer to create an ADR when all three are true: + +1. **Hard to reverse** — the cost of changing your mind later is meaningful +2. **Surprising without context** — a future reader will wonder "why did they do it this way?" +3. **The result of a real trade-off** — there were genuine alternatives and you picked one for specific reasons + +If any of the three is missing, skip the ADR. Use the format in [ADR-FORMAT.md](./ADR-FORMAT.md). + + diff --git a/.claude/skills/improve-codebase-architecture/DEEPENING.md b/.claude/skills/improve-codebase-architecture/DEEPENING.md new file mode 100755 index 00000000..ecaf5d7d --- /dev/null +++ b/.claude/skills/improve-codebase-architecture/DEEPENING.md @@ -0,0 +1,37 @@ +# Deepening + +How to deepen a cluster of shallow modules safely, given its dependencies. Assumes the vocabulary in [LANGUAGE.md](LANGUAGE.md) — **module**, **interface**, **seam**, **adapter**. + +## Dependency categories + +When assessing a candidate for deepening, classify its dependencies. The category determines how the deepened module is tested across its seam. + +### 1. In-process + +Pure computation, in-memory state, no I/O. Always deepenable — merge the modules and test through the new interface directly. No adapter needed. + +### 2. Local-substitutable + +Dependencies that have local test stand-ins (PGLite for Postgres, in-memory filesystem). Deepenable if the stand-in exists. The deepened module is tested with the stand-in running in the test suite. The seam is internal; no port at the module's external interface. + +### 3. Remote but owned (Ports & Adapters) + +Your own services across a network boundary (microservices, internal APIs). Define a **port** (interface) at the seam. The deep module owns the logic; the transport is injected as an **adapter**. Tests use an in-memory adapter. Production uses an HTTP/gRPC/queue adapter. + +Recommendation shape: *"Define a port at the seam, implement an HTTP adapter for production and an in-memory adapter for testing, so the logic sits in one deep module even though it's deployed across a network."* + +### 4. True external (Mock) + +Third-party services (Stripe, Twilio, etc.) you don't control. The deepened module takes the external dependency as an injected port; tests provide a mock adapter. + +## Seam discipline + +- **One adapter means a hypothetical seam. Two adapters means a real one.** Don't introduce a port unless at least two adapters are justified (typically production + test). A single-adapter seam is just indirection. +- **Internal seams vs external seams.** A deep module can have internal seams (private to its implementation, used by its own tests) as well as the external seam at its interface. Don't expose internal seams through the interface just because tests use them. + +## Testing strategy: replace, don't layer + +- Old unit tests on shallow modules become waste once tests at the deepened module's interface exist — delete them. +- Write new tests at the deepened module's interface. The **interface is the test surface**. +- Tests assert on observable outcomes through the interface, not internal state. +- Tests should survive internal refactors — they describe behaviour, not implementation. If a test has to change when the implementation changes, it's testing past the interface. diff --git a/.claude/skills/improve-codebase-architecture/INTERFACE-DESIGN.md b/.claude/skills/improve-codebase-architecture/INTERFACE-DESIGN.md new file mode 100755 index 00000000..3197723a --- /dev/null +++ b/.claude/skills/improve-codebase-architecture/INTERFACE-DESIGN.md @@ -0,0 +1,44 @@ +# Interface Design + +When the user wants to explore alternative interfaces for a chosen deepening candidate, use this parallel sub-agent pattern. Based on "Design It Twice" (Ousterhout) — your first idea is unlikely to be the best. + +Uses the vocabulary in [LANGUAGE.md](LANGUAGE.md) — **module**, **interface**, **seam**, **adapter**, **leverage**. + +## Process + +### 1. Frame the problem space + +Before spawning sub-agents, write a user-facing explanation of the problem space for the chosen candidate: + +- The constraints any new interface would need to satisfy +- The dependencies it would rely on, and which category they fall into (see [DEEPENING.md](DEEPENING.md)) +- A rough illustrative code sketch to ground the constraints — not a proposal, just a way to make the constraints concrete + +Show this to the user, then immediately proceed to Step 2. The user reads and thinks while the sub-agents work in parallel. + +### 2. Spawn sub-agents + +Spawn 3+ sub-agents in parallel using the Agent tool. Each must produce a **radically different** interface for the deepened module. + +Prompt each sub-agent with a separate technical brief (file paths, coupling details, dependency category from [DEEPENING.md](DEEPENING.md), what sits behind the seam). The brief is independent of the user-facing problem-space explanation in Step 1. Give each agent a different design constraint: + +- Agent 1: "Minimize the interface — aim for 1–3 entry points max. Maximise leverage per entry point." +- Agent 2: "Maximise flexibility — support many use cases and extension." +- Agent 3: "Optimise for the most common caller — make the default case trivial." +- Agent 4 (if applicable): "Design around ports & adapters for cross-seam dependencies." + +Include both [LANGUAGE.md](LANGUAGE.md) vocabulary and CONTEXT.md vocabulary in the brief so each sub-agent names things consistently with the architecture language and the project's domain language. + +Each sub-agent outputs: + +1. Interface (types, methods, params — plus invariants, ordering, error modes) +2. Usage example showing how callers use it +3. What the implementation hides behind the seam +4. Dependency strategy and adapters (see [DEEPENING.md](DEEPENING.md)) +5. Trade-offs — where leverage is high, where it's thin + +### 3. Present and compare + +Present designs sequentially so the user can absorb each one, then compare them in prose. Contrast by **depth** (leverage at the interface), **locality** (where change concentrates), and **seam placement**. + +After comparing, give your own recommendation: which design you think is strongest and why. If elements from different designs would combine well, propose a hybrid. Be opinionated — the user wants a strong read, not a menu. diff --git a/.claude/skills/improve-codebase-architecture/LANGUAGE.md b/.claude/skills/improve-codebase-architecture/LANGUAGE.md new file mode 100755 index 00000000..530c2763 --- /dev/null +++ b/.claude/skills/improve-codebase-architecture/LANGUAGE.md @@ -0,0 +1,53 @@ +# Language + +Shared vocabulary for every suggestion this skill makes. Use these terms exactly — don't substitute "component," "service," "API," or "boundary." Consistent language is the whole point. + +## Terms + +**Module** +Anything with an interface and an implementation. Deliberately scale-agnostic — applies equally to a function, class, package, or tier-spanning slice. +_Avoid_: unit, component, service. + +**Interface** +Everything a caller must know to use the module correctly. Includes the type signature, but also invariants, ordering constraints, error modes, required configuration, and performance characteristics. +_Avoid_: API, signature (too narrow — those refer only to the type-level surface). + +**Implementation** +What's inside a module — its body of code. Distinct from **Adapter**: a thing can be a small adapter with a large implementation (a Postgres repo) or a large adapter with a small implementation (an in-memory fake). Reach for "adapter" when the seam is the topic; "implementation" otherwise. + +**Depth** +Leverage at the interface — the amount of behaviour a caller (or test) can exercise per unit of interface they have to learn. A module is **deep** when a large amount of behaviour sits behind a small interface. A module is **shallow** when the interface is nearly as complex as the implementation. + +**Seam** _(from Michael Feathers)_ +A place where you can alter behaviour without editing in that place. The *location* at which a module's interface lives. Choosing where to put the seam is its own design decision, distinct from what goes behind it. +_Avoid_: boundary (overloaded with DDD's bounded context). + +**Adapter** +A concrete thing that satisfies an interface at a seam. Describes *role* (what slot it fills), not substance (what's inside). + +**Leverage** +What callers get from depth. More capability per unit of interface they have to learn. One implementation pays back across N call sites and M tests. + +**Locality** +What maintainers get from depth. Change, bugs, knowledge, and verification concentrate at one place rather than spreading across callers. Fix once, fixed everywhere. + +## Principles + +- **Depth is a property of the interface, not the implementation.** A deep module can be internally composed of small, mockable, swappable parts — they just aren't part of the interface. A module can have **internal seams** (private to its implementation, used by its own tests) as well as the **external seam** at its interface. +- **The deletion test.** Imagine deleting the module. If complexity vanishes, the module wasn't hiding anything (it was a pass-through). If complexity reappears across N callers, the module was earning its keep. +- **The interface is the test surface.** Callers and tests cross the same seam. If you want to test *past* the interface, the module is probably the wrong shape. +- **One adapter means a hypothetical seam. Two adapters means a real one.** Don't introduce a seam unless something actually varies across it. + +## Relationships + +- A **Module** has exactly one **Interface** (the surface it presents to callers and tests). +- **Depth** is a property of a **Module**, measured against its **Interface**. +- A **Seam** is where a **Module**'s **Interface** lives. +- An **Adapter** sits at a **Seam** and satisfies the **Interface**. +- **Depth** produces **Leverage** for callers and **Locality** for maintainers. + +## Rejected framings + +- **Depth as ratio of implementation-lines to interface-lines** (Ousterhout): rewards padding the implementation. We use depth-as-leverage instead. +- **"Interface" as the TypeScript `interface` keyword or a class's public methods**: too narrow — interface here includes every fact a caller must know. +- **"Boundary"**: overloaded with DDD's bounded context. Say **seam** or **interface**. diff --git a/.claude/skills/improve-codebase-architecture/SKILL.md b/.claude/skills/improve-codebase-architecture/SKILL.md new file mode 100755 index 00000000..05984a60 --- /dev/null +++ b/.claude/skills/improve-codebase-architecture/SKILL.md @@ -0,0 +1,71 @@ +--- +name: improve-codebase-architecture +description: Find deepening opportunities in a codebase, informed by the domain language in CONTEXT.md and the decisions in docs/adr/. Use when the user wants to improve architecture, find refactoring opportunities, consolidate tightly-coupled modules, or make a codebase more testable and AI-navigable. +--- + +# Improve Codebase Architecture + +Surface architectural friction and propose **deepening opportunities** — refactors that turn shallow modules into deep ones. The aim is testability and AI-navigability. + +## Glossary + +Use these terms exactly in every suggestion. Consistent language is the point — don't drift into "component," "service," "API," or "boundary." Full definitions in [LANGUAGE.md](LANGUAGE.md). + +- **Module** — anything with an interface and an implementation (function, class, package, slice). +- **Interface** — everything a caller must know to use the module: types, invariants, error modes, ordering, config. Not just the type signature. +- **Implementation** — the code inside. +- **Depth** — leverage at the interface: a lot of behaviour behind a small interface. **Deep** = high leverage. **Shallow** = interface nearly as complex as the implementation. +- **Seam** — where an interface lives; a place behaviour can be altered without editing in place. (Use this, not "boundary.") +- **Adapter** — a concrete thing satisfying an interface at a seam. +- **Leverage** — what callers get from depth. +- **Locality** — what maintainers get from depth: change, bugs, knowledge concentrated in one place. + +Key principles (see [LANGUAGE.md](LANGUAGE.md) for the full list): + +- **Deletion test**: imagine deleting the module. If complexity vanishes, it was a pass-through. If complexity reappears across N callers, it was earning its keep. +- **The interface is the test surface.** +- **One adapter = hypothetical seam. Two adapters = real seam.** + +This skill is _informed_ by the project's domain model. The domain language gives names to good seams; ADRs record decisions the skill should not re-litigate. + +## Process + +### 1. Explore + +Read the project's domain glossary and any ADRs in the area you're touching first. + +Then use the Agent tool with `subagent_type=Explore` to walk the codebase. Don't follow rigid heuristics — explore organically and note where you experience friction: + +- Where does understanding one concept require bouncing between many small modules? +- Where are modules **shallow** — interface nearly as complex as the implementation? +- Where have pure functions been extracted just for testability, but the real bugs hide in how they're called (no **locality**)? +- Where do tightly-coupled modules leak across their seams? +- Which parts of the codebase are untested, or hard to test through their current interface? + +Apply the **deletion test** to anything you suspect is shallow: would deleting it concentrate complexity, or just move it? A "yes, concentrates" is the signal you want. + +### 2. Present candidates + +Present a numbered list of deepening opportunities. For each candidate: + +- **Files** — which files/modules are involved +- **Problem** — why the current architecture is causing friction +- **Solution** — plain English description of what would change +- **Benefits** — explained in terms of locality and leverage, and also in how tests would improve + +**Use CONTEXT.md vocabulary for the domain, and [LANGUAGE.md](LANGUAGE.md) vocabulary for the architecture.** If `CONTEXT.md` defines "Order," talk about "the Order intake module" — not "the FooBarHandler," and not "the Order service." + +**ADR conflicts**: if a candidate contradicts an existing ADR, only surface it when the friction is real enough to warrant revisiting the ADR. Mark it clearly (e.g. _"contradicts ADR-0007 — but worth reopening because…"_). Don't list every theoretical refactor an ADR forbids. + +Do NOT propose interfaces yet. Ask the user: "Which of these would you like to explore?" + +### 3. Grilling loop + +Once the user picks a candidate, drop into a grilling conversation. Walk the design tree with them — constraints, dependencies, the shape of the deepened module, what sits behind the seam, what tests survive. + +Side effects happen inline as decisions crystallize: + +- **Naming a deepened module after a concept not in `CONTEXT.md`?** Add the term to `CONTEXT.md` — same discipline as `/grill-with-docs` (see [CONTEXT-FORMAT.md](../grill-with-docs/CONTEXT-FORMAT.md)). Create the file lazily if it doesn't exist. +- **Sharpening a fuzzy term during the conversation?** Update `CONTEXT.md` right there. +- **User rejects the candidate with a load-bearing reason?** Offer an ADR, framed as: _"Want me to record this as an ADR so future architecture reviews don't re-suggest it?"_ Only offer when the reason would actually be needed by a future explorer to avoid re-suggesting the same thing — skip ephemeral reasons ("not worth it right now") and self-evident ones. See [ADR-FORMAT.md](../grill-with-docs/ADR-FORMAT.md). +- **Want to explore alternative interfaces for the deepened module?** See [INTERFACE-DESIGN.md](INTERFACE-DESIGN.md). diff --git a/.claude/skills/tdd/SKILL.md b/.claude/skills/tdd/SKILL.md new file mode 100755 index 00000000..6014e649 --- /dev/null +++ b/.claude/skills/tdd/SKILL.md @@ -0,0 +1,132 @@ +--- +name: tdd +description: Test-driven development with red-green-refactor loop. Use when user wants to build features or fix bugs using TDD, mentions "red-green-refactor", wants integration tests, or asks for test-first development. +--- + +# Test-Driven Development + +## Philosophy + +**Core principle**: Tests should verify behavior through public interfaces, not implementation details. Code can change entirely; tests shouldn't. + +**Good tests** are integration-style: they exercise real code paths through public APIs. They describe _what_ the system does, not _how_ it does it. A good test reads like a specification - "user can checkout with valid cart" tells you exactly what capability exists. These tests survive refactors because they don't care about internal structure. + +**Bad tests** are coupled to implementation. They mock internal collaborators, test private methods, or verify through external means (like querying a database directly instead of using the interface). The warning sign: your test breaks when you refactor, but behavior hasn't changed. If you rename an internal function and tests fail, those tests were testing implementation, not behavior. + +See [tests.md](tests.md) for examples and [mocking.md](mocking.md) for mocking guidelines. + +## Anti-Pattern: Horizontal Slices + +**DO NOT write all tests first, then all implementation.** This is "horizontal slicing" - treating RED as "write all tests" and GREEN as "write all code." + +This produces **crap tests**: + +- Tests written in bulk test _imagined_ behavior, not _actual_ behavior +- You end up testing the _shape_ of things (data structures, function signatures) rather than user-facing behavior +- Tests become insensitive to real changes - they pass when behavior breaks, fail when behavior is fine +- You outrun your headlights, committing to test structure before understanding the implementation + +**Correct approach**: Vertical slices via tracer bullets. One test → one implementation → repeat. Each test responds to what you learned from the previous cycle. Because you just wrote the code, you know exactly what behavior matters and how to verify it. + +``` +WRONG (horizontal): + RED: test1, test2, test3, test4, test5 + GREEN: impl1, impl2, impl3, impl4, impl5 + +RIGHT (vertical): + RED→GREEN: test1→impl1 + RED→GREEN: test2→impl2 + RED→GREEN: test3→impl3 + ... +``` + +## Workflow + +### 1. Planning + +When exploring the codebase, use the project's domain glossary so that test names and interface vocabulary match the project's language, and respect ADRs in the area you're touching. + +Before writing any code: + +- [ ] Confirm with user what interface changes are needed +- [ ] Confirm with user which behaviors to test (prioritize) +- [ ] Identify opportunities for [deep modules](deep-modules.md) (small interface, deep implementation) +- [ ] Design interfaces for [testability](interface-design.md) +- [ ] List the behaviors to test (not implementation steps) +- [ ] Get user approval on the plan + +Ask: "What should the public interface look like? Which behaviors are most important to test?" + +**You can't test everything.** Confirm with the user exactly which behaviors matter most. Focus testing effort on critical paths and complex logic, not every possible edge case. + +### 2. Tracer Bullet + +Write ONE test that confirms ONE thing about the system: + +``` +RED: Write test for first behavior → test fails +GREEN: Write minimal code to pass → test passes +``` + +This is your tracer bullet - proves the path works end-to-end. + +### 3. Incremental Loop + +For each remaining behavior: + +``` +RED: Write next test → fails +GREEN: Minimal code to pass → passes +``` + +Rules: + +- One test at a time +- Only enough code to pass current test +- Don't anticipate future tests +- Keep tests focused on observable behavior + +### 4. Commit Per Slice + +**Land each slice as its own commit, the moment it goes GREEN.** A slice is the smallest self-contained, shippable unit — usually one RED→GREEN cycle, sometimes a tight cluster of cycles that together form a single coherent capability (e.g., a pure utility plus its handful of example tests). + +Why per-slice commits matter: + +- **Reviewability**: small commits read like a story. A reviewer can follow the build-up cycle by cycle instead of unpicking a single mega-diff. +- **Bisectability**: each commit is green, so `git bisect` can pinpoint regressions to a slice. +- **Reversibility**: a bad slice is one `git revert` away — no manual untangling. +- **Forced honesty**: if a slice can't stand on its own as a commit, that's a signal it isn't really a vertical slice yet. + +When you outline the plan in step 1, the slices in that plan are the commits. Confirm the slice list with the user up front; don't accumulate multiple slices in the working tree before committing. + +Rules: + +- Commit immediately after a slice goes GREEN. Don't defer commits to "the end." +- Each commit's diff should match exactly one item in the plan. If a single file's hunks belong to two different slices, untangle them — temporarily revert one part, commit, then re-apply. +- The commit message names the slice and the user-visible behaviour it adds, not the cycle number. +- Existing tests stay green at every commit; new tests added in this slice go in the same commit as the impl that makes them pass. + +If you find yourself five slices deep with no commits, stop and split before continuing — a single `git reset --soft` plus per-slice `git add` is faster than letting the divergence grow. + +### 5. Refactor + +After all tests pass, look for [refactor candidates](refactoring.md): + +- [ ] Extract duplication +- [ ] Deepen modules (move complexity behind simple interfaces) +- [ ] Apply SOLID principles where natural +- [ ] Consider what new code reveals about existing code +- [ ] Run tests after each refactor step + +**Never refactor while RED.** Get to GREEN first. Commit refactors separately from the slices that motivated them. + +## Checklist Per Cycle + +``` +[ ] Test describes behavior, not implementation +[ ] Test uses public interface only +[ ] Test would survive internal refactor +[ ] Code is minimal for this test +[ ] No speculative features added +[ ] Slice committed as soon as it went GREEN +``` diff --git a/.claude/skills/tdd/deep-modules.md b/.claude/skills/tdd/deep-modules.md new file mode 100755 index 00000000..0d9720cf --- /dev/null +++ b/.claude/skills/tdd/deep-modules.md @@ -0,0 +1,33 @@ +# Deep Modules + +From "A Philosophy of Software Design": + +**Deep module** = small interface + lots of implementation + +``` +┌─────────────────────┐ +│ Small Interface │ ← Few methods, simple params +├─────────────────────┤ +│ │ +│ │ +│ Deep Implementation│ ← Complex logic hidden +│ │ +│ │ +└─────────────────────┘ +``` + +**Shallow module** = large interface + little implementation (avoid) + +``` +┌─────────────────────────────────┐ +│ Large Interface │ ← Many methods, complex params +├─────────────────────────────────┤ +│ Thin Implementation │ ← Just passes through +└─────────────────────────────────┘ +``` + +When designing interfaces, ask: + +- Can I reduce the number of methods? +- Can I simplify the parameters? +- Can I hide more complexity inside? diff --git a/.claude/skills/tdd/interface-design.md b/.claude/skills/tdd/interface-design.md new file mode 100755 index 00000000..a0a20ca4 --- /dev/null +++ b/.claude/skills/tdd/interface-design.md @@ -0,0 +1,31 @@ +# Interface Design for Testability + +Good interfaces make testing natural: + +1. **Accept dependencies, don't create them** + + ```typescript + // Testable + function processOrder(order, paymentGateway) {} + + // Hard to test + function processOrder(order) { + const gateway = new StripeGateway(); + } + ``` + +2. **Return results, don't produce side effects** + + ```typescript + // Testable + function calculateDiscount(cart): Discount {} + + // Hard to test + function applyDiscount(cart): void { + cart.total -= discount; + } + ``` + +3. **Small surface area** + - Fewer methods = fewer tests needed + - Fewer params = simpler test setup diff --git a/.claude/skills/tdd/mocking.md b/.claude/skills/tdd/mocking.md new file mode 100755 index 00000000..71cbfee6 --- /dev/null +++ b/.claude/skills/tdd/mocking.md @@ -0,0 +1,59 @@ +# When to Mock + +Mock at **system boundaries** only: + +- External APIs (payment, email, etc.) +- Databases (sometimes - prefer test DB) +- Time/randomness +- File system (sometimes) + +Don't mock: + +- Your own classes/modules +- Internal collaborators +- Anything you control + +## Designing for Mockability + +At system boundaries, design interfaces that are easy to mock: + +**1. Use dependency injection** + +Pass external dependencies in rather than creating them internally: + +```typescript +// Easy to mock +function processPayment(order, paymentClient) { + return paymentClient.charge(order.total); +} + +// Hard to mock +function processPayment(order) { + const client = new StripeClient(process.env.STRIPE_KEY); + return client.charge(order.total); +} +``` + +**2. Prefer SDK-style interfaces over generic fetchers** + +Create specific functions for each external operation instead of one generic function with conditional logic: + +```typescript +// GOOD: Each function is independently mockable +const api = { + getUser: (id) => fetch(`/users/${id}`), + getOrders: (userId) => fetch(`/users/${userId}/orders`), + createOrder: (data) => fetch('/orders', { method: 'POST', body: data }), +}; + +// BAD: Mocking requires conditional logic inside the mock +const api = { + fetch: (endpoint, options) => fetch(endpoint, options), +}; +``` + +The SDK approach means: +- Each mock returns one specific shape +- No conditional logic in test setup +- Easier to see which endpoints a test exercises +- Type safety per endpoint diff --git a/.claude/skills/tdd/refactoring.md b/.claude/skills/tdd/refactoring.md new file mode 100755 index 00000000..8a444392 --- /dev/null +++ b/.claude/skills/tdd/refactoring.md @@ -0,0 +1,10 @@ +# Refactor Candidates + +After TDD cycle, look for: + +- **Duplication** → Extract function/class +- **Long methods** → Break into private helpers (keep tests on public interface) +- **Shallow modules** → Combine or deepen +- **Feature envy** → Move logic to where data lives +- **Primitive obsession** → Introduce value objects +- **Existing code** the new code reveals as problematic diff --git a/.claude/skills/tdd/tests.md b/.claude/skills/tdd/tests.md new file mode 100755 index 00000000..ff22f809 --- /dev/null +++ b/.claude/skills/tdd/tests.md @@ -0,0 +1,61 @@ +# Good and Bad Tests + +## Good Tests + +**Integration-style**: Test through real interfaces, not mocks of internal parts. + +```typescript +// GOOD: Tests observable behavior +test("user can checkout with valid cart", async () => { + const cart = createCart(); + cart.add(product); + const result = await checkout(cart, paymentMethod); + expect(result.status).toBe("confirmed"); +}); +``` + +Characteristics: + +- Tests behavior users/callers care about +- Uses public API only +- Survives internal refactors +- Describes WHAT, not HOW +- One logical assertion per test + +## Bad Tests + +**Implementation-detail tests**: Coupled to internal structure. + +```typescript +// BAD: Tests implementation details +test("checkout calls paymentService.process", async () => { + const mockPayment = jest.mock(paymentService); + await checkout(cart, payment); + expect(mockPayment.process).toHaveBeenCalledWith(cart.total); +}); +``` + +Red flags: + +- Mocking internal collaborators +- Testing private methods +- Asserting on call counts/order +- Test breaks when refactoring without behavior change +- Test name describes HOW not WHAT +- Verifying through external means instead of interface + +```typescript +// BAD: Bypasses interface to verify +test("createUser saves to database", async () => { + await createUser({ name: "Alice" }); + const row = await db.query("SELECT * FROM users WHERE name = ?", ["Alice"]); + expect(row).toBeDefined(); +}); + +// GOOD: Verifies through interface +test("createUser makes user retrievable", async () => { + const user = await createUser({ name: "Alice" }); + const retrieved = await getUser(user.id); + expect(retrieved.name).toBe("Alice"); +}); +``` diff --git a/.claude/skills/triage/AGENT-BRIEF.md b/.claude/skills/triage/AGENT-BRIEF.md new file mode 100755 index 00000000..2efecdfe --- /dev/null +++ b/.claude/skills/triage/AGENT-BRIEF.md @@ -0,0 +1,168 @@ +# Writing Agent Briefs + +An agent brief is a structured comment posted on a GitHub issue when it moves to `ready-for-agent`. It is the authoritative specification that an AFK agent will work from. The original issue body and discussion are context — the agent brief is the contract. + +## Principles + +### Durability over precision + +The issue may sit in `ready-for-agent` for days or weeks. The codebase will change in the meantime. Write the brief so it stays useful even as files are renamed, moved, or refactored. + +- **Do** describe interfaces, types, and behavioral contracts +- **Do** name specific types, function signatures, or config shapes that the agent should look for or modify +- **Don't** reference file paths — they go stale +- **Don't** reference line numbers +- **Don't** assume the current implementation structure will remain the same + +### Behavioral, not procedural + +Describe **what** the system should do, not **how** to implement it. The agent will explore the codebase fresh and make its own implementation decisions. + +- **Good:** "The `SkillConfig` type should accept an optional `schedule` field of type `CronExpression`" +- **Bad:** "Open src/types/skill.ts and add a schedule field on line 42" +- **Good:** "When a user runs `/triage` with no arguments, they should see a summary of issues needing attention" +- **Bad:** "Add a switch statement in the main handler function" + +### Complete acceptance criteria + +The agent needs to know when it's done. Every agent brief must have concrete, testable acceptance criteria. Each criterion should be independently verifiable. + +- **Good:** "Running `gh issue list --label needs-triage` returns issues that have been through initial classification" +- **Bad:** "Triage should work correctly" + +### Explicit scope boundaries + +State what is out of scope. This prevents the agent from gold-plating or making assumptions about adjacent features. + +## Template + +```markdown +## Agent Brief + +**Category:** bug / enhancement +**Summary:** one-line description of what needs to happen + +**Current behavior:** +Describe what happens now. For bugs, this is the broken behavior. +For enhancements, this is the status quo the feature builds on. + +**Desired behavior:** +Describe what should happen after the agent's work is complete. +Be specific about edge cases and error conditions. + +**Key interfaces:** +- `TypeName` — what needs to change and why +- `functionName()` return type — what it currently returns vs what it should return +- Config shape — any new configuration options needed + +**Acceptance criteria:** +- [ ] Specific, testable criterion 1 +- [ ] Specific, testable criterion 2 +- [ ] Specific, testable criterion 3 + +**Out of scope:** +- Thing that should NOT be changed or addressed in this issue +- Adjacent feature that might seem related but is separate +``` + +## Examples + +### Good agent brief (bug) + +```markdown +## Agent Brief + +**Category:** bug +**Summary:** Skill description truncation drops mid-word, producing broken output + +**Current behavior:** +When a skill description exceeds 1024 characters, it is truncated at exactly +1024 characters regardless of word boundaries. This produces descriptions +that end mid-word (e.g. "Use when the user wants to confi"). + +**Desired behavior:** +Truncation should break at the last word boundary before 1024 characters +and append "..." to indicate truncation. + +**Key interfaces:** +- The `SkillMetadata` type's `description` field — no type change needed, + but the validation/processing logic that populates it needs to respect + word boundaries +- Any function that reads SKILL.md frontmatter and extracts the description + +**Acceptance criteria:** +- [ ] Descriptions under 1024 chars are unchanged +- [ ] Descriptions over 1024 chars are truncated at the last word boundary + before 1024 chars +- [ ] Truncated descriptions end with "..." +- [ ] The total length including "..." does not exceed 1024 chars + +**Out of scope:** +- Changing the 1024 char limit itself +- Multi-line description support +``` + +### Good agent brief (enhancement) + +```markdown +## Agent Brief + +**Category:** enhancement +**Summary:** Add `.out-of-scope/` directory support for tracking rejected feature requests + +**Current behavior:** +When a feature request is rejected, the issue is closed with a `wontfix` label +and a comment. There is no persistent record of the decision or reasoning. +Future similar requests require the maintainer to recall or search for the +prior discussion. + +**Desired behavior:** +Rejected feature requests should be documented in `.out-of-scope/.md` +files that capture the decision, reasoning, and links to all issues that +requested the feature. When triaging new issues, these files should be +checked for matches. + +**Key interfaces:** +- Markdown file format in `.out-of-scope/` — each file should have a + `# Concept Name` heading, a `**Decision:**` line, a `**Reason:**` line, + and a `**Prior requests:**` list with issue links +- The triage workflow should read all `.out-of-scope/*.md` files early + and match incoming issues against them by concept similarity + +**Acceptance criteria:** +- [ ] Closing a feature as wontfix creates/updates a file in `.out-of-scope/` +- [ ] The file includes the decision, reasoning, and link to the closed issue +- [ ] If a matching `.out-of-scope/` file already exists, the new issue is + appended to its "Prior requests" list rather than creating a duplicate +- [ ] During triage, existing `.out-of-scope/` files are checked and surfaced + when a new issue matches a prior rejection + +**Out of scope:** +- Automated matching (human confirms the match) +- Reopening previously rejected features +- Bug reports (only enhancement rejections go to `.out-of-scope/`) +``` + +### Bad agent brief + +```markdown +## Agent Brief + +**Summary:** Fix the triage bug + +**What to do:** +The triage thing is broken. Look at the main file and fix it. +The function around line 150 has the issue. + +**Files to change:** +- src/triage/handler.ts (line 150) +- src/types.ts (line 42) +``` + +This is bad because: +- No category +- Vague description ("the triage thing is broken") +- References file paths and line numbers that will go stale +- No acceptance criteria +- No scope boundaries +- No description of current vs desired behavior diff --git a/.claude/skills/triage/OUT-OF-SCOPE.md b/.claude/skills/triage/OUT-OF-SCOPE.md new file mode 100755 index 00000000..cc8ea257 --- /dev/null +++ b/.claude/skills/triage/OUT-OF-SCOPE.md @@ -0,0 +1,101 @@ +# Out-of-Scope Knowledge Base + +The `.out-of-scope/` directory in a repo stores persistent records of rejected feature requests. It serves two purposes: + +1. **Institutional memory** — why a feature was rejected, so the reasoning isn't lost when the issue is closed +2. **Deduplication** — when a new issue comes in that matches a prior rejection, the skill can surface the previous decision instead of re-litigating it + +## Directory structure + +``` +.out-of-scope/ +├── dark-mode.md +├── plugin-system.md +└── graphql-api.md +``` + +One file per **concept**, not per issue. Multiple issues requesting the same thing are grouped under one file. + +## File format + +The file should be written in a relaxed, readable style — more like a short design document than a database entry. Use paragraphs, code samples, and examples to make the reasoning clear and useful to someone encountering it for the first time. + +```markdown +# Dark Mode + +This project does not support dark mode or user-facing theming. + +## Why this is out of scope + +The rendering pipeline assumes a single color palette defined in +`ThemeConfig`. Supporting multiple themes would require: + +- A theme context provider wrapping the entire component tree +- Per-component theme-aware style resolution +- A persistence layer for user theme preferences + +This is a significant architectural change that doesn't align with the +project's focus on content authoring. Theming is a concern for downstream +consumers who embed or redistribute the output. + +```ts +// The current ThemeConfig interface is not designed for runtime switching: +interface ThemeConfig { + colors: ColorPalette; // single palette, resolved at build time + fonts: FontStack; +} +``` + +## Prior requests + +- #42 — "Add dark mode support" +- #87 — "Night theme for accessibility" +- #134 — "Dark theme option" +``` + +### Naming the file + +Use a short, descriptive kebab-case name for the concept: `dark-mode.md`, `plugin-system.md`, `graphql-api.md`. The name should be recognizable enough that someone browsing the directory understands what was rejected without opening the file. + +### Writing the reason + +The reason should be substantive — not "we don't want this" but why. Good reasons reference: + +- Project scope or philosophy ("This project focuses on X; theming is a downstream concern") +- Technical constraints ("Supporting this would require Y, which conflicts with our Z architecture") +- Strategic decisions ("We chose to use A instead of B because...") + +The reason should be durable. Avoid referencing temporary circumstances ("we're too busy right now") — those aren't real rejections, they're deferrals. + +## When to check `.out-of-scope/` + +During triage (Step 1: Gather context), read all files in `.out-of-scope/`. When evaluating a new issue: + +- Check if the request matches an existing out-of-scope concept +- Matching is by concept similarity, not keyword — "night theme" matches `dark-mode.md` +- If there's a match, surface it to the maintainer: "This is similar to `.out-of-scope/dark-mode.md` — we rejected this before because [reason]. Do you still feel the same way?" + +The maintainer may: + +- **Confirm** — the new issue gets added to the existing file's "Prior requests" list, then closed +- **Reconsider** — the out-of-scope file gets deleted or updated, and the issue proceeds through normal triage +- **Disagree** — the issues are related but distinct, proceed with normal triage + +## When to write to `.out-of-scope/` + +Only when an **enhancement** (not a bug) is rejected as `wontfix`. The flow: + +1. Maintainer decides a feature request is out of scope +2. Check if a matching `.out-of-scope/` file already exists +3. If yes: append the new issue to the "Prior requests" list +4. If no: create a new file with the concept name, decision, reason, and first prior request +5. Post a comment on the issue explaining the decision and mentioning the `.out-of-scope/` file +6. Close the issue with the `wontfix` label + +## Updating or removing out-of-scope files + +If the maintainer changes their mind about a previously rejected concept: + +- Delete the `.out-of-scope/` file +- The skill does not need to reopen old issues — they're historical records +- The new issue that triggered the reconsideration proceeds through normal triage diff --git a/.claude/skills/triage/SKILL.md b/.claude/skills/triage/SKILL.md new file mode 100755 index 00000000..3dee68f9 --- /dev/null +++ b/.claude/skills/triage/SKILL.md @@ -0,0 +1,103 @@ +--- +name: triage +description: Triage issues through a state machine driven by triage roles. Use when user wants to create an issue, triage issues, review incoming bugs or feature requests, prepare issues for an AFK agent, or manage issue workflow. +--- + +# Triage + +Move issues on the project issue tracker through a small state machine of triage roles. + +Every comment or issue posted to the issue tracker during triage **must** start with this disclaimer: + +``` +> *This was generated by AI during triage.* +``` + +## Reference docs + +- [AGENT-BRIEF.md](AGENT-BRIEF.md) — how to write durable agent briefs +- [OUT-OF-SCOPE.md](OUT-OF-SCOPE.md) — how the `.out-of-scope/` knowledge base works + +## Roles + +Two **category** roles: + +- `bug` — something is broken +- `enhancement` — new feature or improvement + +Five **state** roles: + +- `needs-triage` — maintainer needs to evaluate +- `needs-info` — waiting on reporter for more information +- `ready-for-agent` — fully specified, ready for an AFK agent +- `ready-for-human` — needs human implementation +- `wontfix` — will not be actioned + +Every triaged issue should carry exactly one category role and one state role. If state roles conflict, flag it and ask the maintainer before doing anything else. + +These are canonical role names — the actual label strings used in the issue tracker may differ. The mapping should have been provided to you - run `/setup-matt-pocock-skills` if not. + +State transitions: an unlabeled issue normally goes to `needs-triage` first; from there it moves to `needs-info`, `ready-for-agent`, `ready-for-human`, or `wontfix`. `needs-info` returns to `needs-triage` once the reporter replies. The maintainer can override at any time — flag transitions that look unusual and ask before proceeding. + +## Invocation + +The maintainer invokes `/triage` and describes what they want in natural language. Interpret the request and act. Examples: + +- "Show me anything that needs my attention" +- "Let's look at #42" +- "Move #42 to ready-for-agent" +- "What's ready for agents to pick up?" + +## Show what needs attention + +Query the issue tracker and present three buckets, oldest first: + +1. **Unlabeled** — never triaged. +2. **`needs-triage`** — evaluation in progress. +3. **`needs-info` with reporter activity since the last triage notes** — needs re-evaluation. + +Show counts and a one-line summary per issue. Let the maintainer pick. + +## Triage a specific issue + +1. **Gather context.** Read the full issue (body, comments, labels, reporter, dates). Parse any prior triage notes so you don't re-ask resolved questions. Explore the codebase using the project's domain glossary, respecting ADRs in the area. Read `.out-of-scope/*.md` and surface any prior rejection that resembles this issue. + +2. **Recommend.** Tell the maintainer your category and state recommendation with reasoning, plus a brief codebase summary relevant to the issue. Wait for direction. + +3. **Reproduce (bugs only).** Before any grilling, attempt reproduction: read the reporter's steps, trace the relevant code, run tests or commands. Report what happened — successful repro with code path, failed repro, or insufficient detail (a strong `needs-info` signal). A confirmed repro makes a much stronger agent brief. + +4. **Grill (if needed).** If the issue needs fleshing out, run a `/grill-with-docs` session. + +5. **Apply the outcome:** + - `ready-for-agent` — post an agent brief comment ([AGENT-BRIEF.md](AGENT-BRIEF.md)). + - `ready-for-human` — same structure as an agent brief, but note why it can't be delegated (judgment calls, external access, design decisions, manual testing). + - `needs-info` — post triage notes (template below). + - `wontfix` (bug) — polite explanation, then close. + - `wontfix` (enhancement) — write to `.out-of-scope/`, link to it from a comment, then close ([OUT-OF-SCOPE.md](OUT-OF-SCOPE.md)). + - `needs-triage` — apply the role. Optional comment if there's partial progress. + +## Quick state override + +If the maintainer says "move #42 to ready-for-agent", trust them and apply the role directly. Confirm what you're about to do (role changes, comment, close), then act. Skip grilling. If moving to `ready-for-agent` without a grilling session, ask whether they want to write an agent brief. + +## Needs-info template + +```markdown +## Triage Notes + +**What we've established so far:** + +- point 1 +- point 2 + +**What we still need from you (@reporter):** + +- question 1 +- question 2 +``` + +Capture everything resolved during grilling under "established so far" so the work isn't lost. Questions must be specific and actionable, not "please provide more info". + +## Resuming a previous session + +If prior triage notes exist on the issue, read them, check whether the reporter has answered any outstanding questions, and present an updated picture before continuing. Don't re-ask resolved questions. diff --git a/.claude/statusline-command.sh b/.claude/statusline-command.sh new file mode 100755 index 00000000..53dbbcba --- /dev/null +++ b/.claude/statusline-command.sh @@ -0,0 +1,57 @@ +#!/usr/bin/env bash +# Claude Code status line: model + context usage. +# +# Reads Claude's JSON status payload from stdin and prints a colored +# one-liner: username · model · cwd · ctx · cost. Uses jq for JSON +# parsing so no python is needed — works fine inside the bwrap sandbox +# where the host's python is masked off. If jq is missing, falls +# through to a bash-only degraded line. + +input=$(cat) + +degraded_line() { + local username cwd short_cwd + username=$(whoami 2>/dev/null || echo "?") + cwd=$(printf '%s' "$input" | sed -n 's/.*"current_dir"[[:space:]]*:[[:space:]]*"\([^"]*\)".*/\1/p') + [ -z "$cwd" ] && cwd="$PWD" + short_cwd="${cwd/#$HOME/~}" + printf "\033[0;35m%s\033[0m \033[0;33m%s\033[0m \033[2;37m(no jq — degraded statusline)\033[0m" \ + "$username" "$short_cwd" +} + +command -v jq >/dev/null 2>&1 || { degraded_line; exit 0; } + +# Single jq pass emits tab-separated fields so a malformed value can't +# bleed across columns. `// empty` returns empty strings rather than +# the literal "null"; cost defaults to 0 for the printf below. +IFS=$'\t' read -r model cwd used remaining cost < <( + printf '%s' "$input" | jq -r ' + [ + (.model.display_name // "unknown model"), + (.workspace.current_dir // .cwd // ""), + (.context_window.used_percentage // empty | tostring), + (.context_window.remaining_percentage // empty | tostring), + (.cost.total_cost_usd // 0 | tostring) + ] | @tsv + ' 2>/dev/null +) || { degraded_line; exit 0; } + +if [ -z "$model" ]; then + degraded_line + exit 0 +fi + +short_cwd="${cwd/#$HOME/~}" +username=$(whoami 2>/dev/null || echo "unknown") +cost_info=$(printf 'cost: $%.2f' "${cost:-0}") + +if [ -n "$used" ] && [ -n "$remaining" ]; then + # printf %.0f rounds half-away-from-zero, matching the old + # int(round(...)) behaviour closely enough for a status line. + context_info=$(printf 'ctx: %.0f%% used / %.0f%% left' "$used" "$remaining") +else + context_info="ctx: new session" +fi + +printf "\033[0;35m%s\033[0m \033[0;36m%s\033[0m \033[0;33m%s\033[0m \033[0;32m%s\033[0m \033[0;31m%s\033[0m" \ + "$username" "$model" "$short_cwd" "$context_info" "$cost_info" diff --git a/.devcontainer/claude-sandbox/claude-shadow b/.devcontainer/claude-sandbox/claude-shadow new file mode 100755 index 00000000..c06359c1 --- /dev/null +++ b/.devcontainer/claude-sandbox/claude-shadow @@ -0,0 +1,247 @@ +#!/usr/bin/env bash +# Shadow `claude` binary. Wraps the real Claude in a bwrap sandbox so +# every `claude` invocation on $PATH runs sandboxed. +# +# Self-contained: no @@PLACEHOLDERS@@, no `source $SRC_DIR/...`. The +# bwrap argv builder is inlined as `bwrap_argv_build` below so the +# shadow is a single file you can read top-to-bottom. +# +# Recursion guard: if IS_SANDBOX=1 is already set we are inside the +# sandbox (e.g. a hook or skill spawned `claude`), so fall through to +# the real binary directly. Internal claude-spawns-claude does not +# double-wrap. + +set -euo pipefail + +# Real Claude lives off the user's PATH so plain `claude` always +# resolves to /usr/local/bin/claude (this shadow). The official +# claude.ai installer drops the binary at ~/.local/bin/claude AND +# prepends ~/.local/bin to the user's shell rc; the installer +# relocates it here so that rc-mutation becomes inert. +REAL_CLAUDE="/usr/libexec/claude-sandbox/claude" +GITCONFIG_PATH="/etc/claude-gitconfig" + +# Test hook: source-only mode lets tests pull `bwrap_argv_build` into +# scope without running the launch body (nor the recursion guard, which +# would exec the real binary inside an already-sandboxed session). +: "${CLAUDE_SHADOW_SOURCE_ONLY:=0}" + +# Recursion guard: if IS_SANDBOX=1 is already set we are inside the +# sandbox (e.g. a hook or skill spawned `claude`), so fall through to +# the real binary directly. Inside the sandbox the real binary is +# bind-mounted at ~/.local/bin/claude — exec that path so argv[0] +# matches the conventional install location. --no-chrome injection +# (see filter_chrome_args below) applies here too so a nested spawn +# can't re-enable the browser-extension RPC channel. +if [ "$CLAUDE_SHADOW_SOURCE_ONLY" != "1" ] && [ "${IS_SANDBOX:-}" = "1" ]; then + _filtered=() + for _a in "$@"; do + case "$_a" in --chrome) ;; *) _filtered+=( "$_a" ) ;; esac + done + exec "${HOME:-/root}/.local/bin/claude" --no-chrome "${_filtered[@]+"${_filtered[@]}"}" +fi + +# Export so the inlined bwrap_argv_build picks up the gitconfig path. +export CLAUDE_SANDBOX_GITCONFIG_PATH="$GITCONFIG_PATH" + +# BwrapArgvBuilder: pure-function emitting the bwrap argv. Deterministic +# given (workspace, real_claude, "$@") and ($HOME, +# $CLAUDE_SANDBOX_GITCONFIG_PATH). No I/O, no subprocesses — sourced by +# tests/bwrap_argv.sh and called in isolation. +# +# Procfs is always `--ro-bind /proc /proc` (no per-launch probe). Trade- +# off: host PIDs are enumerable from inside the sandbox (info-disclosure), +# but credential-bearing procfs entries stay gated by YAMA +# ptrace_scope=1. Kernel-level pidns isolation (kill/ptrace scoping) +# remains intact via --unshare-pid. +bwrap_argv_build() { + local workspace="$1"; shift + local real_claude="$1"; shift + + local home="${HOME:-/root}" + local gitconfig_path="${CLAUDE_SANDBOX_GITCONFIG_PATH:-/etc/claude-gitconfig}" + + local -a argv=( + bwrap + --ro-bind / / + # Fresh /dev (not --dev-bind) hides the host's /dev/pts so a + # TIOCSTI inside the sandbox can only inject into the script(1)- + # allocated pty the shadow wraps us in. + --dev /dev + # Unconditional ro-bind of host /proc — host PIDs visible + # (info-disclosure, accepted) but kernel pidns isolation intact. + --ro-bind /proc /proc + --tmpfs /tmp + ) + + # /run/{user,secrets} masks are emitted only when the host has the + # source dir. Bwrap can't mkdir into a read-only /run when the + # parent has no such subdir (typical of GHA's ubuntu-24.04 runner). + if [ -d /run/user ]; then + argv+=( --tmpfs /run/user ) + fi + if [ -d /run/secrets ]; then + argv+=( --tmpfs /run/secrets ) + fi + + # Strict-under-/root by inversion: wipe $HOME, then bind back only + # what Claude legitimately needs. Anything we forgot to enumerate + # stays masked — the whole point of inverting. + argv+=( --tmpfs "$home" ) + + # Single bind-back list. --bind on a missing source would abort + # bwrap, so each entry is gated on existence. Directories use + # `-d`, files use `-f`; both flavours map source→dest identically. + local rel + for rel in .claude .cache .config/gh .config/glab-cli .local/share/uv; do + if [ -d "$home/$rel" ]; then + argv+=( --bind "$home/$rel" "$home/$rel" ) + fi + done + for rel in .claude.json .local/bin/uv .local/bin/uvx; do + if [ -f "$home/$rel" ]; then + argv+=( --bind "$home/$rel" "$home/$rel" ) + fi + done + + # Real binary lives off-PATH on the host; expose it inside the + # sandbox at the conventional ~/.local/bin/claude location so + # Claude's self-inspection and any internal claude-spawns-claude + # path lookups see the path they expect. Unconditional — the + # shadow's loud-fail upstream catches a missing real binary. + argv+=( --bind "$real_claude" "$home/.local/bin/claude" ) + + if [ -n "$workspace" ] && [ -d "$workspace" ]; then + argv+=( --bind "$workspace" "$workspace" ) + fi + + # Defence-in-depth file masks. Strict-under-/root already hides the + # $HOME dotfiles, but masking them with /dev/null is free and + # survives if the strict-root bind ever regresses. /etc masks are + # gated on readability so non-root hosts don't trip EROFS. + local mask + for mask in "$home/.netrc" "$home/.Xauthority" "$home/.ICEauthority"; do + argv+=( --bind-try /dev/null "$mask" ) + done + for mask in /etc/shadow /etc/gshadow /etc/sudoers; do + if [ -r "$mask" ]; then + argv+=( --bind /dev/null "$mask" ) + fi + done + + argv+=( + --cap-drop ALL + # --unshare-user-try is required when bwrap runs as root inside + # a nested container that lacks CAP_SYS_ADMIN. When bwrap runs + # as non-root it implicitly unshares user anyway, so this is a + # no-op in that path. + --unshare-user-try + --unshare-pid + --unshare-ipc + --unshare-uts + --unshare-cgroup-try + # No --new-session: setsid() severs SIGWINCH delivery. The + # TIOCSTI defence is delegated to the script(1) wrap around + # bwrap (the inner pty's input queue is unreachable from the + # host shell). + --die-with-parent + ) + + # Scrub the env by default, then re-export only what Claude needs. + # $HOME/.local/bin is APPENDED so system tools take precedence in + # PATH resolution. + argv+=( --clearenv ) + argv+=( --setenv PATH "/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:$home/.local/bin" ) + argv+=( --setenv HOME "$home" ) + argv+=( --setenv USER "root" ) + argv+=( --setenv IS_SANDBOX "1" ) + argv+=( --setenv GIT_CONFIG_GLOBAL "$gitconfig_path" ) + argv+=( --setenv GIT_CONFIG_SYSTEM "/dev/null" ) + local pass_through_var + for pass_through_var in TERM LANG LC_ALL LC_CTYPE LC_MESSAGES LC_TIME LC_COLLATE LC_NUMERIC LC_MONETARY; do + if [ -n "${!pass_through_var:-}" ]; then + argv+=( --setenv "$pass_through_var" "${!pass_through_var}" ) + fi + done + + # Disable the Chrome browser-extension RPC channel: strip any + # user-supplied --chrome so it can't override our --no-chrome + # injection. The browser extension's native-messaging-host bridge + # would let any installed Chrome extension on the host invoke + # tools inside this in-sandbox Claude — outside the threat model. + # Manifests Claude would otherwise write into ~/.config// + # NativeMessagingHosts/ are gated on this flag, so check 03 + # (strict-under-/root) catches any regression. + local -a user_args=() + local user_arg + for user_arg in "$@"; do + case "$user_arg" in + --chrome) ;; + *) user_args+=( "$user_arg" ) ;; + esac + done + + # Exec via the in-sandbox conventional path so Claude's argv[0] + # matches what the official installer would have placed. + argv+=( -- "$home/.local/bin/claude" --no-chrome "${user_args[@]+"${user_args[@]}"}" ) + printf '%s\n' "${argv[@]}" +} + +# Guard: tests source this file with CLAUDE_SHADOW_SOURCE_ONLY=1 so +# only the function definitions are pulled into scope. +if [ "$CLAUDE_SHADOW_SOURCE_ONLY" = "1" ]; then + return 0 2>/dev/null || exit 0 +fi + +# Loud-fail when the real binary is missing — clearer than the +# downstream errno from bwrap exec'ing a non-existent path. +if [ ! -x "$REAL_CLAUDE" ]; then + echo "claude-sandbox: real Claude binary missing at $REAL_CLAUDE." >&2 + echo " Re-run \`./install\` from a fresh clone of claude-sandbox." >&2 + exit 1 +fi + +# Pre-create the gh/glab credential dirs so the argv builder's --bind +# succeeds the first time the user runs `claude` after install. +mkdir -p "${HOME:-/root}/.config/gh" "${HOME:-/root}/.config/glab-cli" + +# Ensure ~/.claude.json exists for the bind-back. Without this, Claude +# Code's OAuth token writes into the in-sandbox tmpfs and vanishes on +# exit. touch is a no-op on an existing file apart from mtime. +touch "${HOME:-/root}/.claude.json" + +# Refresh /etc/claude-gitconfig from the host's current git identity on +# every launch. VS Code's dev.containers.copyGitConfig fires AFTER +# postCreate so the install-time render can have empty user.name. By +# the time the user types `claude`, copyGitConfig has run — re-render +# here so the [user] block always reflects the current host identity. +git_name=$(git config --get user.name 2>/dev/null || true) +git_email=$(git config --get user.email 2>/dev/null || true) +cat > "$CLAUDE_SANDBOX_GITCONFIG_PATH" </dev/null 2>&1; then + echo "claude-sandbox: refusing — Debian/Ubuntu only (no apt-get on PATH)." >&2 + exit 1 + fi +} + +apt_install() { + if [ "$SMOKE" = "1" ]; then + return 0 + fi + export DEBIAN_FRONTEND=noninteractive + apt-get update -qq + apt-get install -y -qq --no-install-recommends \ + bubblewrap just jq curl ca-certificates git nodejs gh + # glab isn't in every Ubuntu repo; install-try. + apt-get install -y -qq --no-install-recommends glab 2>/dev/null || true +} + +probe_userns_or_refuse() { + if [ "$SMOKE" = "1" ]; then + return 0 + fi + if ! bwrap --ro-bind / / --unshare-user-try --unshare-pid -- /bin/true \ + >/dev/null 2>&1; then + cat >&2 <<'EOF' +claude-sandbox: refusing — kernel unprivileged user namespaces are +forbidden. The bwrap sandbox cannot start without them. + +On Ubuntu 24.04: + sudo sysctl -w kernel.apparmor_restrict_unprivileged_userns=0 +On rootful Docker with default AppArmor: rebuild the devcontainer +under rootless podman, or relax AppArmor for bwrap. +EOF + exit 1 + fi +} + +# install_claude_binary: fetch the real Claude via the official +# installer, then relocate it to a path that is NOT on the user's +# PATH. The official installer drops the binary at ~/.local/bin/claude +# AND prepends ~/.local/bin to the user's shell rc — meaning plain +# `claude` would resolve past our shadow once a new shell starts. By +# moving the binary to /usr/libexec/claude-sandbox/, ~/.local/bin/ +# stays empty and the rc-mutation becomes harmless. +install_claude_binary() { + if [ "$SMOKE" = "1" ]; then + return 0 + fi + local real_dest + real_dest="$(prefixed /usr/libexec/claude-sandbox/claude)" + if [ -x "$real_dest" ]; then + # Idempotent: purge any stale copy a prior curl-install may have + # left at ~/.local/bin/claude so the shadow remains the only + # `claude` on the user's PATH. + rm -f "$HOME/.local/bin/claude" + return 0 + fi + curl -fsSL https://claude.ai/install.sh | bash + if [ ! -x "$HOME/.local/bin/claude" ]; then + echo "claude-sandbox: official installer did not produce \$HOME/.local/bin/claude" >&2 + exit 1 + fi + mkdir -p "$(dirname "$real_dest")" + mv "$HOME/.local/bin/claude" "$real_dest" +} + +# install_file: byte-stable copy of src → dst at mode 0755. Refuses +# if src is missing (loud-fail beats a downstream errno). cmp -s +# short-circuits so a re-run is a true no-op when content matches. +install_file() { + local src="$1" dst="$2" + if [ ! -f "$src" ]; then + echo "claude-sandbox: cannot find $src" >&2 + exit 1 + fi + mkdir -p "$(dirname "$dst")" + if [ -f "$dst" ] && cmp -s "$src" "$dst"; then + return 0 + fi + install -m 0755 "$src" "$dst" +} + +ensure_cred_dirs() { + mkdir -p "$HOME/.config/gh" "$HOME/.config/glab-cli" + touch "$HOME/.claude.json" +} + +# link_terminal_config: when /user-terminal-config is mounted (the +# convention used by terminal-config-style devcontainers), symlink +# ~/.claude and ~/.claude.json into it so Claude's settings and OAuth +# state are shared across every devcontainer on the host. Runs before +# install_claude_binary so the destinations are guaranteed-clean; a +# pre-existing destination (this repo's bind mount, or a previous +# install) makes ln a no-op via the -e/-L guards. +link_terminal_config() { + local shared="${CLAUDE_SHARED_CONFIG:-/user-terminal-config}" + [ -d "$shared" ] || return 0 + mkdir -p "$shared/.claude" + [ -e "$shared/.claude.json" ] || : > "$shared/.claude.json" + [ -e "$HOME/.claude" ] || [ -L "$HOME/.claude" ] || ln -s "$shared/.claude" "$HOME/.claude" + [ -e "$HOME/.claude.json" ] || [ -L "$HOME/.claude.json" ] || ln -s "$shared/.claude.json" "$HOME/.claude.json" +} + +# wire_settings_hook: surgical UserPromptSubmit-hook merge into +# /.claude/settings.json. +# - file absent → write minimal {"hooks":{"UserPromptSubmit":[...]}}. +# - file parses as JSON via jq → merge, dedup by command basename. +# - file is JSONC (jq parse fails) → refuse with paste-this snippet. +# - existing entry with same basename but different command → refuse. +wire_settings_hook() { + local settings="$WORKSPACE/.claude/settings.json" + local hook_cmd=".claude/hooks/sandbox-check.sh" + mkdir -p "$(dirname "$settings")" + + local minimal + minimal="$(jq -n --arg cmd "$hook_cmd" '{ + hooks: { + UserPromptSubmit: [ + {hooks: [{type: "command", command: $cmd}]} + ] + } + }')" + + if [ ! -f "$settings" ]; then + printf '%s\n' "$minimal" > "$settings" + chmod 0644 "$settings" + return 0 + fi + + if ! jq -e . "$settings" >/dev/null 2>&1; then + cat >&2 <&2 + exit 1 + fi + + local merged tmp + merged="$(jq --arg cmd "$hook_cmd" ' + .hooks //= {} + | .hooks.UserPromptSubmit //= [] + | if any(.hooks.UserPromptSubmit[].hooks[]?; .command == $cmd) then . + else .hooks.UserPromptSubmit += [ + {hooks: [{type: "command", command: $cmd}]} + ] + end + ' "$settings")" + tmp="$(mktemp "$settings.XXXXXX")" + printf '%s\n' "$merged" > "$tmp" + chmod 0644 "$tmp" + mv "$tmp" "$settings" +} + +# wire_settings_statusline: stamp our .statusLine into settings.json +# iff the field is absent. Any pre-existing .statusLine — ours or the +# user's — is left alone, so a user who customised theirs keeps it +# across rebuilds. wire_settings_hook runs first and guarantees the +# file exists and parses as JSON, so no JSONC branch is needed here. +wire_settings_statusline() { + local settings="$WORKSPACE/.claude/settings.json" + local sl_cmd=".claude/statusline-command.sh" + + if jq -e '.statusLine' "$settings" >/dev/null 2>&1; then + return 0 + fi + + local merged tmp + merged="$(jq --arg cmd "$sl_cmd" ' + .statusLine = {type: "command", command: $cmd} + ' "$settings")" + tmp="$(mktemp "$settings.XXXXXX")" + printf '%s\n' "$merged" > "$tmp" + chmod 0644 "$tmp" + mv "$tmp" "$settings" +} + +main() { + probe_or_refuse + # Shadow first: with /usr/local/bin/claude in place before the + # official installer runs, any `claude` lookup during the rest of + # install resolves (and bash-hashes) to the shadow path, even if + # the shadow itself transiently fails because bwrap or the real + # binary haven't landed yet. + install_file "$SCRIPT_DIR/claude-shadow" "$(prefixed /usr/local/bin/claude)" + apt_install + probe_userns_or_refuse + link_terminal_config + install_claude_binary + ensure_cred_dirs + install_file "$REPO_ROOT/.claude/hooks/sandbox-check.sh" \ + "$WORKSPACE/.claude/hooks/sandbox-check.sh" + install_file "$REPO_ROOT/.claude/statusline-command.sh" \ + "$WORKSPACE/.claude/statusline-command.sh" + wire_settings_hook + wire_settings_statusline + + echo "claude-sandbox: install complete." + echo " shadow: $(prefixed /usr/local/bin/claude)" + echo " real claude: $(prefixed /usr/libexec/claude-sandbox/claude)" + echo " workspace: $WORKSPACE" + echo " run \`/verify-sandbox\` inside Claude for the live battery." +} + +# Source guard: `promote.sh` re-uses `install_file`, +# `wire_settings_hook`, and `wire_settings_statusline` by sourcing this +# file. The guard keeps main() from auto-running in that case. +if [ "${BASH_SOURCE[0]}" = "$0" ]; then + main "$@" +fi diff --git a/.devcontainer/claude-sandbox/promote.sh b/.devcontainer/claude-sandbox/promote.sh new file mode 100755 index 00000000..4af0df9d --- /dev/null +++ b/.devcontainer/claude-sandbox/promote.sh @@ -0,0 +1,172 @@ +#!/usr/bin/env bash +# claude-sandbox promote: make a target host workspace a self-sufficient +# claude-sandbox host. After a successful promote, a teammate who +# clones the target only needs the devcontainer to come up — the +# installer runs from postCreate and the curated `.claude/` is in tree. +# +# Three layers of effect, in order: +# +# 1. Curated `.claude/` content (commands, skills, hooks, statusline, +# sandbox-check hook + statusLine wiring into settings.json). +# 2. Install machinery (`.devcontainer/claude-sandbox/{install.sh, +# claude-shadow, promote.sh}` + root `justfile`) so postCreate can +# run install.sh directly and `just promote`/`just gh-auth` work in +# the target the same way they do in the source clone. The source +# repo's root `install` shim is NOT copied — it's the source-repo +# UX entry, not a promoted-target workflow. +# 3. `.devcontainer/postCreate.sh` runs +# `bash .devcontainer/claude-sandbox/install.sh`; we then print the +# one-line JSON snippet the user pastes into devcontainer.json so +# install runs automatically on devcontainer create. We do NOT edit +# devcontainer.json — it's JSONC in the wild and the user knows +# whether they've already wired it or need to combine with their own. +# +# Idempotent: re-runs are byte-stable via install_file's `cmp -s` +# short-circuit and the dedup/refuse logic in each wire_* step. +# +# Does NOT touch `~/.claude` — that channel is reserved for +# cross-container shared state (OAuth, memories). Issue #18 spells out +# the rationale. +# +# Usage: +# bash .devcontainer/claude-sandbox/promote.sh [TARGET] +# just promote [TARGET] # preferred +# +# TARGET defaults to $PWD. The script refuses when TARGET resolves to +# the sandbox clone itself — promoting onto yourself is a no-op the +# user almost certainly didn't mean. +set -euo pipefail + +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" +# REPO_ROOT is the clone — two levels above .devcontainer/claude-sandbox. +REPO_ROOT="$(cd "$SCRIPT_DIR/../.." && pwd)" + +TARGET_INPUT="${1:-$PWD}" +if [ ! -d "$TARGET_INPUT" ]; then + echo "claude-sandbox: refusing — target '$TARGET_INPUT' is not a directory." >&2 + exit 1 +fi +TARGET="$(cd "$TARGET_INPUT" && pwd)" + +if [ "$TARGET" = "$REPO_ROOT" ]; then + echo "claude-sandbox: refusing — target is the sandbox repo itself; nothing to promote." >&2 + exit 1 +fi + +# Hand WORKSPACE off to install.sh's wire_settings_{hook,statusline} — +# they operate on "$WORKSPACE/.claude/settings.json". +INSTALL_WORKSPACE="$TARGET" +export INSTALL_WORKSPACE +# shellcheck source=./install.sh +. "$SCRIPT_DIR/install.sh" + +# copy_tree: install_file every regular file under $1 to the same +# relative path under $2. install_file's `cmp -s` short-circuit makes +# re-runs no-ops; mode 0755 matches what install.sh already uses for +# the hook and statusline (skill .md files end up 0755 too — harmless, +# users can `chmod 0644` if they care). +copy_tree() { + local src_dir="$1" dst_dir="$2" + [ -d "$src_dir" ] || return 0 + local src rel + while IFS= read -r -d '' src; do + rel="${src#"$src_dir/"}" + install_file "$src" "$dst_dir/$rel" + done < <(find "$src_dir" -type f -print0) +} + +# wire_postcreate_script: ensure /.devcontainer/postCreate.sh +# exists and runs `.devcontainer/claude-sandbox/install.sh`. If a +# postCreate.sh is already there, append our line unless an installer +# invocation is already present. Final mode is 0755 so +# devcontainer.json's `"postCreateCommand": ".devcontainer/postCreate.sh"` +# form works without an explicit `bash` prefix. +wire_postcreate_script() { + local pc="$TARGET/.devcontainer/postCreate.sh" + local install_cmd="bash .devcontainer/claude-sandbox/install.sh" + mkdir -p "$(dirname "$pc")" + + if [ ! -f "$pc" ]; then + cat > "$pc" <> "$pc" + chmod 0755 "$pc" +} + +# print_devcontainer_snippet: print the JSON the user should paste into +# their devcontainer.json. We deliberately do NOT inspect or edit +# devcontainer.json — it is almost always JSONC, structured editing +# while preserving comments is non-trivial, and the user is the one +# who knows whether they already wired this in (or whether their +# existing postCreateCommand needs combining with ours). Trust them. +print_devcontainer_snippet() { + local pc_cmd=".devcontainer/postCreate.sh" + cat >&2 < just this repo + - Expiration: short (e.g. 30 days) so a leaked token expires quickly + - Repository permissions (Read and Write): + Contents, Issues, Pull requests + (Metadata: Read-only is added automatically) + - Leave everything else unset / no access + + EOF + read -sp "GitHub PAT: " t && echo + echo "$t" | gh auth login --with-token + unset t + gh auth setup-git + gh auth status + +# Authenticate glab CLI with a GitLab PAT (token not stored in shell history). +# --git-protocol https prevents glab's SSH insteadOf rewrite. +glab-auth hostname="gitlab.com": + #!/usr/bin/env bash + cat <<'EOF' + Create or renew a fine-grained PAT at: + https://gitlab.com/-/user_settings/personal_access_tokens + (or your organisation's GitLab instance equivalent) + + Recommended scopes for a sandboxed Claude Code: + - api, read_repository, write_repository + - Short expiration so a leaked token expires quickly + + EOF + read -sp "GitLab PAT for {{ hostname }}: " t && echo + echo "$t" | glab auth login --stdin --hostname {{ hostname }} --git-protocol https + unset t + glab auth status