optave · carlos-alm · Mar 24, 2026 · Mar 21, 2026 · Mar 21, 2026 · Mar 21, 2026
diff --git a/.claude/skills/bench-check/SKILL.md b/.claude/skills/bench-check/SKILL.md
@@ -153,6 +153,13 @@ For each metric in the current run:
 
 ## Phase 4 — Verdict
 
+### Pre-condition check
+Before evaluating verdicts, verify that at least one benchmark produced valid numeric results.
+If `metrics` is empty (all suites produced `"error"` or `"timeout"` records):
+- Print: `BENCH-CHECK ABORTED — no valid benchmark results (all suites failed or timed out)`
+- Do NOT proceed to Phase 5 — the baseline must not be overwritten with empty data
+- Stop here and skip to Phase 6 to write the report with verdict `ABORTED`.
+
 Based on comparison results:
 
 ### No regressions found
@@ -169,7 +176,7 @@ Based on comparison results:
   - Re-run individual benchmarks to confirm (not flaky)
 
 ### First run (no baseline)
-- If `COMPARE_ONLY` is set: print a warning that no baseline exists and exit without saving
+- If `COMPARE_ONLY` is set: print a warning that no baseline exists and **stop here — do not proceed to Phase 5 or Phase 6**. No baseline is saved and no report is written.
 - Otherwise: print `BENCH-CHECK — initial baseline saved` and save current results as baseline
 
 ### Save-baseline with existing baseline (`--save-baseline`)
@@ -180,6 +187,7 @@ Based on comparison results:
 
 **Skip this phase if `COMPARE_ONLY` is set.** Compare-only mode never writes or commits baselines.
 **Skip this phase if regressions were detected in Phase 4.** The baseline is only updated on a clean run.
+**Skip this phase if the ABORTED pre-condition was triggered in Phase 4.** The baseline must not be overwritten with empty data.
 
 When saving (initial run, `--save-baseline`, or passed comparison):
 
@@ -211,8 +219,26 @@ git diff --cached --quiet -- generated/bench-check/baseline.json generated/bench
 
 ## Phase 6 — Report
 
+**Skip this phase (write no report) if `COMPARE_ONLY` was set and no baseline existed, AND the ABORTED pre-condition was not triggered.** That early-exit case was already handled in Phase 4 — writing a "BASELINE SAVED" report here would be misleading since no baseline was saved. When ABORTED, always write the ABORTED report regardless of other flags.
+
 Write a human-readable report to `generated/bench-check/BENCH_REPORT_<date>.md`.
 
+**If the ABORTED pre-condition was triggered (no valid benchmark results):** write a minimal report — this check MUST come before the SAVE_ONLY/first-run check, because when all benchmarks fail on a `--save-baseline` or first run, SAVE_ONLY would also be true but no baseline was actually saved:
+
+```markdown
+# Benchmark Report — <date>
+
+**Version:** X.Y.Z | **Git ref:** abc1234 | **Threshold:** $THRESHOLD%
+
+## Verdict: ABORTED — no valid benchmark results
+
+All benchmark suites failed or timed out. See Phase 1 error records for details.
+
+## Raw Results
+
+<!-- Error/timeout records from each suite -->
+```
+
 **If `SAVE_ONLY` is set or no prior baseline existed (first run):** write a shortened report — omit the "Comparison vs Baseline" and "Regressions" sections since no comparison was performed:
 
 ```markdown
@@ -257,7 +283,7 @@ Write a human-readable report to `generated/bench-check/BENCH_REPORT_<date>.md`.
 
 1. If report was written, print its path
 2. If baseline was updated, print confirmation
-3. Print one-line summary: `PASSED (0 regressions) | FAILED (N regressions) | BASELINE SAVED`
+3. Print one-line summary: `PASSED (0 regressions) | FAILED (N regressions) | BASELINE SAVED | ABORTED (all suites failed)`
 
 ## Rules
 
@@ -266,6 +292,6 @@ Write a human-readable report to `generated/bench-check/BENCH_REPORT_<date>.md`.
 - **Don't update baseline on regression** — the user must investigate first
 - **Recall/quality metrics are inverted** — a decrease is a regression
 - **Count metrics are informational** — graph growing isn't a regression
-- **The baseline file is committed to git** — it's a shared reference point; Phase 5 always commits it
+- **The baseline file is committed to git** — it's a shared reference point; Phase 5 commits it on clean (non-regressed) runs where COMPARE_ONLY is not set
 - **history.ndjson is append-only** — never truncate or rewrite it
- **Count metrics are informational** — graph growing isn't a regression
- **The baseline file is committed to git** — it's a shared reference point; Phase 5 always commits it
- **history.ndjson is append-only** — never truncate or rewrite it
+- **The baseline file is committed to git** — it's a shared reference point; Phase 5 commits it on clean (non-regressed) runs where `COMPARE_ONLY` is not set
- **Count metrics are informational** — graph growing isn't a regression
- **The baseline file is committed to git** — it's a shared reference point; Phase 5 always commits it
- **history.ndjson is append-only** — never truncate or rewrite it
+- **The baseline file is committed to git** — it's a shared reference point; Phase 5 commits it on clean (non-regressed) runs where `COMPARE_ONLY` is not set
 - Generated files go in `generated/bench-check/` — create the directory if needed
diff --git a/.claude/skills/deps-audit/SKILL.md b/.claude/skills/deps-audit/SKILL.md
@@ -22,14 +22,10 @@ Audit the project's dependency tree for security vulnerabilities, outdated packa
 5. **If `AUTO_FIX` is set:** Save the original manifests now, before any npm commands run, so pre-existing unstaged changes are preserved:
    ```bash
    git stash push -m "deps-audit-backup" -- package.json package-lock.json
-   STASH_CREATED=$?
-   ```
-   Track `STASH_CREATED` — when `0`, a stash entry was actually created; when `1`, the files had no changes so nothing was stashed.
-   If `STASH_CREATED` is `0`, immediately capture the stash ref for later use:
-   ```bash
    STASH_REF=$(git stash list --format='%gd %s' | grep 'deps-audit-backup' | head -1 | awk '{print $1}')
    ```
-   Use `$STASH_REF` (not `stash@{0}`) in all later stash drop/pop commands to avoid targeting the wrong entry if other stashes are pushed in the interim.
+   `STASH_REF` is non-empty if and only if a stash entry was actually created. Do **not** use `$?` — modern git (2.16+) returns 0 even when nothing was stashed.
+   Use `[ -n "$STASH_REF" ]` (stash created) / `[ -z "$STASH_REF" ]` (nothing stashed) for all branching. Use `$STASH_REF` (not `stash@{0}`) in all later stash drop/pop commands to avoid targeting the wrong entry.
 
 ## Phase 1 — Security Vulnerabilities
 
@@ -165,13 +161,25 @@ If `AUTO_FIX` was set:
 Summarize all changes made:
 1. List each package updated/fixed
 2. Run `npm test` to verify nothing broke
-3. If tests pass and `STASH_CREATED` is `0`: drop the saved state (`git stash drop $STASH_REF`) — the npm changes are good, no rollback needed
-   If tests pass and `STASH_CREATED` is `1`: no action needed — the npm changes are good and no stash entry exists to clean up
-4. If tests fail and `STASH_CREATED` is `0`:
-   - Restore the saved manifests: `git stash pop $STASH_REF`
+3. If tests pass and `STASH_REF` is non-empty: pop and merge the saved state (`git stash pop $STASH_REF`) — this restores any pre-existing uncommitted changes alongside the npm fix results. Note: the step 2 test run validated the npm changes alone; step 3b below is the authoritative test of the final merged state.
+   - If the pop applies cleanly:
+     a. Run `npm install` to re-sync `node_modules/` with the merged manifest.
+     b. Re-run `npm test` to confirm the merged state is consistent (this is the authoritative check — step 2 only validated the npm changes in isolation).
+     c. If tests still pass: confirm the project is consistent.
+     d. If tests now fail: warn the user — the pre-existing manifest changes conflict with the audit fixes.
+        Recovery options:
+        - To undo **all** manifest changes (both audit fixes and pre-existing): `git checkout -- package.json package-lock.json && npm ci`
+        - To keep only the audit fixes and discard pre-existing changes: manually edit `package.json`/`package-lock.json` to remove the pre-existing delta, then `npm ci`
+        - To keep only the pre-existing changes and discard the audit fixes: `git checkout HEAD -- package.json package-lock.json && npm ci` to revert manifests to their clean state, then manually re-apply only your pre-existing changes
+   - If the pop causes conflicts in `package.json`/`package-lock.json`: warn the user, leave conflict markers for manual resolution, and instruct: "After you resolve the conflicts, run `npm install` to re-sync `node_modules/` with the resolved lock file before committing."
+   - For conflicts in other files, resolve them by keeping both the npm fixes and the pre-existing changes.
+   If tests pass and `STASH_REF` is empty: no action needed — the npm changes are good and no stash entry exists to clean up
+4. If tests fail and `STASH_REF` is non-empty:
+   - Reset manifests to HEAD first (undoes npm changes): `git checkout HEAD -- package.json package-lock.json`
+   - Then re-apply the pre-existing changes cleanly: `git stash pop $STASH_REF`
    - Restore `node_modules/` to match the reverted lock file: `npm ci`
    - Report what failed
-5. If tests fail and `STASH_CREATED` is `1`:
+5. If tests fail and `STASH_REF` is empty:
    - Discard manifest changes: `git checkout -- package.json package-lock.json`
    - Restore `node_modules/` to match the reverted lock file: `npm ci`
    - Report what failed
@@ -181,6 +189,6 @@ Summarize all changes made:
 - **Never run `npm audit fix --force`** — breaking changes need human review
 - **Never remove a dependency** without asking the user, even if it appears unused — flag it in the report instead
 - **Always run tests** after any auto-fix changes
-- **If `--fix` causes test failures**, restore manifests from the saved state (`git stash pop $STASH_REF` if `STASH_CREATED=0`, or `git checkout` if stash was a no-op) then run `npm ci` to resync `node_modules/`, and report the failure
+- **If `--fix` causes test failures**, first reset manifests to HEAD (`git checkout HEAD -- package.json package-lock.json`), then re-apply pre-existing changes (`git stash pop $STASH_REF` if `STASH_REF` is non-empty, or no-op if nothing was stashed), then run `npm ci` to resync `node_modules/`, and report the failure
 - Treat `optionalDependencies` separately — they're expected to fail on some platforms
 - The report goes in `generated/deps-audit/` — create the directory if it doesn't exist
diff --git a/.claude/skills/housekeep/SKILL.md b/.claude/skills/housekeep/SKILL.md
@@ -66,28 +66,65 @@ For stale worktrees with merged branches:
 
 ## Phase 2 — Delete Dirt Files
 
-Remove temporary and generated files that accumulate over time:
+Remove temporary and generated files that accumulate over time. There are two distinct categories of dirt that require different discovery commands:
+
+- **Gitignored dirt** (files matching `.gitignore` patterns — e.g. `coverage/`, `.DS_Store`, `*.log`, `.codegraph/graph.db-journal`): use `git clean -fdX --dry-run` to list them. `git ls-files --others --exclude-standard` silently omits these because `--exclude-standard` suppresses gitignored entries.
+- **Untracked non-ignored files** (stray files not in `.gitignore` — e.g. `*.tmp.*`, `*.bak`, `*.orig`): use `git ls-files --others --exclude-standard` to list them.
+
+Run both commands and union the results to get the full set of candidate dirt files.
 
 ### 2a. Known dirt patterns
 
-Search for and remove:
+Search for and remove files found by the two discovery commands above (never touch tracked files):
 - `*.tmp.*`, `*.bak`, `*.orig` files in the repo (but NOT in `node_modules/`)
 - `.DS_Store` files
 - `*.log` files in repo root (not in `node_modules/`)
 - Empty directories (except `.codegraph/`, `.claude/`, `node_modules/`)
 - `coverage/` directory (regenerated by `npm run test:coverage`)
 - `.codegraph/graph.db-journal` (SQLite WAL leftovers)
-- Stale lock files: `.codegraph/*.lock` older than 1 hour
+
+**Stale lock files** (`.codegraph/*.lock` older than 1 hour): Before removing, first check if `lsof` is available (`command -v lsof`). If `lsof` is **not installed** (common in Docker/CI minimal containers where it exits 127), **skip lock file removal entirely** and print a warning: `"lsof not available — skipping lock file cleanup (cannot verify no process holds the file)"`. When `lsof` IS available, use `lsof "$f"` to verify no process holds the file. If the file is held, **skip it** and warn — concurrent Claude Code sessions may hold legitimate long-lived locks.
+
+```bash
+if ! command -v lsof > /dev/null 2>&1; then
+  echo "lsof not available — skipping lock file cleanup (cannot verify no process holds the file)"
+else
+  for f in .codegraph/*.lock; do
+    [ -f "$f" ] || continue
+    age=$(( $(date +%s) - $(stat --format='%Y' "$f" 2>/dev/null || stat -f '%m' "$f" 2>/dev/null) ))
+    [ -z "$age" ] && continue
+    if [ "$age" -gt 3600 ] && ! lsof "$f" > /dev/null 2>&1; then
+      if [ "$DRY_RUN" = "true" ]; then
+        echo "[DRY RUN] Would remove stale lock: $f"
+      else
+        echo "Removing stale lock: $f"
+        rm "$f"
+      fi
+    elif [ "$age" -gt 3600 ]; then
+      echo "Lock file $f is old but still held by a process — ask user before removing"
+    fi
+  done
+fi
+```
 
 ### 2b. Large untracked files
 
-Find untracked files larger than 1MB:
+Find untracked files (both gitignored and non-ignored) larger than 1MB. Use both discovery commands and union the paths:
 ```bash
+# Non-ignored untracked files
 git ls-files --others --exclude-standard | while read f; do
   size=$(stat --format='%s' "$f" 2>/dev/null || stat -f '%z' "$f" 2>/dev/null)
   [ -z "$size" ] && continue
   if [ "$size" -gt 1048576 ]; then echo "$f ($size bytes)"; fi
 done
+# Gitignored files (strip the leading "Would remove " prefix from dry-run output)
+git clean -fdX --dry-run | sed 's/^Would remove //' | while read f; do
+  # Skip directory entries — stat returns inode size, not content size
+  [ -d "$f" ] && continue
+  size=$(stat --format='%s' "$f" 2>/dev/null || stat -f '%z' "$f" 2>/dev/null)
+  [ -z "$size" ] && continue
+  if [ "$size" -gt 1048576 ]; then echo "$f ($size bytes) [gitignored]"; fi
+done
 ```
-git ls-files --others --exclude-standard | while read f; do
-  size=$(stat --format='%s' "$f" 2>/dev/null || stat -f '%z' "$f" 2>/dev/null)
-  if [ "$size" -gt 1048576 ]; then echo "$f ($size bytes)"; fi
-done
-```
+      size=$(stat --format='%s' "$f" 2>/dev/null || stat -f '%z' "$f" 2>/dev/null)
+      [ -z "$size" ] && continue
+      if [ "$size" -gt 1048576 ]; then echo "$f ($size bytes)"; fi
-git ls-files --others --exclude-standard | while read f; do
-  size=$(stat --format='%s' "$f" 2>/dev/null || stat -f '%z' "$f" 2>/dev/null)
-  if [ "$size" -gt 1048576 ]; then echo "$f ($size bytes)"; fi
-done
-```
+      size=$(stat --format='%s' "$f" 2>/dev/null || stat -f '%z' "$f" 2>/dev/null)
+      [ -z "$size" ] && continue
+      if [ "$size" -gt 1048576 ]; then echo "$f ($size bytes)"; fi
 
 Flag these for user review — they might be accidentally untracked binaries.
@@ -117,7 +154,7 @@ git log HEAD..origin/main --oneline
 ```
 
 If main has new commits:
-- If on main: `git pull origin main`
+- If on main: `git pull --no-rebase origin main`
 - If on a feature branch: inform the user how many commits behind main they are
   - Suggest: `git merge origin/main` (never rebase — per project rules)
 

diff --git a/.claude/skills/test-health/SKILL.md b/.claude/skills/test-health/SKILL.md
@@ -41,8 +41,14 @@ Run the full test suite `FLAKY_RUNS` times and track per-test pass/fail:
 RUN_DIR=$(mktemp -d /tmp/test-health-XXXXXX)
 for i in $(seq 1 $FLAKY_RUNS); do
   timeout 180 npx vitest run --reporter=json > "$RUN_DIR/run-$i.json" 2>"$RUN_DIR/run-$i.err"
-  if [ $? -eq 124 ]; then
+  exit_code=$?
+  if [ $exit_code -eq 124 ]; then
     echo '{"timeout":true}' > "$RUN_DIR/run-$i.json"
+  elif [ $exit_code -ne 0 ] && [ $exit_code -ne 1 ]; then
+    # Use jq to safely JSON-escape stderr content (may contain quotes, newlines, backslashes)
+    stderr_content=$(cat "$RUN_DIR/run-$i.err")
+    jq -n --argjson code "$exit_code" --arg stderr "$stderr_content" \
+      '{"error":true,"exit_code":$code,"stderr":$stderr}' > "$RUN_DIR/run-$i.json"
   fi
 done
-RUN_DIR=$(mktemp -d /tmp/test-health-XXXXXX)
-for i in $(seq 1 $FLAKY_RUNS); do
-  npx vitest run --reporter=json > "$RUN_DIR/run-$i.json" 2>"$RUN_DIR/run-$i.err"
-done
+RUN_DIR=$(mktemp -d /tmp/test-health-XXXXXX)
+for i in $(seq 1 $FLAKY_RUNS); do
+  timeout 180 npx vitest run --reporter=json > "$RUN_DIR/run-$i.json" 2>"$RUN_DIR/run-$i.err"
+  if [ $? -eq 124 ]; then
+    echo "run-$i timed out after 3 minutes — partial results recorded" >> "$RUN_DIR/run-$i.err"
+  fi
+done
-RUN_DIR=$(mktemp -d /tmp/test-health-XXXXXX)
-for i in $(seq 1 $FLAKY_RUNS); do
-  npx vitest run --reporter=json > "$RUN_DIR/run-$i.json" 2>"$RUN_DIR/run-$i.err"
-done
+RUN_DIR=$(mktemp -d /tmp/test-health-XXXXXX)
+for i in $(seq 1 $FLAKY_RUNS); do
+  timeout 180 npx vitest run --reporter=json > "$RUN_DIR/run-$i.json" 2>"$RUN_DIR/run-$i.err"
+  if [ $? -eq 124 ]; then
+    echo "run-$i timed out after 3 minutes — partial results recorded" >> "$RUN_DIR/run-$i.err"
+  fi
+done
 ```
@@ -56,7 +62,9 @@ rm -rf "$RUN_DIR"
 
 ### Analysis
 
-A test is **flaky** if it passes in some runs and fails in others.
+Before analyzing, **exclude invalid runs**: skip any run file containing `{"timeout":true}` or `{"error":true}` — these runs produced no reliable per-test data and must not be counted as "all tests failed." Require a **minimum of 2 valid runs** (runs with parseable vitest JSON output) for flaky detection to be conclusive. If fewer than 2 valid runs remain, report that flaky detection was inconclusive due to too many errored/timed-out runs.
+
+A test is **flaky** if it passes in some valid runs and fails in others.
 
 For each flaky test found:
 1. Record: test file, test name, pass count, fail count, failure messages
@@ -139,7 +147,7 @@ Find files with < 50% line coverage. For each:
 Compare against `main` branch to find recently changed files:
 
 ```bash
-git diff --name-only main...HEAD -- src/
+git diff --name-only origin/main...HEAD -- src/
 ```
 
 For each changed source file, check if: