Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
66 commits
Select commit Hold shift + click to select a range
d52a3a4
feat: add maintenance skills — deps-audit, bench-check, test-health, …
carlos-alm Mar 21, 2026
b187fe1
Merge branch 'main' into feat/maintenance-skills
carlos-alm Mar 21, 2026
a562b52
fix(bench-check): capture stderr, guard division-by-zero, commit base…
carlos-alm Mar 21, 2026
4fc994d
fix(deps-audit): run npm ci after revert, document tokenizer skip reason
carlos-alm Mar 21, 2026
89aef6b
fix(housekeep): guard Phase 5 in source repo, fix stale-worktree crit…
carlos-alm Mar 21, 2026
3e892d1
Merge remote-tracking branch 'origin/feat/maintenance-skills' into fe…
carlos-alm Mar 21, 2026
ce5d811
fix: address Round 3 Greptile review feedback
carlos-alm Mar 22, 2026
01b5110
fix: move deps-audit stash to Phase 0, before npm commands modify man…
carlos-alm Mar 22, 2026
3b0e293
fix: capture flaky-detection loop output to per-run files for comparison
carlos-alm Mar 22, 2026
52de495
fix: always require confirmation for stale worktree removal
carlos-alm Mar 22, 2026
8be5cec
fix: use parsed threshold in baseline.json, guard --compare-only on f…
carlos-alm Mar 22, 2026
0691ffc
Merge branch 'main' into feat/maintenance-skills
carlos-alm Mar 22, 2026
87d9213
fix(deps-audit): track stash creation to avoid operating on wrong entry
carlos-alm Mar 22, 2026
65d9836
fix(test-health): use mktemp for flaky-run directory to avoid concurr…
carlos-alm Mar 22, 2026
eef2c03
fix(bench-check): add save-baseline verdict path, fix em-dash, use ex…
carlos-alm Mar 22, 2026
19b14e9
docs(roadmap): update Phase 5 TypeScript migration with accurate prog…
carlos-alm Mar 22, 2026
5bda6ba
fix: deps-audit success path should keep npm changes, not revert (#565)
carlos-alm Mar 22, 2026
bd0ba1a
fix: bench-check use git add + diff --cached to detect new files (#565)
carlos-alm Mar 22, 2026
7b91e3c
fix: housekeep require confirmation before branch deletion (#565)
carlos-alm Mar 22, 2026
7fcdd93
Merge branch 'main' into feat/maintenance-skills
carlos-alm Mar 23, 2026
5462d32
fix: scope git diff --cached to bench-check files only (#565)
carlos-alm Mar 23, 2026
457e6b9
fix: use json-summary reporter to match coverage-summary.json output …
carlos-alm Mar 23, 2026
852003d
fix: capture stash ref by name to avoid position-based targeting (#565)
carlos-alm Mar 23, 2026
eea2954
fix: remove unreachable Phase 5 subphases since source-repo guard alw…
carlos-alm Mar 23, 2026
baf6797
Merge remote-tracking branch 'origin/feat/maintenance-skills' into fi…
carlos-alm Mar 23, 2026
9b4869c
fix: use dynamic threshold variable in bench-check Phase 6 report tem…
carlos-alm Mar 23, 2026
8d92c99
fix: address open review items in maintenance skills (#565)
carlos-alm Mar 23, 2026
9ad37ea
fix: add COVERAGE_ONLY guards to Phase 2 and Phase 4 in test-health
carlos-alm Mar 23, 2026
30ab30e
fix: add regression skip guard to bench-check Phase 5, expand deps-au…
carlos-alm Mar 23, 2026
a8631d2
fix: add empty-string guard for stat size check in housekeep (#565)
carlos-alm Mar 23, 2026
8fd7430
fix: add BASELINE SAVED verdict path and clarify if/else-if in bench-…
carlos-alm Mar 23, 2026
23f2f76
docs(roadmap): mark Phase 4 complete, update Phase 5 progress (5 of 7)
carlos-alm Mar 23, 2026
2616c78
docs(roadmap): correct Phase 5 progress — 5.3/5.4/5.5 still in progress
carlos-alm Mar 23, 2026
cdcc086
fix: use --no-rebase flag on git pull in housekeep Phase 3b (#565)
carlos-alm Mar 23, 2026
ac1e0b5
fix: handle non-timeout vitest crash exit codes in test-health flaky …
carlos-alm Mar 23, 2026
5434550
fix: restore pre-existing changes on deps-audit success path (#565)
carlos-alm Mar 23, 2026
316105c
Merge branch 'main' into feat/maintenance-skills
carlos-alm Mar 23, 2026
c7115c2
fix(test-health): use jq to safely JSON-escape stderr in flaky detect…
carlos-alm Mar 23, 2026
740299f
fix(housekeep): enforce untracked-only dirt removal and safe lock del…
carlos-alm Mar 23, 2026
a762236
Merge branch 'main' into feat/maintenance-skills
carlos-alm Mar 23, 2026
ed12127
fix: correct stale Phase 5 commit rule in bench-check skill (#565)
carlos-alm Mar 23, 2026
3d52aaa
fix: distinguish gitignored vs untracked files in housekeep skill (#565)
carlos-alm Mar 23, 2026
912109f
Merge branch 'main' into feat/maintenance-skills
carlos-alm Mar 23, 2026
a94ddf3
fix: replace STASH_CREATED exit code check with STASH_REF presence ch…
carlos-alm Mar 23, 2026
ad80d18
Merge remote-tracking branch 'origin/main' into feat/maintenance-skills
carlos-alm Mar 23, 2026
6d4de9f
docs(roadmap): mark Phase 5.3 Leaf Module Migration complete, update …
carlos-alm Mar 23, 2026
34c50b9
fix(bench-check): guard against empty baseline when all benchmarks fa…
carlos-alm Mar 23, 2026
8b636ae
fix(deps-audit): instruct npm install after pop conflicts (#565)
carlos-alm Mar 23, 2026
8c467d3
fix(test-health): use origin/main for coverage gap diff (#565)
carlos-alm Mar 23, 2026
ec1fd56
fix(skills): add BSD stat stderr guard and npm install after clean st…
carlos-alm Mar 23, 2026
c9ac370
fix(skill): add ABORTED report template to bench-check Phase 6
carlos-alm Mar 23, 2026
a2e8be0
fix(skill): respect DRY_RUN in housekeep lock file removal
carlos-alm Mar 23, 2026
7a1c86a
fix(skill): re-run tests after clean stash pop in deps-audit
carlos-alm Mar 23, 2026
55c5a22
fix(skill): skip directory entries in housekeep gitignored file size …
carlos-alm Mar 24, 2026
0dc2605
Merge branch 'main' into feat/maintenance-skills
carlos-alm Mar 24, 2026
aa3e1f4
fix(skill): add ABORTED skip guard to bench-check Phase 5 and narrow …
carlos-alm Mar 24, 2026
75350c7
fix(skill): add recovery options for deps-audit clean-pop test failure
carlos-alm Mar 24, 2026
7aab540
fix(skill): check ABORTED before SAVE_ONLY in bench-check Phase 6 (#565)
carlos-alm Mar 24, 2026
0b08a2b
fix(skill): reset manifests to HEAD before stash pop on failure path …
carlos-alm Mar 24, 2026
e4f8c3d
Merge branch 'main' into feat/maintenance-skills
carlos-alm Mar 24, 2026
933c0d3
fix(skill): guard against missing lsof in housekeep lock file removal…
carlos-alm Mar 24, 2026
005f806
fix(skill): replace unachievable recovery option in deps-audit (#565)
carlos-alm Mar 24, 2026
2e6d37d
fix(skill): exclude error/timeout runs from test-health flaky analysi…
carlos-alm Mar 24, 2026
854f248
fix(skill): embed lsof guard in housekeep lock removal snippet (#565)
carlos-alm Mar 24, 2026
0db47cb
Merge branch 'main' into feat/maintenance-skills
carlos-alm Mar 24, 2026
dcbe349
fix: address Greptile review feedback (#565)
carlos-alm Mar 24, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
32 changes: 29 additions & 3 deletions .claude/skills/bench-check/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -153,6 +153,13 @@ For each metric in the current run:

## Phase 4 — Verdict

### Pre-condition check
Before evaluating verdicts, verify that at least one benchmark produced valid numeric results.
If `metrics` is empty (all suites produced `"error"` or `"timeout"` records):
- Print: `BENCH-CHECK ABORTED — no valid benchmark results (all suites failed or timed out)`
- Do NOT proceed to Phase 5 — the baseline must not be overwritten with empty data
- Stop here and skip to Phase 6 to write the report with verdict `ABORTED`.
Comment on lines +156 to +161
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Phase 6 has no ABORTED report template, and Phase 7 summary is also missing the ABORTED case

The pre-condition check in Phase 4 correctly says "Stop here and skip to Phase 6 to write the report with verdict ABORTED." However, Phase 6 only defines two report branches:

  1. If SAVE_ONLY or first-run → shortened "BASELINE SAVED" report
  2. Otherwise (comparison was performed) → full "PASSED / FAILED" report

An agent following an ABORTED verdict would fall through to the "Otherwise" branch (there was a prior baseline, no SAVE_ONLY) and produce a report using the "PASSED / FAILED" template, complete with empty "Comparison vs Baseline" and "Regressions" sections. This is misleading — no comparison was performed because all benchmarks failed.

Similarly, Phase 7 line 269's summary only lists PASSED (0 regressions) | FAILED (N regressions) | BASELINE SAVED, with no ABORTED option.

Add a dedicated ABORTED branch to Phase 6, for example:

**If the ABORTED pre-condition was triggered (no valid benchmark results):** write a minimal report:

# Benchmark Report — <date>

**Version:** X.Y.Z | **Git ref:** abc1234 | **Threshold:** $THRESHOLD%

## Verdict: ABORTED — no valid benchmark results

All benchmark suites failed or timed out. See Phase 1 error records for details.

## Raw Results
<!-- Error/timeout records from each suite -->

And update Phase 7's summary line to include | ABORTED (all suites failed).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — added a dedicated ABORTED branch to Phase 6 with a minimal report template (version, git ref, verdict, raw error records). Also added ABORTED to the Phase 7 summary line.


Based on comparison results:

### No regressions found
Expand All @@ -169,7 +176,7 @@ Based on comparison results:
- Re-run individual benchmarks to confirm (not flaky)

### First run (no baseline)
- If `COMPARE_ONLY` is set: print a warning that no baseline exists and exit without saving
- If `COMPARE_ONLY` is set: print a warning that no baseline exists and **stop here — do not proceed to Phase 5 or Phase 6**. No baseline is saved and no report is written.
- Otherwise: print `BENCH-CHECK — initial baseline saved` and save current results as baseline
Comment on lines +178 to +180
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 --compare-only with no baseline: "exit" is ambiguous and Phase 6 reports a misleading verdict

Phase 4's first-run path says:

"If COMPARE_ONLY is set: print a warning that no baseline exists and exit without saving"

The word "exit" here is ambiguous — an agent may interpret it as "exit this verdict path and continue to Phase 5/6/7" rather than "terminate the entire skill." If Phase 6 is reached in this scenario, its branch condition reads:

"If SAVE_ONLY is set or no prior baseline existed (first run)"

A --compare-only run with no baseline satisfies the second condition (no prior baseline existed), so Phase 6 writes a report with "Verdict: BASELINE SAVED — no comparison performed" — but no baseline was actually saved! The verdict is factually wrong.

Disambiguate Phase 4 by making the early exit explicit:

### First run (no baseline)
- If `COMPARE_ONLY` is set: print a warning (`BENCH-CHECK: no baseline exists — nothing to compare against`) and **exit the skill without writing a report or committing anything**.
- Otherwise: print `BENCH-CHECK — initial baseline saved` and proceed to Phase 5.

And add a corresponding guard to Phase 6 so the "BASELINE SAVED" verdict path is explicitly excluded for COMPARE_ONLY.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — added explicit 'stop here' early-exit wording in Phase 4's first-run path and a guard in Phase 6 that skips the BASELINE SAVED report when --compare-only was set with no baseline.


### Save-baseline with existing baseline (`--save-baseline`)
Expand All @@ -180,6 +187,7 @@ Based on comparison results:

**Skip this phase if `COMPARE_ONLY` is set.** Compare-only mode never writes or commits baselines.
**Skip this phase if regressions were detected in Phase 4.** The baseline is only updated on a clean run.
**Skip this phase if the ABORTED pre-condition was triggered in Phase 4.** The baseline must not be overwritten with empty data.

When saving (initial run, `--save-baseline`, or passed comparison):

Expand Down Expand Up @@ -211,8 +219,26 @@ git diff --cached --quiet -- generated/bench-check/baseline.json generated/bench

## Phase 6 — Report

**Skip this phase (write no report) if `COMPARE_ONLY` was set and no baseline existed, AND the ABORTED pre-condition was not triggered.** That early-exit case was already handled in Phase 4 — writing a "BASELINE SAVED" report here would be misleading since no baseline was saved. When ABORTED, always write the ABORTED report regardless of other flags.

Write a human-readable report to `generated/bench-check/BENCH_REPORT_<date>.md`.

**If the ABORTED pre-condition was triggered (no valid benchmark results):** write a minimal report — this check MUST come before the SAVE_ONLY/first-run check, because when all benchmarks fail on a `--save-baseline` or first run, SAVE_ONLY would also be true but no baseline was actually saved:

```markdown
# Benchmark Report — <date>

**Version:** X.Y.Z | **Git ref:** abc1234 | **Threshold:** $THRESHOLD%

## Verdict: ABORTED — no valid benchmark results

All benchmark suites failed or timed out. See Phase 1 error records for details.

## Raw Results

<!-- Error/timeout records from each suite -->
```

**If `SAVE_ONLY` is set or no prior baseline existed (first run):** write a shortened report — omit the "Comparison vs Baseline" and "Regressions" sections since no comparison was performed:

```markdown
Expand Down Expand Up @@ -257,7 +283,7 @@ Write a human-readable report to `generated/bench-check/BENCH_REPORT_<date>.md`.

1. If report was written, print its path
2. If baseline was updated, print confirmation
3. Print one-line summary: `PASSED (0 regressions) | FAILED (N regressions) | BASELINE SAVED`
3. Print one-line summary: `PASSED (0 regressions) | FAILED (N regressions) | BASELINE SAVED | ABORTED (all suites failed)`

## Rules

Expand All @@ -266,6 +292,6 @@ Write a human-readable report to `generated/bench-check/BENCH_REPORT_<date>.md`.
- **Don't update baseline on regression** — the user must investigate first
- **Recall/quality metrics are inverted** — a decrease is a regression
- **Count metrics are informational** — graph growing isn't a regression
- **The baseline file is committed to git** — it's a shared reference point; Phase 5 always commits it
- **The baseline file is committed to git** — it's a shared reference point; Phase 5 commits it on clean (non-regressed) runs where COMPARE_ONLY is not set
- **history.ndjson is append-only** — never truncate or rewrite it
Comment on lines +294 to +296
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Stale rule: "Phase 5 always commits it" is no longer accurate

After the multiple rounds of fixes, Phase 5 now has two explicit skip guards:

  • COMPARE_ONLY → skip
  • Regressions detected → skip

The rule at line 269 still reads: "The baseline file is committed to git — it's a shared reference point; Phase 5 always commits it." This is factually incorrect and could mislead a future editor into removing the skip guards, believing they contradict the rule.

Suggested change
- **Count metrics are informational** — graph growing isn't a regression
- **The baseline file is committed to git** — it's a shared reference point; Phase 5 always commits it
- **history.ndjson is append-only** — never truncate or rewrite it
- **The baseline file is committed to git** — it's a shared reference point; Phase 5 commits it on clean (non-regressed) runs where `COMPARE_ONLY` is not set

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — updated the rule to: 'Phase 5 commits it on clean (non-regressed) runs where COMPARE_ONLY is not set.'

- Generated files go in `generated/bench-check/` — create the directory if needed
32 changes: 20 additions & 12 deletions .claude/skills/deps-audit/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,14 +22,10 @@ Audit the project's dependency tree for security vulnerabilities, outdated packa
5. **If `AUTO_FIX` is set:** Save the original manifests now, before any npm commands run, so pre-existing unstaged changes are preserved:
```bash
git stash push -m "deps-audit-backup" -- package.json package-lock.json
STASH_CREATED=$?
```
Track `STASH_CREATED` — when `0`, a stash entry was actually created; when `1`, the files had no changes so nothing was stashed.
If `STASH_CREATED` is `0`, immediately capture the stash ref for later use:
```bash
STASH_REF=$(git stash list --format='%gd %s' | grep 'deps-audit-backup' | head -1 | awk '{print $1}')
```
Use `$STASH_REF` (not `stash@{0}`) in all later stash drop/pop commands to avoid targeting the wrong entry if other stashes are pushed in the interim.
`STASH_REF` is non-empty if and only if a stash entry was actually created. Do **not** use `$?` — modern git (2.16+) returns 0 even when nothing was stashed.
Use `[ -n "$STASH_REF" ]` (stash created) / `[ -z "$STASH_REF" ]` (nothing stashed) for all branching. Use `$STASH_REF` (not `stash@{0}`) in all later stash drop/pop commands to avoid targeting the wrong entry.

## Phase 1 — Security Vulnerabilities

Expand Down Expand Up @@ -165,13 +161,25 @@ If `AUTO_FIX` was set:
Summarize all changes made:
1. List each package updated/fixed
2. Run `npm test` to verify nothing broke
3. If tests pass and `STASH_CREATED` is `0`: drop the saved state (`git stash drop $STASH_REF`) — the npm changes are good, no rollback needed
If tests pass and `STASH_CREATED` is `1`: no action needed — the npm changes are good and no stash entry exists to clean up
4. If tests fail and `STASH_CREATED` is `0`:
- Restore the saved manifests: `git stash pop $STASH_REF`
3. If tests pass and `STASH_REF` is non-empty: pop and merge the saved state (`git stash pop $STASH_REF`) — this restores any pre-existing uncommitted changes alongside the npm fix results. Note: the step 2 test run validated the npm changes alone; step 3b below is the authoritative test of the final merged state.
- If the pop applies cleanly:
a. Run `npm install` to re-sync `node_modules/` with the merged manifest.
b. Re-run `npm test` to confirm the merged state is consistent (this is the authoritative check — step 2 only validated the npm changes in isolation).
c. If tests still pass: confirm the project is consistent.
d. If tests now fail: warn the user — the pre-existing manifest changes conflict with the audit fixes.
Comment on lines +165 to +169
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 No recovery path when tests fail after clean-pop + npm install

When the stash pops cleanly and npm install re-syncs node_modules/, the skill re-runs npm test. If those tests fail (step d), the skill warns the user that pre-existing changes conflict with the audit fixes — but the stash has already been consumed by git stash pop. There is no way to return to either prior state:

  • The stash entry is gone, so the pre-existing manifest state cannot be automatically restored.
  • npm audit fix/npm update changes are already merged with the pre-existing changes in the working tree.

Without a recovery path, the user is left with a mixed, broken state and must manually reconstruct which changes to keep. Add explicit recovery guidance:

   d. If tests now fail: warn the user — the pre-existing manifest changes conflict with the audit fixes.
      Recovery options:
      - To undo **all** manifest changes (both audit fixes and pre-existing): `git checkout -- package.json package-lock.json && npm ci`
      - To keep only the audit fixes and discard pre-existing changes: manually edit `package.json`/`package-lock.json` to remove the pre-existing delta, then `npm ci`
      - To keep only the pre-existing changes and discard the audit fixes: re-run `/deps-audit` without `--fix`

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — step 169d now lists three explicit recovery options: undo all changes (git checkout + npm ci), keep only audit fixes (manual edit + npm ci), or keep only pre-existing changes (re-run without --fix).

Recovery options:
- To undo **all** manifest changes (both audit fixes and pre-existing): `git checkout -- package.json package-lock.json && npm ci`
- To keep only the audit fixes and discard pre-existing changes: manually edit `package.json`/`package-lock.json` to remove the pre-existing delta, then `npm ci`
- To keep only the pre-existing changes and discard the audit fixes: `git checkout HEAD -- package.json package-lock.json && npm ci` to revert manifests to their clean state, then manually re-apply only your pre-existing changes
- If the pop causes conflicts in `package.json`/`package-lock.json`: warn the user, leave conflict markers for manual resolution, and instruct: "After you resolve the conflicts, run `npm install` to re-sync `node_modules/` with the resolved lock file before committing."
- For conflicts in other files, resolve them by keeping both the npm fixes and the pre-existing changes.
If tests pass and `STASH_REF` is empty: no action needed — the npm changes are good and no stash entry exists to clean up
Comment on lines 163 to +176
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 node_modules/ not re-synced after a clean stash pop on the success path

When STASH_REF is non-empty and tests pass, the success path now correctly pops the saved state to restore pre-existing manifest changes. However, only the conflict path instructs the user to run npm install:

"After you resolve the conflicts, run npm install to re-sync node_modules/…"

There is no equivalent instruction for the clean-pop path. If the stashed pre-existing changes include a dependency addition or version pin (e.g., the user was mid-way through adding a new package), the merged package.json/package-lock.json now describes a dependency tree that differs from what npm applied during the audit — but node_modules/ still reflects the pre-pop state. This leaves the project in an inconsistent state that isn't caught by the tests already run (those ran against the pre-pop manifest).

Suggest adding a re-sync step after a clean pop:

3. If tests pass and `STASH_REF` is non-empty: pop and merge the saved state (`git stash pop $STASH_REF`).
   - If the pop applies cleanly: run `npm install` to re-sync `node_modules/` with the merged manifest, then confirm the project is consistent.
   - If the pop causes conflicts in `package.json`/`package-lock.json`: warn the user, leave conflict markers for manual resolution, and instruct: "After you resolve the conflicts, run `npm install` to re-sync `node_modules/` with the resolved lock file before committing."

The npm install on the clean-pop path is a no-op if the pre-existing changes didn't affect installed packages, so adding it is safe.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — the clean-pop success path now runs \Unknown command: install"

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — the clean-pop success path now runs npm install to re-sync node_modules with the merged manifest before confirming consistency. This is a no-op if the pre-existing changes did not affect installed packages.

4. If tests fail and `STASH_REF` is non-empty:
- Reset manifests to HEAD first (undoes npm changes): `git checkout HEAD -- package.json package-lock.json`
- Then re-apply the pre-existing changes cleanly: `git stash pop $STASH_REF`
- Restore `node_modules/` to match the reverted lock file: `npm ci`
- Report what failed
Comment on lines +177 to 181
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Failure-path git stash pop applied to npm-modified manifests will conflict

When tests fail and STASH_REF is non-empty, the stash was created from:

working tree = HEAD + pre-existing changes

By this point in step 4, the working tree contains:

HEAD + npm's modifications (audit fix / update)

git stash pop $STASH_REF applies the stash as a patch on top of the current state — it does not restore the working tree to a previous snapshot. Since the stash patch (pre-existing changes) and the current state (npm modifications) both modify package.json/package-lock.json from the same base (HEAD), git will almost certainly report conflicts, leaving the manifests with conflict markers and node_modules/ in an undefined state.

The correct two-step restore is:

  1. First, reset the manifest files to HEAD (undoing npm's changes):
    git checkout HEAD -- package.json package-lock.json
  2. Then re-apply the pre-existing changes cleanly from HEAD:
    git stash pop $STASH_REF

At step 2 the working tree matches HEAD, so the stash applies exactly as it was originally created — no conflicts.

Suggested replacement for step 4:

4. If tests fail and `STASH_REF` is non-empty:
   - Reset manifests to HEAD first (undoes npm changes):
     `git checkout HEAD -- package.json package-lock.json`
   - Then re-apply the pre-existing changes cleanly:
     `git stash pop $STASH_REF`
   - Restore `node_modules/` to match the reverted lock file: `npm ci`
   - Report what failed

Note that the success path (step 3) intentionally does a merge (pop on the npm-modified state) to preserve both sets of changes. The failure path semantics are different — we want a full restore — so the merge approach is wrong here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — the failure path now resets manifests to HEAD first (git checkout HEAD -- package.json package-lock.json) before popping the stash. This ensures the stash applies cleanly against the same base it was created from, avoiding conflicts. Also updated the Rules section to match.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in prior commit 0b08a2b — the failure path already resets manifests to HEAD first (git checkout HEAD -- package.json package-lock.json) before popping the stash, preventing conflicts between npm-modified and stashed versions.

5. If tests fail and `STASH_CREATED` is `1`:
5. If tests fail and `STASH_REF` is empty:
- Discard manifest changes: `git checkout -- package.json package-lock.json`
- Restore `node_modules/` to match the reverted lock file: `npm ci`
- Report what failed
Expand All @@ -181,6 +189,6 @@ Summarize all changes made:
- **Never run `npm audit fix --force`** — breaking changes need human review
- **Never remove a dependency** without asking the user, even if it appears unused — flag it in the report instead
- **Always run tests** after any auto-fix changes
- **If `--fix` causes test failures**, restore manifests from the saved state (`git stash pop $STASH_REF` if `STASH_CREATED=0`, or `git checkout` if stash was a no-op) then run `npm ci` to resync `node_modules/`, and report the failure
- **If `--fix` causes test failures**, first reset manifests to HEAD (`git checkout HEAD -- package.json package-lock.json`), then re-apply pre-existing changes (`git stash pop $STASH_REF` if `STASH_REF` is non-empty, or no-op if nothing was stashed), then run `npm ci` to resync `node_modules/`, and report the failure
- Treat `optionalDependencies` separately — they're expected to fail on some platforms
- The report goes in `generated/deps-audit/` — create the directory if it doesn't exist
47 changes: 42 additions & 5 deletions .claude/skills/housekeep/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -66,28 +66,65 @@ For stale worktrees with merged branches:

## Phase 2 — Delete Dirt Files

Remove temporary and generated files that accumulate over time:
Remove temporary and generated files that accumulate over time. There are two distinct categories of dirt that require different discovery commands:

- **Gitignored dirt** (files matching `.gitignore` patterns — e.g. `coverage/`, `.DS_Store`, `*.log`, `.codegraph/graph.db-journal`): use `git clean -fdX --dry-run` to list them. `git ls-files --others --exclude-standard` silently omits these because `--exclude-standard` suppresses gitignored entries.
- **Untracked non-ignored files** (stray files not in `.gitignore` — e.g. `*.tmp.*`, `*.bak`, `*.orig`): use `git ls-files --others --exclude-standard` to list them.

Run both commands and union the results to get the full set of candidate dirt files.

### 2a. Known dirt patterns

Search for and remove:
Search for and remove files found by the two discovery commands above (never touch tracked files):
- `*.tmp.*`, `*.bak`, `*.orig` files in the repo (but NOT in `node_modules/`)
- `.DS_Store` files
- `*.log` files in repo root (not in `node_modules/`)
- Empty directories (except `.codegraph/`, `.claude/`, `node_modules/`)
- `coverage/` directory (regenerated by `npm run test:coverage`)
- `.codegraph/graph.db-journal` (SQLite WAL leftovers)
- Stale lock files: `.codegraph/*.lock` older than 1 hour

**Stale lock files** (`.codegraph/*.lock` older than 1 hour): Before removing, first check if `lsof` is available (`command -v lsof`). If `lsof` is **not installed** (common in Docker/CI minimal containers where it exits 127), **skip lock file removal entirely** and print a warning: `"lsof not available — skipping lock file cleanup (cannot verify no process holds the file)"`. When `lsof` IS available, use `lsof "$f"` to verify no process holds the file. If the file is held, **skip it** and warn — concurrent Claude Code sessions may hold legitimate long-lived locks.

```bash
if ! command -v lsof > /dev/null 2>&1; then
echo "lsof not available — skipping lock file cleanup (cannot verify no process holds the file)"
else
for f in .codegraph/*.lock; do
[ -f "$f" ] || continue
age=$(( $(date +%s) - $(stat --format='%Y' "$f" 2>/dev/null || stat -f '%m' "$f" 2>/dev/null) ))
[ -z "$age" ] && continue
if [ "$age" -gt 3600 ] && ! lsof "$f" > /dev/null 2>&1; then
if [ "$DRY_RUN" = "true" ]; then
echo "[DRY RUN] Would remove stale lock: $f"
else
echo "Removing stale lock: $f"
rm "$f"
fi
elif [ "$age" -gt 3600 ]; then
echo "Lock file $f is old but still held by a process — ask user before removing"
fi
done
fi
```
Comment on lines +88 to +108
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Lock file removal code does not respect DRY_RUN — violates the "DRY_RUN is sacred" rule

The provided code block in Phase 2a runs rm "$f" unconditionally — it does not check DRY_RUN. Phase 2c's guard ("If DRY_RUN: List all files that would be removed…") only covers the general dirt-file removal that follows; it does not retroactively protect against the rm already embedded in this Phase 2a snippet.

An agent following the skill sequentially will execute Phase 2a's code block (including rm) before reaching Phase 2c's DRY_RUN check. Running /housekeep --dry-run would still delete lock files, directly violating the Rules section: "--dry-run is sacred — it must NEVER modify anything, only report."

Add a DRY_RUN guard inside the loop:

for f in .codegraph/*.lock; do
  [ -f "$f" ] || continue
  age=$(( $(date +%s) - $(stat --format='%Y' "$f" 2>/dev/null || stat -f '%m' "$f" 2>/dev/null) ))
  [ -z "$age" ] && continue
  if [ "$age" -gt 3600 ] && ! lsof "$f" > /dev/null 2>&1; then
    if [ "$DRY_RUN" = "true" ]; then
      echo "[DRY RUN] Would remove stale lock: $f"
    else
      echo "Removing stale lock: $f"
      rm "$f"
    fi
  elif [ "$age" -gt 3600 ]; then
    echo "Lock file $f is old but still held by a process — ask user before removing"
  fi
done

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — added a DRY_RUN guard inside the lock file removal loop. When DRY_RUN=true, it prints what would be removed instead of calling rm.


### 2b. Large untracked files

Find untracked files larger than 1MB:
Find untracked files (both gitignored and non-ignored) larger than 1MB. Use both discovery commands and union the paths:
```bash
# Non-ignored untracked files
git ls-files --others --exclude-standard | while read f; do
size=$(stat --format='%s' "$f" 2>/dev/null || stat -f '%z' "$f" 2>/dev/null)
[ -z "$size" ] && continue
if [ "$size" -gt 1048576 ]; then echo "$f ($size bytes)"; fi
done
# Gitignored files (strip the leading "Would remove " prefix from dry-run output)
git clean -fdX --dry-run | sed 's/^Would remove //' | while read f; do
# Skip directory entries — stat returns inode size, not content size
[ -d "$f" ] && continue
size=$(stat --format='%s' "$f" 2>/dev/null || stat -f '%z' "$f" 2>/dev/null)
[ -z "$size" ] && continue
if [ "$size" -gt 1048576 ]; then echo "$f ($size bytes) [gitignored]"; fi
done
Comment on lines +121 to +127
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 stat on directory paths returns metadata size, not content size

git clean -fdX --dry-run emits directory entries with trailing slashes (e.g. Would remove coverage/). After the sed strip, stat coverage/ returns the directory inode size (typically 4096 bytes on Linux / 512 on macOS) — not the total size of the directory's contents. A coverage/ directory containing 200 MB of reports will therefore produce size=4096, which is far below the 1 MB threshold and will never be flagged.

Replace the raw stat call with du -sk for directories, or add a -d check to skip directory entries entirely (they'll be cleaned by Phase 2c regardless):

# Gitignored files (strip the leading "Would remove " prefix from dry-run output)
git clean -fdX --dry-run | sed 's/^Would remove //' | while read f; do
  # Skip directory entries — their size can't be reliably measured with stat
  [ -d "$f" ] && continue
  size=$(stat --format='%s' "$f" 2>/dev/null || stat -f '%z' "$f" 2>/dev/null)
  [ -z "$size" ] && continue
  if [ "$size" -gt 1048576 ]; then echo "$f ($size bytes) [gitignored]"; fi
done

Alternatively, if flagging large gitignored directories matters (e.g. a 500 MB accidentally-untracked dataset), use du -sk "$f" | cut -f1 and compare against 1024 (KB).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — added [ -d "$f" ] && continue guard before the stat call in the gitignored files loop. Directory entries are skipped since stat returns inode metadata size, not content size. Directories are cleaned by Phase 2c regardless.

```
Comment on lines +115 to +128
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 stat size comparison fails silently when both stat variants fail

size=$(stat --format='%s' "$f" 2>/dev/null || stat -f '%z' "$f" 2>/dev/null)
if [ "$size" -gt 1048576 ]; then echo "$f ($size bytes)"; fi

If both stat invocations fail (e.g., the file is deleted between git ls-files and stat, or running in a container without either version of stat), $size is the empty string. The arithmetic comparison [ "" -gt 1048576 ] will cause bash to exit with:

bash: [: : integer expression expected

This terminates the entire while loop body, silently skipping all subsequent untracked files.

Add an empty-string guard before the comparison:

Suggested change
git ls-files --others --exclude-standard | while read f; do
size=$(stat --format='%s' "$f" 2>/dev/null || stat -f '%z' "$f" 2>/dev/null)
if [ "$size" -gt 1048576 ]; then echo "$f ($size bytes)"; fi
done
```
size=$(stat --format='%s' "$f" 2>/dev/null || stat -f '%z' "$f" 2>/dev/null)
[ -z "$size" ] && continue
if [ "$size" -gt 1048576 ]; then echo "$f ($size bytes)"; fi

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — added \ guard before the arithmetic comparison. If both stat variants (GNU \ and BSD ) fail, the loop now skips the file instead of erroring on the empty-string comparison.


Flag these for user review — they might be accidentally untracked binaries.
Expand Down Expand Up @@ -117,7 +154,7 @@ git log HEAD..origin/main --oneline
```

If main has new commits:
- If on main: `git pull origin main`
- If on main: `git pull --no-rebase origin main`
- If on a feature branch: inform the user how many commits behind main they are
- Suggest: `git merge origin/main` (never rebase — per project rules)

Expand Down
14 changes: 11 additions & 3 deletions .claude/skills/test-health/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,8 +41,14 @@ Run the full test suite `FLAKY_RUNS` times and track per-test pass/fail:
RUN_DIR=$(mktemp -d /tmp/test-health-XXXXXX)
for i in $(seq 1 $FLAKY_RUNS); do
timeout 180 npx vitest run --reporter=json > "$RUN_DIR/run-$i.json" 2>"$RUN_DIR/run-$i.err"
if [ $? -eq 124 ]; then
exit_code=$?
if [ $exit_code -eq 124 ]; then
echo '{"timeout":true}' > "$RUN_DIR/run-$i.json"
elif [ $exit_code -ne 0 ] && [ $exit_code -ne 1 ]; then
# Use jq to safely JSON-escape stderr content (may contain quotes, newlines, backslashes)
stderr_content=$(cat "$RUN_DIR/run-$i.err")
jq -n --argjson code "$exit_code" --arg stderr "$stderr_content" \
'{"error":true,"exit_code":$code,"stderr":$stderr}' > "$RUN_DIR/run-$i.json"
fi
Comment on lines +43 to +52
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Non-timeout, non-zero vitest exit codes leave corrupt run files unhandled

The loop only guards for exit code 124 (timeout). When vitest exits non-zero for any other reason — config error, missing plugin, compilation failure — the guard is skipped and $RUN_DIR/run-$i.json may contain empty or truncated JSON. The parsing step at line 50 will then silently fail for that run, and a test that was simply unable to run may be misclassified as flaky (it "fails" in that run and "passes" in others where vitest succeeded).

Add an explicit fallback for other non-zero exit codes:

  timeout 180 npx vitest run --reporter=json > "$RUN_DIR/run-$i.json" 2>"$RUN_DIR/run-$i.err"
  run_exit=$?
  if [ $run_exit -eq 124 ]; then
    echo '{"timeout":true}' > "$RUN_DIR/run-$i.json"
  elif [ $run_exit -ne 0 ] && [ $run_exit -ne 1 ]; then
    # Exit 1 = test failures (valid JSON written); other codes = vitest crash
    echo '{"error":true,"code":'$run_exit'}' > "$RUN_DIR/run-$i.json"
  fi

(vitest exits 1 when tests fail but JSON is still valid; any other non-zero exit means the runner itself crashed.)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — the flaky-detection loop now captures the exit code and handles three cases: 124 (timeout marker), 0 or 1 (normal vitest pass/fail), and anything else (error marker with exit code and stderr). This prevents corrupt/empty run files from misclassifying crashing tests as flaky.

Comment on lines +47 to +52
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Unescaped stderr content produces malformed JSON

The echo command embeds $(cat "$RUN_DIR/run-$i.err") raw into a JSON string. Any stderr output containing double-quotes, backslashes, or newlines (all common in compiler/tool crash messages) will produce syntactically invalid JSON. When the downstream parsing step at line 53 tries to parse run-$i.json, it will silently fail, and a vitest crash run will be indistinguishable from a run with no data — potentially misclassifying a suite that can't run at all as a suite where all tests pass (i.e., never-flaky).

Replace with a properly escaped construction using jq (available in the environment per allowed-tools: Bash):

  elif [ $exit_code -ne 0 ] && [ $exit_code -ne 1 ]; then
    # Exit 1 = test failures (valid JSON written); other codes = vitest crash
    stderr_content=$(cat "$RUN_DIR/run-$i.err")
    jq -n --argjson code "$exit_code" --arg stderr "$stderr_content" \
      '{"error":true,"exit_code":$code,"stderr":$stderr}' > "$RUN_DIR/run-$i.json"
  fi

jq --arg automatically JSON-escapes the string value, making the output valid regardless of stderr content.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — replaced raw echo with jq -n --arg for proper JSON escaping of stderr content. Quotes, backslashes, and newlines are now safely handled.

done
Comment on lines +41 to +53
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Timeout note is documented but never enforced

Line 66 states: "Each full suite run gets 3 minutes. If it times out, record partial results and continue." However, the loop invocation has no timeout wrapper:

for i in $(seq 1 $FLAKY_RUNS); do
  npx vitest run --reporter=json > "$RUN_DIR/run-$i.json" 2>"$RUN_DIR/run-$i.err"
done

If any vitest run hangs (e.g., a test opens a port and never closes it), the entire Phase 1 blocks indefinitely. With the default FLAKY_RUNS=5, a single hung run means the skill never completes.

This is exactly the same enforcement gap that was identified for bench-check's benchmark invocations. Apply the same fix here:

Suggested change
RUN_DIR=$(mktemp -d /tmp/test-health-XXXXXX)
for i in $(seq 1 $FLAKY_RUNS); do
npx vitest run --reporter=json > "$RUN_DIR/run-$i.json" 2>"$RUN_DIR/run-$i.err"
done
RUN_DIR=$(mktemp -d /tmp/test-health-XXXXXX)
for i in $(seq 1 $FLAKY_RUNS); do
timeout 180 npx vitest run --reporter=json > "$RUN_DIR/run-$i.json" 2>"$RUN_DIR/run-$i.err"
if [ $? -eq 124 ]; then
echo "run-$i timed out after 3 minutes — partial results recorded" >> "$RUN_DIR/run-$i.err"
fi
done

Exit code 124 is the POSIX standard value returned by timeout when the process is killed, consistent with the pattern used in bench-check's fix.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — flaky detection loop now uses timeout 180 wrapper. Exit code 124 writes a {"timeout":true} marker to the run output file so the analysis phase can account for it.

```
Expand All @@ -56,7 +62,9 @@ rm -rf "$RUN_DIR"

### Analysis

A test is **flaky** if it passes in some runs and fails in others.
Before analyzing, **exclude invalid runs**: skip any run file containing `{"timeout":true}` or `{"error":true}` — these runs produced no reliable per-test data and must not be counted as "all tests failed." Require a **minimum of 2 valid runs** (runs with parseable vitest JSON output) for flaky detection to be conclusive. If fewer than 2 valid runs remain, report that flaky detection was inconclusive due to too many errored/timed-out runs.

A test is **flaky** if it passes in some valid runs and fails in others.

For each flaky test found:
1. Record: test file, test name, pass count, fail count, failure messages
Expand Down Expand Up @@ -139,7 +147,7 @@ Find files with < 50% line coverage. For each:
Compare against `main` branch to find recently changed files:

```bash
git diff --name-only main...HEAD -- src/
git diff --name-only origin/main...HEAD -- src/
```

For each changed source file, check if:
Expand Down
Loading
Loading