Skip to content

fix(native): purge stale rows when WASM-only files are deleted#1122

Open
carlos-alm wants to merge 6 commits into
mainfrom
fix/1073-wasm-stale-purge
Open

fix(native): purge stale rows when WASM-only files are deleted#1122
carlos-alm wants to merge 6 commits into
mainfrom
fix/1073-wasm-stale-purge

Conversation

@carlos-alm
Copy link
Copy Markdown
Contributor

Summary

  • Add computeWasmOnlyStaleFiles helper that finds files present in nodes/file_hashes but absent from disk, scoped to extensions installed for WASM and outside NATIVE_SUPPORTED_EXTENSIONS so the Rust orchestrator still owns its own purge path.
  • Wire it into backfillNativeDroppedFiles: after the better-sqlite3 handoff, call purgeFilesData for the stale set. Early-return now requires both missingAbs and staleRel to be empty.
  • 8 unit tests covering: WASM-only detection, native-supported skipping, grammar-missing skipping, single-table presence (nodes-only / hashes-only), dedup across both tables, case normalization, and no-op when DB and disk agree.

Context

Closes #1073.

PR #1070 made Rust's detect_removed_files skip files outside is_supported_extension to stop purging-and-reinserting WASM-only files on every incremental rebuild (#1066). Side effect: when a WASM-only file is genuinely deleted from disk, no path removes its rows — Rust skips it, and JS backfillNativeDroppedFiles only inserts.

The fix is scoped narrowly: the new helper restricts purges to extensions where the WASM grammar is installed and Rust has no extractor, so the natively-handled delete path is untouched. Every currently-registered language now has a Rust extractor, so the bug surface today is the empty set; the fix protects against any future WASM-only language entering the registry before its native port lands.

Test plan

  • 8/8 new unit tests pass (npx vitest run tests/builder/wasm-only-stale-files.test.ts)
  • All 34 builder tests pass
  • Build / incremental-parity integration tests pass (35/35)
  • tsc --noEmit clean
  • biome check clean on changed files

Closes #1073

When a file in a WASM-only language is deleted from disk, neither engine
cleans up its DB rows: Rust's detect_removed_files (#1070) skips files
outside is_supported_extension, and the JS-side backfill only inserts.

Add computeWasmOnlyStaleFiles to detect (existingNodes ∪ existingHashes)
not on disk, filtered to extensions installed for WASM but absent from
NATIVE_SUPPORTED_EXTENSIONS so Rust still owns its own purge path.
Wire it into backfillNativeDroppedFiles to call purgeFilesData after the
better-sqlite3 handoff. Unit tests cover the helper.

docs check acknowledged — internal pipeline fix, no doc-visible surface.
@claude
Copy link
Copy Markdown

claude Bot commented May 14, 2026

Claude finished @carlos-alm's task —— View job


Comprehensive PR Review

Todo List:

Root Cause Analysis

Problem Being Solved: When WASM-only files are deleted from disk, their rows persist in the database indefinitely. This happens because:

  1. PR fix(native): skip unsupported-extension files in detect_removed_files #1070 made Rust's detect_removed_files skip files outside NATIVE_SUPPORTED_EXTENSIONS to prevent purge-and-reinsert cycles (publish gate: native incremental rebuilds regress to ~2s, JS fast-skip not firing in CI #1066)
  2. The JS backfillNativeDroppedFiles function only inserts rows, never removes them
  3. Result: deleted WASM-only files leave stale data until the next full rebuild

Why This Approach: The fix is correctly scoped - it only handles purges for extensions with installed WASM grammars that are NOT in NATIVE_SUPPORTED_EXTENSIONS. This preserves the Rust orchestrator's ownership of natively-supported file deletion while closing the gap for WASM-only files.

Risk Assessment: Low risk. The approach is conservative and well-isolated. Currently, all registered languages have Rust extractors (per PR description), so the bug surface is empty today. The fix protects against future WASM-only languages entering the registry before their native ports.

Code Quality Assessment

⭐ Excellent Implementation Quality

Strengths:

  1. Pure function design: computeWasmOnlyStaleFiles is exported, pure, and fully unit-testable
  2. Comprehensive test coverage: 8 focused unit tests covering edge cases, deduplication, case normalization, and both table states
  3. Clear scope boundaries: Properly restricts to installedExts!nativeSupported to avoid conflicts with Rust deletion paths
  4. Defensive programming: Handles both nodes and file_hashes tables, deduplicates across both, normalizes extension case
  5. Performance conscious: Early returns when both missingAbs and staleRel are empty
  6. Excellent documentation: Function comments clearly explain the filtering logic and interaction with Rust paths

Technical Excellence:

  • Line 814-829: Clean, readable implementation with appropriate abstractions
  • Line 894-901: Proper integration into existing backfillNativeDroppedFiles flow
  • Line 918-931: Good error handling and diagnostic logging for unexpected edge cases
  • Tests cover all meaningful permutations without being excessive

Configuration & Breaking Changes

  • No configuration changes - no linting rules relaxed, no test thresholds lowered
  • No breaking changes - purely additive internal fix
  • Clean integration - uses existing APIs (purgeFilesData, classifyNativeDrops)

Test Integrity

  • Strong test design - mocks extension sets as parameters so tests remain valid even when all languages become native-supported
  • Edge case coverage - single-table presence, deduplication, case normalization
  • Clear test names - each test documents its specific scenario

Backlog Compliance

  • Zero-dep: ✅ No new dependencies, builds on existing SQLite/tree-sitter infrastructure
  • Foundation-aligned: ✅ Fully compliant - fixes data consistency without changing core architecture
  • Problem-fit: 4/5 - Directly prevents data inconsistency that could confuse AI agents about file existence and dependencies
  • Breaking: No - purely internal fix with no API changes
  • Tier: 1 (zero-dep ✅ + foundation-aligned ✅)

Minor Observations

Positive Notes:

  • Line 925-929: Good defensive logging for the unexpected case where native-supported files leak through the filter
  • Test line 92-105: Thoughtful case normalization test for R's .R/.r convention
  • Clean separation between computation (computeWasmOnlyStaleFiles) and integration (backfillNativeDroppedFiles)

No Issues Found:

  • Security: Clean - no command injection, SQL injection, or unsafe file operations
  • Performance: Efficient - minimal DB queries, early returns, proper set operations
  • Architecture: Well-integrated - follows existing patterns and conventions

Final Assessment

## Root Cause Analysis
**Problem Being Solved**: Stale database rows persist when WASM-only files are deleted from disk due to Rust's purge filter skipping non-native extensions
**Why This Approach**: Scoped narrowly to WASM-only extensions, preserves Rust ownership of native deletion paths
**Risk Assessment**: Low - conservative scope, defensive implementation, comprehensive tests

## Backlog Compliance
- Zero-dep: ✅ - builds on existing infrastructure  
- Foundation-aligned: ✅ - pure data consistency fix
- Problem-fit: 4/5 - prevents data inconsistency affecting AI agent accuracy
- Breaking: No - internal implementation only
- Tier: 1

## Critical Concerns
None identified.

## Final Recommendation
- Rating: ⭐⭐⭐⭐⭐ (5/5)
- Action: **APPROVE**
- Reasoning: Exemplary implementation with excellent testing, clear scope, and conservative approach. Addresses a real consistency issue that could mislead AI agents. The pure function design and comprehensive test coverage make this maintainable and reliable.

This PR represents high-quality defensive programming that directly supports the tool's core mission of providing accurate dependency information to AI agents.


@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 14, 2026

Greptile Summary

This PR fixes a gap where WASM-only files deleted from disk left stale nodes/file_hashes rows indefinitely, because Rust's detect_removed_files skips extensions outside its supported set (PR #1070) and the JS backfill only inserts.

  • Adds computeWasmOnlyStaleFiles to identify DB rows for WASM-only extensions (installed grammar, no native extractor) that are absent from disk, and exports it for unit testing.
  • Wires the stale-file set into detectDroppedLanguageGap and backfillNativeDroppedFiles, where purgeFilesData removes the rows; the early-return and trigger conditions are updated to consider staleRel alongside missingAbs.
  • Adds a groupByExtension helper to build the per-extension log summary without going through the now-unnecessary classifyNativeDrops pass, and includes 9 unit tests covering dedup, case normalisation, single-table presence, and the Windows back-slash regression.

Confidence Score: 5/5

The change is narrowly scoped to WASM-only extensions and does not touch the native purge path; the only anomaly is an unreachable duplicate return statement left from refactoring.

The core logic is correct and well-tested. The back-slash normalisation fix is properly applied (rawRel pushed into stale while seen deduplicates on the normalised form). The only issue is a duplicate dead-code guard that has no behavioural impact.

The duplicate early-return in backfillNativeDroppedFiles (pipeline.ts ~line 1016) is harmless dead code but worth cleaning up before merge.

Important Files Changed

Filename Overview
src/domain/graph/builder/pipeline.ts Adds computeWasmOnlyStaleFiles, groupByExtension, and wires them into detectDroppedLanguageGap/backfillNativeDroppedFiles; contains one duplicate dead-code return statement left from refactoring.
tests/builder/wasm-only-stale-files.test.ts New test file with 9 unit tests covering WASM-only stale detection, native-skip, grammar-missing skip, single-table presence, dedup, case normalisation, no-op, and back-slash regression cases.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[tryNativeOrchestrator] --> B[detectDroppedLanguageGap]
    B --> C[Collect existingNodes & existingHashes from DB]
    C --> D[Build expected set from disk files]
    D --> E[Compute missingRel / missingAbs]
    D --> F[computeWasmOnlyStaleFiles]
    E --> G{missingAbs OR staleRel non-empty?}
    F --> G
    G -- No --> H[Return]
    G -- Yes --> I[backfillNativeDroppedFiles]
    I --> J{staleRel non-empty?}
    J -- Yes --> K[purgeFilesData DELETE rows]
    J -- No --> L{missingAbs non-empty?}
    K --> L
    L -- No --> M[Return]
    L -- Yes --> N[WASM backfill + batchInsertNodes]
Loading

Fix All in Claude Code

Reviews (7): Last reviewed commit: "fix(stale-purge): preserve raw path so D..." | Re-trigger Greptile

Comment thread src/domain/graph/builder/pipeline.ts Outdated
Comment on lines +919 to +930
const { byReason: staleByReason, totals: staleTotals } = classifyNativeDrops(staleRel);
info(
`Detected ${staleRel.length} deleted WASM-only file(s) the native orchestrator skipped; purging stale rows: ${formatDropExtensionSummary(staleByReason['unsupported-by-native'])}`,
);
// staleRel is restricted above to extensions outside NATIVE_SUPPORTED_EXTENSIONS,
// so the native-extractor-failure bucket should always be empty here.
if (staleTotals['native-extractor-failure'] > 0) {
debug(
`backfillNativeDroppedFiles: stale-purge classified ${staleTotals['native-extractor-failure']} native-supported file(s) — unexpected; inspect the filter`,
);
}
purgeFilesData(dbConn, staleRel);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 classifyNativeDrops call is redundant for the stale-purge path

computeWasmOnlyStaleFiles already guarantees every path in staleRel has an extension outside NATIVE_SUPPORTED_EXTENSIONS, so classifyNativeDrops will always put 100 % of staleRel into byReason['unsupported-by-native'] and staleTotals['native-extractor-failure'] can never be > 0. The defensive debug log is harmless, but the full classification pass (including building the Map<string, string[]> buckets) is unused except to feed formatDropExtensionSummary. Consider building the extension summary directly from staleRel — or at minimum remove the staleTotals['native-extractor-failure'] guard block, since its impossible-condition silently contradicts the invariant comment above it.

Fix in Claude Code

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 8af09cc. Replaced the classifyNativeDrops call with a dedicated groupByExtension helper that builds the Map<string, string[]> summary directly from staleRel. Also dropped the unreachable native-extractor-failure guard block — the comment above already documents the invariant, no need for impossible-condition debug logging contradicting it.

Comment thread src/domain/graph/builder/pipeline.ts Outdated
Comment on lines +818 to +819
const consider = (rel: string): void => {
if (expected.has(rel) || seen.has(rel)) return;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 expected.has(rel) uses raw DB paths against normalised on-disk paths

expected is built with normalizePath(path.relative(ctx.rootDir, f)), while existingNodes and existingHashes are raw file column values from the DB. On Windows, a DB row stored with back-slashes (src\a.gleam) would fail the expected.has(rel) check even when the file still exists on disk, causing a false-positive stale detection and an unintended purge of live rows. The same path-normalisation mismatch already exists in the missingRel loop above (where it causes under-detection rather than over-deletion), so this is a pre-existing assumption — but the consequence is worse on the new purge path. Worth adding a normalizePath call on rel inside consider if Windows support is ever on the table, or a comment documenting the invariant that DB paths are always forward-slash-normalised.

Fix in Claude Code

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 8af09cc. consider() now applies rawRel.replace(/\\/g, '/') before comparing against expected (which is already forward-slash-normalised), so a stale DB row carrying back-slashes — e.g. one migrated from a Windows-built DB — no longer triggers a false-positive purge of a live file. I used an explicit regex replace rather than normalizePath because normalizePath only touches path.sep, so it would be a no-op on POSIX against a Windows-flavoured string. Added a regression test in wasm-only-stale-files.test.ts for the back-slash case.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 14, 2026

Codegraph Impact Analysis

12 functions changed9 callers affected across 5 files

  • tryNativeOrchestrator in src/domain/graph/builder/pipeline.ts:606 (5 transitive callers)
  • DroppedLanguageGap.staleRel in src/domain/graph/builder/pipeline.ts:804 (0 transitive callers)
  • WasmOnlyStaleFilesInput.existingNodes in src/domain/graph/builder/pipeline.ts:814 (0 transitive callers)
  • WasmOnlyStaleFilesInput.existingHashes in src/domain/graph/builder/pipeline.ts:816 (0 transitive callers)
  • WasmOnlyStaleFilesInput.expected in src/domain/graph/builder/pipeline.ts:818 (0 transitive callers)
  • WasmOnlyStaleFilesInput.installedExts in src/domain/graph/builder/pipeline.ts:820 (0 transitive callers)
  • WasmOnlyStaleFilesInput.nativeSupported in src/domain/graph/builder/pipeline.ts:822 (0 transitive callers)
  • computeWasmOnlyStaleFiles in src/domain/graph/builder/pipeline.ts:851 (3 transitive callers)
  • consider in src/domain/graph/builder/pipeline.ts:855 (3 transitive callers)
  • groupByExtension in src/domain/graph/builder/pipeline.ts:882 (3 transitive callers)
  • detectDroppedLanguageGap in src/domain/graph/builder/pipeline.ts:923 (4 transitive callers)
  • backfillNativeDroppedFiles in src/domain/graph/builder/pipeline.ts:981 (4 transitive callers)

…tching (#1122)

Address Greptile review on the WASM-only stale purge:

- Replace classifyNativeDrops + unreachable native-extractor-failure guard
  in the stale-purge branch with a direct groupByExtension helper. The
  paths returned by computeWasmOnlyStaleFiles are guaranteed to be outside
  NATIVE_SUPPORTED_EXTENSIONS, so the classification pass was always a
  no-op for the unreachable bucket.
- Normalise back-slashes to forward-slashes inside consider() before
  comparing against the expected set. Defends against a stale DB row
  (e.g. migrated from a Windows-built DB) being treated as missing and
  purged even when the file still exists on disk. Replace `\\` explicitly
  rather than calling normalizePath so the defence works on POSIX too.
- Add a regression test for the back-slash path-matching case.
@carlos-alm
Copy link
Copy Markdown
Contributor Author

@greptileai

Comment thread src/domain/graph/builder/pipeline.ts Outdated
Comment on lines +833 to +834
seen.add(rel);
stale.push(rel);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 stale.push(rel) stores the forward-slash-normalised path, but purgeFilesData executes DELETE FROM nodes WHERE file = ? with that value. If the DB row was stored with back-slashes (the Windows-migration case the comment describes), the SQL predicate won't match and the DELETE hits 0 rows — the stale row silently persists, which is exactly the regression this PR is trying to fix.

The dedup key should remain the normalised form (so src/a.gleam and src\a.gleam are still treated as one file), but the value pushed into stale must be the original rawRel so the later WHERE file = ? matches the actual stored path. The existing back-slash test only covers the "file still on disk" path; adding a counterpart with an empty expected set would catch this.

Suggested change
seen.add(rel);
stale.push(rel);
seen.add(rel);
stale.push(rawRel);

Fix in Claude Code

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 4c22410. consider() now pushes rawRel (not the normalised rel) into the stale list, so purgeFilesData's DELETE FROM nodes WHERE file = ? is byte-identical to the stored row. The dedup seen set still uses the normalised form so a path written once with \ and once with / is treated as one entry. Added the counterpart regression test (preserves back-slash form so DELETE matches the actual DB row) covering a stale back-slash row with an empty expected set.

carlos-alm and others added 3 commits May 15, 2026 00:08
Integrate main's detectDroppedLanguageGap/backfillNativeDroppedFiles
refactor (#1123) with PR's stale-row purge (#1073). Combined gap
detection now returns missingRel/missingAbs (gap repair) and staleRel
(WASM-only deletes), wired through both the earlyExit and dirty paths.
Greptile P1: pushing the forward-slash-normalised path into the stale
list meant 'DELETE FROM nodes WHERE file = ?' missed rows that had been
written with back-slashes (e.g. legacy Windows-migrated DBs) — the exact
regression #1073 is trying to fix.

The dedup key keeps the normalised form so a file written once with '\'
and once with '/' is still treated as one entry, but the value the SQL
sees is now byte-identical to what's stored. Added a regression test
covering a back-slash row with an empty 'expected' set.
@carlos-alm
Copy link
Copy Markdown
Contributor Author

@greptileai

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

follow-up: clean up DB rows when WASM-only files are deleted from disk

1 participant