fix(native): skip unsupported-extension files in detect_removed_files#1070
Conversation
The Rust file_collector only collects files whose extension is recognized by `LanguageKind::from_extension` or listed in `SUPPORTED_EXTENSIONS`. The JS LANGUAGE_REGISTRY is broader — Clojure, Gleam, Julia, F# files exist in `file_hashes` because the JS-side WASM backfill writes them (#1068), but Rust's narrower collector never sees them. Before this fix, `detect_removed_files` flagged every such file as "removed" on every incremental rebuild because they were absent from `current` (the just-collected file list). The orchestrator's purge step then deleted their `nodes` and `file_hashes` rows, and the JS-side `backfillNativeDroppedFiles` (now running on every pass per #1069) re-parsed them with WASM and re-inserted the rows — the ~2s 1-file rebuild floor reported in #1066. Add `is_supported_extension` to `file_collector` (exposing the same predicate used by `collect_files`) and apply it as a pre-filter in `detect_removed_files`. Files outside Rust's capability are now left alone: their absence from `current` is a capability boundary, not a deletion. Rows owned by the JS layer persist across incremental rebuilds and the backfill's missing-file early-return at `pipeline.ts:811` finally fires. Refs #1066 Impact: 3 functions changed, 4 affected
|
Claude finished @carlos-alm's task —— View job PR Review AnalysisReview Checklist
Root Cause AnalysisProblem Being Solved: The native Rust orchestrator was incorrectly flagging WASM-only files (.clj, .gleam, .jl, .fs) as "removed" during incremental builds because these files are outside Rust's supported extension set. This caused a wasteful delete → re-insert cycle on every rebuild, creating the ~2s performance floor reported in issue #1066. Why This Approach: The solution correctly addresses the root cause by filtering out unsupported extensions in Risk Assessment: Low risk - this is a targeted fix that maintains existing behavior for supported files while correctly excluding unsupported ones. Backlog Compliance
Critical ConcernsNone identified. This is a well-implemented performance fix. Code Quality Highlights
Final Recommendation
|
Greptile SummaryFixes a ~2 s incremental-rebuild floor caused by
Confidence Score: 5/5Safe to merge — the guard is narrowly scoped to unsupported extensions, and the helper function provably matches the predicate already used in The new predicate in No files require special attention. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[detect_removed_files called] --> B[Build 'current' set from all_files]
B --> C[Iterate existing DB file_hashes rows]
C --> D{is_supported_extension?}
D -- No: WASM-only file --> E[Skip — leave row alone]
D -- Yes: Rust-native file --> F{scoped rebuild?}
F -- Yes --> G{in scope AND not in current?}
F -- No --> H{not in current?}
G -- Yes --> I[Flag as removed]
G -- No --> J[Skip]
H -- Yes --> I
H -- No --> J
Reviews (2): Last reviewed commit: "fix(native): skip unsupported-extension ..." | Re-trigger Greptile |
| if !is_supported_extension(f) { | ||
| return false; | ||
| } |
There was a problem hiding this comment.
Stale rows when WASM-only files are genuinely deleted
With this guard in place, Rust will never call purge_changed_files for a WASM-only file — even when that file is actually deleted from disk. The PR description relies on backfillNativeDroppedFiles's "missing-file early-return" (pipeline.ts:811) to prevent needless re-parsing, but that early return fires when the file_hashes row already exists. If the file is truly gone, the row now persists forever: Rust skips the purge, and the JS backfill sees an existing row and returns early without re-parsing (or deleting). Could you confirm whether there is a separate deletion path in backfillNativeDroppedFiles that detects the file no longer exists on disk and removes its DB rows?
There was a problem hiding this comment.
Good catch — confirmed the gap. backfillNativeDroppedFiles only adds rows for files missing from the DB; it has no path to delete rows for files missing from disk. Before this PR, Rust's detect_removed_files happened to clean those up as a side effect of the buggy purge-and-reinsert loop. With the new guard, that side effect is gone and a deleted .clj/.gleam/.jl/.fs leaves stale rows until the next full rebuild.
The hot-path fix here is intentionally narrow — adding the deletion path touches a different module (pipeline.ts) and warrants its own test coverage. Tracked as #1073 with a fix sketch (compute (existingNodes ∪ existingHashes) − expected filtered to non-native extensions, then purgeFilesData).
Codegraph Impact Analysis3 functions changed → 4 callers affected across 1 files
|
Summary
detect_removed_filesnow filters out files outside Rust's supported-extension set before flagging them as removed. Files whose extension is unknown toLanguageKind::from_extensionand absent fromSUPPORTED_EXTENSIONS(e.g..clj,.gleam,.jl,.fs) are owned by the JS-side WASM backfill (bug: native engine silently drops 60 files vs WASM in 3.9.4 benchmark (engine parity regression) #967, fast-skip: file_hashes drops files when grammar isn't installed (breaks fast-skip permanently) #1068) — the orchestrator must leave their rows alone.is_supported_extensiontofile_collector, exposing the same OR predicate used bycollect_files.change_detectionreuses it so the two stages can never drift.Refs #1066.
Why this fixes the 1-file rebuild ~2s floor
Issue #1066 has two halves. PR #1069 closed the no-op half by ensuring
file_hashesrows are persisted for every collected file. The 1-file half remained — fast-skip can't fire when there's a real change, so each rebuild went throughtryNativeOrchestrator, which:file_collector::collect_files. WASM-only files (.clj/.gleam/.jl/.fs) were absent fromcurrent.detect_removed_filesflagged every such row as removed.purge_changed_filesranDELETE FROM nodes WHERE file = ?1andDELETE FROM file_hashes WHERE file = ?1for each (native_db.rs:1322).backfillNativeDroppedFiles(now on every pass per fix(native): persist file_hashes for dropped/symbol-less files #1069) re-parsed the WASM-only files and re-inserted the rows.On the codegraph repo, that loop hits the multi-language fixtures in
tests/benchmarks/resolution/fixtures/, which is exactly the ~2s floor reported by the regression guard:With this change, those rows persist across incremental rebuilds and the missing-file early-return in
backfillNativeDroppedFiles(pipeline.ts:811) finally fires.Test plan
detect_removed_finds_missingtest still passesdetect_removed_skips_unsupported_extensionstest passes (covers .clj/.gleam/.jl/.fs).cljetc. paths are still owned by the JS backfill, untouched here)Notes
cargo check -p codegraph-core --lib -j 1locally; fullcargo testcould not run on the maintainer's Windows machine due to a rustc/LLVM toolchain memory issue (unrelated to these changes). CI runs the suite.is_supported_extensionhelper deliberately mirrors the OR predicate atfile_collector.rs:248-256. Keeping it as a single function lets both stages stay in lockstep — one source of truth for what the orchestrator considers "in scope".