Skip to content

fix(native): skip unsupported-extension files in detect_removed_files#1070

Merged
carlos-alm merged 1 commit into
mainfrom
fix/1066-rust-detect-removed-by-extension
May 6, 2026
Merged

fix(native): skip unsupported-extension files in detect_removed_files#1070
carlos-alm merged 1 commit into
mainfrom
fix/1066-rust-detect-removed-by-extension

Conversation

@carlos-alm
Copy link
Copy Markdown
Contributor

Summary

Refs #1066.

Why this fixes the 1-file rebuild ~2s floor

Issue #1066 has two halves. PR #1069 closed the no-op half by ensuring file_hashes rows are persisted for every collected file. The 1-file half remained — fast-skip can't fire when there's a real change, so each rebuild went through tryNativeOrchestrator, which:

  1. Ran Rust's narrower file_collector::collect_files. WASM-only files (.clj/.gleam/.jl/.fs) were absent from current.
  2. detect_removed_files flagged every such row as removed.
  3. purge_changed_files ran DELETE FROM nodes WHERE file = ?1 and DELETE FROM file_hashes WHERE file = ?1 for each (native_db.rs:1322).
  4. backfillNativeDroppedFiles (now on every pass per fix(native): persist file_hashes for dropped/symbol-less files #1069) re-parsed the WASM-only files and re-inserted the rows.

On the codegraph repo, that loop hits the multi-language fixtures in tests/benchmarks/resolution/fixtures/, which is exactly the ~2s floor reported by the regression guard:

1-file rebuild: 78 → 1986 (+2446%)

With this change, those rows persist across incremental rebuilds and the missing-file early-return in backfillNativeDroppedFiles (pipeline.ts:811) finally fires.

Test plan

  • CI regression-guard gate shows native 1-file rebuild back under 100ms (was 1986ms in run 25415028370)
  • CI regression-guard gate shows native no-op rebuild back under 30ms (was 21ms post-fix(native): persist file_hashes for dropped/symbol-less files #1069 — not blocking but should also benefit)
  • Existing detect_removed_finds_missing test still passes
  • New detect_removed_skips_unsupported_extensions test passes (covers .clj/.gleam/.jl/.fs)
  • No build-parity regression (the .clj etc. paths are still owned by the JS backfill, untouched here)

Notes

  • Verification was limited to cargo check -p codegraph-core --lib -j 1 locally; full cargo test could not run on the maintainer's Windows machine due to a rustc/LLVM toolchain memory issue (unrelated to these changes). CI runs the suite.
  • The is_supported_extension helper deliberately mirrors the OR predicate at file_collector.rs:248-256. Keeping it as a single function lets both stages stay in lockstep — one source of truth for what the orchestrator considers "in scope".

The Rust file_collector only collects files whose extension is recognized
by `LanguageKind::from_extension` or listed in `SUPPORTED_EXTENSIONS`.
The JS LANGUAGE_REGISTRY is broader — Clojure, Gleam, Julia, F# files
exist in `file_hashes` because the JS-side WASM backfill writes them
(#1068), but Rust's narrower collector never sees them.

Before this fix, `detect_removed_files` flagged every such file as
"removed" on every incremental rebuild because they were absent from
`current` (the just-collected file list). The orchestrator's purge step
then deleted their `nodes` and `file_hashes` rows, and the JS-side
`backfillNativeDroppedFiles` (now running on every pass per #1069)
re-parsed them with WASM and re-inserted the rows — the ~2s 1-file
rebuild floor reported in #1066.

Add `is_supported_extension` to `file_collector` (exposing the same
predicate used by `collect_files`) and apply it as a pre-filter in
`detect_removed_files`. Files outside Rust's capability are now left
alone: their absence from `current` is a capability boundary, not a
deletion. Rows owned by the JS layer persist across incremental
rebuilds and the backfill's missing-file early-return at
`pipeline.ts:811` finally fires.

Refs #1066

Impact: 3 functions changed, 4 affected
@claude
Copy link
Copy Markdown

claude Bot commented May 6, 2026

Claude finished @carlos-alm's task —— View job


PR Review Analysis

Review Checklist

  • Phase 1: Root cause analysis - understand the performance problem
  • Phase 2: Configuration change detection - check for risky changes
  • Phase 3: Test integrity - verify test quality and coverage
  • Phase 4: Code quality review - examine implementation details
  • Phase 5: Backlog compliance - evaluate against project principles
  • Final assessment: Provide rating and recommendation

Root Cause Analysis

Problem Being Solved: The native Rust orchestrator was incorrectly flagging WASM-only files (.clj, .gleam, .jl, .fs) as "removed" during incremental builds because these files are outside Rust's supported extension set. This caused a wasteful delete → re-insert cycle on every rebuild, creating the ~2s performance floor reported in issue #1066.

Why This Approach: The solution correctly addresses the root cause by filtering out unsupported extensions in detect_removed_files before flagging them as removed. This prevents the orchestrator from touching files it doesn't own (which belong to the JS-side WASM backfill).

Risk Assessment: Low risk - this is a targeted fix that maintains existing behavior for supported files while correctly excluding unsupported ones.

Backlog Compliance

  • Zero-dep: ✓ — No new dependencies, builds on existing tree-sitter/SQLite infrastructure
  • Foundation-aligned: ✓ — Directly serves P1 (always-current graph) by fixing incremental rebuild performance
  • Problem-fit: 5 — Directly removes a 2s performance floor that was blocking "fast enough to run on every commit" (FOUNDATION.md P1)
  • Breaking: No — Purely additive optimization, no API/schema changes
  • Tier: 1

Critical Concerns

None identified. This is a well-implemented performance fix.

Code Quality Highlights

  • Surgical fix: Only adds 3 lines of filtering logic at the precise bottleneck
  • Proper abstraction: is_supported_extension() provides single source of truth, preventing drift
  • Comprehensive testing: New regression test covers the exact issue + positive control case
  • Clear documentation: Comments explain the capability boundary concept and issue context

Final Recommendation

  • Rating: ⭐⭐⭐⭐⭐ (5/5)
  • Action: APPROVE
  • Reasoning: This PR delivers an excellent fix for a real performance issue. It correctly identifies and solves the root cause (ownership boundary confusion between native and WASM engines), implements the fix surgically without affecting other logic, includes proper regression testing, and directly serves the project's core principle of keeping the graph always current by removing a major incremental rebuild bottleneck. The implementation quality is high and the approach is sound.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 6, 2026

Greptile Summary

Fixes a ~2 s incremental-rebuild floor caused by detect_removed_files incorrectly flagging WASM-only files (.clj, .gleam, .jl, .fs) as removed on every pass, triggering a purge-and-reinsert loop with the JS backfill. The fix adds is_supported_extension to file_collector — exposing the same OR-predicate already used by collect_files — and applies it as an early-exit guard in detect_removed_files.

  • file_collector.rs: new pub fn is_supported_extension(path: &str) -> bool that mirrors the collect_files extension check exactly (LanguageKind::from_extensionSUPPORTED_EXTENSIONS fallback); verified against collect_files lines 270-276.
  • change_detection.rs: detect_removed_files now skips any DB row whose extension isn't in Rust's supported set, plus a regression test asserting that the four WASM-only extensions are skipped while a legitimately missing .ts file is still flagged.

Confidence Score: 5/5

Safe to merge — the guard is narrowly scoped to unsupported extensions, and the helper function provably matches the predicate already used in collect_files.

The new predicate in is_supported_extension was verified against the collect_files walker logic at lines 270-276: both use LanguageKind::from_extension as the primary check and fall back to the same SUPPORTED_EXTENSIONS slice. The test correctly validates that unsupported-extension rows are skipped while a genuinely missing supported file is still flagged. The known gap around truly deleted WASM-only files leaving stale rows is intentionally deferred to #1073 and was already discussed in the prior review thread.

No files require special attention.

Important Files Changed

Filename Overview
crates/codegraph-core/src/file_collector.rs Adds is_supported_extension(path: &str) -> bool that exactly mirrors the OR-predicate in collect_files (LanguageKind check first, SUPPORTED_EXTENSIONS fallback). Logic verified against lines 270-276 of collect_files.
crates/codegraph-core/src/change_detection.rs Imports and applies is_supported_extension in detect_removed_files to skip WASM-only files; adds a regression test covering .clj/.gleam/.jl/.fs. Logic and test coverage are sound.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[detect_removed_files called] --> B[Build 'current' set from all_files]
    B --> C[Iterate existing DB file_hashes rows]
    C --> D{is_supported_extension?}
    D -- No: WASM-only file --> E[Skip — leave row alone]
    D -- Yes: Rust-native file --> F{scoped rebuild?}
    F -- Yes --> G{in scope AND not in current?}
    F -- No --> H{not in current?}
    G -- Yes --> I[Flag as removed]
    G -- No --> J[Skip]
    H -- Yes --> I
    H -- No --> J
Loading

Reviews (2): Last reviewed commit: "fix(native): skip unsupported-extension ..." | Re-trigger Greptile

Comment on lines +155 to +157
if !is_supported_extension(f) {
return false;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Stale rows when WASM-only files are genuinely deleted

With this guard in place, Rust will never call purge_changed_files for a WASM-only file — even when that file is actually deleted from disk. The PR description relies on backfillNativeDroppedFiles's "missing-file early-return" (pipeline.ts:811) to prevent needless re-parsing, but that early return fires when the file_hashes row already exists. If the file is truly gone, the row now persists forever: Rust skips the purge, and the JS backfill sees an existing row and returns early without re-parsing (or deleting). Could you confirm whether there is a separate deletion path in backfillNativeDroppedFiles that detects the file no longer exists on disk and removes its DB rows?

Fix in Claude Code

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — confirmed the gap. backfillNativeDroppedFiles only adds rows for files missing from the DB; it has no path to delete rows for files missing from disk. Before this PR, Rust's detect_removed_files happened to clean those up as a side effect of the buggy purge-and-reinsert loop. With the new guard, that side effect is gone and a deleted .clj/.gleam/.jl/.fs leaves stale rows until the next full rebuild.

The hot-path fix here is intentionally narrow — adding the deletion path touches a different module (pipeline.ts) and warrants its own test coverage. Tracked as #1073 with a fix sketch (compute (existingNodes ∪ existingHashes) − expected filtered to non-native extensions, then purgeFilesData).

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 6, 2026

Codegraph Impact Analysis

3 functions changed4 callers affected across 1 files

  • detect_removed_files in crates/codegraph-core/src/change_detection.rs:141 (3 transitive callers)
  • detect_removed_skips_unsupported_extensions in crates/codegraph-core/src/change_detection.rs:776 (0 transitive callers)
  • is_supported_extension in crates/codegraph-core/src/file_collector.rs:49 (4 transitive callers)

@carlos-alm
Copy link
Copy Markdown
Contributor Author

@greptileai

@carlos-alm carlos-alm merged commit 2082fb3 into main May 6, 2026
37 checks passed
@carlos-alm carlos-alm deleted the fix/1066-rust-detect-removed-by-extension branch May 6, 2026 05:59
@github-actions github-actions Bot locked and limited conversation to collaborators May 6, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant