Skip to content

fix(native): backfill new dropped-language files on quiet incrementals#1123

Merged
carlos-alm merged 4 commits into
mainfrom
fix/1083-dropped-language-gap-repair
May 15, 2026
Merged

fix(native): backfill new dropped-language files on quiet incrementals#1123
carlos-alm merged 4 commits into
mainfrom
fix/1083-dropped-language-gap-repair

Conversation

@carlos-alm
Copy link
Copy Markdown
Contributor

Summary

Closes #1083, closes #1091.

The post-orchestrator gate (isFullBuild || removedCount>0 || changedCount>0) left a silent gap: when a brand-new file with an extension outside the Rust binary's file_collector (e.g. on a published binary that predates a newly-added Rust extractor) is added on an otherwise-quiet incremental, the orchestrator reports no activity and the JS backfill is skipped. The file never enters nodes/file_hashes until a forced full rebuild.

Fix

Extract the missing-file detection from backfillNativeDroppedFiles into a new detectDroppedLanguageGap helper, run it once before the gate, and pass the precomputed gap into backfillNativeDroppedFiles. The scan extends both post-native paths:

  • result.earlyExit branch now also repairs the gap before returning, so no-op rebuilds with stale nodes rows for dropped-extension files are healed instead of left silent.
  • Non-earlyExit gate adds gap.missingAbs.length > 0, so quiet incrementals with a real gap trigger backfill.

Perf

No redundant fs walks: incrementals where the orchestrator signals fire would have done the same walk inside the old backfillNativeDroppedFiles anyway. The new cost on a truly-quiet incremental with no gap is one collectFiles walk + two cheap DB queries (early-return before any parsing or WAL handoff).

Test plan

  • New tests/integration/dropped-language-gap.test.ts simulates the issue's DB state (file on disk + file_hashes row, no nodes row + bumped mtime to force fast-skip fall-through) and asserts the row is restored after an incremental. Fails on main, passes after the fix.
  • npx vitest run tests/integration/ — 573/573 pass (includes incremental-parity, build-parity, incr-edge-gap).
  • npm run lint — clean (1 unrelated pre-existing warning).
  • npx tsc --noEmit — clean.

The post-orchestrator gate (`isFullBuild || removedCount>0 || changedCount>0`)
left a silent gap: when a brand-new file in an extension outside the Rust
binary's `file_collector` (e.g. on an older binary) is added on an
otherwise-quiet incremental, the orchestrator reports no activity and
backfill is skipped — the file never enters `nodes`/`file_hashes` until
a forced full rebuild.

Extract the missing-file detection from `backfillNativeDroppedFiles` into
`detectDroppedLanguageGap`, run it once before the gate, and pass the
result to `backfillNativeDroppedFiles`. The scan extends both code paths:

- The `result.earlyExit` branch now also repairs the gap before returning,
  so no-op rebuilds with stale `nodes` rows for dropped-extension files
  are healed instead of left silent.
- The non-earlyExit gate adds `gap.missingAbs.length > 0` so quiet
  incrementals with a real gap trigger backfill.

No redundant fs walks: incrementals where the orchestrator signals fire
would have done the same walk inside the old `backfillNativeDroppedFiles`
anyway. The new cost on a truly-quiet incremental with no gap is one
`collectFiles` walk + two cheap DB queries (early-return before any
parsing/WAL handoff).

Closes #1083, closes #1091.
@claude
Copy link
Copy Markdown

claude Bot commented May 14, 2026

Claude finished @carlos-alm's task —— View job


PR Review: Backfill New Dropped-Language Files on Quiet Incrementals

Root Cause Analysis

Problem Being Solved: The PR fixes a silent gap where brand-new files with dropped-language extensions (extensions not supported by the Rust native engine but supported by WASM) are never added to the graph when introduced during quiet incremental builds.

The specific issue occurs when:

  1. A new file with an extension the native binary doesn't support is added
  2. The orchestrator's file_collector skips the file
  3. The orchestrator reports changedCount=0, removedCount=0, isFullBuild=false
  4. The pre-fix gate isFullBuild || removedCount>0 || changedCount>0 evaluates to false
  5. Backfill is skipped and the file never enters nodes/file_hashes until a forced full rebuild

Why This Approach: The solution extracts missing-file detection into a separate detectDroppedLanguageGap() function and runs it BEFORE the gate. This allows both early-exit and regular paths to detect and repair gaps, with the pre-computed gap used for both gating and backfill execution. This avoids redundant filesystem walks while ensuring completeness.

Risk Assessment: LOW RISK. The change is surgical and well-isolated. It preserves existing behavior while closing a specific gap. The filesystem walk cost (~20-30ms) is acceptable given it prevents silent data loss.

Configuration Change Detection

CLEAN - No configuration changes, lint rule relaxations, or test expectation modifications detected.

Test Integrity

EXCELLENT - The new test file is exemplary:

  • Surgical simulation: Creates exact issue conditions by deleting nodes rows while preserving file_hashes
  • No false positives: Uses mtime bump to bypass fast-skip without content changes
  • Verifiable outcomes: Asserts gap repair with clear expectations
  • Proper isolation: Uses temp directories and cleanup
  • Integration-focused: Tests the actual pipeline behavior, not just unit functions

The test strategy aligns perfectly with CLAUDE.md guidance - it tests the fix without documenting wrong behavior as expected.

Code Quality

Strengths:

  • Clean separation of concerns: detection vs. execution
  • Well-documented functions with clear purpose
  • Proper error handling for missing tables
  • Performance-conscious (early returns, minimal DB handoff)
  • Type safety with DroppedLanguageGap interface

Architecture alignment:

  • Maintains existing dual-engine parity guarantees
  • Preserves performance characteristics
  • Follows established patterns for native/WASM fallback

Technical soundness:

  • ✅ Correct SQL queries for gap detection
  • ✅ Proper filesystem vs DB state comparison
  • ✅ Safe handling of legacy DB schemas (file_hashes table may not exist)
  • ✅ Memory management (WASM tree cleanup)
  • ✅ Comprehensive backfill logic

Minor observations:

  • Code is well-structured and follows existing conventions
  • Comments are informative and accurate
  • Error handling is appropriate for the context

Backlog Compliance

Zero-dep: ✅ - No new runtime dependencies added
Foundation-aligned: ✅ - Fully compliant

  • Maintains always-current principle (P1) by ensuring completeness
  • Preserves zero-infrastructure goal (P4)
  • No agent-like behavior (P8)
  • Strengthens embeddable API (P5)

Problem-fit: 4/5 - Directly improves agent accuracy
This fix prevents AI coding assistants from missing files in their dependency analysis, which directly addresses the core problem of "hallucinated dependencies" and incomplete blast radius analysis.

Breaking: No - Purely additive behavior that improves existing functionality

Tier: 1 (zero-dep ✓ + foundation-aligned ✓)

Critical Concerns

None. The implementation is solid, well-tested, and addresses a real gap without introducing risks.

Final Recommendation

  • Rating: ⭐⭐⭐⭐☆ (4/5)
  • Action: APPROVE
  • Reasoning: Excellent fix for a specific edge case with comprehensive test coverage. The solution is architecturally sound, performance-conscious, and maintains engine parity. The only reason for not giving 5 stars is that this addresses an edge case rather than a fundamental improvement, but within its scope, the execution is exemplary.

Key strengths:

  1. Root cause precision: Targets the exact gap without over-engineering
  2. Test quality: Regression test is surgical and verifiable
  3. Performance awareness: Minimal overhead on the happy path
  4. Engine parity: Maintains dual-engine correctness guarantees
  5. Code quality: Clean, well-documented, properly isolated

This PR demonstrates the engineering discipline expected in a tool that analyzes itself - it fixes the gap without introducing new ones.


@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 14, 2026

Greptile Summary

This PR closes a silent gap where a brand-new file in a dropped-language extension (one outside the Rust binary's file_collector) added during an otherwise-quiet incremental would never enter nodes/file_hashes until a forced full rebuild. The fix extracts gap detection into a new synchronous detectDroppedLanguageGap helper and wires it into both post-native paths before any early return.

  • earlyExit=true branch (pipeline.ts:665–668): runs detectDroppedLanguageGap before closeDbPair; if the gap is non-empty, calls backfillNativeDroppedFiles so no-op rebuilds with stale DB state are healed.
  • Non-earlyExit gate (pipeline.ts:775–778): pre-computes the gap before the existing gate, adds gap.missingAbs.length > 0 as a fourth trigger, and passes the pre-computed gap into backfillNativeDroppedFiles to avoid a redundant fs walk.
  • New integration test (dropped-language-gap.test.ts): two describeOrSkip blocks cover both backfill call sites by surgically deleting the target file's nodes row post-full-build, then verifying the row is re-inserted after the incremental.

Confidence Score: 5/5

Safe to merge. The change is a targeted refactor of an existing backfill path, touching no new external surfaces.

The extraction of detectDroppedLanguageGap is straightforward: the gap-detection query logic is identical to what was already inside backfillNativeDroppedFiles, the advisory-lock lifecycle is preserved (the openDb call inside backfill sets __lockPath on the new connection so closeDb still releases it), and NativeDbProxy supports .prepare().all() so the detection query works on either connection type. Both new call sites are exercised by integration tests that fail on main and pass after the fix.

No files require special attention.

Important Files Changed

Filename Overview
src/domain/graph/builder/pipeline.ts Refactors backfill gap detection into a separate detectDroppedLanguageGap helper; wires it into both the earlyExit=true and non-earlyExit code paths so quiet incrementals now heal missing dropped-language nodes. No logic errors found; DB and advisory-lock lifecycle are handled correctly in all branches.
tests/integration/dropped-language-gap.test.ts New integration regression test covering both backfill call sites: Scenario 1 drives earlyExit=true via an mtime-only bump, Scenario 2 drives earlyExit=false via a real content change to a sibling file. Both scenarios verify the gap file's nodes row is re-inserted after the incremental.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[tryNativeOrchestrator] --> B[nativeDb.buildGraph]
    B --> C{result.earlyExit?}
    C -- yes --> D[detectDroppedLanguageGap]
    D --> E{gap.missingAbs.length > 0?}
    E -- yes --> F[backfillNativeDroppedFiles
closesNativeDb, opens better-sqlite3
inserts nodes + file_hashes]
    E -- no --> G[closeDbPair]
    F --> G
    G --> H[return 'early-exit']
    C -- no --> I[setBuildMeta + analysis phases]
    I --> J[detectDroppedLanguageGap]
    J --> K{isFullBuild or removedCount > 0
or changedCount > 0
or gap.missingAbs.length > 0?}
    K -- yes --> L[backfillNativeDroppedFiles
pre-computed gap]
    K -- no --> M[closeDbPair]
    L --> M
    M --> N[return BuildResult]
Loading

Reviews (5): Last reviewed commit: "Merge branch 'main' into fix/1083-droppe..." | Re-trigger Greptile

Comment on lines +42 to +91
describeOrSkip('Dropped-language gap repair on quiet incremental (#1083)', () => {
let projectDir: string;
let dbPath: string;
let tmpBase: string;
const targetFile = 'math.js';

beforeAll(async () => {
tmpBase = fs.mkdtempSync(path.join(os.tmpdir(), 'codegraph-1083-'));
projectDir = path.join(tmpBase, 'proj');
copyDirSync(FIXTURE_DIR, projectDir);
dbPath = path.join(projectDir, '.codegraph', 'graph.db');

// Full build → every file has nodes + file_hashes rows.
await buildGraph(projectDir, {
engine: 'native',
incremental: false,
skipRegistry: true,
});

// Simulate the issue's DB state: a file is on disk and tracked in
// `file_hashes` (so the orchestrator's hash tier sees it as unchanged
// and reports `changedCount=0`), but its `kind='file'` node row is
// absent — the same shape produced when an old binary's collector
// never inserts dropped-extension files in the first place.
//
// Foreign keys are disabled for this surgical delete; edges referencing
// the file node are left intact so we only test the gap-detection path.
const db = new Database(dbPath);
db.pragma('foreign_keys = OFF');
db.prepare("DELETE FROM nodes WHERE kind='file' AND file = ?").run(targetFile);
db.close();

// Bump the file's mtime without changing content. This forces the
// JS-side fast-skip pre-flight (#1054) to fall through to the
// orchestrator (mtime mismatch with file_hashes), while the
// orchestrator's content-hash tier still classifies the file as
// unchanged (metadata-only update) — so `changedCount=0` and
// `removedCount=0`. That's the exact orchestrator-state the issue
// describes for a brand-new dropped-language file.
const targetAbs = path.join(projectDir, targetFile);
const future = new Date(Date.now() + 5000);
fs.utimesSync(targetAbs, future, future);

// Incremental rebuild — must detect the gap and re-insert the row.
await buildGraph(projectDir, {
engine: 'native',
incremental: true,
skipRegistry: true,
});
}, 60_000);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 earlyExit=true path has no test coverage

The new test exercises only the changedCount=0 / removedCount=0 / earlyExit=false quiet-incremental code path (the mtime bump forces the orchestrator to run and report changedCount=0, but result.earlyExit is false). The separate result.earlyExit branch added in pipeline.ts (lines 665–668) — where the orchestrator itself returns before any count fields are populated — is not exercised by any test in this PR. If that branch has a bug (e.g., incorrect DB state when backfillNativeDroppedFiles mutates ctx.db before the closeDbPair call), it would go undetected. A second describeOrSkip block that keeps file_hashes intact but never bumps mtime (so the orchestrator's hash tier triggers an early exit) would cover this path directly.

Fix in Claude Code

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 6f21bfe.

You're right that the existing test only made one of the two backfill paths obvious. I split the test into two describeOrSkip blocks so each branch is explicit:

  1. earlyExit=true — mtime bump with unchanged content. The orchestrator's content-hash tier classifies as metadata-only, parse_changes and removed come out empty after the !metadata_only filter in build_pipeline.rs:159, so it returns early_exit=true and backfill fires from pipeline.ts:665-667. Verified via debug log (No changes detected — the message emitted at pipeline.ts:660, distinct from the JS-side fast-skip's No changes detected. Graph is up to date.).
  2. earlyExit=false — a real content change to a sibling file drives changedCount=1, so the orchestrator falls through to the non-early-exit path. The gate (now augmented with gap.missingAbs.length > 0) drives backfill from pipeline.ts:776-778. Verified via Incremental: 1 changed, 0 removed.

Note: your suggestion of "keeps file_hashes intact but never bumps mtime" would actually short-circuit at the JS-side fast-skip pre-flight (pipeline.ts:1162-1188) before the orchestrator is ever invoked — so the orchestrator's earlyExit branch wouldn't be exercised by that path. The mtime bump is what forces the JS fast-skip to fall through to the orchestrator, where the content-hash tier then drives early_exit=true.

Both scenarios end with the deleted nodes row repaired.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 14, 2026

Codegraph Impact Analysis

5 functions changed6 callers affected across 5 files

  • tryNativeOrchestrator in src/domain/graph/builder/pipeline.ts:604 (5 transitive callers)
  • DroppedLanguageGap.missingRel in src/domain/graph/builder/pipeline.ts:787 (0 transitive callers)
  • DroppedLanguageGap.missingAbs in src/domain/graph/builder/pipeline.ts:789 (0 transitive callers)
  • detectDroppedLanguageGap in src/domain/graph/builder/pipeline.ts:812 (4 transitive callers)
  • backfillNativeDroppedFiles in src/domain/graph/builder/pipeline.ts:858 (4 transitive callers)

Greptile flagged that the test only exercised one of the two backfill
call sites added by this PR. Split the test into two describeOrSkip
blocks that explicitly target each path:

- earlyExit=true: mtime bump with unchanged content → orchestrator's
  content-hash tier classifies as metadata-only → after filtering,
  parse_changes/removed are empty → returns early_exit=true → backfill
  fires from pipeline.ts:665-667.
- earlyExit=false: real content change to a sibling file (utils.js)
  drives changedCount=1 → orchestrator returns early_exit=false → the
  non-early-exit gate (now augmented with gap.missingAbs.length > 0)
  drives backfill from pipeline.ts:776-778.

Verified via debug logs: scenario 1 emits 'No changes detected'
(orchestrator earlyExit branch), scenario 2 emits 'Incremental: 1
changed, 0 removed' (main path). Both end with the gap repaired.
@carlos-alm
Copy link
Copy Markdown
Contributor Author

@greptileai

@carlos-alm carlos-alm merged commit 51aaf94 into main May 15, 2026
21 checks passed
@carlos-alm carlos-alm deleted the fix/1083-dropped-language-gap-repair branch May 15, 2026 07:54
@github-actions github-actions Bot locked and limited conversation to collaborators May 15, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

1 participant