Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .erpaval/INDEX.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,7 @@ development sessions. Solutions are reusable; specs are per-feature.
- [Replace raw-SQL escape hatches with typed finders on the storage interface](solutions/architecture-patterns/typed-finders-replace-raw-sql-in-consumers.md) — 108 raw-SQL sites collapse into 15 named finders. Adapters internalize dialect; consumers stay backend-agnostic. Liskov-clean parity harness via public-method rebuilder.
- [Parallel Act subagents on a shared git tree — interleaving + cherry-pick discipline](solutions/best-practices/parallel-act-subagents-with-shared-git-tree.md) — verify branch state, spawn on non-overlapping packages, watch for stale dist + phantom test counts, watch the test-fixup tail.
- [Fixed-offset asset resolvers break on bundle collapse](solutions/best-practices/fixed-offset-asset-resolvers-break-on-bundle-collapse.md) — after a tsup collapse flattens `dist/commands/x.js`→`dist/x-<hash>.js`, EVERY `import.meta.url` resolver with a fixed `..` offset shifts by a level. #189 fixed only doctor.ts; 6 others shipped broken ~5 days (init/ci-init/setup/betterleaks + 2 ingestion WASM resolvers), two SILENTLY (`analyze` → 0 symbols, exit 0). Convert all to walk-up probes; the CLI bundles deps from `dist/` so rebuild in dep order; verify from a packed-tarball global install, not the hot node_modules.
- [Silent degradation needs a run-level guard](solutions/best-practices/silent-degradation-needs-a-run-level-guard.md) — a `catch→warn→continue` loop hides a GLOBAL stage failure: every item fails identically, the run exits 0 with empty output. Fix is a PAIR — (a) a thrown sentinel distinguishing global-init failure from per-item failure (set `.name`, Piscina drops the prototype across threads), (b) a run-level "processed N but produced 0 expected outputs" guard with a distinct exit code. Pick the denominator carefully (exclude regex-provider/placeholder output) and make smoke tests assert a KNOWN symbol, not "any non-empty result".
- [Squash-merge masks pre-existing repo-wide debt](solutions/best-practices/squash-merge-masks-pre-existing-debt.md) — first action on a fresh branch from main is `mise run check` BEFORE starting work; lint rules / transitive deps / cross-package test assertions drift across squash boundaries even when per-commit gating was green inside the prior PR.
- [No spec-coordinate leakage into source](solutions/best-practices/no-spec-coordinate-leakage-into-source.md) — ERPAVal `AC-*`, `M-*`, `W-*`, `CL-*` prefixes belong in commits, PR bodies, ADR refs sections — NOT in JSDoc, inline comments, CLI flag help, MCP tool descriptions, or test names. Sweep `rg -n "AC-[A-Z]-[0-9]" packages/` before every PR-open; LLM clients pick up the leakage and start citing it back.
- [release: published events need PAT or inline](solutions/conventions/release-published-event-needs-pat-or-inline.md) — release-please-action with default `GITHUB_TOKEN` does NOT fire downstream `release: [published]` workflows; inline asset-attach in `release-please.yml` gated on `steps.release.outputs.release_created`. Fixed AC-D-4; sbom.yml has same latent bug for follow-on.
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,105 @@
---
name: silent-degradation-needs-a-run-level-guard
description: A pipeline stage that catches per-item errors and continues (returns empty, logs a per-item warning) will hide a GLOBAL failure of that stage — every item fails identically, the run completes, prints a count, and exits 0. Per-item warnings collapse and never become a verdict. The fix is a pair — (a) at the failure site, distinguish a global/init failure from a per-item failure and THROW on global; (b) a run-level aggregate guard ("processed N items but produced 0 of the expected output") that maps to a distinct exit code. Smoke tests must assert real output, not "any non-empty result".
metadata:
type: bug
category: best-practices
tags: [analyze, parse, wasm, silent-failure, exit-code, error-handling, piscina, smoke-test, guard]
discovered: 2026-06-08
session: session-analyze-hardening
related:
- fixed-offset-asset-resolvers-break-on-bundle-collapse
- doctor-probe-drift-after-rip-and-replace
- kill-escalation-races-the-exit-handler
---

# Silent degradation needs a run-level guard, not just per-item warnings

## What bit us

The fixed-offset WASM-resolver bug ([[fixed-offset-asset-resolvers-break-on-bundle-collapse]])
shipped invisibly for ~5 days. The resolver bug was the *cause*; the reason it
was *invisible* is a separate, more general defect worth its own lesson:

`codehub analyze` parses every source file in a Piscina worker. The worker's
per-file `try/catch` (parse-worker.ts) mapped **any** thrown error to
`{captures: [], warnings: [msg]}` and continued. So when the grammar runtime was
globally dead (vendored `vendor/wasms/` unresolvable in the flat bundle), EVERY
file took the catch, produced zero captures, and the pipeline built a
File/Directory-only **skeleton graph** — then printed `"5 nodes"` and **exited
0**. The per-file warnings existed but `logWarnings` collapsed them into
`"parse: N warnings (use --verbose)"`, and nothing aggregated them into a
run-level verdict or an exit code. A broken parser looked exactly like a healthy
parse of a symbol-free repo.

## The shape of the bug class

Any stage with this shape is vulnerable:

```
for (const item of items) {
try { results.push(await process(item)); }
catch (e) { warnings.push(e.message); results.push(EMPTY); } // continue
}
return results; // a GLOBAL failure = N identical EMPTY results = silent success
```

The per-item resilience (good — one malformed file shouldn't fail the run) is
exactly what masks a global failure (bad — a dead runtime should fail the run).

## The fix is a PAIR, not one change

You need both halves; either alone is insufficient.

**(a) At the failure site, distinguish global from per-item and THROW on global.**
Introduce a sentinel error (`WasmRuntimeUnavailableError`) for the
init/deployment-level failure (vendored dir missing, runtime won't init). Throw
it from the init path; rethrow it through the per-item catch *before* the
generic warning-mapping. A global failure then aborts the run loudly with an
actionable message, instead of becoming N warnings. Per-item errors (one bad
file) still warn-and-skip.

**(b) A run-level aggregate guard as the backstop.** Even with (a), a softer
degradation can slip through (e.g. the runtime package is genuinely absent and
the init path returns a soft `undefined` by design). So add a run-level
predicate: "the run processed >= K items that SHOULD have produced output, but
produced 0" → push a loud warning AND set a **distinct advisory exit code**
(here: 3, separate from generic-failure 1) so CI can detect a silent-skeleton
run without scraping logs. Make the predicate a pure exported function and
table-test the threshold.

Keep the two orthogonal: (a) keys off the thrown error, (b) keys off the output
count. They must not double-fire on the same condition — (a) handles hard death,
(b) handles the residual soft case.

## Gotchas that matter

- **Piscina structured-clones the rejection across threads** — a custom
`class XError extends Error` arrives on the main thread as a plain `Error`
with the prototype lost. Set `this.name` in the constructor and match on
`err.name === "XError"` on the main thread (NOT `instanceof`). `instanceof`
is fine *inside* the worker, before serialization.
- **Pick the denominator carefully.** "Files scanned" is the wrong count if some
legitimately produce no output (here: cobol routes through a regex provider,
not tree-sitter; config-only/empty repos have zero parseable files). The guard
must key off "items that SHOULD have produced output" (tree-sitter files), or
it false-positives on legitimately-empty repos. And don't count placeholder
output (external import stubs are `CodeElement` nodes — counting them as
"symbols" would let an import-only repo mask a broken parser).
- **Smoke tests must assert REAL output, not "any non-empty result".** The
global-install verifier queried `'export default'` and passed on "≥1 hit" —
but the query command's stderr header (`query: "..." (0 results)`) is itself
non-empty, so the gate passed on a 0-symbol graph. Tighten to a KNOWN symbol
from a KNOWN fixture with the expected KIND (`Function`) and FILE — something
that can only appear if extraction actually ran.
- **A separate `doctor`-style health command does NOT protect the hot path.**
`doctor` had a vendored-wasms check, but it's a manual command never run by
`analyze`. Don't assume an existing health probe covers the pipeline.

## The one-line takeaway

Per-item error resilience hides global stage failure. Whenever you write
`catch → warn → continue` in a loop, ask "what does a GLOBAL failure of this
stage look like?" — and add (a) a thrown sentinel for the global case + (b) a
run-level "produced 0 of the expected output" guard with its own exit code.
Then make the smoke test assert real output, not a non-empty string.
29 changes: 29 additions & 0 deletions packages/cli/src/asset-resolver.test.ts
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,11 @@ const CI_TEMPLATE_CANDIDATES = [
["ci-templates"],
["src", "commands", "ci-templates"],
] as const;
// vendor/wasms is resolved by doctor.ts's inline walk-up (resolveVendorWasmsDir,
// accepts on manifest.json) and by @opencodehub/ingestion's vendor-wasms.ts.
const VENDOR_WASMS_CANDIDATES = [["vendor", "wasms"]] as const;
// java/cobol_to_scip.java is resolved by cobol-proleap-setup.ts's inline walk-up.
const JAVA_WRAPPER_CANDIDATE = [["java", "cobol_to_scip.java"]] as const;

test("resolveAsset: flat post-collapse bundle — finds dist/plugin-assets as a sibling", async () => {
const root = await mkTree();
Expand Down Expand Up @@ -209,4 +214,28 @@ test("drift guard: production resolver candidates hit real files in the built di
join(dist, "config", "betterleaks.default.toml"),
"betterleaks default config",
);

// vendored grammar WASMs — doctor.ts's resolveVendorWasmsDir walks up probing
// vendor/wasms and accepts on manifest.json. Also the runtime parse path
// (ingestion vendor-wasms.ts) loads from here. A fixed-offset regression would
// resolve outside dist/ and this assert.equal would fail.
const wasms = resolveAsset(VENDOR_WASMS_CANDIDATES, { startDir: dist });
assert.equal(wasms, join(dist, "vendor", "wasms"), "ingestion vendored grammars");
assert.ok(
statSync(join(wasms as string, "manifest.json")).isFile(),
"vendor/wasms/manifest.json must ship (doctor + parse pipeline accept on it)",
);
assert.ok(
statSync(join(wasms as string, "web-tree-sitter.wasm")).isFile(),
"vendor/wasms/web-tree-sitter.wasm runtime must ship",
);

// COBOL ProLeap JVM bridge source — cobol-proleap-setup.ts walks up probing
// java/cobol_to_scip.java.
const javaSrc = resolveAsset(JAVA_WRAPPER_CANDIDATE, { startDir: dist, kind: "file" });
assert.equal(
javaSrc,
join(dist, "java", "cobol_to_scip.java"),
"cobol proleap wrapper java source",
);
});
18 changes: 18 additions & 0 deletions packages/cli/src/commands/analyze.ts
Original file line number Diff line number Diff line change
Expand Up @@ -168,6 +168,12 @@ export interface AnalyzeSummary {
readonly durationMs: number;
readonly upToDate: boolean;
readonly warnings: readonly string[];
/**
* Set when the parse phase produced zero code symbols from a non-trivial
* number of tree-sitter files (likely a globally-broken parser). The CLI maps
* this to a distinct advisory exit code so CI catches a silent-skeleton run.
*/
readonly zeroSymbolGuard?: boolean;
}

export async function runAnalyze(path: string, opts: AnalyzeOptions = {}): Promise<AnalyzeSummary> {
Expand Down Expand Up @@ -309,6 +315,17 @@ export async function runAnalyze(path: string, opts: AnalyzeOptions = {}): Promi

logWarnings(result.warnings, opts.verbose === true);

// Surface the zero-symbol guard prominently — the detailed message is already
// in result.warnings, but logWarnings collapses grouped warnings, so emit an
// explicit run-level banner that can't be missed (and that the exit-code
// mapping in index.ts keys off via the returned summary flag).
if (result.zeroSymbolGuardTripped === true) {
log(
"codehub analyze: WARNING — extracted 0 code symbols from a non-trivial source tree; " +
"the parser is likely broken (see the parse warning above). Run 'codehub doctor'.",
);
}

// Persist to the composed graph + temporal store. Storage is always
// graph.lbug (graph-tier) + temporal.duckdb sidecar (cochanges, summary
// cache); the temporal-tier writes (`bulkLoadCochanges`,
Expand Down Expand Up @@ -497,6 +514,7 @@ export async function runAnalyze(path: string, opts: AnalyzeOptions = {}): Promi
durationMs,
upToDate: false,
warnings: result.warnings,
...(result.zeroSymbolGuardTripped === true ? { zeroSymbolGuard: true } : {}),
};
}

Expand Down
6 changes: 5 additions & 1 deletion packages/cli/src/index.ts
Original file line number Diff line number Diff line change
Expand Up @@ -149,7 +149,7 @@ program
const embeddingsWorkers = parseWorkerCount(workersRaw);
const embeddingsBatchSize = parsePositiveInt(opts["embeddingsBatchSize"]);

await mod.runAnalyze(path ?? process.cwd(), {
const analyzeSummary = await mod.runAnalyze(path ?? process.cwd(), {
force: opts["force"] === true,
embeddings: opts["embeddings"] === true,
embeddingsVariant: opts["embeddingsInt8"] === true ? "int8" : "fp32",
Expand Down Expand Up @@ -177,6 +177,10 @@ program
strictDetectors: opts["strictDetectors"] === true,
...(allowBuildScripts !== undefined ? { allowBuildScripts } : {}),
});
// Advisory exit code 3: analyze built a graph but extracted zero code
// symbols (likely a broken parser). Distinct from the generic failure
// exit 1 so CI can detect a silent-skeleton run without parsing logs.
if (analyzeSummary.zeroSymbolGuard === true) process.exitCode = 3;
});

program
Expand Down
11 changes: 10 additions & 1 deletion packages/cli/tsup.config.ts
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
import { cp, mkdir } from "node:fs/promises";
import { cp, mkdir, rm } from "node:fs/promises";
import { dirname, join } from "node:path";
import { fileURLToPath } from "node:url";
import { defineConfig } from "tsup";
Expand Down Expand Up @@ -59,6 +59,15 @@ const distDir = join(here, "dist");
const EXTERNAL = [/^[^.]/];

async function copyTree(from: string, to: string): Promise<void> {
// rm the leaf dest FIRST so a rename/delete in the source tree doesn't leave
// a stale entry behind in `dist/`. `cp(force:true)` overwrites files present
// in both trees but never removes dest-only entries, so without this an
// incremental/watch rebuild accumulates renamed dirs (e.g. a renamed skill
// left both old and new names under dist/plugin-assets/skills/). `clean:true`
// only wipes dist on a full build, not in --watch. rm the exact leaf `to`
// (never `dirname(to)`) — every onSuccess dest is a distinct non-nested leaf,
// and plugin-assets is copied as one tree, so this can't clobber a sibling.
await rm(to, { recursive: true, force: true });
await mkdir(dirname(to), { recursive: true });
await cp(from, to, { recursive: true, force: true, errorOnExist: false });
}
Expand Down
20 changes: 20 additions & 0 deletions packages/ingestion/src/parse/parse-worker.test.ts
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@ import { describe, it } from "node:test";
import type { LanguageId } from "@opencodehub/core-types";
import { MAX_FILE_BYTES, parseOne } from "./parse-worker.js";
import type { ParseCapture, ParseTask } from "./types.js";
import { WasmRuntimeUnavailableError } from "./wasm-runtime.js";

function task(content: Buffer, language: LanguageId = "typescript"): ParseTask {
return { filePath: "src/sample.ts", content, language };
Expand Down Expand Up @@ -75,6 +76,25 @@ describe("parseOne — error to warning mapping", () => {

assert.equal(result.warnings?.[0], "plain string failure");
});

it("RETHROWS a global WasmRuntimeUnavailableError instead of mapping it to a warning", async () => {
// A global runtime death is broken-for-every-file; it must abort the run,
// not become a per-file warning that hides behind a 0-symbol skeleton graph.
await assert.rejects(
parseOne(task(Buffer.from("x")), async () => {
throw new WasmRuntimeUnavailableError("vendor/wasms missing");
}),
/vendor\/wasms missing/,
);
});

it("keeps an ordinary per-file Error as a warning (not the global sentinel)", async () => {
const result = await parseOne(task(Buffer.from("x")), () => {
throw new Error("one bad file");
});
assert.deepEqual(result.captures, []);
assert.equal(result.warnings?.[0], "one bad file");
});
});

describe("parseOne — happy path", () => {
Expand Down
11 changes: 10 additions & 1 deletion packages/ingestion/src/parse/parse-worker.ts
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,11 @@ import { Buffer } from "node:buffer";
import { performance } from "node:perf_hooks";
import type { LanguageId, ParseBatch, ParseCapture, ParseResult, ParseTask } from "./types.js";
import { getUnifiedQuery } from "./unified-queries.js";
import { openWasmParser, type WasmParserHandle } from "./wasm-runtime.js";
import {
openWasmParser,
type WasmParserHandle,
WasmRuntimeUnavailableError,
} from "./wasm-runtime.js";

const PER_FILE_TIMEOUT_MS = 30_000;
/** Pre-parse byte cap. Inputs above this are skipped before any WASM call. */
Expand Down Expand Up @@ -101,6 +105,11 @@ export async function parseOne(
...(warnings.length > 0 ? { warnings } : {}),
};
} catch (err) {
// A global runtime failure means the parser is dead for every file — let it
// propagate so the batch (and the run) abort loudly, instead of recording N
// identical per-file warnings and a silently symbol-free graph. Per-file
// errors (syntax, timeout, one bad grammar) stay warnings and skip the file.
if (err instanceof WasmRuntimeUnavailableError) throw err;
const message = err instanceof Error ? err.message : String(err);
warnings.push(message);
return {
Expand Down
2 changes: 1 addition & 1 deletion packages/ingestion/src/parse/vendor-wasms.ts
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ import { statSync } from "node:fs";
import { dirname, resolve } from "node:path";
import { fileURLToPath } from "node:url";

function isDirSync(p: string): boolean {
export function isDirSync(p: string): boolean {
try {
return statSync(p).isDirectory();
} catch {
Expand Down
52 changes: 52 additions & 0 deletions packages/ingestion/src/parse/wasm-runtime.test.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
/**
* Unit tests for the WASM-runtime global-failure detection.
*
* The load-bearing behavior: a globally-broken parse runtime (vendored
* `vendor/wasms/` missing, or web-tree-sitter.wasm un-loadable) must surface as
* a thrown {@link WasmRuntimeUnavailableError} — NOT a soft per-file skip that
* lets `analyze` complete with a symbol-free skeleton graph and exit 0.
*
* `VENDOR_WASMS_DIR` is computed once at module load, so the global-detection
* logic is split into the pure `assertRuntimeAvailable(dirExists, dir)` seam,
* which we test directly without standing up Emscripten or mutating the const.
*/

import { strict as assert } from "node:assert";
import { describe, it } from "node:test";
import {
assertRuntimeAvailable,
vendorWasmsDirExists,
WasmRuntimeUnavailableError,
} from "./wasm-runtime.js";

describe("assertRuntimeAvailable — global runtime probe", () => {
it("throws WasmRuntimeUnavailableError naming the dir when vendored grammars are missing", () => {
assert.throws(
() => assertRuntimeAvailable(false, "/nope/vendor/wasms"),
(err: unknown) => {
assert.ok(err instanceof WasmRuntimeUnavailableError);
assert.equal((err as Error).name, "WasmRuntimeUnavailableError");
assert.match((err as Error).message, /\/nope\/vendor\/wasms/);
return true;
},
);
});

it("does not throw when the vendored directory exists", () => {
assert.doesNotThrow(() => assertRuntimeAvailable(true, "/anywhere"));
});

it("sets a name that survives structured-clone across the worker boundary", () => {
// Piscina loses the subclass prototype across threads; the main thread must
// be able to match on `name`. Constructing fresh confirms the name is set.
const err = new WasmRuntimeUnavailableError("boom");
assert.equal(err.name, "WasmRuntimeUnavailableError");
});

it("reports the real vendored dir as present in this (built) package", () => {
// Sanity: the standalone ingestion build ships vendor/wasms, so the probe
// is true here. This also guards the resolver — a regression that pointed
// VENDOR_WASMS_DIR outside the package would flip this to false.
assert.equal(vendorWasmsDirExists(), true);
});
});
Loading
Loading