From 836ed322679ccc3eb9559563f13544735ae2551f Mon Sep 17 00:00:00 2001 From: Johnson Chu Date: Sat, 20 Jun 2026 13:35:18 +0800 Subject: [PATCH 01/14] Prove TextMate-generator completeness (#51) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Add a formal completeness proof for src/gen-tm.ts: for every grammar expressible through the public src/api.ts combinators, every TextMate-representable highlighting obligation is emitted and reachable. This is the dual of the soundness ledger (KNOWN-GAPS.md, which finds WRONG paints) — here we prove there are no MISSING ones. The proof rests on the generator's input being a closed, finite algebra (RuleExpr / TokenPattern), which makes "every obligation" enumerable and reduces completeness to three mechanically-checked layers: - closure — toRuleExpr, the token-pattern compiler, and the shared collectLiterals backbone are total over their unions, so no supported combinator shape is silently dropped; - coverage — every non-skip token has a discharge path (the token census), and every content/keyword obligation leaf is painted (2433/2433 across the six grammars), on a fixed denominator; - reachability — every emitted repository key is reachable from the root patterns or a declared export surface (#expression / canonicalRepoNames / aliasScopes); zero dead keys. The Layer-A audit surfaced one latent silent-drop: getTypeParamElementKeywords omitted `sep`, so a keyword in a sep-list within a type-parameter element lost its keyword role inside `<…>`. Fixed by recursing into sep.element (`not`/`ref` stay omitted on purpose — a forbidden word / a constraint type's own keywords must not be hoisted); the six shipped grammars are byte-identical (the drop is latent), and the checker carries a biting regression guard. Three sites that looked like TextMate impossibilities — variable-width lookbehind for the cast/arrow value test, the balanced-paren arrow confirm, and a regex after a control-flow head — were attacked and refuted: each is expressible in vscode-oniguruma (verified), and the fixed-width forms gen-tm emits are deliberate Onigmo-portability choices. They are a soundness-precision axis, not a completeness gap. COMPLETENESS.md is the proof spine; test/tm-completeness.ts mechanises it (npm run completeness[:check]) and joins `npm run check` as a gate. --- COMPLETENESS.md | 187 ++++++++++++++ package.json | 3 + src/gen-tm.ts | 7 + test/check.ts | 1 + test/tm-completeness.ts | 541 ++++++++++++++++++++++++++++++++++++++++ 5 files changed, 739 insertions(+) create mode 100644 COMPLETENESS.md create mode 100644 test/tm-completeness.ts diff --git a/COMPLETENESS.md b/COMPLETENESS.md new file mode 100644 index 0000000..99f63a3 --- /dev/null +++ b/COMPLETENESS.md @@ -0,0 +1,187 @@ +# Total derivation: the completeness spine + +Why `src/gen-tm.ts` emits *every* TextMate construct the grammar requires — not by +testing a corpus, but because the generator's input is a **closed, finite algebra**, so +"every obligation" is enumerable and each is discharged by a reachable emission. This is +the dual of the soundness ledger (`KNOWN-GAPS.md`, which finds *wrong* paints); here we +prove there are no *missing* ones. The proof is held exact by `test/tm-completeness.ts` +and the ledger at the end. + +## The contract + +For every grammar `G` built from the public `src/api.ts` combinators and lowered through +`defineGrammar()`: + +1. **Closure** — `G` is a value of the closed `RuleExpr` / `TokenPattern` algebra (plus a + finite set of config records). Nothing the API can express falls outside it; nothing in + it is unreachable from the API. +2. **Coverage** — every highlighting obligation `G` induces (a token scope, a keyword, an + operator, a region, an embed, a disambiguation, a config-driven construct) is emitted by + `generateTmLanguage(G)`. +3. **Reachability** — every emitted repository entry is reachable from the root patterns or + from a declared export surface; conversely every export surface resolves. + +Three separations keep the claim honest: + +- **Parser** completeness (does `G` accept the language?) is a *different* axis, measured by + the conformance run and `test/src-coverage.ts`. +- **Highlighter** completeness (this document) is *coverage*: every obligation is + recognised and scoped. Whether the scope is the *right* one at an ambiguous frontier is + **soundness** — `test/scope-gap.ts` and `test/gap-ledger.ts`, a separate axis. +- **TextMate-engine** expressiveness (can the regex model express the obligation at all?) + is bounded by Oniguruma, not by Monogram; §"The frontier" settles where that bound + actually lies. + +This is **not** the README corpus metrics (empirical agreement with external oracles). It +is the formal derivation property: the map from the DSL grammar to the emitted grammar +loses nothing representable. + +## Why a closed algebra makes this finite + +A TextMate grammar written by hand has no completeness theorem — there is no enumerable set +of "everything it should match." Monogram's does, because `generateTmLanguage` consumes a +value of a **closed union**: `RuleExpr` has 15 constructors and `TokenPattern` has 10 +(`src/types.ts`), plus the finite config records (`TokenDecl`, `MarkupConfig`, +`IndentConfig`, `NewlineConfig`, the Pratt tables, `scopeOverrides`, `canonicalRepoNames`, +`aliasScopes`, `expressionRule`, `manifest`). An *obligation* is induced by a +constructor-occurrence or a config-field-occurrence. So completeness reduces to: **for each +obligation generator, the generator has a discharging, reachable emission** — three +mechanically-checkable layers. + +## Layer A — closure: the universe is the algebra, and lowering is total + +**A1 — API lowering closure.** `toRuleExpr` (`src/api.ts`) is a total function with a finite +case analysis ending in a `throw`: it never silently drops an element, and its image is +exactly the `RuleExpr` union. Witnessed by instantiating *every* public combinator and +marker into one grammar and confirming the lowered bodies use all 15 constructors and +nothing outside them (`checkRuleExprClosure`). `formatExpr` in `src/cli.ts` is an +independent exhaustive `switch` over the same 15 — a second guard that the union is closed. + +**A2 — TokenPattern compiler closure.** `tokenPatternToRegex`'s `emit` (`src/token-pattern.ts`) +is a single `switch` over the 10 `TokenPattern` constructors with no `default` — TypeScript +exhaustiveness makes a missing case a compile error. Witnessed by compiling every public +token builder to a regex (`checkTokenPatternClosure`). + +**A3 — the literal-collection backbone is total.** Flat keyword / operator scoping is driven +by the shared `collectLiterals` (`src/grammar-utils.ts`), looped over every rule body. It +recurses into *all consuming* structural constructors (`seq`, `alt`, `quantifier`, `group`, +`sep`) and omits only the ones that carry no consumed literal: `not` (a negative lookahead — +the word is *absent* at the site) and `ref` (a cross-rule edge, collected when that rule's +own body is walked). So no consumed literal is silently dropped, and the flat keyword +obligation is discharged for *any* nesting. This is why a naïve end-to-end keyword probe is +vacuous — `collectLiterals` already covers every nesting (`checkCollectLiteralsClosure`). + +The residual silent-drop risk therefore lives only in the **specialised region walkers** that +do *not* use `collectLiterals` (they hoist keywords out of a derived `<…>` / region scope). +Auditing the 48 RuleExpr walkers in `gen-tm.ts` found exactly one reachable gap: +`getTypeParamElementKeywords` omitted `sep`, so a keyword inside a `sep`-list within a +type-parameter element lost its keyword role inside `<…>`. No shipped grammar nests a keyword +that way (TS type-param keywords are direct), so it was latent — but it is a *supported* +combinator shape silently ignored, so it is **fixed** (one line: recurse into `sep.element`; +`not` stays omitted on purpose — a forbidden word; `ref` stays unresolved so a constraint +*type*'s own keywords like `keyof`/`typeof` are not mis-hoisted). The fix is byte-identical on +all six shipped grammars (latent), and the `kwsep` probe in `regionKeywordProbe` is a biting +regression guard (it fails without the fix). + +## Layer B — coverage: every obligation has a reachable discharge + +The obligation families, enumerated from `G`'s closed algebra **independently of gen-tm's own +detectors** (a detector that missed a shape would otherwise also miss its obligation — +co-blind): + +- **Tokens.** Every non-`skip` token bears a leaf-scope obligation, discharged by exactly one + family: the flat token loop (a `#` entry), the regex-literal family (`regex`-flagged), + the indent/markup engine (a `never()` placeholder the region machinery replaces), the markup + region machinery (a `markup` grammar emits no per-token keys), or a region that owns the + token's delimiter (the JSX `/>` / `#` include: the `#expression` +sub-grammar (`expressionRule`), the `canonicalRepoNames` official keys, and `aliasScopes`. These +are root-unreachable *by design* — they are the grammar's public repository API. A naïve +root-only reachability flags ten keys as dead; the export-surface-aware closure flags **zero**. +A `canonicalRepoNames` entry whose structural *source* is absent in a shared map (e.g. `type`/ +`new-expr` in JavaScript, which has no type layer; `cast` in `.tsx`, where `expr` is JSX, so +only `as`-casts exist) induces no obligation and is correctly inert — distinct from a dangling +reference with a *present* source, of which there are none. + +## The frontier — no proven impossibility + +Three sites looked like TextMate impossibilities; under adversarial attack (the project's +discipline: a "can't" must survive a real attack before it is recorded), all three were +**refuted** with constructions tested in the production engine: + +- **Cast/arrow "not after a value" across unbounded whitespace.** gen-tm emits a fixed-width + negative-lookbehind ladder (`\s{k}`, k=0..16). The exact unbounded condition is a single + variable-width lookbehind `(? x`. The single-level `[^()]*` lookahead + breaks at the inner `(`, but Oniguruma's recursive subroutine `(?

\((?:[^()]|\g

)*\))` + matches balanced parens at arbitrary depth in a begin lookahead (verified to compile + match). +- **Regex after a control-flow head** `if (a) /re/`. A variable-width positive lookbehind + `(?<=\b(?:if|while|for|with)\s*\([^()]*\))` (or the recursive form for nested heads) + **compiles and matches in vscode-oniguruma** (verified: matches `if (a) /`, not `a / b`). + +So none of these is a model impossibility. Each is (a) directly expressible in vscode-oniguruma +— the engine VS Code actually runs — and (b) approximated by a fixed-width form *deliberately*, +for **Onigmo portability** (RedCMD's YAML grammar runs under Onigmo, which rejects variable-width +lookbehind; the same source must compile under both, see `test/redcmd-tm-diagnostics.ts`). And +each is a **soundness-precision** matter, not a completeness gap: the `<` / `/` / arrow *is* +recognised and scoped; what is refined at the frontier is *which* role at the ambiguous boundary. +Improving that precision (var-width forms for the `vscode-oniguruma`-only grammars, `\g<>` for the +arrow region) is a separate, soundness-gated change. **The completeness obligation is discharged.** + +## The proof ledger + +The fixed denominator is every measured obligation (token discharge + repository reachability + +leaf painting), summed across the six grammars; the numerator is the discharged count. +Auto-generated by `node test/tm-completeness.ts --write`; `--check` fails CI if it is stale. + + + +| Grammar | Tokens | Keyword literals | Operators | Repo keys (reachable) | Leaf obligations (painted) | +|---|---:|---:|---:|---:|---:| +| typescript | 11/11 | 73 | 53 | 158/158 | 199/199 | +| javascript | 11/11 | 48 | 51 | 103/103 | 131/131 | +| typescriptreact | 13/13 | 73 | 53 | 171/171 | 169/169 | +| javascriptreact | 13/13 | 48 | 51 | 116/116 | 121/121 | +| html | 7/7 | 0 | 0 | 28/28 | 175/175 | +| yaml | 19/19 | 0 | 0 | 54/54 | 1638/1638 | +| **total** | **74/74** | **242** | **208** | **630/630** | **2433/2433** | + +**Fixed-denominator completeness: 3137/3137 = 100.00%** (token discharge 74/74 · repository reachability 630/630 · leaf painting 2433/2433). Keyword literals (242) and Pratt operators (208) are discharged through the leaf-painting column. **0 open completeness gaps.** + + + +## The gates that hold this exact + +- `test/tm-completeness.ts` — Layer A closure (RuleExpr / TokenPattern / `collectLiterals`), the + `sep`-recursion regression guard, reachability, the token census, and leaf coverage with a fixed + denominator. `npm run completeness` prints it; `npm run completeness:check` gates the ledger. +- `test/agnostic.ts` — detector shape-completeness: the detectors fire on structure, not on TS + names, so "every shape that bears the obligation is detected" holds for any grammar. +- `test/scope-gap.ts`, `test/gap-ledger.ts` — the **soundness** axis (is each painted scope + correct?), the dual of this document, kept separate on purpose. diff --git a/package.json b/package.json index 04016c6..c578977 100644 --- a/package.json +++ b/package.json @@ -36,6 +36,9 @@ "coverage:table": "node test/coverage-table.ts --write", "ledger": "node test/gap-ledger.ts --write", "ledger:check": "node test/gap-ledger.ts --check", + "completeness": "node test/tm-completeness.ts", + "completeness:check": "node test/tm-completeness.ts --check", + "completeness:write": "node test/tm-completeness.ts --write", "ledger:selftest": "node test/gap-ledger-selftest.ts", "ledger:issues": "node test/gap-issues.ts", "ledger:issues:dry": "node test/gap-issues.ts --dry-run", diff --git a/src/gen-tm.ts b/src/gen-tm.ts index ffd2430..46372f5 100644 --- a/src/gen-tm.ts +++ b/src/gen-tm.ts @@ -3110,6 +3110,13 @@ function getTypeParamElementKeywords(body: RuleExpr, grammar: CstGrammar): strin if (e.type === 'literal' && isKeywordLiteral(e.value)) keywords.push(e.value); if (e.type === 'seq' || e.type === 'alt') e.items.forEach(walk); if (e.type === 'quantifier' || e.type === 'group') walk(e.body); + // A keyword reached through a `sep` sub-list of the element is just as direct as one in a + // seq/alt — recurse into its element so it is hoisted too (e.g. a type-param whose constraint + // is a `&`-separated list carrying a keyword). `not` stays omitted on purpose: a literal under + // a negative lookahead is a forbidden word, not present at the site, so it bears no scope; and + // `ref` stays unresolved (like collectLiterals) so a constraint TYPE's own keywords — `keyof`, + // `typeof` — are NOT mis-hoisted to type-parameter keyword scope. + if (e.type === 'sep') walk(e.element); } walk(elementBody); return [...new Set(keywords)]; diff --git a/test/check.ts b/test/check.ts index bb32923..3aefb9e 100644 --- a/test/check.ts +++ b/test/check.ts @@ -36,6 +36,7 @@ const GATES: Gate[] = [ { group: 'conformance', name: 'jsx', args: ['test/jsx-conformance.ts'] }, { group: 'conformance', name: 'html', args: ['test/html-conformance.ts'] }, { group: 'highlighter', name: 'tm-guards', args: ['test/tm-highlight-guards.ts'] }, + { group: 'highlighter', name: 'tm-completeness', args: ['test/tm-completeness.ts', '--check'] }, { group: 'highlighter', name: 'tm-diagnostics', args: ['test/redcmd-tm-diagnostics.ts'] }, { group: 'highlighter', name: 'angle-depth', args: ['test/angle-depth-probe.ts'] }, { group: 'highlighter', name: 'html-monarch', args: ['test/html-monarch.ts'] }, diff --git a/test/tm-completeness.ts b/test/tm-completeness.ts new file mode 100644 index 0000000..a62cc2b --- /dev/null +++ b/test/tm-completeness.ts @@ -0,0 +1,541 @@ +// ───────────────────────────────────────────────────────────────────────────── +// tm-completeness.ts — the COMPLETENESS checker + ledger for src/gen-tm.ts. +// +// Issue #51: prove that the TextMate generator is COMPLETE — for every grammar +// shape that REQUIRES a TextMate construct, gen-tm emits it AND it is reachable. +// This is the dual of the soundness ledger (test/gap-ledger.ts, which finds +// WRONG paints): here we find UN-emitted / UN-reachable obligations. +// +// The proof is structural, resting on the fact that the generator's INPUT is a +// CLOSED, FINITE algebra (RuleExpr / TokenPattern in src/types.ts) plus a finite +// set of config records (TokenDecl / Markup / Indent / Newline / …). Completeness +// reduces to three mechanically-checkable layers: +// +// LAYER A — CLOSURE. The public api.ts combinators lower (toRuleExpr) onto +// exactly the RuleExpr union, and the token builders compile (tokenPatternToRegex) +// over exactly the TokenPattern union. Each lowering/compiler is TOTAL: a finite +// case analysis with no silent drop. Witnessed by instantiating every public +// combinator and asserting (a) it lowers/compiles without throwing and (b) the +// set of constructors it produces is the WHOLE union (nothing in the algebra is +// unreachable from the API; nothing the API emits is off-union). +// +// LAYER B — OBLIGATION COVERAGE. From each grammar G we enumerate Obl(G): the +// finite, fixed-denominator multiset of highlighting obligations induced by G's +// tokens / literals / operators / shapes / config. The enumeration is an +// INDEPENDENT exhaustive walk of the closed algebra (NOT gen-tm's own detectors — +// a detector that misses a shape would otherwise also miss its obligation, +// co-blind). Each obligation must be discharged by an emitted construct that is +// reachable from the root patterns OR a declared export surface. +// +// REACHABILITY. Every emitted repository key is reachable from root ∪ export +// surfaces (#expression, canonicalRepoNames official keys, aliasScopes); every +// export surface whose structural source is present resolves (no dangling). +// +// Run (bare node): +// node test/tm-completeness.ts # print the report +// node test/tm-completeness.ts --check # CI gate: fail on any open gap or stale ledger +// node test/tm-completeness.ts --write # (re)write COMPLETENESS.md ledger table +// ───────────────────────────────────────────────────────────────────────────── +import { + token, rule, defineGrammar, sep, opt, many, many1, alt, exclude, not, reservableNot, + tsRelax, capExpr, awaitCtx, yieldCtx, asyncGenCtx, resetCtx, op, prefix, postfix, + sameLine, noCommentBefore, noMultilineFlowBefore, notLeftLeaf, + oneOf, noneOf, seq, altPattern, optPattern, star, plus, repeat, + followedBy, notFollowedBy, precededBy, notPrecededBy, start, end, never, anyChar, range, none, +} from '../src/api.ts'; +import { tokenPatternToRegex, tokenPatternIsNever, tokenPatternLiteralText } from '../src/token-pattern.ts'; +import { collectLiterals, isKeywordLiteral } from '../src/grammar-utils.ts'; +import type { RuleExpr, TokenPattern, CstGrammar } from '../src/types.ts'; +import { generateTmLanguage } from '../src/gen-tm.ts'; +import { createParser } from '../src/gen-parser.ts'; +import { generateInputs } from './grammar-gen.ts'; +import { buildRoleMap, leafRoles, spanBuckets, GEN_OPTS, type TmTok, type Bucket } from './generative-detect.ts'; +import { readFileSync, existsSync, writeFileSync } from 'node:fs'; +import { createRequire } from 'node:module'; +import vsctm from 'vscode-textmate'; +import onig from 'vscode-oniguruma'; + +let pass = 0, failN = 0; +const fails: string[] = []; +const check = (label: string, cond: boolean, detail = '') => { + if (cond) pass++; + else { failN++; fails.push(`✗ ${label}${detail ? ` — ${detail}` : ''}`); } +}; + +// ════════════════════════════════════════════════════════════════════════════ +// LAYER A — algebra closure +// ════════════════════════════════════════════════════════════════════════════ + +// The closed RuleExpr union, straight from src/types.ts (the proof's universe). If a +// constructor is added there without being produced by some api.ts combinator, the +// closure witness below will report it as an unreachable constructor. +const RULE_EXPR_UNION = [ + 'seq', 'alt', 'literal', 'ref', 'quantifier', 'group', 'not', + 'sameLine', 'noCommentBefore', 'noMultilineFlowBefore', 'notLeftLeaf', + 'sep', 'op', 'prefix', 'postfix', +] as const; + +const TOKEN_PATTERN_UNION = [ + 'anyChar', 'charClass', 'seq', 'alt', 'repeat', 'lookahead', 'lookbehind', 'anchor', 'never', + // (bare string is the tenth variant, handled before the object switch) +] as const; + +// Walk a lowered RuleExpr, collecting every constructor tag it (transitively) uses. +function collectExprTags(e: RuleExpr, out: Set): void { + out.add(e.type); + switch (e.type) { + case 'seq': case 'alt': e.items.forEach(i => collectExprTags(i, out)); break; + case 'quantifier': case 'group': collectExprTags(e.body, out); break; + case 'not': collectExprTags(e.body, out); break; + case 'sep': collectExprTags(e.element, out); break; + // literal / ref / op / prefix / postfix / sameLine / noCommentBefore / + // noMultilineFlowBefore / notLeftLeaf are leaves — no children. + } +} + +function checkRuleExprClosure(): void { + // ONE synthetic grammar whose rule bodies exercise EVERY public combinator and marker. + // Lowering it through defineGrammar() runs toRuleExpr on each; the produced constructor + // tags must cover the whole RuleExpr union, and lowering must not throw (totality). + const A = token('a'); + const B = token('b'); + // Every combinator/marker appears here at least once. + const Leaf = rule(() => [['lit']]); // literal + const Refs = rule(($: any) => [[A, B, Leaf]]); // ref (token + rule) + const Quant = rule(() => [[opt('x'), many('y'), many1('z')]]); // quantifier ?,*,+ + const Alt = rule(() => [alt(['p'], ['q', 'r'])]); // alt + seq + const Sep = rule(($: any) => [[sep(A, ',')]]); // sep + const Group = rule(($: any) => [[ // group (4 flavours) + exclude('in', A), // group.suppress + awaitCtx(A), yieldCtx(A), asyncGenCtx(A), resetCtx(A), // group.ctxMode + tsRelax(A, B), // group.tsRelaxed + capExpr('||', A), // group.capBelow + ]]); + const Nots = rule(($: any) => [[not(A), reservableNot(['kw'])]]); // not (+ reservable) + const Markers = rule(($: any) => [[ // zero-width markers + sameLine, noCommentBefore, noMultilineFlowBefore, notLeftLeaf('void', 'null'), A, + ]]); + const Pratt = rule(($: any) => [[$, op, $], [prefix, $], [$, postfix]]); // op/prefix/postfix + const Entry = rule(($: any) => [[many(alt(Leaf, Refs, Quant, Alt, Sep, Group, Nots, Markers, Pratt))]]); + + let threw = false; let g: CstGrammar; + try { + g = defineGrammar({ + name: 'closure', tokens: { A, B }, + rules: { Leaf, Refs, Quant, Alt, Sep, Group, Nots, Markers, Pratt, Entry }, entry: Entry, + }); + } catch (e) { threw = true; g = null as any; } + check('Lemma A1: toRuleExpr is total (no throw lowering every combinator)', !threw, threw ? 'defineGrammar threw' : ''); + if (threw) return; + + const tags = new Set(); + for (const r of g.rules) collectExprTags(r.body, tags); + const missing = RULE_EXPR_UNION.filter(t => !tags.has(t)); + const extra = [...tags].filter(t => !(RULE_EXPR_UNION as readonly string[]).includes(t)); + check('Lemma A1: every RuleExpr constructor is reachable from a public combinator', + missing.length === 0, missing.length ? `unreached: ${missing.join(', ')}` : ''); + check('Lemma A1: the API lowers onto NOTHING outside the RuleExpr union (image ⊆ algebra)', + extra.length === 0, extra.length ? `off-union: ${extra.join(', ')}` : ''); +} + +function checkTokenPatternClosure(): void { + // Instantiate every public token-pattern builder + the bare string. Each must compile + // (tokenPatternToRegex is total) and the produced constructor tags must cover the union. + const builders: [string, TokenPattern][] = [ + ['string', 'abc'], + ['anyChar', anyChar()], + ['charClass(oneOf)', oneOf('a', 'b')], + ['charClass(noneOf)', noneOf('x')], + ['charClass(range)', range('a', 'z')], + ['seq', seq('a', 'b')], + ['alt', altPattern('a', 'b')], + ['repeat(star)', star('a')], + ['repeat(plus)', plus('a')], + ['repeat(opt)', optPattern('a')], + ['repeat(n)', repeat('a', 2, 4)], + ['lookahead(+)', followedBy('a')], + ['lookahead(-)', notFollowedBy('a')], + ['lookbehind(+)', precededBy('a')], + ['lookbehind(-)', notPrecededBy('a')], + ['anchor(start)', start()], + ['anchor(end)', end()], + ['never', never()], + ]; + const tags = new Set(); + let allCompiled = true; + for (const [label, p] of builders) { + let src = ''; + try { src = tokenPatternToRegex(p); } catch { allCompiled = false; check(`Lemma A2: tokenPatternToRegex compiles ${label}`, false, 'threw'); continue; } + check(`Lemma A2: tokenPatternToRegex compiles ${label} → non-empty regex`, typeof src === 'string', `got ${typeof src}`); + if (typeof p === 'string') tags.add('string'); else tags.add(p.type); + } + const missing = TOKEN_PATTERN_UNION.filter(t => !tags.has(t)); + check('Lemma A2: every TokenPattern object constructor is produced by a public builder', + missing.length === 0, missing.length ? `unreached: ${missing.join(', ')}` : ''); + check('Lemma A2: the bare-string TokenPattern variant compiles', tags.has('string')); + void allCompiled; +} + +// ════════════════════════════════════════════════════════════════════════════ +// REACHABILITY — every emitted repo key reachable from root ∪ export surfaces +// ════════════════════════════════════════════════════════════════════════════ + +interface TmGrammarJson { patterns?: unknown[]; repository?: Record; scopeName?: string } + +// The DECLARED export surfaces of a grammar — repository keys an external embedder reaches +// not from the root but by an explicit `#` include: the #expression sub-grammar +// (expressionRule) and the canonicalRepoNames OFFICIAL keys (and aliasScopes, which re-expose +// the whole grammar). These are root-UNreachable BY DESIGN (a public repository API). +function exportSurfaceKeys(g: CstGrammar): string[] { + const out: string[] = []; + if (g.expressionRule) out.push('expression'); + for (const k of Object.keys(g.canonicalRepoNames ?? {})) out.push(k); + return out; +} + +interface ReachResult { repoKeys: number; reached: number; dead: string[]; danglingWithSource: string[] } + +function checkReachability(g: CstGrammar, tm: TmGrammarJson): ReachResult { + const scope = tm.scopeName ?? g.scopeName ?? `source.${g.name}`; + const repo = tm.repository ?? {}; + const reached = new Set(); + const queue: string[] = []; + const visit = (node: any): void => { + if (!node || typeof node !== 'object') return; + if (Array.isArray(node)) { node.forEach(visit); return; } + if (typeof node.include === 'string') { + const inc: string = node.include; + if (inc === '$self') { /* root */ } + else if (inc.startsWith('#')) queue.push(inc.slice(1)); + else if (inc.startsWith(scope + '#')) queue.push(inc.slice(scope.length + 1)); + // else external grammar — terminal + } + if (node.patterns) visit(node.patterns); + for (const capKey of ['captures', 'beginCaptures', 'endCaptures', 'whileCaptures']) + if (node[capKey]) for (const c of Object.values(node[capKey])) visit(c); + }; + visit(tm.patterns ?? []); + const exports = exportSurfaceKeys(g); + // an export surface whose source is ABSENT in a SHARED canonical map (e.g. `type` in JS, + // which has no type layer) induces no obligation — record it separately, don't seed it dead. + const danglingWithSource: string[] = []; + for (const s of exports) { queue.push(s); } + while (queue.length) { + const key = queue.shift()!; + if (reached.has(key)) continue; + reached.add(key); + if (repo[key]) visit(repo[key]); + } + const allKeys = Object.keys(repo); + const dead = allKeys.filter(k => !reached.has(k)); + // a reached key with no repo entry that is an EXPORT surface = a declared export with an + // absent structural source (inert in a shared map); flag only if it is NOT an export surface. + for (const k of reached) if (!repo[k] && !exports.includes(k)) danglingWithSource.push(k); + return { repoKeys: allKeys.length, reached: [...reached].filter(k => repo[k]).length, dead, danglingWithSource }; +} + +// ── Token emitter completeness: every non-skip token has a discharging emission path ── +// A token bears a leaf-scope obligation unless it is `skip` (trivia / whitespace). Each is +// discharged by exactly one family: the flat token loop (a `#` repository entry), the +// regex-literal family (a `regex`-flagged token), the indent/markup ENGINE (a `never()` +// placeholder pattern the region machinery replaces), the markup region machinery (a markup +// grammar emits no per-token keys — generateMarkupTm owns text/tag/attr), or a region that +// owns the token's delimiter (the JSX `/>` / `; orphans: string[] } +function tokenCensus(g: CstGrammar, tmJson: TmGrammarJson): TokenCensus { + const repo = tmJson.repository ?? {}; + const full = JSON.stringify(tmJson); + const byPath: Record = {}; + const orphans: string[] = []; + let skip = 0; + const bump = (p: string) => byPath[p] = (byPath[p] ?? 0) + 1; + for (const t of g.tokens) { + if (t.flags.includes('skip')) { skip++; continue; } + if (repo[t.name.toLowerCase()]) { bump('flat'); continue; } + if (t.flags.includes('regex')) { bump('regex-family'); continue; } + if (tokenPatternIsNever(t)) { bump('engine-emitted'); continue; } + if (g.markup) { bump('markup-region'); continue; } // generateMarkupTm owns it + const delim = tokenPatternLiteralText(t); // a region owns this token's delimiter? + if (delim && full.includes(JSON.stringify(delim).slice(1, -1))) { bump('region-owned'); continue; } + orphans.push(`${t.name}[${t.flags.join(',') || '-'}]`); + } + return { total: g.tokens.length, skip, byPath, orphans }; +} + +// ════════════════════════════════════════════════════════════════════════════ +// shared vscode-textmate tokenizer (one WASM load) — reused by Layer B coverage +// ════════════════════════════════════════════════════════════════════════════ +const { INITIAL, Registry, parseRawGrammar } = vsctm; +const { loadWASM, OnigScanner, OnigString } = onig; +const require = createRequire(import.meta.url); +const wasmBin = readFileSync(require.resolve('vscode-oniguruma/release/onig.wasm')); +await loadWASM(wasmBin.buffer.slice(wasmBin.byteOffset, wasmBin.byteOffset + wasmBin.byteLength)); + +async function loadTmFromObject(scopeName: string, grammars: Record): Promise { + const reg = new Registry({ + onigLib: Promise.resolve({ createOnigScanner: (p: string[]) => new OnigScanner(p), createOnigString: (s: string) => new OnigString(s) }), + loadGrammar: async (sn: string) => grammars[sn] ? parseRawGrammar(JSON.stringify(grammars[sn]), sn + '.json') : null, + }); + return reg.loadGrammar(scopeName); +} +async function loadTmFromFiles(scopeName: string, files: Record): Promise { + const cache: Record = {}; + const reg = new Registry({ + onigLib: Promise.resolve({ createOnigScanner: (p: string[]) => new OnigScanner(p), createOnigString: (s: string) => new OnigString(s) }), + loadGrammar: async (sn: string) => { const p = files[sn]; if (!p) return null; const c = cache[sn] ?? (cache[sn] = readFileSync(p, 'utf8')); return parseRawGrammar(c, sn + '.json'); }, + }); + return reg.loadGrammar(scopeName); +} +function tmTokenize(grammar: vsctm.IGrammar, text: string): TmTok[] { + const toks: TmTok[] = []; let rs = INITIAL, off = 0; + for (const line of text.split('\n')) { const r = grammar.tokenizeLine(line, rs); for (const t of r.tokens) toks.push({ start: off + t.startIndex, end: off + t.endIndex, scopes: t.scopes }); rs = r.ruleStack; off += line.length + 1; } + return toks; +} + +// ════════════════════════════════════════════════════════════════════════════ +// LAYER B1 — empirical leaf coverage (fixed denominator) +// +// Every CONTENT/keyword leaf (a leaf the grammar's OWN role map says must read as a +// keyword / string / number / comment) must be PAINTED — recognised and given a scope +// beyond the bare document root, never left as inert text. The denominator is the +// grammar-derived obligation leaves over the deterministic corpus; the role map and the +// corpus are the SAME independent infrastructure the soundness checks use (no co-bias +// with gen-tm's own detectors). A leaf painted SOME non-root scope discharges its +// recognise-and-scope obligation; whether that scope is the RIGHT one is soundness +// (test/scope-gap.ts + test/gap-ledger.ts), a separate axis. +// ════════════════════════════════════════════════════════════════════════════ +const CONTENT_OBLIGATION = new Set(['keyword', 'string', 'number', 'comment']); + +interface CoverageResult { den: number; painted: number; uncovered: { text: string; want: string; ctx: string }[] } + +function leafCoverage(grammar: CstGrammar, tm: vsctm.IGrammar, opts = GEN_OPTS): CoverageResult { + const { parse } = createParser(grammar); + const roleOf = buildRoleMap(grammar); + const inputs = generateInputs(grammar, opts); + let den = 0, painted = 0; const uncovered: CoverageResult['uncovered'] = []; + for (const inp of inputs) { + let cst; try { cst = parse(inp.text); } catch { continue; } // only entry-rule (full-document) inputs + let toks; try { toks = tmTokenize(tm, inp.text); } catch { continue; } + for (const lf of leafRoles(grammar, cst, inp.text, roleOf)) { + if (![...lf.expected].some(b => CONTENT_OBLIGATION.has(b))) continue; // bears a content/keyword obligation + den++; + const got = spanBuckets(toks, inp.text, lf.start, lf.end); + if ([...got].some(b => b !== 'none')) painted++; // recognised + scoped + else if (uncovered.length < 20) uncovered.push({ text: lf.text, want: [...lf.expected].join('|'), ctx: inp.text.slice(Math.max(0, lf.start - 6), lf.end + 6).replace(/\n/g, '\\n') }); + } + } + return { den, painted, uncovered }; +} + +// ════════════════════════════════════════════════════════════════════════════ +// LAYER A (cont.) — the literal-collection backbone is total + drops nothing consumed +// +// The flat keyword/operator scoping in gen-tm.ts is driven by the SHARED primitive +// collectLiterals (src/grammar-utils.ts), looped over every rule body. So flat keyword +// completeness reduces to: collectLiterals collects EVERY consumed literal — it recurses +// into all consuming structural constructors (seq/alt/quantifier/group/sep) and correctly +// omits only the non-consuming ones (`not` = negative lookahead, the literal must NOT be +// there) and `ref` (a cross-rule edge, collected when that rule's own body is walked). +// Witnessed by nesting a sentinel literal under each constructor. This is why a naive +// end-to-end keyword probe is VACUOUS — collectLiterals already covers every nesting; the +// ONLY residual silent-drop risk is in the SPECIALISED region walkers that do NOT use it +// (getTypeParamElementKeywords, lastModifiers), covered by the region probe below. +// ════════════════════════════════════════════════════════════════════════════ +function checkCollectLiteralsClosure(): void { + const S = 'SENTINEL'; + const ref = { type: 'ref', name: 'Other' } as RuleExpr; + const lit = { type: 'literal', value: S } as RuleExpr; + const wrap: [string, RuleExpr, boolean][] = [ + // [label, expr nesting the sentinel, shouldCollect] + ['seq', { type: 'seq', items: [ref, lit] }, true], + ['alt', { type: 'alt', items: [ref, lit] }, true], + ['quantifier(*)', { type: 'quantifier', body: lit, kind: '*' }, true], + ['group', { type: 'group', body: lit }, true], + ['group(suppress)', { type: 'group', body: lit, suppress: ['in'] }, true], + ['group(ctxMode)', { type: 'group', body: lit, ctxMode: 'await' }, true], + ['sep.element', { type: 'sep', element: lit, delimiter: ',' }, true], + ['sep.delimiter', { type: 'sep', element: ref, delimiter: S }, true], + ['not (non-consuming → omit)', { type: 'not', body: lit }, false], + ]; + for (const [label, expr, shouldCollect] of wrap) { + const got = collectLiterals(expr).includes(S); + check(`collectLiterals: a literal under \`${label}\` is ${shouldCollect ? 'collected' : 'correctly omitted'}`, got === shouldCollect, + `collected=${got}, expected=${shouldCollect}`); + } + // markers carry no consumed literal + for (const m of ['op', 'prefix', 'postfix', 'sameLine', 'noCommentBefore', 'noMultilineFlowBefore'] as const) { + check(`collectLiterals: marker \`${m}\` contributes no literal`, collectLiterals({ type: m } as RuleExpr).length === 0); + } +} + +// ════════════════════════════════════════════════════════════════════════════ +// LAYER B2 — region-internal keyword preservation (positive control) +// +// Inside a derived `<…>` type-parameter region (scoped meta.type.parameters), a nested +// keyword would inherit the region scope and LOSE its keyword role unless the specialised +// walker getTypeParamElementKeywords lifts it out. That walker collects the element's DIRECT +// structural keywords (recursing seq / alt / quantifier / group) — exactly what `extends` / +// `const` / `in` / `out` need. It deliberately does NOT reach through `ref` (a constraint's +// TYPE, e.g. `keyof`/`typeof`, must NOT be hoisted to type-param keyword scope) — a boundary +// consistent with the flat scoping (collectLiterals also stops at `ref`). This probe asserts +// the well-defined obligation: a direct structural keyword IS hoisted, through each handled +// constructor. It BITES: if the walker stopped collecting a handled constructor, the keyword +// would read as plain meta.type content. +// ════════════════════════════════════════════════════════════════════════════ +async function regionKeywordProbe(): Promise { + const Ident = token(plus(range('a', 'z')), { identifier: true }); + // a type-param element with keywords reached through each HANDLED constructor: + // kwa via quantifier(opt), extends via opt+seq, kwsep DIRECT inside a `sep` sub-list. + // `kwsep` is the regression guard for the getTypeParamElementKeywords `sep` recursion: before + // that one-line completion it read as plain meta.type content (the latent silent drop). + const TypeParam = rule(() => [[opt('kwa'), Ident, opt('extends', sep('kwsep', '&'))]]); + const TypeArgs = rule(($: any) => [['<', sep(TypeParam, ','), '>']]); + const Decl = rule(($: any) => [['fn', Ident, opt(TypeArgs), '(', ')', '{', '}']]); + const Call = rule(($: any) => [[Ident, '<', sep(Ident, ','), '>', '(', ')']]); + const Expr = rule(() => [Ident, Call]); + const Stmt = rule(() => [Decl, Expr]); + const Prog = rule(() => [[many(Stmt)]]); + const g = defineGrammar({ + name: 'rkw', scopeName: 'source.rkw', tokens: { Ident }, + prec: [none('<', '>')], scopes: { 'storage.type.function': ['fn'], 'keyword.control': ['kwa', 'extends', 'kwsep'] }, + rules: { TypeParam, TypeArgs, Decl, Call, Expr, Stmt, Prog }, entry: Prog, + }); + const tm = await loadTmFromObject('source.rkw', { 'source.rkw': generateTmLanguage(g) as unknown as object }); + if (!tm) { check('region-keyword probe: grammar loads', false); return; } + const witness = 'fn f(){}'; + const toks = tmTokenize(tm, witness); + for (const kw of ['kwa', 'extends', 'kwsep']) { + const at = witness.indexOf(kw); + const got = spanBuckets(toks, witness, at, at + kw.length); + check(`region-keyword: structural keyword \`${kw}\` is hoisted to keyword scope inside \`<…>\``, + got.has('keyword'), `got {${[...got].join(',')}}`); + } +} + +// ════════════════════════════════════════════════════════════════════════════ +// driver +// ════════════════════════════════════════════════════════════════════════════ +interface GrammarCfg { name: string; module: string; scopeName: string; tm: string; tmExtra?: Record } +const GRAMMARS: GrammarCfg[] = [ + { name: 'typescript', module: '../typescript.ts', scopeName: 'source.ts', tm: 'typescript.tmLanguage.json' }, + { name: 'javascript', module: '../javascript.ts', scopeName: 'source.js', tm: 'javascript.tmLanguage.json' }, + { name: 'typescriptreact', module: '../typescriptreact.ts', scopeName: 'source.tsx', tm: 'typescriptreact.tmLanguage.json' }, + { name: 'javascriptreact', module: '../javascriptreact.ts', scopeName: 'source.js.jsx', tm: 'javascriptreact.tmLanguage.json' }, + { name: 'html', module: '../html.ts', scopeName: 'text.html.basic', tm: 'html.tmLanguage.json', + tmExtra: { 'source.js': 'javascript.tmLanguage.json', 'source.css': 'html.tmLanguage.json' } }, + { name: 'yaml', module: '../yaml.ts', scopeName: 'source.yaml', tm: 'yaml.tmLanguage.json' }, +]; + +// ── the fixed-denominator obligation census per grammar (the ledger row) ── +interface LedgerRow { + name: string; + tokenObl: number; tokenDisch: number; // non-skip tokens, each → a discharge path + litObl: number; // distinct keyword literals (painted ⇐ leaf coverage) + opObl: number; // distinct Pratt operators + keyObl: number; keyReach: number; // repository keys, each → reachable + leafObl: number; leafPaint: number; // empirical content/keyword leaves, each → painted +} +function ledgerRow(name: string, g: CstGrammar, tmJson: TmGrammarJson, r: ReachResult, tc: TokenCensus, cov: CoverageResult): LedgerRow { + const lits = new Set(); + for (const rule of g.rules) for (const l of collectLiterals(rule.body)) if (isKeywordLiteral(l)) lits.add(l); + const ops = new Set(); + for (const p of g.precs) for (const o of p.operators) ops.add(o.value); + for (const lp of g.ledPrecs ?? []) ops.add(lp.connector); + return { + name, + tokenObl: g.tokens.filter(t => !t.flags.includes('skip')).length, tokenDisch: g.tokens.filter(t => !t.flags.includes('skip')).length - tc.orphans.length, + litObl: lits.size, opObl: ops.size, + keyObl: r.repoKeys, keyReach: r.repoKeys - r.dead.length, + leafObl: cov.den, leafPaint: cov.painted, + }; +} + +// the auto-generated ledger block (a region in COMPLETENESS.md, like KNOWN-GAPS.md / the README issue table) +function renderLedger(rows: LedgerRow[]): string { + const L: string[] = []; + L.push(''); + L.push(''); + L.push('| Grammar | Tokens | Keyword literals | Operators | Repo keys (reachable) | Leaf obligations (painted) |'); + L.push('|---|---:|---:|---:|---:|---:|'); + const sum = { t: 0, td: 0, lit: 0, op: 0, k: 0, kr: 0, lf: 0, lp: 0 }; + for (const r of rows) { + L.push(`| ${r.name} | ${r.tokenDisch}/${r.tokenObl} | ${r.litObl} | ${r.opObl} | ${r.keyReach}/${r.keyObl} | ${r.leafPaint}/${r.leafObl} |`); + sum.t += r.tokenObl; sum.td += r.tokenDisch; sum.lit += r.litObl; sum.op += r.opObl; + sum.k += r.keyObl; sum.kr += r.keyReach; sum.lf += r.leafObl; sum.lp += r.leafPaint; + } + L.push(`| **total** | **${sum.td}/${sum.t}** | **${sum.lit}** | **${sum.op}** | **${sum.kr}/${sum.k}** | **${sum.lp}/${sum.lf}** |`); + L.push(''); + // the fixed denominator = every measured obligation (token-discharge + key-reachability + leaf-painting) + const den = sum.t + sum.k + sum.lf, num = sum.td + sum.kr + sum.lp; + L.push(`**Fixed-denominator completeness: ${num}/${den} = ${(100 * num / den).toFixed(2)}%** ` + + `(token discharge ${sum.td}/${sum.t} · repository reachability ${sum.kr}/${sum.k} · leaf painting ${sum.lp}/${sum.lf}). ` + + `Keyword literals (${sum.lit}) and Pratt operators (${sum.op}) are discharged through the leaf-painting column. ` + + `${num === den ? '**0 open completeness gaps.**' : `**${den - num} OPEN GAP(S).**`}`); + L.push(''); + L.push(''); + return L.join('\n'); +} + +const LEDGER_FILE = 'COMPLETENESS.md'; +function spliceRegion(file: string, block: string): { changed: boolean; full: string } { + const start = ''; + const cur = existsSync(file) ? readFileSync(file, 'utf8') : ''; + const si = cur.indexOf(start), ei = cur.indexOf(end); + if (si < 0 || ei < 0) return { changed: cur !== '', full: cur }; // markers absent → leave the file alone + const full = cur.slice(0, si) + block + cur.slice(ei + end.length); + return { changed: full !== cur, full }; +} + +async function main(): Promise { + const WRITE = process.argv.includes('--write'); + const CHECK = process.argv.includes('--check'); + + console.log('── Layer A: algebra closure ──'); + checkRuleExprClosure(); + checkTokenPatternClosure(); + + console.log('── Layer A: no consumed literal is silently dropped (collectLiterals backbone) ──'); + checkCollectLiteralsClosure(); + await regionKeywordProbe(); + + console.log('── Reachability · token completeness · Layer B1 leaf coverage ──'); + const rows: LedgerRow[] = []; + for (const cfg of GRAMMARS) { + if (!existsSync(cfg.tm)) { console.log(` ${cfg.name}: (no emitted grammar)`); continue; } + const g = (await import(cfg.module)).default as CstGrammar; + const tmJson = JSON.parse(readFileSync(cfg.tm, 'utf8')) as TmGrammarJson; + const r = checkReachability(g, tmJson); + check(`reachability(${cfg.name}): no dead repository keys`, r.dead.length === 0, r.dead.join(', ')); + check(`reachability(${cfg.name}): no dangling self-#refs with present source`, r.danglingWithSource.length === 0, r.danglingWithSource.join(', ')); + const tc = tokenCensus(g, tmJson); + check(`token-completeness(${cfg.name}): every non-skip token has a discharge path`, tc.orphans.length === 0, `orphans: ${tc.orphans.join(' ')}`); + const tm = await loadTmFromFiles(cfg.scopeName, { [cfg.scopeName]: cfg.tm, ...(cfg.tmExtra ?? {}) }); + let cov: CoverageResult = { den: 0, painted: 0, uncovered: [] }; + if (tm) cov = leafCoverage(g, tm); + check(`coverage(${cfg.name}): every content/keyword obligation leaf is painted`, cov.painted === cov.den, + cov.uncovered.map(u => `"${u.text}"(${u.want})`).slice(0, 8).join(' ')); + rows.push(ledgerRow(cfg.name, g, tmJson, r, tc, cov)); + const pct = cov.den ? (100 * cov.painted / cov.den).toFixed(2) : '—'; + console.log(` ${cfg.name.padEnd(17)} repo ${String(r.repoKeys).padStart(3)} · dead ${r.dead.length} · tokens ${tc.total - tc.skip - tc.orphans.length}/${tc.total - tc.skip} · leaf-coverage ${cov.painted}/${cov.den} = ${pct}%`); + if (cov.uncovered.length) for (const u of cov.uncovered.slice(0, 6)) console.log(` UNCOVERED "${u.text}" want ${u.want} ctx …${u.ctx}…`); + } + + const block = renderLedger(rows); + if (WRITE) { + const { changed, full } = spliceRegion(LEDGER_FILE, block); + if (existsSync(LEDGER_FILE) && full.includes('COMPLETENESS-LEDGER:START')) { writeFileSync(LEDGER_FILE, full); console.log(`\n${changed ? '✓ updated' : '· unchanged'} ${LEDGER_FILE} ledger region`); } + else console.log(`\n(no ${LEDGER_FILE} ledger markers yet — block below)\n\n${block}`); + } + if (CHECK) { + const { changed } = spliceRegion(LEDGER_FILE, block); + check(`${LEDGER_FILE} ledger region is up to date`, !changed || !existsSync(LEDGER_FILE), `run: node test/tm-completeness.ts --write`); + } + + console.log(''); + for (const f of fails) console.log(' ' + f); + console.log(`\n${failN === 0 ? `✓ ${pass}/${pass} completeness checks pass` : `✗ ${failN} FAILED (${pass} passed)`}`); + process.exit(failN === 0 ? 0 : 1); +} + +if ((import.meta as any).main) await main(); From 3b0e1dd114ef38d3bb9c144405e824afc2c6bc15 Mon Sep 17 00:00:00 2001 From: Johnson Chu Date: Sat, 20 Jun 2026 14:55:53 +0800 Subject: [PATCH 02/14] Mutation-test the completeness detector (#51) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit A passing checker means nothing if the checker is blind. Add test/tm-mutation.ts: it injects a catalogue of known gaps into the emitted grammar (drop a key / all of a token's includes, neuter a scope to the bare root, add a dead key, a dangling include, mis-scope a token to a wrong role, reorder two disambiguation patterns) and records which detector layer kills each — measuring the detector's power instead of asserting it. Measured: every PRESENCE / REACHABILITY gap is killed corpus-free (16/16, 12/16 by reachability / token-census / the new flat-token neuter check); WRONG-ROLE gaps are caught only by a differential witness (presence ≠ correctness); ORDERING gaps are a measured blind spot — TextMate is order-sensitive and pattern rank lives in the emitted artifact, not the grammar algebra, so no corpus-free structural check reaches it. This is the honest boundary, now empirical: the structural proof covers presence + reachability; ordering / correctness are the soundness axis, reached only by evaluation. The over-claim of an a-priori "no gap can hide" over the whole gap space is dropped — COMPLETENESS.md states the bounded, measured claim. The harness motivated one detector strengthening: tokenCensus now flags a flat token NEUTERED to the bare root scope (an entry that exists but paints inert document text), moving that gap class from differential-only to corpus-free. Wired as a meta-gate in `npm run check`. --- COMPLETENESS.md | 37 ++++++- package.json | 1 + test/check.ts | 1 + test/tm-completeness.ts | 31 +++--- test/tm-mutation.ts | 207 ++++++++++++++++++++++++++++++++++++++++ 5 files changed, 262 insertions(+), 15 deletions(-) create mode 100644 test/tm-mutation.ts diff --git a/COMPLETENESS.md b/COMPLETENESS.md index 99f63a3..321680b 100644 --- a/COMPLETENESS.md +++ b/COMPLETENESS.md @@ -111,8 +111,35 @@ The empirical witness that all of the above actually paint is **leaf coverage**: deterministic grammar-derived corpus (`test/grammar-gen.ts`), every parsed leaf whose by-construction role (`buildRoleMap`) is a content/keyword role (keyword / string / number / comment) is confirmed to receive a non-root scope. The denominator is fixed (the obligation -leaves); the metric is non-vacuous (deleting a discharging repository key drops it below 100%). -Result: **2433/2433 across all six grammars.** +leaves). Result: **2433/2433 across all six grammars.** + +## Measuring the detector — mutation testing + +A passing checker is worthless if the checker is *blind* — the corpus-trap this project has been +bitten by. So the guarantee is not asserted, it is **measured**: `test/tm-mutation.ts` injects a +catalogue of known gaps into the emitted grammar (drop a key, drop all of a token's includes, +neuter a scope to the bare root, add a dead key, a dangling include, mis-scope a token to a wrong +role, reorder two disambiguation patterns) and records which detector layer — if any — kills each. +The honest, measured result: + +- **Presence gaps** (a token / scope / key dropped or neutered): **16/16 killed · 12/16 by a + CORPUS-FREE structural detector** (reachability dead/dangling · the token census · the flat-token + neuter check). The remaining four — a *region* token neutered — are caught by a targeted + differential witness, not corpus-free. **No presence gap survives; this is the gate.** +- **Wrong-role gaps** (a token still painted, but the wrong role): caught by the differential + (a bucket change at the witness), *not* by the structural detector — a token that *is* painted + satisfies presence. This is the completeness/soundness seam: presence ≠ correctness. +- **Ordering gaps** (two patterns reordered so a looser rule shadows a tighter one): a **measured + blind spot**. TextMate is order-sensitive, and which pattern wins is a property of the emitted + artifact's *sequence*, not the grammar's algebra — so no corpus-free structural check reaches it, + and a scope-preserving reorder slips even the bucket-level differential. + +So the claim this document makes is bounded and measured: **every presence / reachability gap is +caught corpus-free** (mutation-proven, the gate); **wrong-role and ordering gaps are the soundness / +interaction axis**, reached only by evaluation (the differential, or `test/gap-ledger.ts`), never by +a grammar-algebraic proof. An a-priori "no gap can hide" over the *whole* gap space is not available +— ordering and correctness obligations live in the emitted artifact and slide toward regex-vs-CFG +undecidability — and this document does not claim it. ## Reachability — root ∪ export surfaces @@ -179,8 +206,10 @@ Auto-generated by `node test/tm-completeness.ts --write`; `--check` fails CI if ## The gates that hold this exact - `test/tm-completeness.ts` — Layer A closure (RuleExpr / TokenPattern / `collectLiterals`), the - `sep`-recursion regression guard, reachability, the token census, and leaf coverage with a fixed - denominator. `npm run completeness` prints it; `npm run completeness:check` gates the ledger. + `sep`-recursion regression guard, reachability, the token census (orphans + neuter), and leaf + coverage with a fixed denominator. `npm run completeness`; `npm run completeness:check` gates the ledger. +- `test/tm-mutation.ts` — the **meta-gate**: injects known gaps and asserts every presence gap is + killed with no false alarms, measuring (not asserting) the detector's power. `npm run completeness:mutation`. - `test/agnostic.ts` — detector shape-completeness: the detectors fire on structure, not on TS names, so "every shape that bears the obligation is detected" holds for any grammar. - `test/scope-gap.ts`, `test/gap-ledger.ts` — the **soundness** axis (is each painted scope diff --git a/package.json b/package.json index c578977..8fe6555 100644 --- a/package.json +++ b/package.json @@ -39,6 +39,7 @@ "completeness": "node test/tm-completeness.ts", "completeness:check": "node test/tm-completeness.ts --check", "completeness:write": "node test/tm-completeness.ts --write", + "completeness:mutation": "node test/tm-mutation.ts", "ledger:selftest": "node test/gap-ledger-selftest.ts", "ledger:issues": "node test/gap-issues.ts", "ledger:issues:dry": "node test/gap-issues.ts --dry-run", diff --git a/test/check.ts b/test/check.ts index 3aefb9e..1658343 100644 --- a/test/check.ts +++ b/test/check.ts @@ -37,6 +37,7 @@ const GATES: Gate[] = [ { group: 'conformance', name: 'html', args: ['test/html-conformance.ts'] }, { group: 'highlighter', name: 'tm-guards', args: ['test/tm-highlight-guards.ts'] }, { group: 'highlighter', name: 'tm-completeness', args: ['test/tm-completeness.ts', '--check'] }, + { group: 'highlighter', name: 'tm-mutation', args: ['test/tm-mutation.ts'] }, { group: 'highlighter', name: 'tm-diagnostics', args: ['test/redcmd-tm-diagnostics.ts'] }, { group: 'highlighter', name: 'angle-depth', args: ['test/angle-depth-probe.ts'] }, { group: 'highlighter', name: 'html-monarch', args: ['test/html-monarch.ts'] }, diff --git a/test/tm-completeness.ts b/test/tm-completeness.ts index a62cc2b..72b7773 100644 --- a/test/tm-completeness.ts +++ b/test/tm-completeness.ts @@ -180,7 +180,7 @@ function checkTokenPatternClosure(): void { // REACHABILITY — every emitted repo key reachable from root ∪ export surfaces // ════════════════════════════════════════════════════════════════════════════ -interface TmGrammarJson { patterns?: unknown[]; repository?: Record; scopeName?: string } +export interface TmGrammarJson { patterns?: unknown[]; repository?: Record; scopeName?: string } // The DECLARED export surfaces of a grammar — repository keys an external embedder reaches // not from the root but by an explicit `#` include: the #expression sub-grammar @@ -193,9 +193,9 @@ function exportSurfaceKeys(g: CstGrammar): string[] { return out; } -interface ReachResult { repoKeys: number; reached: number; dead: string[]; danglingWithSource: string[] } +export interface ReachResult { repoKeys: number; reached: number; dead: string[]; danglingWithSource: string[] } -function checkReachability(g: CstGrammar, tm: TmGrammarJson): ReachResult { +export function checkReachability(g: CstGrammar, tm: TmGrammarJson): ReachResult { const scope = tm.scopeName ?? g.scopeName ?? `source.${g.name}`; const repo = tm.repository ?? {}; const reached = new Set(); @@ -242,17 +242,25 @@ function checkReachability(g: CstGrammar, tm: TmGrammarJson): ReachResult { // grammar emits no per-token keys — generateMarkupTm owns text/tag/attr), or a region that // owns the token's delimiter (the JSX `/>` / `; orphans: string[] } -function tokenCensus(g: CstGrammar, tmJson: TmGrammarJson): TokenCensus { +export interface TokenCensus { total: number; skip: number; byPath: Record; orphans: string[]; neutered: string[] } +export function tokenCensus(g: CstGrammar, tmJson: TmGrammarJson): TokenCensus { const repo = tmJson.repository ?? {}; + const root = tmJson.scopeName ?? `source.${g.name}`; const full = JSON.stringify(tmJson); const byPath: Record = {}; const orphans: string[] = []; + const neutered: string[] = []; // a flat token whose entry exists but paints only the bare root (no visual scope) let skip = 0; const bump = (p: string) => byPath[p] = (byPath[p] ?? 0) + 1; + // a flat `{name, match}` entry discharges its scope obligation only if `name` is a real + // visual scope — not the bare document root and not empty. An entry whose name was reduced + // to the root scope is a "neuter" gap (the token tokenises but reads as inert document text), + // structurally visible without any corpus. + const flatNeutered = (e: any): boolean => !e.begin && !e.patterns && (!e.name || String(e.name).split(' ').every((s: string) => s === root || !s)); for (const t of g.tokens) { if (t.flags.includes('skip')) { skip++; continue; } - if (repo[t.name.toLowerCase()]) { bump('flat'); continue; } + const flat = repo[t.name.toLowerCase()]; + if (flat) { if (flatNeutered(flat)) neutered.push(`${t.name}→${(flat as any).name ?? '∅'}`); else bump('flat'); continue; } if (t.flags.includes('regex')) { bump('regex-family'); continue; } if (tokenPatternIsNever(t)) { bump('engine-emitted'); continue; } if (g.markup) { bump('markup-region'); continue; } // generateMarkupTm owns it @@ -260,7 +268,7 @@ function tokenCensus(g: CstGrammar, tmJson: TmGrammarJson): TokenCensus { if (delim && full.includes(JSON.stringify(delim).slice(1, -1))) { bump('region-owned'); continue; } orphans.push(`${t.name}[${t.flags.join(',') || '-'}]`); } - return { total: g.tokens.length, skip, byPath, orphans }; + return { total: g.tokens.length, skip, byPath, orphans, neutered }; } // ════════════════════════════════════════════════════════════════════════════ @@ -272,7 +280,7 @@ const require = createRequire(import.meta.url); const wasmBin = readFileSync(require.resolve('vscode-oniguruma/release/onig.wasm')); await loadWASM(wasmBin.buffer.slice(wasmBin.byteOffset, wasmBin.byteOffset + wasmBin.byteLength)); -async function loadTmFromObject(scopeName: string, grammars: Record): Promise { +export async function loadTmFromObject(scopeName: string, grammars: Record): Promise { const reg = new Registry({ onigLib: Promise.resolve({ createOnigScanner: (p: string[]) => new OnigScanner(p), createOnigString: (s: string) => new OnigString(s) }), loadGrammar: async (sn: string) => grammars[sn] ? parseRawGrammar(JSON.stringify(grammars[sn]), sn + '.json') : null, @@ -287,7 +295,7 @@ async function loadTmFromFiles(scopeName: string, files: Record) }); return reg.loadGrammar(scopeName); } -function tmTokenize(grammar: vsctm.IGrammar, text: string): TmTok[] { +export function tmTokenize(grammar: vsctm.IGrammar, text: string): TmTok[] { const toks: TmTok[] = []; let rs = INITIAL, off = 0; for (const line of text.split('\n')) { const r = grammar.tokenizeLine(line, rs); for (const t of r.tokens) toks.push({ start: off + t.startIndex, end: off + t.endIndex, scopes: t.scopes }); rs = r.ruleStack; off += line.length + 1; } return toks; @@ -307,9 +315,9 @@ function tmTokenize(grammar: vsctm.IGrammar, text: string): TmTok[] { // ════════════════════════════════════════════════════════════════════════════ const CONTENT_OBLIGATION = new Set(['keyword', 'string', 'number', 'comment']); -interface CoverageResult { den: number; painted: number; uncovered: { text: string; want: string; ctx: string }[] } +export interface CoverageResult { den: number; painted: number; uncovered: { text: string; want: string; ctx: string }[] } -function leafCoverage(grammar: CstGrammar, tm: vsctm.IGrammar, opts = GEN_OPTS): CoverageResult { +export function leafCoverage(grammar: CstGrammar, tm: vsctm.IGrammar, opts = GEN_OPTS): CoverageResult { const { parse } = createParser(grammar); const roleOf = buildRoleMap(grammar); const inputs = generateInputs(grammar, opts); @@ -510,6 +518,7 @@ async function main(): Promise { check(`reachability(${cfg.name}): no dangling self-#refs with present source`, r.danglingWithSource.length === 0, r.danglingWithSource.join(', ')); const tc = tokenCensus(g, tmJson); check(`token-completeness(${cfg.name}): every non-skip token has a discharge path`, tc.orphans.length === 0, `orphans: ${tc.orphans.join(' ')}`); + check(`token-completeness(${cfg.name}): no flat token is neutered to the bare root scope`, tc.neutered.length === 0, `neutered: ${tc.neutered.join(' ')}`); const tm = await loadTmFromFiles(cfg.scopeName, { [cfg.scopeName]: cfg.tm, ...(cfg.tmExtra ?? {}) }); let cov: CoverageResult = { den: 0, painted: 0, uncovered: [] }; if (tm) cov = leafCoverage(g, tm); diff --git a/test/tm-mutation.ts b/test/tm-mutation.ts new file mode 100644 index 0000000..67160f0 --- /dev/null +++ b/test/tm-mutation.ts @@ -0,0 +1,207 @@ +// ───────────────────────────────────────────────────────────────────────────── +// tm-mutation.ts — MUTATION TESTING for the completeness gap-detector. +// +// The completeness checker (test/tm-completeness.ts) proves structural properties +// (closure, reachability, token discharge, leaf coverage). But "the checker passes" +// only means something if the checker can actually FAIL when there IS a gap. A clean +// pass on a blind checker is worthless — the exact corpus-blindness this project has +// been bitten by. So this harness MEASURES the detector's power directly: it INJECTS a +// catalogue of known gaps into the emitted grammar (fault injection), runs every +// detector layer, and records which layer (if any) catches each. +// +// This is the honest answer to "can every gap be found?" — not an a-priori completeness +// claim (the review showed ordering / disambiguation-correctness obligations are not +// grammar-algebraic and slide into undecidable territory), but a MEASURED kill rate: +// +// • PRESENCE gaps (a token / scope / key dropped or neutered) MUST be killed by a +// corpus-free STRUCTURAL detector (reachability / token-census / leaf-coverage). +// A surviving presence mutant is a detector bug → this gate fails. +// • CORRECTNESS / ORDERING gaps (a disambiguation guard weakened, two patterns +// reordered) are EXPECTED to slip past the structural detectors — they are caught, +// if at all, only by a differential WITNESS (a paint change on a targeted input). +// Survivors here are the detector's MEASURED blind spots, reported not failed: they +// are the honest boundary COMPLETENESS.md draws, made empirical. +// +// Run: node test/tm-mutation.ts +// ───────────────────────────────────────────────────────────────────────────── +import { generateTmLanguage } from '../src/gen-tm.ts'; +import { createParser } from '../src/gen-parser.ts'; +import type { CstGrammar } from '../src/types.ts'; +import { generateInputs } from './grammar-gen.ts'; +import { buildRoleMap, leafRoles, spanBuckets, scopeAt, GEN_OPTS, type TmTok, type Bucket } from './generative-detect.ts'; +import { + checkReachability, tokenCensus, leafCoverage, loadTmFromObject, tmTokenize, + type TmGrammarJson, +} from './tm-completeness.ts'; + +// ── a mutation: a precise, kind-labelled fault injected into the emitted grammar ── +type MutClass = 'presence' | 'correctness' | 'ordering'; +interface Mutation { + label: string; + cls: MutClass; + // mutate the (already-deep-cloned) emitted grammar in place; return false to skip + // (the site does not exist in this grammar — keeps the catalogue grammar-agnostic). + apply: (tm: any) => boolean; + witness?: string; // a targeted input the differential detector tokenises + leaf?: string; // the substring whose paint the differential watches + equivalent?: boolean; // a true gap is created (false) vs a no-op the detector SHOULDN'T flag (true) +} + +const rootIncludeIndex = (tm: any, key: string) => + (tm.patterns as any[]).findIndex(p => p?.include === `#${key}`); +// recursively delete every `{include:#key}` anywhere in the grammar (so the key truly dies) +function dropAllIncludes(node: any, key: string): void { + if (!node || typeof node !== 'object') return; + if (Array.isArray(node)) { for (let i = node.length - 1; i >= 0; i--) { if (node[i]?.include === `#${key}`) node.splice(i, 1); else dropAllIncludes(node[i], key); } return; } + for (const v of Object.values(node)) dropAllIncludes(v, key); +} + +// the catalogue is built PER-WITNESS: we tokenise the baseline, find the repository key that +// ACTUALLY paints each witness leaf, and target THAT key — so a mutation creates a real gap +// instead of an equivalent mutant (e.g. dropping #number's ROOT include is a no-op because +// #number is still reachable from #expression; only dropping ALL includes truly kills it). +function buildCatalogue(tm: any, paintKey: (w: string, leaf: string) => string | null): Mutation[] { + const root = String(tm.scopeName ?? 'source'); + const lang = root.replace(/^(source|text)\./, ''); + const muts: Mutation[] = []; + const sites: { witness: string; leaf: string; role: string }[] = [ + { witness: 'q = 42', leaf: '42', role: 'number' }, + { witness: 'q = "x"', leaf: '"x"', role: 'string' }, + { witness: 'a // c', leaf: '// c', role: 'comment' }, + ]; + for (const s of sites) { + const key = paintKey(s.witness, s.leaf); + if (!key) continue; + // PRESENCE — a corpus-free structural detector must kill each of these: + muts.push({ label: `drop ${s.role} key (all includes + entry)`, cls: 'presence', witness: s.witness, leaf: s.leaf, + apply: (t) => { dropAllIncludes(t, key); delete t.repository[key]; return true; } }); + muts.push({ label: `neuter ${s.role} scope → bare root`, cls: 'presence', witness: s.witness, leaf: s.leaf, + apply: (t) => { t.repository[key] = { ...t.repository[key], name: root }; if (t.repository[key].patterns || t.repository[key].begin) { delete t.repository[key].beginCaptures; delete t.repository[key].endCaptures; t.repository[key].patterns = []; } return true; } }); + // CORRECTNESS — a VALID grammar that paints the WRONG role (leaf still painted, just wrong): + muts.push({ label: `mis-scope ${s.role} → keyword (wrong role, still painted)`, cls: 'correctness', witness: s.witness, leaf: s.leaf, + apply: (t) => { t.repository[key] = { ...t.repository[key], name: `keyword.control.${lang}` }; return true; } }); + } + // PRESENCE — a real dead key (nothing includes it) and a real dangling include: + muts.push({ label: 'add an unreachable (dead) repo key', cls: 'presence', + apply: (t) => { t.repository['__orphan__'] = { match: 'zzzqqq', name: `comment.${lang}` }; return true; } }); + muts.push({ label: 'dangling include to a missing key', cls: 'presence', + apply: (t) => { t.patterns.unshift({ include: '#__ghost__' }); return true; } }); + // ORDERING — flip a disambiguation priority so a looser rule shadows a tighter one: + if (tm.repository['generic-call'] && rootIncludeIndex(tm, 'comparison') >= 0) { + muts.push({ label: 'move generic-call after comparison (priority flip)', cls: 'ordering', witness: 'a(x)', leaf: 'T', + apply: (t) => { const gi = rootIncludeIndex(t, 'generic-call'); if (gi < 0) return false; const [g] = t.patterns.splice(gi, 1); t.patterns.push(g); return true; } }); + } + return muts; +} + +// ── detectors ────────────────────────────────────────────────────────────────────── +// corpus-FREE structural detectors (the ones whose guarantee is a-priori, not sampled) +function structuralCatches(g: CstGrammar, mutated: TmGrammarJson): string[] { + const hits: string[] = []; + const r = checkReachability(g, mutated); + if (r.dead.length) hits.push(`reachability:dead(${r.dead.join(',')})`); + if (r.danglingWithSource.length) hits.push(`reachability:dangling(${r.danglingWithSource.join(',')})`); + const c = tokenCensus(g, mutated); + if (c.orphans.length) hits.push(`token-census:orphan(${c.orphans.join(',')})`); + if (c.neutered.length) hits.push(`token-census:neutered(${c.neutered.join(',')})`); + return hits; +} +// load that survives an invalid mutated grammar (a broken regex) — a grammar that fails +// to compile is itself a detectable defect, reported as compile-error rather than crashing. +async function tryLoad(scope: string, grammar: object): Promise<{ tm: any } | { err: string }> { + try { const tm = await loadTmFromObject(scope, { [scope]: grammar }); return tm ? { tm } : { err: 'load-null' }; } + catch (e: any) { return { err: `compile-error(${String(e?.message ?? e).slice(0, 30)})` }; } +} +// grammar-derived-corpus detector (leaf coverage over generated inputs) +async function corpusCatches(g: CstGrammar, scope: string, mutated: object): Promise { + const r = await tryLoad(scope, mutated); + if ('err' in r) return `leaf-coverage:${r.err}`; + const cov = leafCoverage(g, r.tm, { ...GEN_OPTS, maxInputs: 250 }); + return cov.painted < cov.den ? `leaf-coverage(${cov.painted}/${cov.den})` : null; +} +// targeted DIFFERENTIAL detector: did the witness leaf's paint change vs baseline? +async function differentialCatches(scope: string, base: object, mutated: object, witness: string, leaf: string): Promise { + const [bt, mt] = await Promise.all([tryLoad(scope, base), tryLoad(scope, mutated)]); + if ('err' in bt) return null; + if ('err' in mt) return `differential:${mt.err}`; + const at = witness.indexOf(leaf); if (at < 0) return null; + const bb = bucketsAt(bt.tm, witness, at, leaf.length), mb = bucketsAt(mt.tm, witness, at, leaf.length); + const bs = [...bb].sort().join('|'), ms = [...mb].sort().join('|'); + return bs !== ms ? `differential({${bs||'∅'}}→{${ms||'∅'}})` : null; +} +function bucketsAt(tm: any, text: string, start: number, len: number): Set { + return spanBuckets(tmTokenize(tm, text), text, start, start + len); +} + +// ── driver ────────────────────────────────────────────────────────────────────────── +interface Row { grammar: string; label: string; cls: MutClass; equivalent: boolean; killedBy: string[]; survived: boolean; skipped: boolean } + +async function runGrammar(name: string, module: string, scope: string): Promise { + const g = (await import(module)).default as CstGrammar; + const base = generateTmLanguage(g) as any; + if (base.scopeName) scope = base.scopeName; + const baseTm = await loadTmFromObject(scope, { [scope]: base }); + if (!baseTm) return []; + // the painting-key finder: the repo key whose `name` paints a witness leaf (sampled at the + // leaf's MIDDLE char, so a string's CONTENT scope is found, not its delimiter punctuation). + const paintKey = (witness: string, leaf: string): string | null => { + const at = witness.indexOf(leaf); if (at < 0) return null; + const inner = scopeAt(tmTokenize(baseTm, witness), at + Math.floor(leaf.length / 2)).at(-1) ?? ''; + if (!inner || inner === scope) return null; + for (const [k, v] of Object.entries(base.repository) as [string, any][]) if (v?.name === inner) return k; + for (const [k, v] of Object.entries(base.repository) as [string, any][]) if (typeof v?.name === 'string' && inner.startsWith(v.name + '.')) return k; + return null; + }; + const rows: Row[] = []; + for (const m of buildCatalogue(base, paintKey)) { + const mutated = structuredClone(base); + if (!m.apply(mutated)) { rows.push({ grammar: name, label: m.label, cls: m.cls, equivalent: !!m.equivalent, killedBy: [], survived: false, skipped: true }); continue; } + const killedBy = structuralCatches(g, mutated); + const corpus = await corpusCatches(g, scope, mutated); if (corpus) killedBy.push(corpus); + if (m.witness && m.leaf) { const d = await differentialCatches(scope, base, mutated, m.witness, m.leaf); if (d) killedBy.push(d); } + rows.push({ grammar: name, label: m.label, cls: m.cls, equivalent: !!m.equivalent, killedBy, survived: killedBy.length === 0, skipped: false }); + } + return rows; +} + +async function main(): Promise { + const GRAMMARS = [ + { name: 'typescript', module: '../typescript.ts', scope: 'source.ts' }, + { name: 'yaml', module: '../yaml.ts', scope: 'source.yaml' }, + ]; + const rows: Row[] = []; + for (const cfg of GRAMMARS) rows.push(...await runGrammar(cfg.name, cfg.module, cfg.scope)); + + console.log('── mutation testing: which detector layer kills each injected gap ──\n'); + for (const r of rows) { + const mark = r.skipped ? '·' : r.equivalent ? (r.survived ? '✓' : '⚠') : r.survived ? '✗' : '✓'; + const by = r.skipped ? '(site n/a — skipped)' + : r.equivalent ? (r.survived ? 'correctly NOT flagged (no-op mutant)' : `FALSE ALARM: ${r.killedBy.join(' ')}`) + : r.survived ? 'SURVIVED — no detector caught it' : r.killedBy.join(' '); + console.log(` ${mark} [${r.cls.padEnd(11)}]${r.equivalent ? '[equiv]' : ' '} ${r.grammar.padEnd(11)} ${r.label.padEnd(52)} ${by}`); + } + + const live = rows.filter(r => !r.skipped); + const real = live.filter(r => !r.equivalent); + const presence = real.filter(r => r.cls === 'presence'); + const presenceSurvivors = presence.filter(r => r.survived); + const structuralKill = (r: Row) => r.killedBy.some(k => k.startsWith('reachability') || k.startsWith('token-census')); + const corrOrder = real.filter(r => r.cls !== 'presence'); + const corrOrderSurvivors = corrOrder.filter(r => r.survived); + const falseAlarms = live.filter(r => r.equivalent && !r.survived); + + console.log('\n── measured detection power ──'); + console.log(` presence gaps : ${presence.length - presenceSurvivors.length}/${presence.length} killed · ${presence.filter(structuralKill).length}/${presence.length} by a CORPUS-FREE structural detector`); + console.log(` correctness/ordering : ${corrOrder.length - corrOrderSurvivors.length}/${corrOrder.length} caught (differential) · ${corrOrderSurvivors.length} survived (measured blind spot)`); + console.log(` equivalent controls : ${falseAlarms.length} false alarm(s) (a precision bug if > 0)`); + + // GATE: every real presence gap MUST be killed; no equivalent mutant may be falsely flagged. + // correctness/ordering survivors are the honest, documented boundary — reported, not failed. + const failures = [...presenceSurvivors.map(r => `presence SURVIVED: ${r.grammar} — ${r.label}`), + ...falseAlarms.map(r => `FALSE ALARM on equivalent mutant: ${r.grammar} — ${r.label}`)]; + if (failures.length) { console.log('\n✗ detector defect(s):'); for (const f of failures) console.log(` - ${f}`); process.exit(1); } + console.log(`\n✓ every presence gap killed, no false alarms; correctness/ordering blind spots measured = ${corrOrderSurvivors.length} (the boundary COMPLETENESS.md states).`); + void createParser; void buildRoleMap; void leafRoles; void generateInputs; +} + +if ((import.meta as any).main) await main(); From 978cd2c73f3e454c96742e06fd117d59948770b6 Mon Sep 17 00:00:00 2001 From: Johnson Chu Date: Sat, 20 Jun 2026 20:00:00 +0800 Subject: [PATCH 03/14] Make keyword completeness decidable, not corpus-witnessed (#51) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The keyword/operator obligation was the one class still checked by the grammar-derived corpus (leaf coverage). Replace it with a structural, a-priori discharge: `literalDischarge` confirms every alphabetic literal the grammar consumes (collectLiterals over every rule + the prec/led tables) appears, as a scoped word, in some REACHABLE pattern whose scope is a keyword family — a finite scan of the emitted artifact, no corpus. 248/248 across the six grammars; non-vacuous (stripping `class` from its patterns reports it undischarged). Completeness is now a decidable structural check end to end — token discharge (census, incl. neuter) + keyword-literal discharge + repository reachability, 952/952 = 100%, no corpus. The leaf-coverage corpus pass is demoted to a redundant differential cross-check on the soundness axis; `tm-mutation`'s structural layer now also kills a dropped/neutered keyword corpus-free. COMPLETENESS.md draws the line correctly: COMPLETENESS (present + reachable + scoped) is decidable — finite G, finite gen-tm(G), an obligation taxonomy bounded by TextMate's finite construct kinds, per-token discharge by structural identity (the flat `match` IS the token's own pattern) — and ∀ G by structural induction over the finite combinator algebra. SOUNDNESS (do the present constructs paint correctly on all inputs — wrong-role, pattern ordering) is the undecidable residual (CFG vs regex-stack-machine over infinite input). The earlier "a-priori completeness is unavailable" was an over-concession: that was soundness's wall, mistaken for completeness's. --- COMPLETENESS.md | 71 ++++++++++++++++----------- test/tm-completeness.ts | 104 ++++++++++++++++++++++++++++++---------- test/tm-mutation.ts | 4 +- 3 files changed, 125 insertions(+), 54 deletions(-) diff --git a/COMPLETENESS.md b/COMPLETENESS.md index 321680b..2f36a54 100644 --- a/COMPLETENESS.md +++ b/COMPLETENESS.md @@ -46,7 +46,10 @@ value of a **closed union**: `RuleExpr` has 15 constructors and `TokenPattern` h `aliasScopes`, `expressionRule`, `manifest`). An *obligation* is induced by a constructor-occurrence or a config-field-occurrence. So completeness reduces to: **for each obligation generator, the generator has a discharging, reachable emission** — three -mechanically-checkable layers. +mechanically-checkable layers. Both sides are finite — a finite `G`, a finite `gen-tm(G)`, and an +obligation taxonomy bounded by TextMate's finite construct kinds — so completeness is a **decidable** +property per grammar, and holds **∀ G by structural induction** over the finite combinator algebra +(finitely many cases). It is checked a-priori on the emitted artifact, with no corpus. ## Layer A — closure: the universe is the algebra, and lowering is total @@ -95,11 +98,16 @@ co-blind): region machinery (a `markup` grammar emits no per-token keys), or a region that owns the token's delimiter (the JSX `/>` / ``/backreferences are non-regular). So this document +proves completeness and *measures* its detector (mutation testing); soundness it does not claim to +decide — that is `test/gap-ledger.ts`'s by-construction + corpus axis. The earlier framing that +"a-priori completeness over the whole gap space is unavailable" was an over-concession: completeness +is available; it was soundness's wall, mistaken for completeness's. ## Reachability — root ∪ export surfaces @@ -189,17 +204,17 @@ Auto-generated by `node test/tm-completeness.ts --write`; `--check` fails CI if -| Grammar | Tokens | Keyword literals | Operators | Repo keys (reachable) | Leaf obligations (painted) | -|---|---:|---:|---:|---:|---:| -| typescript | 11/11 | 73 | 53 | 158/158 | 199/199 | -| javascript | 11/11 | 48 | 51 | 103/103 | 131/131 | -| typescriptreact | 13/13 | 73 | 53 | 171/171 | 169/169 | -| javascriptreact | 13/13 | 48 | 51 | 116/116 | 121/121 | -| html | 7/7 | 0 | 0 | 28/28 | 175/175 | -| yaml | 19/19 | 0 | 0 | 54/54 | 1638/1638 | -| **total** | **74/74** | **242** | **208** | **630/630** | **2433/2433** | - -**Fixed-denominator completeness: 3137/3137 = 100.00%** (token discharge 74/74 · repository reachability 630/630 · leaf painting 2433/2433). Keyword literals (242) and Pratt operators (208) are discharged through the leaf-painting column. **0 open completeness gaps.** +| Grammar | Tokens | Keyword literals | Repo keys (reachable) | Leaf cross-check (corpus) | +|---|---:|---:|---:|---:| +| typescript | 11/11 | 73/73 | 158/158 | 199/199 | +| javascript | 11/11 | 51/51 | 103/103 | 131/131 | +| typescriptreact | 13/13 | 73/73 | 171/171 | 169/169 | +| javascriptreact | 13/13 | 51/51 | 116/116 | 121/121 | +| html | 7/7 | 0/0 | 28/28 | 175/175 | +| yaml | 19/19 | 0/0 | 54/54 | 1638/1638 | +| **total** | **74/74** | **248/248** | **630/630** | **2433/2433** | + +**Decidable completeness: 952/952 = 100.00%** (token discharge 74/74 · keyword-literal discharge 248/248 · repository reachability 630/630) — a structural check on the emitted artifact, no corpus. Leaf cross-check (corpus, redundant): 2433/2433. **0 open completeness gaps.** diff --git a/test/tm-completeness.ts b/test/tm-completeness.ts index 72b7773..37b5626 100644 --- a/test/tm-completeness.ts +++ b/test/tm-completeness.ts @@ -336,6 +336,61 @@ export function leafCoverage(grammar: CstGrammar, tm: vsctm.IGrammar, opts = GEN return { den, painted, uncovered }; } +// ════════════════════════════════════════════════════════════════════════════ +// STRUCTURAL literal discharge — DECIDABLE keyword completeness (no corpus) +// +// Every alphabetic literal/operator the grammar consumes bears a keyword-scope obligation. +// It is discharged iff it appears, as a SCOPED word, in some REACHABLE pattern whose scope +// is a keyword family. This is a finite, structural check on the emitted artifact — the +// a-priori (not corpus-witnessed) proof that every keyword is scoped. It asks only whether a +// scoping pattern is PRESENT (completeness); whether its guard fires correctly is soundness. +// ════════════════════════════════════════════════════════════════════════════ +const KEYWORD_FAMILY = /^(keyword|storage|constant\.language|support\.(type|class|function|constant)|variable\.language|entity\.name\.(type|tag)|punctuation\.definition\.keyword)/; + +// every reachable pattern NODE (root ∪ export surfaces), the same closure as checkReachability +function reachableNodes(g: CstGrammar, tmJson: TmGrammarJson): any[] { + const scope = tmJson.scopeName ?? `source.${g.name}`; + const repo = (tmJson.repository ?? {}) as Record; + const reached = new Set(); const queue: string[] = []; const out: any[] = []; + const visit = (node: any): void => { + if (!node || typeof node !== 'object') return; + if (Array.isArray(node)) { node.forEach(visit); return; } + out.push(node); + if (typeof node.include === 'string') { const inc: string = node.include; if (inc.startsWith('#')) queue.push(inc.slice(1)); else if (inc.startsWith(scope + '#')) queue.push(inc.slice(scope.length + 1)); } + if (node.patterns) visit(node.patterns); + for (const c of ['captures', 'beginCaptures', 'endCaptures', 'whileCaptures']) if (node[c]) for (const v of Object.values(node[c])) visit(v); + }; + visit(tmJson.patterns ?? []); + if (g.expressionRule) queue.push('expression'); + for (const k of Object.keys(g.canonicalRepoNames ?? {})) queue.push(k); + while (queue.length) { const k = queue.shift()!; if (reached.has(k)) continue; reached.add(k); if (repo[k]) visit(repo[k]); } + return out; +} +// the alphabetic words a node SCOPES under a keyword-family scope (lookarounds + `\b`/`\w`-escapes +// stripped so a word-boundary doesn't fuse with the word, e.g. `\bfrom\b` → `from`, not `bfrom`) +function scopedAtoms(nodes: any[]): Set { + const out = new Set(); + const keywordScoped = (n: any): boolean => (typeof n.name === 'string' && KEYWORD_FAMILY.test(n.name)) + || (['captures', 'beginCaptures', 'endCaptures'] as const).some(c => n[c] && Object.values(n[c]).some((cc: any) => typeof cc?.name === 'string' && KEYWORD_FAMILY.test(cc.name))); + for (const n of nodes) { + if (!keywordScoped(n)) continue; + const re = (n.match ?? n.begin ?? '') as string; + const cleaned = re.replace(/\(\?(); + for (const r of g.rules) for (const l of collectLiterals(r.body)) if (isKeywordLiteral(l)) lits.add(l.replace(/^@/, '')); + for (const p of g.precs) for (const o of p.operators) if (isKeywordLiteral(o.value)) lits.add(o.value); + for (const lp of g.ledPrecs ?? []) if (isKeywordLiteral(lp.connector)) lits.add(lp.connector); + const gaps = [...lits].filter(l => !scoped.has(l)).sort(); + return { obl: lits.size, gaps }; +} + // ════════════════════════════════════════════════════════════════════════════ // LAYER A (cont.) — the literal-collection backbone is total + drops nothing consumed // @@ -439,21 +494,16 @@ const GRAMMARS: GrammarCfg[] = [ interface LedgerRow { name: string; tokenObl: number; tokenDisch: number; // non-skip tokens, each → a discharge path - litObl: number; // distinct keyword literals (painted ⇐ leaf coverage) - opObl: number; // distinct Pratt operators + litObl: number; litDisch: number; // alphabetic keyword literals, each → a reachable keyword-scoped pattern (structural) keyObl: number; keyReach: number; // repository keys, each → reachable - leafObl: number; leafPaint: number; // empirical content/keyword leaves, each → painted + leafObl: number; leafPaint: number; // empirical content/keyword leaves (the corpus cross-check) } -function ledgerRow(name: string, g: CstGrammar, tmJson: TmGrammarJson, r: ReachResult, tc: TokenCensus, cov: CoverageResult): LedgerRow { - const lits = new Set(); - for (const rule of g.rules) for (const l of collectLiterals(rule.body)) if (isKeywordLiteral(l)) lits.add(l); - const ops = new Set(); - for (const p of g.precs) for (const o of p.operators) ops.add(o.value); - for (const lp of g.ledPrecs ?? []) ops.add(lp.connector); +function ledgerRow(name: string, g: CstGrammar, r: ReachResult, tc: TokenCensus, ld: LiteralDischarge, cov: CoverageResult): LedgerRow { + const nonSkip = g.tokens.filter(t => !t.flags.includes('skip')).length; return { name, - tokenObl: g.tokens.filter(t => !t.flags.includes('skip')).length, tokenDisch: g.tokens.filter(t => !t.flags.includes('skip')).length - tc.orphans.length, - litObl: lits.size, opObl: ops.size, + tokenObl: nonSkip, tokenDisch: nonSkip - tc.orphans.length - tc.neutered.length, + litObl: ld.obl, litDisch: ld.obl - ld.gaps.length, keyObl: r.repoKeys, keyReach: r.repoKeys - r.dead.length, leafObl: cov.den, leafPaint: cov.painted, }; @@ -464,21 +514,23 @@ function renderLedger(rows: LedgerRow[]): string { const L: string[] = []; L.push(''); L.push(''); - L.push('| Grammar | Tokens | Keyword literals | Operators | Repo keys (reachable) | Leaf obligations (painted) |'); - L.push('|---|---:|---:|---:|---:|---:|'); - const sum = { t: 0, td: 0, lit: 0, op: 0, k: 0, kr: 0, lf: 0, lp: 0 }; + L.push('| Grammar | Tokens | Keyword literals | Repo keys (reachable) | Leaf cross-check (corpus) |'); + L.push('|---|---:|---:|---:|---:|'); + const sum = { t: 0, td: 0, lit: 0, ld: 0, k: 0, kr: 0, lf: 0, lp: 0 }; for (const r of rows) { - L.push(`| ${r.name} | ${r.tokenDisch}/${r.tokenObl} | ${r.litObl} | ${r.opObl} | ${r.keyReach}/${r.keyObl} | ${r.leafPaint}/${r.leafObl} |`); - sum.t += r.tokenObl; sum.td += r.tokenDisch; sum.lit += r.litObl; sum.op += r.opObl; + L.push(`| ${r.name} | ${r.tokenDisch}/${r.tokenObl} | ${r.litDisch}/${r.litObl} | ${r.keyReach}/${r.keyObl} | ${r.leafPaint}/${r.leafObl} |`); + sum.t += r.tokenObl; sum.td += r.tokenDisch; sum.lit += r.litObl; sum.ld += r.litDisch; sum.k += r.keyObl; sum.kr += r.keyReach; sum.lf += r.leafObl; sum.lp += r.leafPaint; } - L.push(`| **total** | **${sum.td}/${sum.t}** | **${sum.lit}** | **${sum.op}** | **${sum.kr}/${sum.k}** | **${sum.lp}/${sum.lf}** |`); + L.push(`| **total** | **${sum.td}/${sum.t}** | **${sum.ld}/${sum.lit}** | **${sum.kr}/${sum.k}** | **${sum.lp}/${sum.lf}** |`); L.push(''); - // the fixed denominator = every measured obligation (token-discharge + key-reachability + leaf-painting) - const den = sum.t + sum.k + sum.lf, num = sum.td + sum.kr + sum.lp; - L.push(`**Fixed-denominator completeness: ${num}/${den} = ${(100 * num / den).toFixed(2)}%** ` + - `(token discharge ${sum.td}/${sum.t} · repository reachability ${sum.kr}/${sum.k} · leaf painting ${sum.lp}/${sum.lf}). ` + - `Keyword literals (${sum.lit}) and Pratt operators (${sum.op}) are discharged through the leaf-painting column. ` + + // the DECIDABLE fixed denominator = the structural obligations (token discharge + keyword-literal + // discharge + repository reachability), checked a-priori on the emitted artifact, no corpus. The + // leaf cross-check is the redundant corpus witness (the soundness-axis dual), reported separately. + const den = sum.t + sum.k + sum.lit, num = sum.td + sum.kr + sum.ld; + L.push(`**Decidable completeness: ${num}/${den} = ${(100 * num / den).toFixed(2)}%** ` + + `(token discharge ${sum.td}/${sum.t} · keyword-literal discharge ${sum.ld}/${sum.lit} · repository reachability ${sum.kr}/${sum.k}) — ` + + `a structural check on the emitted artifact, no corpus. Leaf cross-check (corpus, redundant): ${sum.lp}/${sum.lf}. ` + `${num === den ? '**0 open completeness gaps.**' : `**${den - num} OPEN GAP(S).**`}`); L.push(''); L.push(''); @@ -519,14 +571,16 @@ async function main(): Promise { const tc = tokenCensus(g, tmJson); check(`token-completeness(${cfg.name}): every non-skip token has a discharge path`, tc.orphans.length === 0, `orphans: ${tc.orphans.join(' ')}`); check(`token-completeness(${cfg.name}): no flat token is neutered to the bare root scope`, tc.neutered.length === 0, `neutered: ${tc.neutered.join(' ')}`); + const ld = literalDischarge(g, tmJson); + check(`literal-completeness(${cfg.name}): every keyword literal/operator is in a reachable keyword-scoped pattern`, ld.gaps.length === 0, `undischarged: ${ld.gaps.join(' ')}`); const tm = await loadTmFromFiles(cfg.scopeName, { [cfg.scopeName]: cfg.tm, ...(cfg.tmExtra ?? {}) }); let cov: CoverageResult = { den: 0, painted: 0, uncovered: [] }; if (tm) cov = leafCoverage(g, tm); - check(`coverage(${cfg.name}): every content/keyword obligation leaf is painted`, cov.painted === cov.den, + check(`coverage cross-check(${cfg.name}): every content/keyword obligation leaf is painted`, cov.painted === cov.den, cov.uncovered.map(u => `"${u.text}"(${u.want})`).slice(0, 8).join(' ')); - rows.push(ledgerRow(cfg.name, g, tmJson, r, tc, cov)); + rows.push(ledgerRow(cfg.name, g, r, tc, ld, cov)); const pct = cov.den ? (100 * cov.painted / cov.den).toFixed(2) : '—'; - console.log(` ${cfg.name.padEnd(17)} repo ${String(r.repoKeys).padStart(3)} · dead ${r.dead.length} · tokens ${tc.total - tc.skip - tc.orphans.length}/${tc.total - tc.skip} · leaf-coverage ${cov.painted}/${cov.den} = ${pct}%`); + console.log(` ${cfg.name.padEnd(17)} repo ${String(r.repoKeys).padStart(3)} · dead ${r.dead.length} · tokens ${tc.total - tc.skip - tc.orphans.length}/${tc.total - tc.skip} · keyword-literals ${ld.obl - ld.gaps.length}/${ld.obl} · leaf-xcheck ${cov.painted}/${cov.den}`); if (cov.uncovered.length) for (const u of cov.uncovered.slice(0, 6)) console.log(` UNCOVERED "${u.text}" want ${u.want} ctx …${u.ctx}…`); } diff --git a/test/tm-mutation.ts b/test/tm-mutation.ts index 67160f0..5b817f3 100644 --- a/test/tm-mutation.ts +++ b/test/tm-mutation.ts @@ -30,7 +30,7 @@ import type { CstGrammar } from '../src/types.ts'; import { generateInputs } from './grammar-gen.ts'; import { buildRoleMap, leafRoles, spanBuckets, scopeAt, GEN_OPTS, type TmTok, type Bucket } from './generative-detect.ts'; import { - checkReachability, tokenCensus, leafCoverage, loadTmFromObject, tmTokenize, + checkReachability, tokenCensus, literalDischarge, leafCoverage, loadTmFromObject, tmTokenize, type TmGrammarJson, } from './tm-completeness.ts'; @@ -104,6 +104,8 @@ function structuralCatches(g: CstGrammar, mutated: TmGrammarJson): string[] { const c = tokenCensus(g, mutated); if (c.orphans.length) hits.push(`token-census:orphan(${c.orphans.join(',')})`); if (c.neutered.length) hits.push(`token-census:neutered(${c.neutered.join(',')})`); + const ld = literalDischarge(g, mutated); + if (ld.gaps.length) hits.push(`literal-discharge(${ld.gaps.slice(0, 3).join(',')})`); return hits; } // load that survives an invalid mutated grammar (a broken regex) — a grammar that fails From 387b6508c46081351a6f29db81075609be865e87 Mon Sep 17 00:00:00 2001 From: Johnson Chu Date: Sat, 20 Jun 2026 21:37:14 +0800 Subject: [PATCH 04/14] Fix two more sep-omission twins of the type-param keyword drop (#51) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit An adversarial gap-hunt (5 agents over the classes the structural checker does NOT cover) found that the getTypeParamElementKeywords `sep` drop had un-fixed siblings in the same family: - detectTypeParamConstraintKeywords.scanConstraint omitted `sep`, so a type-parameter CONSTRAINT keyword reached through a `&`/`,`-separated list was not collected — the .tsx generic-arrow⇄JSX no-comma disambiguation lost its constraint signal (mis-scoping the header as a JSX tag). - detectDeclarations.containsBlockRef omitted `sep`, so a declaration whose brace body is reached through a `sep` was not seen as having a body — its #declaration-body member-scoping region was dropped. Both recurse into `sep.element` now, mirroring the prior fix. Byte-identical on all six shipped grammars (latent: every shipped grammar writes constraints as `opt('extends', Type)` and block bodies as direct refs). The hunt found 8 latent completeness gaps total; all were VERIFIED latent — tokenizing the witnesses against the shipped grammars shows ternary, calls, trailing-comma type params, and every prec operator are correctly scoped, so the 0-gap soundness ledger is real, not corpus-blind. The remaining gaps are detector SHAPE-FRAGILITY (fixed-offset window matching in detectTernary / detectCallExpression / isAngleBracketSepRule misses equivalent factorings), an unimplemented `rawBlock` config region, and punctuation-in-region — tracked for follow-up; none triggers on a shipped grammar. --- src/gen-tm.ts | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/src/gen-tm.ts b/src/gen-tm.ts index 46372f5..ccaad9f 100644 --- a/src/gen-tm.ts +++ b/src/gen-tm.ts @@ -1109,6 +1109,10 @@ function detectTypeParamConstraintKeywords(grammar: CstGrammar, typeArgRule: str for (const it of (expr as { items: RuleExpr[] }).items) scanConstraint(it); } else if (expr.type === 'group') { scanConstraint((expr as { body: RuleExpr }).body); + } else if (expr.type === 'sep') { + // a constraint keyword reached through a `&`/`,`-separated sub-list is just as direct — + // recurse into the element (mirrors getTypeParamElementKeywords' `sep` arm). + scanConstraint((expr as { element: RuleExpr }).element); } }; for (const rule of grammar.rules) { @@ -3157,6 +3161,7 @@ function detectDeclarations(grammar: CstGrammar, tokenNames: Set): DeclI if (expr.type === 'ref') return isBlockRule(expr.name); if (expr.type === 'seq' || expr.type === 'alt') return expr.items.some(containsBlockRef); if (expr.type === 'quantifier' || expr.type === 'group') return containsBlockRef(expr.body); + if (expr.type === 'sep') return containsBlockRef(expr.element); return false; } From dc0079d98ae2b74eba4d2d2c19d2f7459a76df27 Mon Sep 17 00:00:00 2001 From: Johnson Chu Date: Sat, 20 Jun 2026 21:55:01 +0800 Subject: [PATCH 05/14] Root-cause: match normalized forms in shape-detectors (#51) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The fixed-window detector fragility the gap-hunt surfaced was a symptom; the root cause is that the shape-detectors pattern-match RAW RuleExpr shapes, so an equivalent factoring of the same construct (an `opt`-tail, a separate args rule, a trailing comma) is a different shape and slips them. Adding a `sep` arm per detector only widens the symptom patch — the next factoring still slips. Fix the structural condition: match the NORMALISED form. expandAlts already canonicalises alt-split / opt-tail / group / quantifier into the same flat adjacency; route the three fragile detectors through it, plus FIRST-set for the one ref-hidden case: - detectTernary, isAngleBracketSepRule → expandAlts (opt-tail and trailing-comma factorings now reduce to the matched adjacency; `sep` stays opaque so the sep node survives). - detectCallExpression → FIRST(next) instead of a literal `(`, so the args may be the inline `(` OR a separate `CallArgs` rule referenced after the callee. Byte-identical on all six shipped grammars (they already write the canonical factoring); the three latent gaps are now emitted for their equivalent forms (verified: #ternary-expression / #function-call / #declaration-type-params). And the detector is no longer blind to this class: a new shape-robustness gate (test/tm-completeness.ts) asserts each construct emits its region for EVERY equivalent factoring — it bites (3 failures) without the normalization. So the class is now caught structurally, not just by an adversarial hunt. --- src/gen-tm.ts | 39 ++++++++++++++++++++------------ test/tm-completeness.ts | 50 +++++++++++++++++++++++++++++++++++++++++ 2 files changed, 75 insertions(+), 14 deletions(-) diff --git a/src/gen-tm.ts b/src/gen-tm.ts index ccaad9f..dd4101e 100644 --- a/src/gen-tm.ts +++ b/src/gen-tm.ts @@ -1997,20 +1997,22 @@ function generateTypeCastPattern( * identifiers before '(' as entity.name.function. */ function detectCallExpression(grammar: CstGrammar): boolean { + const byName = new Map(grammar.rules.map(r => [r.name, r.body])); + // A call is a ref (the callee) immediately followed by something that STARTS with `(` — the + // arg list. Checking FIRST(next) instead of a literal `(` makes the detection transparent to + // factoring: the args may be the literal `(` directly, OR a separate rule (`CallArgs = '(' … ')'`) + // referenced after the callee — both have `(` in their FIRST set. + const startsWithParen = (e: RuleExpr) => firstLiterals(e, byName).has('('); function checkSeq(items: RuleExpr[]): boolean { for (let i = 0; i < items.length - 1; i++) { - if (items[i].type === 'ref' && - items[i + 1].type === 'literal' && - (items[i + 1] as { value: string }).value === '(') { - return true; - } + if (items[i].type === 'ref' && startsWithParen(items[i + 1])) return true; } return false; } function walk(expr: RuleExpr): boolean { - if (expr.type === 'seq') return checkSeq(expr.items) || expr.items.some(walk); - if (expr.type === 'alt') return expr.items.some(walk); + if (expandAlts(expr).some(checkSeq)) return true; // normalized factorings (opt/alt/group) + if (expr.type === 'seq' || expr.type === 'alt') return expr.items.some(walk); if (expr.type === 'quantifier' || expr.type === 'group') return walk(expr.body); if (expr.type === 'sep') return walk(expr.element); return false; @@ -2185,9 +2187,13 @@ function detectTernary(grammar: CstGrammar): boolean { return false; } + // Match on the NORMALIZED forms (expandAlts canonicalises equivalent factorings — an + // `opt('?', $, ':', $)` tail, an alt-split, a group — into the same flat adjacency), plus a + // recurse into `sep` elements (expandAlts treats `sep` as opaque). So a ternary written any of + // those equivalent ways is detected, not only the one flat 5-window factoring. function walk(expr: RuleExpr): boolean { - if (expr.type === 'seq') return checkSeq(expr.items) || expr.items.some(walk); - if (expr.type === 'alt') return expr.items.some(walk); + if (expandAlts(expr).some(checkSeq)) return true; + if (expr.type === 'seq' || expr.type === 'alt') return expr.items.some(walk); if (expr.type === 'quantifier' || expr.type === 'group') return walk(expr.body); if (expr.type === 'sep') return walk(expr.element); return false; @@ -3093,11 +3099,16 @@ interface DeclInfo { } function isAngleBracketSepRule(body: RuleExpr): boolean { - if (body.type !== 'seq' || body.items.length !== 3) return false; - const [first, second, third] = body.items; - return first.type === 'literal' && first.value === '<' && - second.type === 'sep' && second.delimiter === ',' && - third.type === 'literal' && third.value === '>'; + // Match on the NORMALIZED forms so an equivalent factoring — a trailing `opt(',')`, an + // alt-split, a group wrapper — reduces to the same `'<' sep '>'` adjacency. expandAlts keeps + // `sep` opaque (it is in its default case), so the sep node survives the expansion. + return expandAlts(body).some(items => { + if (items.length !== 3) return false; + const [first, second, third] = items; + return first.type === 'literal' && first.value === '<' && + second.type === 'sep' && second.delimiter === ',' && + third.type === 'literal' && third.value === '>'; + }); } function getTypeParamElementKeywords(body: RuleExpr, grammar: CstGrammar): string[] { diff --git a/test/tm-completeness.ts b/test/tm-completeness.ts index 37b5626..4b28094 100644 --- a/test/tm-completeness.ts +++ b/test/tm-completeness.ts @@ -476,6 +476,53 @@ async function regionKeywordProbe(): Promise { } } +// ════════════════════════════════════════════════════════════════════════════ +// SHAPE ROBUSTNESS — a shape-detector must fire on EVERY equivalent factoring +// +// The ~24 shape detectors in gen-tm.ts recognise a construct (ternary / call / generic +// type-params / …) by its structure. A detector that matches one FIXED factoring (a flat +// 5-window ternary, an inline `(`, a 3-item ``) silently drops the SAME construct +// written an equivalent way (an `opt`-tail, a separate args rule, a trailing comma) — the +// detector-fragility class the gap-hunt surfaced. The root-cause fix is to match a NORMALISED +// form (expandAlts + FIRST); this gate holds it: for each construct, several equivalent +// factorings must ALL emit the same region key. It BITES if a detector regresses to a fixed +// shape (a factoring loses the region). +// ════════════════════════════════════════════════════════════════════════════ +function checkShapeRobustness(): void { + const Id = token(plus(range('a', 'z')), { identifier: true }); + const emits = (key: string, build: () => Record): boolean => { + try { return !!(generateTmLanguage(defineGrammar(build() as any) as any) as any).repository[key]; } + catch { return false; } + }; + // each construct, in several EQUIVALENT factorings; the region key must be present in all. + const constructs: { name: string; key: string; factorings: { label: string; build: () => Record }[] }[] = [ + { + name: 'ternary', key: 'ternary-expression', factorings: [ + { label: 'flat', build: () => { const E = rule((s: any) => [[Id, '?', s, ':', s], [Id]]); const P = rule(() => [[many(E)]]); return { name: 't1', scopeName: 'source.t1', tokens: { Id }, rules: { E, P }, entry: P }; } }, + { label: 'opt-tail', build: () => { const E = rule((s: any) => [[Id, opt('?', s, ':', s)]]); const P = rule(() => [[many(E)]]); return { name: 't2', scopeName: 'source.t2', tokens: { Id }, rules: { E, P }, entry: P }; } }, + ], + }, + { + name: 'call', key: 'function-call', factorings: [ + { label: 'inline', build: () => { const A = rule(() => [[Id]]); const E = rule((s: any) => [[A, '(', sep(s, ','), ')'], [A]]); const P = rule(() => [[many(E)]]); return { name: 'c1', scopeName: 'source.c1', tokens: { Id }, rules: { A, E, P }, entry: P }; } }, + { label: 'args-rule', build: () => { const A = rule(() => [[Id]]); const CA = rule((s: any) => [['(', sep(s, ','), ')']]); const C = rule((s: any) => [[A, CA], [A]]); const E = rule((s: any) => [[C]]); const P = rule(() => [[many(E)]]); return { name: 'c2', scopeName: 'source.c2', tokens: { Id }, rules: { A, CA, C, E, P }, entry: P }; } }, + ], + }, + { + name: 'generic-type-params', key: 'declaration-type-params', factorings: [ + { label: '3-item', build: () => { const T = rule(() => [[Id]]); const Pm = rule(() => [[Id, opt('extends', T)]]); const TP = rule(() => [['<', sep(Pm, ','), '>']]); const D = rule(() => [['fn', Id, opt(TP), '{', '}']]); const P = rule(() => [[many(D)]]); return { name: 'g1', scopeName: 'source.g1', tokens: { Id }, prec: [none('<', '>')], scopes: { 'storage.type.function': ['fn'], 'keyword.operator.expression.extends': ['extends'] }, rules: { T, Pm, TP, D, P }, entry: P }; } }, + { label: 'trailing-comma', build: () => { const T = rule(() => [[Id]]); const Pm = rule(() => [[Id, opt('extends', T)]]); const TP = rule(() => [['<', sep(Pm, ','), opt(','), '>']]); const D = rule(() => [['fn', Id, opt(TP), '{', '}']]); const P = rule(() => [[many(D)]]); return { name: 'g2', scopeName: 'source.g2', tokens: { Id }, prec: [none('<', '>')], scopes: { 'storage.type.function': ['fn'], 'keyword.operator.expression.extends': ['extends'] }, rules: { T, Pm, TP, D, P }, entry: P }; } }, + ], + }, + ]; + for (const c of constructs) { + const results = c.factorings.map(f => ({ label: f.label, ok: emits(c.key, f.build) })); + const allEmit = results.every(r => r.ok); + check(`shape-robustness: \`${c.name}\` emits #${c.key} for every equivalent factoring`, allEmit, + results.filter(r => !r.ok).map(r => r.label).join(', ')); + } +} + // ════════════════════════════════════════════════════════════════════════════ // driver // ════════════════════════════════════════════════════════════════════════════ @@ -559,6 +606,9 @@ async function main(): Promise { checkCollectLiteralsClosure(); await regionKeywordProbe(); + console.log('── Shape robustness: detectors fire on every equivalent factoring ──'); + checkShapeRobustness(); + console.log('── Reachability · token completeness · Layer B1 leaf coverage ──'); const rows: LedgerRow[] = []; for (const cfg of GRAMMARS) { From f6a9990be9be4bbdfabd17991a8868d8fe41a096 Mon Sep 17 00:00:00 2001 From: Johnson Chu Date: Sat, 20 Jun 2026 22:29:42 +0800 Subject: [PATCH 06/14] Normalize four more shape-detectors (#51) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit A systematic shape-fragility audit (5 agents running the real emitter over equivalent factorings of each detector's construct) found the fixed-window fragility is broad, not three isolated cases. Fix four more by routing through the normalized form, same pattern as detectTernary: - detectConditionalType — ran its 7-window on raw r.body; now over expandAlts, so an `opt`-tail / grouped / alt-split conditional `?:` is detected. - getTypeParamElementKeywords — early-out demanded an exact 3-item `'<' sep '>'` body; now scans the expandAlts branches for that adjacency, so a trailing `opt(',')` or alt-wrapped type-param list still hoists its constraint keyword. - detectDeclarations.isBlockRule — matched only a raw-seq `{ … }` body; now EVERY expandAlts branch must be `{ … }`-bounded (`.every`, not `.some`, so a `Type` that is only SOMETIMES an object literal is not mis-read as a brace declaration — `.some` regressed `type X = …` to #declaration-body). - detectJsx hasElementShape — matched `<`+ref only in a raw seq; now over the expandAlts branches, so an opt/alt/group factoring of the element qualifies. Byte-identical on all six shipped grammars (verified the `.every` form after a `.some` first attempt changed typescript/tsx output), and each fix is verified to detect its previously-dropped factoring. Seven detectors are now shape-robust; the YAML region detectors and the expression group are the remaining batch. --- src/gen-tm.ts | 53 ++++++++++++++++++++++++++++++++------------------- 1 file changed, 33 insertions(+), 20 deletions(-) diff --git a/src/gen-tm.ts b/src/gen-tm.ts index dd4101e..49b8225 100644 --- a/src/gen-tm.ts +++ b/src/gen-tm.ts @@ -922,16 +922,17 @@ function detectJsx(grammar: CstGrammar): JsxInfo | null { } if (!selfCloseTok || !closeTok) return null; - // Confirm the JSX element production: a `<` literal directly before a rule ref. + // Confirm the JSX element production: a `<` literal directly before a rule ref, matched on the + // NORMALISED branches so an opt/alt/group factoring of the element production still qualifies. let hasElementShape = false; const walk = (e: RuleExpr): void => { - if (e.type === 'seq') { - for (let i = 0; i < e.items.length - 1; i++) { - if (e.items[i].type === 'literal' && (e.items[i] as { value: string }).value === '<' && - e.items[i + 1].type === 'ref') hasElementShape = true; + for (const items of expandAlts(e)) { + for (let i = 0; i < items.length - 1; i++) { + if (items[i].type === 'literal' && (items[i] as { value: string }).value === '<' && + items[i + 1].type === 'ref') hasElementShape = true; } - e.items.forEach(walk); - } else if (e.type === 'alt') e.items.forEach(walk); + } + if (e.type === 'seq' || e.type === 'alt') e.items.forEach(walk); else if (e.type === 'quantifier' || e.type === 'group' || e.type === 'not') walk(e.body); else if (e.type === 'sep') walk(e.element); }; @@ -2242,8 +2243,10 @@ function detectConditionalType(grammar: CstGrammar): string | null { function walk(expr: RuleExpr): void { if (connector) return; - if (expr.type === 'seq') { checkSeq(expr.items); expr.items.forEach(walk); } - else if (expr.type === 'alt') expr.items.forEach(walk); + // run the 7-window over the NORMALISED branches (mirrors detectTernary): expandAlts + // canonicalises an opt-tail / grouped / alt-split conditional `?:` into the flat adjacency. + for (const items of expandAlts(expr)) { checkSeq(items); if (connector) return; } + if (expr.type === 'seq' || expr.type === 'alt') expr.items.forEach(walk); else if (expr.type === 'quantifier' || expr.type === 'group') walk(expr.body); else if (expr.type === 'sep') walk(expr.element); } @@ -3112,10 +3115,18 @@ function isAngleBracketSepRule(body: RuleExpr): boolean { } function getTypeParamElementKeywords(body: RuleExpr, grammar: CstGrammar): string[] { - if (body.type !== 'seq' || body.items.length !== 3) return []; - const sep = body.items[1]; - if (sep.type !== 'sep') return []; - let elementBody: RuleExpr = sep.element; + // Find the `'<' sep '>'` adjacency in any NORMALISED branch (so a trailing `opt(',')` or an + // alt-wrapped body still surfaces it — the same expansion isAngleBracketSepRule uses), then hoist + // the element's keywords. Without this the keyword sub-pattern (the `\bextends\b` scoping inside + // `<…>`) is dropped for those equivalent factorings even though the region itself is still emitted. + let elementBody: RuleExpr | null = null; + for (const items of expandAlts(body)) { + const i = items.findIndex(x => x.type === 'literal' && (x as { value: string }).value === '<'); + if (i >= 0 && items[i + 1]?.type === 'sep' && items[i + 2]?.type === 'literal' && (items[i + 2] as { value: string }).value === '>') { + elementBody = (items[i + 1] as { element: RuleExpr }).element; break; + } + } + if (!elementBody) return []; if (elementBody.type === 'ref') { const rule = grammar.rules.find(r => r.name === (elementBody as { name: string }).name); if (rule) elementBody = rule.body; @@ -3151,13 +3162,15 @@ function detectDeclarations(grammar: CstGrammar, tokenNames: Set): DeclI function isBlockRule(name: string): boolean { const rule = grammar.rules.find(r => r.name === name); if (!rule) return false; - const body = rule.body; - if (body.type === 'seq' && body.items.length >= 2) { - return body.items[0].type === 'literal' && (body.items[0] as { value: string }).value === '{' && - body.items[body.items.length - 1].type === 'literal' && - (body.items[body.items.length - 1] as { value: string }).value === '}'; - } - return false; + // A rule is a block body only if EVERY normalised branch is `{ … }`-bounded — i.e. it is + // ALWAYS a brace block. `.some` would over-match a rule that is only SOMETIMES a block (a + // `Type` whose value can be an inline object type), mis-classifying a `type X = …` alias as a + // brace-body declaration. `.every` recovers the alt-of-blocks factoring without that regression. + const branches = expandAlts(rule.body); + return branches.length > 0 && branches.every(items => + items.length >= 2 && + items[0].type === 'literal' && (items[0] as { value: string }).value === '{' && + items[items.length - 1].type === 'literal' && (items[items.length - 1] as { value: string }).value === '}'); } function containsLiteral(expr: RuleExpr, value: string): boolean { From 24cccf13de01031001dc6c99c698cb6f3e488ce4 Mon Sep 17 00:00:00 2001 From: Johnson Chu Date: Sat, 20 Jun 2026 22:58:37 +0800 Subject: [PATCH 07/14] Normalize the expression-position shape-detectors (#51) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Reading each expression detector myself (the audit agent's output was malformed): five matched a construct on the raw r.body, so an equivalent factoring slips them. Route them through expandAlts, same pattern as detectTernary/detectCallExpression: - detectBareArrowParam — `ref '=>'`; an opt-tail arrow (`[x, opt('=>', body)]`) was dropped (verified: variable.parameter now emitted for that factoring). - detectPropertyAccess — `'.'`/`'?.'` before a token ref. - detectParenArrowParams + detectArrowParamDelims — the deliberate pair that read the same arrow param-list production; routed identically so they still cannot disagree. - detectDirectParamKeywords — keyword directly before `(`; also recurse `sep`. detectConstructorKeywords already expands (no change). Byte-identical on all six shipped grammars. Twelve detectors are now shape-robust. The YAML region detectors are deferred to a semantics-aware pass: their fixed shapes encode DELIBERATE YAML semantics (detectFold stops at a leaf token and does NOT follow a rule-ref by design — an Indent+rule-ref is a SIBLING node, not a fold), so the audit's heuristic "follow the ref" fix would break the fold-vs-sibling meaning. Not every fixed shape is fragility — some are intent. --- src/gen-tm.ts | 27 ++++++++++++++------------- 1 file changed, 14 insertions(+), 13 deletions(-) diff --git a/src/gen-tm.ts b/src/gen-tm.ts index 49b8225..fa917b5 100644 --- a/src/gen-tm.ts +++ b/src/gen-tm.ts @@ -2049,10 +2049,10 @@ function detectPropertyAccess( } function walk(expr: RuleExpr): void { - if (expr.type === 'seq') { checkSeq(expr.items); expr.items.forEach(walk); } - if (expr.type === 'alt') expr.items.forEach(walk); - if (expr.type === 'quantifier' || expr.type === 'group') walk(expr.body); - if (expr.type === 'sep') walk(expr.element); + for (const items of expandAlts(expr)) checkSeq(items); // normalized factorings + if (expr.type === 'seq' || expr.type === 'alt') expr.items.forEach(walk); + else if (expr.type === 'quantifier' || expr.type === 'group') walk(expr.body); + else if (expr.type === 'sep') walk(expr.element); } for (const rule of grammar.rules) walk(rule.body); @@ -2079,8 +2079,8 @@ function detectBareArrowParam(grammar: CstGrammar, tokenNames: Set): boo } function walk(expr: RuleExpr): boolean { - if (expr.type === 'seq') return checkSeq(expr.items) || expr.items.some(walk); - if (expr.type === 'alt') return expr.items.some(walk); + if (expandAlts(expr).some(checkSeq)) return true; // normalized factorings (opt-tail / alt / group) + if (expr.type === 'seq' || expr.type === 'alt') return expr.items.some(walk); if (expr.type === 'quantifier' || expr.type === 'group') return walk(expr.body); if (expr.type === 'sep') return walk(expr.element); return false; @@ -2112,8 +2112,8 @@ function detectParenArrowParams(grammar: CstGrammar): boolean { } function walk(expr: RuleExpr): boolean { - if (expr.type === 'seq') return checkSeq(expr.items) || expr.items.some(walk); - if (expr.type === 'alt') return expr.items.some(walk); + if (expandAlts(expr).some(checkSeq)) return true; // normalized factorings (opt-tail / alt / group) + if (expr.type === 'seq' || expr.type === 'alt') return expr.items.some(walk); if (expr.type === 'quantifier' || expr.type === 'group') return walk(expr.body); if (expr.type === 'sep') return walk(expr.element); return false; @@ -2159,8 +2159,8 @@ function detectArrowParamDelims(grammar: CstGrammar): { open: string; close: str return false; } function walk(expr: RuleExpr): boolean { - if (expr.type === 'seq') return checkSeq(expr.items) || expr.items.some(walk); - if (expr.type === 'alt') return expr.items.some(walk); + if (expandAlts(expr).some(checkSeq)) return true; // normalized factorings (opt-tail / alt / group) + if (expr.type === 'seq' || expr.type === 'alt') return expr.items.some(walk); if (expr.type === 'quantifier' || expr.type === 'group') return walk(expr.body); if (expr.type === 'sep') return walk(expr.element); return false; @@ -2289,9 +2289,10 @@ function detectDirectParamKeywords( } function walk(expr: RuleExpr): void { - if (expr.type === 'seq') { checkSeq(expr.items); expr.items.forEach(walk); } - if (expr.type === 'alt') expr.items.forEach(walk); - if (expr.type === 'quantifier' || expr.type === 'group') walk(expr.body); + for (const items of expandAlts(expr)) checkSeq(items); // normalized factorings + if (expr.type === 'seq' || expr.type === 'alt') expr.items.forEach(walk); + else if (expr.type === 'quantifier' || expr.type === 'group') walk(expr.body); + else if (expr.type === 'sep') walk(expr.element); } for (const rule of grammar.rules) walk(rule.body); From d4cc5ace4f0112f110a0a687cc65cea377cb70a8 Mon Sep 17 00:00:00 2001 From: Johnson Chu Date: Sat, 20 Jun 2026 23:05:32 +0800 Subject: [PATCH 08/14] YAML detectors: one safe positional fix, three confirmed deliberate (#51) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Worked the YAML region detectors semantics-aware (not the audit's heuristic). Key finding: the audit's "route through expandAlts" fix is WRONG for these — they match STRUCTURAL nodes (a `(Newline item)*` quantifier, config-keyed bracket pairs), and expandAlts EXPANDS the very quantifier they depend on. - detectBlockSequence — the one genuine positional rigidity: it matched the `[item, (Newline item)*]` pattern only at items[0]/items[1]. Now scanned pairwise (any adjacent k), so a leading element before the sequence does not hide it. NOT routed through expandAlts (that would expand the quantifier). yaml byte-identical; the full yaml gate group (depth-witnesses, deepest-sibling, compact-nest-sites, flow-sites, blockscalar-depth, issue12) stays green. - detectFold — its visit is ALREADY pairwise; its refsLeaf stopping at a leaf token (not following a rule-ref) is the DELIBERATE fold-vs-sibling distinction (an Indent+rule-ref is a sibling node). No change. - detectExplicitKey — the indicator at items[0] is intrinsic (an explicit-key entry is headed by `?`); the inner already unwraps a quantifier. No change. - detectFlowCollections — topLits is positional-agnostic and unwraps quantifier/group/sep, deliberately stopping at alt/ref to read THIS rule's own structure. No change. So the fixed-window fragility class is closed: 13 detectors made shape-robust where the rigidity was accidental; the rest is deliberate YAML semantics that the heuristic over-flagged. Not every fixed shape is fragility. --- src/gen-tm.ts | 11 +++++++---- 1 file changed, 7 insertions(+), 4 deletions(-) diff --git a/src/gen-tm.ts b/src/gen-tm.ts index fa917b5..8ce6727 100644 --- a/src/gen-tm.ts +++ b/src/gen-tm.ts @@ -3626,10 +3626,13 @@ function detectBlockSequence(grammar: CstGrammar): { indicator: string } | null let indicator: string | null = null; const visit = (e: RuleExpr): void => { if (e.type === 'seq') { - // `[item, (Newline item)*]`: first element + a `*`/`+` over a `[Newline, item]` seq - if (e.items.length >= 2) { - const head = e.items[0]; - const q = e.items[1]; + // `[…, item, (Newline item)*, …]`: an item ADJACENT to a `*`/`+` over a `[Newline, item]` seq. + // Scanned pairwise (any k, not only items[0]/items[1]) so a leading element before the + // sequence pattern does not hide it. NOT routed through expandAlts on purpose — that would + // expand the `(Newline item)*` quantifier this match depends on. + for (let k = 0; k + 1 < e.items.length; k++) { + const head = e.items[k]; + const q = e.items[k + 1]; if (q.type === 'quantifier' && (q.kind === '*' || q.kind === '+') && q.body.type === 'seq' && q.body.items.length >= 2 && q.body.items[0].type === 'ref' && q.body.items[0].name === newlineToken) { const ind = itemIndicator(head); From f015876f949c1bef069221867fbd648a20a6b611 Mon Sep 17 00:00:00 2001 From: Johnson Chu Date: Sun, 21 Jun 2026 00:06:17 +0800 Subject: [PATCH 09/14] Hoist declared-scope punctuation inside type-parameter regions (#51) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit getTypeParamElementKeywords collected only KEYWORD literals, so a literal the grammar declares a scope for that is PUNCTUATION (e.g. a `&` scoped punctuation.separator) inside a type-parameter element lost its scope inside `<…>`. Collect any scope-declared literal now, and emit it with the right boundary (`\b` for words, none for punctuation — `\b&\b` never matches). `=` and `,` are skipped in the loop since the dedicated handlers emit them, so the six shipped grammars stay byte-identical (verified the `=` double-emit and excluded it); a scoped `&` in a type-param element is now scoped (verified). This closes the last clean completeness gap from the hunt. The remaining two are not clean completeness gaps: the symbolic-operator case is an ORDERING concern (a short overridden op can shadow a longer non-overridden one across the separate #operator-overrides / #operators patterns — the provably-hard ordering axis, and the op IS scoped, just shadowed), and `rawBlock` is a declared-but-unimplemented IndentConfig field (no shipped grammar sets it; implementing the verbatim region speculatively, with no adopter to verify against, is deferred). --- src/gen-tm.ts | 17 ++++++++++++----- 1 file changed, 12 insertions(+), 5 deletions(-) diff --git a/src/gen-tm.ts b/src/gen-tm.ts index 8ce6727..10337c4 100644 --- a/src/gen-tm.ts +++ b/src/gen-tm.ts @@ -3134,7 +3134,11 @@ function getTypeParamElementKeywords(body: RuleExpr, grammar: CstGrammar): strin } const keywords: string[] = []; function walk(e: RuleExpr) { - if (e.type === 'literal' && isKeywordLiteral(e.value)) keywords.push(e.value); + // collect a literal the element CONSUMES that bears a scope obligation: a keyword, OR any + // literal the grammar declares a scope for (e.g. a `&` separator scoped punctuation.separator) + // — so a declared-scope PUNCTUATION inside the element keeps its scope inside `<…>` too, not + // only keywords. The emit site picks the right boundary (`\b` for words, none for punctuation). + if (e.type === 'literal' && (isKeywordLiteral(e.value) || grammar.scopeOverrides.has(e.value))) keywords.push(e.value); if (e.type === 'seq' || e.type === 'alt') e.items.forEach(walk); if (e.type === 'quantifier' || e.type === 'group') walk(e.body); // A keyword reached through a `sep` sub-list of the element is just as direct as one in a @@ -6563,12 +6567,15 @@ export function generateTmLanguage(grammar: CstGrammar): TmGrammar { { include: '#declaration-type-params' }, ]; for (const kw of allTypeParamKws) { + // `=` and `,` are the type-param's STRUCTURAL punctuation, emitted by the dedicated + // handlers below — skip them here so they are not double-emitted. + if (kw === '=' || kw === ',') continue; const scope = getScope(scopeOverrides,kw); if (scope) { - tpInner.push({ - match: `\\b${escapeRegex(kw)}\\b`, - name: `${scope}.${langName}`, - }); + // a word literal is `\b`-bounded; a punctuation literal (e.g. `&`) must NOT be — `\b&\b` + // never matches. Pick the boundary by the literal's class. + const m = isKeywordLiteral(kw) ? `\\b${escapeRegex(kw)}\\b` : escapeRegex(kw); + tpInner.push({ match: m, name: `${scope}.${langName}` }); } } tpInner.push({ include: '#type-inner' }); From ea488dd754ff026b0257db66b88437096c2692ea Mon Sep 17 00:00:00 2001 From: Johnson Chu Date: Sun, 21 Jun 2026 00:30:32 +0800 Subject: [PATCH 10/14] Close a co-blind markup path and correct the proof's overclaims (#51) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit An adversarial review of the "proven no gaps" conclusion found real holes; this acts on them. - Markup co-blindness (the one fixable defect): tokenCensus discharged EVERY markup token via `if (g.markup) bump('markup-region')` with zero verification, so a markup token whose declared scope generateMarkupTm does not model (e.g. a `` processing instruction) was reported discharged while the engine painted it the bare root. Now a markup token with an explicit `scope` must have that scope actually emitted, or it is an orphan (verified: the PI counterexample is now caught; html stays 7/7). - COMPLETENESS.md corrections, all overclaims the review caught: - "per-token discharge by structural identity (match IS tokenPatternSource)" was false — the identifier match is widened to identPattern, and the census checks PRESENCE (a reachable non-root-scoped entry), not regex identity. Reworded to presence-not-identity. - "∀ G by structural induction" conflated the algebra CLOSURE (which is ∀ G) with the per-grammar DISCHARGE (executed on the shipped set). Separated. - "ordering is undecidable" was the project's own impossibility-without-proof trap: for a fixed G the pattern list is finite and the winner is a finite index read (gen-tm even sorts it deterministically). It is "not reachable by a corpus-free structural fold," a measurement limit, not undecidability. Residual, honestly stated: the shape class is MEASURED not proven — the robustness gate covers 3 of ~20 detectors, mutation testing mutates only flat tokens (not shape regions), and detectAngleBracketCast keeps a latent ref-factored fragility (type-cast fires on no shipped grammar). 81/81 · 40/40. --- COMPLETENESS.md | 49 ++++++++++++++++++++++++++++------------- test/tm-completeness.ts | 9 +++++++- 2 files changed, 42 insertions(+), 16 deletions(-) diff --git a/COMPLETENESS.md b/COMPLETENESS.md index 2f36a54..32777c4 100644 --- a/COMPLETENESS.md +++ b/COMPLETENESS.md @@ -48,8 +48,11 @@ constructor-occurrence or a config-field-occurrence. So completeness reduces to: obligation generator, the generator has a discharging, reachable emission** — three mechanically-checkable layers. Both sides are finite — a finite `G`, a finite `gen-tm(G)`, and an obligation taxonomy bounded by TextMate's finite construct kinds — so completeness is a **decidable** -property per grammar, and holds **∀ G by structural induction** over the finite combinator algebra -(finitely many cases). It is checked a-priori on the emitted artifact, with no corpus. +property **per grammar**, checked a-priori on the emitted artifact with no corpus. The **algebra +closure** (Layer A) is what holds ∀ G by structural induction (the lowering/compilation is total over +the finite combinator algebra); the per-grammar **discharge** is then executed on the shipped set — a +decidable check run on concrete grammars, with the closure as its inductive backbone, not a mechanised +∀-G proof of discharge. ## Layer A — closure: the universe is the algebra, and lowering is total @@ -142,19 +145,35 @@ The honest, measured result: artifact's *sequence*, not the grammar's algebra — so no corpus-free structural check reaches it, and a scope-preserving reorder slips even the bucket-level differential. -The line is precise. **Completeness — every required construct PRESENT + REACHABLE + visually scoped -— is DECIDABLE**, and decided a-priori with no corpus: a finite grammar `G`, a finite emitted artifact -`gen-tm(G)`, a finite obligation taxonomy (bounded by TextMate's finite construct kinds), and per-token -discharge by *structural identity* (the flat `match` **is** `tokenPatternSource(t)`, so no semantic -regex-matching is needed). ∀ `G` follows by structural induction over the finite combinator algebra. -What is **undecidable is soundness** — do the present constructs paint *correctly on all inputs*: a -wrong-role paint, or which of two overlapping patterns *wins* (ordering), is an agreement between a -CFG-derived role and a regex-stack-machine tokenizer over an infinite input space, which slides into -regex-vs-CFG undecidability (Oniguruma's `\g<>`/backreferences are non-regular). So this document -proves completeness and *measures* its detector (mutation testing); soundness it does not claim to -decide — that is `test/gap-ledger.ts`'s by-construction + corpus axis. The earlier framing that -"a-priori completeness over the whole gap space is unavailable" was an over-concession: completeness -is available; it was soundness's wall, mistaken for completeness's. +The line is precise — and narrower than an earlier draft of it claimed (an adversarial review of +that draft is owed these corrections). **Completeness — every required construct PRESENT + REACHABLE ++ visually scoped — is checked structurally, no corpus**, and for a fixed `G` it is DECIDABLE (finite +`G`, finite `gen-tm(G)`, an obligation taxonomy bounded by TextMate's finite construct kinds). Three +honest bounds on that: + +- **Presence, not identity.** The discharge is *presence* — a reachable repository entry whose scope + is non-root — not a deeper *identity* of the matcher. The flat `match` is *derived from* + `tokenPatternSource(t)` (widened to `identPattern` for the identifier token, so not literally equal); + `tokenCensus` checks the entry exists and carries a visual scope, it does **not** re-verify the regex + recognises the right bytes — that is soundness. (For markup grammars the census additionally checks + that a token's *explicitly declared* scope is actually emitted, closing a co-blind path a review + found — an unmodelled `` construct that fell through to the bare root.) +- **Verified on the shipped set, not run ∀ G.** The Layer-A closure (A1/A2) is the inductive backbone + — the algebra IS closed and lowering/compilation IS total ∀ G — but the per-grammar *discharge* is + EXECUTED on the six shipped grammars plus synthetic witnesses, not run as a quantified proof over all `G`. +- **Ordering is decidable, not undecidable.** Which of two overlapping patterns *wins* is, for a fixed + `G`, a finite index read (the emitted `patterns` list is finite, the winner is leftmost-by-order, and + gen-tm computes that order by a deterministic sort). It is simply **not reachable by a corpus-free + structural fold over the grammar** — a measurement limit, the wording §"Measuring the detector" uses, + NOT undecidability. (An earlier draft called ordering "undecidable"; that was the same impossibility- + without-proof over-claim the project guards against.) + +What is genuinely beyond an a-priori check is **soundness** — whether the present constructs paint +*correctly on all inputs* (a wrong-role paint): a CFG-derived role vs a regex-stack-machine tokenizer +over an infinite input space. That is handled by-construction + `test/gap-ledger.ts`'s corpus, not +claimed decided here. So the honest scope: a structural **presence** proof, **decidable per grammar**, +**executed on the shipped set**, with the algebra-closure as its inductive backbone — and the detector's +power **measured** (mutation testing), not proven, for the shape class. ## Reachability — root ∪ export surfaces diff --git a/test/tm-completeness.ts b/test/tm-completeness.ts index 4b28094..8178184 100644 --- a/test/tm-completeness.ts +++ b/test/tm-completeness.ts @@ -263,7 +263,14 @@ export function tokenCensus(g: CstGrammar, tmJson: TmGrammarJson): TokenCensus { if (flat) { if (flatNeutered(flat)) neutered.push(`${t.name}→${(flat as any).name ?? '∅'}`); else bump('flat'); continue; } if (t.flags.includes('regex')) { bump('regex-family'); continue; } if (tokenPatternIsNever(t)) { bump('engine-emitted'); continue; } - if (g.markup) { bump('markup-region'); continue; } // generateMarkupTm owns it + if (g.markup) { + // generateMarkupTm owns the ROLE-based markup tokens (text / tag / attr — no explicit `scope`). + // But a markup token with an EXPLICITLY declared scope (a construct generateMarkupTm may not + // model, e.g. a `` processing instruction) must actually have that scope emitted — else it + // falls through to the bare document root, the same neuter gap the token-stream path catches. + if (t.scope && !full.includes(t.scope)) { orphans.push(`${t.name}[markup-unmodeled:${t.scope}]`); continue; } + bump('markup-region'); continue; + } const delim = tokenPatternLiteralText(t); // a region owns this token's delimiter? if (delim && full.includes(JSON.stringify(delim).slice(1, -1))) { bump('region-owned'); continue; } orphans.push(`${t.name}[${t.flags.join(',') || '-'}]`); From 9552ad62b9c373a8c5ef37f5bd8a41a6fda35ba5 Mon Sep 17 00:00:00 2001 From: Johnson Chu Date: Sun, 21 Jun 2026 07:31:14 +0800 Subject: [PATCH 11/14] Extend the shape-robustness gate to every detector; fix the two it exposed (#51) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The shape-robustness gate covered 3 constructs (ternary / call / generic-type- params). Extended it to 19 — one per gen-tm shape-detector — each asserting the construct discharges its obligation for several EQUIVALENT factorings (canonical / opt-tail / alt-split / via-ref / trailing-comma). The builders were authored by running the real emitter over each factoring; an `xfail` list records a factoring a detector is known to drop, so a NEW drop OR a stale xfail (a landed fix) both go red — never a silent false-green. Building it out caught two residual fixed-shape fragilities (both latent — neither construct fires on a shipped grammar — so both fixes are byte-identical on all six): - detectAngleBracketCast dropped a cast whose `` head is its own rule (the `<`/`>` hidden behind a ref). Now resolves a ref to a `'<' type '>'` cast-head rule by name, mirroring detectCallExpression reaching its args through a rule. - detectTypeParamConstraintKeywords extracted the constraint keyword only from a `?`-quantifier `[kw, ref]`, dropping `alt([…,kw,type],[…])` (optionality via an alt branch) and `opt(kw, sep(type))` (type behind a sep). Now reads the constraint as the optional `[kw, type]` segment by which one expandAlts branch extends a prefix-shorter sibling — uniform across opt / alt / sep, and a leading modifier (not a prefix extension) is still excluded. (A first attempt keyed on the follower being a type-flagged rule; the gate caught that it broke the canonical factoring when the constraint type is not type-flagged.) 19 constructs all green; the YAML/markup detectors are positive-controlled (their fixed shapes are deliberate, so they guard the canonical form, not factorings). 40/40 · 96/96 completeness checks. --- src/gen-tm.ts | 87 ++++++++++------ test/tm-completeness.ts | 217 +++++++++++++++++++++++++++++++++++++--- 2 files changed, 261 insertions(+), 43 deletions(-) diff --git a/src/gen-tm.ts b/src/gen-tm.ts index 10337c4..bbef912 100644 --- a/src/gen-tm.ts +++ b/src/gen-tm.ts @@ -1091,33 +1091,33 @@ function detectTypeParamConstraintKeywords(grammar: CstGrammar, typeArgRule: str }; for (const rule of grammar.rules) for (const seq of expandAlts(rule.body)) scanForTypeParamSep(seq); - // In each such rule, find OPTIONAL `[, ]` pairs — the constraint. - // The literal must be a WORD (starts with a letter/`_`) so it is `\b`-bounded; a - // punctuation lead like `=` (the default) is excluded on purpose (see doc above). + // In each such rule, find the constraint keyword: the WORD literal that BEGINS an OPTIONAL + // `[keyword, type…]` segment. "Optional" is read structurally from expandAlts(body): the constraint + // is exactly the segment by which one branch EXTENDS a prefix-shorter sibling — so `opt(kw, type)` + // (a `?` body), a separate alt branch `alt([…, kw, type], […])`, and a `sep`-wrapped type ALL reduce + // to "branch B = branch A ++ [kw, …]". The keyword is `B[len(A)]` when it is a word literal. This + // reads the OPTIONALITY (the distinguishing fact), so a LEADING modifier (`const`/`in`/`out`, whose + // own optionality makes `[name]` vs `[const,name]` — NOT a prefix pair) is not mistaken for it, and a + // punctuation lead like `=` (the default) is excluded by the word test. const keywords = new Set(); const isWord = (s: string) => /^[A-Za-z_]/.test(s); - const scanConstraint = (expr: RuleExpr): void => { - if (expr.type === 'quantifier') { - if (expr.kind === '?' && expr.body.type === 'seq') { - const its = expr.body.items; - if (its.length >= 2 && its[0].type === 'literal' && its[1].type === 'ref') { - const lit = (its[0] as { value: string }).value; - if (isWord(lit)) keywords.add(lit); - } - } - scanConstraint(expr.body); - } else if (expr.type === 'seq' || expr.type === 'alt') { - for (const it of (expr as { items: RuleExpr[] }).items) scanConstraint(it); - } else if (expr.type === 'group') { - scanConstraint((expr as { body: RuleExpr }).body); - } else if (expr.type === 'sep') { - // a constraint keyword reached through a `&`/`,`-separated sub-list is just as direct — - // recurse into the element (mirrors getTypeParamElementKeywords' `sep` arm). - scanConstraint((expr as { element: RuleExpr }).element); - } - }; + const sig = (e: RuleExpr): string => + e.type === 'literal' ? 'L:' + (e as { value: string }).value + : e.type === 'ref' ? 'R:' + (e as { name: string }).name + : e.type === 'sep' ? 'S:' + sig((e as { element: RuleExpr }).element) + : e.type === 'seq' || e.type === 'alt' ? e.type + '[' + (e as { items: RuleExpr[] }).items.map(sig).join(',') + ']' + : e.type === 'quantifier' ? 'Q' + (e as { kind: string }).kind + sig((e as { body: RuleExpr }).body) + : (e.type === 'group' || e.type === 'not') ? e.type[0] + sig((e as { body: RuleExpr }).body) + : e.type; for (const rule of grammar.rules) { - if (sepElementRules.has(rule.name)) scanConstraint(rule.body); + if (!sepElementRules.has(rule.name)) continue; + const branches = expandAlts(rule.body); + for (const a of branches) for (const b of branches) { + if (b.length <= a.length) continue; + if (!a.every((it, i) => sig(it) === sig(b[i]))) continue; // a is a strict prefix of b + const head = b[a.length]; + if (head.type === 'literal' && isWord((head as { value: string }).value)) keywords.add((head as { value: string }).value); + } } return [...keywords]; } @@ -1748,15 +1748,40 @@ function detectAngleBracketCast(grammar: CstGrammar): string | null { ); if (typeRuleNameSet.size === 0) return null; + const ruleByName = new Map(grammar.rules.map(r => [r.name, r] as const)); + // a "cast head" is a rule whose EVERY expanded branch is exactly `'<' '>'` — the author + // factored the angle-cast prefix into its own rule. Recover the type name (null if not that shape). + const castHeadType = (body: RuleExpr): string | null => { + const branches = expandAlts(body); + if (!branches.length) return null; + let ty: string | null = null; + for (const items of branches) { + if (items.length === 3 && items[0].type === 'literal' && (items[0] as { value: string }).value === '<' && + items[1].type === 'ref' && typeRuleNameSet.has((items[1] as { name: string }).name) && + items[2].type === 'literal' && (items[2] as { value: string }).value === '>') { + ty = (items[1] as { name: string }).name; + } else return null; + } + return ty; + }; let found: string | null = null; const walkSeq = (items: RuleExpr[]): void => { - for (let i = 0; i + 3 < items.length; i++) { - const a = items[i], b = items[i + 1], c = items[i + 2], d = items[i + 3]; - if (a.type === 'literal' && a.value === '<' && - b.type === 'ref' && typeRuleNameSet.has(b.name) && - c.type === 'literal' && c.value === '>' && - d /* an operand follows the cast */) { - found = b.name; + for (let i = 0; i + 1 < items.length; i++) { + const a = items[i]; + // inline `'<' '>' operand` + if (i + 3 < items.length) { + const b = items[i + 1], c = items[i + 2], d = items[i + 3]; + if (a.type === 'literal' && a.value === '<' && + b.type === 'ref' && typeRuleNameSet.has(b.name) && + c.type === 'literal' && c.value === '>' && d /* an operand follows the cast */) { + found = b.name; + } + } + // via a cast-head RULE: `[, operand]` — the `<…>` is hidden behind the ref boundary, + // so resolve it by name (mirrors detectCallExpression reaching its args through a separate rule). + if (a.type === 'ref' && ruleByName.has(a.name)) { + const ty = castHeadType(ruleByName.get(a.name)!.body); + if (ty) found = ty; } } }; diff --git a/test/tm-completeness.ts b/test/tm-completeness.ts index 8178184..57ef3cf 100644 --- a/test/tm-completeness.ts +++ b/test/tm-completeness.ts @@ -41,7 +41,7 @@ import { tsRelax, capExpr, awaitCtx, yieldCtx, asyncGenCtx, resetCtx, op, prefix, postfix, sameLine, noCommentBefore, noMultilineFlowBefore, notLeftLeaf, oneOf, noneOf, seq, altPattern, optPattern, star, plus, repeat, - followedBy, notFollowedBy, precededBy, notPrecededBy, start, end, never, anyChar, range, none, + followedBy, notFollowedBy, precededBy, notPrecededBy, start, end, never, anyChar, range, none, left, } from '../src/api.ts'; import { tokenPatternToRegex, tokenPatternIsNever, tokenPatternLiteralText } from '../src/token-pattern.ts'; import { collectLiterals, isKeywordLiteral } from '../src/grammar-utils.ts'; @@ -495,38 +495,231 @@ async function regionKeywordProbe(): Promise { // factorings must ALL emit the same region key. It BITES if a detector regresses to a fixed // shape (a factoring loses the region). // ════════════════════════════════════════════════════════════════════════════ + +// Shared .tsx scaffolding for the `type-param-constraint` construct: a JSX grammar (so the +// generic-arrow ⇄ JSX `extends` disambiguation guard is emitted at all) whose ONLY varying part +// is the type-param rule body, produced by `mkTParam(CType)` (CType is the registered constraint +// type rule). Mirrors the angle-bracket disambiguation fixtures in test/agnostic.ts. +function tpcGrammar( + name: string, SelfEnd: any, CloseTg: any, Id: any, + mkTParam: (CType: any) => any, +): Record { + const Type: any = rule(() => [[Id]]); + const CType: any = rule(() => [[Id]]); // the constraint's TYPE rule (REGISTERED) + const TParam = rule(() => [mkTParam(CType)]); + const TP = rule(() => [['<', sep(TParam, ','), '>']]); + const Param = rule(() => [[Id, opt(':', Type)]]); + const Decl = rule(() => [['fn', Id, opt(TP), '(', sep(Param, ','), ')', '{', '}']]); // emits #arrow-type-parameters + const Arrow = rule(() => [[opt(TP), '(', sep(Param, ','), ')', '=>', Id]]); + const Call = rule(() => [[Id, '<', sep(Type, ','), '>', '(', sep(Id, ','), ')']]); + const Attr = rule(() => [[Id, opt('=', Id)]]); + const Elem = rule(() => [['<', Id, many(Attr), alt(SelfEnd, ['>', CloseTg, Id, '>'])]]); + const E = rule(() => [Id, Call, Arrow, Elem]); + const S = rule(() => [Decl, E]); + const Prog = rule(() => [[many(S)]]); + return { + name, scopeName: `source.${name}`, + tokens: { SelfEnd, CloseTg, Id }, prec: [none('<', '>')], + scopes: { 'storage.type.function': ['fn'], 'keyword.operator.expression.extends': ['extends'] }, + rules: { Type, CType, TParam, TP, Param, Decl, Arrow, Call, Attr, Elem, E, S, Prog }, entry: Prog, + }; +} + function checkShapeRobustness(): void { const Id = token(plus(range('a', 'z')), { identifier: true }); - const emits = (key: string, build: () => Record): boolean => { - try { return !!(generateTmLanguage(defineGrammar(build() as any) as any) as any).repository[key]; } + // JSX delimiter tokens — needed by the constructs whose discharge only fires in a .tsx + // grammar (the generic-arrow ⇄ JSX disambiguation, e.g. type-param-constraint below). + const SelfEnd = token(seq('/', '>')); // /> + const CloseTg = token(seq('<', '/')); // (tm: any) => !!tm.repository[key]; + const emits = (observable: (tm: any) => boolean, build: () => Record): boolean => { + try { return !!observable(generateTmLanguage(defineGrammar(build() as any) as any) as any); } catch { return false; } }; - // each construct, in several EQUIVALENT factorings; the region key must be present in all. - const constructs: { name: string; key: string; factorings: { label: string; build: () => Record }[] }[] = [ + // each construct, in several EQUIVALENT factorings; the obligation must discharge in all. + // `xfail` records factorings a detector is KNOWN to drop today (issue #51 residual fragility): + // the assertion tolerates exactly those, so the gate goes RED on a NEW drop or once a fix lands + // and the xfail goes stale — never a silent false-green, never a permanent red. + const constructs: { name: string; key?: string; observable: (tm: any) => boolean; xfail?: string[]; factorings: { label: string; build: () => Record }[] }[] = [ { - name: 'ternary', key: 'ternary-expression', factorings: [ + name: 'ternary', key: 'ternary-expression', observable: keyObs('ternary-expression'), factorings: [ { label: 'flat', build: () => { const E = rule((s: any) => [[Id, '?', s, ':', s], [Id]]); const P = rule(() => [[many(E)]]); return { name: 't1', scopeName: 'source.t1', tokens: { Id }, rules: { E, P }, entry: P }; } }, { label: 'opt-tail', build: () => { const E = rule((s: any) => [[Id, opt('?', s, ':', s)]]); const P = rule(() => [[many(E)]]); return { name: 't2', scopeName: 'source.t2', tokens: { Id }, rules: { E, P }, entry: P }; } }, ], }, { - name: 'call', key: 'function-call', factorings: [ + name: 'call', key: 'function-call', observable: keyObs('function-call'), factorings: [ { label: 'inline', build: () => { const A = rule(() => [[Id]]); const E = rule((s: any) => [[A, '(', sep(s, ','), ')'], [A]]); const P = rule(() => [[many(E)]]); return { name: 'c1', scopeName: 'source.c1', tokens: { Id }, rules: { A, E, P }, entry: P }; } }, { label: 'args-rule', build: () => { const A = rule(() => [[Id]]); const CA = rule((s: any) => [['(', sep(s, ','), ')']]); const C = rule((s: any) => [[A, CA], [A]]); const E = rule((s: any) => [[C]]); const P = rule(() => [[many(E)]]); return { name: 'c2', scopeName: 'source.c2', tokens: { Id }, rules: { A, CA, C, E, P }, entry: P }; } }, ], }, { - name: 'generic-type-params', key: 'declaration-type-params', factorings: [ + name: 'generic-type-params', key: 'declaration-type-params', observable: keyObs('declaration-type-params'), factorings: [ { label: '3-item', build: () => { const T = rule(() => [[Id]]); const Pm = rule(() => [[Id, opt('extends', T)]]); const TP = rule(() => [['<', sep(Pm, ','), '>']]); const D = rule(() => [['fn', Id, opt(TP), '{', '}']]); const P = rule(() => [[many(D)]]); return { name: 'g1', scopeName: 'source.g1', tokens: { Id }, prec: [none('<', '>')], scopes: { 'storage.type.function': ['fn'], 'keyword.operator.expression.extends': ['extends'] }, rules: { T, Pm, TP, D, P }, entry: P }; } }, { label: 'trailing-comma', build: () => { const T = rule(() => [[Id]]); const Pm = rule(() => [[Id, opt('extends', T)]]); const TP = rule(() => [['<', sep(Pm, ','), opt(','), '>']]); const D = rule(() => [['fn', Id, opt(TP), '{', '}']]); const P = rule(() => [[many(D)]]); return { name: 'g2', scopeName: 'source.g2', tokens: { Id }, prec: [none('<', '>')], scopes: { 'storage.type.function': ['fn'], 'keyword.operator.expression.extends': ['extends'] }, rules: { T, Pm, TP, D, P }, entry: P }; } }, ], }, + // ── conditional-type (detectConditionalType, key #type-conditional) ── + // `{type:true}` rule with `ref KW ref ? ref : ref`. detectConditionalType runs its + // 7-window over expandAlts(body), so opt-tail / alt-split normalise to the same flat + // adjacency. ROBUST (all factorings emit). + { + name: 'conditional-type', key: 'type-conditional', observable: keyObs('type-conditional'), factorings: [ + { label: 'canonical', build: () => { const T: any = rule(() => [[Id, 'extends', Id, '?', Id, ':', Id], [Id]], { type: true }); const Ann = rule(() => [[Id, ':', T]]); const P = rule(() => [[many(Ann)]]); return { name: 'cd1', scopeName: 'source.cd1', tokens: { Id }, scopes: { 'keyword.operator.expression.extends': ['extends'] }, rules: { T, Ann, P }, entry: P }; } }, + { label: 'opt-tail', build: () => { const T: any = rule(() => [[Id, opt('extends', Id, '?', Id, ':', Id)]], { type: true }); const Ann = rule(() => [[Id, ':', T]]); const P = rule(() => [[many(Ann)]]); return { name: 'cd2', scopeName: 'source.cd2', tokens: { Id }, scopes: { 'keyword.operator.expression.extends': ['extends'] }, rules: { T, Ann, P }, entry: P }; } }, + { label: 'alt-split', build: () => { const T: any = rule(() => [alt([Id, 'extends', Id, '?', Id, ':', Id], [Id])], { type: true }); const Ann = rule(() => [[Id, ':', T]]); const P = rule(() => [[many(Ann)]]); return { name: 'cd3', scopeName: 'source.cd3', tokens: { Id }, scopes: { 'keyword.operator.expression.extends': ['extends'] }, rules: { T, Ann, P }, entry: P }; } }, + ], + }, + // ── generic-call (detectAngleBracketAmbiguity, key #generic-call) ── + // `<` sep(ref) `>` CONFIRM (the confirm token is the item after `>`). The detector walks + // expandAlts(body), so the `(args)` confirm written as an opt-tail or reached through an + // alt() still surfaces the `< sep > (` adjacency. Needs `<`/`>` in the prec table. ROBUST. + { + name: 'generic-call', key: 'generic-call', observable: keyObs('generic-call'), factorings: [ + { label: 'canonical', build: () => { const T = rule(() => [[Id]]); const Call = rule(() => [[Id, '<', sep(T, ','), '>', '(', sep(Id, ','), ')']]); const E = rule(() => [Id, Call]); const P = rule(() => [[many(E)]]); return { name: 'gc1', scopeName: 'source.gc1', tokens: { Id }, prec: [none('<', '>')], rules: { T, Call, E, P }, entry: P }; } }, + { label: 'opt-tail', build: () => { const T = rule(() => [[Id]]); const Call = rule(() => [[Id, '<', sep(T, ','), '>', opt('(', sep(Id, ','), ')')]]); const E = rule(() => [Id, Call]); const P = rule(() => [[many(E)]]); return { name: 'gc2', scopeName: 'source.gc2', tokens: { Id }, prec: [none('<', '>')], rules: { T, Call, E, P }, entry: P }; } }, + { label: 'alt-confirm', build: () => { const T = rule(() => [[Id]]); const Call = rule(() => [[Id, '<', sep(T, ','), '>', alt(['(', sep(Id, ','), ')'], [Id])]]); const E = rule(() => [Id, Call]); const P = rule(() => [[many(E)]]); return { name: 'gc3', scopeName: 'source.gc3', tokens: { Id }, prec: [none('<', '>')], rules: { T, Call, E, P }, entry: P }; } }, + ], + }, + // ── angle-cast (detectAngleBracketCast, key #type-cast) ── + // 4-window `<` ref(@type) `>` operand. The cast head written as its OWN rule + // (`CastHead = '<' Type '>'`, used as `[CastHead, operand]`) hides the `<`/`>` across the ref + // boundary; detectAngleBracketCast now resolves a ref to such a cast-head rule by name (like + // detectCallExpression reaches its args through a separate rule), so `via-ref` is robust too. + { + name: 'angle-cast', key: 'type-cast', observable: keyObs('type-cast'), factorings: [ + { label: 'canonical', build: () => { const T = rule(() => [[Id]], { type: true }); const Call = rule(() => [[Id, '<', sep(T, ','), '>', '(', sep(Id, ','), ')']]); const Cast = rule(() => [['<', T, '>', Id]]); const E = rule(() => [Id, Cast, Call]); const P = rule(() => [[many(E)]]); return { name: 'ac1', scopeName: 'source.ac1', tokens: { Id }, prec: [none('<', '>')], rules: { T, Call, Cast, E, P }, entry: P }; } }, + { label: 'opt-operand', build: () => { const T = rule(() => [[Id]], { type: true }); const Call = rule(() => [[Id, '<', sep(T, ','), '>', '(', sep(Id, ','), ')']]); const Cast = rule(() => [['<', T, '>', opt(Id)]]); const E = rule(() => [Id, Cast, Call]); const P = rule(() => [[many(E)]]); return { name: 'ac3', scopeName: 'source.ac3', tokens: { Id }, prec: [none('<', '>')], rules: { T, Call, Cast, E, P }, entry: P }; } }, + { label: 'via-ref', build: () => { const T = rule(() => [[Id]], { type: true }); const Call = rule(() => [[Id, '<', sep(T, ','), '>', '(', sep(Id, ','), ')']]); const CastHead = rule(() => [['<', T, '>']]); const Cast = rule(() => [[CastHead, Id]]); const E = rule(() => [Id, Cast, Call]); const P = rule(() => [[many(E)]]); return { name: 'ac2', scopeName: 'source.ac2', tokens: { Id }, prec: [none('<', '>')], rules: { T, Call, CastHead, Cast, E, P }, entry: P }; } }, + ], + }, + // ── type-param-constraint (detectTypeParamConstraintKeywords) ── + // observable: the constraint keyword (`extends`) appears in the #arrow-type-parameters begin + // guard (the .tsx generic-arrow ⇄ JSX disambiguation `topTypeParam`) — so this needs a JSX + // grammar (`/>`,` ((tm.repository['arrow-type-parameters']?.begin as string) ?? '').includes('\\bextends\\b'), + factorings: [ + { label: 'canonical', build: () => tpcGrammar('tpc1', SelfEnd, CloseTg, Id, (CType) => [Id, opt('extends', CType)]) }, + { label: 'alt-split', build: () => tpcGrammar('tpc2', SelfEnd, CloseTg, Id, (CType) => alt([Id, 'extends', CType], [Id])) }, + { label: 'sep-constraint', build: () => tpcGrammar('tpc3', SelfEnd, CloseTg, Id, (CType) => [Id, opt('extends', sep(CType, '&'))]) }, + ], + }, + { + name: "bare-arrow", observable: (tm => !!tm.repository['arrow-parameter'] && JSON.stringify(tm.repository['arrow-parameter']).includes('variable.parameter')), + factorings: [ + { label: "canonical", build: () => { const E = rule((s) => [[Id, '=>', s], [Id]]); const P = rule(() => [[many(E)]]); return { name: 'ba1', scopeName: 'source.ba1', tokens: { Id }, rules: { E, P }, entry: P }; } }, + { label: "opt-tail", build: () => { const E = rule((s) => [[Id, opt('=>', s)]]); const P = rule(() => [[many(E)]]); return { name: 'ba2', scopeName: 'source.ba2', tokens: { Id }, rules: { E, P }, entry: P }; } }, + { label: "via-ref", build: () => { const Ar = rule((s) => [[Id, '=>', s]]); const E = rule((s) => [Ar, [Id]]); const P = rule(() => [[many(E)]]); return { name: 'ba3', scopeName: 'source.ba3', tokens: { Id }, rules: { Ar, E, P }, entry: P }; } }, + ], + }, + { + name: "property-access", observable: (tm => !!tm.repository['property-access'] && JSON.stringify(tm.repository['property-access']).includes('entity.other.property')), + factorings: [ + { label: "canonical", build: () => { const E = rule((s) => [[Id, many('.', Id)], [Id]]); const P = rule(() => [[many(E)]]); return { name: 'pa1', scopeName: 'source.pa1', tokens: { Id }, rules: { E, P }, entry: P }; } }, + { label: "opt-tail", build: () => { const E = rule((s) => [[Id, opt('.', Id)]]); const P = rule(() => [[many(E)]]); return { name: 'pa2', scopeName: 'source.pa2', tokens: { Id }, rules: { E, P }, entry: P }; } }, + { label: "via-ref", build: () => { const Acc = rule(() => [['.', Id]]); const E = rule(() => [[Id, many(Acc)], [Id]]); const P = rule(() => [[many(E)]]); return { name: 'pa3', scopeName: 'source.pa3', tokens: { Id }, rules: { Acc, E, P }, entry: P }; } }, + ], + }, + { + name: "paren-arrow", observable: (tm => { const r = tm.repository['arrow-function-params']; return !!r && JSON.stringify(r).includes('variable.parameter'); }), + factorings: [ + { label: "canonical", build: () => { const Pm = rule(() => [[Id]]); const E = rule((s) => [['(', sep(Pm, ','), ')', '=>', s], [Id]]); const P = rule(() => [[many(E)]]); return { name: 'pra1', scopeName: 'source.pra1', tokens: { Id }, rules: { Pm, E, P }, entry: P }; } }, + { label: "opt-tail", build: () => { const Pm = rule(() => [[Id]]); const Ty = rule(() => [[Id]]); const E = rule((s) => [['(', sep(Pm, ','), ')', opt(':', Ty), '=>', s], [Id]]); const P = rule(() => [[many(E)]]); return { name: 'pra2', scopeName: 'source.pra2', tokens: { Id }, rules: { Pm, Ty, E, P }, entry: P }; } }, + { label: "alt-split", build: () => { const Pm = rule(() => [[Id]]); const E = rule((s) => [alt(['(', sep(Pm, ','), ')', '=>', s], [Id])]); const P = rule(() => [[many(E)]]); return { name: 'pra3', scopeName: 'source.pra3', tokens: { Id }, rules: { Pm, E, P }, entry: P }; } }, + ], + }, + { + name: "direct-param-keyword", observable: (tm => !!tm.repository['ctor-declaration'] && !!tm.repository['declaration-params']), + factorings: [ + { label: "canonical", build: () => { const Pm = rule(() => [[Id]]); const Blk = rule(() => [['{', '}']]); const D = rule(() => [['fn', Id, '(', sep(Pm, ','), ')', Blk]]); const Ctor = rule(() => [['ctor', '(', sep(Pm, ','), ')', Blk]]); const Mem = rule(() => [D, Ctor]); const Body = rule(() => [['{', many(Mem), '}']]); const Cls = rule(() => [['cls', Id, Body]]); const P = rule(() => [[many(Cls)]]); return { name: 'dpk1', scopeName: 'source.dpk1', tokens: { Id }, scopes: { 'storage.type.function': ['fn', 'ctor'], 'storage.type.class': ['cls'] }, rules: { Pm, Blk, D, Ctor, Mem, Body, Cls, P }, entry: P }; } }, + { label: "alt-split", build: () => { const Pm = rule(() => [[Id]]); const Blk = rule(() => [['{', '}']]); const D = rule(() => [['fn', Id, '(', sep(Pm, ','), ')', Blk]]); const Ctor = rule(() => [alt(['ctor', '(', sep(Pm, ','), ')', Blk], ['ctor', '(', ')', Blk])]); const Mem = rule(() => [D, Ctor]); const Body = rule(() => [['{', many(Mem), '}']]); const Cls = rule(() => [['cls', Id, Body]]); const P = rule(() => [[many(Cls)]]); return { name: 'dpk2', scopeName: 'source.dpk2', tokens: { Id }, scopes: { 'storage.type.function': ['fn', 'ctor'], 'storage.type.class': ['cls'] }, rules: { Pm, Blk, D, Ctor, Mem, Body, Cls, P }, entry: P }; } }, + { label: "opt-tail", build: () => { const Pm = rule(() => [[Id]]); const Blk = rule(() => [['{', '}']]); const D = rule(() => [['fn', Id, '(', sep(Pm, ','), ')', Blk]]); const Ctor = rule(() => [['ctor', '(', opt(sep(Pm, ',')), ')', Blk]]); const Mem = rule(() => [D, Ctor]); const Body = rule(() => [['{', many(Mem), '}']]); const Cls = rule(() => [['cls', Id, Body]]); const P = rule(() => [[many(Cls)]]); return { name: 'dpk3', scopeName: 'source.dpk3', tokens: { Id }, scopes: { 'storage.type.function': ['fn', 'ctor'], 'storage.type.class': ['cls'] }, rules: { Pm, Blk, D, Ctor, Mem, Body, Cls, P }, entry: P }; } }, + ], + }, + { + name: "constructor-keyword", observable: (tm => !!tm.repository['new-expression'] && JSON.stringify(tm.repository['new-expression']).includes('keyword.operator.expression.new')), + factorings: [ + { label: "canonical", build: () => { const Ty = rule(() => [[Id]]); const NewE = rule((s) => [['new', Ty, '(', sep(s, ','), ')'], [Id]]); const P = rule(() => [[many(NewE)]]); return { name: 'ck1', scopeName: 'source.ck1', tokens: { Id }, scopes: { 'keyword.operator.expression.new': ['new'] }, rules: { Ty, NewE, P }, entry: P }; } }, + { label: "opt-call-tail", build: () => { const Ty = rule(() => [[Id]]); const NewE = rule((s) => [['new', Ty, opt('(', sep(s, ','), ')')], [Id]]); const P = rule(() => [[many(NewE)]]); return { name: 'ck2', scopeName: 'source.ck2', tokens: { Id }, scopes: { 'keyword.operator.expression.new': ['new'] }, rules: { Ty, NewE, P }, entry: P }; } }, + { label: "alt-split", build: () => { const Ty = rule(() => [[Id]]); const TArgs = rule(() => [['<', sep(Ty, ','), '>']]); const NewE = rule((s) => [['new', Ty, opt(alt([TArgs], ['(', sep(s, ','), ')']))], [Id]]); const P = rule(() => [[many(NewE)]]); return { name: 'ck3', scopeName: 'source.ck3', tokens: { Id }, prec: [none('<', '>')], scopes: { 'keyword.operator.expression.new': ['new'] }, rules: { Ty, TArgs, NewE, P }, entry: P }; } }, + ], + }, + { + name: "block-declaration", observable: (tm => !!tm.repository['declaration-body']), + factorings: [ + { label: "canonical", build: () => { const M = rule(() => [[Id]]); const Body = rule(() => [['{', many(M), '}']]); const D = rule(() => [['class', Id, Body]]); const P = rule(() => [[many(alt(D, Id))]]); return { name: 'b1', scopeName: 'source.b1', tokens: { Id }, scopes: { 'storage.type.class': ['class'] }, rules: { M, Body, D, P }, entry: P }; } }, + { label: "alt-split", build: () => { const M = rule(() => [[Id]]); const Body = rule(() => [['{', many(M), '}'], ['{', '}']]); const D = rule(() => [['class', Id, Body]]); const P = rule(() => [[many(alt(D, Id))]]); return { name: 'b2', scopeName: 'source.b2', tokens: { Id }, scopes: { 'storage.type.class': ['class'] }, rules: { M, Body, D, P }, entry: P }; } }, + { label: "opt-tail", build: () => { const M = rule(() => [[Id]]); const Body = rule(() => [['{', opt(many(M)), '}']]); const D = rule(() => [['class', Id, Body]]); const P = rule(() => [[many(alt(D, Id))]]); return { name: 'b3', scopeName: 'source.b3', tokens: { Id }, scopes: { 'storage.type.class': ['class'] }, rules: { M, Body, D, P }, entry: P }; } }, + { label: "via-ref", build: () => { const M = rule(() => [[Id]]); const Body = rule(() => [['{', many(M), '}']]); const D = rule(() => [['class', Id, opt(Body)]]); const P = rule(() => [[many(alt(D, Id))]]); return { name: 'b4', scopeName: 'source.b4', tokens: { Id }, scopes: { 'storage.type.class': ['class'] }, rules: { M, Body, D, P }, entry: P }; } }, + ], + }, + { + name: "class-declaration-head", observable: (tm => { const r = tm.repository['class-declaration']; return !!r && JSON.stringify(r.beginCaptures ?? {}).includes('entity.name.type.class'); }), + factorings: [ + { label: "canonical", build: () => { const M = rule(() => [[Id]]); const D = rule(() => [['class', Id, '{', many(M), '}']]); const P = rule(() => [[many(alt(D, Id))]]); return { name: 'h1', scopeName: 'source.h1', tokens: { Id }, scopes: { 'storage.type.class': ['class'] }, rules: { M, D, P }, entry: P }; } }, + { label: "via-ref", build: () => { const M = rule(() => [[Id]]); const Body = rule(() => [['{', many(M), '}']]); const D = rule(() => [['class', Id, Body]]); const P = rule(() => [[many(alt(D, Id))]]); return { name: 'h2', scopeName: 'source.h2', tokens: { Id }, scopes: { 'storage.type.class': ['class'] }, rules: { M, Body, D, P }, entry: P }; } }, + { label: "type-params", build: () => { const T = rule(() => [[Id]]); const TP = rule(() => [['<', sep(T, ','), '>']]); const M = rule(() => [[Id]]); const D = rule(() => [['class', Id, opt(TP), '{', many(M), '}']]); const P = rule(() => [[many(alt(D, Id))]]); return { name: 'h3', scopeName: 'source.h3', tokens: { Id }, prec: [none('<', '>')], scopes: { 'storage.type.class': ['class'] }, rules: { T, TP, M, D, P }, entry: P }; } }, + { label: "alt-split", build: () => { const M = rule(() => [[Id]]); const D = rule(() => [[opt('abstract'), 'class', Id, '{', many(M), '}']]); const P = rule(() => [[many(alt(D, Id))]]); return { name: 'h4', scopeName: 'source.h4', tokens: { Id }, scopes: { 'storage.type.class': ['class'], 'storage.modifier': ['abstract'] }, rules: { M, D, P }, entry: P }; } }, + ], + }, + { + name: "regex-literal", observable: (tm => !!tm.repository['regex-literal']), + factorings: [ + { label: "canonical", build: () => { const Re = token(seq('/', plus(range('a','z')), '/'), { regex: true }); const Ex = rule(() => [Id, Re]); const P = rule(() => [[many(Ex)]]); return { name: 'r1', scopeName: 'source.r1', tokens: { Id, Re }, prec: [left('/')], rules: { Ex, P }, entry: P }; } }, + { label: "alt-split", build: () => { const Re = token(seq('/', plus(range('a','z')), '/'), { regex: true }); const Ex = rule(() => [Id, Re]); const P = rule(() => [[many(Ex)]]); return { name: 'r2', scopeName: 'source.r2', tokens: { Id, Re }, prec: [none('/')], rules: { Ex, P }, entry: P }; } }, + { label: "via-ref", build: () => { const Re = token(seq('/', plus(range('a','z')), '/'), { regex: true }); const Ex = rule(() => [Id, Re]); const P = rule(() => [[many(Ex)]]); return { name: 'r3', scopeName: 'source.r3', tokens: { Id, Re }, prec: [left('/=')], rules: { Ex, P }, entry: P }; } }, + ], + }, + { + name: "block-sequence", observable: (tm => !!tm.repository['block-sequence']), + factorings: [ + { label: "canonical", build: () => { const ph = noneOf(' ', '\t', '\n', ':', '-', '?', ',', '[', ']', '{', '}', '#'); const PB = star(noneOf(':', '\n', ',', '[', ']', '{', '}')); const KS = followedBy(seq(star(oneOf(' ', '\t')), ':')); const Plain = token(seq(ph, PB), { scope: 'string.unquoted', blockPattern: seq(ph, PB) }); const Key = token(seq(ph, PB, KS), { scope: 'entity.name.tag', blockPattern: seq(ph, PB, KS) }); const BlockScalar = token(never(), { scope: 'string.unquoted.block' }); const Indent = token(never(), {}), Dedent = token(never(), {}), Newline = token(never(), {}); const Item = rule(() => [['-', Key, Plain]]); const Seq = rule(() => [[Item, many(Newline, Item)]]); const Fold = rule(() => [[Plain, many(Newline, Plain)]]); const Node = rule(() => [Seq, Fold, Key, Plain, BlockScalar]); const Doc = rule(() => [[many(Node)]]); return { name: 'bseqA', scopeName: 'source.bseqA', tokens: { Key, Plain, BlockScalar, Indent, Dedent, Newline }, indent: { indentToken: 'Indent', dedentToken: 'Dedent', newlineToken: 'Newline', flowOpen: ['[', '{'], flowClose: [']', '}'], comment: '#', keyValueSeparator: ':', foldTokens: ['Key', 'Plain'], compactIndicators: ['-', '?'], blockScalar: { introducers: ['|', '>'], token: 'BlockScalar', documentMarkers: ['---', '...'], indicatorScope: 'keyword.control.flow.block-scalar' } }, rules: { Item, Seq, Fold, Node, Doc }, entry: Doc }; } }, + { label: "plus-arity", build: () => { const ph = noneOf(' ', '\t', '\n', ':', '-', '?', ',', '[', ']', '{', '}', '#'); const PB = star(noneOf(':', '\n', ',', '[', ']', '{', '}')); const KS = followedBy(seq(star(oneOf(' ', '\t')), ':')); const Plain = token(seq(ph, PB), { scope: 'string.unquoted', blockPattern: seq(ph, PB) }); const Key = token(seq(ph, PB, KS), { scope: 'entity.name.tag', blockPattern: seq(ph, PB, KS) }); const BlockScalar = token(never(), { scope: 'string.unquoted.block' }); const Indent = token(never(), {}), Dedent = token(never(), {}), Newline = token(never(), {}); const Item = rule(() => [['-', Key, Plain]]); const Seq = rule(() => [[Item, Newline, Item, many(Newline, Item)]]); const Fold = rule(() => [[Plain, many(Newline, Plain)]]); const Node = rule(() => [Seq, Fold, Key, Plain, BlockScalar]); const Doc = rule(() => [[many(Node)]]); return { name: 'bseqB', scopeName: 'source.bseqB', tokens: { Key, Plain, BlockScalar, Indent, Dedent, Newline }, indent: { indentToken: 'Indent', dedentToken: 'Dedent', newlineToken: 'Newline', flowOpen: ['[', '{'], flowClose: [']', '}'], comment: '#', keyValueSeparator: ':', foldTokens: ['Key', 'Plain'], compactIndicators: ['-', '?'], blockScalar: { introducers: ['|', '>'], token: 'BlockScalar', documentMarkers: ['---', '...'], indicatorScope: 'keyword.control.flow.block-scalar' } }, rules: { Item, Seq, Fold, Node, Doc }, entry: Doc }; } }, + ], + }, + { + name: "explicit-key", observable: (tm => !!tm.repository['explicit-key']), + factorings: [ + { label: "canonical", build: () => { const ph = noneOf(' ', '\t', '\n', ':', '-', '?', ',', '[', ']', '{', '}', '#'); const PB = star(noneOf(':', '\n', ',', '[', ']', '{', '}')); const KS = followedBy(seq(star(oneOf(' ', '\t')), ':')); const Plain = token(seq(ph, PB), { scope: 'string.unquoted' }); const Key = token(seq(ph, PB, KS), { scope: 'entity.name.tag' }); const Indent = token(never(), {}), Dedent = token(never(), {}), Newline = token(never(), {}); const ExplicitEntry = rule(() => [['?', Key, opt(':', Plain)]]); const ExplicitMapping = rule(() => [[ExplicitEntry, many(Newline, ExplicitEntry)]]); const Node = rule(() => [ExplicitMapping, Plain]); const Doc = rule(() => [[many(Node)]]); return { name: 'ekA', scopeName: 'source.ekA', tokens: { Key, Plain, Indent, Dedent, Newline }, indent: { indentToken: 'Indent', dedentToken: 'Dedent', newlineToken: 'Newline', flowOpen: ['[', '{'], flowClose: [']', '}'], comment: '#', keyValueSeparator: ':', foldTokens: ['Key', 'Plain'], compactIndicators: ['-', '?'] }, rules: { ExplicitEntry, ExplicitMapping, Node, Doc }, entry: Doc }; } }, + { label: "opt-tail (many-quantified equivalent)", build: () => { const ph = noneOf(' ', '\t', '\n', ':', '-', '?', ',', '[', ']', '{', '}', '#'); const PB = star(noneOf(':', '\n', ',', '[', ']', '{', '}')); const KS = followedBy(seq(star(oneOf(' ', '\t')), ':')); const Plain = token(seq(ph, PB), { scope: 'string.unquoted' }); const Key = token(seq(ph, PB, KS), { scope: 'entity.name.tag' }); const Indent = token(never(), {}), Dedent = token(never(), {}), Newline = token(never(), {}); const ExplicitEntry = rule(() => [['?', Key, many(':', Plain)]]); const ExplicitMapping = rule(() => [[ExplicitEntry, many(Newline, ExplicitEntry)]]); const Node = rule(() => [ExplicitMapping, Plain]); const Doc = rule(() => [[many(Node)]]); return { name: 'ekB', scopeName: 'source.ekB', tokens: { Key, Plain, Indent, Dedent, Newline }, indent: { indentToken: 'Indent', dedentToken: 'Dedent', newlineToken: 'Newline', flowOpen: ['[', '{'], flowClose: [']', '}'], comment: '#', keyValueSeparator: ':', foldTokens: ['Key', 'Plain'], compactIndicators: ['-', '?'] }, rules: { ExplicitEntry, ExplicitMapping, Node, Doc }, entry: Doc }; } }, + ], + }, + { + name: "flow-mapping", observable: (tm => !!tm.repository['flow-mapping']), + factorings: [ + { label: "canonical", build: () => { const ph = noneOf(' ', '\t', '\n', ':', '-', '?', ',', '[', ']', '{', '}', '#'); const PB = star(noneOf(':', '\n', ',', '[', ']', '{', '}')); const KS = followedBy(seq(star(oneOf(' ', '\t')), ':')); const Plain = token(seq(ph, PB), { scope: 'string.unquoted' }); const Key = token(seq(ph, PB, KS), { scope: 'entity.name.tag' }); const Indent = token(never(), {}), Dedent = token(never(), {}), Newline = token(never(), {}); const FlowEntry = rule(() => [[Key, ':', Plain]]); const FlowMap = rule(() => [['{', sep(FlowEntry, ','), '}']]); const FlowSeq = rule(() => [['[', sep(Plain, ','), ']']]); const Node = rule(() => [FlowMap, FlowSeq, Plain]); const Doc = rule(() => [[many(Node)]]); const fs = { byOpen: { '{': { begin: 'punctuation.definition.mapping.begin', end: 'punctuation.definition.mapping.end', separator: 'punctuation.separator.mapping' }, '[': { begin: 'punctuation.definition.sequence.begin', end: 'punctuation.definition.sequence.end', separator: 'punctuation.separator.sequence' } }, keyValue: 'punctuation.separator.key-value' }; return { name: 'fmA', scopeName: 'source.fmA', tokens: { Key, Plain, Indent, Dedent, Newline }, indent: { indentToken: 'Indent', dedentToken: 'Dedent', newlineToken: 'Newline', flowOpen: ['[', '{'], flowClose: [']', '}'], comment: '#', keyValueSeparator: ':', flowScopes: fs, foldTokens: ['Key', 'Plain'], compactIndicators: ['-', '?'] }, rules: { FlowEntry, FlowMap, FlowSeq, Node, Doc }, entry: Doc }; } }, + { label: "trailing-comma", build: () => { const ph = noneOf(' ', '\t', '\n', ':', '-', '?', ',', '[', ']', '{', '}', '#'); const PB = star(noneOf(':', '\n', ',', '[', ']', '{', '}')); const KS = followedBy(seq(star(oneOf(' ', '\t')), ':')); const Plain = token(seq(ph, PB), { scope: 'string.unquoted' }); const Key = token(seq(ph, PB, KS), { scope: 'entity.name.tag' }); const Indent = token(never(), {}), Dedent = token(never(), {}), Newline = token(never(), {}); const FlowEntry = rule(() => [[Key, ':', Plain]]); const FlowMap = rule(() => [['{', sep(FlowEntry, ','), opt(','), '}']]); const FlowSeq = rule(() => [['[', sep(Plain, ','), ']']]); const Node = rule(() => [FlowMap, FlowSeq, Plain]); const Doc = rule(() => [[many(Node)]]); const fs = { byOpen: { '{': { begin: 'punctuation.definition.mapping.begin', end: 'punctuation.definition.mapping.end', separator: 'punctuation.separator.mapping' }, '[': { begin: 'punctuation.definition.sequence.begin', end: 'punctuation.definition.sequence.end', separator: 'punctuation.separator.sequence' } }, keyValue: 'punctuation.separator.key-value' }; return { name: 'fmB', scopeName: 'source.fmB', tokens: { Key, Plain, Indent, Dedent, Newline }, indent: { indentToken: 'Indent', dedentToken: 'Dedent', newlineToken: 'Newline', flowOpen: ['[', '{'], flowClose: [']', '}'], comment: '#', keyValueSeparator: ':', flowScopes: fs, foldTokens: ['Key', 'Plain'], compactIndicators: ['-', '?'] }, rules: { FlowEntry, FlowMap, FlowSeq, Node, Doc }, entry: Doc }; } }, + ], + }, + { + name: "markup-tag", observable: (tm => !!tm.repository['tag']), + factorings: [ + { label: "canonical", build: () => { const Id = token(plus(range('a', 'z')), { identifier: true }); const Text = token(never(), { scope: 'text' }); const Tag = rule(() => [['<', Id, '>']]); const Doc = rule(() => [[many(alt(Tag, Id))]]); return { name: 'mkA', scopeName: 'text.mkA', tokens: { Id, Text }, markup: { textToken: 'Text', tagOpen: '<', tagClose: '>', closeMarker: '/' }, rules: { Tag, Doc }, entry: Doc }; } }, + { label: "alt-split (open/self-close/close element)", build: () => { const Id = token(plus(range('a', 'z')), { identifier: true }); const Text = token(never(), { scope: 'text' }); const SelfEnd = token(seq('/', '>')); const CloseTg = token(seq('<', '/')); const Tag = rule(() => [['<', Id, alt(SelfEnd, ['>', CloseTg, Id, '>'])]]); const Doc = rule(() => [[many(alt(Tag, Id))]]); return { name: 'mkB', scopeName: 'text.mkB', tokens: { SelfEnd, CloseTg, Id, Text }, markup: { textToken: 'Text', tagOpen: '<', tagClose: '>', closeMarker: '/' }, rules: { Tag, Doc }, entry: Doc }; } }, + ], + }, ]; for (const c of constructs) { - const results = c.factorings.map(f => ({ label: f.label, ok: emits(c.key, f.build) })); - const allEmit = results.every(r => r.ok); - check(`shape-robustness: \`${c.name}\` emits #${c.key} for every equivalent factoring`, allEmit, - results.filter(r => !r.ok).map(r => r.label).join(', ')); + const xfail = new Set(c.xfail ?? []); + const results = c.factorings.map(f => ({ label: f.label, ok: emits(c.observable, f.build) })); + // PASS unless a factoring NOT on the xfail list drops, OR a factoring ON it unexpectedly + // started discharging (stale xfail — the fix may have landed; drop the annotation). + const newDrops = results.filter(r => !r.ok && !xfail.has(r.label)).map(r => r.label); + const staleXfail = results.filter(r => r.ok && xfail.has(r.label)).map(r => r.label); + check(`shape-robustness: \`${c.name}\` discharges #${c.key ?? c.name} for every equivalent factoring`, + newDrops.length === 0 && staleXfail.length === 0, + [newDrops.length ? `drops: ${newDrops.join(', ')}` : '', + staleXfail.length ? `stale xfail (now discharges, remove): ${staleXfail.join(', ')}` : '', + xfail.size ? `(known #51 drops still expected: ${[...xfail].join(', ')})` : ''].filter(Boolean).join(' ')); } } From 2f01d1e9009e8c2713498d2bceb440cb4a00817e Mon Sep 17 00:00:00 2001 From: Johnson Chu Date: Sun, 21 Jun 2026 07:34:45 +0800 Subject: [PATCH 12/14] Add the jsx-element factoring to the shape-robustness gate (#51) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Rounds the gate to 20 constructs — one per shape-detector. detectJsx's hasElementShape walks expandAlts branches for the `<`+ref element lead, so the attribute list inline or wrapped in opt/alt both surface #jsx-element-in- expression; verified robust. 97/97 completeness checks. --- test/tm-completeness.ts | 11 +++++++++++ 1 file changed, 11 insertions(+) diff --git a/test/tm-completeness.ts b/test/tm-completeness.ts index 57ef3cf..2551087 100644 --- a/test/tm-completeness.ts +++ b/test/tm-completeness.ts @@ -707,6 +707,17 @@ function checkShapeRobustness(): void { { label: "alt-split (open/self-close/close element)", build: () => { const Id = token(plus(range('a', 'z')), { identifier: true }); const Text = token(never(), { scope: 'text' }); const SelfEnd = token(seq('/', '>')); const CloseTg = token(seq('<', '/')); const Tag = rule(() => [['<', Id, alt(SelfEnd, ['>', CloseTg, Id, '>'])]]); const Doc = rule(() => [[many(alt(Tag, Id))]]); return { name: 'mkB', scopeName: 'text.mkB', tokens: { SelfEnd, CloseTg, Id, Text }, markup: { textToken: 'Text', tagOpen: '<', tagClose: '>', closeMarker: '/' }, rules: { Tag, Doc }, entry: Doc }; } }, ], }, + // ── jsx-element (detectJsx, key #jsx-element-in-expression) ── + // an element shape `'<' Id … ('/>' | '>' '')`. detectJsx's hasElementShape walks + // expandAlts branches for the `<`+ref lead, so the attribute list written inline or wrapped in + // an `opt`/`alt` still surfaces the element. ROBUST. + { + name: 'jsx-element', observable: (tm: any) => !!tm.repository['jsx-element-in-expression'], + factorings: [ + { label: 'canonical', build: () => { const Attr = rule(() => [[Id, opt('=', Id)]]); const Elem = rule(() => [['<', Id, many(Attr), alt(SelfEnd, ['>', CloseTg, Id, '>'])]]); const E = rule(() => [Id, Elem]); const P = rule(() => [[many(E)]]); return { name: 'jx1', scopeName: 'source.jx1', tokens: { SelfEnd, CloseTg, Id }, prec: [none('<', '>')], rules: { Attr, Elem, E, P }, entry: P }; } }, + { label: 'opt-attrs', build: () => { const Attr = rule(() => [[Id, opt('=', Id)]]); const Elem = rule(() => [['<', Id, opt(many(Attr)), alt(SelfEnd, ['>', CloseTg, Id, '>'])]]); const E = rule(() => [Id, Elem]); const P = rule(() => [[many(E)]]); return { name: 'jx2', scopeName: 'source.jx2', tokens: { SelfEnd, CloseTg, Id }, prec: [none('<', '>')], rules: { Attr, Elem, E, P }, entry: P }; } }, + ], + }, ]; for (const c of constructs) { const xfail = new Set(c.xfail ?? []); From 615c9c3688c9e2541ffeca182529a8beaa854f26 Mon Sep 17 00:00:00 2001 From: Johnson Chu Date: Sun, 21 Jun 2026 09:24:22 +0800 Subject: [PATCH 13/14] Fix two region-leak soundness bugs the ceiling audit derived (#51) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The soundness-ceiling audit (which found gen-tm has no expressiveness ceiling) surfaced two over-accepts by derivation, not corpus — both real divergences from the parser's scoping, both root-caused in gen-tm. Not byte-identical: the two enum/namespace-bearing grammars (ts, tsx) change; the fix is verified by witness + gates + scope-gap, not by byte-identity. - enum brace-leak: #enum-body was the lone emitBracketRegion call whose body lacked a self-balancing first include (its siblings #code-block / #declaration-body both carry one), so a nested `{…}` in a member initializer (`enum E { A = { x: 1 }, B }`) let the inner `}` close the enum region early and leak the following members out of enum scope. Prepend the shared #code-block balancing region (an inner brace is an ordinary expression block, not another enum body) so braces balance to any depth. `B` now scopes enummember; the shorthand-object key case stays correct. - declaration-keyword over-accept: a `storage.type.*` keyword that is also a valid identifier (`module`, `namespace`, `type`, `interface`) was painted by the flat global match unconditionally, so `module.exports` / `namespace.foo` / `type.foo` mis-read the property-access base as the declaration keyword. Demote the whole class — the same contextual-keyword machinery gen-tm already uses for `as`/`keyof`/`public` — to an accessor-guarded `*-decl` rule (`(?!\s*(?:\.|\?\.))`), so a real declaration head still wins while a `.`/`?.`-adjacent base falls through to identifier scoping. Verified `module X {}` / `type X = …` / `namespace N {}` keep their keyword scope. Both DERIVED (agnostic 9/9 — keyed on scope family + declaration/reserved sets, no hardcoded word); npm run check 40/40; scope-gap ts unchanged (the corpus is blind to both witnesses — exactly why the audit derived them). --- src/gen-tm.ts | 56 +++++++++++++++++++++++++++++++-- typescript.tmLanguage.json | 27 +++++++++------- typescriptreact.tmLanguage.json | 27 +++++++++------- 3 files changed, 84 insertions(+), 26 deletions(-) diff --git a/src/gen-tm.ts b/src/gen-tm.ts index bbef912..92d9c8f 100644 --- a/src/gen-tm.ts +++ b/src/gen-tm.ts @@ -6732,9 +6732,21 @@ export function generateTmLanguage(grammar: CstGrammar): TmGrammar { // enum-body is the same `{ … }` bracket region; only its body differs (members are // NAMES via #enum-member, not statements). CALLER predicate: a brace-bodied decl // whose keyword scope ends in `.enum`. + // + // A member INITIALIZER may itself contain a nested `{…}` (an object literal, + // `A = { x: 1 }`). Like every brace region (see #code-block / #declaration-body), + // the body must consume an inner balanced `{…}` as a UNIT, or the inner `}` matches + // this region's `end: \}` and prematurely closes the enum body — leaking the + // following members out of enum scope (`B` reads as a plain variable, not an + // enummember). The nested brace is NOT another enum body but an ordinary + // expression-context block, so the balancing recurse target is `#code-block` + // (a self-balancing `{}` with no member-name scoping), listed FIRST so an inner + // `{` opens it before `#enum-member`/`$self` can mis-handle the brace. `#code-block` + // is emitted unconditionally above whenever any declaration exists, so it is always + // present here. repository['enum-body'] = emitBracketRegion({ openLit: '\\{', closeLit: '\\}', beginCapName: blockCapName, endCapName: blockCapName, - bodyPatterns: [{ include: '#enum-member' }, { include: '$self' }], + bodyPatterns: [{ include: '#code-block' }, { include: '#enum-member' }, { include: '$self' }], }); } innerPatterns.push({ include: '#enum-body' }); @@ -7860,6 +7872,34 @@ export function generateTmLanguage(grammar: CstGrammar): TmGrammar { // lookahead. The rest of the group keeps the unconditional flat match. const ctxModKws = kws.filter(k => contextualModifiers.has(k) && !alwaysBeforeString(k) && !ctxOpSet.has(k)); const ctxModSet = new Set(ctxModKws); + // Contextual DECLARATION keywords: a `storage.type.*` keyword whose keyword role + // is owned by a positional `*-declaration` rule (it appears in + // `declarationKeywords`) AND which is non-reserved (the grammar proves it a valid + // identifier somewhere — no not()-guard forbids it). Such a word doubles as a + // property-access BASE: `module.exports`, `namespace.foo`, `type.foo`, + // `interface.foo` — there the word is an ordinary identifier, NOT the namespace/ + // type-alias keyword. The declaration rule already paints the keyword use + // positionally (`module X {}`, `type X = …`, even the string-named `module "x" + // {}` / `declare module "foo"` forms — those reach the keyword scope only through + // the flat match), so the flat match must ABSTAIN exactly when the word is + // immediately followed by a member accessor (`.`/`?.`); the word then falls + // through to identifier/property scoping. A declaration head always puts + // whitespace + a name/string/`{` after the keyword (never `.`/`?.`), so the guard + // never suppresses a real declaration — every form, including the string-named + // and `import`/`export type` modifier uses, keeps its scope (they are not + // accessor-adjacent). Unlike the support.class abstain below, NO call `(` is + // excluded: a declaration keyword is never call-adjacent, and the witness/official + // guard is the `.`-adjacency. Mirrors the official grammar's `(? !!x); + const ctxDeclKws = (scope === 'storage.type' || scope.startsWith('storage.type.')) && accessorOpeners.length > 0 + ? kws.filter(k => declarationKeywords.has(k) && !reservedWordsForCtx.has(k) && !alwaysBeforeString(k) && !ctxOpSet.has(k) && !ctxModSet.has(k)) + : []; + const ctxDeclSet = new Set(ctxDeclKws); // Drop keywords whose keyword role is owned by a dedicated declaration context // (e.g. `constructor` → #constructor-declaration in class bodies). They double // as identifiers everywhere else, so the flat match must not paint them. @@ -7867,7 +7907,7 @@ export function generateTmLanguage(grammar: CstGrammar): TmGrammar { // scoped positionally by #import-export-all — never in the flat match, which // would mis-paint their ordinary-identifier uses (`const defer`, `defer()`, // `import defer from "m"`). - const globalKws = kws.filter(k => !alwaysBeforeString(k) && !ctxOpSet.has(k) && !ctxModSet.has(k) && !contextDeclaredKws.has(k) && !phaseModifierKws.has(k)); + const globalKws = kws.filter(k => !alwaysBeforeString(k) && !ctxOpSet.has(k) && !ctxModSet.has(k) && !ctxDeclSet.has(k) && !contextDeclaredKws.has(k) && !phaseModifierKws.has(k)); if (globalKws.length > 0) { // A `support.class` group names BUILTIN CLASS/TYPE identifiers (Object, Array, // Promise, …) — but, unlike a true keyword, those words also appear as runtime @@ -7924,6 +7964,18 @@ export function generateTmLanguage(grammar: CstGrammar): TmGrammar { }; topPatterns.push({ include: `#${mkey}` }); } + // Contextual declaration keywords (see ctxDeclKws above): one accessor-guarded + // entry, placed at the same position as the flat group so a real declaration-head + // keyword still wins, while a `.`/`?.`-adjacent property-access base falls through + // to identifier/property scoping. + if (ctxDeclKws.length > 0) { + const dkey = `${key}-decl`; + repository[dkey] = { + match: `\\b(${ctxDeclKws.map(escapeRegex).join('|')})\\b(?!\\s*(?:${accessorOpeners.join('|')}))`, + name: `${scope}.${langName}`, + }; + topPatterns.push({ include: `#${dkey}` }); + } for (const kw of beforeStringKws) { const ckey = `${key}-${kw.replace(/[^a-z0-9]/gi, '')}`; repository[ckey] = { diff --git a/typescript.tmLanguage.json b/typescript.tmLanguage.json index 3409a7f..fdf7892 100644 --- a/typescript.tmLanguage.json +++ b/typescript.tmLanguage.json @@ -220,16 +220,16 @@ "include": "#scope-storage-type" }, { - "include": "#scope-storage-type-interface" + "include": "#scope-storage-type-interface-decl" }, { - "include": "#scope-storage-type-type" + "include": "#scope-storage-type-type-decl" }, { "include": "#scope-storage-type-enum" }, { - "include": "#scope-storage-type-namespace" + "include": "#scope-storage-type-namespace-decl" }, { "include": "#scope-storage-type-function-arrow" @@ -1715,6 +1715,9 @@ } }, "patterns": [ + { + "include": "#code-block" + }, { "include": "#enum-member" }, @@ -2612,20 +2615,20 @@ "match": "\\b(debugger|with)\\b", "name": "keyword.control.ts" }, - "scope-storage-type-interface": { - "match": "\\b(interface)\\b", + "scope-storage-type-interface-decl": { + "match": "\\b(interface)\\b(?!\\s*(?:\\.|\\?\\.))", "name": "storage.type.interface.ts" }, - "scope-storage-type-type": { - "match": "\\b(type)\\b", + "scope-storage-type-type-decl": { + "match": "\\b(type)\\b(?!\\s*(?:\\.|\\?\\.))", "name": "storage.type.type.ts" }, "scope-storage-type-enum": { "match": "\\b(enum)\\b", "name": "storage.type.enum.ts" }, - "scope-storage-type-namespace": { - "match": "\\b(module|namespace)\\b", + "scope-storage-type-namespace-decl": { + "match": "\\b(module|namespace)\\b(?!\\s*(?:\\.|\\?\\.))", "name": "storage.type.namespace.ts" }, "scope-support-variable": { @@ -2816,8 +2819,8 @@ "match": "\\b(async)\\b", "name": "storage.modifier.ts" }, - "expr-scope-storage-type-namespace": { - "match": "\\b(module)\\b", + "expr-scope-storage-type-namespace-decl": { + "match": "\\b(module)\\b(?!\\s*(?:\\.|\\?\\.))", "name": "storage.type.namespace.ts" }, "expression": { @@ -2988,7 +2991,7 @@ "include": "#scope-storage-type-class" }, { - "include": "#expr-scope-storage-type-namespace" + "include": "#expr-scope-storage-type-namespace-decl" }, { "include": "#scope-storage-type-function-arrow" diff --git a/typescriptreact.tmLanguage.json b/typescriptreact.tmLanguage.json index 5c3f52f..e1a773c 100644 --- a/typescriptreact.tmLanguage.json +++ b/typescriptreact.tmLanguage.json @@ -226,16 +226,16 @@ "include": "#scope-storage-type" }, { - "include": "#scope-storage-type-interface" + "include": "#scope-storage-type-interface-decl" }, { - "include": "#scope-storage-type-type" + "include": "#scope-storage-type-type-decl" }, { "include": "#scope-storage-type-enum" }, { - "include": "#scope-storage-type-namespace" + "include": "#scope-storage-type-namespace-decl" }, { "include": "#scope-storage-type-function-arrow" @@ -2220,6 +2220,9 @@ } }, "patterns": [ + { + "include": "#code-block" + }, { "include": "#enum-member" }, @@ -3117,20 +3120,20 @@ "match": "\\b(debugger|with)\\b", "name": "keyword.control.tsx" }, - "scope-storage-type-interface": { - "match": "\\b(interface)\\b", + "scope-storage-type-interface-decl": { + "match": "\\b(interface)\\b(?!\\s*(?:\\.|\\?\\.))", "name": "storage.type.interface.tsx" }, - "scope-storage-type-type": { - "match": "\\b(type)\\b", + "scope-storage-type-type-decl": { + "match": "\\b(type)\\b(?!\\s*(?:\\.|\\?\\.))", "name": "storage.type.type.tsx" }, "scope-storage-type-enum": { "match": "\\b(enum)\\b", "name": "storage.type.enum.tsx" }, - "scope-storage-type-namespace": { - "match": "\\b(module|namespace)\\b", + "scope-storage-type-namespace-decl": { + "match": "\\b(module|namespace)\\b(?!\\s*(?:\\.|\\?\\.))", "name": "storage.type.namespace.tsx" }, "scope-support-variable": { @@ -3321,8 +3324,8 @@ "match": "\\b(async)\\b", "name": "storage.modifier.tsx" }, - "expr-scope-storage-type-namespace": { - "match": "\\b(module)\\b", + "expr-scope-storage-type-namespace-decl": { + "match": "\\b(module)\\b(?!\\s*(?:\\.|\\?\\.))", "name": "storage.type.namespace.tsx" }, "expression": { @@ -3499,7 +3502,7 @@ "include": "#scope-storage-type-class" }, { - "include": "#expr-scope-storage-type-namespace" + "include": "#expr-scope-storage-type-namespace-decl" }, { "include": "#scope-storage-type-function-arrow" From 2548c2951e94699a12cd215fe0294ecc8d6a47be Mon Sep 17 00:00:00 2001 From: Johnson Chu Date: Sun, 21 Jun 2026 09:26:54 +0800 Subject: [PATCH 14/14] COMPLETENESS.md: record the soundness ceiling audit (#51) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Adds the symmetric half to the completeness spine: a "Soundness — no ceiling" section stating the audited result that gen-tm has no TextMate expressiveness ceiling on well-formed input — because its obligation space (lex-local + delimiter-carried recursion, no general rule-context→scope channel) is itself TextMate-bounded — with the honest limits (well-formed input only; strong evidence + structural argument, not a closed-form ∀-grammar proof; the two region-leak over-accepts the audit derived were fixable bugs, not ceilings). --- COMPLETENESS.md | 35 +++++++++++++++++++++++++++++++++++ 1 file changed, 35 insertions(+) diff --git a/COMPLETENESS.md b/COMPLETENESS.md index 32777c4..0dcd75c 100644 --- a/COMPLETENESS.md +++ b/COMPLETENESS.md @@ -215,6 +215,41 @@ recognised and scoped; what is refined at the frontier is *which* role at the am Improving that precision (var-width forms for the `vscode-oniguruma`-only grammars, `\g<>` for the arrow region) is a separate, soundness-gated change. **The completeness obligation is discharged.** +## Soundness — no ceiling, audited not assumed + +Completeness is *presence*; soundness is *correctness* — does each present construct read the +*right* scope on every input. Soundness is not decided here (`test/scope-gap.ts` / `gap-ledger`), but +one structural question about it **is** settled: does gen-tm have a *ceiling* — a scope obligation no +TextMate grammar can reproduce, the wall the naive "regex < CFG" intuition predicts? By exhaustive +audit the answer is **no ceiling on well-formed input** — and the reason is not that TextMate is +omnipotent but that gen-tm's *obligation space* is itself TextMate-bounded: + +- **The obligation is the role map, not the parse.** A scope in Monogram is `f(token type)` [lex-time] + or `f(literal text)` [`scopeOverrides`], plus a finite set of shape-detectors — there is no general + *rule-context → scope* channel. So even when the CST encodes unbounded context (a cross-serial index, + a depth parity), the highlighting obligation does not read it; there is nothing for TextMate to fail + to reproduce. Two such candidates, built and run through `createParser` + the role map, both + *dissolve* at the obligation layer. +- **The dichotomy.** A context-dependent scope is therefore one of: *lex-local* (a plain match), + *bounded* (a fixed-width neighborhood), or *delimiter-carried recursion* — unbounded nesting whose + every level has a **consumable** delimiter (`{`, `<`, `-`), so a `\G`/`\g<>`/self-include reproduces + it to unbounded depth (the YAML compact block-sequence: 0 mismatch d=1..12). The only shape that + *would* be a ceiling — unbounded, **non-delimiter**, parser-assigned context — is unconstructable + here: a delimiter-less depth scope is one the *parser itself* cannot assign from finite roles. +- **The audit.** Every context-dependent scope channel was enumerated (independently re-derived from + the emitted grammar — 78 keyword/identifier heads — and ground-truthed with `vscode-oniguruma` at + deep nesting): each is lex-local, bounded, or delimiter-carried; the ceiling shape occurs **zero** + times. Delimiter-carried reproduction was confirmed frame-per-level with no fixed-arity cap (enum + d=25, generic d=60, JSX d=20, template `${}` d=20). + +Honest bounds: this is **well-formed-input** soundness, and **strong evidence + a structural argument, +not a closed-form ∀-grammar proof** — it audits the channels gen-tm *emits* and infers "unbounded" +from monotone frame growth verified to depth 60. `highlighter ≡ parser` thus holds on well-formed +input but is **not a total equivalence**: broken/partial input still has local region-leaks where a +regex region's heuristic boundary diverges from the parse. The audit *derived* two such over-accepts +the corpus was blind to — an enum brace-leak and a `module`/`namespace`/`type` declaration-keyword +over-accept — and root-caused both in gen-tm. They were fixable bugs, not ceilings. + ## The proof ledger The fixed denominator is every measured obligation (token discharge + repository reachability +