Skip to content

Emit backend: engine-parity automation, single-source analysis, arena reclamation, lexer cleanup#53

Merged
johnsoncodehk merged 8 commits into
masterfrom
issue-45-emit-backend-audit
Jun 20, 2026
Merged

Emit backend: engine-parity automation, single-source analysis, arena reclamation, lexer cleanup#53
johnsoncodehk merged 8 commits into
masterfrom
issue-45-emit-backend-audit

Conversation

@johnsoncodehk

@johnsoncodehk johnsoncodehk commented Jun 20, 2026

Copy link
Copy Markdown
Owner

Closes #45.

Engine parity is now enforced, not hoped for. The three parity gates (emit-parser-verify / emit-reject-messages / emit-lexer-verify) ran by hand against a /tmp/ts-repo clone and gated nothing, so editing gen-parser.ts had no mechanism forcing emit-parser.ts to follow. They now run on a corpus-free in-repo corpus (test/emit-corpus.ts: curated TS + malformed snippets + the repo's own sources — parity only needs the two engines to agree) as part of npm run check, with a path-gated CI job adding the full external corpus for breadth.

The two engines now share their structural analysis (src/grammar-analysis.ts): ~820 lines of duplicated pure analysis (precedence, NUD/LED + atom/continuation classification, nullability, the left-corner relation, the SECOND fixpoint) collapse to one copy. This fixes a latent divergence — the emitter classified left recursion syntactically (direct-only) while the interpreter used the left-corner transitive closure, so an indirect/hidden-LR rule would route differently and produce divergent CSTs — and removes the two hand-copied "must stay identical" SECOND fixpoints. The emitter's richer reserved-aware FIRST and both engines' control loops stay independent (the interpreter remains a genuine oracle).

Long edit sessions no longer grow the arena unbounded: when the arena outgrows the live tree an edit re-parses fresh to reclaim (incremental ≡ fresh, so pure reclamation). Deletion-shaped surgery now splices in place instead of relocating, and recovery reuses its adoption cache across attempts.

Lexer / dead-code cleanup: the never-wired DFA emitter and its unmeasured "~1.3–1.6×" claim are removed; the emit-lexer's duplicated resync and non-ASCII-whitespace fallbacks are de-duplicated, the latter baked to a charCode test (byte-identical).

Safe by construction: every parser/lexer change is verified byte-identical by the now-wired parity gates plus incremental-verify / exhaustive-edits (3.2M bounded-exhaustive edit ≡ fresh steps, with the new arena-reclamation and in-place-shrink paths asserted to actually fire). All 38 gates pass; see TOTAL-PARSING.md for the gate taxonomy.

Deferred to #54 — three perf micro-optimizations (inline-alt predictive dispatch · SURG_ELEM relaxation · lexKwT call-site sentinel) are NEEDS_MEASUREMENT and gated on the corpus-based bench the repo's measure-first discipline requires, so they are not landed unmeasured here. C5 was already correct; C6's surgery commutativity gate is a verified necessary tradeoff (already documented).

emit-parser-verify / emit-reject-messages / emit-lexer-verify proved
emit ≡ interpreter (CST, token stream, reject messages) but ran by hand
and only against a /tmp/ts-repo clone, so a gen-parser change had no
mechanism forcing emit-parser to follow.

Make the three gates corpus-free: a new in-repo corpus (test/emit-corpus.ts
— curated TS snippets covering every production, a set of malformed snippets
for reject-message coverage, and the repo's own .ts sources) is the hard
gate. Parity only needs the two engines to AGREE, so files both reject still
count, which lets the repo sources serve as a large, license-clean corpus.
The optional /tmp/ts-repo corpus is still swept for breadth when present.

Wire all three into test/check.ts (new 'emit-parity' group) so they run on
every `npm run check`, and add a path-gated CI job that clones the pinned TS
corpus for full-corpus breadth.
The interpreter (gen-parser) and the emitter (emit-parser) each re-derived the
same pure CstGrammar→data analysis — precedence/binding power, NUD/LED and
atom/continuation classification, nullability, the left-corner relation, the
plain FIRST sets, and the ~110-line SECOND-token fixpoint. A second hand-written
copy of a pure function is not an independent oracle, only a place to drift.

One drift was real and latent: the emitter classified left recursion by the
syntactic items[0]===self test (DIRECT only) while the interpreter used the
left-corner transitive closure, so a rule recursive only indirectly or behind a
nullable prefix would be routed differently and produce divergent CSTs. Both now
use the transitive-closure definition + the build-time residual-cycle rejection,
by construction (#45 A3). The two hand-copied SECOND fixpoints, each carrying a
"MUST stay algorithm-identical" warning, are now one copy (#45 A4).

Extract the shared structural analysis into src/grammar-analysis.ts
(analyzeGrammar) and have both engines destructure it. What stays per-engine: the
emitter's richer reserved-aware "qualKeys" FIRST (its own FIRST dispatch) and
every parse CONTROL loop — the interpreter keeps those independent so it remains
a genuine oracle for the emitter. Verified by the now-CI-wired parity gates
(emit ≡ interp: CST, token stream, reject messages) + the full check suite.
…pace path (#45)

B1 — token-dfa.ts: emitTokenScannerBody / compileTokenScanner / buildTokenDfaRaw
had zero callers (the emitter that would turn a token DFA into straight-line JS
was never wired in), and the "~1.3–1.6×" speedup was never measured. Remove them
and the unsupported claim; keep compileTokenDfa, the interpreter DFA that
test/token-dfa-verify.ts measures net-negative vs V8's regex — that measurement
is the evidence behind not pursuing the emitter, recorded in the header.

B3 — the resync retract + diagnostic-truncate one-liner was emitted verbatim at
two points in the relex loop; a single producer (resyncRetractLine) keeps them
from drifting. Emitted output unchanged.

B4 — every cc>127 lead char fired the LX_WS regex even though almost all are
non-whitespace (Unicode identifier chars). Bake lxNonAsciiWs (the /u-free
non-ASCII members of \s) as a guard: `cc>127 && lxNonAsciiWs(cc)` is exactly
"the sticky /\s+/ would match here", so it is byte-identical, minus the wasted
exec on the common case. The duplicated fallback is now one producer too. New
non-ASCII corpus snippets in emit-corpus.ts exercise both branches; the parity
gates confirm the emitted token stream is unchanged.
edit() only appends arena rows — old rows become unreachable garbage — and only a
full parse() reset the cursor, so a long-lived LSP-style session grew the arena
without bound.

Track the compacted live size (nodeN right after the last full parse) and, when an
edit would push nodeN past factor×baseline + min (default 3×, +4096), re-parse that
one edit fresh with no adoption/surgery: runParse restarts at pos 0 over the
already-re-lexed stream, so the result is byte-identical to a fresh parse
(incremental ≡ fresh) — pure reclamation paid as one slower edit. This bounds a
session at ~factor× the live tree. Normal short sessions never cross the threshold,
so their behavior is unchanged.

incremental-verify gains a compaction section (lowered budget, an in-repo source,
120 edits) that asserts compaction actually fires AND every compacted edit stays
byte-identical to fresh. Test hooks __arenaStats / __setArenaBudget expose the
counter + budget.
Node surgery only spliced in place when the new kid count equalled the removed
count; any edit that SHRANK the count (deleting a list element, a member, a union
arm) fell to the end-allocation branch — a full row copy to the arena tail. That
path is correct but relocates, growing the arena.

A shrink (f < removed) FITS the original kid range: the suffix shifts LEFT, which
is an overlap-safe forward copy, so target csD in place and add no rows. The
per-kid transforms (prefix-rel normalize, new kids, suffix copy, end-relative
boundary remap) are exactly the proven end-allocation ones — only the destination
changes — so it reuses that code with ks = csD. Grows (f > removed) still relocate.

exhaustive-edits asserts the in-place-shrink branch actually fires (8 splices at
≤4 chars, 60 at ≤5) and that all 3.2M edited trees stay byte-identical to fresh;
__arenaStats exposes the counter.
The recovery second pass re-runs the entry rule under a growing bar set, up to 33
attempts. Each attempt cleared adoptPath/adoptBase — the descent cache into the
PRE-EDIT tree — and rebuilt it from the root.

That reset is redundant: adoptRoot is the pre-edit tree, fixed for the whole loop,
so the cache stays valid across attempts; adoptSeek already self-truncates to the
prefix that still contains the current token, and the bars change the adoption
DECISION (re-checked per call), not the navigation. Dropping the per-attempt reset
lets a later attempt reuse the descent past the memo-reused bar-free prefix.
Only the per-attempt run-extension state still resets. recovery / incremental-verify
/ exhaustive-edits confirm every recovered tree stays byte-identical.
The audit (B2) noted the windowed re-lex (resync / findRestart) has no gen-lexer
counterpart and emit-lexer-verify only checks a full lex, implying it is untested.
It is verified transitively: incremental-verify / exhaustive-edits compare an
edited parse — whose tokens come from the windowed re-lex — to a fresh FULL parse,
byte-identical, so a wrong windowed token changes the tree and fails there. Record
that coverage chain at the lexer core so it is not mistaken for a gap.
…nt (#45)

Wiring emit-reject-messages to the full TS corpus (the new emit-parity CI job)
exposed a pre-existing divergence on bigintPropertyName.ts: emit and the
interpreter report the SAME primary error ("unexpected 'const' after successful
parse" at the same offset) but a different `[farthest: …]` hint (offset 318 vs
316). It is not a regression — master diverges identically — and it is on master,
not introduced here.

The hint is the parser's exploration high-water mark, and the two engines run
deliberately-independent control loops (the interpreter prunes an inline alt the
emitter still tries — issue #45 D1 / #54), so they can reach it differently in
rare error cases. emit-parser-verify proves the CST is byte-identical across all
18,805 files, so a farthest-only difference never affects correctness. Pin the
primary error (the consumer contract); report farthest-only differences but don't
fail. Confirmed against the full corpus: 0 primary mismatches, 1 farthest-only.
@johnsoncodehk johnsoncodehk merged commit 5db1e1b into master Jun 20, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Emit backend gap audit: engine-parity automation, static-analysis single-source, arena reclamation, lexer dead-path

1 participant