Emit backend: engine-parity automation, single-source analysis, arena reclamation, lexer cleanup#53
Merged
Merged
Conversation
emit-parser-verify / emit-reject-messages / emit-lexer-verify proved emit ≡ interpreter (CST, token stream, reject messages) but ran by hand and only against a /tmp/ts-repo clone, so a gen-parser change had no mechanism forcing emit-parser to follow. Make the three gates corpus-free: a new in-repo corpus (test/emit-corpus.ts — curated TS snippets covering every production, a set of malformed snippets for reject-message coverage, and the repo's own .ts sources) is the hard gate. Parity only needs the two engines to AGREE, so files both reject still count, which lets the repo sources serve as a large, license-clean corpus. The optional /tmp/ts-repo corpus is still swept for breadth when present. Wire all three into test/check.ts (new 'emit-parity' group) so they run on every `npm run check`, and add a path-gated CI job that clones the pinned TS corpus for full-corpus breadth.
The interpreter (gen-parser) and the emitter (emit-parser) each re-derived the same pure CstGrammar→data analysis — precedence/binding power, NUD/LED and atom/continuation classification, nullability, the left-corner relation, the plain FIRST sets, and the ~110-line SECOND-token fixpoint. A second hand-written copy of a pure function is not an independent oracle, only a place to drift. One drift was real and latent: the emitter classified left recursion by the syntactic items[0]===self test (DIRECT only) while the interpreter used the left-corner transitive closure, so a rule recursive only indirectly or behind a nullable prefix would be routed differently and produce divergent CSTs. Both now use the transitive-closure definition + the build-time residual-cycle rejection, by construction (#45 A3). The two hand-copied SECOND fixpoints, each carrying a "MUST stay algorithm-identical" warning, are now one copy (#45 A4). Extract the shared structural analysis into src/grammar-analysis.ts (analyzeGrammar) and have both engines destructure it. What stays per-engine: the emitter's richer reserved-aware "qualKeys" FIRST (its own FIRST dispatch) and every parse CONTROL loop — the interpreter keeps those independent so it remains a genuine oracle for the emitter. Verified by the now-CI-wired parity gates (emit ≡ interp: CST, token stream, reject messages) + the full check suite.
…pace path (#45) B1 — token-dfa.ts: emitTokenScannerBody / compileTokenScanner / buildTokenDfaRaw had zero callers (the emitter that would turn a token DFA into straight-line JS was never wired in), and the "~1.3–1.6×" speedup was never measured. Remove them and the unsupported claim; keep compileTokenDfa, the interpreter DFA that test/token-dfa-verify.ts measures net-negative vs V8's regex — that measurement is the evidence behind not pursuing the emitter, recorded in the header. B3 — the resync retract + diagnostic-truncate one-liner was emitted verbatim at two points in the relex loop; a single producer (resyncRetractLine) keeps them from drifting. Emitted output unchanged. B4 — every cc>127 lead char fired the LX_WS regex even though almost all are non-whitespace (Unicode identifier chars). Bake lxNonAsciiWs (the /u-free non-ASCII members of \s) as a guard: `cc>127 && lxNonAsciiWs(cc)` is exactly "the sticky /\s+/ would match here", so it is byte-identical, minus the wasted exec on the common case. The duplicated fallback is now one producer too. New non-ASCII corpus snippets in emit-corpus.ts exercise both branches; the parity gates confirm the emitted token stream is unchanged.
edit() only appends arena rows — old rows become unreachable garbage — and only a full parse() reset the cursor, so a long-lived LSP-style session grew the arena without bound. Track the compacted live size (nodeN right after the last full parse) and, when an edit would push nodeN past factor×baseline + min (default 3×, +4096), re-parse that one edit fresh with no adoption/surgery: runParse restarts at pos 0 over the already-re-lexed stream, so the result is byte-identical to a fresh parse (incremental ≡ fresh) — pure reclamation paid as one slower edit. This bounds a session at ~factor× the live tree. Normal short sessions never cross the threshold, so their behavior is unchanged. incremental-verify gains a compaction section (lowered budget, an in-repo source, 120 edits) that asserts compaction actually fires AND every compacted edit stays byte-identical to fresh. Test hooks __arenaStats / __setArenaBudget expose the counter + budget.
Node surgery only spliced in place when the new kid count equalled the removed count; any edit that SHRANK the count (deleting a list element, a member, a union arm) fell to the end-allocation branch — a full row copy to the arena tail. That path is correct but relocates, growing the arena. A shrink (f < removed) FITS the original kid range: the suffix shifts LEFT, which is an overlap-safe forward copy, so target csD in place and add no rows. The per-kid transforms (prefix-rel normalize, new kids, suffix copy, end-relative boundary remap) are exactly the proven end-allocation ones — only the destination changes — so it reuses that code with ks = csD. Grows (f > removed) still relocate. exhaustive-edits asserts the in-place-shrink branch actually fires (8 splices at ≤4 chars, 60 at ≤5) and that all 3.2M edited trees stay byte-identical to fresh; __arenaStats exposes the counter.
The recovery second pass re-runs the entry rule under a growing bar set, up to 33 attempts. Each attempt cleared adoptPath/adoptBase — the descent cache into the PRE-EDIT tree — and rebuilt it from the root. That reset is redundant: adoptRoot is the pre-edit tree, fixed for the whole loop, so the cache stays valid across attempts; adoptSeek already self-truncates to the prefix that still contains the current token, and the bars change the adoption DECISION (re-checked per call), not the navigation. Dropping the per-attempt reset lets a later attempt reuse the descent past the memo-reused bar-free prefix. Only the per-attempt run-extension state still resets. recovery / incremental-verify / exhaustive-edits confirm every recovered tree stays byte-identical.
The audit (B2) noted the windowed re-lex (resync / findRestart) has no gen-lexer counterpart and emit-lexer-verify only checks a full lex, implying it is untested. It is verified transitively: incremental-verify / exhaustive-edits compare an edited parse — whose tokens come from the windowed re-lex — to a fresh FULL parse, byte-identical, so a wrong windowed token changes the tree and fails there. Record that coverage chain at the lexer core so it is not mistaken for a gap.
…nt (#45) Wiring emit-reject-messages to the full TS corpus (the new emit-parity CI job) exposed a pre-existing divergence on bigintPropertyName.ts: emit and the interpreter report the SAME primary error ("unexpected 'const' after successful parse" at the same offset) but a different `[farthest: …]` hint (offset 318 vs 316). It is not a regression — master diverges identically — and it is on master, not introduced here. The hint is the parser's exploration high-water mark, and the two engines run deliberately-independent control loops (the interpreter prunes an inline alt the emitter still tries — issue #45 D1 / #54), so they can reach it differently in rare error cases. emit-parser-verify proves the CST is byte-identical across all 18,805 files, so a farthest-only difference never affects correctness. Pin the primary error (the consumer contract); report farthest-only differences but don't fail. Confirmed against the full corpus: 0 primary mismatches, 1 farthest-only.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #45.
Engine parity is now enforced, not hoped for. The three parity gates (
emit-parser-verify/emit-reject-messages/emit-lexer-verify) ran by hand against a/tmp/ts-repoclone and gated nothing, so editinggen-parser.tshad no mechanism forcingemit-parser.tsto follow. They now run on a corpus-free in-repo corpus (test/emit-corpus.ts: curated TS + malformed snippets + the repo's own sources — parity only needs the two engines to agree) as part ofnpm run check, with a path-gated CI job adding the full external corpus for breadth.The two engines now share their structural analysis (
src/grammar-analysis.ts): ~820 lines of duplicated pure analysis (precedence, NUD/LED + atom/continuation classification, nullability, the left-corner relation, the SECOND fixpoint) collapse to one copy. This fixes a latent divergence — the emitter classified left recursion syntactically (direct-only) while the interpreter used the left-corner transitive closure, so an indirect/hidden-LR rule would route differently and produce divergent CSTs — and removes the two hand-copied "must stay identical" SECOND fixpoints. The emitter's richer reserved-aware FIRST and both engines' control loops stay independent (the interpreter remains a genuine oracle).Long edit sessions no longer grow the arena unbounded: when the arena outgrows the live tree an edit re-parses fresh to reclaim (incremental ≡ fresh, so pure reclamation). Deletion-shaped surgery now splices in place instead of relocating, and recovery reuses its adoption cache across attempts.
Lexer / dead-code cleanup: the never-wired DFA emitter and its unmeasured "~1.3–1.6×" claim are removed; the emit-lexer's duplicated resync and non-ASCII-whitespace fallbacks are de-duplicated, the latter baked to a charCode test (byte-identical).
Safe by construction: every parser/lexer change is verified byte-identical by the now-wired parity gates plus
incremental-verify/exhaustive-edits(3.2M bounded-exhaustiveedit ≡ freshsteps, with the new arena-reclamation and in-place-shrink paths asserted to actually fire). All 38 gates pass; seeTOTAL-PARSING.mdfor the gate taxonomy.Deferred to #54 — three perf micro-optimizations (inline-alt predictive dispatch ·
SURG_ELEMrelaxation ·lexKwTcall-site sentinel) areNEEDS_MEASUREMENTand gated on the corpus-based bench the repo's measure-first discipline requires, so they are not landed unmeasured here. C5 was already correct; C6's surgery commutativity gate is a verified necessary tradeoff (already documented).