Skip to content

feat: add parse() (list-of-successes) to the grammar combinators (#467 step 2)#469

Open
SJrX wants to merge 2 commits into
issue-345from
issue-345-2
Open

feat: add parse() (list-of-successes) to the grammar combinators (#467 step 2)#469
SJrX wants to merge 2 commits into
issue-345from
issue-345-2

Conversation

@SJrX

@SJrX SJrX commented Jun 21, 2026

Copy link
Copy Markdown
Owner

What

Grows a second matching method, Combinator.parse(), directly on the existing combinators — alongside SyntacticMatch/SemanticMatch, not replacing them. The caller (GrammarOptionValue) is untouched and still uses the old methods, and none of the 225 grammar definitions change. This lets us validate the new engine against the real production grammars before deciding on any migration.

Stacked on #468 (issue-345). This PR's diff is the engine method + result types + tests.

(Supersedes the earlier standalone grammar2 PoC, which this PR removes: a parallel package had a jarringly different surface and implied rewriting every grammar. Same idea, grown in place instead.)

The one idea

fun parse(value: String, offset: Int): Sequence<Parse>   // every way it can match, lazily

instead of today's single greedy first-match. Seq threads each possibility of one part into the next; a value is valid if any path consumes the whole input. So Seq(ZeroOrMore("a"), "a") on "aa" matches — built from the same SequenceCombinator/ZeroOrMore/LiteralChoiceTerminal classes whose SemanticMatch still fails it (the test asserts both). Alt offers all options, so option ordering no longer affects correctness; ZeroOrMore/OneOrMore/Repeat offer every count.

One pass instead of two

Parse.kt adds ParsedToken/Parse and a validate() free function. Each leaf carries a valid flag (the strict/"semantic" check), so one lenient parse answers both:

  • syntactic ("could be this") = did any path consume the whole value?
  • semantic ("actually valid") = did any such path use only valid tokens?

No second traversal, and no change to the authoring DSL.

Tests (ParseTest, plain JUnit)

  • runs validate() against the actual ConfigParseAddressFamiliesOptionValue grammar (valid + invalid cases from the canary);
  • runs it against the real IPV6_ADDR combinator (15+ hand-ordered alternatives + IPv4 suffix) — the old engine needed that ordering to dodge greedy traps; parse() explores all forms;
  • an integer-range grammar (config_parse_ip_port shape);
  • the greedy Seq(ZeroOrMore("a"), "a") case, asserting the old SemanticMatch returns -1 while parse() accepts it.

The existing GrammarTest (old engine) is untouched and still passes; the full suite is green.

Known limitations (next layers, deliberately out of scope)

  • Error localization is best-effort. SyntaxError.furthest collapses to 0 when a trailing EOF() discards a partial path (e.g. AF_INET, AF_INET6 would want 7). Pinned in a test with a comment. Precise localization needs the frontier/expected-set layer — the same machinery as completion (Grammar Based Completion #343) — which is the next step.
  • No completion yet, no role-labeling/coloring yet (those build on this).
  • No state merging — like any backtracking matcher, pathological grammars could blow up; fine for short option values, but a path cap + a neutral cancellation hook are planned before wiring parse() into IntelliJ (EDT safety).

Refs #467 #345 #343 #342

🤖 Generated with Claude Code

Parallel, self-contained proof of concept for a new option-value grammar engine
in a `grammar2` package. It does not touch the existing `grammar` package; it
sits beside it so we can play with the approach and validate it against existing
behaviour first.

Core idea: every matcher returns ALL the ways it can match, lazily
(`parse(input, offset): Sequence<Parse>` — Wadler's "list of successes"), instead
of one greedy first match. Seq threads each possibility into the next and a value
is valid if any path consumes the whole input, so Seq(ZeroOrMore("a"), "a") now
matches "aa" (the case the current engine's own docs warn it fails).

Two more ideas come along for free:
- a labeled parse tree (Branch with a Role), so a span like an IPv4 address is one
  labeled unit rather than per-terminal colors;
- per-leaf validity flags, collapsing the old syntactic/semantic two passes into
  one lenient parse (a token can match yet be flagged invalid).

Capabilities (validate, and the seed of coloring) are free functions over the
parse result — no per-combinator code. No IntelliJ types in the engine; it speaks
plain Int offsets so the IntelliJ layer can adapt later.

Tests reproduce the RestrictAddressFamilies= canary cases, show well-formed-but-
unknown vs malformed error kinds, expose leaf roles for coloring, and prove the
greedy completeness win.

Refs #467 #345 #343 #342

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@github-actions

github-actions Bot commented Jun 21, 2026

Copy link
Copy Markdown

Test Results

1 117 tests  +5   1 117 ✅ +5   49s ⏱️ -1s
  295 suites +1       0 💤 ±0 
  295 files   +1       0 ❌ ±0 

Results for commit c3b8f6d. ± Comparison against base commit 8b48357.

♻️ This comment has been updated with latest results.

…ammar2

Replaces the standalone grammar2 PoC with the same list-of-successes idea grown
directly on the existing combinators, per review feedback: the parallel package
had a jarringly different surface and implied a rewrite of all 225 grammars.

Instead, add one new method — Combinator.parse(value, offset): Sequence<Parse> —
ALONGSIDE the existing SyntacticMatch/SemanticMatch, implemented on each of the 12
combinators next to its existing match logic. The caller (GrammarOptionValue) is
untouched and still uses the old methods; nothing in the 225 grammar definitions
changes. parse() can therefore be validated against the REAL production grammars
before any migration decision.

- Parse.kt: ParsedToken / Parse result types + validate() free function. One
  lenient pass folds the strict "semantic" check into a per-token `valid` flag, so
  it answers both syntactic (any full parse?) and semantic (a full parse with only
  valid tokens?) without two traversals.
- Each combinator returns every way it can match, lazily; Alt offers all options
  (ordering no longer matters), ZeroOrMore/OneOrMore/Repeat offer every count.
- ParseTest runs validate() against the actual ConfigParseAddressFamiliesOptionValue
  grammar and the real IPV6_ADDR combinator (15+ hand-ordered alternatives), an
  integer-range grammar, and the greedy Seq(ZeroOrMore("a"),"a") case — which it
  shows the old SemanticMatch still fails while parse() succeeds.

Known limitation pinned in a test: SyntaxError `furthest` is best-effort and
collapses to 0 when a trailing EOF() discards partial progress; precise
localization needs the frontier/expected-set layer (the same machinery as
completion, #343), deliberately out of scope here.

Refs #467 #345 #343 #342

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@SJrX SJrX changed the title feat: grammar engine PoC — list-of-successes matcher (#467 step 2) feat: add parse() (list-of-successes) to the grammar combinators (#467 step 2) Jun 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant