Skip to content

Grammar engine rework: complete matching, role-based coloring, completion & canonicalization (PoC) #467

Description

@SJrX

Summary

This is the umbrella/PoC issue for reworking the option-value grammar engine
(semanticdata/optionvalues/grammar/) so that it can drive capabilities
role-based coloring, grammar-based completion, deprecation/semantic warnings, and
value canonicalization — instead of only answering "is this value valid?".

It consolidates and supersedes the analysis behind:

The plan is to build a proof of concept end-to-end (likely on
RestrictAddressFamilies= as the canary), live with it / mentally chew on it, and
then cut it up into smaller shippable PRs in roughly the order below.


Background: how the engine works today

Every Combinator implements two near-identical recursive traversals:

  • SyntacticMatch — lenient ("could be this — color it / locate the error")
  • SemanticMatch — strict ("is this actually valid?")

Both return a flat MatchResult:

data class MatchResult(
  val tokens: List<String>,
  val matchResult: Int,                 // new offset, or -1 = no match
  val terminals: List<TerminalCombinator>,  // parallel to tokens
  val longestMatch: Int                 // furthest offset reached (error localization)
)

Three properties of the current design drive everything below:

  1. Single-path, greedy, no-backtracking (PEG-like, minus backtracking). Each
    combinator commits to its first success and never reconsiders.
    ZeroOrMore/OneOrMore/Repeat consume greedily and never give back;
    AlternativeCombinator returns the first alternative that matches (not the
    longest) and never retries if a later sequence element fails. The 15
    hand-ordered, most-specific-first IPv6 alternatives in Combinators.kt exist
    purely to dodge this.
  2. The result is flat and lossytokens + a parallel terminals list. No
    tree, no record of which alternative was taken, no nesting. (Tokens do tile
    the matched region contiguously, so TextRanges are reconstructable by summing
    lengths — useful, but there's no grouping.)
  3. Syntactic vs Semantic is duplicated traversal. The recursion is identical;
    only the terminal predicate differs (RegexTerminal's two regexes,
    IntegerTerminal's bounds, FlexibleLiteralChoiceTerminal's loose-vs-exact).

The one limitation behind #345, #343, and the alternative-ordering worry

These are the same deficiency wearing three hats: single-path greedy matching
over a flat result.

  • Ensure Grammar Matching is Complete #345 (completeness): greedy commitment can reject strings the grammar
    should accept. The real present-day cost is the maintenance tax — authors must
    hand-order alternatives and avoid greedy overlap (the "subtle bugs" worry).
  • Alternative descending-order fragility: same root. If the matcher explored
    all paths, ordering would stop mattering.
  • Grammar Based Completion #343 (completion): needs "given the prefix typed so far, what could come
    next?" Single-path matching surfaces at most one continuation, never the full
    set — which is why Grammar Based Completion #343 notes it's blocked on a redesign.

Design direction

A. Color/intelligence is by semantic role, not terminal type

127.0.0.1 is one literal (an IPv4 address), not blue-octet/black-dot/blue-octet.
The unit of meaning is a labeled span, usually a whole sub-grammar. So we add a
wrapper that tags a sub-grammar with a role:

Labeled(IPV4_ADDRESS, SequenceCombinator(octet, DOT, octet, DOT, octet, DOT, octet))
Labeled(ADDRESS_FAMILY, FlexibleLiteralChoiceTerminal("AF_INET", "AF_INET6", ...))

That single label powers multiple capabilities, all as "walk the labeled result,
act per node":

  • Coloring (Grammar Based Syntax Highlighting #342): role → TextAttributesKey. Inner dots never get their own color.
  • Deprecation/semantic warnings (new): role ADDRESS_FAMILY → look the matched
    text up in a deprecated set (e.g. AF_WANPIPE, AF_DECnet, AF_NETBEUI, …) and warn.
  • Canonicalization (IPv6 Recommended Formats #363): role IPV6_ADDRESS → grab the span, rewrite it.

This argues for building real labeled structure into the result (a parse tree),
not lightweight ad-hoc captures.

B. One richer primitive, capabilities as free functions (NOT a visitor split)

We explicitly reject pulling the algorithms out into per-capability visitors,
because that imposes an "implement-it-again for every new combinator type" tax.

Instead, each combinator implements one method — parse() — that returns
all the ways it can match (Wadler's "list of successes": a lazy
Sequence<ParseState>), where each ParseState carries a labeled sub-tree plus
per-node validity flags plus the frontier expected-set. Everything else
(coloring, completion, deprecation, canonicalization, validation) becomes a free
function over the parse result
— zero combinator code.

per new combinator per new capability
today 2 methods (SyntacticMatch+SemanticMatch) a method on every combinator
proposed 1 method (parse) 1 free function, no edits

So the refactor reduces per-combinator code and removes the duplication.

C. Collapse Syntactic/Semantic into one pass

Lenient is always a superset of strict (the design intent). So run one lenient
parse
, tagging each terminal node with a validity flag:

  • Syntactically valid = some success reaches end-of-input.
  • Semantically valid = some end-of-input success has every node valid.

One traversal answers both; the duplicated SyntacticMatch/SemanticMatch bodies
go away. (Edge to settle in the PoC: for full completeness a terminal may need to
yield multiple match lengths rather than always-greedy-one.)

D. Completion and errors are the same mechanism

The hard part of completion is partial input: public class isn't a valid program,
but you must complete it, not red-squiggle it. Both completion and good error
messages come from one thing:

the set of terminals the grammar expected at the furthest input position reached.

  • expected-set at end-of-document → completion (AF_INAF_INET, AF_INET6)
  • expected-set at a failure point → error message ("expected one of: …")

longestMatch is already the seed of "furthest position reached." Terminals get one
small optional hook — suggest() — so a literal-choice can enumerate candidates
while a regex/integer contributes a generic "expected a number"; most combinators
inherit a default.

So #345 (completeness), #343 (completion), and good errors are three readouts of one
engine.


Target result shape (sketch — will evolve in the PoC)

// Each combinator: fun parse(input: String, offset: Int): Sequence<ParseState>

data class ParseState(
  val offset: Int,            // how far we consumed
  val node: ParseNode,       // labeled sub-tree built so far
  val expected: Set<Expectation> // terminals attempted at the frontier (completion/errors)
)

sealed interface ParseNode {
  // role label (nullable), source TextRange, validity flag, children
}

Capabilities (all free functions over the above):


Proposed sequence (PoC first, then cut into PRs)

  1. Spec + fix RestrictAddressFamilies=. Replace the loose
    RegexTerminal("AF_[A-Z0-9_]+") with a FlexibleLiteralChoiceTerminal of the
    real af_from_name families, wrapped in an ADDRESS_FAMILY role label. Add a
    golden test suite (valid / invalid / completion / ~-inversion). This is
    grammar-definition-level, so it's engine-agnostic and carries across the
    refactor — and doubles as the refactor's regression canary.
  2. New parse()-based engine (list-of-successes + labeled tree + per-node
    validity + frontier expected-set), validated against the step-1 tests. Closes
    Ensure Grammar Matching is Complete #345 and removes alternative-ordering fragility; merges Syntactic/Semantic into
    one pass.
  3. Capabilities as free functions:
    • role-based coloring + annotator + color-settings page (Grammar Based Syntax Highlighting #342)
    • grammar completion (Grammar Based Completion #343) — note GrammarOptionValue.getAutoCompleteOptions
      currently returns emptySet(), so all grammar options offer zero completion today
    • deprecated-address-family annotator (new capability; ref
      address_families(7))
    • IPv6 canonicalization (IPv6 Recommended Formats #363) — RFC 5952; likely a hand-rolled ~40-line
      canonicalizer with tests (IntelliJ bundles Guava whose
      InetAddresses.toAddrString() is RFC 5952-compliant, but a self-contained
      impl avoids depending on a bundled lib)

RestrictAddressFamilies= is the canary throughout: small, but it exercises an
alternative, a repeat/list, and the ~ inversion — enough to stress backtracking
without drowning the PoC.

Notes / open questions

  • Terminal multi-length matching for full completeness (B/C edge above).
  • Performance: option values are short strings, so list-of-successes cost is a
    non-issue, but worth a sanity check on the IPv6 grammar.
  • Whether the parse tree fully replaces the flat MatchResult or wraps it during a
    transition period.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions