You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is the umbrella/PoC issue for reworking the option-value grammar engine
(semanticdata/optionvalues/grammar/) so that it can drive capabilities —
role-based coloring, grammar-based completion, deprecation/semantic warnings, and
value canonicalization — instead of only answering "is this value valid?".
It consolidates and supersedes the analysis behind:
The plan is to build a proof of concept end-to-end (likely on RestrictAddressFamilies= as the canary), live with it / mentally chew on it, and
then cut it up into smaller shippable PRs in roughly the order below.
Background: how the engine works today
Every Combinator implements two near-identical recursive traversals:
SyntacticMatch — lenient ("could be this — color it / locate the error")
SemanticMatch — strict ("is this actually valid?")
Both return a flat MatchResult:
data classMatchResult(
valtokens:List<String>,
valmatchResult:Int, // new offset, or -1 = no matchvalterminals:List<TerminalCombinator>, // parallel to tokensvallongestMatch:Int// furthest offset reached (error localization)
)
Three properties of the current design drive everything below:
Single-path, greedy, no-backtracking (PEG-like, minus backtracking). Each
combinator commits to its first success and never reconsiders. ZeroOrMore/OneOrMore/Repeat consume greedily and never give back; AlternativeCombinator returns the first alternative that matches (not the
longest) and never retries if a later sequence element fails. The 15
hand-ordered, most-specific-first IPv6 alternatives in Combinators.kt exist
purely to dodge this.
The result is flat and lossy — tokens + a parallel terminals list. No
tree, no record of which alternative was taken, no nesting. (Tokens do tile
the matched region contiguously, so TextRanges are reconstructable by summing
lengths — useful, but there's no grouping.)
Syntactic vs Semantic is duplicated traversal. The recursion is identical;
only the terminal predicate differs (RegexTerminal's two regexes, IntegerTerminal's bounds, FlexibleLiteralChoiceTerminal's loose-vs-exact).
The one limitation behind #345, #343, and the alternative-ordering worry
These are the same deficiency wearing three hats: single-path greedy matching
over a flat result.
Ensure Grammar Matching is Complete #345 (completeness): greedy commitment can reject strings the grammar should accept. The real present-day cost is the maintenance tax — authors must
hand-order alternatives and avoid greedy overlap (the "subtle bugs" worry).
Alternative descending-order fragility: same root. If the matcher explored all paths, ordering would stop mattering.
Grammar Based Completion #343 (completion): needs "given the prefix typed so far, what could come
next?" Single-path matching surfaces at most one continuation, never the full
set — which is why Grammar Based Completion #343 notes it's blocked on a redesign.
Design direction
A. Color/intelligence is by semantic role, not terminal type
127.0.0.1 is one literal (an IPv4 address), not blue-octet/black-dot/blue-octet.
The unit of meaning is a labeled span, usually a whole sub-grammar. So we add a
wrapper that tags a sub-grammar with a role:
Deprecation/semantic warnings (new): role ADDRESS_FAMILY → look the matched
text up in a deprecated set (e.g. AF_WANPIPE, AF_DECnet, AF_NETBEUI, …) and warn.
This argues for building real labeled structure into the result (a parse tree),
not lightweight ad-hoc captures.
B. One richer primitive, capabilities as free functions (NOT a visitor split)
We explicitly reject pulling the algorithms out into per-capability visitors,
because that imposes an "implement-it-again for every new combinator type" tax.
Instead, each combinator implements one method — parse() — that returns all the ways it can match (Wadler's "list of successes": a lazy Sequence<ParseState>), where each ParseState carries a labeled sub-tree plus
per-node validity flags plus the frontier expected-set. Everything else
(coloring, completion, deprecation, canonicalization, validation) becomes a free
function over the parse result — zero combinator code.
per new combinator
per new capability
today
2 methods (SyntacticMatch+SemanticMatch)
a method on every combinator
proposed
1 method (parse)
1 free function, no edits
So the refactor reduces per-combinator code and removes the duplication.
C. Collapse Syntactic/Semantic into one pass
Lenient is always a superset of strict (the design intent). So run one lenient
parse, tagging each terminal node with a validity flag:
Syntactically valid = some success reaches end-of-input.
Semantically valid = some end-of-input success has every node valid.
One traversal answers both; the duplicated SyntacticMatch/SemanticMatch bodies
go away. (Edge to settle in the PoC: for full completeness a terminal may need to
yield multiple match lengths rather than always-greedy-one.)
D. Completion and errors are the same mechanism
The hard part of completion is partial input: public class isn't a valid program,
but you must complete it, not red-squiggle it. Both completion and good error
messages come from one thing:
the set of terminals the grammar expected at the furthest input position reached.
expected-set at end-of-document → completion (AF_IN → AF_INET, AF_INET6)
expected-set at a failure point → error message ("expected one of: …")
longestMatch is already the seed of "furthest position reached." Terminals get one
small optional hook — suggest() — so a literal-choice can enumerate candidates
while a regex/integer contributes a generic "expected a number"; most combinators
inherit a default.
So #345 (completeness), #343 (completion), and good errors are three readouts of one
engine.
Target result shape (sketch — will evolve in the PoC)
// Each combinator: fun parse(input: String, offset: Int): Sequence<ParseState>data classParseState(
valoffset:Int, // how far we consumedvalnode:ParseNode, // labeled sub-tree built so farvalexpected:Set<Expectation> // terminals attempted at the frontier (completion/errors)
)
sealedinterfaceParseNode {
// role label (nullable), source TextRange, validity flag, children
}
Capabilities (all free functions over the above):
validate(result) → problem descriptors (replaces current generateProblemDescriptors)
Spec + fix RestrictAddressFamilies=. Replace the loose RegexTerminal("AF_[A-Z0-9_]+") with a FlexibleLiteralChoiceTerminal of the
real af_from_name families, wrapped in an ADDRESS_FAMILY role label. Add a golden test suite (valid / invalid / completion / ~-inversion). This is
grammar-definition-level, so it's engine-agnostic and carries across the
refactor — and doubles as the refactor's regression canary.
New parse()-based engine (list-of-successes + labeled tree + per-node
validity + frontier expected-set), validated against the step-1 tests. Closes Ensure Grammar Matching is Complete #345 and removes alternative-ordering fragility; merges Syntactic/Semantic into
one pass.
grammar completion (Grammar Based Completion #343) — note GrammarOptionValue.getAutoCompleteOptions
currently returns emptySet(), so all grammar options offer zero completion today
deprecated-address-family annotator (new capability; ref address_families(7))
IPv6 canonicalization (IPv6 Recommended Formats #363) — RFC 5952; likely a hand-rolled ~40-line
canonicalizer with tests (IntelliJ bundles Guava whose InetAddresses.toAddrString() is RFC 5952-compliant, but a self-contained
impl avoids depending on a bundled lib)
RestrictAddressFamilies= is the canary throughout: small, but it exercises an
alternative, a repeat/list, and the ~ inversion — enough to stress backtracking
without drowning the PoC.
Notes / open questions
Terminal multi-length matching for full completeness (B/C edge above).
Performance: option values are short strings, so list-of-successes cost is a
non-issue, but worth a sanity check on the IPv6 grammar.
Whether the parse tree fully replaces the flat MatchResult or wraps it during a
transition period.
Summary
This is the umbrella/PoC issue for reworking the option-value grammar engine
(
semanticdata/optionvalues/grammar/) so that it can drive capabilities —role-based coloring, grammar-based completion, deprecation/semantic warnings, and
value canonicalization — instead of only answering "is this value valid?".
It consolidates and supersedes the analysis behind:
The plan is to build a proof of concept end-to-end (likely on
RestrictAddressFamilies=as the canary), live with it / mentally chew on it, andthen cut it up into smaller shippable PRs in roughly the order below.
Background: how the engine works today
Every
Combinatorimplements two near-identical recursive traversals:SyntacticMatch— lenient ("could be this — color it / locate the error")SemanticMatch— strict ("is this actually valid?")Both return a flat
MatchResult:Three properties of the current design drive everything below:
combinator commits to its first success and never reconsiders.
ZeroOrMore/OneOrMore/Repeatconsume greedily and never give back;AlternativeCombinatorreturns the first alternative that matches (not thelongest) and never retries if a later sequence element fails. The 15
hand-ordered, most-specific-first IPv6 alternatives in
Combinators.ktexistpurely to dodge this.
tokens+ a parallelterminalslist. Notree, no record of which alternative was taken, no nesting. (Tokens do tile
the matched region contiguously, so
TextRanges are reconstructable by summinglengths — useful, but there's no grouping.)
only the terminal predicate differs (
RegexTerminal's two regexes,IntegerTerminal's bounds,FlexibleLiteralChoiceTerminal's loose-vs-exact).The one limitation behind #345, #343, and the alternative-ordering worry
These are the same deficiency wearing three hats: single-path greedy matching
over a flat result.
should accept. The real present-day cost is the maintenance tax — authors must
hand-order alternatives and avoid greedy overlap (the "subtle bugs" worry).
all paths, ordering would stop mattering.
next?" Single-path matching surfaces at most one continuation, never the full
set — which is why Grammar Based Completion #343 notes it's blocked on a redesign.
Design direction
A. Color/intelligence is by semantic role, not terminal type
127.0.0.1is one literal (an IPv4 address), not blue-octet/black-dot/blue-octet.The unit of meaning is a labeled span, usually a whole sub-grammar. So we add a
wrapper that tags a sub-grammar with a role:
That single label powers multiple capabilities, all as "walk the labeled result,
act per node":
TextAttributesKey. Inner dots never get their own color.ADDRESS_FAMILY→ look the matchedtext up in a deprecated set (e.g.
AF_WANPIPE,AF_DECnet,AF_NETBEUI, …) and warn.IPV6_ADDRESS→ grab the span, rewrite it.This argues for building real labeled structure into the result (a parse tree),
not lightweight ad-hoc captures.
B. One richer primitive, capabilities as free functions (NOT a visitor split)
We explicitly reject pulling the algorithms out into per-capability visitors,
because that imposes an "implement-it-again for every new combinator type" tax.
Instead, each combinator implements one method —
parse()— that returnsall the ways it can match (Wadler's "list of successes": a lazy
Sequence<ParseState>), where eachParseStatecarries a labeled sub-tree plusper-node validity flags plus the frontier expected-set. Everything else
(coloring, completion, deprecation, canonicalization, validation) becomes a free
function over the parse result — zero combinator code.
SyntacticMatch+SemanticMatch)parse)So the refactor reduces per-combinator code and removes the duplication.
C. Collapse Syntactic/Semantic into one pass
Lenient is always a superset of strict (the design intent). So run one lenient
parse, tagging each terminal node with a validity flag:
One traversal answers both; the duplicated
SyntacticMatch/SemanticMatchbodiesgo away. (Edge to settle in the PoC: for full completeness a terminal may need to
yield multiple match lengths rather than always-greedy-one.)
D. Completion and errors are the same mechanism
The hard part of completion is partial input:
public classisn't a valid program,but you must complete it, not red-squiggle it. Both completion and good error
messages come from one thing:
AF_IN→AF_INET,AF_INET6)longestMatchis already the seed of "furthest position reached." Terminals get onesmall optional hook —
suggest()— so a literal-choice can enumerate candidateswhile a regex/integer contributes a generic "expected a number"; most combinators
inherit a default.
So #345 (completeness), #343 (completion), and good errors are three readouts of one
engine.
Target result shape (sketch — will evolve in the PoC)
Capabilities (all free functions over the above):
validate(result)→ problem descriptors (replaces currentgenerateProblemDescriptors)colorize(result)→List<Pair<TextRange, Role>>for an annotator (Grammar Based Syntax Highlighting #342)complete(grammar, prefix)→Set<String>for the completion contributor (Grammar Based Completion #343)semanticWarnings(result)→ deprecation/role-based warnings (new)canonicalize(result, role)→ text rewrite, e.g. RFC 5952 IPv6 (IPv6 Recommended Formats #363)Proposed sequence (PoC first, then cut into PRs)
RestrictAddressFamilies=. Replace the looseRegexTerminal("AF_[A-Z0-9_]+")with aFlexibleLiteralChoiceTerminalof thereal
af_from_namefamilies, wrapped in anADDRESS_FAMILYrole label. Add agolden test suite (valid / invalid / completion /
~-inversion). This isgrammar-definition-level, so it's engine-agnostic and carries across the
refactor — and doubles as the refactor's regression canary.
parse()-based engine (list-of-successes + labeled tree + per-nodevalidity + frontier expected-set), validated against the step-1 tests. Closes
Ensure Grammar Matching is Complete #345 and removes alternative-ordering fragility; merges Syntactic/Semantic into
one pass.
GrammarOptionValue.getAutoCompleteOptionscurrently returns
emptySet(), so all grammar options offer zero completion todayaddress_families(7))
canonicalizer with tests (IntelliJ bundles Guava whose
InetAddresses.toAddrString()is RFC 5952-compliant, but a self-containedimpl avoids depending on a bundled lib)
RestrictAddressFamilies=is the canary throughout: small, but it exercises analternative, a repeat/list, and the
~inversion — enough to stress backtrackingwithout drowning the PoC.
Notes / open questions
non-issue, but worth a sanity check on the IPv6 grammar.
MatchResultor wraps it during atransition period.