Skip to content

Support Python regex a and L flags#1589

Open
bcmeireles wants to merge 1 commit into
lark-parser:masterfrom
bcmeireles:support-a-and-l-regex-flags
Open

Support Python regex a and L flags#1589
bcmeireles wants to merge 1 commit into
lark-parser:masterfrom
bcmeireles:support-a-and-l-regex-flags

Conversation

@bcmeireles
Copy link
Copy Markdown

Added grammar-level support for Python's a and L regular expression flags, closing the gap between Lark's supported regex suffix syntax and Python's re flags. Fixes #1527

@erezsh
Copy link
Copy Markdown
Member

erezsh commented May 1, 2026

Did you use an LLM to write this?

@bcmeireles
Copy link
Copy Markdown
Author

@erezsh no

@erezsh
Copy link
Copy Markdown
Member

erezsh commented May 1, 2026

So, can you explain a few parts?

Why use regexp.encode('latin-1') ?

And what is _strip_width_only_locale_flags for?

And what is test_token_flags_locale_bytes testing?

@bcmeireles
Copy link
Copy Markdown
Author

Why use regexp.encode('latin-1') ?

because Python is picky with (?L), only working with bytes regexes.
latin-1 is chosen because it maps code points 0-255 directly to byte values. If we used UTF-8, one character might become multiple bytes, and then the regex would mean something slightly different.

And what is _strip_width_only_locale_flags for?

Flags like a, L, and u affect character-class semantics, such as \w, \b, and case handling, but they do not affect regex width. For width analysis, like checking whether a token has fixed/min/max length, those flags are irrelevant.

And what is test_token_flags_locale_bytes testing?

verifies the bytes-specific locale case

  • regex has locale flag
  • input is bytes
  • Python gets a bytes regex, not a string regex
  • latin-1 doesn’t mess up the bytes
  • the width-check helper doesn’t break actual matching

@erezsh
Copy link
Copy Markdown
Member

erezsh commented May 9, 2026

Ok, I think I have a better understanding now.

I think _strip_width_only_locale_flags() is the wrong approach.

_get_width() calls get_regexp_width(self.to_regexp()), so we can just generate the regexp without the ?L flag, instead of having to do a brittle search and replace over it.

i.e. something like get_regexp_width(self.to_regexp(locale_flags=False)), and get_regexp_width doesn't have to change.

It would also be nice to see better error handling, and tests for that error handling, and possibly some edge cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add support for the a and L regex flags

2 participants