Skip to content

gh-136757: Reuse static strings for exact operator tokens#151838

Open
omkar-334 wants to merge 3 commits into
python:mainfrom
omkar-334:gh-136757-intern-operator-strings
Open

gh-136757: Reuse static strings for exact operator tokens#151838
omkar-334 wants to merge 3 commits into
python:mainfrom
omkar-334:gh-136757-intern-operator-strings

Conversation

@omkar-334

@omkar-334 omkar-334 commented Jun 21, 2026

Copy link
Copy Markdown
Contributor

This PR reduces repeated string allocations in the tokenizer by reusing static interned strings for exact multi-character operator tokens such as ==, ->, //=, and ....

Before -

"==" repeated 1,000 times -> 1,000 distinct "==" string objects

After -

"==" repeated 1,000 times -> 1 distinct "==" string object

Benchmarks

These are local synthetic measurements from pyperf --rigorous on a quieter machine. The public tokenize timing cases are the strongest timing evidence; one direct _tokenize.TokenizerIter streaming case still had a pyperf stability warning.

Workload:

  • 2,000 repetitions of a source snippet containing all 24 candidate multi-character exact operators.
  • Compared public tokenize.tokenize() and private _tokenize.TokenizerIter.
  • Measured both streaming token consumption and retained token streams.

Object reuse:

Repeated "==" tokens:
before: 1,000 distinct str objects
after:      1 distinct str object

Retained operator token strings:
before: 48,000 distinct str objects for 24 values
after:      24 distinct str objects for 24 values

Peak memory for retained token streams:

list(tokenize.tokenize):
before: 70.091 MiB
after:  68.113 MiB
change: -2.82%

list(_tokenize.TokenizerIter):
before: 65.054 MiB
after:  63.076 MiB
change: -3.04%

Timing:

tokenize.tokenize count:
51.0 ms +- 0.7 ms -> 48.5 ms +- 0.8 ms
1.05x faster

list(tokenize.tokenize):
91.1 ms +- 1.6 ms -> 88.6 ms +- 1.1 ms
1.03x faster

_tokenize.TokenizerIter count:
28.6 ms +- 0.6 ms -> 26.8 ms +- 0.7 ms
1.07x faster
note: this case had a pyperf stability warning

list(_tokenize.TokenizerIter):
51.8 ms +- 1.0 ms -> 50.7 ms +- 0.6 ms
1.02x faster

ast.parse control:
no significant difference

@omkar-334

Copy link
Copy Markdown
Contributor Author

cc @ZeroIntensity for review

@StanFromIreland

Copy link
Copy Markdown
Member

Thanks for benchmarking, could you please also share the script you used?

2,000 repetitions of a source snippet containing all 24 candidate multi-character exact operators.

So that means that 1.05x is the upper bound for a synthetic corpus, but not what anyone tokenizing real code would see. Can you please also benchmark a realistic corpus (e.g., a file from the stdlib)?

list(_tokenize.TokenizerIter):
51.8 ms +- 1.0 ms -> 50.7 ms +- 0.6 ms
1.02x faster

This is noise, the delta is below the combined error and the ranges overlap.

The change is small, but so is the win. I'm -0 on this change, but I won't object if someone else thinks it's worth it.

Comment thread Lib/test/test_tokenize.py
Comment on lines +1917 to +1925
def test_old_not_equal_spelling_is_not_rewritten(self):
source = BytesIO(
b'from __future__ import barry_as_FLUFL\n'
b'a <> b\n'
)
tokens = list(tokenize._tokenize.TokenizerIter(
source.readline, encoding='utf-8', extra_tokens=True))
self.assertIn('<>', [tok[1] for tok in tokens])
self.assertNotIn('!=', [tok[1] for tok in tokens])

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test passes on main and isn't related to this change; please remove it.

Comment thread Lib/test/test_tokenize.py
Comment on lines +1911 to +1912
tokens = list(tokenize._tokenize.TokenizerIter(
source.readline, encoding='utf-8', extra_tokens=True))

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a fan of getting _tokenize this way, nor do I think this test stresses anything different from the first one. Let's keep just the first test and remove this one.

Comment thread Lib/test/test_tokenize.py
Comment on lines +1898 to +1900
matches = [tok.string for tok in tokens if tok.string == op]
self.assertEqual(len(matches), 2)
self.assertIs(matches[0], matches[1])

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use sys._is_interned instead of relying on is.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants