gh-136757: Reuse static strings for exact operator tokens by omkar-334 · Pull Request #151838 · python/cpython

omkar-334 · 2026-06-21T11:08:42Z

This PR reduces repeated string allocations in the tokenizer by reusing static interned strings for exact multi-character operator tokens such as ==, ->, //=, and ....

Before -

"==" repeated 1,000 times -> 1,000 distinct "==" string objects

After -

"==" repeated 1,000 times -> 1 distinct "==" string object

Issue: Intern string representation of operators and some other symbolic literals #136757

Benchmarks

These are local synthetic measurements from pyperf --rigorous on a quieter machine. The public tokenize timing cases are the strongest timing evidence; one direct _tokenize.TokenizerIter streaming case still had a pyperf stability warning.

Workload:

2,000 repetitions of a source snippet containing all 24 candidate multi-character exact operators.
Compared public tokenize.tokenize() and private _tokenize.TokenizerIter.
Measured both streaming token consumption and retained token streams.

Object reuse:

Repeated "==" tokens:
before: 1,000 distinct str objects
after:      1 distinct str object

Retained operator token strings:
before: 48,000 distinct str objects for 24 values
after:      24 distinct str objects for 24 values

Peak memory for retained token streams:

list(tokenize.tokenize):
before: 70.091 MiB
after:  68.113 MiB
change: -2.82%

list(_tokenize.TokenizerIter):
before: 65.054 MiB
after:  63.076 MiB
change: -3.04%

Timing:

tokenize.tokenize count:
51.0 ms +- 0.7 ms -> 48.5 ms +- 0.8 ms
1.05x faster

list(tokenize.tokenize):
91.1 ms +- 1.6 ms -> 88.6 ms +- 1.1 ms
1.03x faster

_tokenize.TokenizerIter count:
28.6 ms +- 0.6 ms -> 26.8 ms +- 0.7 ms
1.07x faster
note: this case had a pyperf stability warning

list(_tokenize.TokenizerIter):
51.8 ms +- 1.0 ms -> 50.7 ms +- 0.6 ms
1.02x faster

ast.parse control:
no significant difference

omkar-334 · 2026-06-21T11:09:29Z

cc @ZeroIntensity for review

StanFromIreland · 2026-06-21T19:07:29Z

Thanks for benchmarking, could you please also share the script you used?

2,000 repetitions of a source snippet containing all 24 candidate multi-character exact operators.

So that means that 1.05x is the upper bound for a synthetic corpus, but not what anyone tokenizing real code would see. Can you please also benchmark a realistic corpus (e.g., a file from the stdlib)?

list(_tokenize.TokenizerIter):
51.8 ms +- 1.0 ms -> 50.7 ms +- 0.6 ms
1.02x faster

This is noise, the delta is below the combined error and the ranges overlap.

The change is small, but so is the win. I'm -0 on this change, but I won't object if someone else thinks it's worth it.

ZeroIntensity · 2026-06-21T22:31:37Z

+    def test_old_not_equal_spelling_is_not_rewritten(self):
+        source = BytesIO(
+            b'from __future__ import barry_as_FLUFL\n'
+            b'a <> b\n'
+        )
+        tokens = list(tokenize._tokenize.TokenizerIter(
+            source.readline, encoding='utf-8', extra_tokens=True))
+        self.assertIn('<>', [tok[1] for tok in tokens])
+        self.assertNotIn('!=', [tok[1] for tok in tokens])


This test passes on main and isn't related to this change; please remove it.

ZeroIntensity · 2026-06-21T22:34:30Z

+                tokens = list(tokenize._tokenize.TokenizerIter(
+                    source.readline, encoding='utf-8', extra_tokens=True))


I'm not a fan of getting _tokenize this way, nor do I think this test stresses anything different from the first one. Let's keep just the first test and remove this one.

ZeroIntensity · 2026-06-21T22:36:26Z

+                matches = [tok.string for tok in tokens if tok.string == op]
+                self.assertEqual(len(matches), 2)
+                self.assertIs(matches[0], matches[1])


Use sys._is_interned instead of relying on is.

Reuse static strings for exact operator tokens

8be2a0c

omkar-334 requested review from lysnikolaou and pablogsal as code owners June 21, 2026 11:08

bedevere-app Bot added the awaiting review label Jun 21, 2026

bedevere-app Bot mentioned this pull request Jun 21, 2026

Intern string representation of operators and some other symbolic literals #136757

Open

omkar-334 added 2 commits June 21, 2026 16:40

add news label

7d232f8

Merge branch 'main' into pythongh-136757-intern-operator-strings

4a65219

StanFromIreland requested a review from ZeroIntensity June 21, 2026 11:42

ZeroIntensity reviewed Jun 21, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

gh-136757: Reuse static strings for exact operator tokens#151838

gh-136757: Reuse static strings for exact operator tokens#151838
omkar-334 wants to merge 3 commits into
python:mainfrom
omkar-334:gh-136757-intern-operator-strings

omkar-334 commented Jun 21, 2026 •

edited

Loading

Uh oh!

omkar-334 commented Jun 21, 2026

Uh oh!

StanFromIreland commented Jun 21, 2026

Uh oh!

ZeroIntensity Jun 21, 2026

Uh oh!

ZeroIntensity Jun 21, 2026

Uh oh!

ZeroIntensity Jun 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		tokens = list(tokenize._tokenize.TokenizerIter(
		source.readline, encoding='utf-8', extra_tokens=True))

Uh oh!

Conversation

omkar-334 commented Jun 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmarks

Uh oh!

omkar-334 commented Jun 21, 2026

Uh oh!

StanFromIreland commented Jun 21, 2026

Uh oh!

ZeroIntensity Jun 21, 2026

Choose a reason for hiding this comment

Uh oh!

ZeroIntensity Jun 21, 2026

Choose a reason for hiding this comment

Uh oh!

ZeroIntensity Jun 21, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

omkar-334 commented Jun 21, 2026 •

edited

Loading