gh-136757: Reuse static strings for exact operator tokens#151838
gh-136757: Reuse static strings for exact operator tokens#151838omkar-334 wants to merge 3 commits into
Conversation
|
cc @ZeroIntensity for review |
|
Thanks for benchmarking, could you please also share the script you used?
So that means that 1.05x is the upper bound for a synthetic corpus, but not what anyone tokenizing real code would see. Can you please also benchmark a realistic corpus (e.g., a file from the stdlib)?
This is noise, the delta is below the combined error and the ranges overlap. The change is small, but so is the win. I'm -0 on this change, but I won't object if someone else thinks it's worth it. |
| def test_old_not_equal_spelling_is_not_rewritten(self): | ||
| source = BytesIO( | ||
| b'from __future__ import barry_as_FLUFL\n' | ||
| b'a <> b\n' | ||
| ) | ||
| tokens = list(tokenize._tokenize.TokenizerIter( | ||
| source.readline, encoding='utf-8', extra_tokens=True)) | ||
| self.assertIn('<>', [tok[1] for tok in tokens]) | ||
| self.assertNotIn('!=', [tok[1] for tok in tokens]) |
There was a problem hiding this comment.
This test passes on main and isn't related to this change; please remove it.
| tokens = list(tokenize._tokenize.TokenizerIter( | ||
| source.readline, encoding='utf-8', extra_tokens=True)) |
There was a problem hiding this comment.
I'm not a fan of getting _tokenize this way, nor do I think this test stresses anything different from the first one. Let's keep just the first test and remove this one.
| matches = [tok.string for tok in tokens if tok.string == op] | ||
| self.assertEqual(len(matches), 2) | ||
| self.assertIs(matches[0], matches[1]) |
There was a problem hiding this comment.
Use sys._is_interned instead of relying on is.
This PR reduces repeated string allocations in the tokenizer by reusing static interned strings for exact multi-character operator tokens such as
==,->,//=, and....Before -
After -
Benchmarks
These are local synthetic measurements from
pyperf --rigorouson a quieter machine. The publictokenizetiming cases are the strongest timing evidence; one direct_tokenize.TokenizerIterstreaming case still had a pyperf stability warning.Workload:
tokenize.tokenize()and private_tokenize.TokenizerIter.Object reuse:
Peak memory for retained token streams:
Timing: