THRIFT-6069: python: avoid utf8 copy during fastbinary string encode by markjm · Pull Request #3595 · apache/thrift

markjm · 2026-06-13T07:16:14Z

Hi - I figured I'd share a few perf optimizations we are using internally. We are still on an older thrift version (😢 ), so I did use some agent magic to port these to the head of this repo. My testing was primarily on my branch - in this case, for example, we only care about p3.12+, so i added some compat for when they added PyUnicode_AsUTF8AndSize in py3.3 (impressed the repo still suports py2!)

This one is 1 of 3 PRs

⚠️ I did use AI tools to investigate and address feedback, but I am a real human ready to collaborate 😄

Use PyUnicode_AsUTF8AndSize when available, with a fallback for older Python 3 releases, so UTF-8 strings can be encoded without allocating an intermediate PyBytes object.

Performance (50k iterations, warmed)

Workload	Baseline	This commit	Speedup
encode simple (30B, 1 string)	0.60 us	0.55 us	1.09x

The more string fields a struct has, the larger the gain. Decode is unchanged.

Use PyUnicode_AsUTF8AndSize when available, with a fallback for older Python 3 releases, so UTF-8 strings can be encoded without allocating an intermediate PyBytes object. Performance (50k iterations, warmed) | Workload | Baseline | This commit | Speedup | |----------|----------|-------------|---------| | encode simple (30B, 1 string) | 0.60 us | 0.55 us | 1.09x | | encode 10-string (182B) | 1.44 us | 1.01 us | 1.43x | | encode complex (395B) | 3.02 us | 2.56 us | 1.18x | The more string fields a struct has, the larger the gain. Decode is unchanged.

Copilot

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR strengthens Thrift’s Python fastbinary path by improving UTF-8 string encoding in the C extension and adding a regression test that exercises accelerated encoding/decoding for TApplicationException.

Changes:

Add a unit test covering accelerated UTF-8 roundtrip for TApplicationException when fastbinary is available.
Update the C extension to encode Python str values via PyUnicode_AsUTF8AndSize (avoids intermediate bytes object creation) while preserving 32-bit length checks.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.

File	Description
lib/py/test/thrift_TBinaryProtocol.py	Adds accelerated-protocol test coverage and related fixtures for `TApplicationException`.
lib/py/src/ext/protocol.tcc	Optimizes `T_STRING` encoding for Python `str` by writing UTF-8 bytes directly with length validation.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+from thrift.Thrift import TApplicationException
 from thrift.protocol.TBinaryProtocol import TBinaryProtocol
+from thrift.protocol.TBinaryProtocol import TBinaryProtocolAcceleratedFactory


+APPLICATION_EXCEPTION_TYPEARGS = [
+    TApplicationException,
+    (
+        None,
+        (1, 11, "message", "UTF8", None),
+        (2, 8, "type", None, None),
+    ),
+]


+
+        otrans = TTransport.TMemoryBuffer()
+        oproto = TBinaryProtocolAcceleratedFactory(fallback=False).getProtocol(otrans)
+        oproto.trans.write(oproto._fast_encode(original, APPLICATION_EXCEPTION_TYPEARGS))


+
+        itrans = TTransport.TMemoryBuffer(otrans.getvalue())
+        iproto = TBinaryProtocolAcceleratedFactory(fallback=False).getProtocol(itrans)
+        decoded = iproto._fast_decode(None, iproto, APPLICATION_EXCEPTION_TYPEARGS)


markjm · 2026-06-13T07:23:05Z

+      impl()->writeI32(static_cast<int32_t>(len));
+      return writeBuffer(const_cast<char*>(str), static_cast<size_t>(len));


Change the internal fastbinary writeBuffer helper to accept const input buffers so the zero-copy UTF-8 encode path does not need to const_cast CPython-managed string storage.

Jens-G · 2026-06-13T09:29:42Z

Code review

Found 2 issues:

The fast path calls impl()->writeI32(len) to write the string length, which is correct for BinaryProtocol but wrong for CompactProtocol. CompactProtocol::writeString encodes the length with writeVarint(len) (plain unsigned varint), while writeI32 encodes with writeVarint(toZigZag(val)) (zigzag). The fast path will produce a malformed Compact frame for any unicode string field, and a decoder will read the wrong number of bytes.

thrift/lib/py/src/ext/protocol.tcc

Lines 469 to 473 in 0691570

    
                 } 
        
                 impl()->writeI32(static_cast<int32_t>(len)); 
        
                 return writeBuffer(str, static_cast<size_t>(len)); 
        
           #else

AI tool use is acknowledged in the PR body but neither commit contains the required Co-Authored-By: or Generated-by: label (AGENTS.md §4 says "Always label AI-assisted commits and PRs … Apply this label even when AI only generated a portion of the change").

thrift/AGENTS.md

Lines 57 to 71 in 35c1a53

    
           ## 4. AI-Generated Contributions 
        
           Per [`CONTRIBUTING.md § AI generated content`](CONTRIBUTING.md#ai-generated-content) and the [ASF Generative Tooling Guidance](https://www.apache.org/legal/generative-tooling.html): 
        
           - **Always** label AI-assisted commits and PRs. Use one or both of: 
        
             ``` 
        
             Co-Authored-By: <AI tool name and version> 
        
             Generated-by: <AI tool name and version> 
        
             ``` 
        
             Example: 
        
             ``` 
        
             THRIFT-9999: Fix connection timeout handling in Go client 
        
             Client: go 
        
             Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

🤖 Generated with Claude Code

_{- If this code review was useful, please react with 👍. Otherwise, react with 👎.}

markjm requested a review from mhlakhani as a code owner June 13, 2026 07:16

Copilot AI review requested due to automatic review settings June 13, 2026 07:16

mergeable Bot added the python label Jun 13, 2026

Copilot AI reviewed Jun 13, 2026

View reviewed changes

python: make fastbinary writeBuffer const-correct

0691570

Change the internal fastbinary writeBuffer helper to accept const input buffers so the zero-copy UTF-8 encode path does not need to const_cast CPython-managed string storage.

markjm changed the title ~~python: avoid utf8 copy during fastbinary string encode~~ [THRIFT-6069] python: avoid utf8 copy during fastbinary string encode Jun 13, 2026

Jens-G changed the title ~~[THRIFT-6069] python: avoid utf8 copy during fastbinary string encode~~ THRIFT-6069: python: avoid utf8 copy during fastbinary string encode Jun 13, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

THRIFT-6069: python: avoid utf8 copy during fastbinary string encode#3595

THRIFT-6069: python: avoid utf8 copy during fastbinary string encode#3595
markjm wants to merge 2 commits into
apache:masterfrom
markjm:apache-port-fastbinary-zerocopy-string

markjm commented Jun 13, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

markjm Jun 13, 2026

Uh oh!

Jens-G commented Jun 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		impl()->writeI32(static_cast<int32_t>(len));
		return writeBuffer(const_cast<char*>(str), static_cast<size_t>(len));

Conversation

markjm commented Jun 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

markjm Jun 13, 2026

Choose a reason for hiding this comment

Uh oh!

Jens-G commented Jun 13, 2026

Code review

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

markjm commented Jun 13, 2026 •

edited

Loading