Skip to content

THRIFT-6069: python: avoid utf8 copy during fastbinary string encode#3595

Open
markjm wants to merge 2 commits into
apache:masterfrom
markjm:apache-port-fastbinary-zerocopy-string
Open

THRIFT-6069: python: avoid utf8 copy during fastbinary string encode#3595
markjm wants to merge 2 commits into
apache:masterfrom
markjm:apache-port-fastbinary-zerocopy-string

Conversation

@markjm

@markjm markjm commented Jun 13, 2026

Copy link
Copy Markdown
Contributor

Hi - I figured I'd share a few perf optimizations we are using internally. We are still on an older thrift version (😢 ), so I did use some agent magic to port these to the head of this repo. My testing was primarily on my branch - in this case, for example, we only care about p3.12+, so i added some compat for when they added PyUnicode_AsUTF8AndSize in py3.3 (impressed the repo still suports py2!)

This one is 1 of 3 PRs

⚠️ I did use AI tools to investigate and address feedback, but I am a real human ready to collaborate 😄


Use PyUnicode_AsUTF8AndSize when available, with a fallback for older Python 3 releases, so UTF-8 strings can be encoded without allocating an intermediate PyBytes object.

Performance (50k iterations, warmed)

Workload Baseline This commit Speedup
encode simple (30B, 1 string) 0.60 us 0.55 us 1.09x

The more string fields a struct has, the larger the gain. Decode is unchanged.

Use PyUnicode_AsUTF8AndSize when available, with a fallback for older Python 3 releases, so UTF-8 strings can be encoded without allocating an intermediate PyBytes object.

Performance (50k iterations, warmed)

| Workload | Baseline | This commit | Speedup |
|----------|----------|-------------|---------|
| encode simple (30B, 1 string) | 0.60 us | 0.55 us | 1.09x |
| encode 10-string (182B) | 1.44 us | 1.01 us | 1.43x |
| encode complex (395B) | 3.02 us | 2.56 us | 1.18x |

The more string fields a struct has, the larger the gain. Decode is unchanged.
@markjm markjm requested a review from mhlakhani as a code owner June 13, 2026 07:16
Copilot AI review requested due to automatic review settings June 13, 2026 07:16
@mergeable mergeable Bot added the python label Jun 13, 2026

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR strengthens Thrift’s Python fastbinary path by improving UTF-8 string encoding in the C extension and adding a regression test that exercises accelerated encoding/decoding for TApplicationException.

Changes:

  • Add a unit test covering accelerated UTF-8 roundtrip for TApplicationException when fastbinary is available.
  • Update the C extension to encode Python str values via PyUnicode_AsUTF8AndSize (avoids intermediate bytes object creation) while preserving 32-bit length checks.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.

File Description
lib/py/test/thrift_TBinaryProtocol.py Adds accelerated-protocol test coverage and related fixtures for TApplicationException.
lib/py/src/ext/protocol.tcc Optimizes T_STRING encoding for Python str by writing UTF-8 bytes directly with length validation.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +25 to +27
from thrift.Thrift import TApplicationException
from thrift.protocol.TBinaryProtocol import TBinaryProtocol
from thrift.protocol.TBinaryProtocol import TBinaryProtocolAcceleratedFactory
Comment on lines +172 to +179
APPLICATION_EXCEPTION_TYPEARGS = [
TApplicationException,
(
None,
(1, 11, "message", "UTF8", None),
(2, 8, "type", None, None),
),
]

otrans = TTransport.TMemoryBuffer()
oproto = TBinaryProtocolAcceleratedFactory(fallback=False).getProtocol(otrans)
oproto.trans.write(oproto._fast_encode(original, APPLICATION_EXCEPTION_TYPEARGS))

itrans = TTransport.TMemoryBuffer(otrans.getvalue())
iproto = TBinaryProtocolAcceleratedFactory(fallback=False).getProtocol(itrans)
decoded = iproto._fast_decode(None, iproto, APPLICATION_EXCEPTION_TYPEARGS)
Comment thread lib/py/src/ext/protocol.tcc Outdated
Comment on lines +471 to +472
impl()->writeI32(static_cast<int32_t>(len));
return writeBuffer(const_cast<char*>(str), static_cast<size_t>(len));

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks bot

Change the internal fastbinary writeBuffer helper to accept const input buffers so the zero-copy UTF-8 encode path does not need to const_cast CPython-managed string storage.
@markjm markjm changed the title python: avoid utf8 copy during fastbinary string encode [THRIFT-6069] python: avoid utf8 copy during fastbinary string encode Jun 13, 2026
@Jens-G Jens-G changed the title [THRIFT-6069] python: avoid utf8 copy during fastbinary string encode THRIFT-6069: python: avoid utf8 copy during fastbinary string encode Jun 13, 2026
@Jens-G

Jens-G commented Jun 13, 2026

Copy link
Copy Markdown
Member

Code review

Found 2 issues:

  1. The fast path calls impl()->writeI32(len) to write the string length, which is correct for BinaryProtocol but wrong for CompactProtocol. CompactProtocol::writeString encodes the length with writeVarint(len) (plain unsigned varint), while writeI32 encodes with writeVarint(toZigZag(val)) (zigzag). The fast path will produce a malformed Compact frame for any unicode string field, and a decoder will read the wrong number of bytes.

}
impl()->writeI32(static_cast<int32_t>(len));
return writeBuffer(str, static_cast<size_t>(len));
#else

  1. AI tool use is acknowledged in the PR body but neither commit contains the required Co-Authored-By: or Generated-by: label (AGENTS.md §4 says "Always label AI-assisted commits and PRs … Apply this label even when AI only generated a portion of the change").

thrift/AGENTS.md

Lines 57 to 71 in 35c1a53

## 4. AI-Generated Contributions
Per [`CONTRIBUTING.md § AI generated content`](CONTRIBUTING.md#ai-generated-content) and the [ASF Generative Tooling Guidance](https://www.apache.org/legal/generative-tooling.html):
- **Always** label AI-assisted commits and PRs. Use one or both of:
```
Co-Authored-By: <AI tool name and version>
Generated-by: <AI tool name and version>
```
Example:
```
THRIFT-9999: Fix connection timeout handling in Go client
Client: go
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

🤖 Generated with Claude Code

- If this code review was useful, please react with 👍. Otherwise, react with 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants