GH-50312: [Python] Fix UUID extension type round-trip to pandas returning bytes#50325
Open
parker-cassar wants to merge 2 commits into
Open
GH-50312: [Python] Fix UUID extension type round-trip to pandas returning bytes#50325parker-cassar wants to merge 2 commits into
parker-cassar wants to merge 2 commits into
Conversation
|
|
5b15a46 to
386b1b6
Compare
|
Hey @parker-cassar, thanks so much for jumping on this so quickly. Just out of curiosity—since this delegates to to_pylist(), do you think this pattern might end up being useful for other Arrow extension types down the way??? |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Rationale for this change
Converting a Table with an
arrow.uuidextension column to pandas currently produces a column ofbytesinstead ofuuid.UUIDobjects. This happens becauseUuidTypedoes not implementto_pandas_dtype(), soTable.to_pandas()falls back to the storage type (fixed_size_binary(16)) and producesbytes. The bug occurs even without a Parquet roundtrip.Note: the original issue suggested this might be specific to Python 3.14 but I tested on Python versions 3.10 - 3.14 and still had the issue since
UuidTypehas never implementedto_pandas_dtype().What changes are included in this PR?
Added
UuidType.to_pandas_dtype(): returns a dtype wrapper implementing__from_arrow__, which delegates toto_pylist()sinceUuidScalar.as_py()already producesuuid.UUIDobjects.Are these changes tested?
Yes. Added
test_uuid_roundtripwhich covers pandas DataFrame with a UUID column -> pyarrow Table -> Parquet on disk -> pyarrow Table -> pandas DataFrame. The final conversion is what this PR fixes.Are there any user-facing changes?
Yes.
Table.to_pandas()now returnsuuid.UUIDforarrow.uuidcolumns instead ofbytes.