Skip to content

New Parquet-Variant encoding#7130

Open
AdamGS wants to merge 13 commits intodevelopfrom
adamg/parquet-variant
Open

New Parquet-Variant encoding#7130
AdamGS wants to merge 13 commits intodevelopfrom
adamg/parquet-variant

Conversation

@AdamGS
Copy link
Contributor

@AdamGS AdamGS commented Mar 23, 2026

Summary

This PR introduces a new encoding to support Arrow's canonical extension type, but does not integrate it with anything else.

It does include basic pieces like slice/take/filter, and it can (or at least should) roundtrip with the equivalent arrow array.

Testing

The encoding includes a bunch of basic tests, both for making sure it roundtrips with arrow and for the various nullability cases.

@AdamGS AdamGS added the changelog/feature A new feature label Mar 23, 2026
@AdamGS AdamGS force-pushed the adamg/parquet-variant branch 2 times, most recently from 62280a5 to f266c90 Compare March 23, 2026 16:21
@codspeed-hq
Copy link

codspeed-hq bot commented Mar 23, 2026

Merging this PR will not alter performance

✅ 1106 untouched benchmarks
⏩ 1522 skipped benchmarks1


Comparing adamg/parquet-variant (338af8a) with develop (2383946)

Open in CodSpeed

Footnotes

  1. 1522 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

@a10y
Copy link
Contributor

a10y commented Mar 23, 2026

It looks like this supports one typed_value per column. How would you represent something like this example, which shreds out multiple nested columns from an object?

image

Physically I think this differs from shredding out a struct column with two fields.

@AdamGS
Copy link
Contributor Author

AdamGS commented Mar 24, 2026

I think in that case it is a struct, because they are named

@AdamGS
Copy link
Contributor Author

AdamGS commented Mar 24, 2026

typed_value will be a struct (I think that's "object" in variant terms), but not directly of the event_type and event_ts types, but of ParquetVariant arrays that, but roughly:

Struct {
    "event_ts": Variant {
        ..,
        typed_value: Timestamp
    },
    "event_type": Variant {
        ..,
        typed_value: String
    },
}

@AdamGS AdamGS force-pushed the adamg/parquet-variant branch 3 times, most recently from 3d7ecb4 to ad93dcf Compare March 24, 2026 13:28
@AdamGS AdamGS requested review from a10y, connortsui20 and gatesn and removed request for a10y, connortsui20 and gatesn March 24, 2026 15:33
AdamGS added a commit that referenced this pull request Mar 24, 2026
## Summary

Filling up missing parts of making Variant a canonical array. It still
not fully supported but this is a step towards it.

These changes started as part of
#7130, but I figured they are
just noise in that already too big of a PR.

## API Changes

Makes variant officially a canonical array and type.

---------

Signed-off-by: Adam Gutglick <adam@spiraldb.com>
arrow-buffer = { workspace = true }
arrow-schema = { workspace = true }
chrono = { workspace = true }
parquet-variant = { workspace = true }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

@AdamGS AdamGS force-pushed the adamg/parquet-variant branch 2 times, most recently from 6e32d0f to fbf4d4e Compare March 25, 2026 12:36
"vortex-jni",
"vortex-python",
"vortex-tui",
"vortex-sqllogictest",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just moving things around

@AdamGS AdamGS marked this pull request as ready for review March 25, 2026 14:18
@AdamGS AdamGS force-pushed the adamg/parquet-variant branch from fbf4d4e to 4848c82 Compare March 25, 2026 14:18
@AdamGS AdamGS requested review from a10y, gatesn and robert3005 March 25, 2026 14:19
AdamGS added 12 commits March 25, 2026 16:04
Signed-off-by: Adam Gutglick <adam@spiraldb.com>
Signed-off-by: Adam Gutglick <adam@spiraldb.com>
Signed-off-by: Adam Gutglick <adam@spiraldb.com>
Signed-off-by: Adam Gutglick <adam@spiraldb.com>
Signed-off-by: Adam Gutglick <adam@spiraldb.com>
Signed-off-by: Adam Gutglick <adam@spiraldb.com>
Signed-off-by: Adam Gutglick <adam@spiraldb.com>
Signed-off-by: Adam Gutglick <adam@spiraldb.com>
Signed-off-by: Adam Gutglick <adam@spiraldb.com>
Signed-off-by: Adam Gutglick <adam@spiraldb.com>
Signed-off-by: Adam Gutglick <adam@spiraldb.com>
Signed-off-by: Adam Gutglick <adam@spiraldb.com>
@AdamGS AdamGS force-pushed the adamg/parquet-variant branch from 510a8c8 to e5a799e Compare March 25, 2026 16:06
Signed-off-by: Adam Gutglick <adam@spiraldb.com>
@AdamGS AdamGS force-pushed the adamg/parquet-variant branch from e5a799e to 338af8a Compare March 25, 2026 16:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

changelog/feature A new feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants