Skip to content

feat: sequence cast compute#8403

Open
siddarth2810 wants to merge 4 commits into
vortex-data:developfrom
siddarth2810:feat/sequence-cast-compute
Open

feat: sequence cast compute#8403
siddarth2810 wants to merge 4 commits into
vortex-data:developfrom
siddarth2810:feat/sequence-cast-compute

Conversation

@siddarth2810

@siddarth2810 siddarth2810 commented Jun 13, 2026

Copy link
Copy Markdown

Summary

This PR fixes SequenceArray casts where internal arithmetic needs to use a different type than the declared output dtype.

Previously, SequenceArray effectively used the same primitive type for both internal arithmetic and for exposing values. That broke cases like a negative-step signed sequence cast to an unsigned dtype, where every generated value still fits the unsigned output type.

Sequence operations now compute using that internal arithmetic type while materialized arrays and scalars are exposed using the array output dtype.

Added regression coverage for casts where scalar_at must return a scalar matching the array output dtype rather than the internal arithmetic type.

Closes: #5102

API Changes

None

Review Feedback

Based on review feedback, this PR avoids storing an extra calculation_ptype in SequenceMetadata.

Instead, it determines a normalized arithmetic type from the sequence values:
i64 when either base or multiplier is signed, otherwise u64.

Testing

For testing code changes

cargo nextest run -p vortex-sequence
cargo +nightly fmt --all
cargo clippy --all-targets --all-features

For testing the build

cargo build --workspace

AI tools Disclosure

Used ChatGPT for understanding code and OpenCode for updating callers and generating tests

SequenceArray previously used the same ptype for arithmetic and for the declared
output dtype, which made the model too narrow for casts.

Store calculation_ptype separately from the output dtype, preserve it through
metadata, and validate that generated values fit the declared output type.

Update decompression, filter, take, slice, scalar access, and min/max paths
to compute in calculation_ptype while emitting values using the array output
ptype.

Signed-off-by: Siddarth Gundu <siddarthg0910@gmail.com>
Sequence::try_new now accepts both the calculation ptype and output ptype.

Signed-off-by: Siddarth Gundu <siddarthg0910@gmail.com>
@siddarth2810 siddarth2810 requested a review from a team June 13, 2026 18:06
@connortsui20 connortsui20 self-requested a review June 13, 2026 23:47
@connortsui20 connortsui20 added the changelog/fix A bug fix label Jun 13, 2026
@connortsui20

Copy link
Copy Markdown
Member

Don't worry about the rustsec issue, that is a known problem

@codspeed-hq

codspeed-hq Bot commented Jun 13, 2026

Copy link
Copy Markdown

Merging this PR will not alter performance

⚠️ Unknown Walltime execution environment detected

Using the Walltime instrument on standard Hosted Runners will lead to inconsistent data.

For the most accurate results, we recommend using CodSpeed Macro Runners: bare-metal machines fine-tuned for performance measurement consistency.

⚠️ Different runtime environments detected

Some benchmarks with significant performance changes were compared across different runtime environments,
which may affect the accuracy of the results.

Open the report in CodSpeed to investigate

⚡ 4 improved benchmarks
❌ 2 regressed benchmarks
✅ 1579 untouched benchmarks
⏩ 4 skipped benchmarks1

Warning

Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

Mode Benchmark BASE HEAD Efficiency
Simulation sequence_decompress_u32 206.1 µs 498.8 µs -58.68%
Simulation chunked_varbinview_into_canonical[(1000, 10)] 168.9 µs 205.6 µs -17.84%
Simulation chunked_bool_canonical_into[(1000, 10)] 26.7 µs 16.1 µs +65.2%
Simulation chunked_varbinview_canonical_into[(100, 100)] 259 µs 223.8 µs +15.7%
Simulation chunked_varbinview_into_canonical[(100, 100)] 305.7 µs 270.6 µs +12.99%
Simulation eq_i64_constant 318.1 µs 288.2 µs +10.37%

Tip

Investigate this regression by commenting @codspeedbot fix this regression on this PR, or directly use the CodSpeed MCP with your agent.


Comparing siddarth2810:feat/sequence-cast-compute (61b79bd) with develop (f793584)

Open in CodSpeed

Footnotes

  1. 4 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

@gatesn gatesn left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a good catch that this was broken before.

But why is it important to store this information? Vs just always computing in i64/u64 space?

@siddarth2810

Copy link
Copy Markdown
Author

It's a good catch that this was broken before.

But why is it important to store this information? Vs just always computing in i64/u64 space?

Thanks for the quick review :)

The maintainer in the previous PR mentioned to use two ptypes, so I went ahead with that design in mind.

But after experimenting a bit on this, I think we could get calculation_ptype from base.ptype()
instead of being passed around or stored separately.

For deserialization, I think we can use scalar kind before decoding the scalar, so we may not need to store calculation_ptype in SequenceMetadata either.

Comment thread encodings/sequence/src/array.rs Outdated
Comment on lines +73 to +74
multiplier: PValue,
calculation_ptype: PType,

@connortsui20 connortsui20 Jun 15, 2026

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit outside of what this PR is trying to do, but could you document this now that we are adding a third type here that is different from the base and multiplier? With just base and multiplier it is obvious what this is doing, but with the addition of calculation_ptype this is now harder to understand on a first read. And if possible you could document other places that use this now, that would be great, thanks!

Edit: I am interested to see if your idea about not storing this at all works. That would probably be better for us since that is not a breaking change.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding of this issue was to use two types. caclculation_ptype for computing and output_ptype for the result type. But yeah, adding another type makes it less obvious in the first read, makes sense.

I'l work on the idea of not storing this at all.

Thanks a lot !

@joseph-isaacs joseph-isaacs changed the title Feat/sequence cast compute feat: sequence cast compute Jun 15, 2026
Comment thread encodings/sequence/src/array.rs Outdated
#[prost(message, tag = "2")]
multiplier: Option<vortex_proto::scalar::ScalarValue>,
#[prost(enumeration = "PType", optional, tag = "3")]
calculation_ptype: Option<i32>,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you explain why in a doc str why we need this, if we need this

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to decode the base and multiplier as calculation_ptype during deserialization, so I added this. But after Gates comments, I found that I could use scalar_value::Kind to get the type instead. I'll remove this in the next change

Thanks for the review :)

siddarth2810 and others added 2 commits June 24, 2026 00:53
Remove the stored calculation_ptype from Sequence metadata and avoid
passing it through constructors. Normalize base and multiplier to i64
when either value is signed, otherwise u64, and use that internal ptype
for sequence arithmetic.

Signed-off-by: Siddarth Gundu <siddarthg0910@gmail.com>
Signed-off-by: Siddarth <110726331+siddarth2810@users.noreply.github.com>
@siddarth2810 siddarth2810 marked this pull request as draft June 29, 2026 13:15
@siddarth2810 siddarth2810 marked this pull request as ready for review June 29, 2026 13:15
@siddarth2810 siddarth2810 requested a review from gatesn June 29, 2026 13:15
Self { base, multiplier }
}

pub fn ptype(&self) -> PType {

@gatesn gatesn Jun 29, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this entire module is a massive foot gun.... (not your fault!)

We should just be passing around Scalar values. Not PValue.

This PType used to return the output ptype (i.e. matches array.dtype().ptype()), but now it returns the ptype used to calculate the result values I think?

Can we remove this function maybe? Or rename it to calculation_ptype()? Make it pub(crate) and not pub?

I think that's all I would change here!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

changelog/fix A bug fix

Projects

None yet

Development

Successfully merging this pull request may close these issues.

SequenceArray cannot cast to a narrower type

4 participants