Skip to content

fix: round large UInt64 values without narrowing#23033

Open
Kevin-Li-2025 wants to merge 1 commit into
apache:mainfrom
Kevin-Li-2025:kevin/fix-spark-round-large-uint64
Open

fix: round large UInt64 values without narrowing#23033
Kevin-Li-2025 wants to merge 1 commit into
apache:mainfrom
Kevin-Li-2025:kevin/fix-spark-round-large-uint64

Conversation

@Kevin-Li-2025

Copy link
Copy Markdown

Which issue does this PR close?

Rationale for this change

Spark round currently narrows UInt64 inputs through i64. Values above i64::MAX therefore fail even when a non-negative scale makes rounding a no-op. The narrowing also prevents correct negative-scale rounding for the upper half of the UInt64 domain.

What changes are included in this PR?

  • Add a native u64 HALF_UP rounding path with the same ANSI overflow and non-ANSI wrapping behavior as the signed implementation.
  • Use it for both scalar and array UInt64 inputs.
  • Add helper-level coverage for large values and overflow behavior.
  • Add SQL regression coverage for u64::MAX with default/positive scales and a large negative-scale case.

Are these changes tested?

Yes:

  • cargo test -p datafusion-spark function::math::round::tests --lib
  • cargo test -p datafusion-sqllogictest --test sqllogictests -- spark/math/round --test-threads 1
  • cargo clippy -p datafusion-spark --features core --all-targets -- -D warnings
  • cargo fmt --all -- --check

Are there any user-facing changes?

Yes. Spark round now accepts UInt64 values above i64::MAX and returns correct results instead of reporting a narrowing error. There are no public API changes.

Signed-off-by: Kevin-Li-2025 <2242139@qq.com>
@github-actions github-actions Bot added sqllogictest SQL Logic Tests (.slt) spark labels Jun 19, 2026

@viirya viirya left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM — this correctly fixes the narrowing path. A few notes from reviewing it:

The new round_unsigned_integer faithfully mirrors round_integer, and correctly omits the remainder <= -threshold branch since value % factor is always non-negative for u64 — that's a valid simplification, not a missing case. The HALF_UP tie-breaking, ANSI checked-overflow vs. non-ANSI wrapping, and the factor overflow → 0 short-circuit all match the signed version.

I verified the overflow test's expected value by hand: round(u64::MAX, -1, non-ansi)remainder = (2^64-1) % 10 = 5, 5 >= 5 rounds up → wrapping_sub(5).wrapping_add(10) = 4, which is what the test asserts.

I ran both unit tests locally and they pass. (I couldn't run the SLT cases in my environment, but they exercise the same round_unsigned_integer paths the unit tests cover, and the expected values check out by hand.)

One optional nit: round_integer hoists let threshold = factor / 2; into a local and reuses it; the unsigned version inlines factor / 2 in the comparison. Matching the sibling function's style would read a bit more consistently — minor, take it or leave it.

@Jefffrey

Copy link
Copy Markdown
Contributor

Fyi we have an existing PR:

This PR seems to fix only spark, while the original issue seems to describe for datafusion function. Also worth noting Spark doesn't support unsigned types in the first place (I made a comment to this affect on the linked PR)

@viirya

viirya commented Jun 20, 2026

Copy link
Copy Markdown
Member

Following up on @Jefffrey's note about the overlap with #22697 — I dug into both and wanted to lay out the relationship so the maintainers can decide which path to take.

Scope: @Jefffrey is right that this PR addresses only the Spark side. Issue #22696 has two halves: (1) core datafusion-functions round() losing precision on Int64 above 2^53 by routing through Float64, and (2) Spark round() failing on UInt64 above i64::MAX. This PR fixes only (2); #22697 covers both.

But the two PRs fix the Spark UInt64 path differently, and this one is more complete there:

  • fix: Preserve integer values in round() for large Int64 and UInt64 inputs #22697 (current state) only short-circuits when scale >= 0 (the no-op case). For negative scale with a value above i64::MAX, it still falls back to i64::try_from(...) and errors:
    if scale >= 0 { Ok(...v...) } else {
        let v_i64 = i64::try_from(*v).map_err(...)?;  // still narrows → errors for u64 > i64::MAX
        ...
    }
  • This PR's round_unsigned_integer handles negative scale natively in u64, so e.g. round(u64::MAX, -1) rounds correctly instead of erroring. That's the case the unit test round_unsigned_integer_overflow and the -1 SLT case exercise.

On whether the Spark UInt64 codepath should exist at all: @Jefffrey raised this on #22697 too ("spark doesn't have a u64 type"). Worth noting the SparkRound signature coerces via TypeSignatureClass::Integer, which includes unsigned types, so arrow_cast(x, 'UInt64') does reach this arm today — it's live, not dead code. Whether it should be supported is a separate design question; if the answer is "drop unsigned support," that would supersede both this fix and #22697's Spark portion.

No strong opinion from me on which PR lands — just flagging that if the Spark UInt64 path stays, this PR's negative-scale handling is the more correct of the two.

@viirya

viirya commented Jun 20, 2026

Copy link
Copy Markdown
Member

Some archaeology on the "Spark doesn't have a u64 type" point, in case it helps decide the direction here.

You're right that Spark can't produce a UInt64. Spark SQL's integer types are ByteType/ShortType/IntegerType/LongType — all signed (JVM has no unsigned integers). This PR's own doc comment on round_integer reflects that: it says it matches "Spark's RoundBase behaviour for ByteType, ShortType, IntegerType, and LongType" — only the signed types.

So why does the code accept UInt64 at all? It looks incidental, not intentional. The signature has used Coercion::new_exact(TypeSignatureClass::Integer) since the function was first added in #21062, and TypeSignatureClass::Integer matches on NativeType::is_integer(), which includes the unsigned types:

// datafusion/common/src/types/native.rs
pub fn is_integer(&self) -> bool {
    matches!(self, UInt8 | UInt16 | UInt32 | UInt64 | Int8 | Int16 | Int32 | Int64)
}

So UInt8..UInt64 came in "for free" via that coercion class rather than by deliberate design, and the UInt64 dispatch arm exists only because the signature lets those values through (e.g. via arrow_cast(x, 'UInt64')) — Spark itself never generates them.

That seems to leave two coherent directions:

  1. Keep the codepath and make it correct — what this PR (and fix: Preserve integer values in round() for large Int64 and UInt64 inputs #22697's Spark portion) does.
  2. Narrow the signature to signed-only and drop the UInt* dispatch arms, so Spark round matches Spark's actual type domain — which would make both this fix and fix: Preserve integer values in round() for large Int64 and UInt64 inputs #22697's Spark change unnecessary.

No opinion from me on which is preferable — just flagging that the unsigned support appears to be a side effect of the coercion class rather than a requirement, in case that informs the call.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

spark sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

round() mishandles large Int64 and UInt64 values

3 participants