Finish removing the BigInts from * for FD{Int128}! by NHDaly · Pull Request #94 · JuliaMath/FixedPointDecimals.jl

NHDaly · 2024-06-13T02:27:53Z

Finally implements the fast-multiplication optimization from #45! :)

This PR optimizes multiplication for FixedDecimal{Int64} and FixedDecimal{Int128}. In the process, we also undid an earlier optimization which is no longer needed after julia 1.8, and that makes multiplication about 2x fast for the smaller int types as well! 🎉

This is a follow-up to
#93, which introduces an Int256 type for widemul. However after that PR, the fldmod still required 2 BigInt allocations.

Now, this PR uses a custom implementation of the LLVM div-by-const optimization for (U)Int128 and for (U)Int256, which briefly widens to Int512 (😅), to perform the fldmod by the constant 10^f coefficient.

After this PR, FD multiply performance scales linear with the number of bits. FD{Int128} has no allocations, and is only 2x slower than 64-bit. :) And it makes all other multiplications ~2x faster.

master:

julia> using FixedPointDecimals, BenchmarkTools

julia> @btime for _ in 1:10000 fd = fd * fd end setup = (fd = FixedDecimal{Int16,3}(1.234))
  83.750 μs (0 allocations: 0 bytes)
FixedDecimal{Int16,3}(0.000)

julia> @btime for _ in 1:10000 fd = fd * fd end setup = (fd = FixedDecimal{Int32,3}(1.234))
  84.916 μs (0 allocations: 0 bytes)
FixedDecimal{Int32,3}(1700943.280)

julia> @btime for _ in 1:10000 fd = fd * fd end setup = (fd = FixedDecimal{Int64,3}(1.234))
  249.083 μs (0 allocations: 0 bytes)
FixedDecimal{Int64,3}(4230510070790917.029)

julia> @btime for _ in 1:10000 fd = fd * fd end setup = (fd = FixedDecimal{Int128,3}(1.234))
  4.660 ms (248829 allocations: 4.70 MiB)
FixedDecimal{Int128,3}(-66726338547984585007169386718143307.324)

# unsigned

julia> @btime for _ in 1:10000 fd = fd * fd end setup = (fd = FixedDecimal{UInt16,3}(1.234))
  56.791 μs (0 allocations: 0 bytes)
FixedDecimal{UInt16,3}(0.000)

julia> @btime for _ in 1:10000 fd = fd * fd end setup = (fd = FixedDecimal{UInt32,3}(1.234))
  60.000 μs (0 allocations: 0 bytes)
FixedDecimal{UInt32,3}(4191932.283)

julia> @btime for _ in 1:10000 fd = fd * fd end setup = (fd = FixedDecimal{UInt64,3}(1.234))
  172.958 μs (0 allocations: 0 bytes)
FixedDecimal{UInt64,3}(16576189118051436.703)

julia> @btime for _ in 1:10000 fd = fd * fd end setup = (fd = FixedDecimal{UInt128,3}(1.234))
  5.408 ms (308621 allocations: 6.14 MiB)
FixedDecimal{UInt128,3}(303384805088638153637410092093845905.434)

After PR #93:

julia> @btime for _ in 1:10000 fd = fd * fd end setup = (fd = FixedDecimal{Int16,3}(1.234))
  83.750 μs (0 allocations: 0 bytes)
FixedDecimal{Int16,3}(0.000)

julia> @btime for _ in 1:10000 fd = fd * fd end setup = (fd = FixedDecimal{Int32,3}(1.234))
  85.000 μs (0 allocations: 0 bytes)
FixedDecimal{Int32,3}(1700943.280)

julia> @btime for _ in 1:10000 fd = fd * fd end setup = (fd = FixedDecimal{Int64,3}(1.234))
  248.958 μs (0 allocations: 0 bytes)
FixedDecimal{Int64,3}(4230510070790917.029)

julia> @btime for _ in 1:10000 fd = fd * fd end setup = (fd = FixedDecimal{Int128,3}(1.234))
  4.673 ms (160798 allocations: 3.22 MiB)
FixedDecimal{Int128,3}(-66726338547984585007169386718143307.324)

# unsigned

julia> @btime for _ in 1:10000 fd = fd * fd end setup = (fd = FixedDecimal{UInt16,3}(1.234))
  56.791 μs (0 allocations: 0 bytes)
FixedDecimal{UInt16,3}(0.000)

julia> @btime for _ in 1:10000 fd = fd * fd end setup = (fd = FixedDecimal{UInt32,3}(1.234))
  60.041 μs (0 allocations: 0 bytes)
FixedDecimal{UInt32,3}(4191932.283)

julia> @btime for _ in 1:10000 fd = fd * fd end setup = (fd = FixedDecimal{UInt64,3}(1.234))
  173.000 μs (0 allocations: 0 bytes)
FixedDecimal{UInt64,3}(16576189118051436.703)

julia> @btime for _ in 1:10000 fd = fd * fd end setup = (fd = FixedDecimal{UInt128,3}(1.234))
  4.750 ms (190708 allocations: 4.82 MiB)
FixedDecimal{UInt128,3}(303384805088638153637410092093845905.434)

After this PR:

julia> @btime for _ in 1:10000 fd = fd * fd end setup = (fd = FixedDecimal{Int16,3}(1.234))
  48.458 μs (0 allocations: 0 bytes)
FixedDecimal{Int16,3}(0.000)

julia> @btime for _ in 1:10000 fd = fd * fd end setup = (fd = FixedDecimal{Int32,3}(1.234))
  57.000 μs (0 allocations: 0 bytes)
FixedDecimal{Int32,3}(1700943.280)

julia> @btime for _ in 1:10000 fd = fd * fd end setup = (fd = FixedDecimal{Int64,3}(1.234))
  90.708 μs (0 allocations: 0 bytes)
FixedDecimal{Int64,3}(4230510070790917.029)

julia> @btime for _ in 1:10000 fd = fd * fd end setup = (fd = FixedDecimal{Int128,3}(1.234))
  180.125 μs (0 allocations: 0 bytes)
FixedDecimal{Int128,3}(-66726338547984585007169386718143307.324)

# unsigned

julia> @btime for _ in 1:10000 fd = fd * fd end setup = (fd = FixedDecimal{UInt16,3}(1.234))
  42.708 μs (0 allocations: 0 bytes)
FixedDecimal{UInt16,3}(0.000)

julia> @btime for _ in 1:10000 fd = fd * fd end setup = (fd = FixedDecimal{UInt32,3}(1.234))
  51.250 μs (0 allocations: 0 bytes)
FixedDecimal{UInt32,3}(4191932.283)

julia> @btime for _ in 1:10000 fd = fd * fd end setup = (fd = FixedDecimal{UInt64,3}(1.234))
  80.042 μs (0 allocations: 0 bytes)
FixedDecimal{UInt64,3}(16576189118051436.703)

julia> @btime for _ in 1:10000 fd = fd * fd end setup = (fd = FixedDecimal{UInt128,3}(1.234))
  162.417 μs (0 allocations: 0 bytes)
FixedDecimal{UInt128,3}(303384805088638153637410092093845905.434)

We do not here explicitly introduce support for FD{BitIntegers.Int256}, though that should work out of the box both before and after this PR. Rather, this PR _uses_ a (U)Int256 under the hood to prevent allocations from Int128 widening to BigInt in FD operations.

Finally implements the fast-multiplication optimization from #45, but this time for 128-bit FixedDecimals! :) This is a follow-up to #93, which introduces an Int256 type for widemul. However, the fldmod still required 2 BigInt allocations. Now, this PR uses a custom implementation of the LLVM div-by-const optimization for (U)Int256, which briefly widens to Int512 (😅) to perform the fldmod by the constant 10^f coefficient. This brings 128-bit FD multiply to the same performance as 64-bit. :)

NHDaly · 2024-06-13T02:29:53Z

Huzzah! I finally feel like we can bring the ideas from #45 to this package, and it's still valuable. It turns out that LLVM has already applied this optimization automatically for Int128 div by const (thus * for FixedDecimal{Int64} is already fast). But this PR adds support for FixedDecimal{Int128}, which needed a custom function for Int256 div by const. :)

See here, for my note on FixedDecimal{Int64} support:
#45 (comment)

NHDaly · 2024-06-13T02:39:38Z

Now that there's julia's effects system, this doesn't need @pure, so it's actually possible to do this!
And now that we've introduced BitIntegers.jl, the code is way simpler. 😊 woohoo!

test/runtests.jl

src/FixedPointDecimals.jl

src/fldmod-by-const.jl

Co-authored-by: Curtis Vogt <curtis.vogt@gmail.com>

NHDaly · 2024-06-13T20:12:20Z

Update 1: It turned out that (of course) * is too fast for @btime to get a good measurement of on its own. I've adjusted the benchmark to perform repeated multiplies, and the true performance is now included in the original comment.

Update 2: It turns out that this is actually faster for all int types (>2x faster for FD{Int64}), even though LLVM was doing a version of this optimization on its own. I'm honestly pretty surprised by that... From what I can tell, for FD{Int32}, the code is identical except for two instruction changes: one changing the order of registers as inputs to add (shouldn't matter), and one changing from comparing hi to comparing gt:

And yet somehow it's 25% faster (80µs -> 60µs).

NHDaly · 2024-06-13T20:18:42Z

Aha, the native code difference comes from the whole benchmark function. Before, the multiply was outlined, and called through a function call.
Now, it's inlined, and i think the whole loop is constant-folded away, so it's actually sort of cheating. 😅
But still, allowing that optimization seems like a positive thing!. The inlining is the main change, i think, for FD{Int32}.

julia> function bench(fd) 
           for _ in 1:10000
               fd = fd * fd
           end
           fd
       end
bench (generic function with 1 method)

# Before:

julia> @code_native debuginfo=:none bench(FixedDecimal{UInt32,3}(1.234))
┌ Warning: /Users/nathandaly/.julia/dev/FixedPointDecimals/src/fldmod-by-const.jl no longer exists, deleted all methods
└ @ Revise ~/.julia/packages/Revise/bAgL0/src/packagedef.jl:666
        .section        __TEXT,__text,regular,pure_instructions
        .build_version macos, 14, 0
        .globl  _julia_bench_1163               ; -- Begin function julia_bench_1163
        .p2align        2
_julia_bench_1163:                      ; @julia_bench_1163
; %bb.0:                                ; %guard_exit4
        sub     sp, sp, #48
        stp     x20, x19, [sp, #16]             ; 16-byte Folded Spill
        stp     x29, x30, [sp, #32]             ; 16-byte Folded Spill
        ldr     w0, [x0]
        mov     w19, #10000
Lloh0:
        adrp    x20, "_j_*_1165"@GOTPAGE
Lloh1:
        ldr     x20, [x20, "_j_*_1165"@GOTPAGEOFF]
LBB0_1:                                 ; %L2
                                        ; =>This Inner Loop Header: Depth=1
        str     w0, [sp, #12]
        add     x0, sp, #12
        add     x1, sp, #12
        blr     x20
        subs    x19, x19, #1
        b.ne    LBB0_1
; %bb.2:                                ; %guard_exit16
        ldp     x29, x30, [sp, #32]             ; 16-byte Folded Reload
        ldp     x20, x19, [sp, #16]             ; 16-byte Folded Reload
        add     sp, sp, #48
        ret
        .loh AdrpLdrGot Lloh0, Lloh1
                                        ; -- End function
.subsections_via_symbols

After:

julia> @code_native debuginfo=:none bench(FixedDecimal{UInt32,3}(1.234))
        .section        __TEXT,__text,regular,pure_instructions
        .build_version macos, 14, 0
        .globl  _julia_bench_1161               ; -- Begin function julia_bench_1161
        .p2align        2
_julia_bench_1161:                      ; @julia_bench_1161
; %bb.0:                                ; %guard_exit7
        ldr     w0, [x0]
        mov     w8, #10000
        mov     x9, #57148
        movk    x9, #36175, lsl #16
        movk    x9, #28311, lsl #32
        movk    x9, #33554, lsl #48
        mov     x10, #-1000
LBB0_1:                                 ; %L2
                                        ; =>This Inner Loop Header: Depth=1
        umull   x11, w0, w0
        umulh   x11, x11, x9
        lsr     x12, x11, #9
        mul     x13, x12, x10
        umaddl  x13, w0, w0, x13
        ubfx    x11, x11, #9, #1
        cmp     x13, #500
        cset    w13, hi
        csel    w11, w11, w13, eq
        add     w0, w11, w12
        subs    x8, x8, #1
        b.ne    LBB0_1
; %bb.2:                                ; %guard_exit19
        ret
                                        ; -- End function
.subsections_via_symbols

EDIT: And even with a runtime variable for the loop count, so it can't optimize away the loop, it's still just as much faster in the new code:

julia> @noinline function bench(fd, N) 
           for _ in 1:N
               fd = fd * fd
           end
           fd
       end
bench (generic function with 2 methods)

julia> @btime bench($(FixedDecimal{UInt32,3}(1.234)), $10000)
  48.416 μs (0 allocations: 0 bytes)
FixedDecimal{UInt32,3}(4191932.283)

omus

Did another surface level review. I can do a full review if no one else will. I will need to block off some time though as there's a bit going on in this PR.

src/FixedPointDecimals.jl

Co-authored-by: Curtis Vogt <curtis.vogt@gmail.com>

NHDaly · 2024-06-17T03:49:53Z

I think Tomas should be able to review when he's back from vacation. He's out til thursday. :)
Thanks for the offer, @omus! :)

…change Tests all FD{(U)Int16} values. Tests most corner cases for FD{(U)Int128} values.

We must use the precise value of 2^nbits(T) in order to get the correct division in all cases. ....... UGH except now the Int64 tests aren't passing.

LLVM _does_ do this automatically if you directly call `fldmod`. Improve the comments as well.

… not fld(x,y),mod(x,y), even if y is a constant! :) Improved that for julia versions 1.8 - 1.9

NHDaly · 2024-07-10T01:31:34Z

Okay, after reviewing and integrating all of Tomas' changes, and cleaning things up, i think this is a great improvement and is ready to go! :) Tomas and I are going to look through it one more time, but otherwise I think this is fully ready for review. Thanks!

NHDaly · 2024-07-10T01:40:01Z

I also updated the PR comment with the final perf numbers, which look 👌

EdsterG · 2024-09-16T22:31:41Z

By the way, div/mod allocations are fixed in the most recent version of BitIntegers when running on 1.11+, see #48.

NHDaly · 2024-09-26T20:57:43Z

Oh wow, very nice! Thanks @EdsterG. 💪

This should still help quite a bit, but the baseline for comparison would be better in 1.11, then! 💪

src/fldmod-by-const.jl

…noallocs

- Added comments - Introduced new tests for FixedDecimal multiplication using JET. - Enhanced fldmod_by_const tests to cover a wider range of divisors and edge cases. - Updated Project.toml to include JET as a dependency.

Drvi · 2026-03-10T14:07:56Z

Ok, so I've reviewed the code and I concluded that it is correct.

I was mainly concerned about whether should or should not add 1 for negative x in the last line of div_by_const. I later realized that, as Nathan pointed out, the +1 is very much intended and explicitly relied on in the associated proofs in Hacker's Delight. The high level reason for it is that shifts implement flooring division while we were interested in division towards zero, so for negative values, if the division is not already exact, we need to add one to compensate for the flooring. I've spent some time proving that we should always add 1 in this code path (i.e. that we don't divide evenly and that we need to compensate). I've added the proof to the body of the function as a comment, but now I think that it is not useful, since it proves a subset of what the relevant section in HD proves already (i.e. HD proves the multiplication and shift corresponds to ceil for negative numbers if we add 1, and floor for positive numbers, while I prove that given the multitplication and shift, we need to add the 1 for negative numbers).

I've also added JET tests and manually examined generated typed IR for all possible multiplication methods to verify that none of those mentioned any errors or unreachable branches.

I've also run the exhaustive tests for Int16, all of them passed.

NHDaly

Alright, finally ready to go! I've moved @Drvi's last comment out to below the function, to make it easier to read. But this is finally ready to merge. I've also confirmed the performance impacts again locally.

…test things like simd-compatibility)

Project.toml

Didn't actually fix it.. I guess we can just leave the nightly tests broken, until they fix JET? :/ This reverts commit 72156f7.

NHDaly added 6 commits June 12, 2024 13:11

Further reduce BigInts by skipping a rem() in iseven

7756238

Fix ambiguity in _widemul(Int256, UInt256)

78e45dc

Bump patch version number

879c602

Add compat for BitIntegers

a245651

NHDaly requested review from Drvi and TotalVerb June 13, 2024 02:28

Support older versions of julia

4e53f3d

NHDaly added 4 commits June 12, 2024 20:42

Comments

dfd41b1

Disable fldmod-by-const tests on older julia

efee91b

Fix one other case of iseven allocating a BigInt

a03d754

Apply this optimization to FD{Int64} as well.

4ed8ebf

omus reviewed Jun 13, 2024

View reviewed changes

test/runtests.jl Outdated Show resolved Hide resolved

src/FixedPointDecimals.jl Outdated Show resolved Hide resolved

src/fldmod-by-const.jl Outdated Show resolved Hide resolved

NHDaly and others added 3 commits June 13, 2024 13:49

Adjust to run for all integer types!

3f39b8a

Clarify the _unsigned(x) methods with comments

20c66f2

Apply suggestions from code review

f2958ba

Co-authored-by: Curtis Vogt <curtis.vogt@gmail.com>

NHDaly force-pushed the nhd-Int128-fastmul-noallocs branch from 82eb32d to f2958ba Compare June 13, 2024 19:59

NHDaly requested a review from omus June 13, 2024 20:06

omus reviewed Jun 14, 2024

View reviewed changes

src/FixedPointDecimals.jl Outdated Show resolved Hide resolved

Update src/FixedPointDecimals.jl

0019bb0

Co-authored-by: Curtis Vogt <curtis.vogt@gmail.com>

NHDaly added 2 commits June 19, 2024 11:51

Add extensive tests for multiplication correctness, to cover the new …

4a53703

…change Tests all FD{(U)Int16} values. Tests most corner cases for FD{(U)Int128} values.

Named testsets to make failures easier to identify

27a3e0f

NHDaly force-pushed the nhd-Int128-fastmul-noallocs branch from 44686f3 to ac3302e Compare June 19, 2024 19:16

Fix off-by-one error in rounding truncation in calculate_inverse_coeff()

e4cb73b

We must use the precise value of 2^nbits(T) in order to get the correct division in all cases. ....... UGH except now the Int64 tests aren't passing.

Drvi and others added 5 commits June 26, 2024 17:13

.

4f4d17a

Restrict back to just Int128 & Int256 for custom div_by_const

eaeaddf

LLVM _does_ do this automatically if you directly call `fldmod`. Improve the comments as well.

It turns out that in newer versions of julia, you should call fldmod,…

a2dcf56

… not fld(x,y),mod(x,y), even if y is a constant! :) Improved that for julia versions 1.8 - 1.9

Reorganize the functions to be top-down

ef578e9

More thorough tests for flmdod_by_const

8df68b5

Base automatically changed from nhd-overflow-Int128 to master August 12, 2024 17:49

NHDaly added 5 commits August 12, 2024 11:51

Merge branch 'master' into nhd-Int128-fastmul-noallocs

66e1ecb

Bump patch version number

fd096ac

Add _widemul unit test

e147f5c

Change "unreachable" comment to an @assert false

f830c3c

add test comment

71ba82e

NHDaly commented Mar 2, 2026

View reviewed changes

src/fldmod-by-const.jl Outdated Show resolved Hide resolved

NHDaly and others added 6 commits March 2, 2026 07:29

Apply suggestion from @NHDaly

2904141

Merge remote-tracking branch 'origin/master' into nhd-Int128-fastmul-…

44478a5

…noallocs

Fixes after the merge

5bcb2a2

Add JET integration tests and update fldmod_by_const tests

61a3294

- Added comments - Introduced new tests for FixedDecimal multiplication using JET. - Enhanced fldmod_by_const tests to cover a wider range of divisors and edge cases. - Updated Project.toml to include JET as a dependency.

Target Julia 1.10+

67d5c37

Improve JET tests

aa1f4db

Clean up tomas' comment

7571a9a

NHDaly commented Mar 16, 2026

View reviewed changes

Fix the example benchmarks to benchmark across a wide array (to also …

865800a

…test things like simd-compatibility)

NHDaly commented Mar 16, 2026

View reviewed changes

Project.toml Outdated Show resolved Hide resolved

NHDaly and others added 3 commits March 16, 2026 10:46

Update Project.toml

3f5c2e2

JET currently broken on nightly

72156f7

Revert "JET currently broken on nightly"

3eebac2

Didn't actually fix it.. I guess we can just leave the nightly tests broken, until they fix JET? :/ This reverts commit 72156f7.

Uh oh!

Conversation

NHDaly commented Jun 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NHDaly commented Jun 13, 2024

Uh oh!

NHDaly commented Jun 13, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

NHDaly commented Jun 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NHDaly commented Jun 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

omus left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

NHDaly commented Jun 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NHDaly commented Jul 10, 2024

Uh oh!

NHDaly commented Jul 10, 2024

Uh oh!

EdsterG commented Sep 16, 2024

Uh oh!

NHDaly commented Sep 26, 2024

Uh oh!

Uh oh!

Drvi commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NHDaly left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

NHDaly commented Jun 13, 2024 •

edited

Loading

NHDaly commented Jun 13, 2024 •

edited

Loading

NHDaly commented Jun 13, 2024 •

edited

Loading

NHDaly commented Jun 17, 2024 •

edited

Loading

Drvi commented Mar 10, 2026 •

edited

Loading