Finish removing the BigInts from * for FD{Int128}!#94
Finish removing the BigInts from * for FD{Int128}!#94
Conversation
We do not here explicitly introduce support for FD{BitIntegers.Int256},
though that should work out of the box both before and after this PR.
Rather, this PR _uses_ a (U)Int256 under the hood to prevent allocations
from Int128 widening to BigInt in FD operations.
Finally implements the fast-multiplication optimization from #45, but this time for 128-bit FixedDecimals! :) This is a follow-up to #93, which introduces an Int256 type for widemul. However, the fldmod still required 2 BigInt allocations. Now, this PR uses a custom implementation of the LLVM div-by-const optimization for (U)Int256, which briefly widens to Int512 (😅) to perform the fldmod by the constant 10^f coefficient. This brings 128-bit FD multiply to the same performance as 64-bit. :)
|
Huzzah! I finally feel like we can bring the ideas from #45 to this package, and it's still valuable. It turns out that LLVM has already applied this optimization automatically for Int128 div by const (thus * for See here, for my note on FixedDecimal{Int64} support: |
|
Now that there's julia's effects system, this doesn't need |
Co-authored-by: Curtis Vogt <curtis.vogt@gmail.com>
82eb32d to
f2958ba
Compare
|
Aha, the native code difference comes from the whole benchmark function. Before, the multiply was outlined, and called through a function call. julia> function bench(fd)
for _ in 1:10000
fd = fd * fd
end
fd
end
bench (generic function with 1 method)
# Before:
julia> @code_native debuginfo=:none bench(FixedDecimal{UInt32,3}(1.234))
┌ Warning: /Users/nathandaly/.julia/dev/FixedPointDecimals/src/fldmod-by-const.jl no longer exists, deleted all methods
└ @ Revise ~/.julia/packages/Revise/bAgL0/src/packagedef.jl:666
.section __TEXT,__text,regular,pure_instructions
.build_version macos, 14, 0
.globl _julia_bench_1163 ; -- Begin function julia_bench_1163
.p2align 2
_julia_bench_1163: ; @julia_bench_1163
; %bb.0: ; %guard_exit4
sub sp, sp, #48
stp x20, x19, [sp, #16] ; 16-byte Folded Spill
stp x29, x30, [sp, #32] ; 16-byte Folded Spill
ldr w0, [x0]
mov w19, #10000
Lloh0:
adrp x20, "_j_*_1165"@GOTPAGE
Lloh1:
ldr x20, [x20, "_j_*_1165"@GOTPAGEOFF]
LBB0_1: ; %L2
; =>This Inner Loop Header: Depth=1
str w0, [sp, #12]
add x0, sp, #12
add x1, sp, #12
blr x20
subs x19, x19, #1
b.ne LBB0_1
; %bb.2: ; %guard_exit16
ldp x29, x30, [sp, #32] ; 16-byte Folded Reload
ldp x20, x19, [sp, #16] ; 16-byte Folded Reload
add sp, sp, #48
ret
.loh AdrpLdrGot Lloh0, Lloh1
; -- End function
.subsections_via_symbolsAfter: julia> @code_native debuginfo=:none bench(FixedDecimal{UInt32,3}(1.234))
.section __TEXT,__text,regular,pure_instructions
.build_version macos, 14, 0
.globl _julia_bench_1161 ; -- Begin function julia_bench_1161
.p2align 2
_julia_bench_1161: ; @julia_bench_1161
; %bb.0: ; %guard_exit7
ldr w0, [x0]
mov w8, #10000
mov x9, #57148
movk x9, #36175, lsl #16
movk x9, #28311, lsl #32
movk x9, #33554, lsl #48
mov x10, #-1000
LBB0_1: ; %L2
; =>This Inner Loop Header: Depth=1
umull x11, w0, w0
umulh x11, x11, x9
lsr x12, x11, #9
mul x13, x12, x10
umaddl x13, w0, w0, x13
ubfx x11, x11, #9, #1
cmp x13, #500
cset w13, hi
csel w11, w11, w13, eq
add w0, w11, w12
subs x8, x8, #1
b.ne LBB0_1
; %bb.2: ; %guard_exit19
ret
; -- End function
.subsections_via_symbolsEDIT: And even with a runtime variable for the loop count, so it can't optimize away the loop, it's still just as much faster in the new code: julia> @noinline function bench(fd, N)
for _ in 1:N
fd = fd * fd
end
fd
end
bench (generic function with 2 methods)
julia> @btime bench($(FixedDecimal{UInt32,3}(1.234)), $10000)
48.416 μs (0 allocations: 0 bytes)
FixedDecimal{UInt32,3}(4191932.283) |
omus
left a comment
There was a problem hiding this comment.
Did another surface level review. I can do a full review if no one else will. I will need to block off some time though as there's a bit going on in this PR.
Co-authored-by: Curtis Vogt <curtis.vogt@gmail.com>
|
I think Tomas should be able to review when he's back from vacation. He's out til thursday. :) |
…change
Tests all FD{(U)Int16} values.
Tests most corner cases for FD{(U)Int128} values.
44686f3 to
ac3302e
Compare
We must use the precise value of 2^nbits(T) in order to get the correct division in all cases. ....... UGH except now the Int64 tests aren't passing.
LLVM _does_ do this automatically if you directly call `fldmod`. Improve the comments as well.
… not fld(x,y),mod(x,y), even if y is a constant! :) Improved that for julia versions 1.8 - 1.9
|
Okay, after reviewing and integrating all of Tomas' changes, and cleaning things up, i think this is a great improvement and is ready to go! :) Tomas and I are going to look through it one more time, but otherwise I think this is fully ready for review. Thanks! |
|
I also updated the PR comment with the final perf numbers, which look 👌 |
|
By the way, |
|
Oh wow, very nice! Thanks @EdsterG. 💪 This should still help quite a bit, but the baseline for comparison would be better in 1.11, then! 💪 |
- Added comments - Introduced new tests for FixedDecimal multiplication using JET. - Enhanced fldmod_by_const tests to cover a wider range of divisors and edge cases. - Updated Project.toml to include JET as a dependency.
|
Ok, so I've reviewed the code and I concluded that it is correct. I was mainly concerned about whether should or should not add 1 for negative I've also added JET tests and manually examined generated typed IR for all possible multiplication methods to verify that none of those mentioned any errors or unreachable branches. I've also run the exhaustive tests for |
…test things like simd-compatibility)
Didn't actually fix it.. I guess we can just leave the nightly tests broken, until they fix JET? :/ This reverts commit 72156f7.

Finally implements the fast-multiplication optimization from #45! :)
This PR optimizes multiplication for
FixedDecimal{Int64}andFixedDecimal{Int128}. In the process, we also undid an earlier optimization which is no longer needed after julia 1.8, and that makes multiplication about 2x fast for the smaller int types as well! 🎉This is a follow-up to
#93, which introduces an Int256 type for widemul. However after that PR, the fldmod still required 2 BigInt allocations.
Now, this PR uses a custom implementation of the LLVM div-by-const optimization for (U)Int128 and for (U)Int256, which briefly widens to Int512 (😅), to perform the fldmod by the constant 10^f coefficient.
After this PR, FD multiply performance scales linear with the number of bits. FD{Int128} has no allocations, and is only 2x slower than 64-bit. :) And it makes all other multiplications ~2x faster.
master:
After PR #93:
After this PR: