Conversation
jaja360
left a comment
There was a problem hiding this comment.
Very cool !!!
I ran some benchmarks to see if it impacts performance. It doesn't at all with clang. With g++, however, I get that the new version is 7% faster on the Twitter dataset, but 15% slower on Uniform-8 (it goes from 4.01 to 4.62 c/n). g++ has always given us suboptimal results (it generates 30 instructions instead of 27 for clang, on Uniform-8), so it seems more like a GCC codegen quirk than a problem on our side.
I think we can merge !
|
@jaja360 Logically, it should not make a difference... We save one named register, but that's not a limitation. We also save loading the register, which might help a little bit (maybe save one instruction) but it was almost surely not a bottleneck. Of course, the main point is to simplify the algorithm and the code. Merging. |
|
Indeed, it should not do a difference (and it does not on |
We can simplify the algorithm. This should not affect the performance, although it might be slightly beneficial because we need one fewer register.
The gist of it is that we replace
by
So instead of loading
zmmzeroandifma_const, we just needifma_const.I am also trimming out
shift_and_insert_dotas it is not needed.