Commit 9d58b8d
committed
provide a NEON version of arm/sgemm
benchmark/sgemm.goto before:
M= 200, N= 200, K= 200 : 9262.97 MFlops 0.001727 sec
after:
M= 200, N= 200, K= 200 : 30223.64 MFlops 0.000529 sec
Conveniently the registers are already allocated suitably for vector
operation, so the conversion from vfpv3 was rather straightforward.
Prefetching was left out because it doesn't help Cortex-A76,
only hurts it slightly.1 parent cd276c2 commit 9d58b8d
1 file changed
Lines changed: 225 additions & 2 deletions
0 commit comments