Skip to content

provide a NEON version of arm/sgemm#5800

Merged
martin-frbg merged 4 commits intoOpenMathLib:developfrom
notaz:armv7_sgemm
May 6, 2026
Merged

provide a NEON version of arm/sgemm#5800
martin-frbg merged 4 commits intoOpenMathLib:developfrom
notaz:armv7_sgemm

Conversation

@notaz
Copy link
Copy Markdown
Contributor

@notaz notaz commented May 5, 2026

Surprisingly OpenBLAS lacks NEON optimized kernels for armv7, even though it auto-enables NEON during build by default (passes -mfpu=neon to the compiler, meaning the compiler will use NEON instructions wherever it can).

The speedup on Cortex-A76 is significant, before:

 M= 200, N= 200, K= 200 :     9262.97 MFlops   0.001727 sec

after:

 M= 200, N= 200, K= 200 :    30223.64 MFlops   0.000529 sec

notaz added 4 commits May 5, 2026 22:36
Non-local labels interfere with profiling. Same thing was done for arm64 in
commit a0128aa.
According to ARM AAPCS (Procedure Call Standard) 5.1.2.1, only registers
s16-s31 must be preserved across subroutine calls; registers s0-s15
do not need to be preserved.
benchmark/sgemm.goto before:
 M= 200, N= 200, K= 200 :     9262.97 MFlops   0.001727 sec
after:
 M= 200, N= 200, K= 200 :    30223.64 MFlops   0.000529 sec

Conveniently the registers are already allocated suitably for vector
operation, so the conversion from vfpv3 was rather straightforward.

Prefetching was left out because it doesn't help Cortex-A76,
only hurts it slightly.
@martin-frbg
Copy link
Copy Markdown
Collaborator

Thanks - ARMV7 hasn't received much work in recent years, and I believe in the early days of CORTEXA53/A57 it was assumed/claimed that NEON wasn't significantly faster in V7 code. (Telling the compiler it's allowed to use NEON was required for some code contribution that used universal intrinsics, I think, and it was obviously trivial to do...)

@martin-frbg martin-frbg added this to the 0.3.34 milestone May 6, 2026
@martin-frbg martin-frbg merged commit e8ad16c into OpenMathLib:develop May 6, 2026
103 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants