Add NumPy optimization guide#36
Conversation
Add intel-numpy mkl extension optimizations readme
|
Overall comment: this guide recommends setting the IOMP threading layter for MKL, but pretty much every other PyPI package outside of Intel-distributed NumPy will bundle LibGOMP and could potentially cause incompatibilities. Perhaps it could recommend setting |
| conda install -y \ | ||
| -c https://software.repos.intel.com/python/conda \ | ||
| -c conda-forge --override-channels \ | ||
| "blas=*=*_intelmkl" \ |
There was a problem hiding this comment.
What about lapack? If this is done on an existing environment, there's no guarantee that the user won't have different backends for blas and lapack.
| mkl mkl_fft mkl_random mkl_umath mkl-service | ||
| ``` | ||
|
|
||
| `--override-channels` resolves only from the two named channels, so conda does not mix in an OpenBLAS build from elsewhere. The `blas=*=*_intelmkl` selector requests the Intel channel's MKL-backed BLAS; conda-forge offers an equivalent under the build string `blas=*=*mkl`. Either gives an MKL BLAS backend. The Intel channel is required for the three extensions and Intel's latest oneMKL builds. |
There was a problem hiding this comment.
I assume this advice might have been copied from other documentation pages.
The reason why it was there was to avoid pulling packages from the Anaconda channel which have higher priority. That's worth mentioning here.
| conda activate idp_env | ||
| ``` | ||
|
|
||
| Pin `python=<version>` to match your project if you need a specific interpreter. NumPy comes from conda-forge; the Intel channel supplies the `mkl_fft`/`mkl_random`/`mkl_umath` extensions and Intel's latest oneMKL builds. To add oneMKL to an *existing* environment that already has conda-forge NumPy installed, swap its BLAS to the MKL variant and add the extensions in place (this re-links the NumPy you already have, it does not reinstall NumPy): |
There was a problem hiding this comment.
I think this part is redundant:
Pin
python=<version>to match your project if you need a specific interpreter
Since this is not a general conda guide.
| mkl mkl_fft mkl_random mkl_umath mkl-service | ||
| ``` | ||
|
|
||
| `--override-channels` resolves only from the two named channels, so conda does not mix in an OpenBLAS build from elsewhere. The `blas=*=*_intelmkl` selector requests the Intel channel's MKL-backed BLAS; conda-forge offers an equivalent under the build string `blas=*=*mkl`. Either gives an MKL BLAS backend. The Intel channel is required for the three extensions and Intel's latest oneMKL builds. |
There was a problem hiding this comment.
| `--override-channels` resolves only from the two named channels, so conda does not mix in an OpenBLAS build from elsewhere. The `blas=*=*_intelmkl` selector requests the Intel channel's MKL-backed BLAS; conda-forge offers an equivalent under the build string `blas=*=*mkl`. Either gives an MKL BLAS backend. The Intel channel is required for the three extensions and Intel's latest oneMKL builds. | |
| `--override-channels` resolves only from the two named channels, so conda does not mix in an OpenBLAS build from elsewhere. The `blas=*=*_intelmkl` selector requests the Intel channel's MKL-backed BLAS; conda-forge offers an equivalent under the build string `blas=*=*mkl*`. Either gives an MKL BLAS backend. The Intel channel is required for the three extensions and Intel's latest oneMKL builds. |
|
|
||
| Use `--index-url`, not `--extra-index-url`: Intel's index is a partial mirror, and with `--extra-index-url` pip would see PyPI's higher-numbered OpenBLAS wheel and install that instead. Packages Intel does not mirror (for example `threadpoolctl`, used for [verification](#verifying-onemkl-is-active)) install normally from PyPI in a separate step. The Intel wheels target Linux and Windows; if `pip` reports no matching distribution, check that your platform and Python version are covered on the index. | ||
|
|
||
| Whichever path you take, choose the OpenMP threading layer and set it **before anything imports NumPy or MKL**. The variable is read once at MKL load time, so exporting it after the import has no effect. Which value to pick is explained under [Threads and NUMA](#threads-and-numa); the safe default for a typical pip or mixed environment is: |
There was a problem hiding this comment.
Also applicable to SciPy.
|
|
||
| The `threading_layer` value matches `MKL_THREADING_LAYER` (`gnu`, `intel`, or `sequential`); the field that confirms the backend is `internal_api: mkl`. | ||
|
|
||
| `np.show_config()` will show `name: blas, version: 3.9.0` even with oneMKL active. That is expected: it reflects the generic interface NumPy compiled against, not the runtime library. `threadpoolctl` is the reliable check. |
There was a problem hiding this comment.
These hard-coded version numbers are prone to get outdated over time.
| MKL_VERBOSE DGEMM(N,N,4096,4096,4096,...) 2.1s CNT=1 | ||
| ``` | ||
|
|
||
| If only the banner appears and no `DGEMM`/`DFFT`/`VML` lines follow, oneMKL loaded but is not being called. |
There was a problem hiding this comment.
It should mention here that which of those show depends on what the code is doing. Maybe could provide a sample script with a matrix multiplication that would trigger dgemm.
|
|
||
| **The extension packages do not activate themselves.** `mkl_fft`, `mkl_random`, and `mkl_umath` do not replace NumPy functions on import. Use the patch function or context manager. Since the 2026.0 release installs the standard conda-forge NumPy rather than a bundled Intel build, there is no longer anything that activates them at build time, so explicit activation is required even in the full Intel® Distribution for Python. | ||
|
|
||
| **The activation model is release-specific; this guide targets 2026.0 and later.** The explicit `patch_*` workflow described here matches the package generation in [Benchmark results](#benchmark-results) (NumPy 2.4.3, mkl_fft 2.2.0, mkl_random 1.4.0, mkl_umath 0.4.0). Earlier releases behave differently, verified on `intelpython3_full=2025.3.0`: |
There was a problem hiding this comment.
This makes it sounds as if this were expected to change in the future. Maybe it could mention that it applies to versions starting with 2026.0.
| conda install -y \ | ||
| -c https://software.repos.intel.com/python/conda \ | ||
| -c conda-forge --override-channels \ | ||
| "blas=*=*_intelmkl" \ |
There was a problem hiding this comment.
'blas' is a development package providing headers, .pc files, and similar, depending in turn on 'libblas'. 'libblas' is the runtime that sets the backend.
| conda install -c conda-forge _openmp_mutex=*=*_llvm | ||
| ``` | ||
|
|
||
| On Windows, `_openmp_mutex` offers Intel and LLVM variants but no GNU one, consistent with there being no GNU threading on the platform. |
| Pin `python=<version>` to match your project if you need a specific interpreter. NumPy comes from conda-forge; the Intel channel supplies the `mkl_fft`/`mkl_random`/`mkl_umath` extensions and Intel's latest oneMKL builds. To add oneMKL to an *existing* environment that already has conda-forge NumPy installed, swap its BLAS to the MKL variant and add the extensions in place (this re-links the NumPy you already have, it does not reinstall NumPy): | ||
|
|
||
| ```bash | ||
| conda install -y \ |
There was a problem hiding this comment.
Very important to mention here that packages from the Intel channel are meant to be compatible with packages from conda-forge but not with packages from Anaconda, which is the default channel.
|
Comment again that the guide specifically mentions AVX-512 as the highest level of SIMD instructions, but that will become outdated soon as hardware with avx10.2 gets released. |
| ```python | ||
| from threadpoolctl import threadpool_info | ||
| import pprint | ||
| pprint.pprint(threadpool_info()) |
There was a problem hiding this comment.
This should be executed after importing numpy.
| | `MKL_DYNAMIC` | `FALSE` | Disable automatic thread scaling | | ||
| | `KMP_AFFINITY` | `granularity=fine,compact,1,0` | Pin threads to physical cores (Intel OpenMP only) | | ||
|
|
||
| `KMP_AFFINITY` is an Intel OpenMP setting, so it applies only when oneMKL is on the Intel runtime (`MKL_THREADING_LAYER=INTEL`); under the GNU layer use `GOMP_CPU_AFFINITY` or `numactl` instead. `KMP_AFFINITY=granularity=fine,compact,1,0` is appropriate for single-socket systems or when running one process per socket. On multi-socket systems without `numactl` it may bind threads across sockets; verify the actual binding with `KMP_AFFINITY=verbose`. |
There was a problem hiding this comment.
What about OMP_PROC_BIND?
| | Variable | Recommended value | Effect | | ||
| |---|---|---| | ||
| | `MKL_THREADING_LAYER` | `GNU` (mixed env) or `INTEL` (all-Intel) | Select MKL's OpenMP runtime; see note below | | ||
| | `MKL_NUM_THREADS` | physical core count | Cap MKL thread count | |
There was a problem hiding this comment.
Is this guaranteed to work as intended if you set MKL_NUM_THREADS to number of physical cores, then bind the threads to numbers from the system, but don't specify something like OMP_PLACES=threads? Wouldn't it potentially end up using hyperthreads if the system enumerates them in an interleaved order?
| The speedup arrives in two parts that activate differently, and the distinction matters for the rest of this guide: | ||
|
|
||
| - **Linear algebra (BLAS and LAPACK)** turns on automatically once oneMKL is the backend. `np.dot`, `np.matmul`, and `np.linalg.*` route to it with no code change. | ||
| - **FFT, random, and vectorized math** come from three separate packages (`mkl_fft`, `mkl_random`, `mkl_umath`). These do not activate on import; you switch them on explicitly in code. |
There was a problem hiding this comment.
It could link to the github repositories of those packages.

Adds a new tuning guide documenting how to run NumPy with Intel® oneMKL-backed performance (BLAS/LAPACK plus optional FFT/random/umath patching), and links it from the repository’s main README
Changes:
CC @xaleryb @jharlow-intel @napetrov for addition review