Add MLX backend for Apple Silicon (GPU + ANE/CoreML dispatch)#1199
Draft
ChinChangYang wants to merge 2 commits into
Draft
Add MLX backend for Apple Silicon (GPU + ANE/CoreML dispatch)#1199ChinChangYang wants to merge 2 commits into
ChinChangYang wants to merge 2 commits into
Conversation
b544c66 to
dcf296a
Compare
Introduces a new neural-net backend (USE_BACKEND=MLX) targeting Apple
Silicon via Apple's MLX framework. The backend implements the full
nninterface contract (model load, batched evaluation, FP16/FP32 paths)
and ships with a Winograd 3x3 convolution path plus an adaptive
per-shape tuner that picks the fastest implementation for each
conv-3x3 shape at model load.
Backend
- cpp/neuralnet/mlxbackend.cpp: backend implementation. Supports
variable board sizes via input masking (same nnXLen/nnYLen
contract as other backends; the global COMPILE_MAX_BOARD_LEN
bound still applies). FP16/FP32 selected by the mlxUseFP16 config
(default auto -> fp16); same input feature layout as the other
backends. Mish activation runs FP16-safe (asserts on
ACTIVATION_MISH_SCALE8 so out-of-range variants are caught
explicitly rather than silently truncated).
- cpp/neuralnet/mlxwinograd.h: F(4x4, 3x3) Winograd transform with
fused activation + residual add.
- cpp/neuralnet/mlxwinotuner.{cpp,h}: per-shape Winograd tuner with
adaptive scoring (rotates the candidate set per shape, scores by
median-time delta against a baked-default baseline). Logs the
conv-3x3 shape distribution at model load.
- cpp/neuralnet/mlxtests.cpp: unit tests for the Winograd path
and tuner numeric-consistency, gated under runnnlayertests.
Build / wiring
- cpp/CMakeLists.txt: USE_BACKEND=MLX target. MLX requires CMake
3.27 (cmake_minimum_required stays at 3.18.2 so other backends
keep building on older CMake). Links Homebrew's prebuilt
libmlx.dylib; OSX deployment target intentionally not pinned so
the executable's minos matches the dylib it was linked against.
- cpp/main.cpp, cpp/program/setup.cpp, cpp/command/benchmark.cpp:
wire MLX into backend selection / benchmark.
- cpp/configs/{gtp,analysis,match,contribute}_example.cfg: document
mlxUseFP16 (auto / true / false), default auto -> fp16.
- Compiling.md: build instructions for the MLX backend.
Validation
- Cross-backend validation against an Eigen reference (testgpuerror)
for b18c384nbt, b40v8, and humanv0 nets shows FP32 max winrate
error 0.00095% and FP16 max 2.63%, well within the existing
backend tolerances.
This is the squash of 130 commits from feature/mlx-backend.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
dcf296a to
81b00db
Compare
…arity smoke test (#26)
31eec63 to
628e377
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR adds a new neural-net backend (
USE_BACKEND=MLX) targeting AppleSilicon via Apple's MLX framework,
with two dispatch paths sharing a single backend:
per-shape tuner.
backend's
gpuIdx = 100convention. Usable standalone or muxed withthe GPU path on the same model to overlap forward passes.
It implements the full
nninterface.hcontract (model load, batchedevaluation, FP16/FP32 paths) and reuses the existing CoreML conversion
pipeline shared with the Metal backend.
What's new
Backend (
cpp/neuralnet/)mlxbackend.cpp— backend implementation.nnXLen/nnYLencontract as other backends; the globalCOMPILE_MAX_BOARD_LENbound still applies), FP16/FP32 selectedby
mlxUseFP16(defaultauto→ fp16). Mish runs FP16-safe; thecode asserts on
ACTIVATION_MISH_SCALE8so out-of-range variantsfail loudly rather than truncate silently.
mlxDeviceToUseThread<N> = 100(or the backend-agnosticdeviceToUseThread<N>); shares the model+converter cache withMetal. Feeds the real spatial mask (channel 0 NHWC → buffer) so
rectangular / sub-NN-frame boards predict correctly, transposes
NHWC → NCHW into a per-batch staging buffer for the Swift
MLMultiArraycontract, and uses path-correct strides in thepolicy-optimism postprocessor for v12+ models.
ComputeHandleconstruction witha file-static mutex (CoreML converter is not concurrent), and
eagerly evaluates FP16 weight casts so secondary MLX/GPU server
threads don't trip MLX 0.31.2's thread-local command-encoder map.
mlxwinograd.h— F(4×4, 3×3) Winograd transform with fusedactivation + residual add.
mlxwinotuner.{cpp,h}— per-shape Winograd tuner with adaptivescoring (rotates the candidate set per shape, scores by median-time
delta against a baked-default baseline). Logs the conv-3x3 shape
distribution at model load.
mlxtests.cpp— Winograd + tuner numeric-consistency tests, gatedunder
runnnlayertests.Build / wiring
cpp/CMakeLists.txt—USE_BACKEND=MLXtarget; pulls in theMetal/Swift CoreML bridge so the ANE path links cleanly. MLX needs
CMake 3.27;
cmake_minimum_requiredstays at 3.18.2 so otherbackends keep building on older CMake. Links Homebrew's prebuilt
libmlx.dylib; OSX deployment target is intentionally not pinnedso the executable's
minosmatches the linked dylib.cpp/main.cpp,cpp/program/setup.cpp,cpp/command/benchmark.cpp— wire MLX into backend selection / benchmark.
cpp/configs/{gtp,analysis,match,contribute}_example.cfg—document
mlxUseFP16(defaultauto→ fp16) and thenumNNServerThreadsPerModel/mlxDeviceToUseThread<N>dispatchknobs (GPU-only / ANE-only / mux), with the note that
mlxUseFP16=falseon an ANE thread falls back to CPU FP32.cpp/rungpuerrortest.sh— backend-agnosticdeviceToUseThread0=100for the ANE mode, so the same scriptdrives whichever backend the binary was built with.
Compiling.md— build instructions.How to build
cd cpp cmake -G Ninja -DUSE_BACKEND=MLX ninjaRequires CMake ≥ 3.27 and
brew install mlx.Validation
Cross-backend validation against an Eigen reference (
testgpuerror).GPU path on b18c384nbt, b40v8, and humanv0:
ANE path on b5c192nbt-v16test and b18c384nbt mux 2g2a:
winrateErrormax ≤ ~3%,topPolicyDelta< 30%,policyKLDiv< 1.0winrateErrormax ≤ ~1.1%All within the existing tolerances used by other backends.
Status
Draft — opening for early feedback on the backend's structure, the
tuner approach, and the GPU/ANE dispatch wiring before promoting to
ready-for-review.