Skip to content

Add MLX backend for Apple Silicon (GPU + ANE/CoreML dispatch)#1199

Draft
ChinChangYang wants to merge 2 commits into
lightvector:masterfrom
ChinChangYang:mlx-backend-squash
Draft

Add MLX backend for Apple Silicon (GPU + ANE/CoreML dispatch)#1199
ChinChangYang wants to merge 2 commits into
lightvector:masterfrom
ChinChangYang:mlx-backend-squash

Conversation

@ChinChangYang
Copy link
Copy Markdown
Contributor

@ChinChangYang ChinChangYang commented May 23, 2026

This PR adds a new neural-net backend (USE_BACKEND=MLX) targeting Apple
Silicon via Apple's MLX framework,
with two dispatch paths sharing a single backend:

  • GPU — MLX/Metal with an F(4×4, 3×3) Winograd path and an adaptive
    per-shape tuner.
  • ANE — CoreML on CPU + Apple Neural Engine, mirroring the Metal
    backend's gpuIdx = 100 convention. Usable standalone or muxed with
    the GPU path on the same model to overlap forward passes.

It implements the full nninterface.h contract (model load, batched
evaluation, FP16/FP32 paths) and reuses the existing CoreML conversion
pipeline shared with the Metal backend.

What's new

Backend (cpp/neuralnet/)

  • mlxbackend.cpp — backend implementation.
    • GPU path: Variable board sizes via input masking (same
      nnXLen / nnYLen contract as other backends; the global
      COMPILE_MAX_BOARD_LEN bound still applies), FP16/FP32 selected
      by mlxUseFP16 (default auto → fp16). Mish runs FP16-safe; the
      code asserts on ACTIVATION_MISH_SCALE8 so out-of-range variants
      fail loudly rather than truncate silently.
    • ANE path: Selected per server thread via
      mlxDeviceToUseThread<N> = 100 (or the backend-agnostic
      deviceToUseThread<N>); shares the model+converter cache with
      Metal. Feeds the real spatial mask (channel 0 NHWC → buffer) so
      rectangular / sub-NN-frame boards predict correctly, transposes
      NHWC → NCHW into a per-batch staging buffer for the Swift
      MLMultiArray contract, and uses path-correct strides in the
      policy-optimism postprocessor for v12+ models.
    • Mux (GPU + ANE): Serializes ComputeHandle construction with
      a file-static mutex (CoreML converter is not concurrent), and
      eagerly evaluates FP16 weight casts so secondary MLX/GPU server
      threads don't trip MLX 0.31.2's thread-local command-encoder map.
  • mlxwinograd.h — F(4×4, 3×3) Winograd transform with fused
    activation + residual add.
  • mlxwinotuner.{cpp,h} — per-shape Winograd tuner with adaptive
    scoring (rotates the candidate set per shape, scores by median-time
    delta against a baked-default baseline). Logs the conv-3x3 shape
    distribution at model load.
  • mlxtests.cpp — Winograd + tuner numeric-consistency tests, gated
    under runnnlayertests.

Build / wiring

  • cpp/CMakeLists.txtUSE_BACKEND=MLX target; pulls in the
    Metal/Swift CoreML bridge so the ANE path links cleanly. MLX needs
    CMake 3.27; cmake_minimum_required stays at 3.18.2 so other
    backends keep building on older CMake. Links Homebrew's prebuilt
    libmlx.dylib; OSX deployment target is intentionally not pinned
    so the executable's minos matches the linked dylib.
  • cpp/main.cpp, cpp/program/setup.cpp, cpp/command/benchmark.cpp
    — wire MLX into backend selection / benchmark.
  • cpp/configs/{gtp,analysis,match,contribute}_example.cfg
    document mlxUseFP16 (default auto → fp16) and the
    numNNServerThreadsPerModel / mlxDeviceToUseThread<N> dispatch
    knobs (GPU-only / ANE-only / mux), with the note that
    mlxUseFP16=false on an ANE thread falls back to CPU FP32.
  • cpp/rungpuerrortest.sh — backend-agnostic
    deviceToUseThread0=100 for the ANE mode, so the same script
    drives whichever backend the binary was built with.
  • Compiling.md — build instructions.

How to build

cd cpp
cmake -G Ninja -DUSE_BACKEND=MLX
ninja

Requires CMake ≥ 3.27 and brew install mlx.

Validation

Cross-backend validation against an Eigen reference (testgpuerror).

GPU path on b18c384nbt, b40v8, and humanv0:

  • FP32: max winrate error 0.00095%
  • FP16: max winrate error 2.63%

ANE path on b5c192nbt-v16test and b18c384nbt mux 2g2a:

  • v16 ANE: winrateError max ≤ ~3%, topPolicyDelta < 30%,
    policyKLDiv < 1.0
  • v11 b18 mux (2 GPU + 2 ANE): winrateError max ≤ ~1.1%

All within the existing tolerances used by other backends.

Status

Draft — opening for early feedback on the backend's structure, the
tuner approach, and the GPU/ANE dispatch wiring before promoting to
ready-for-review.

Introduces a new neural-net backend (USE_BACKEND=MLX) targeting Apple
Silicon via Apple's MLX framework. The backend implements the full
nninterface contract (model load, batched evaluation, FP16/FP32 paths)
and ships with a Winograd 3x3 convolution path plus an adaptive
per-shape tuner that picks the fastest implementation for each
conv-3x3 shape at model load.

Backend
- cpp/neuralnet/mlxbackend.cpp: backend implementation. Supports
  variable board sizes via input masking (same nnXLen/nnYLen
  contract as other backends; the global COMPILE_MAX_BOARD_LEN
  bound still applies). FP16/FP32 selected by the mlxUseFP16 config
  (default auto -> fp16); same input feature layout as the other
  backends. Mish activation runs FP16-safe (asserts on
  ACTIVATION_MISH_SCALE8 so out-of-range variants are caught
  explicitly rather than silently truncated).
- cpp/neuralnet/mlxwinograd.h: F(4x4, 3x3) Winograd transform with
  fused activation + residual add.
- cpp/neuralnet/mlxwinotuner.{cpp,h}: per-shape Winograd tuner with
  adaptive scoring (rotates the candidate set per shape, scores by
  median-time delta against a baked-default baseline). Logs the
  conv-3x3 shape distribution at model load.
- cpp/neuralnet/mlxtests.cpp: unit tests for the Winograd path
  and tuner numeric-consistency, gated under runnnlayertests.

Build / wiring
- cpp/CMakeLists.txt: USE_BACKEND=MLX target. MLX requires CMake
  3.27 (cmake_minimum_required stays at 3.18.2 so other backends
  keep building on older CMake). Links Homebrew's prebuilt
  libmlx.dylib; OSX deployment target intentionally not pinned so
  the executable's minos matches the dylib it was linked against.
- cpp/main.cpp, cpp/program/setup.cpp, cpp/command/benchmark.cpp:
  wire MLX into backend selection / benchmark.
- cpp/configs/{gtp,analysis,match,contribute}_example.cfg: document
  mlxUseFP16 (auto / true / false), default auto -> fp16.
- Compiling.md: build instructions for the MLX backend.

Validation
- Cross-backend validation against an Eigen reference (testgpuerror)
  for b18c384nbt, b40v8, and humanv0 nets shows FP32 max winrate
  error 0.00095% and FP16 max 2.63%, well within the existing
  backend tolerances.

This is the squash of 130 commits from feature/mlx-backend.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ChinChangYang ChinChangYang changed the title Add MLX backend for Apple Silicon Add MLX backend for Apple Silicon (GPU + ANE/CoreML dispatch) May 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant