Add MLX backend for Apple Silicon (GPU + ANE/CoreML dispatch, incl. transformer nets) by ChinChangYang · Pull Request #1199 · lightvector/KataGo

ChinChangYang · 2026-05-23T04:27:34Z

This PR adds a new neural-net backend (USE_BACKEND=MLX) targeting Apple
Silicon via Apple's MLX framework,
with two dispatch paths sharing a single backend:

GPU — MLX/Metal with an F(2×2, 3×3) Winograd path and an adaptive
per-shape tuner.
ANE — CoreML on CPU + Apple Neural Engine, mirroring the Metal
backend's gpuIdx = 100 convention. Usable standalone or muxed with
the GPU path on the same model to overlap forward passes.

It implements the full nninterface.h contract (model load, batched
evaluation, FP16/FP32 paths) and reuses the existing CoreML conversion
pipeline shared with the Metal backend. Both dispatch paths support
the v15+ transformer trunk (attention incl. GQA, learnable RoPE, SiLU,
RMSNorm/batchnorm tips, SwiGLU FFN) in addition to convnets.

What's new

Backend (cpp/neuralnet/)

mlxbackend.cpp — backend implementation.
- GPU path: Variable board sizes via input masking (same
  nnXLen / nnYLen contract as other backends; the global
  COMPILE_MAX_BOARD_LEN bound still applies), FP16/FP32 selected
  by mlxUseFP16 (default auto → fp16). Mish runs FP16-safe; the
  code asserts on ACTIVATION_MISH_SCALE8 so out-of-range variants
  fail loudly rather than truncate silently.
- ANE path: Selected per server thread via
  mlxDeviceToUseThread<N> = 100 (or the backend-agnostic
  deviceToUseThread<N>); shares the model+converter cache with
  Metal. Feeds the real spatial mask (channel 0 NHWC → buffer) so
  rectangular / sub-NN-frame boards predict correctly, transposes
  NHWC → NCHW into a per-batch staging buffer for the Swift
  MLMultiArray contract, and uses path-correct strides in the
  policy-optimism postprocessor for v12+ models.
- Mux (GPU + ANE): Serializes ComputeHandle construction with
  a file-static mutex (CoreML converter is not concurrent), and
  eagerly evaluates FP16 weight casts so secondary MLX/GPU server
  threads don't trip MLX 0.31.2's thread-local command-encoder map.
  A second file-static mutex serializes the MLX/GPU graph-build + eval
  - result read in applyCompiled: MLX GPU streams have no per-stream
    worker thread, so every handle shares MLX's single global default GPU
    stream, and two inference threads (mux, numNNServerThreadsPerModel > 1, or a second loaded model such as a human-SL net) would otherwise
    open two command encoders on one MTLCommandBuffer. Input prep stays
    outside the lock, and it is a no-op for the default single-thread
    config.
mlxwinograd.h — F(2×2, 3×3) Winograd transform with fused
activation + residual add.
mlxwinotuner.{cpp,h} — per-shape Winograd tuner with adaptive
scoring (rotates the candidate set per shape, scores by median-time
delta against a baked-default baseline). Logs the conv-3x3 shape
distribution at model load. At model load it loads a valid cache or,
on a miss, runs a fast coarse tune: greedy coordinate-descent
(neuralnet/greedysearch.h) over a trimmed candidate grid with a
reduced timing-rep budget — ~3.75× fewer GPU dispatches than the wide
sweep, at equal steady-state throughput (the winning geometry sits on a
broad plateau where it moves end-to-end ≤ ~1.5%). Fast and wide
("full") tunes are cached in separate files, and a process-global
session memo lets same-shape nets (e.g. the main + human-SL b18c384
pair) tune once instead of twice. A valid cache is reused as-is unless
re-tuning is requested — command-only via ./katago tuner (see Build /
wiring), or per-load via the mlxTunerFull (wide grid) / mlxReTune
(force fresh) override-config keys (both default off).
KATAGO_MLX_WINOTUNER=0 skips tuning entirely and uses baked-default
launch geometry, while KATAGO_MLX_WINOGRAD=0 falls 3×3 convs back to
mx::conv2d.
mlxtests.cpp — Winograd + tuner numeric-consistency tests, gated
under runnnlayertests.

Transformer trunk support (both dispatch paths)

GPU path (native MLX): transformer trunk/tip implemented directly
in mlxbackend.cpp — scaled-dot-product attention with grouped-query
attention (GQA), learnable RoPE, RMSNorm and batchnorm trunk tips,
SwiGLU FFN, and the SiLU activation. Mirrors eigenbackend.cpp.
ANE path (CoreML/katagocoreml): the same transformer trunk in
the shared MIL converter, with FP16 accuracy precision tiers gated
on actual transformer-block presence (recursing into nested-bottleneck
blocks) so plain convnets stay pure FP16 on the FP16-only ANE. For
transformer trunks: narrow (<256ch) build fully FP32; wider ones
escalate the accuracy-sensitive non-spatial ops (matmuls + global
pooling) to FP32; very wide (≥320ch) also escalate convs; RMSNorm
reductions run FP32 in FP16 mode while attention scores/softmax and the
per-head out-projection stay FP16. Mirrors the TensorRT precision split
and the Metal/CoreML backend's tiers. The converter's public API is
unchanged, so the MLX ANE call site needs no edits.

Build / wiring

cpp/CMakeLists.txt — USE_BACKEND=MLX target; pulls in the
Metal/Swift CoreML bridge so the ANE path links cleanly. MLX needs
CMake 3.27; cmake_minimum_required stays at 3.18.2 so other
backends keep building on older CMake. Links Homebrew's prebuilt
libmlx.dylib; OSX deployment target is intentionally not pinned
so the executable's minos matches the linked dylib.
cpp/main.cpp, cpp/program/setup.cpp, cpp/command/benchmark.cpp
— wire MLX into backend selection / benchmark.
cpp/command/tune.cpp — wires the MLX winotuner into the ./katago tuner subcommand (previously OpenCL-only), mirroring OpenCL's
command-only tuning. ./katago tuner -model <net> always re-runs the
sweep and overwrites the cache the backend reads at model load; add
-full for the wide candidate grid. Args: -output (default: shared
MLX cache), -xsize / -ysize (default 19), -batchsize (default
8), -testFP16 (true|false|auto, default auto → FP16, the
precision the engine and the cache filename key use), -full. The
OpenCL-only FP16 sub-knobs (storage / compute / tensorcores) have no
MLX analog and are omitted.
cpp/configs/{gtp,analysis,match,contribute}_example.cfg —
document mlxUseFP16 (default auto → fp16) and the
numNNServerThreadsPerModel / mlxDeviceToUseThread<N> dispatch
knobs (GPU-only / ANE-only / mux), with the note that
mlxUseFP16=false on an ANE thread falls back to CPU FP32.
cpp/rungpuerrortest.sh — backend-agnostic
deviceToUseThread0=100 for the ANE mode, so the same script
drives whichever backend the binary was built with.
Compiling.md — build instructions.
.github/workflows/build.yml — adds a build-macos-mlx CI job
(macos-latest, which is Apple Silicon) that brew installs the MLX
toolchain, configures USE_BACKEND=MLX, builds, and runs
./katago runtests, so the MLX backend is build-gated on every PR.

Conversion + ANE steady-state memory

The ANE path carries the on-device memory levers from #1202 (cut CoreML
conversion peak and ANE steady-state RSS, with byte-identical converter
output), cherry-picked onto this branch and wired into the MLX backend:

Non-owning weight views — the converter's weight tensors are
non-owning FloatViews into the parsed model instead of owning extra
FP32 copies (derived/transposed tensors keep an owned buffer). The
rvalue registerWeight / addConstOp overloads are = deleted so a
temporary can't bind to the view path and dangle, and CoreML
serialization is made deterministic for byte-stable output.
Streaming parser — the KataGo model parser streams the gzip
through a bounded ~1 MB refill buffer (RAII gzFile) instead of
decompressing the whole file into memory, preserving the existing
NaN/Inf weight validation.
Release weights on the ANE-only path — ModelDesc::releaseWeights()
frees the in-memory FP32 weight arrays (keeping scalar shape metadata),
gated by a new ComputeContext::aneOnly flag that is true only when
every configured device index is MLX_MUX_ANE. On the MLX backend it
fires in convertAndCreateCoreMLOnlyHandleMLX after the model has been
converted to CoreML on disk; it is safe because the ANE path re-reads
the model from modelPath (not the in-memory arrays), the
ComputeHandle ctor takes the ANE early-return before building any
MLX/GPU model (the only weight-array consumer), only scalar dims are
read afterward, and it runs under computeHandleMutex. The GPU/MLX
path keeps its weights (aneOnly is false whenever any thread uses the
GPU). releaseWeights() covers the transformer descriptors too
(attention incl. ropeFreqs, FFN, RMSNorm tips).

Measured (19×19, MLX ANE path)

Resident-memory A/B (mlxDeviceToUse=100, cold conversion every run),
before vs after the levers. Peak RSS is the /usr/bin/time -l maximum
over load+convert; idle steady-state is ps-sampled after load+genmove.

Model	Peak RSS (load+convert), before → after	Idle steady-state RSS, before → after
Classical 40b (`b40c256`)	1.61 → 1.05 GB (−35%)	0.41 → 0.16 GB (−62%)
18b nbt (`b18c384nbt`)	0.94 → 0.63 GB (−33%)	0.64 → 0.32 GB (−49%)
Transformer (`b10c384h6nbttflrs`)	1.69 → 1.52 GB (−10%)	1.69 → 1.52 GB (−10%)

The peak win is largest on the classical 40b (the duplicate owning weight
copies + whole-gzip decompress the levers target dominate there); the
transformer gains least because most of its converter-side weight is
derived (RoPE tables, rotation matrix, per-head out-proj slices) which
keeps an owned buffer, and its absolute peak is dominated by ANE
compilation of the MIL graph downstream of these host-side levers.

The levers don't change inference numerics: testgpuerror ANE output is
bit-identical before vs after on a convnet and all three transformer nets
(release happens after conversion; the MLX/GPU path never invokes the
converter), and benchmark throughput is within run-to-run noise on both
the GPU and ANE paths.

How to build

cd cpp
cmake -G Ninja -DUSE_BACKEND=MLX
ninja

Requires CMake ≥ 3.27 and brew install mlx.

Validation

Cross-backend testgpuerror vs an Eigen FP32 reference, via
cpp/rungpuerrortest.sh (all 12 nets across 9×9 / 13×13 / 19×19 / 10×14 /
rectangular / rect-buffer / weird-settings). MLX FP16 (auto) vs Eigen
FP32 — winrate error per net (99%-ile / worst-case max across that
net's configs):

Network	Ver	GPU FP16 (99% / max)	ANE FP16 (99% / max)
run4-b6c96	v3	0.22% / 0.43%	— pre-v8
grun50-b6c96	v4	0.31% / 1.11%	— pre-v8
g103-b6c96	v5	0.28% / 1.31%	— pre-v8
g170e-b10c128	v8	0.60% / 2.35%	0.40% / 1.45%
kata1-b18c384nbt (s5832)	v11	0.45% / 1.03%	0.39% / 0.65%
kata1-b18c384nbt (s9996)	v14	0.40% / 1.36%	0.76% / 1.38%
kata1-b28c512nbt	v15	0.35% / 1.30%	0.56% / 2.40%
b18c384nbt-humanv0	SL	0.14% / 0.42%	0.22% / 0.52%
b5c192nbt-v16test	v16	0.11% / 0.18%	0.19% / 0.25%
b7c96h3tfrs · RoPE+RMSNorm	v17	0.59% / 2.49%	0.0003% / 0.0004% ᵗ
b4c256h4nbttflrs · SiLU+sp-RMSNorm	v17	0.20% / 0.30%	0.48% / 0.79%
b7c96h6kv3qk32v16tflrs · GQA+bnorm	v17	0.24% / 0.59%	0.0001% / 0.0003% ᵗ

Pass: GPU 37/37 configs, ANE 31/31 supported configs. FP32-vs-Eigen
sanity (numerical noise): GPU ≤ 0.00092%, ANE ≤ 0.00082%.
./katago runtests and ./katago runnnlayertests pass.

ᵗ 96ch transformers hit the converter's full-FP32 tier on ANE (FP16 == FP32);
the 256ch SiLU transformer uses the non-spatial-FP32 tier (real FP16 residual).

The table's worst GPU FP16 number — b7c96h3tfrs at 19×19, 0.59% / 2.49%
(99%-ile / max) — is intrinsic FP16 rounding, not an MLX defect: it's a single
near-50%-winrate tail position out of 2247 (where winrateError = |Δ½(W−L)| is
most amplified), the same graph in FP32 matches Eigen to ≤ 0.00078%, and it
clears KataGo's built-in testgpuerror bounds (FP16 99%-ile ≤ 2.0%, max ≤ 5.0%,
unchanged from master) with ~2× headroom. The same net / config / Eigen
reference through KataGo's OpenCL FP16 path confirms it:

Path (`b7c96h3tfrs`, 19×19, vs Eigen FP32)	winrate avg / 99% / max
MLX FP16	0.17% / 0.59% / 2.49%
OpenCL FP16	0.25% / 0.92% / 2.82%

OpenCL FP16 is less accurate than MLX at every percentile — its Apple-Silicon
path is FP16 storage + FP32 compute (no cl_khr_fp16), so layer-boundary rounding
alone already gives 2.82%. MLX stays ahead while running attention/softmax/FFN in
FP16 by keeping the precision-critical reductions in FP32 (RMSNorm variance,
softmax precise=true, FP32 matmul accumulation).

The 6 pre-v8 ANE configs (v3/v4/v5) are skipped — the CoreML converter
supports model versions 8–17 only. Run ANE testgpuerror serially or with a
per-process TMPDIR: the converter stages weights in a shared
$TMPDIR/katagocoreml_weights dir, so concurrent processes race.

GPU throughput: MLX vs Metal

Single-GPU throughput of the two Apple-Silicon GPU backends, both configured
GPU-only (no ANE; MLX gpuIdx 0, Metal default device 0), on an Apple
M3 Max. The two backends take different GPU strategies, so this compares
each backend's actual GPU-only behavior rather than an identical-precision
kernel:

MLX GPU — F(2×2, 3×3) Winograd path in FP16 (mlxUseFP16 auto).
Metal GPU — MPSGraph in FP32 (this backend's GPU path, incl. the
transformer trunk, is FP32-only).

The Metal binary is built from feature/metal-transformer-silu-b10c384h6
(the branch whose Metal GPU path supports the v15+ transformer trunk, tip
39f82f6d). Each net was benchmarked with benchmark -half-batch-size -t 8,16,32 -v 800 at 19×19, run sequentially — one backend / one net at a
time. Cells are visits/s at each thread count; the human-SL net used
humanSLProfile=rank_9d (same override on both backends). Both columns were
re-measured together on 2026-06-14 at branch HEAD 5b5899c9 on the same
machine; every cell is the median of 3 runs. Best-thread throughput (the
ratio) is stable run-to-run; the low-thread (t=8) cells, especially the small
transformers, are higher-variance and should be read as approximate.

Network	Ver	MLX GPU · FP16 visits/s @ t=8 / 16 / 32	Metal GPU · FP32 visits/s @ t=8 / 16 / 32	MLX best ÷ Metal best
run4-b6c96	v3	3657 / 5931 / 7346	3557 / 5140 / 7060	1.04×
grun50-b6c96	v4	3558 / 6440 / 8329	3698 / 5465 / 7875	1.06×
g103-b6c96	v5	4006 / 6667 / 8426	3797 / 5449 / 7722	1.09×
g170e-b10c128	v8	2802 / 3979 / 4749	2329 / 2924 / 3824	1.24×
kata1-b18c384nbt (s5832)	v11	564 / 685 / 749	363 / 534 / 554	1.35×
kata1-b18c384nbt (s9996)	v14	562 / 678 / 734	358 / 535 / 528	1.37×
kata1-b28c512nbt	v15	267 / 308 / 306	159 / 206 / 208	1.48×
b18c384nbt-humanv0	SL	547 / 677 / 692	370 / 517 / 527	1.31×
b5c192nbt-v16test	v16	2239 / 3663 / 4443	2320 / 3071 / 3968	1.12×
b7c96h3tfrs · RoPE+RMSNorm	v17	944 / 2557 / 2912	1513 / 2057 / 1860	1.42×
b4c256h4nbttflrs · SiLU	v17	996 / 1443 / 1841	859 / 1298 / 1304	1.41×
b7c96h6kv3qk32v16tflrs · GQA	v17	1069 / 1559 / 1985	1100 / 1367 / 1327	1.45×

MLX-GPU is faster than Metal-GPU on every net (best-thread throughput
1.04×–1.48×), with the margin growing with trunk depth: small convnets are
near parity (~1.04–1.12×), and the largest margins are on the deep NBT nets
(b18c384nbt ~1.35–1.37×, b28c512nbt ~1.48×) and the v17 transformer nets
(~1.41–1.45×, with the GPU attention path running end-to-end in FP16) where
the FP16 Winograd path has the most to exploit. The FP16-vs-FP32 precision gap is part of that lead;
accuracy of each path vs an Eigen FP32 reference is in the Validation
table above.

Status

Draft — opening for early feedback on the backend's structure, the
tuner approach, the GPU/ANE dispatch wiring, and the transformer-trunk
support (incl. the ANE FP16 precision tiers) before promoting to
ready-for-review.

Introduces a new neural-net backend (USE_BACKEND=MLX) targeting Apple Silicon via Apple's MLX framework. The backend implements the full nninterface contract (model load, batched evaluation, FP16/FP32 paths) and ships with a Winograd 3x3 convolution path plus an adaptive per-shape tuner that picks the fastest implementation for each conv-3x3 shape at model load. Backend - cpp/neuralnet/mlxbackend.cpp: backend implementation. Supports variable board sizes via input masking (same nnXLen/nnYLen contract as other backends; the global COMPILE_MAX_BOARD_LEN bound still applies). FP16/FP32 selected by the mlxUseFP16 config (default auto -> fp16); same input feature layout as the other backends. Mish activation runs FP16-safe (asserts on ACTIVATION_MISH_SCALE8 so out-of-range variants are caught explicitly rather than silently truncated). - cpp/neuralnet/mlxwinograd.h: F(4x4, 3x3) Winograd transform with fused activation + residual add. - cpp/neuralnet/mlxwinotuner.{cpp,h}: per-shape Winograd tuner with adaptive scoring (rotates the candidate set per shape, scores by median-time delta against a baked-default baseline). Logs the conv-3x3 shape distribution at model load. - cpp/neuralnet/mlxtests.cpp: unit tests for the Winograd path and tuner numeric-consistency, gated under runnnlayertests. Build / wiring - cpp/CMakeLists.txt: USE_BACKEND=MLX target. MLX requires CMake 3.27 (cmake_minimum_required stays at 3.18.2 so other backends keep building on older CMake). Links Homebrew's prebuilt libmlx.dylib; OSX deployment target intentionally not pinned so the executable's minos matches the dylib it was linked against. - cpp/main.cpp, cpp/program/setup.cpp, cpp/command/benchmark.cpp: wire MLX into backend selection / benchmark. - cpp/configs/{gtp,analysis,match,contribute}_example.cfg: document mlxUseFP16 (auto / true / false), default auto -> fp16. - Compiling.md: build instructions for the MLX backend. Validation - Cross-backend validation against an Eigen reference (testgpuerror) for b18c384nbt, b40v8, and humanv0 nets shows FP32 max winrate error 0.00095% and FP16 max 2.63%, well within the existing backend tolerances. This is the squash of 130 commits from feature/mlx-backend. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…arity smoke test (#26)

# Conflicts: # cpp/rungpuerrortest.sh

master consolidated createComputeContext's trailing params (openCLTunerFile, openCLReTunePerBoardSize, useNHWCMode) into a single ConfigParser& cfg. The Metal backend was already updated; update the MLX backend to match so it compiles against the merged interface. NHWC is still enforced per-handle via inputsUseNHWC in createComputeHandle. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The MLX backend implemented only the convnet path; transformer nets crashed (empty-(0)-tensor broadcast on rmsnorm/SiLU tips) or produced garbage (GQA). Implement the transformer trunk/tip path in mlxbackend.cpp, mirroring eigenbackend.cpp: - ACTIVATION_SILU (x * sigmoid(x)) - TransformerRMSNormLayer (spatial rmsnorm tip) + TransformerTrunkRMSNormLayer (pre-LN) - GQA TransformerAttentionBlock + SwiGLU FFNBlock - branch the trunk tip on trunkNormKind; wire the new block kinds into the block-variant and nested-bottleneck loops; thread nnX/nnY through Verified via testgpuerror against fresh Eigen references (boardsize 19): fp32 winrateError max — rope 0.00094%, silu 0.00046%, gqa 0.00029% (bar 0.10%); convnet g170-b6c96 unregressed (0.00036%); runtests + runnnlayertests pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The MLX backend's ANE mux path (mlxDeviceToUseThread0=100) drives inference through the shared external/katagocoreml converter -- the same library the Metal backend uses. Bring PR#1205's CoreML/ANE transformer work into that converter so the MLX ANE path supports the v15+ transformer trunk: - Transformer MIL support: attention (incl. grouped-query attention), learnable RoPE, SiLU, RMSNorm/batchnorm tips, SwiGLU FFN. - FP16 accuracy precision tiers, gated on actual transformer-block presence (blocksContainTransformer, recursing into nested-bottleneck blocks): narrow trunks (<256ch) build fully FP32; wider ones escalate non-spatial matmuls + global pooling to FP32; very wide (>=320ch) also escalate convs; RMSNorm reductions FP32 in FP16 mode. Plain convnets stay pure FP16 on the ANE (the d052d2a regression-fix behavior is preserved). The converter's public API is unchanged, so the MLX call site (CoreMLConversion::convertModelToTemp) needs no edits. The Metal-GPU/MPSGraph portions of PR#1205 (metalbackend.cpp, metallayers.swift) are intentionally not ported -- the MLX backend's native GPU path already has transformer support. Verified on the MLX ANE mux (testgpuerror vs fresh Eigen FP32 references): all 3 transformer test nets pass FP16 thresholds across board sizes/buffer configs (7 configs), a plain convnet stays pure FP16 (non-regression), and runtests + runnnlayertests pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

applyGlobalPooling / applyValueHeadPooling summed in fp16 but produced an fp32 mean (division by the fp32 maskSum), which also leaked fp32 into the downstream gpool-bias and value-v2 head matmuls. Cast maskSum to the input dtype so the whole pooling and the heads stay in the compute dtype (fp16 when useFP16), maximizing fp16 utilization rather than escalating to fp32 for negligible accuracy gain. The masked-max keeps its 1e9 constant in fp32 (1e9 overflows fp16 -> inf -> 0*inf=NaN), then casts the max result back to the compute dtype. The fp32 path is unaffected (the astype casts are no-ops in fp32). Verified via testgpuerror vs fresh Eigen fp32 references on all 3 transformer nets (7 board-size/buffer configs): fp16 winrate error max <= 2.07% (within tolerance, winrate unchanged vs baseline), fp32 path byte-identical, ownership output bit-identical, runnnlayertests pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…nt warmup M2: A candidate whose threadgroup exceeds the pipeline's register-pressure- dependent maxTotalThreadsPerThreadgroup (can be < 1024), or that hits a transient GPU error, throws out of mx::eval during the flat sweep. Previously this propagated out of loadOrAutoTune and aborted model load with no fallback. Now each candidate's scoring is wrapped in try/catch: a throw is counted and skipped (mirroring the OpenCL tuner's mark-bad-and-continue), and best/bestTime are seeded with the baked default so even a fully-failing sweep returns a valid result. A separate "flatSweep{Input,Output} skipped=N" log line is emitted only when skips occur; it intentionally omits the colon after the function name so it cannot collide with the regex-tested "flatSweepInput: considered" log line. M3: timeOneInputTransform/timeOneOutputUntransform ran an untimed warmup eval on every call, but the scoring functions already warmed up once before the measured loop -- so every measured rep paid an extra full warmup (~doubling tuning cost). Add a doWarmup parameter gating the internal warmup; the scoring functions drop their explicit warmup and pass (r == 0), so each shape warms exactly once on its first measured rep. Verified by triggering autotuning: gated flat-sweep tests (convergence, log-format, baseline-consistency, per-shape) pass; an end-to-end re-tune via loadOrAutoTune runs a fresh sweep and saves valid fp16/fp32 caches; testgpuerror output is unchanged (tuner params are numerically inert); runtests passes. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The METAL and MLX backend branches each ran set(CMAKE_OSX_DEPLOYMENT_TARGET 13.0) *after* project(), where it is a silent no-op: for this Swift project the deployment target is fixed during project()/enable_language, so a later set() never affects the produced binary. Both shipped binaries already carry minos 26.0 (the build host / libmlx's floor), not 13.0, confirming the pins were inert dead code that contradicted the pre-project comment explaining why the deployment target is deliberately not pinned. Delete both pins so code, comment, and reality agree; the comment becomes literally true. Add a guard note documenting that a post-project pin is a no-op so it is not reintroduced. No behavior change: binaries still build at minos 26.0, matching libmlx's minos and MLX's macOS >= 14 requirement. Verified: MLX reconfigure+build clean; METAL branch configures clean; binary minos unchanged (26.0); runtests pass; testgpuerror unchanged (fp32 max 0.00036%, fp16 max 0.863%). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The katagocoreml parser read the q/k/v/out projection matmuls of a transformer attention block without checking their declared dimensions against the head geometry or trunk width. The CoreML graph builder (MILBuilder) then reshapes each flat projection into a [seq, heads, headDim] grid and reshapes the out-projection result back into the trunk, so a mismatched dimension would either read past the weight buffer or build a graph that compiles but computes nonsense - the exact failure mode the existing checkBlockChannels() guard was added to prevent for conv blocks. Thread trunk_num_channels into parseTransformerAttentionBlock (mirroring parseNestedBottleneckBlock) and add a checkAttentionProjDim() helper in the style of checkBlockChannels(), then validate all four projections: qProj.outChannels == numHeads * qHeadDim (master desc.cpp:1129) kProj.outChannels == numKVHeads * qHeadDim (master desc.cpp:1131) vProj.outChannels == numKVHeads * vHeadDim (master desc.cpp:1133) outProj.inChannels == numHeads * vHeadDim (master desc.cpp:1135) qProj.inChannels == trunkNumChannels (master desc.cpp:1430) outProj.outChannels== trunkNumChannels (master desc.cpp:1437) k/vProj.inChannels == trunkNumChannels (gap master leaves implicit to the backend) Six checks mirror master desc.cpp's transformer attention consistency checks exactly; the k/v inChannels checks additionally close a gap master leaves to the backend (all three QKV projections consume the same normed-trunk input, so their inChannels must equal the trunk width). K pairs with Q in the QK^T dot product, so kProj uses qHeadDim; only V carries vHeadDim. Purely additive: throws std::runtime_error on a malformed model, no-op on valid ones, no numerics touched. Verified: MLX build clean, runtests pass, and ANE-path testgpuerror on all three transformer nets (incl. the GQA net with 6 heads/3 KV heads, qk=32/v=16) loads and converts with zero false-positive throws and unchanged numerics. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Cuts memory during the on-device KataGo -> CoreML conversion and while running the ANE/CoreML path, with byte-identical converter output: - The converter's weight tensors become non-owning views into the parsed model instead of owning extra FP32 copies; derived/transposed tensors keep an owned buffer. This drops redundant resident weight copies during conversion. CoreML model serialization is made deterministic (SetSerializationDeterministic) so the output is byte-stable. - The KataGo model parser streams the gzip through a bounded ~1 MB refill buffer instead of decompressing the whole file into memory, while preserving the existing NaN/Inf weight validation. - ModelDesc gains releaseWeights(), which frees the in-memory weight arrays (keeping scalar shape metadata). The Metal backend calls it on the ANE (CoreML) path after converting from the model file on disk, gated by a new ComputeContext::aneOnly flag so it only fires when every configured device is ANE -- the GPU/MPSGraph path keeps its weights. The call is serialized under computeHandleMutex and only scalar dims are read afterward. Measured on b18c384nbt (19x19) over the ANE path: idle steady-state RSS 0.59 GB -> 0.19 GB; peak (load+convert) 0.87 GB -> 0.48 GB. Cross-backend parity vs an Eigen reference is unchanged on both the GPU and ANE paths. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> (cherry picked from commit b05f559)

WeightEntry stores a non-owning view (const float*, count) into the live KataGoModelDesc, so the backing std::vector must outlive serialization. addConstOp/registerWeight took the data by const& and silently stored a pointer to it; a caller passing a temporary would bind to that const& and leave the view dangling, read much later during serialization. Delete the rvalue overloads of both so any such call fails to compile, forcing temporaries through addOwnedConstOp/registerOwnedWeight (which take ownership). Named lvalues (the model-member call sites) still bind to the const& overload, so no existing caller changes. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> (cherry picked from commit 971fa9d)

Own the gzFile with a custom-deleter unique_ptr so it closes on every exit path (normal return, exception, bad_alloc); removes the manual try/catch+gzclose in parse() and the ordering caveat on buffer allocation. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> (cherry picked from commit eeefc97)

Introduce a KataGo-local non-owning FloatView for WeightEntry::data instead of a raw const float*/size_t pair; convert to MILBlob::Util::Span only inside WeightSerializer, keeping the MILBlob dependency out of Operations.hpp. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> (cherry picked from commit 6bfa617)

The ComputeHandle member-order comment claimed that declaring mpsGraphOnlyHandle before coremlOnlyHandle is what prevents a GPU handle from reading freed weights. That overstates the ordering's role: within a single ComputeHandle exactly one handle is built (mutually exclusive on gpuIdx, enforced by the ctor's exactly-one check), and releaseWeights() only fires on an aneOnly context where no MPSGraph handle is ever built. Reframe the declaration order as belt-and-suspenders and point at ComputeContext::aneOnly as the actual invariant. Comment-only change. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> (cherry picked from commit 4159930)

Replace the file-local releaseXXX free functions in desc.cpp (which reached into each desc struct's internals from outside) with releaseWeights() member methods on each weight-bearing struct, matching the existing OO convention used by applyScale8ToReduceActivations() and iterConvLayers(). Each container delegates to its members; type-erased block dispatch is inlined with the same cast pattern those methods use. Behavior-preserving: same set of freed vectors, same block recursion, same metaEncoderVersion guard. ModelDesc::releaseWeights() keeps its signature, so the metalbackend.cpp call site is unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> (cherry picked from commit 44342a3)

Move the 11 leaf/container releaseWeights() definitions in desc.cpp out of the bottom cluster (inherited from the old free-function layout) and place each immediately after its struct's last existing method, matching the file's per-struct grouping convention used by every other method. ModelDesc::releaseWeights() stays put, already adjacent to its siblings. Pure relocation: function bodies and desc.h are unchanged; only two stray double-blank lines were normalized to single. Verified clean Metal build, testgpuerror vs Eigen reference (g170-b6c96) at <0.0004% winrate error, and runtests all pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> (cherry picked from commit 98b17eb)

…contract The transformer attention builder emits four function-local std::vector<float> tensors: RoPE cos/sin tables, the rotation matrix R, and per-head out-projection weight slices. After merging the transformer support onto the FloatView branch, these needed two fixes: 1. Dangling view. lightvector#1202 made WeightEntry::data a non-owning FloatView, so addConstOp registers a view whose backing buffer must outlive serialization. These locals were passed to addConstOp and would dangle once the build function returns (serialization runs afterwards). Route them through addOwnedConstOp so KataGoOps owns the buffer until serialization. (Under lightvector#1205's owning WeightEntry they were copied, so this only surfaces post-merge.) 2. dtype mismatch. emitConstOp declares each const's dtype as m_weight_dtype, but addOwnedConstOp / registerOwnedWeight stored at the global mode (is_fp32 hardcoded false). In an FP16 model these derived consts land in the attention / value-head FP32 sub-region (m_weight_dtype == FLOAT32), so they were declared FP32 but stored FP16. CoreML/ANE then rejects the model at load ("Metadata data type does not match requested type", BNNS error -14), which SIGABRT'd every FP16 ANE transformer. Thread is_fp32 through registerOwnedWeight and have addOwnedConstOp pass is_fp32 = (m_weight_dtype == FLOAT32), mirroring addConstOp so the stored dtype always matches the declared dtype. This also fixes the same latent mismatch for addLinearOp's transposed value-head weights. Verified with testgpuerror against fresh Eigen FP32 references: b7c96h3tfrs and b7c96h6gqa, which previously SIGABRT'd on the FP16 ANE path, now load and match to <0.0005% winrate; convnet ANE output is byte-identical and the Metal GPU path is unchanged. katago runtests and runnnlayertests also pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> (cherry picked from commit 8481a94)

The cherry-picked per-struct releaseWeights() refactor (44342a3/98b17ebb) predates this branch's MLX transformer port, so it only added releaseWeights() to the non-transformer descriptors. Extend the coverage to the transformer descriptors present on this branch (RMSNormLayerDesc, TransformerRMSNormDesc, TransformerAttentionDesc incl. ropeFreqs, TransformerFFNDesc) and handle TRANSFORMER_ATTENTION_BLOCK_KIND / TRANSFORMER_FFN_BLOCK_KIND plus trunkTipRMSNorm in the trunk release walk. Without this, releasing weights on a transformer model would hit ASSERT_UNREACHABLE. This makes desc.cpp/desc.h byte-identical to the lightvector#1202 feature branch. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Port the lightvector#1202 ANE steady-state memory lever to the MLX backend. Add ComputeContext::aneOnly, set in createComputeContext when every configured device index is MLX_MUX_ANE, and call ModelDesc::releaseWeights() in convertAndCreateCoreMLOnlyHandleMLX after the model has been converted to CoreML on disk. Safe because: the ANE path re-reads the model from modelPath (not the in-memory weight arrays); the ComputeHandle ctor takes the MLX_MUX_ANE early-return before building any MLX/GPU model (the only weight-array consumer); only scalar dims are read afterward, which releaseWeights() preserves; and it runs under computeHandleMutex. Mirrors the Metal backend's aneOnly release. GPU path unaffected (aneOnly is false whenever any thread uses MLX_MUX_GPU). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

parseTransformerFFNBlock only checked num_channels/ffn_channels > 0, while the attention block validates all of its projection dimensions. Thread trunk_num_channels through and add the mirror checks: num_channels must equal the trunk width (the block adds its output back into the trunk residually) and the linear layers must chain numChannels -> ffnChannels -> numChannels (with the SwiGLU gate also numChannels -> ffnChannels). A malformed FFN block now fails at parse time instead of producing an opaque CoreML compile error or silently-wrong activations. Reuses the existing checkAttentionProjDim helper. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Six conv/matmul/RMSNorm/FFN sites hand-rolled save/restore of m_weight_dtype around their FP32 escalation windows. An exception thrown inside a window would leave m_weight_dtype stuck at FLOAT32, causing later FP16 consts to be tagged FP32 -> the BNNS "Metadata data type does not match" SIGABRT on the FP16 ANE. Give ScopedFp32 an active flag (so a conditional window needs no construction-time branch) and an idempotent restore() (to end the window before a trailing cast-down while keeping the dtor's exception-safe restore), then route all six sites through it. The guard is constructed exactly where the manual flip was and restore() called exactly where the manual restore was, so op-emission order -- and thus the serialized converter output -- is unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The raw-output forward path had no callers: production inference goes through getOutput() -> Model::applyCompiled(). It also duplicated applyCompiled's input setup and output copy. Delete it. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

… test comment CMakeLists.txt: uppercase USE_BACKEND into USE_BACKEND_NORMALIZED before the pre-project() MLX version guard and the Swift language selection, mirroring the post-project() string(TOUPPER). Previously a lowercase -DUSE_BACKEND=mlx skipped the CMake 3.27 guard and Swift enablement, then still tried to build the Swift sources later, producing a confusing failure instead of a clear message. rungpuerrortest.sh: the gpu/ane modes drive whichever backend the binary was built with (backend-agnostic deviceToUseThread0), so reword the usage comment from "the Metal backend" to "the active backend (Metal or MLX)". Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The F(2x2,3x3) input transform and output untransform did all their add/sub arithmetic in T, so in fp16 mode every intermediate of B^T d B and A^T M A rounded to fp16 -- a precision sink independent of the matmul (which already accumulates in fp32 via the steel GEMM). Compute the transform arithmetic in float and cast only the final stored V/M/Y back to T, leaving fp16 storage and memory traffic unchanged (no-op on the fp32 path). Cuts fp16 Winograd kernel error ~33% (runnnlayertests ConvLayer fp16 winograd maxErr 0.0107 -> 0.0071) while the fp32 path stays bit-exact (maxErr=0). testgpuerror FP16-vs-Eigen avg/90%/99% errors drop across configs; g170e-b10c128 worst case 2.35% -> 2.13%. Benchmark throughput within run-to-run noise on both a convnet and a deep NBT net. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…uning The model-load auto-tune swept a nearly-full candidate grid (~2000 configs, ~16s) even with full=false -- only tg1 differed between non-full and full. Measured on this hardware the winning configs form a broad plateau (many within ~7% of each other, all ~25-40% better than the baked default) and geometry moves end-to-end throughput <=1.5%, so that 16s was spent discriminating run-to-run noise: three forced re-tunes of the same net picked three entirely different winners, all within ~7%. Mirror OpenCL's split: full=false (auto) now sweeps a coarse grid (tg0 {8,16,32,64,128}, tg1 {1,2,4,8,16}, wpt {1,2,4}) -- ~2.7s, still landing ~21% above the default and within ~6% of the wide-sweep winner. full=true keeps the wide grid as the deliberate command-tune, opt-in via KATAGO_MLX_WINOTUNER_FULL=1 (the analog of './katago tuner --full', which openclbackend.cpp pins to full=false at load). Cache format is unchanged; existing caches still load, and FULL+FORCE overwrites with the wide-swept winner. Also loosen two gated tuner stress-test budgets (baseline-anchor 0.25->0.50, convergence 1.10->1.30) that compared single sub-millisecond timing samples and flaked ~1-in-4 on both this and the pre-change binary -- the same dispatch/sync-overhead noise the tuner itself tolerates. They remain gross-error sanity checks. runtests + runnnlayertests pass, gated tuner tests 16/16, testgpuerror fp32 bit-exact and fp16 winrate max 0.55% on the coarse-tuned config. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

M4 - silent fallthroughs in mlxbackend.cpp, now loud (ASSERT_UNREACHABLE, the file's existing idiom): - BlockVariant::apply switch default returned the input unchanged, a silent identity no-op block. Now asserts; kept as a default: label so the active -Wswitch-default (CMakeLists.txt) stays clean. - The trunk block-construction loop silently dropped unknown kinds. Added an else { ASSERT_UNREACHABLE; }. - The nested-bottleneck construction loop was a genuine latent bug, not just a missing guard: parseResidualBlockStack (desc.cpp, shared by trunk and nested) accepts nested_bottleneck_block inside a nested bottleneck and the desc layer handles it, but the MLX nested loop omitted NESTED_BOTTLENECK_BLOCK_KIND and silently dropped such a block. Added the missing case (mirroring the trunk loop and Eigen's shared BlockStack) plus an else-assert. M3 - Compiling.md implied the MLX backend builds with make, but CMakeLists.txt hard-fails MLX without the Ninja generator (same Swift/C++ interop requirement as Metal). Added -G Ninja to the MLX cmake example, listed MLX alongside Metal for the Ninja prerequisite, and noted MLX uses ninja to build. Verification: build clean; runtests + runnnlayertests pass; testgpuerror g170-b6c96 vs eigen_reference.json fp32 near-exact (winrate max 0.00036%) / fp16 max 0.55% (unchanged); testgpuerror on the b4c256h4nbttflrs nested- bottleneck+transformer model loads through the modified nested loop and runs forward with no assert (fp16 ~0.27%). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The `tuner` command previously did nothing on non-OpenCL backends. Add an MLX branch that loads the model, builds its conv-3x3 shape histograms, and runs MLXWinogradTuner::loadOrAutoTune with reTune=true so the Winograd input/output transform search runs and overwrites the cache the backend reads at model load. This is the first-class "command tune" path; the load-time auto-tune stays coarse/fast. -full selects the wide candidate grid (the env-var KATAGO_MLX_WINOTUNER_FULL=1 behavior, which still works for triggering a full tune through benchmark/gtp). -testFP16 (auto->FP16) matches the engine's useFP16 default and the cache-filename key. The default output path is MLXWinogradTuner::defaultDirectory/defaultFileName - the exact file the backend loads - verified end-to-end: after a default-path tune, a benchmark model-load logs "Loaded MLX Winograd tuning parameters from" that same file. The OpenCL-only FP16 sub-knobs have no MLX analog and are omitted. The backend guard is restructured from #ifndef USE_OPENCL_BACKEND to a three-way #if/#elif/#else; the OpenCL body is unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The env var was an explicit stopgap "command tune analog" for the wide Winograd sweep before `./katago tuner` supported MLX. Now that the literal tuner subcommand exists, drop the env var and pin the model-load path to the coarse grid (full=false), so the wide sweep is reached only through `./katago tuner -full`. This mirrors openclbackend.cpp, which pins full=false at load and passes full only from the explicit tuner command, and reinforces the coarse-auto-tune design. No capability is lost: `tuner -full` writes the same cache the backend reads at load, so the prior one-shot workflow (FULL=1 FORCE=1 benchmark) becomes the cleaner two-step (tune once, then run) - the OpenCL workflow. KATAGO_MLX_WINOTUNER (enable/disable) and KATAGO_MLX_WINOTUNER_FORCE (force re-tune) are unchanged; FORCE now only ever drives a coarse re-tune at load. Verified: FULL=1 FORCE=1 benchmark now re-tunes coarse (considered=288 for b10c128, was 2176), while `tuner -full` still sweeps the wide grid (considered=2176). runtests pass; testgpuerror unchanged (fp32 0.0005%, fp16 ~1.0%). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Like KATAGO_MLX_WINOTUNER_FULL, this env var predates MLX support in the tuner subcommand. It set reTune=true on the model-load path so a normal benchmark/gtp run would ignore the cache, re-tune, and overwrite. OpenCL has no load-time force-retune analog: OpenCLTuner::loadOrAutoTune loads the cached params if present (falling back to the full-size cache) and only auto-tunes on a complete miss; the only re-tune paths are the explicit tuner command or deleting the cache file. Now that `./katago tuner` works on MLX and always re-runs + overwrites (reTune=true), the env var is redundant with that established pattern, so drop it and pin the load-time reTune=false. The model-load path now loads a valid cache or coarse-tunes once on a miss, never re-tuning a valid cache and never sweeping the wide grid - both are reached only through `./katago tuner` (-full for the wide grid). To refresh the cache, run the tuner command (or delete the cache file). No reachable end-state is lost; only the inline-during-benchmark/gtp convenience, which OpenCL never had. KATAGO_MLX_WINOTUNER (disable tuning, use baked defaults) is unchanged - a distinct switch the command does not cover. Verified: FORCE=1 benchmark now loads the cache instead of re-tuning; KATAGO_MLX_WINOTUNER=0 still disables tuning; runtests pass; testgpuerror unchanged (fp32 0.0005%). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The MLX backend pulls in external/katagocoreml, which requires protobuf and abseil, but Compiling.md listed them only under the Metal backend. `brew install mlx` does not provide them transitively, so an MLX-only build following the docs failed at find_package(Protobuf REQUIRED). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The tuner cache filename hardcoded "AppleSilicon", so every Apple chip shared one cache file: a cache tuned on e.g. an M1 would be loaded verbatim on an M4 Max, where the optimal Winograd launch geometry differs. Add a shared MLXWinogradTuner::detectGpuName() that reads the chip brand string (sysctl machdep.cpu.brand_string, e.g. "Apple M3 Max") with an "AppleSilicon" fallback, and call it from both the backend model-load path and the `tuner` command so their cache keys always match. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

In MLX C++ there are no weak scalars: mx::array(scale) is a strong float32 array, so `matmul(q,kT) * mx::array(scale)` promoted the attention scores -- and the whole transformer residual stream, since each block adds its output into the trunk -- to fp32. Cast the scale to the compute dtype so the fp16 path stays fp16 end-to-end, matching the pooling/BN/RMSNorm layers (which already astype their fp32 intermediates back) and the maximize-fp16 goal for the GPU path. Verified by full rungpuerrortest.sh: GPU 37/37 pass, ANE 31/31 supported pass (6 pre-v8 SIGABRTs expected/unrelated); worst-case fp16 winrate 2.50% / topPolicy 1.25% (thresholds 5.0% / 6.0%). runtests and runnnlayertests pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

… param, transformer test Address four follow-up items from the PR#1199 review (test/tuner-only; no inference forward-pass changes): 1. Atomic tuner-cache save. MLXWinogradTuneParams::save now writes to a per-process temp path (filename + ".tmp.<pid>") and FileUtils::rename's it onto the final path, so two processes that cache-miss and tune the same model concurrently can no longer tear the shared cache file. 2. Independent Winograd oracle. The GPU and FP16 Winograd metal_kernel tests previously asserted only against cpuConv2d3x3, itself a Winograd F(2,3) impl sharing the kernel's B/G/A transform matrices -- a shared sign/transpose error would cancel and pass. They now also assert against the independent naive direct-conv oracle. 3. Remove the dead seedOverride parameter from MLXWinogradTuner:: loadOrAutoTune (declaration, definition, and both call sites). It was documented "reserved ... currently ignored" and always passed nullptr. 4. Transformer-layer numeric test (runMLXTransformerLayerFP16Test): the transformer path (RMSNorm / attention / RoPE) had no layer-level coverage -- only end-to-end via testgpuerror. Adds RMSNorm fp32-vs-CPU correctness, attention fp16 output-dtype preservation (the regression guard for the just-fixed fp16->fp32 attention-scale promotion), fp16/ fp32 closeness, and a zero-outProj residual-identity anchor; covers fixed-RoPE on/off and mask on/off. Verified: build clean; runtests and runnnlayertests pass; the three transformer nets pass testgpuerror on the MLX GPU path within thresholds. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…delete, help text, fp16 accum test - mlxwinotuner.cpp: non-finite median in the two scoring paths now maps to +infinity instead of 0.0 — the tuner minimizes time, so a NaN/inf from a failed kernel run was making that candidate win selection. Diagnostic-only per-shape guards left as-is. - Operations.hpp: add symmetric 4-arg rvalue registerWeight(..., bool) = delete to close the arity hole; registerWeight(name, std::move(vec), shape, true) previously bound a temporary to the const& view overload, leaving a dangling FloatView. No-default form avoids lvalue-overload ambiguity. - main.cpp: tuner help text "(OpenCL only)" -> "(OpenCL and MLX)". - mlxtests.cpp: add a self-calibrating fp16 Winograd accumulation guard. Measures the scale-invariant normalized error (maxAbsErr/outMagMax) at Cin=8 and Cin=384 and asserts it stays small AND flat in Cin (ratio < 3). fp32 accumulation keeps it flat (~1.0); an fp16-accum regression would grow with the term count. Hardware-independent (no absolute-magnitude tuning). Verified: runtests, runnnlayertests (accum guard ratio 1.02), and testgpuerror vs eigen_reference.json (fp32 max winErr 0.00036%, fp16 0.55%) all pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Mirrors the build-macos-metal job: installs the mlx Homebrew formula alongside the shared CoreML deps (protobuf/abseil), configures with -DUSE_BACKEND=MLX under Ninja, builds, and runs runtests on the Apple Silicon (arm64) macos-latest runner that MLX requires. Carries over the metal job's dependency-version cache keying: the CMake build bakes in version-pinned Homebrew Cellar paths (protobuf/abseil/mlx), so the installed versions are folded into the cache key to force a fresh configure when a formula is bumped. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

MLX's GPU streams have no per-stream worker thread, so gpu::eval runs inline on the calling thread and every ComputeHandle shares MLX's single global default GPU stream. With more than one thread driving inference concurrently -- numNNServerThreadsPerModel > 1, a second loaded model (e.g. a human SL net), or the multi-threaded analysis engine -- two threads open two compute command encoders on the same MTLCommandBuffer, which aborts with the Metal assertion "A command encoder is already encoding to this command buffer". Guard the whole MLX graph-build + eval + result read in Model::applyCompiled with a file-scope mlxGpuEvalMutex. Input prep in getOutput stays outside the lock so it still overlaps; one Apple GPU serializes the actual work anyway and KataGo's batching remains the throughput lever. No-op for the default single-model, single-server- thread GTP config. Cherry-picked (cpp portion) from ios-dev 8d07258. That commit's other half -- a JIT threadgroup-size assertion patch -- lives in the iOS app's vendored mlx-swift and does not apply to this Homebrew-MLX CLI build. Co-Authored-By: Chin-Chang Yang <2770271+ChinChangYang@users.noreply.github.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ss-net memo) Squashes the model-load autotuner overhaul developed on ios-dev (c4f150f, ed7415a, 771756d, b12c736, 469cea8, 69d1fac, 9483fdd, fb422d2), restricted to the C++ backend (the iOS app/UI companion changes are dropped). What changes: - Fast coarse model-load tune. planShapeRotation takes a `full` flag: the per-load coarse tune now uses a 7-rep / 2-rep-floor budget and a trimmed 240-config grid (was 19/3 and 360), ~3.75x fewer GPU dispatches on a cache miss. `tuner -full` keeps the precise 19/3, wide grid. The documented broad plateau (geometry moves end-to-end <=1.5%) justifies the coarse budget. - Greedy coordinate-descent on the coarse path (useGreedy), backed by a new header-only GreedySearch::coordinateDescent core (neuralnet/greedysearch.h) with a standalone unit test (greedysearch_test.cpp, not wired into the katago build). A self-test gate asserts greedy stays within 5% of coarse-exhaustive. - Cross-context, session-scoped tune memo so the main and human SL nets (identical b18c384 3x3-conv shapes) tune once per session, not twice; cleared when the last ComputeContext is freed. - Separate cache files per mode: defaultFileName gains a `_full` suffix; the coarse "fast" tune keeps the legacy name so existing caches hit. - mlxTunerFull / mlxReTune are now read from -override-config in createComputeContext and fed to loadOrAutoTune (were hardcoded full=false, reTune=false). Cache format VERSION unchanged (3). - Diagnostics: tuner candidate-count line + MLX_TUNE_STUDY per-candidate dump; dropped the always-on [MLX-TUNE] sweep stderr line. Deliberately excluded (iOS-app only, no CLI value): the CoreML cache bridge wiring (686f823), the iOS extern-C katagocoreml shims (aebf3e0), and the iOS/visionOS set_cache_limit cap (97d3453). Co-Authored-By: Chin-Chang Yang <2770271+ChinChangYang@users.noreply.github.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Pure, behavior-preserving refactor of the model-load path the autotuner and cross-net memo landed in. No change to inference, tuning results, or the eval hot path. - Fold the three file-scope memo globals (g_winoTuneMemoMutex, g_winoTuneMemo, g_liveComputeContexts) into one WinogradTuneMemo unit with tryGet/put/retain/release. The locking obligation and the "clear on last context release" session-scope invariant now live in one place instead of being open-coded across createComputeContext, freeComputeContext, and the ComputeHandle ctor. - Extract the ~70-line tuner orchestration block out of the ComputeHandle ctor into a free resolveTuneParams(context, loadedModel, useFP16): shape-key build, memo lookup, loadOrAutoTune, memo store. It reads only ComputeContext + LoadedModel state, so it stands alone. - Collapse the verbatim-duplicated "exactly one inference path" invariant check (ANE early-return and GPU end) into a checkExactlyOnePath() member. Validation (Apple M3 Max, b18 uec vs eigen_reference_b18.json): runtests + runnnlayertests pass; FP32 winrate max 0.00065% (bit-identical to baseline); 2 GPU server threads reuse the memo across handles ("Reusing MLX Winograd tuning for shape ..._fast") with no deadlock and no "encoder is already encoding". Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Pure, behavior-preserving refactor of the autotuner's sweep logging. No change to candidate enumeration, winner selection, tuning results, or the eval hot path -- only how the already-computed diagnostic line is built. flatSweepInput and flatSweepOutput each carried a ~40-line copy of the same tail: an optional skipped-count line, the per-shape median suffix loop, and the considered/best/baseline/delta_pct summary. The two copies differed only in the sweep label, the per-shape scorer, and whether the best-config body prints vw/gridOrder. - Extract renderPerShapeMs() for the " shape_ms=c<C>:<ms>,..." suffix. - Extract logFlatSweep() to own the skipped + summary lines, including the %+.1f delta_pct (sign-forced for the gated regex) and the best=none / delta_pct=nan degenerate branch. The regex-pinned log format (mlxtests.cpp) now has a single source of truth. - Each sweep keeps only its own best-config field rendering (input adds vw/gridOrder) and hands the rest off. perShapeStr is built on the same (best && baseline>=1e-9) condition logFlatSweep uses for delta_pct, so the nan branch stays in lockstep. Net -69/+67 lines. Validation (Apple M3 Max, b18 uec vs eigen_reference_b18.json): runtests + runnnlayertests pass, including the gated flatSweepInput / flatSweepOutput log-format regex checks (KATAGO_MLX_WINOTUNER_RUN_LOG_FORMAT_TEST=1). A forced retune (mlxReTune=true) emits byte-compatible sweep lines and FP32 winrate max stays 0.00065% (bit-identical to baseline). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…tor#1208) The windows-latest runner image now ships Visual Studio 2026 only (actions/runner-images#14017), so configuring with -G "Visual Studio 17 2022" fails with "could not find any instance of Visual Studio". Drop the explicit generator and let CMake auto-detect the installed Visual Studio, keeping -A x64. https://claude.ai/code/session_018eaSTE1PhvV7SsiNyJrq76 Co-authored-by: Claude <noreply@anthropic.com>

Narrow mlxGpuEvalMutex to graph construction + async_eval (which encodes synchronously); move the completion wait and result readback outside the lock. With numNNServerThreadsPerModel=2, one server thread now encodes batch N+1 while the other waits on batch N, eliminating the ~1.4ms/batch GPU idle bubble. Single-server-thread configs are unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Fold midBN+activation (after regularConv) and the residual add (after finalConv/postConv) into the Winograd output-untransform kernel for useMask=false, eliminating per-conv elementwise kernel launches and full-tensor round-trips. Residual fuses in T arithmetic (bit-identical); BN+act consumes the rounded-to-T conv value in fp32. Non-Winograd convs, masked configs, gpool blocks, and heads keep the unfused path. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Apply fusedConvResidual to GlobalPoolingResidualBlock's input+finalOut residual, mirroring ResidualBlock/NestedBottleneckResidualBlock. Closes the one gpool special-case left unfused in 63446b6; every Winograd-conv->residual in the trunk now fuses uniformly. Bit-identical (T+T residual); FP32 stays 0.00065% vs Eigen. The gpool bias-add (broadcast) path stays unfused. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…nsform Add a BiasBNAct epilogue mode (BNAct kernel + one broadcast-bias-add line) and fuse GlobalPoolingResidualBlock's regularConv -> (+bias) -> midBN chain into the regularConv untransform, for useMask=false. The bias add is T+T (bit-identical to regularOut + bias); round-to-T then BN+act in fp32 matches midBN. Completes the gpool block: no elementwise kernel left between regularConv and finalConv. FP32 stays 0.00065% vs Eigen. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

- applyCompiled result memcpys: cast the leading operand to size_t so the byte-count product is computed in size_t (the int sub-product could overflow on a large board with a big batch before the sizeof promotion). - MLXWinogradTuneParams::isValid(): bound each threadgroup dim to <=1024 before multiplying (a corrupt cache pair could overflow the int product and slip past the >1024 gate), and reject gridOrder values outside the defined enum so a corrupt cache re-tunes instead of running an unintended geometry. Both surface only on a hand-corrupted local tuner cache. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

- KataGoParser: reject non-positive matmul in/out channels, qHeadDim/vHeadDim < 1, and ropeTheta <= 0, matching master desc.cpp. The ropeTheta check matters most: a non-positive theta yields NaN in the builder-derived cos/sin tables, which (being derived, not parsed) bypass the readFloats NaN/Inf gate and would otherwise produce a valid-but-garbage model. - Converter/MILBuilder: derive the serialized IO dtype from the builder's effective use_fp16_io (post narrow-transformer FP32 downgrade) via a new getUseFp16Io() getter, instead of the raw request. Fixes a spec/program mismatch where a narrow transformer with use_fp16_io=true emitted FP32 IO tensors while the model spec still declared FP16 IO. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…oud) - Session tune-memo key: include a signature of the 3x3-conv distribution (the histogram that actually drives planShapeRotation), so two nets with the same trunk width but different conv shapes no longer share a tune. modelVersion stays omitted so same-shape/different-version nets still share. - MLXWinogradTuneParams::isValid(): upper-bound wpt (<=8) and vw (<=4) from a corrupt cache, matching the existing tg<=1024 caps. - Greedy seed: guard the hard-coded seed indices (input + output sweeps) with a runtime check that they decode to the baked default, so a future reorder of the coarse value sets fails loudly instead of silently degrading tuning. - Policy-optimism postprocessor: throw on an unsupported numPolicyChannels instead of a release-elided assert that would silently mis-stride. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Both parseTransformerRMSNorm and parseRMSNormLayer validated only num_channels; master desc.cpp rejects epsilon <= 0 || > 1.0f. Add the same check so a malformed model fails loudly instead of building a valid-but-garbage model (rsqrt(x+eps) with a degenerate eps). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Match master desc.cpp: reject nested-bottleneck numBlocks < 1 (before parseBlockStack reserves on the count — a crafted model could otherwise force a multi-GB reserve from a few bytes) and matbias numChannels <= 0. Both fail loud on a malformed/corrupt model instead of throwing a confusing length_error or attempting a huge allocation. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Extract each backend's body into a file-static helper (runOpenCLTuner / runMLXTuner) gated by its own backend macro, leaving MainCmds::tuner as a short #if/#elif/#else dispatcher. Pure relocation — the bodies are byte-identical, no args/defaults/messages/control-flow change. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Spec for moving KataGo Anytime (iOS/macOS/visionOS) from USE_METAL_BACKEND to USE_MLX_BACKEND (PR lightvector#1199) via the mlx-swift Cmlx target. Spike-first, phased, clean replace of the MPSGraph GPU path. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Brings origin/mlx-backend-squash (the MLX backend + transformer CoreML support + refined lightvector#1202 memory levers) onto feature/mlx-app-migration. App still builds on USE_METAL_BACKEND (the new mlx*.cpp are on disk but not yet in the Xcode project); macOS + iOS-sim build green. Conflict resolution (12 files): - katagocoreml converter (6 files) + desc.cpp: took THEIRS. ios-dev had the earlier lightvector#1202 memory-lever lineage (raw gzFile, ptr+count WeightEntry, free-function releaseWeights); PR lightvector#1199 has the refined superset (RAII GzHandle, FloatView+is_fp32, per-struct releaseWeights, transformer parsing). Converter public API is unchanged, so the iOS CoreML bridge callers are unaffected. - metalbackend.cpp: UNION — kept ios-dev's Task-19 cache-aware CoreML bridge (invokeCoreMLBridge + legacy fallback) AND adopted PR's context->aneOnly gate around releaseWeights(). - main.cpp: union of the USE_COREML_BACKEND and USE_MLX_BACKEND branches. - LICENSE: union (abseil + protobuf + onnx all vendored). - python/export_model_pytorch.py: took theirs (training-only). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

ChinChangYang force-pushed the mlx-backend-squash branch from b544c66 to dcf296a Compare May 23, 2026 04:32

ChinChangYang mentioned this pull request May 23, 2026

Add MLX backend for Apple Silicon neural network inference ChinChangYang/KataGo#16

Closed

ChinChangYang force-pushed the mlx-backend-squash branch from dcf296a to 81b00db Compare May 23, 2026 08:06

MLX backend: ANE/CoreML correctness + concurrency fixes, cross-path p…

628e377

…arity smoke test (#26)

ChinChangYang force-pushed the mlx-backend-squash branch from 31eec63 to 628e377 Compare May 26, 2026 23:38

ChinChangYang changed the title ~~Add MLX backend for Apple Silicon~~ Add MLX backend for Apple Silicon (GPU + ANE/CoreML dispatch) May 26, 2026

ChinChangYang and others added 4 commits June 2, 2026 23:47

Merge remote-tracking branch 'origin/master' into mlx-backend-squash

d5647ed

# Conflicts: # cpp/rungpuerrortest.sh

ChinChangYang changed the title ~~Add MLX backend for Apple Silicon (GPU + ANE/CoreML dispatch)~~ Add MLX backend for Apple Silicon (GPU + ANE/CoreML dispatch, incl. transformer nets) Jun 3, 2026

ChinChangYang and others added 18 commits June 3, 2026 18:13

ChinChangYang and others added 27 commits June 5, 2026 21:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add MLX backend for Apple Silicon (GPU + ANE/CoreML dispatch, incl. transformer nets)#1199

Add MLX backend for Apple Silicon (GPU + ANE/CoreML dispatch, incl. transformer nets)#1199
ChinChangYang wants to merge 51 commits into
lightvector:masterfrom
ChinChangYang:mlx-backend-squash

ChinChangYang commented May 23, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ChinChangYang commented May 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What's new

Conversion + ANE steady-state memory

Measured (19×19, MLX ANE path)

How to build

Validation

GPU throughput: MLX vs Metal

Status

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ChinChangYang commented May 23, 2026 •

edited

Loading