You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
|`metal`| macOS (Apple Silicon) | GPU via MPS + custom MSL kernels | Primary inference on Mac | Fastest option on Apple Silicon (~42 tok/s on M3 Pro) |
32
+
|`cuda`| Linux (NVIDIA GPU) | GPU via cuBLAS/cuDNN | Primary inference on Linux | Requires CUDA toolkit matching driver version |
33
+
|`accelerate`| macOS | CPU via Apple Accelerate (AMX) | CPU-only F32 inference on Mac | 2.7x faster than pure-Rust for F32 matmul; no F16 support |
34
+
|`vulkan`| Any (Vulkan 1.3+) | GPU via Vulkan compute shaders | Steam Deck, AMD GPUs | Portable but less optimized than Metal/CUDA |
35
+
| (none) | Any | CPU via pure-Rust `gemm`| Portable CPU fallback | F16 weights stay F16, avoids bandwidth doubling |
36
+
37
+
**When to use which:**
38
+
-**Apple Silicon (stevie.local):** Use `--features metal`. Metal is 1.6x faster than CPU F16 (42 vs 26 tok/s). The `accelerate` feature doesn't help with Metal and doesn't support F16 matmul, so CPU F16 (default, no features) is actually faster than `accelerate` with F32 (26 vs 23 tok/s).
39
+
-**NVIDIA GPU (blade/bahamut):** Use `--features cuda`. Add `flash-attn` for flash attention support.
40
+
-**CPU-only with F32 models:** Use `--features accelerate` on macOS for 2.7x faster F32 matmul. On Linux, consider linking against MKL or OpenBLAS.
41
+
-**CPU-only with F16 models:** Use no features — pure-Rust `gemm` with F16 avoids the 2x memory bandwidth penalty of converting to F32.
Copy file name to clipboardExpand all lines: docs/install.md
+9-6Lines changed: 9 additions & 6 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -98,15 +98,18 @@ make mobile_ios
98
98
99
99
By default, inference runs on CPU. Enable GPU acceleration with:
100
100
101
-
| Feature | Backend | Platforms |
102
-
|---------|---------|-----------|
103
-
|`cuda`| NVIDIA CUDA (PTX kernels + flash-attn) | Linux, Windows |
104
-
|`metal`| Apple Metal (MSL shaders + fused SDPA) | macOS, iOS |
105
-
|`vulkan`| Vulkan via wgpu | Linux, Windows, Steam Deck |
106
-
|`flash-attn`| Flash Attention 2 (implies `cuda`) | Linux, Windows |
101
+
| Feature | Backend | Platforms | Notes |
102
+
|---------|---------|-----------|-------|
103
+
|`cuda`| NVIDIA CUDA (PTX kernels + flash-attn) | Linux, Windows | Best for NVIDIA GPUs |
104
+
|`metal`| Apple Metal (MSL shaders + fused SDPA) | macOS, iOS | Best for Apple Silicon (~42 tok/s on M3 Pro with 0.8B model) |
105
+
|`accelerate`| Apple Accelerate (AMX hardware) | macOS | CPU-only; 2.7x faster F32 matmul via Apple BLAS. No F16 support — use `metal` for F16 models |
|`flash-attn`| Flash Attention 2 (implies `cuda`) | Linux, Windows | Fused attention kernel for long sequences |
107
108
108
109
Multiple backends can be compiled together — the runtime auto-selects based on available hardware.
109
110
111
+
**Apple Silicon guidance:** Use `metal` for best performance. The `accelerate` feature only helps CPU inference with F32 models — for F16 models (default), CPU without `accelerate` is actually faster (26 vs 23 tok/s) because F16 halves memory bandwidth vs the F32 conversion Accelerate requires.
112
+
110
113
### Model Features
111
114
112
115
By default, all text model architectures are compiled in. To build only for specific models:
0 commit comments