Commit 24f1f1a
perf: cache F32 lm_head weight — eliminates 1 GB allocation per token
The F32 lm_head weight conversion (to_dtype(F32)) was creating a fresh
1 GB copy of the 248320×1024 weight matrix on every single generated
token. Pre-caching it at model load time eliminates this allocation.
Benchmark (M3 Pro, Qwen3.5-0.8B, 50 tokens):
- Before: 16.1 tok/s
- After: 36.7 tok/s (+128%, 2.28x speedup)
Memory cost: +1 GB for the cached F32 weight (acceptable on 36 GB M3 Pro).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>1 parent b7c2e2a commit 24f1f1a
1 file changed
Lines changed: 8 additions & 3 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
137 | 137 | | |
138 | 138 | | |
139 | 139 | | |
| 140 | + | |
| 141 | + | |
140 | 142 | | |
141 | 143 | | |
142 | 144 | | |
| |||
245 | 247 | | |
246 | 248 | | |
247 | 249 | | |
| 250 | + | |
| 251 | + | |
| 252 | + | |
248 | 253 | | |
249 | 254 | | |
250 | 255 | | |
| |||
258 | 263 | | |
259 | 264 | | |
260 | 265 | | |
| 266 | + | |
261 | 267 | | |
262 | 268 | | |
263 | 269 | | |
| |||
347 | 353 | | |
348 | 354 | | |
349 | 355 | | |
| 356 | + | |
350 | 357 | | |
351 | 358 | | |
352 | | - | |
353 | | - | |
354 | 359 | | |
355 | 360 | | |
356 | 361 | | |
357 | | - | |
| 362 | + | |
358 | 363 | | |
359 | 364 | | |
360 | 365 | | |
| |||
0 commit comments