Commit f1787d9
perf: revert F32 lm_head — F16 is correct after norm fixes
The F32 lm_head was added to compensate for logit distribution errors
caused by the GDN gated norm bug (+1.0 on non-residual weights). Now
that the norm is fixed, F16 lm_head produces correct output at all
temperatures.
Removing F32 saves:
- 1 GB memory (no cached F32 weight)
- ~6ms/token (reads 508 MB instead of 1 GB from memory bandwidth)
Benchmark (M3 Pro, Qwen3.5-0.8B, 50 tokens):
- F32 lm_head (cached): 36.7 tok/s, +1 GB memory
- F16 lm_head: 42.4 tok/s, no extra memory (+15.5%)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>1 parent 24f1f1a commit f1787d9
1 file changed
Lines changed: 1 addition & 12 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
137 | 137 | | |
138 | 138 | | |
139 | 139 | | |
140 | | - | |
141 | | - | |
142 | 140 | | |
143 | 141 | | |
144 | 142 | | |
| |||
247 | 245 | | |
248 | 246 | | |
249 | 247 | | |
250 | | - | |
251 | | - | |
252 | | - | |
253 | 248 | | |
254 | 249 | | |
255 | 250 | | |
| |||
263 | 258 | | |
264 | 259 | | |
265 | 260 | | |
266 | | - | |
267 | 261 | | |
268 | 262 | | |
269 | 263 | | |
| |||
351 | 345 | | |
352 | 346 | | |
353 | 347 | | |
354 | | - | |
355 | | - | |
356 | | - | |
357 | | - | |
358 | | - | |
359 | 348 | | |
360 | 349 | | |
361 | 350 | | |
362 | | - | |
| 351 | + | |
363 | 352 | | |
364 | 353 | | |
365 | 354 | | |
| |||
0 commit comments