You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
`MmapStorage.SliceElements` provides zero-copy slicing of mmap'd tensor elements. It returns a view into the memory-mapped region without copying data, making expert weight extraction in mixture-of-experts models efficient:
138
+
139
+
```go
140
+
// Extract expert weights directly from the mmap'd file — no allocation
This replaces the previous pattern of copying expert weights into a new tensor before each forward pass.
145
+
146
+
### Streaming GEMM for mmap'd Tensors
147
+
148
+
`internal/xblas` now includes a streaming GEMM path for mmap'd weight tensors. Instead of paging in the entire weight matrix before computation, the kernel tiles over the mmap region in cache-sized chunks, keeping memory bandwidth proportional to the active tile rather than the full matrix.
149
+
150
+
This enables over-RAM CPU inference: a model whose weights exceed physical RAM can run without GPU, with the OS paging tensor data from NVMe on demand. Combined with `MmapStorage.SliceElements`, a 229B MoE model runs on a 128 GB machine with no configuration flags.
151
+
133
152
## Dependencies
134
153
135
154
ztensor depends on [float16]({{< relref "numeric-types" >}}) and [float8]({{< relref "numeric-types" >}}) for half-precision and FP8 arithmetic.
0 commit comments