expert-offloading

Here are 2 public repositories matching this topic...

e1n00r / tinyserve

30 tok/s for 20B MoE on 8 GB VRAM. Flat throughput to 32K context. Native MXFP4 + GGUF Q4_K/Q5_K/Q6_K via ggml CUDA kernels — zero dequant. Expert offloading for models that don't fit in GPU memory.

gpu inference pytorch moe offloading quantization llm expert-offloading

Updated Apr 7, 2026
Python

Saundersonmainstreamed100 / flash-moe

Star

Run a 397B MoE model on a MacBook with C/Metal inference, 4.4+ tok/s, and tool calling

c gpu neon inference pytorch simd moe offloading avx2 quantization kv-cache cpu-inference llm gguf expert-offloading turboquant

Updated May 6, 2026
Objective-C

Improve this page

Add a description, image, and links to the expert-offloading topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the expert-offloading topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly