Model weight management for LLM inference — cache, transfer, and serve weights at scale with GPU-to-GPU RDMA and multi-node coordination.
Features • Architecture • Quick Start • Deployment • Docs • Contributing
ModelExpress is a Rust-based service that manages the complete model weight lifecycle in the cluster—from acquisition to GPU memory. It accelerates LLM inference by caching, routing, and transferring weights through the fastest available path. Deploy standalone or as a sidecar alongside vLLM, NVIDIA Dynamo, and other inference runtimes.
| LLM serving problem | How ModelExpress helps |
|---|---|
| Models take too long to load | GPU-to-GPU transfer via NIXL/RDMA instead of loading from storage. In P2P mode, weights already serving inference act as the cache—no extra storage. |
| Many nodes need the same model | Metadata backends (Redis, K8s CRD) coordinate sharing: one node loads; others receive via P2P or local paths. |
ModelExpress orchestrates the full flow—from download to GPU memory. It ensures only one node downloads a model from external sources (e.g., HuggingFace); other nodes receive weights via P2P or shared storage—eliminating duplicate downloads and reducing cluster ingress.
- Download from HuggingFace — One node pulls the model; ModelExpress coordinates so no other node duplicates this download, reducing external ingress. In air-gapped mode, serve from cache only (
HF_HUB_OFFLINE=1). - Persist to disk — Store in a cache backed by disk:
- Host-attached disk — Local disk on the node (single-node or per-node cache).
- PVC — RWO (ReadWriteOnce) for single-node; RWX (ReadWriteMany) for shared access across nodes.
- Disk to GPU — Inference engine (vLLM, etc.) loads weights from the cache (disk) into GPU memory.
- P2P transfer — Additional nodes receive weights via GPU-to-GPU RDMA from the first node instead of reading from disk—no duplicate downloads or disk reads.
- Cold start reduction — GPU-to-GPU P2P transfer over InfiniBand instead of disk load
- HuggingFace caching — PVC-backed cache,
HF_HUB_OFFLINE,ignore_weights,get_model_pathfor Dynamo - P2P GPU transfer — vLLM
modelexpressloader (mxalias) and TRT-LLMPRESHARDEDloader with NVIDIA NIXL over RDMA - Metadata backends — In-memory, Redis, or Kubernetes CRD (layered write-through for HA)
- Kubernetes — Helm chart, CRDs/Redis for P2P, no-shared-storage support
- CLI — Health, download, list, validate, clear; init-container support for pre-warming
- ModelStreamer integration: stream weights from object storage (AWS S3, Azure Blob, GCS) with multi-engine support
- Expanded model pull providers: NGC catalog and Google Cloud Storage in addition to Hugging Face
- GDS (GPUDirect Storage): load model weights directly from NVMe into GPU memory, bypassing the CPU/DRAM copy path
| Runtime | Integration |
|---|---|
| vLLM | --load-format modelexpress for P2P weight transfer; mx is a backward-compatible alias |
| NVIDIA Dynamo (vLLM) | get_model_path API; Dynamo model cache K8s example |
| TensorRT-LLM | LoadFormat.PRESHARDED with MxLiveCheckpointLoader for P2P weight transfer (beta) — TRT-LLM examples |
| SGLang | remote_instance + modelexpress backend with transport=nixl or transport=transfer_engine — see docs/SGLANG.md |
Phase 1 — Upload once: Model Source (HuggingFace Hub, NFS) downloads to the Seed Pod (GPU), which loads and postprocesses weights, registers VRAM with NIXL, and publishes metadata to the MX Server. Phase 2 — Autoscale: New pods receive weights via NIXL GPUDirect RDMA (GPU VRAM → GPU VRAM, zero-copy) from the seed GPU, using --load-format modelexpress for inference.
┌─────────────────────────────────────────────────────────────────┐
│ ModelExpress Server │
│ Health • Model • P2P Metadata • Redis/K8s CRD backends │
└──────────────────────┬──────────────────────────────────────────┘
│
┌─────────────────┼─────────────────┐
│ metadata │ │ metadata
▼ │ ▼
┌──────────────────┐ │ ┌──────────────────┐
│ Source (vLLM) │ RDMA │ │ Target (vLLM) │
│ mx loader │════════►│ │ mx loader │
│ Load → NIXL │ NIXL │ │ Receive → FP8 │
│ Publish metadata│ │ │ Serve inference │
└──────────────────┘ │ └──────────────────┘
Source and Target exchange metadata with the server for coordination; weights transfer directly over RDMA between GPUs.
- modelexpress_server: gRPC server with configurable metadata backends (Redis, Kubernetes CRD).
- modelexpress_client: Rust CLI for cache management; Python package with inference engine loaders and
MxClientfor gRPC. - modelexpress_common: Protobuf definitions, provider trait (HuggingFace), shared configuration.
See Architecture.
Requirements: Rust 1.90+, protoc, Docker
git clone https://github.com/ai-dynamo/modelexpress.git
cd modelexpress
# Start a local Redis instance for metadata storage
docker run -d --name redis -p 6379:6379 redis:8-alpine
cargo build
# REDIS_URL is required; the server does not fall back to localhost:6379.
REDIS_URL=redis://localhost:6379 MX_METADATA_BACKEND=redis cargo run --bin modelexpress-serverServer listens on 0.0.0.0:8001. In another terminal:
# Download a model (shared storage)
modelexpress-cli model download meta-llama/Llama-3.3-70B-Instruct
# Verify
modelexpress-cli healthWithout shared storage: use --no-shared-storage for gRPC streaming.
Air-gapped: with the model already in the local HF cache, HF_HUB_OFFLINE=1 modelexpress-cli model download <model-id> resolves it without network access.
kubectl create secret generic hf-token-secret --from-literal=HF_TOKEN=${HF_TOKEN} -n <namespace>
helm install modelexpress ./helm --namespace modelexpress --create-namespaceOverride values-production.yaml for your env. Full config: helm/README.md.
from modelexpress import register_modelexpress_loaders
register_modelexpress_loaders()
# vllm serve <model> --load-format modelexpress
# The mx load format is kept as a backward-compatible alias.First instance loads from disk; subsequent instances receive via RDMA. P2P guide · Server setup.
Load model weights directly from Azure Blob Storage, S3, or a PVC-backed local path through ModelStreamer. ModelStreamer examples · vLLM recipes.
docker compose -f docker/docker-compose.yml up --buildPrecedence: CLI → env vars (MODEL_EXPRESS_*, MX_*) → YAML → defaults.
| Variable | Default | Description |
|---|---|---|
MODEL_EXPRESS_SERVER_PORT |
8001 |
gRPC port |
MODEL_EXPRESS_CACHE_DIRECTORY |
./cache |
Cache root |
MX_METADATA_BACKEND |
(required) | redis | kubernetes |
REDIS_URL |
(required for redis) |
Redis connection URL. Alternatively set MX_REDIS_HOST + MX_REDIS_PORT. No localhost fallback. |
MX_SERVER_ADDRESS |
localhost:8001 |
Client-side gRPC server address (P2P). Recommended. |
MODEL_EXPRESS_URL |
localhost:8001 |
Deprecated, pending removal in a future release. Still read by all client paths and takes precedence when both are set; keep setting it during the transition. |
cargo run --bin config_gen -- --output model-express.yaml
cargo run --bin modelexpress-server -- --config model-express.yaml --validate-configFull reference: docs/DEPLOYMENT.md.
modelexpress-cli health
modelexpress-cli model download <model-id>
modelexpress-cli model list
modelexpress-cli model validate <model-id>
modelexpress-cli model clear <model-id>cargo test
cargo test --test integration_tests
cargo run --bin test_client -- --test-model "google-t5/t5-small"
./run_integration_tests.sh
cargo bench| Doc | Description |
|---|---|
| Deployment | Server/client config, Docker, K8s, P2P |
| Architecture | Components, gRPC, NIXL, FP8 |
| CLI | Full CLI reference |
| Metadata | Redis keys, K8s CRD schema |
| Helm | Kubernetes configuration |
- NIXL_ERR_REMOTE_DISCONNECT — Source restarts invalidate rkeys. Flush Redis, redeploy.
- Long source warmup — DeepSeek-V3 (DeepGemm, CUDA graphs) can take significant time; targets wait via coordination.
- Large model gRPC stream — May not close automatically; use client timeout.
- P2P compile/warmup caching: torch.compile/deepGEMM cache for follower workers. Leader performs full warmup; followers consume caches and skip full warmup.
- DRAM and NVMe-resident shard streaming: Stream shards across workers while keeping weights in DRAM and host local high-speed NVMe.
- RL workloads: Explore fast P2P transfers to optimize RL refit phase and support for weight resharding.
- Earlier weight availability: Bring weights to prefill earlier; identify prefill workers that can act as strong source nodes.
- Multi-tier cache hierarchy: Promote and demote models across DRAM, NVMe, and PVC tiers based on access patterns.
- Distributed sharded cache: Shard large models across nodes using consistent hashing and parallel shard assembly.
- Training checkpoint management: Cache and reuse CUDA kernel compilations (torch.compile, deepGEMM) and CUDA graphs across restarts.
- Metrics and observability: Cache hit rates, eviction frequency, transfer throughput, and P2P RDMA utilization via Prometheus/OpenTelemetry.
- Predictive prefetching: Pre-warm caches from workload history or scheduling hints.
- P2P transfer fault tolerance: Auto-recovery from stale rkeys on source restart; retry and fallback to storage loading.
- Dynamic EPLB (Expert Parallelism Load Balancer): Rebalance MoE expert placement across GPUs at runtime via P2P transfer of expert weights as load shifts.
Contributions welcome. See CONTRIBUTING.md.
pip install pre-commit && pre-commit install
pre-commit run --all-filesIssues: GitHub Issues
Apache 2.0. See LICENSE.

