Dynamo ModelExpress

Model weight management for LLM inference — cache, transfer, and serve weights at scale with GPU-to-GPU RDMA and multi-node coordination.

Features • Architecture • Quick Start • Deployment • Docs • Contributing

Overview

ModelExpress is a Rust-based service that manages the complete model weight lifecycle in the cluster—from acquisition to GPU memory. It accelerates LLM inference by caching, routing, and transferring weights through the fastest available path. Deploy standalone or as a sidecar alongside vLLM, NVIDIA Dynamo, and other inference runtimes.

LLM serving problem	How ModelExpress helps
Models take too long to load	GPU-to-GPU transfer via NIXL/RDMA instead of loading from storage. In P2P mode, weights already serving inference act as the cache—no extra storage.
Many nodes need the same model	Metadata backends (Redis, K8s CRD) coordinate sharing: one node loads; others receive via P2P or local paths.

How ModelExpress manages weights in the cluster

ModelExpress orchestrates the full flow—from download to GPU memory. It ensures only one node downloads a model from external sources (e.g., HuggingFace); other nodes receive weights via P2P or shared storage—eliminating duplicate downloads and reducing cluster ingress.

Download from HuggingFace — One node pulls the model; ModelExpress coordinates so no other node duplicates this download, reducing external ingress. In air-gapped mode, serve from cache only (HF_HUB_OFFLINE=1).
Persist to disk — Store in a cache backed by disk:
- Host-attached disk — Local disk on the node (single-node or per-node cache).
- PVC — RWO (ReadWriteOnce) for single-node; RWX (ReadWriteMany) for shared access across nodes.
Disk to GPU — Inference engine (vLLM, etc.) loads weights from the cache (disk) into GPU memory.
P2P transfer — Additional nodes receive weights via GPU-to-GPU RDMA from the first node instead of reading from disk—no duplicate downloads or disk reads.

Features

Cold start reduction — GPU-to-GPU P2P transfer over InfiniBand instead of disk load
HuggingFace caching — PVC-backed cache, HF_HUB_OFFLINE, ignore_weights, get_model_path for Dynamo
P2P GPU transfer — vLLM modelexpress loader (mx alias) and TRT-LLM PRESHARDED loader with NVIDIA NIXL over RDMA
Metadata backends — In-memory, Redis, or Kubernetes CRD (layered write-through for HA)
Kubernetes — Helm chart, CRDs/Redis for P2P, no-shared-storage support
CLI — Health, download, list, validate, clear; init-container support for pre-warming
ModelStreamer integration: stream weights from object storage (AWS S3, Azure Blob, GCS) with multi-engine support
Expanded model pull providers: NGC catalog and Google Cloud Storage in addition to Hugging Face
GDS (GPUDirect Storage): load model weights directly from NVMe into GPU memory, bypassing the CPU/DRAM copy path

Integrations

Runtime	Integration
vLLM	`--load-format modelexpress` for P2P weight transfer; `mx` is a backward-compatible alias
NVIDIA Dynamo (vLLM)	`get_model_path` API; Dynamo model cache K8s example
TensorRT-LLM	`LoadFormat.PRESHARDED` with `MxLiveCheckpointLoader` for P2P weight transfer (beta) — TRT-LLM examples
SGLang	`remote_instance` + `modelexpress` backend with `transport=nixl` or `transport=transfer_engine` — see `docs/SGLANG.md`

ModelExpress Architecture

Phase 1 — Upload once: Model Source (HuggingFace Hub, NFS) downloads to the Seed Pod (GPU), which loads and postprocesses weights, registers VRAM with NIXL, and publishes metadata to the MX Server. Phase 2 — Autoscale: New pods receive weights via NIXL GPUDirect RDMA (GPU VRAM → GPU VRAM, zero-copy) from the seed GPU, using --load-format modelexpress for inference.

                    ┌─────────────────────────────────────────────────────────────────┐
                    │                    ModelExpress Server                          │
                    │   Health • Model • P2P Metadata • Redis/K8s CRD backends        │
                    └──────────────────────┬──────────────────────────────────────────┘
                                           │
                         ┌─────────────────┼─────────────────┐
                         │ metadata        │                 │ metadata
                         ▼                 │                 ▼
              ┌──────────────────┐         │       ┌──────────────────┐
              │  Source (vLLM)   │  RDMA   │       │  Target (vLLM)   │
              │  mx loader       │════════►│       │  mx loader       │
              │  Load → NIXL     │  NIXL   │       │  Receive → FP8   │
              │  Publish metadata│         │       │  Serve inference │
              └──────────────────┘         │       └──────────────────┘

Source and Target exchange metadata with the server for coordination; weights transfer directly over RDMA between GPUs.

modelexpress_server: gRPC server with configurable metadata backends (Redis, Kubernetes CRD).
modelexpress_client: Rust CLI for cache management; Python package with inference engine loaders and MxClient for gRPC.
modelexpress_common: Protobuf definitions, provider trait (HuggingFace), shared configuration.

See Architecture.

Quick Start

Requirements: Rust 1.90+, protoc, Docker

git clone https://github.com/ai-dynamo/modelexpress.git
cd modelexpress

# Start a local Redis instance for metadata storage
docker run -d --name redis -p 6379:6379 redis:8-alpine

cargo build
# REDIS_URL is required; the server does not fall back to localhost:6379.
REDIS_URL=redis://localhost:6379 MX_METADATA_BACKEND=redis cargo run --bin modelexpress-server

Server listens on 0.0.0.0:8001. In another terminal:

# Download a model (shared storage)
modelexpress-cli model download meta-llama/Llama-3.3-70B-Instruct

# Verify
modelexpress-cli health

Without shared storage: use --no-shared-storage for gRPC streaming.
Air-gapped: with the model already in the local HF cache, HF_HUB_OFFLINE=1 modelexpress-cli model download <model-id> resolves it without network access.

Deployment

Kubernetes (Helm)

kubectl create secret generic hf-token-secret --from-literal=HF_TOKEN=${HF_TOKEN} -n <namespace>
helm install modelexpress ./helm --namespace modelexpress --create-namespace

Override values-production.yaml for your env. Full config: helm/README.md.

P2P GPU Transfer (vLLM)

from modelexpress import register_modelexpress_loaders
register_modelexpress_loaders()
# vllm serve <model> --load-format modelexpress
# The mx load format is kept as a backward-compatible alias.

First instance loads from disk; subsequent instances receive via RDMA. P2P guide · Server setup.

ModelStreamer on Kubernetes

Load model weights directly from Azure Blob Storage, S3, or a PVC-backed local path through ModelStreamer. ModelStreamer examples · vLLM recipes.

Docker

docker compose -f docker/docker-compose.yml up --build

Configuration

Precedence: CLI → env vars (MODEL_EXPRESS_*, MX_*) → YAML → defaults.

Variable	Default	Description
`MODEL_EXPRESS_SERVER_PORT`	`8001`	gRPC port
`MODEL_EXPRESS_CACHE_DIRECTORY`	`./cache`	Cache root
`MX_METADATA_BACKEND`	(required)	`redis` \| `kubernetes`
`REDIS_URL`	(required for `redis`)	Redis connection URL. Alternatively set `MX_REDIS_HOST` + `MX_REDIS_PORT`. No localhost fallback.
`MX_SERVER_ADDRESS`	`localhost:8001`	Client-side gRPC server address (P2P). Recommended.
`MODEL_EXPRESS_URL`	`localhost:8001`	Deprecated, pending removal in a future release. Still read by all client paths and takes precedence when both are set; keep setting it during the transition.

cargo run --bin config_gen -- --output model-express.yaml
cargo run --bin modelexpress-server -- --config model-express.yaml --validate-config

Full reference: docs/DEPLOYMENT.md.

CLI

modelexpress-cli health
modelexpress-cli model download <model-id>
modelexpress-cli model list
modelexpress-cli model validate <model-id>
modelexpress-cli model clear <model-id>

CLI Reference

Testing

cargo test
cargo test --test integration_tests
cargo run --bin test_client -- --test-model "google-t5/t5-small"
./run_integration_tests.sh
cargo bench

Documentation

Doc	Description
Deployment	Server/client config, Docker, K8s, P2P
Architecture	Components, gRPC, NIXL, FP8
CLI	Full CLI reference
Metadata	Redis keys, K8s CRD schema
Helm	Kubernetes configuration

Known Issues

NIXL_ERR_REMOTE_DISCONNECT — Source restarts invalidate rkeys. Flush Redis, redeploy.
Long source warmup — DeepSeek-V3 (DeepGemm, CUDA graphs) can take significant time; targets wait via coordination.
Large model gRPC stream — May not close automatically; use client timeout.

Roadmap

Priorities Under Development

P2P compile/warmup caching: torch.compile/deepGEMM cache for follower workers. Leader performs full warmup; followers consume caches and skip full warmup.
DRAM and NVMe-resident shard streaming: Stream shards across workers while keeping weights in DRAM and host local high-speed NVMe.
RL workloads: Explore fast P2P transfers to optimize RL refit phase and support for weight resharding.
Earlier weight availability: Bring weights to prefill earlier; identify prefill workers that can act as strong source nodes.
Multi-tier cache hierarchy: Promote and demote models across DRAM, NVMe, and PVC tiers based on access patterns.
Distributed sharded cache: Shard large models across nodes using consistent hashing and parallel shard assembly.
Training checkpoint management: Cache and reuse CUDA kernel compilations (torch.compile, deepGEMM) and CUDA graphs across restarts.
Metrics and observability: Cache hit rates, eviction frequency, transfer throughput, and P2P RDMA utilization via Prometheus/OpenTelemetry.
Predictive prefetching: Pre-warm caches from workload history or scheduling hints.
P2P transfer fault tolerance: Auto-recovery from stale rkeys on source restart; retry and fallback to storage loading.
Dynamic EPLB (Expert Parallelism Load Balancer): Rebalance MoE expert placement across GPUs at runtime via P2P transfer of expert weights as load shifts.

Contributing

Contributions welcome. See CONTRIBUTING.md.

pip install pre-commit && pre-commit install
pre-commit run --all-files

Issues: GitHub Issues

License

Apache 2.0. See LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 219 Commits
.claude		.claude
.cursor/rules		.cursor/rules
.devcontainer		.devcontainer
.github		.github
.vscode		.vscode
ci		ci
docker		docker
docs		docs
examples		examples
helm		helm
modelexpress_client		modelexpress_client
modelexpress_common		modelexpress_common
modelexpress_server		modelexpress_server
trtllm_patches/v1.3.0rc5		trtllm_patches/v1.3.0rc5
workspace-tests		workspace-tests
.coderabbit.yaml		.coderabbit.yaml
.dockerignore		.dockerignore
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
ATTRIBUTIONS_Rust.md		ATTRIBUTIONS_Rust.md
CLAUDE.md		CLAUDE.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
ModelExpressTrainLogo.jpeg		ModelExpressTrainLogo.jpeg
README.md		README.md
SECURITY.md		SECURITY.md
deny.toml		deny.toml
model-express-architecture.png		model-express-architecture.png
modelexpress-cli-completion.bash		modelexpress-cli-completion.bash
run_integration_tests.sh		run_integration_tests.sh
rust-toolchain.toml		rust-toolchain.toml
rustfmt.toml		rustfmt.toml
test_client.sh		test_client.sh
test_grpc_transfer_k8s.sh		test_grpc_transfer_k8s.sh
test_multinode_k8s.sh		test_multinode_k8s.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Dynamo ModelExpress

Overview

How ModelExpress manages weights in the cluster

Features

Integrations

ModelExpress Architecture

Quick Start

Deployment

Kubernetes (Helm)

P2P GPU Transfer (vLLM)

ModelStreamer on Kubernetes

Docker

Configuration

CLI

Testing

Documentation

Known Issues

Roadmap

Priorities Under Development

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Dynamo ModelExpress

Overview

How ModelExpress manages weights in the cluster

Features

Integrations

ModelExpress Architecture

Quick Start

Deployment

Kubernetes (Helm)

P2P GPU Transfer (vLLM)

ModelStreamer on Kubernetes

Docker

Configuration

CLI

Testing

Documentation

Known Issues

Roadmap

Priorities Under Development

Contributing

License

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages