High-performance model inference serving framework for Python.
Nerva lets ML engineers define multi-model pipelines as plain Python functions, then serves them over a binary RPC protocol with process-isolated workers — no YAML configs, no C++ plugins, no leaving the Python ecosystem.
Existing serving frameworks (Triton, TorchServe) require learning framework-specific configuration, have limited DAG orchestration support, and make Python-native development difficult. Nerva takes a different approach:
- Pipeline = Python function. Define arbitrarily complex DAGs with
trace(),cond(), andparallel()— the framework infers the computation graph automatically. - Process isolation by default. Each model runs in its own worker process with a dedicated CUDA context. One model crashing or OOM-ing doesn't take down the service.
- IPC designed for ML payloads. Small tensors inline via ZeroMQ; large payloads (images, embeddings) go through shared memory with near-zero copy overhead.
- Python 3.11+
uvfor environment and dependency management.
uv sync --devOptional backend extras:
uv sync --dev --extra pytorch
uv sync --dev --extra vllmTerminal 1 — start an echo server:
uv run uvicorn examples.echo_server:app --host 127.0.0.1 --port 8080Terminal 2 — send a request:
uv run python scripts/demo_client.py --url http://127.0.0.1:8080 --pipeline echo --value "hello"Expected output:
Calling http://127.0.0.1:8080/rpc/echo ...
Result: {'echo': 'hello'}
Health and model listing:
curl http://127.0.0.1:8080/v1/health
curl http://127.0.0.1:8080/v1/modelsSubclass Model and implement load() + infer():
from typing import Any
from nerva import Model, build_nerva_app, model, trace
class EchoModel(Model):
def load(self) -> None:
pass
async def infer(self, inputs: dict[str, Any]) -> dict[str, Any]:
return {"echo": inputs["value"]}echo = model("echo", EchoModel, backend="pytorch", device="cpu")
graph = trace(lambda inp: echo(inp))
# Option 1: ASGI factory (recommended — works with uvicorn, tests, etc.)
app = build_nerva_app({"echo": graph})
# Option 2: Blocking startup
# from nerva import serve
# serve({"echo": graph}, host="0.0.0.0", port=8080)Run the ASGI app:
uv run uvicorn your_module:app --port 8080from nerva import cond, model, parallel, trace
text_enc = model("text_enc", TextEncoder, backend="pytorch", device="cpu")
img_enc = model("img_enc", ImageEncoder, backend="pytorch", device="cuda:0")
fusion = model("fusion", Fusion, backend="pytorch", device="cuda:1")
def pipeline(inp):
t, i = parallel(
lambda: text_enc({"text": inp["text"]}),
lambda: img_enc({"image": inp["image"]}),
)
return fusion({"text_feat": t["features"], "img_feat": i["features"]})
graph = trace(pipeline)
app = build_nerva_app({"mm": graph})| Endpoint | Method | Description |
|---|---|---|
/rpc/{pipeline_name} |
POST | Binary RPC inference |
/v1/health |
GET | Health check |
/v1/models |
GET | List loaded models |
/metrics |
GET | Prometheus metrics |
- With ASGI lifespan (e.g.
uvicorn): workers start on app startup and shut down on app shutdown. - Without lifespan (e.g.
httpx.ASGITransportin tests): workers start lazily on first request. Callawait app.shutdown()after use for deterministic cleanup. - Parent-process watchdog: if the parent process exits unexpectedly, workers self-terminate to prevent orphan processes.
| File | Description |
|---|---|
examples/echo_server.py |
Minimal runnable E2E server |
examples/01_single_model.py |
Single-model serving flow |
examples/02_multi_model_pipeline.py |
Multi-stage pipeline |
examples/03_parallel_dag.py |
parallel / cond flow composition |
examples/mm_vllm_server.py |
Multimodal + vLLM benchmark DAG service |
scripts/demo_client.py |
Standalone Binary RPC client |
Full runbook: docs/plans/2026-03-02-phase7-e2e-benchmark-runbook.md
Benchmarks compare Nerva against native vLLM and Triton on the same workload. Startup scripts fail-fast if real backend dependencies are missing.
# Start Nerva
MM_VLLM_MODEL_PATH=<MODEL_PATH> \
uv run uvicorn examples.mm_vllm_server:app --host 127.0.0.1 --port 8080
# Start vLLM
uv run python scripts/bench/infra/start_vllm_server.py --model <MODEL_PATH> --host 127.0.0.1 --port 8001
uv run python scripts/bench/infra/wait_service_ready.py --kind vllm --url http://127.0.0.1:8001/health --timeout-seconds 120
# Start Triton (embeds vLLM in-process, no separate vLLM needed)
uv run python scripts/bench/infra/prepare_triton_repo.py --output /tmp/mm_vllm-triton-repo --vllm-model <MODEL_PATH>
uv run python scripts/bench/infra/start_triton_server.py --model-repo /tmp/mm_vllm-triton-repo --http-port 8002 --grpc-port 8003 --metrics-port 8004
uv run python scripts/bench/infra/wait_service_ready.py --kind triton --url http://127.0.0.1:8002/v2/health/ready --timeout-seconds 300
# Run benchmark matrix
uv run python scripts/bench/run_bench.py \
--target nerva --target vllm --target triton \
--concurrency-levels 1,32,128,512,1000 \
--warmup-seconds 60 --sample-seconds 300 \
--require-real-backendSee docs/design/性能测试指南.md for detailed execution steps, result interpretation, and metrics-based bottleneck diagnosis.
export PATH="$HOME/.local/bin:$PATH"
uv run ruff check src/ tests/ examples/ scripts/ # lint
uv run mypy # type check
uv run pytest tests/ -v # test| Directory | Content |
|---|---|
docs/design/ |
Architecture, module design, testing and benchmarking guides (Chinese) |
docs/plans/ |
Implementation plans, ADRs, protocol specs, roadmap |
docs/spikes/ |
Technical spike reports (IPC benchmarks, trace prototype, batcher benchmarks) |
See LICENSE.