feat(models): vllm deployment in docker by benmccown · Pull Request #232 · NVIDIA-NeMo/nemo-platform

benmccown · 2026-06-09T15:43:40Z

Add vLLM as a deployment engine (docker backend)

Summary

Adds vLLM as a peer inference engine to NIM in the models service, deployable through
the existing docker service backend. A user sets engine: "vllm" on a
ModelDeploymentConfig and gets a deployed, IGW-routable, OpenAI-compatible endpoint —
including LoRA adapter support with hot-reload.

This is the first implementation pass of the vLLM RFC: the base schema/interface changes
and the docker backend vLLM path. The k8s backend vLLM path and the generic engine are
out of scope here.

Core changes

Schema: `engine` + `model_spec` + `executor_config`

ModelDeploymentConfig now carries a top-level engine discriminant (nim / vllm /
generic) plus two grouped configs shared across engines:

model_spec — what model to serve and how (model source, chat template, tool-call
config, lora_enabled); executor-invariant.
executor_config — compute + container settings (gpu, disk_size, image,
additional_envs, additional_args); shared by the docker and k8s container executors.

Existing NIM deployments continue to work via engine: "nim". The k8s NIM-operator path
is preserved unchanged (its operator-specific fields ride along on executor_config).

Docker vLLM backend

New vllm_compiler produces the image, vllm serve arg vector, and env for the
container: tensor-parallel auto-tuning from GPU count + model architecture, served-model
name, chat template, and additional_args passthrough.
The docker creation reconciler dispatches on config.engine, reusing the existing
GPU/port/volume/weight-puller plumbing; only the final container shape differs from NIM.
Engine-aware readiness probe (vLLM /health vs NIM /v1/health/ready).
New per-backend default_vllm_image / default_vllm_image_tag settings.

LoRA hot-reload on vLLM

vLLM deployments get --enable-lora + the filesystem-resolver plugin env, watching a
shared adapter directory.
The existing adapter sidecar (engine-agnostic) delivers enabled adapters into that
directory; the controller wires the sidecar's env for vLLM and pre-creates the watched
directory so vLLM doesn't crash-loop on startup.
The sidecar rewrites each delivered adapter's base_model_name_or_path to the locally
served model path so vLLM's resolver matches and auto-loads it.

SDK / CLI / docs

Regenerated OpenAPI spec, Python SDK, and CLI for the reshaped deployment config
(--engine, --model-spec, --executor-config).
Updated docs/run-inference/tutorials/deploy-models.md with a vLLM section (base +
LoRA) and refreshed the existing examples to the new config shape.

Testing

Unit tests for the vLLM compiler, docker backend engine dispatch, LoRA sidecar wiring,
and the adapter base-model rewrite; existing models/customizer/IGW suites updated to the
new shape.
Manually validated end-to-end on a GPU pod (docker-in-docker): deployed Qwen via vLLM
and served inference through IGW; deployed a base model with a real HuggingFace LoRA
adapter and confirmed the sidecar delivered it and vLLM served adapter-applied responses.

Summary by CodeRabbit

Release Notes

New Features
- Added support for multiple inference engines (NIM, vLLM, generic) in model deployments.
- Introduced engine-agnostic deployment configuration with engine, model_spec, and executor_config fields.
Breaking Changes
- Updated deployment configuration API: nim_deployment parameter replaced with separate engine, model_spec, and executor_config fields in create/update requests.
- CLI commands updated: --nim-deployment option replaced with --engine, --executor-config, and --model-spec options.

github-actions · 2026-06-09T15:45:35Z

Documentation preview is ready

Preview: https://nvidia-nemo.github.io/nemo-platform/pr-preview/pr-232/pr-232/

Built from 1ff296d in workflow run.

This preview is deployed from this PR branch, updates when docs changes are pushed, and will be removed when the PR closes.

benmccown

Per-file review notes for everything under services/core/models/src/nmp/core/models/. Read order: schemas.py first (the rest fans out from there), then controllers/backends/docker/vllm_compiler.py and creation_reconciler.py. Autogenerated openapi/ and sdk/ dirs can be ignored.

benmccown

Per-file review notes for everything under services/core/models/src/nmp/core/models/. Read order: schemas.py first (the rest fans out from there), then controllers/backends/docker/vllm_compiler.py and creation_reconciler.py. Autogenerated openapi/ and sdk/ dirs can be ignored.

benmccown · 2026-06-09T15:55:28Z

Reviewer guide

The diff is large but mostly mechanical fan-out from one schema change. Suggested reading order:

services/core/models/src/nmp/core/models/schemas.py — start here, specifically ModelDeploymentConfig and the new Engine / ModelDeploymentConfigModelSpec / ContainerExecutorConfig types. The old nim_deployment blob is replaced by a top-level engine discriminant plus two grouped configs:
- model_spec — what model, served how (model source, chat template, tool-call config, lora_enabled); executor-invariant.
- executor_config — compute + container (gpu, disk_size, image, additional_envs, additional_args, plus k8s-operator passthrough fields).
  Everything else extrapolates from this.
controllers/backends/docker/vllm_compiler.py (new) — the heart of the feature: image, vllm serve args (TP auto-tune, served-model name, chat template, additional_args), and env (incl. LoRA).
controllers/backends/docker/creation_reconciler.py — where the docker backend dispatches on engine and the vLLM + LoRA-sidecar wiring lives.
The remaining models-service files are small adaptations to the new shape — see the per-file review comments for a "what changed and why" on each.

Can be ignored

openapi/ and sdk/ — fully autogenerated from the schema change (make refresh-openapi + Stainless).
The vendored/generated CLI under packages/nemo_platform_ext/.../cli/ and sdk/.
Test files — updated to the new config shape; the net-new behavioral tests worth a look are test_vllm_compiler.py, the vLLM/LoRA cases in test_docker_backend.py, and the rewrite cases in test_adapters_controller.py.

Worth a careful look

Schema-split correctness — does each field land in the right group (model_spec vs executor_config)?
controllers/backends/common.py DeploymentConfigView — the seam that lets the existing NIM compilers (docker + k8s) keep working under the new schema without a field-by-field rewrite.
LoRA wiring — three coupled pieces: sidecar env + watched-dir pre-create (creation_reconciler.py) and the adapter base-name rewrite (sidecars/adapters/main.py).

Scope

First pass: base schema + docker backend vLLM (incl. LoRA). The k8s backend vLLM path and the generic engine are intentionally out of scope here.

I've left per-file In this file: ... review comments on everything under services/core/models/src/nmp/core/models/.

github-actions · 2026-06-09T15:57:04Z

Suite	Lines Covered	Line Rate	Branch Rate
Unit Tests	19025/25079	75.9%	62.4%
Integration Tests	12032/23851	50.4%	25.6%

coderabbitai · 2026-06-09T15:58:56Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Deployment config contracts now require engine, model_spec, and executor_config instead of nim_deployment. Runtime services, customizer paths, and tests were updated to read the split shape, including vLLM-specific compilation, health probing, and adapter rewrite handling.

Changes

Deployment config migration

Layer / File(s)	Summary
Contract surface `openapi/ga/...`, `openapi/openapi.yaml`, `packages/.../deployment_configs/__init__.py`, `docs/cli/reference.mdx`	OpenAPI schemas, CLI options, and CLI docs now require `engine`, `model_spec`, and `executor_config` and remove `nim_deployment`.
Core runtime data flow `services/core/models/src/nmp/core/models/...`	Shared schemas, services, controllers, backends, and sidecars now use `model_spec`/`executor_config`, add engine-aware paths, and compile vLLM-specific container inputs.
Customizer and caller wiring `services/customizer/...`, `packages/nmp_testing/...`, `services/core/inference-gateway/tests/...`	Customizer jobs, e2e helpers, and inference-gateway callers now build deployment configs with the new split schema.
Validation suites `services/core/models/tests/...`, `services/customizer/tests/...`	Integration and unit tests were migrated to the new config shape and expanded for vLLM, override-config, and adapter rewrite behavior.
Auxiliary removals `services/core/models/script/README.md`	The deployment migration guide README was removed.

Sequence Diagram(s)

sequenceDiagram
  participant API
  participant Service
  participant Backend
  participant Sidecar
  API->>Service: create/update with engine, model_spec, executor_config
  Service->>Backend: persist normalized config
  Backend->>Backend: select NIM or vLLM path
  Backend->>Sidecar: configure LoRA adapter rewrite

Possibly related PRs

NVIDIA-NeMo/nemo-platform#134: Uses a nim_deployment block in deployment config JSON, which this PR replaces with engine, model_spec, and executor_config.

Suggested reviewers

mckornfield
matthewgrossman
albcui

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch AIRCORE-698-models-vllm-docker/bmccown

coderabbitai

Actionable comments posted: 17

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

services/customizer/src/nmp/customizer/app/jobs/compiler.py (1)

334-339: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Require explicit lora_enabled=True for LoRA string refs.

Line 334 only rejects explicit False; unset/None currently passes. That allows LoRA training to proceed with a config that may not load adapters.

Proposed fix

-    if is_lora and resolved_config.model_spec and resolved_config.model_spec.lora_enabled is False:
+    if is_lora and (
+        resolved_config.model_spec is None
+        or resolved_config.model_spec.lora_enabled is not True
+    ):
         raise PlatformJobCompilationError(
-            f"deployment_config references '{dc}' which has lora_enabled=false, "
+            f"deployment_config references '{dc}' which does not have lora_enabled=true, "
             "but this is a LoRA training job. The deployment would not load LoRA adapters. "
             "Use a deployment config with lora_enabled=true, or provide inline deployment parameters."
         )

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@services/customizer/src/nmp/customizer/app/jobs/compiler.py` around lines 334
- 339, The check for LoRA deployments currently only rejects explicit False but
lets None/unset pass; update the condition in the compilation path that uses
is_lora and resolved_config.model_spec.lora_enabled so that LoRA string refs
require lora_enabled to be True (i.e., treat anything not True as invalid) and
raise the same PlatformJobCompilationError (include the dc context) when
lora_enabled is not True; locate the check referencing is_lora,
resolved_config.model_spec.lora_enabled, and PlatformJobCompilationError and
change the predicate to fail when lora_enabled is not True.

services/core/models/src/nmp/core/models/api/service/model_deployment_config_service.py (1)

261-271: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Preserve or recompute model_entity_id on update.

If the caller omits model_entity_id, this version stores None even when request.model_spec still identifies the model. That breaks the config→model association used later by deployment resolution and LoRA filtering. Reuse the create-path lookup here, then fall back to current.model_entity_id.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@services/core/models/src/nmp/core/models/api/service/model_deployment_config_service.py`
around lines 261 - 271, When constructing the new ModelDeploymentConfigEntity on
update, ensure model_entity_id is preserved or recomputed instead of
unconditionally using request.model_entity_id; if request.model_entity_id is
missing, run the same model lookup used in the create path using
request.model_spec and use that result, and if that lookup yields nothing fall
back to current.model_entity_id so the config→model association remains intact
(update the model_entity_id argument passed to ModelDeploymentConfigEntity
accordingly).

🧹 Nitpick comments (2)

services/core/models/tests/integration/test_models.py (1)

735-739: ⚡ Quick win

Assert engine round-trip in CRUD assertions.

Line 737 and Line 774 validate nested fields, but not the top-level discriminant. Add engine == "nim" checks on create/update/read responses to catch contract regressions early.

Patch

@@
     assert created["name"] == test_name
     assert created["description"] == "A deployment configuration for LLM inference"
+    assert created["engine"] == "nim"
     assert created["executor_config"]["gpu"] == 2
     assert created["model_spec"]["lora_enabled"] is True
@@
     assert updated["description"] == "Updated deployment configuration"
+    assert updated["engine"] == "nim"
     assert updated["executor_config"]["gpu"] == 4
     assert updated["entity_version"] == 2
@@
     latest = response.json()
     assert latest["entity_version"] == 2
+    assert latest["engine"] == "nim"

Also applies to: 773-775, 782-784

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@services/core/models/tests/integration/test_models.py` around lines 735 -
739, The CRUD test is missing assertions that the top-level discriminant
"engine" round-trips; update the test assertions to include checks like
created["engine"] == "nim", updated["engine"] == "nim", and (the read/fetched
response variable) e.g. fetched["engine"] == "nim" alongside the existing nested
assertions (e.g. created["name"], created["executor_config"],
created["model_spec"]). Locate the places where the test uses the variables
created and updated (and the read/fetched response) and add these engine
equality asserts to catch contract regressions.

docs/run-inference/tutorials/deploy-models.md (1)

1212-1214: ⚡ Quick win

Add a Next Steps section at the end of this page.

Link to related docs (config reference, API reference, troubleshooting) to complete the tutorial flow.

As per coding guidelines, "Include 'Next Steps' section at the end with cross-links to related documentation content".
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/run-inference/tutorials/deploy-models.md` around lines 1212 - 1214, Add
a "Next Steps" section at the end of this page (after the
vLLM/chat_template_kwargs note) that provides cross-links to the configuration
reference, API reference, and troubleshooting guide to complete the tutorial
flow; mention that chat_template_kwargs applies only to vLLM-backed models and
provide links to the config reference, API reference, and troubleshooting docs
for non-vLLM providers (e.g., OpenAI/NVIDIA Build) so readers can follow up on
provider-specific parameters and debugging steps.
Source: Coding guidelines

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/run-inference/tutorials/deploy-models.md`:
- Around line 549-550: The CLI example is missing the required --name flag for
the deployment; update the command that starts with "nemo inference gateway
provider post v1/chat/completions qwen3-lora-deployment --body" to include
"--name qwen3-lora-deployment" (matching the rest of the page) in the
appropriate position so the example is runnable and consistent with other
snippets.

In `@openapi/ga/individual/platform.openapi.yaml`:
- Around line 20153-20158: The schema for model_provider currently allows any
string but the description limits valid values to "hf" and "nmp"; update the
OpenAPI schema for the model_provider property to enforce those values by adding
an enum: ["hf","nmp"] (keep the existing title/description/type/maxLength) so
requests with invalid providers fail validation; refer to the model_provider
property in the YAML to apply this change.
- Around line 12422-12427: The OpenAPI schema currently doesn't enforce that
executor_config.image_name is present when engine == "generic"; update the
request schema that defines engine and executor_config to add a conditional
requirement (e.g., use oneOf/anyOf with a subschema where engine: {const:
"generic"} and required: ["executor_config"] plus a schema that enforces
executor_config.image_name is present) so any payload with engine: "generic"
must include executor_config.image_name; apply the same conditional change
wherever executor_config.image_name/engine pair is defined (the other repeated
schemas that include image_name) to keep behavior consistent.

In `@openapi/ga/openapi.yaml`:
- Around line 12435-12439: The schema for additional_envs currently uses
additionalProperties: true which permits non-string values; update the
additional_envs schema by replacing additionalProperties: true with
additionalProperties: { type: string } (i.e., constrain additionalProperties to
string values) so that the additional_envs object only allows string-valued
environment variables; change the additional_envs definition accordingly where
it appears.
- Around line 12422-12427: The OpenAPI schema currently documents that
image_name is "Required for engine='generic'" only in prose, but the request
schemas don't enforce this conditional; update the deployment-level schema for
the object containing engine and image_name so that it includes a conditional
(oneOf/if-then or discriminator) requiring image_name (and any related image
fields) when engine == "generic" — i.e., add an if:
{properties:{engine:{const:"generic"}}} then: {required:["image_name"]} (or
equivalent oneOf alternatives) for the same schema that declares engine and
image_name so generated clients cannot omit image_name when engine is generic;
ensure you apply the same change in the other two locations noted (around the
blocks at lines ~13712-13725 and ~27984-27997).
- Around line 20153-20158: The model_provider property currently allows any
string; constrain it to supported values by replacing the free-form type with an
enum (or $ref to a provider schema) that lists "hf" and "nmp" so client
validation and generated SDK types reflect the actual allowed providers; locate
the model_provider schema and change it to an enum of ["hf","nmp"] (or point
model_provider to a shared Provider enum schema) while keeping any existing
metadata like title/description.

In `@openapi/openapi.yaml`:
- Around line 12435-12439: The schema for additional_envs currently allows
arbitrary JSON (additionalProperties: true) but env vars must be strings; update
the additional_envs schema so additionalProperties is a string schema (e.g.,
additionalProperties: { type: "string" }) to restrict values to strings, or
alternately add explicit documentation and a property like additional_envs: {
additionalProperties: { type: "string" }, description: "... JSON-encoded values
..." }—make this change to the additional_envs schema to ensure only string
values are accepted.
- Around line 12422-12428: The schema currently documents
ContainerExecutorConfig.image_name as required for engine='generic' but doesn’t
enforce it; add a JSON Schema conditional at the OpenAPI object that contains
the engine and executor_config properties (use an if:
{properties:{engine:{const:"generic"}}} then: {required:["executor_config"],
properties:{executor_config:{required:["image_name"]}}} or equivalent) so that
when engine equals "generic" the executor_config.image_name field is required;
reference the ContainerExecutorConfig schema and the executor_config.image_name
property when adding the conditional to ensure the validator enforces this rule.
- Around line 13716-13728: Update the deployment config schemas to enforce a
resolvable model source by adding a oneOf invariant at the
CreateModelDeploymentConfigRequest/UpdateModelDeploymentConfigRequest level:
require either the top-level model_entity_id property or a model_spec that
contains the identity/source fields used to resolve weights (modify
ModelDeploymentConfigModelSpec to mark those identity/source properties as
required, e.g., the fields your system uses such as
model_source/model_uri/model_id or equivalent), and add a oneOf that validates
(1) model_entity_id is present OR (2) model_spec has the required
identity/source properties; apply the same change for the other affected ranges
(around lines 20130–20184 and 27988–28007).

In `@services/core/models/script/v2_deployment_migration.py`:
- Around line 794-801: The migration currently omits the lora_enabled field when
its value is None, causing ModelDeploymentConfigModelSpec.lora_enabled to
default to False; update _translate_nim_deployment() (where nim is processed and
model_spec_param is built) to copy the lora_enabled key based on key presence
(e.g., check "lora_enabled" in nim) rather than nim.get(...) != None so that if
nim["lora_enabled"] is explicitly None the field is preserved in
create_kwargs["model_spec"], preventing accidental coercion to False.

In `@services/core/models/src/nmp/core/models/app/utils.py`:
- Around line 167-168: Guard against missing nested attributes on
model_deployment_config before dereferencing: check that
model_deployment_config.executor_config and model_deployment_config.model_spec
are truthy (or use getattr with defaults) before calling
is_multi_llm_image(model_deployment_config.executor_config.image_name) and
before accessing model_deployment_config.model_spec.model_name; update the
conditional(s) that reference executor_config.image_name and
model_spec.model_name (the occurrence around is_multi_llm_image and the similar
check at line ~173) to short-circuit when executor_config or model_spec is None
so AttributeError cannot be raised.

In `@services/core/models/src/nmp/core/models/controllers/backends/common.py`:
- Around line 16-47: The DeploymentConfigView dataclass and the
deployment_config_view(config: Any) parameter use broad Any types; replace each
Any with precise Optional[...] types (e.g., model_type: Optional[str],
model_namespace: Optional[str], model_name: Optional[str], model_revision:
Optional[str], model_provider: Optional[str], chat_template: Optional[dict] or
Optional[str] as appropriate, tool_call_config: Optional[dict], disk_size:
Optional[int|str], image_name: Optional[str], image_tag: Optional[str],
additional_envs: Optional[dict[str,str]] or Optional[list[tuple[str,str]]],
k8s_nim_operator_config: Optional[dict], override_config: Optional[dict]) while
keeping lora_enabled: bool and gpu: int and additional_args:
Optional[list[str]]; also change the function signature
deployment_config_view(config: Any) to deployment_config_view(config:
Optional[Mapping[str, Any]] or a typed Protocol/ModelDeploymentConfig) so
callers get static safety and update imports/typing hints accordingly.

In `@services/core/models/src/nmp/core/models/controllers/provider_reconciler.py`:
- Around line 1094-1099: The code builds HF artifact URLs using model_spec
without validating its fields, which can produce invalid URLs like
"hf://None/None" or trigger the broad exception path; update the block around
model_spec usage and the call to parse_model_name_revision so you first check
that config.model_spec exists and that model_spec.model_namespace,
model_spec.model_name, and model_spec.model_revision are non-empty/nonnull (or
meet expected format) before calling parse_model_name_revision or constructing
"hf://" URLs, and if validation fails log a clear warning/error and skip or
short-circuit artifact metadata construction instead of catching and silencing
exceptions; reference model_spec and parse_model_name_revision to locate where
to add these checks and the hf:// URL construction to guard.

In `@services/core/models/src/nmp/core/models/schemas.py`:
- Around line 1366-1372: The model allows engine="generic" while permitting
executor_config.image_name to be None; add an API-level validation in the
Pydantic model that owns the engine, model_spec and executor_config fields
(referencing engine and executor_config / ContainerExecutorConfig) to reject
requests where engine == "generic" and executor_config.image_name is empty/None
by raising a validation error; implement this as a validator or root_validator
on the same model class (the schema around these fields) and apply the same
validator to the other identical model block (the block at 1388-1394) so both
places enforce that generic requires an explicit image_name.

In `@services/core/models/src/nmp/core/models/sidecars/adapters/main.py`:
- Around line 146-151: The vLLM base-model rewrite failure currently logs a
warning but still proceeds to mark the adapter up-to-date and swap it into
place; change the error handling so that when reading/parsing
adapter_config.json (cfg_path) or the rewrite operation fails (the except block
around json.load and the corresponding failure points at the other locations
mentioned: the blocks at ~153-158 and ~228-230), the code returns early and does
NOT call the functions that write metadata or perform the swap (references to
cfg, adapter_dir, any write/update_metadata function and any
swap/rename/atomic_swap_adapter call); ensure the early-return prevents
execution of the metadata write/update and swap logic so a broken adapter is not
marked up-to-date and automatic retry can still occur.

In `@services/core/models/tests/unit/controllers/test_backend_config_fields.py`:
- Around line 32-45: The MagicMock assigned to config.model_spec causes unset
attributes to be truthy and can trigger optional code paths; explicitly set
optional fields (e.g., config.model_spec.tool_call_config = None and any other
optional model_spec attributes that tests rely on) to None instead of leaving
them as mocks, and ensure any other optional executor-related fields remain
explicitly None (e.g., config.executor_config.k8s_nim_operator_config already
None) so the fixture returns a config with the intended None values rather than
truthy MagicMocks.

In `@services/core/models/tests/unit/controllers/test_nimservice_compiler.py`:
- Around line 51-55: The MagicMock used for config.model_spec leaves optional
attributes truthy and can cause unwanted branches to run; update every place
that sets config.model_spec = MagicMock() (including the full_config fixture and
the inline setups around test_nimservice_compiler lines referenced) to
explicitly set the optional fields: config.model_spec.lora_enabled = False,
config.model_spec.model_name = None, config.model_spec.model_namespace = None,
and config.model_spec.model_revision = None so tests don't accidentally trigger
optional compiler/tool-call logic.

---

Outside diff comments:
In
`@services/core/models/src/nmp/core/models/api/service/model_deployment_config_service.py`:
- Around line 261-271: When constructing the new ModelDeploymentConfigEntity on
update, ensure model_entity_id is preserved or recomputed instead of
unconditionally using request.model_entity_id; if request.model_entity_id is
missing, run the same model lookup used in the create path using
request.model_spec and use that result, and if that lookup yields nothing fall
back to current.model_entity_id so the config→model association remains intact
(update the model_entity_id argument passed to ModelDeploymentConfigEntity
accordingly).

In `@services/customizer/src/nmp/customizer/app/jobs/compiler.py`:
- Around line 334-339: The check for LoRA deployments currently only rejects
explicit False but lets None/unset pass; update the condition in the compilation
path that uses is_lora and resolved_config.model_spec.lora_enabled so that LoRA
string refs require lora_enabled to be True (i.e., treat anything not True as
invalid) and raise the same PlatformJobCompilationError (include the dc context)
when lora_enabled is not True; locate the check referencing is_lora,
resolved_config.model_spec.lora_enabled, and PlatformJobCompilationError and
change the predicate to fail when lora_enabled is not True.

---

Nitpick comments:
In `@docs/run-inference/tutorials/deploy-models.md`:
- Around line 1212-1214: Add a "Next Steps" section at the end of this page
(after the vLLM/chat_template_kwargs note) that provides cross-links to the
configuration reference, API reference, and troubleshooting guide to complete
the tutorial flow; mention that chat_template_kwargs applies only to vLLM-backed
models and provide links to the config reference, API reference, and
troubleshooting docs for non-vLLM providers (e.g., OpenAI/NVIDIA Build) so
readers can follow up on provider-specific parameters and debugging steps.

In `@services/core/models/tests/integration/test_models.py`:
- Around line 735-739: The CRUD test is missing assertions that the top-level
discriminant "engine" round-trips; update the test assertions to include checks
like created["engine"] == "nim", updated["engine"] == "nim", and (the
read/fetched response variable) e.g. fetched["engine"] == "nim" alongside the
existing nested assertions (e.g. created["name"], created["executor_config"],
created["model_spec"]). Locate the places where the test uses the variables
created and updated (and the read/fetched response) and add these engine
equality asserts to catch contract regressions.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 19fcd48a-9eea-47d4-8038-c3805410bd46

📥 Commits

Reviewing files that changed from the base of the PR and between 9a6aba7 and c70bc52.

⛔ Files ignored due to path filters (16)

sdk/python/nemo-platform/.nmpcontext/openapi.yaml is excluded by !sdk/**
sdk/python/nemo-platform/.nmpcontext/stainless.yaml is excluded by !sdk/**
sdk/python/nemo-platform/src/nemo_platform/cli/commands/api/inference/deployment_configs/__init__.py is excluded by !sdk/**
sdk/python/nemo-platform/src/nemo_platform/resources/inference/api.md is excluded by !sdk/**
sdk/python/nemo-platform/src/nemo_platform/resources/inference/deployment_configs/deployment_configs.py is excluded by !sdk/**
sdk/python/nemo-platform/src/nemo_platform/types/inference/__init__.py is excluded by !sdk/**
sdk/python/nemo-platform/src/nemo_platform/types/inference/container_executor_config.py is excluded by !sdk/**
sdk/python/nemo-platform/src/nemo_platform/types/inference/container_executor_config_param.py is excluded by !sdk/**
sdk/python/nemo-platform/src/nemo_platform/types/inference/deployment_config_create_params.py is excluded by !sdk/**
sdk/python/nemo-platform/src/nemo_platform/types/inference/deployment_config_update_params.py is excluded by !sdk/**
sdk/python/nemo-platform/src/nemo_platform/types/inference/engine.py is excluded by !sdk/**
sdk/python/nemo-platform/src/nemo_platform/types/inference/model_deployment_config.py is excluded by !sdk/**
sdk/python/nemo-platform/src/nemo_platform/types/inference/model_deployment_config_model_spec.py is excluded by !sdk/**
sdk/python/nemo-platform/src/nemo_platform/types/inference/model_deployment_config_model_spec_param.py is excluded by !sdk/**
sdk/python/nemo-platform/tests/api_resources/inference/test_deployment_configs.py is excluded by !sdk/**
sdk/stainless.yaml is excluded by !sdk/**

📒 Files selected for processing (53)

docs/cli/reference.md
docs/run-inference/tutorials/deploy-models.md
docs/set-up/config-reference.md
openapi/ga/individual/platform.openapi.yaml
openapi/ga/openapi.yaml
openapi/openapi.yaml
packages/nemo_platform_ext/src/nemo_platform_ext/cli/commands/api/inference/deployment_configs/__init__.py
services/core/inference-gateway/tests/integration/test_inference.py
services/core/inference-gateway/tests/integration/test_middleware_pipeline.py
services/core/models/script/v2_deployment_migration.py
services/core/models/src/nmp/core/models/api/service/model_deployment_config_service.py
services/core/models/src/nmp/core/models/api/service/model_entity_service.py
services/core/models/src/nmp/core/models/api/v2/models.py
services/core/models/src/nmp/core/models/app/utils.py
services/core/models/src/nmp/core/models/controllers/backends/common.py
services/core/models/src/nmp/core/models/controllers/backends/docker/backend.py
services/core/models/src/nmp/core/models/controllers/backends/docker/config.py
services/core/models/src/nmp/core/models/controllers/backends/docker/creation_reconciler.py
services/core/models/src/nmp/core/models/controllers/backends/docker/vllm_compiler.py
services/core/models/src/nmp/core/models/controllers/backends/k8s_nim_operator/backend.py
services/core/models/src/nmp/core/models/controllers/backends/k8s_nim_operator/nimservice_compiler.py
services/core/models/src/nmp/core/models/controllers/models_controller.py
services/core/models/src/nmp/core/models/controllers/provider_reconciler.py
services/core/models/src/nmp/core/models/entities.py
services/core/models/src/nmp/core/models/schemas.py
services/core/models/src/nmp/core/models/sidecars/adapters/main.py
services/core/models/tests/integration/test_chat_template_tool_calling.py
services/core/models/tests/integration/test_model_deployment_config_service_integration.py
services/core/models/tests/integration/test_model_deployment_service_integration.py
services/core/models/tests/integration/test_model_entity_service_integration.py
services/core/models/tests/integration/test_model_lora_filter_route.py
services/core/models/tests/integration/test_models.py
services/core/models/tests/integration/test_models_auth_propagation.py
services/core/models/tests/integration/test_models_controller.py
services/core/models/tests/integration/test_models_with_auth.py
services/core/models/tests/unit/api/test_deployment_configs_api.py
services/core/models/tests/unit/common/test_utils.py
services/core/models/tests/unit/controllers/backends/docker/test_vllm_compiler.py
services/core/models/tests/unit/controllers/test_backend_config_fields.py
services/core/models/tests/unit/controllers/test_docker_backend.py
services/core/models/tests/unit/controllers/test_k8s_nim_operator_backend.py
services/core/models/tests/unit/controllers/test_models_controller_unit.py
services/core/models/tests/unit/controllers/test_nimservice_compiler.py
services/core/models/tests/unit/controllers/test_provider_reconciler.py
services/core/models/tests/unit/sidecars/test_adapters_controller.py
services/core/models/tests/unit/test_model_deployment_config_service_unit.py
services/core/models/tests/unit/test_model_deployment_service_unit.py
services/core/models/tests/unit/test_model_entity_service_unit.py
services/core/models/tests/unit/test_schemas_override_config.py
services/customizer/src/nmp/customizer/app/jobs/compiler.py
services/customizer/src/nmp/customizer/tasks/model_entity/run.py
services/customizer/tests/tasks/model_entity/test_model_entity.py
services/customizer/tests/test_compiler.py

mckornfield

nothing crazy stands out, some choices I found sorta random but nothing blocking for me

mckornfield · 2026-06-09T16:06:28Z

+
+Serve any HuggingFace-format model with [vLLM](https://docs.vllm.ai/) by setting `engine: "vllm"` on the deployment config. The model is registered through the Files service exactly as in [Deploy from HuggingFace](#deploy-from-huggingface) — only the `engine` field on the deployment config changes. Tensor parallelism is computed automatically from the GPU count and the model's architecture.
+
+This example deploys [Qwen3-1.7B](https://huggingface.co/Qwen/Qwen3-1.7B) with vLLM.


did we just choose this one because of how small it is?

Pretty much. I chose it for velocity in my own manual testing so I didn't have to wait forever for weights to pull and the deployment to start. I am definitely open to changing it for the public facing docs, if you think an alternative (something a bit bigger) would be better.

mckornfield · 2026-06-09T16:07:28Z

+
+vLLM deployments support LoRA adapters with hot-reload: enable LoRA on the deployment config, then register one or more adapters against the base model entity. An adapter sidecar delivers each enabled adapter to the running server, and vLLM serves it on a per-request basis. Adapters are registered on the **base model entity**, not as separate models.
+
+This example serves the [linear-algebra LoRA](https://huggingface.co/premjatin/qwen-linear-algebra-coder) on top of Qwen3-1.7B.


lol or maybe this was the reason for using this model?

seems semi random as far as this guy only has two things on HF but 🤷

I gave the adapters link for Qwen3-1.7B to claude and said "let's test deploying a LoRA and sending an inference request to the adapter" and the first adapter it chose out of the 500+ available was a flipping cat girl adapter... I said "hey that's fine for our test but please, for the love of god, pick a more professional adapter for the docs"

In reality the core adapter example should probably be using one produced by customizer, but I figured we'd add that in follow up. But it probably is worth documenting how to deploy adapters from HF, I could see folks realistically wanting to do that.

mckornfield · 2026-06-09T16:29:49Z

+    ``executor_config`` into a discriminated union.
+    """
+
+    gpu: int = Field(description="Number of GPUs required for the deployment. 0 = CPU-only.", ge=0)


do we even want to support 0? lolol

vLLM can run in cpu only mode, I think? I figured if it exists and is a matter of a single integer setting being set to 0, then it's trivial to support and maybe someone has a use case for it. But /shrug

mckornfield · 2026-06-09T16:40:26Z

need a commit -s, and looks like some styling/type failures

benmccown · 2026-06-09T17:09:36Z

+        env_vars["VLLM_LORA_RESOLVER_CACHE_DIR"] = VLLM_LORA_CACHE_DIR
+        env_vars["VLLM_ALLOW_RUNTIME_LORA_UPDATING"] = "True"
+
+    if model_entity and getattr(model_entity, "trust_remote_code", False):


@soluwalana are we okay to default to False here?

benmccown · 2026-06-09T17:10:29Z

+    return [d for d in range(1, n + 1) if n % d == 0]
+
+
+def compute_tensor_parallel_size(


@soluwalana could you review this func implementation specifically? I'm not sure if 1 is a reasonable default?

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@services/core/models/src/nmp/core/models/schemas.py`:
- Around line 1211-1220: The schema currently allows health_check_path to be
None even when engine == "generic", which leads to invalid configs; add a
Pydantic root_validator on the model that owns the engine and executor_config
fields to enforce that when engine == "generic" the health_check_path field
(defined as health_check_path: Optional[str]) is not None and is a non-empty
string, raising a ValidationError with a clear message otherwise; implement this
check in the model class containing health_check_path and engine so invalid
configs fail validation early.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 4a63df38-b36f-43de-a1e6-b34b258dbc04

📥 Commits

Reviewing files that changed from the base of the PR and between c70bc52 and 0e72c94.

⛔ Files ignored due to path filters (6)

sdk/python/nemo-platform/.nmpcontext/openapi.yaml is excluded by !sdk/**
sdk/python/nemo-platform/src/nemo_platform/types/inference/container_executor_config.py is excluded by !sdk/**
sdk/python/nemo-platform/src/nemo_platform/types/inference/container_executor_config_param.py is excluded by !sdk/**
sdk/python/nemo-platform/src/nemo_platform/types/inference/model_deployment_config_model_spec.py is excluded by !sdk/**
sdk/python/nemo-platform/src/nemo_platform/types/inference/model_deployment_config_model_spec_param.py is excluded by !sdk/**
sdk/python/nemo-platform/tests/api_resources/inference/test_deployment_configs.py is excluded by !sdk/**

📒 Files selected for processing (18)

docs/run-inference/tutorials/deploy-models.md
openapi/ga/individual/platform.openapi.yaml
openapi/ga/openapi.yaml
openapi/openapi.yaml
packages/nmp_testing/src/nmp/testing/e2e/customizer.py
services/core/models/script/README.md
services/core/models/script/v2_deployment_migration.py
services/core/models/src/nmp/core/models/app/utils.py
services/core/models/src/nmp/core/models/controllers/backends/common.py
services/core/models/src/nmp/core/models/controllers/backends/docker/creation_reconciler.py
services/core/models/src/nmp/core/models/controllers/backends/docker/vllm_compiler.py
services/core/models/src/nmp/core/models/controllers/provider_reconciler.py
services/core/models/src/nmp/core/models/schemas.py
services/core/models/src/nmp/core/models/sidecars/adapters/main.py
services/core/models/tests/unit/controllers/test_backend_config_fields.py
services/core/models/tests/unit/controllers/test_docker_backend.py
services/core/models/tests/unit/controllers/test_nimservice_compiler.py
services/core/models/tests/unit/sidecars/test_adapters_controller.py

💤 Files with no reviewable changes (2)

services/core/models/script/README.md
services/core/models/script/v2_deployment_migration.py

✅ Files skipped from review due to trivial changes (1)

docs/run-inference/tutorials/deploy-models.md

🚧 Files skipped from review as they are similar to previous changes (13)

services/core/models/src/nmp/core/models/app/utils.py
services/core/models/src/nmp/core/models/controllers/backends/common.py
services/core/models/src/nmp/core/models/sidecars/adapters/main.py
services/core/models/src/nmp/core/models/controllers/backends/docker/vllm_compiler.py
services/core/models/tests/unit/sidecars/test_adapters_controller.py
services/core/models/src/nmp/core/models/controllers/provider_reconciler.py
openapi/openapi.yaml
openapi/ga/individual/platform.openapi.yaml
services/core/models/tests/unit/controllers/test_backend_config_fields.py
services/core/models/src/nmp/core/models/controllers/backends/docker/creation_reconciler.py
openapi/ga/openapi.yaml
services/core/models/tests/unit/controllers/test_docker_backend.py
services/core/models/tests/unit/controllers/test_nimservice_compiler.py

albcui · 2026-06-09T21:12:31Z

+        env_vars["VLLM_ALLOW_RUNTIME_LORA_UPDATING"] = "True"
+
+    if model_entity and getattr(model_entity, "trust_remote_code", False):
+        env_vars["VLLM_TRUST_REMOTE_CODE"] = "1"


Just making sure that this env var actually works? I couldn't find it anywhere in the vllm source code: https://github.com/search?q=repo%3Avllm-project%2Fvllm%20VLLM_TRUST_REMOTE_CODE&type=code

The alternative is to set the --trust-remote-code arg to vllm serve.

I tried running it for realz and got

(APIServer pid=1) WARNING 06-09 21:35:31 [envs.py:2057] Unknown vLLM environment variable detected: VLLM_TRUST_REMOTE_CODE

ty for the sanity check, I would have never stopped to question this one. I'll get it fixed.

Fixed in a58ef9d. Replaced the bogus VLLM_TRUST_REMOTE_CODE env var with the --trust-remote-code vllm serve CLI flag (injected in compile_vllm_args, with the same user-override guard as our other flags). Also sanity-checked every other vLLM env var/arg we emit against vLLM's envs.py and arg_utils.py — the rest (VLLM_PLUGINS, VLLM_LORA_RESOLVER_CACHE_DIR, VLLM_ALLOW_RUNTIME_LORA_UPDATING, --served-model-name, --tensor-parallel-size, --chat-template, --enable-lora, --max-lora-rank) are all valid. Thanks for catching this.

addressed in a58ef9d

Signed-off-by: Ben McCown <bmccown@nvidia.com>

github-actions · 2026-06-10T15:49:07Z

🌿 Preview your docs: https://nvidia-preview-aircore-698-models-vllm-docker-bmccown.docs.buildwithfern.com/nemo/platform

…env var VLLM_TRUST_REMOTE_CODE is not a real vLLM environment variable -- vLLM logs 'Unknown vLLM environment variable detected: VLLM_TRUST_REMOTE_CODE' and ignores it. trust-remote-code is a vllm serve CLI flag (EngineArgs.trust_remote_code -> --trust-remote-code). Move the injection from compile_vllm_env_vars to compile_vllm_args, honoring the same user-override guard as the other injected flags so a user-supplied --trust-remote-code is not duplicated. compile_vllm_env_vars no longer needs model_entity, so drop the now-unused parameter and update the caller. Sanity-checked all other vLLM env vars/args we emit against vLLM's envs.py and arg_utils.py: VLLM_PLUGINS, VLLM_LORA_RESOLVER_CACHE_DIR, VLLM_ALLOW_RUNTIME_LORA_UPDATING and --served-model-name, --tensor-parallel-size, --chat-template, --enable-lora, --max-lora-rank are all valid. Addresses PR #232 review thread (albcui). Signed-off-by: Ben McCown <bmccown@nvidia.com>

benmccown requested review from a team as code owners June 9, 2026 15:43

github-advanced-security AI found potential problems Jun 9, 2026

View reviewed changes

Comment thread services/core/models/tests/unit/sidecars/test_adapters_controller.py Fixed

Comment thread services/core/models/tests/unit/sidecars/test_adapters_controller.py Fixed

benmccown commented Jun 9, 2026

View reviewed changes

coderabbitai Bot reviewed Jun 9, 2026

View reviewed changes

benmccown requested review from JashG, albcui, mckornfield and soluwalana June 9, 2026 16:02

mckornfield approved these changes Jun 9, 2026

View reviewed changes

benmccown commented Jun 9, 2026

View reviewed changes

coderabbitai Bot reviewed Jun 9, 2026

View reviewed changes

Comment thread services/core/models/src/nmp/core/models/schemas.py

benmccown force-pushed the AIRCORE-698-models-vllm-docker/bmccown branch 2 times, most recently from 4f74606 to 1ff296d Compare June 9, 2026 21:08

albcui reviewed Jun 9, 2026

View reviewed changes

feat(models): vllm deployment in docker

f4a30c7

Signed-off-by: Ben McCown <bmccown@nvidia.com>

benmccown force-pushed the AIRCORE-698-models-vllm-docker/bmccown branch from 1ff296d to f4a30c7 Compare June 10, 2026 15:47

benmccown requested a review from albcui June 10, 2026 16:03


		Serve any HuggingFace-format model with [vLLM](https://docs.vllm.ai/) by setting `engine: "vllm"` on the deployment config. The model is registered through the Files service exactly as in [Deploy from HuggingFace](#deploy-from-huggingface) — only the `engine` field on the deployment config changes. Tensor parallelism is computed automatically from the GPU count and the model's architecture.

		This example deploys [Qwen3-1.7B](https://huggingface.co/Qwen/Qwen3-1.7B) with vLLM.


		vLLM deployments support LoRA adapters with hot-reload: enable LoRA on the deployment config, then register one or more adapters against the base model entity. An adapter sidecar delivers each enabled adapter to the running server, and vLLM serves it on a per-request basis. Adapters are registered on the base model entity, not as separate models.

		This example serves the [linear-algebra LoRA](https://huggingface.co/premjatin/qwen-linear-algebra-coder) on top of Qwen3-1.7B.

		return [d for d in range(1, n + 1) if n % d == 0]


		def compute_tensor_parallel_size(

Conversation

benmccown commented Jun 9, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Add vLLM as a deployment engine (docker backend)

Summary

Core changes

Schema: engine + model_spec + executor_config

Docker vLLM backend

LoRA hot-reload on vLLM

SDK / CLI / docs

Testing

Summary by CodeRabbit

Release Notes

Uh oh!

github-actions Bot commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Documentation preview is ready

Uh oh!

Uh oh!

Uh oh!

benmccown left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

benmccown left a comment

Choose a reason for hiding this comment

Uh oh!

benmccown commented Jun 9, 2026

Reviewer guide

Can be ignored

Worth a careful look

Scope

Uh oh!

github-actions Bot commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Possibly related PRs

Suggested reviewers

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mckornfield left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

benmccown commented Jun 9, 2026 •

edited by coderabbitai Bot

Loading

Schema: `engine` + `model_spec` + `executor_config`

github-actions Bot commented Jun 9, 2026 •

edited

Loading

github-actions Bot commented Jun 9, 2026 •

edited

Loading

coderabbitai Bot commented Jun 9, 2026 •

edited

Loading