feat(models): vllm deployment in docker#232
Conversation
Documentation preview is readyPreview: https://nvidia-nemo.github.io/nemo-platform/pr-preview/pr-232/pr-232/ Built from This preview is deployed from this PR branch, updates when docs changes are pushed, and will be removed when the PR closes. |
benmccown
left a comment
There was a problem hiding this comment.
Per-file review notes for everything under services/core/models/src/nmp/core/models/. Read order: schemas.py first (the rest fans out from there), then controllers/backends/docker/vllm_compiler.py and creation_reconciler.py. Autogenerated openapi/ and sdk/ dirs can be ignored.
benmccown
left a comment
There was a problem hiding this comment.
Per-file review notes for everything under services/core/models/src/nmp/core/models/. Read order: schemas.py first (the rest fans out from there), then controllers/backends/docker/vllm_compiler.py and creation_reconciler.py. Autogenerated openapi/ and sdk/ dirs can be ignored.
Reviewer guideThe diff is large but mostly mechanical fan-out from one schema change. Suggested reading order:
Can be ignored
Worth a careful look
ScopeFirst pass: base schema + docker backend vLLM (incl. LoRA). The k8s backend vLLM path and the I've left per-file |
|
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughDeployment config contracts now require ChangesDeployment config migration
Sequence Diagram(s)sequenceDiagram
participant API
participant Service
participant Backend
participant Sidecar
API->>Service: create/update with engine, model_spec, executor_config
Service->>Backend: persist normalized config
Backend->>Backend: select NIM or vLLM path
Backend->>Sidecar: configure LoRA adapter rewrite
Possibly related PRs
Suggested reviewers
✨ Finishing Touches🧪 Generate unit tests (beta)
|
There was a problem hiding this comment.
Actionable comments posted: 17
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
services/customizer/src/nmp/customizer/app/jobs/compiler.py (1)
334-339:⚠️ Potential issue | 🟠 Major | ⚡ Quick winRequire explicit
lora_enabled=Truefor LoRA string refs.Line 334 only rejects explicit
False; unset/Nonecurrently passes. That allows LoRA training to proceed with a config that may not load adapters.Proposed fix
- if is_lora and resolved_config.model_spec and resolved_config.model_spec.lora_enabled is False: + if is_lora and ( + resolved_config.model_spec is None + or resolved_config.model_spec.lora_enabled is not True + ): raise PlatformJobCompilationError( - f"deployment_config references '{dc}' which has lora_enabled=false, " + f"deployment_config references '{dc}' which does not have lora_enabled=true, " "but this is a LoRA training job. The deployment would not load LoRA adapters. " "Use a deployment config with lora_enabled=true, or provide inline deployment parameters." )🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@services/customizer/src/nmp/customizer/app/jobs/compiler.py` around lines 334 - 339, The check for LoRA deployments currently only rejects explicit False but lets None/unset pass; update the condition in the compilation path that uses is_lora and resolved_config.model_spec.lora_enabled so that LoRA string refs require lora_enabled to be True (i.e., treat anything not True as invalid) and raise the same PlatformJobCompilationError (include the dc context) when lora_enabled is not True; locate the check referencing is_lora, resolved_config.model_spec.lora_enabled, and PlatformJobCompilationError and change the predicate to fail when lora_enabled is not True.services/core/models/src/nmp/core/models/api/service/model_deployment_config_service.py (1)
261-271:⚠️ Potential issue | 🟠 Major | ⚡ Quick winPreserve or recompute
model_entity_idon update.If the caller omits
model_entity_id, this version storesNoneeven whenrequest.model_specstill identifies the model. That breaks the config→model association used later by deployment resolution and LoRA filtering. Reuse the create-path lookup here, then fall back tocurrent.model_entity_id.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@services/core/models/src/nmp/core/models/api/service/model_deployment_config_service.py` around lines 261 - 271, When constructing the new ModelDeploymentConfigEntity on update, ensure model_entity_id is preserved or recomputed instead of unconditionally using request.model_entity_id; if request.model_entity_id is missing, run the same model lookup used in the create path using request.model_spec and use that result, and if that lookup yields nothing fall back to current.model_entity_id so the config→model association remains intact (update the model_entity_id argument passed to ModelDeploymentConfigEntity accordingly).
🧹 Nitpick comments (2)
services/core/models/tests/integration/test_models.py (1)
735-739: ⚡ Quick winAssert
engineround-trip in CRUD assertions.Line 737 and Line 774 validate nested fields, but not the top-level discriminant. Add
engine == "nim"checks on create/update/read responses to catch contract regressions early.Patch
@@ assert created["name"] == test_name assert created["description"] == "A deployment configuration for LLM inference" + assert created["engine"] == "nim" assert created["executor_config"]["gpu"] == 2 assert created["model_spec"]["lora_enabled"] is True @@ assert updated["description"] == "Updated deployment configuration" + assert updated["engine"] == "nim" assert updated["executor_config"]["gpu"] == 4 assert updated["entity_version"] == 2 @@ latest = response.json() assert latest["entity_version"] == 2 + assert latest["engine"] == "nim"Also applies to: 773-775, 782-784
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@services/core/models/tests/integration/test_models.py` around lines 735 - 739, The CRUD test is missing assertions that the top-level discriminant "engine" round-trips; update the test assertions to include checks like created["engine"] == "nim", updated["engine"] == "nim", and (the read/fetched response variable) e.g. fetched["engine"] == "nim" alongside the existing nested assertions (e.g. created["name"], created["executor_config"], created["model_spec"]). Locate the places where the test uses the variables created and updated (and the read/fetched response) and add these engine equality asserts to catch contract regressions.docs/run-inference/tutorials/deploy-models.md (1)
1212-1214: ⚡ Quick winAdd a
Next Stepssection at the end of this page.Link to related docs (config reference, API reference, troubleshooting) to complete the tutorial flow.
As per coding guidelines, "Include 'Next Steps' section at the end with cross-links to related documentation content".
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/run-inference/tutorials/deploy-models.md` around lines 1212 - 1214, Add a "Next Steps" section at the end of this page (after the vLLM/chat_template_kwargs note) that provides cross-links to the configuration reference, API reference, and troubleshooting guide to complete the tutorial flow; mention that chat_template_kwargs applies only to vLLM-backed models and provide links to the config reference, API reference, and troubleshooting docs for non-vLLM providers (e.g., OpenAI/NVIDIA Build) so readers can follow up on provider-specific parameters and debugging steps.Source: Coding guidelines
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@docs/run-inference/tutorials/deploy-models.md`:
- Around line 549-550: The CLI example is missing the required --name flag for
the deployment; update the command that starts with "nemo inference gateway
provider post v1/chat/completions qwen3-lora-deployment --body" to include
"--name qwen3-lora-deployment" (matching the rest of the page) in the
appropriate position so the example is runnable and consistent with other
snippets.
In `@openapi/ga/individual/platform.openapi.yaml`:
- Around line 20153-20158: The schema for model_provider currently allows any
string but the description limits valid values to "hf" and "nmp"; update the
OpenAPI schema for the model_provider property to enforce those values by adding
an enum: ["hf","nmp"] (keep the existing title/description/type/maxLength) so
requests with invalid providers fail validation; refer to the model_provider
property in the YAML to apply this change.
- Around line 12422-12427: The OpenAPI schema currently doesn't enforce that
executor_config.image_name is present when engine == "generic"; update the
request schema that defines engine and executor_config to add a conditional
requirement (e.g., use oneOf/anyOf with a subschema where engine: {const:
"generic"} and required: ["executor_config"] plus a schema that enforces
executor_config.image_name is present) so any payload with engine: "generic"
must include executor_config.image_name; apply the same conditional change
wherever executor_config.image_name/engine pair is defined (the other repeated
schemas that include image_name) to keep behavior consistent.
In `@openapi/ga/openapi.yaml`:
- Around line 12435-12439: The schema for additional_envs currently uses
additionalProperties: true which permits non-string values; update the
additional_envs schema by replacing additionalProperties: true with
additionalProperties: { type: string } (i.e., constrain additionalProperties to
string values) so that the additional_envs object only allows string-valued
environment variables; change the additional_envs definition accordingly where
it appears.
- Around line 12422-12427: The OpenAPI schema currently documents that
image_name is "Required for engine='generic'" only in prose, but the request
schemas don't enforce this conditional; update the deployment-level schema for
the object containing engine and image_name so that it includes a conditional
(oneOf/if-then or discriminator) requiring image_name (and any related image
fields) when engine == "generic" — i.e., add an if:
{properties:{engine:{const:"generic"}}} then: {required:["image_name"]} (or
equivalent oneOf alternatives) for the same schema that declares engine and
image_name so generated clients cannot omit image_name when engine is generic;
ensure you apply the same change in the other two locations noted (around the
blocks at lines ~13712-13725 and ~27984-27997).
- Around line 20153-20158: The model_provider property currently allows any
string; constrain it to supported values by replacing the free-form type with an
enum (or $ref to a provider schema) that lists "hf" and "nmp" so client
validation and generated SDK types reflect the actual allowed providers; locate
the model_provider schema and change it to an enum of ["hf","nmp"] (or point
model_provider to a shared Provider enum schema) while keeping any existing
metadata like title/description.
In `@openapi/openapi.yaml`:
- Around line 12435-12439: The schema for additional_envs currently allows
arbitrary JSON (additionalProperties: true) but env vars must be strings; update
the additional_envs schema so additionalProperties is a string schema (e.g.,
additionalProperties: { type: "string" }) to restrict values to strings, or
alternately add explicit documentation and a property like additional_envs: {
additionalProperties: { type: "string" }, description: "... JSON-encoded values
..." }—make this change to the additional_envs schema to ensure only string
values are accepted.
- Around line 12422-12428: The schema currently documents
ContainerExecutorConfig.image_name as required for engine='generic' but doesn’t
enforce it; add a JSON Schema conditional at the OpenAPI object that contains
the engine and executor_config properties (use an if:
{properties:{engine:{const:"generic"}}} then: {required:["executor_config"],
properties:{executor_config:{required:["image_name"]}}} or equivalent) so that
when engine equals "generic" the executor_config.image_name field is required;
reference the ContainerExecutorConfig schema and the executor_config.image_name
property when adding the conditional to ensure the validator enforces this rule.
- Around line 13716-13728: Update the deployment config schemas to enforce a
resolvable model source by adding a oneOf invariant at the
CreateModelDeploymentConfigRequest/UpdateModelDeploymentConfigRequest level:
require either the top-level model_entity_id property or a model_spec that
contains the identity/source fields used to resolve weights (modify
ModelDeploymentConfigModelSpec to mark those identity/source properties as
required, e.g., the fields your system uses such as
model_source/model_uri/model_id or equivalent), and add a oneOf that validates
(1) model_entity_id is present OR (2) model_spec has the required
identity/source properties; apply the same change for the other affected ranges
(around lines 20130–20184 and 27988–28007).
In `@services/core/models/script/v2_deployment_migration.py`:
- Around line 794-801: The migration currently omits the lora_enabled field when
its value is None, causing ModelDeploymentConfigModelSpec.lora_enabled to
default to False; update _translate_nim_deployment() (where nim is processed and
model_spec_param is built) to copy the lora_enabled key based on key presence
(e.g., check "lora_enabled" in nim) rather than nim.get(...) != None so that if
nim["lora_enabled"] is explicitly None the field is preserved in
create_kwargs["model_spec"], preventing accidental coercion to False.
In `@services/core/models/src/nmp/core/models/app/utils.py`:
- Around line 167-168: Guard against missing nested attributes on
model_deployment_config before dereferencing: check that
model_deployment_config.executor_config and model_deployment_config.model_spec
are truthy (or use getattr with defaults) before calling
is_multi_llm_image(model_deployment_config.executor_config.image_name) and
before accessing model_deployment_config.model_spec.model_name; update the
conditional(s) that reference executor_config.image_name and
model_spec.model_name (the occurrence around is_multi_llm_image and the similar
check at line ~173) to short-circuit when executor_config or model_spec is None
so AttributeError cannot be raised.
In `@services/core/models/src/nmp/core/models/controllers/backends/common.py`:
- Around line 16-47: The DeploymentConfigView dataclass and the
deployment_config_view(config: Any) parameter use broad Any types; replace each
Any with precise Optional[...] types (e.g., model_type: Optional[str],
model_namespace: Optional[str], model_name: Optional[str], model_revision:
Optional[str], model_provider: Optional[str], chat_template: Optional[dict] or
Optional[str] as appropriate, tool_call_config: Optional[dict], disk_size:
Optional[int|str], image_name: Optional[str], image_tag: Optional[str],
additional_envs: Optional[dict[str,str]] or Optional[list[tuple[str,str]]],
k8s_nim_operator_config: Optional[dict], override_config: Optional[dict]) while
keeping lora_enabled: bool and gpu: int and additional_args:
Optional[list[str]]; also change the function signature
deployment_config_view(config: Any) to deployment_config_view(config:
Optional[Mapping[str, Any]] or a typed Protocol/ModelDeploymentConfig) so
callers get static safety and update imports/typing hints accordingly.
In `@services/core/models/src/nmp/core/models/controllers/provider_reconciler.py`:
- Around line 1094-1099: The code builds HF artifact URLs using model_spec
without validating its fields, which can produce invalid URLs like
"hf://None/None" or trigger the broad exception path; update the block around
model_spec usage and the call to parse_model_name_revision so you first check
that config.model_spec exists and that model_spec.model_namespace,
model_spec.model_name, and model_spec.model_revision are non-empty/nonnull (or
meet expected format) before calling parse_model_name_revision or constructing
"hf://" URLs, and if validation fails log a clear warning/error and skip or
short-circuit artifact metadata construction instead of catching and silencing
exceptions; reference model_spec and parse_model_name_revision to locate where
to add these checks and the hf:// URL construction to guard.
In `@services/core/models/src/nmp/core/models/schemas.py`:
- Around line 1366-1372: The model allows engine="generic" while permitting
executor_config.image_name to be None; add an API-level validation in the
Pydantic model that owns the engine, model_spec and executor_config fields
(referencing engine and executor_config / ContainerExecutorConfig) to reject
requests where engine == "generic" and executor_config.image_name is empty/None
by raising a validation error; implement this as a validator or root_validator
on the same model class (the schema around these fields) and apply the same
validator to the other identical model block (the block at 1388-1394) so both
places enforce that generic requires an explicit image_name.
In `@services/core/models/src/nmp/core/models/sidecars/adapters/main.py`:
- Around line 146-151: The vLLM base-model rewrite failure currently logs a
warning but still proceeds to mark the adapter up-to-date and swap it into
place; change the error handling so that when reading/parsing
adapter_config.json (cfg_path) or the rewrite operation fails (the except block
around json.load and the corresponding failure points at the other locations
mentioned: the blocks at ~153-158 and ~228-230), the code returns early and does
NOT call the functions that write metadata or perform the swap (references to
cfg, adapter_dir, any write/update_metadata function and any
swap/rename/atomic_swap_adapter call); ensure the early-return prevents
execution of the metadata write/update and swap logic so a broken adapter is not
marked up-to-date and automatic retry can still occur.
In `@services/core/models/tests/unit/controllers/test_backend_config_fields.py`:
- Around line 32-45: The MagicMock assigned to config.model_spec causes unset
attributes to be truthy and can trigger optional code paths; explicitly set
optional fields (e.g., config.model_spec.tool_call_config = None and any other
optional model_spec attributes that tests rely on) to None instead of leaving
them as mocks, and ensure any other optional executor-related fields remain
explicitly None (e.g., config.executor_config.k8s_nim_operator_config already
None) so the fixture returns a config with the intended None values rather than
truthy MagicMocks.
In `@services/core/models/tests/unit/controllers/test_nimservice_compiler.py`:
- Around line 51-55: The MagicMock used for config.model_spec leaves optional
attributes truthy and can cause unwanted branches to run; update every place
that sets config.model_spec = MagicMock() (including the full_config fixture and
the inline setups around test_nimservice_compiler lines referenced) to
explicitly set the optional fields: config.model_spec.lora_enabled = False,
config.model_spec.model_name = None, config.model_spec.model_namespace = None,
and config.model_spec.model_revision = None so tests don't accidentally trigger
optional compiler/tool-call logic.
---
Outside diff comments:
In
`@services/core/models/src/nmp/core/models/api/service/model_deployment_config_service.py`:
- Around line 261-271: When constructing the new ModelDeploymentConfigEntity on
update, ensure model_entity_id is preserved or recomputed instead of
unconditionally using request.model_entity_id; if request.model_entity_id is
missing, run the same model lookup used in the create path using
request.model_spec and use that result, and if that lookup yields nothing fall
back to current.model_entity_id so the config→model association remains intact
(update the model_entity_id argument passed to ModelDeploymentConfigEntity
accordingly).
In `@services/customizer/src/nmp/customizer/app/jobs/compiler.py`:
- Around line 334-339: The check for LoRA deployments currently only rejects
explicit False but lets None/unset pass; update the condition in the compilation
path that uses is_lora and resolved_config.model_spec.lora_enabled so that LoRA
string refs require lora_enabled to be True (i.e., treat anything not True as
invalid) and raise the same PlatformJobCompilationError (include the dc context)
when lora_enabled is not True; locate the check referencing is_lora,
resolved_config.model_spec.lora_enabled, and PlatformJobCompilationError and
change the predicate to fail when lora_enabled is not True.
---
Nitpick comments:
In `@docs/run-inference/tutorials/deploy-models.md`:
- Around line 1212-1214: Add a "Next Steps" section at the end of this page
(after the vLLM/chat_template_kwargs note) that provides cross-links to the
configuration reference, API reference, and troubleshooting guide to complete
the tutorial flow; mention that chat_template_kwargs applies only to vLLM-backed
models and provide links to the config reference, API reference, and
troubleshooting docs for non-vLLM providers (e.g., OpenAI/NVIDIA Build) so
readers can follow up on provider-specific parameters and debugging steps.
In `@services/core/models/tests/integration/test_models.py`:
- Around line 735-739: The CRUD test is missing assertions that the top-level
discriminant "engine" round-trips; update the test assertions to include checks
like created["engine"] == "nim", updated["engine"] == "nim", and (the
read/fetched response variable) e.g. fetched["engine"] == "nim" alongside the
existing nested assertions (e.g. created["name"], created["executor_config"],
created["model_spec"]). Locate the places where the test uses the variables
created and updated (and the read/fetched response) and add these engine
equality asserts to catch contract regressions.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 19fcd48a-9eea-47d4-8038-c3805410bd46
⛔ Files ignored due to path filters (16)
sdk/python/nemo-platform/.nmpcontext/openapi.yamlis excluded by!sdk/**sdk/python/nemo-platform/.nmpcontext/stainless.yamlis excluded by!sdk/**sdk/python/nemo-platform/src/nemo_platform/cli/commands/api/inference/deployment_configs/__init__.pyis excluded by!sdk/**sdk/python/nemo-platform/src/nemo_platform/resources/inference/api.mdis excluded by!sdk/**sdk/python/nemo-platform/src/nemo_platform/resources/inference/deployment_configs/deployment_configs.pyis excluded by!sdk/**sdk/python/nemo-platform/src/nemo_platform/types/inference/__init__.pyis excluded by!sdk/**sdk/python/nemo-platform/src/nemo_platform/types/inference/container_executor_config.pyis excluded by!sdk/**sdk/python/nemo-platform/src/nemo_platform/types/inference/container_executor_config_param.pyis excluded by!sdk/**sdk/python/nemo-platform/src/nemo_platform/types/inference/deployment_config_create_params.pyis excluded by!sdk/**sdk/python/nemo-platform/src/nemo_platform/types/inference/deployment_config_update_params.pyis excluded by!sdk/**sdk/python/nemo-platform/src/nemo_platform/types/inference/engine.pyis excluded by!sdk/**sdk/python/nemo-platform/src/nemo_platform/types/inference/model_deployment_config.pyis excluded by!sdk/**sdk/python/nemo-platform/src/nemo_platform/types/inference/model_deployment_config_model_spec.pyis excluded by!sdk/**sdk/python/nemo-platform/src/nemo_platform/types/inference/model_deployment_config_model_spec_param.pyis excluded by!sdk/**sdk/python/nemo-platform/tests/api_resources/inference/test_deployment_configs.pyis excluded by!sdk/**sdk/stainless.yamlis excluded by!sdk/**
📒 Files selected for processing (53)
docs/cli/reference.mddocs/run-inference/tutorials/deploy-models.mddocs/set-up/config-reference.mdopenapi/ga/individual/platform.openapi.yamlopenapi/ga/openapi.yamlopenapi/openapi.yamlpackages/nemo_platform_ext/src/nemo_platform_ext/cli/commands/api/inference/deployment_configs/__init__.pyservices/core/inference-gateway/tests/integration/test_inference.pyservices/core/inference-gateway/tests/integration/test_middleware_pipeline.pyservices/core/models/script/v2_deployment_migration.pyservices/core/models/src/nmp/core/models/api/service/model_deployment_config_service.pyservices/core/models/src/nmp/core/models/api/service/model_entity_service.pyservices/core/models/src/nmp/core/models/api/v2/models.pyservices/core/models/src/nmp/core/models/app/utils.pyservices/core/models/src/nmp/core/models/controllers/backends/common.pyservices/core/models/src/nmp/core/models/controllers/backends/docker/backend.pyservices/core/models/src/nmp/core/models/controllers/backends/docker/config.pyservices/core/models/src/nmp/core/models/controllers/backends/docker/creation_reconciler.pyservices/core/models/src/nmp/core/models/controllers/backends/docker/vllm_compiler.pyservices/core/models/src/nmp/core/models/controllers/backends/k8s_nim_operator/backend.pyservices/core/models/src/nmp/core/models/controllers/backends/k8s_nim_operator/nimservice_compiler.pyservices/core/models/src/nmp/core/models/controllers/models_controller.pyservices/core/models/src/nmp/core/models/controllers/provider_reconciler.pyservices/core/models/src/nmp/core/models/entities.pyservices/core/models/src/nmp/core/models/schemas.pyservices/core/models/src/nmp/core/models/sidecars/adapters/main.pyservices/core/models/tests/integration/test_chat_template_tool_calling.pyservices/core/models/tests/integration/test_model_deployment_config_service_integration.pyservices/core/models/tests/integration/test_model_deployment_service_integration.pyservices/core/models/tests/integration/test_model_entity_service_integration.pyservices/core/models/tests/integration/test_model_lora_filter_route.pyservices/core/models/tests/integration/test_models.pyservices/core/models/tests/integration/test_models_auth_propagation.pyservices/core/models/tests/integration/test_models_controller.pyservices/core/models/tests/integration/test_models_with_auth.pyservices/core/models/tests/unit/api/test_deployment_configs_api.pyservices/core/models/tests/unit/common/test_utils.pyservices/core/models/tests/unit/controllers/backends/docker/test_vllm_compiler.pyservices/core/models/tests/unit/controllers/test_backend_config_fields.pyservices/core/models/tests/unit/controllers/test_docker_backend.pyservices/core/models/tests/unit/controllers/test_k8s_nim_operator_backend.pyservices/core/models/tests/unit/controllers/test_models_controller_unit.pyservices/core/models/tests/unit/controllers/test_nimservice_compiler.pyservices/core/models/tests/unit/controllers/test_provider_reconciler.pyservices/core/models/tests/unit/sidecars/test_adapters_controller.pyservices/core/models/tests/unit/test_model_deployment_config_service_unit.pyservices/core/models/tests/unit/test_model_deployment_service_unit.pyservices/core/models/tests/unit/test_model_entity_service_unit.pyservices/core/models/tests/unit/test_schemas_override_config.pyservices/customizer/src/nmp/customizer/app/jobs/compiler.pyservices/customizer/src/nmp/customizer/tasks/model_entity/run.pyservices/customizer/tests/tasks/model_entity/test_model_entity.pyservices/customizer/tests/test_compiler.py
mckornfield
left a comment
There was a problem hiding this comment.
nothing crazy stands out, some choices I found sorta random but nothing blocking for me
|
|
||
| Serve any HuggingFace-format model with [vLLM](https://docs.vllm.ai/) by setting `engine: "vllm"` on the deployment config. The model is registered through the Files service exactly as in [Deploy from HuggingFace](#deploy-from-huggingface) — only the `engine` field on the deployment config changes. Tensor parallelism is computed automatically from the GPU count and the model's architecture. | ||
|
|
||
| This example deploys [Qwen3-1.7B](https://huggingface.co/Qwen/Qwen3-1.7B) with vLLM. |
There was a problem hiding this comment.
did we just choose this one because of how small it is?
There was a problem hiding this comment.
Pretty much. I chose it for velocity in my own manual testing so I didn't have to wait forever for weights to pull and the deployment to start. I am definitely open to changing it for the public facing docs, if you think an alternative (something a bit bigger) would be better.
|
|
||
| vLLM deployments support LoRA adapters with hot-reload: enable LoRA on the deployment config, then register one or more adapters against the base model entity. An adapter sidecar delivers each enabled adapter to the running server, and vLLM serves it on a per-request basis. Adapters are registered on the **base model entity**, not as separate models. | ||
|
|
||
| This example serves the [linear-algebra LoRA](https://huggingface.co/premjatin/qwen-linear-algebra-coder) on top of Qwen3-1.7B. |
There was a problem hiding this comment.
lol or maybe this was the reason for using this model?
There was a problem hiding this comment.
seems semi random as far as this guy only has two things on HF but 🤷
There was a problem hiding this comment.
I gave the adapters link for Qwen3-1.7B to claude and said "let's test deploying a LoRA and sending an inference request to the adapter" and the first adapter it chose out of the 500+ available was a flipping cat girl adapter... I said "hey that's fine for our test but please, for the love of god, pick a more professional adapter for the docs"
In reality the core adapter example should probably be using one produced by customizer, but I figured we'd add that in follow up. But it probably is worth documenting how to deploy adapters from HF, I could see folks realistically wanting to do that.
| ``executor_config`` into a discriminated union. | ||
| """ | ||
|
|
||
| gpu: int = Field(description="Number of GPUs required for the deployment. 0 = CPU-only.", ge=0) |
There was a problem hiding this comment.
do we even want to support 0? lolol
There was a problem hiding this comment.
vLLM can run in cpu only mode, I think? I figured if it exists and is a matter of a single integer setting being set to 0, then it's trivial to support and maybe someone has a use case for it. But /shrug
|
need a |
| env_vars["VLLM_LORA_RESOLVER_CACHE_DIR"] = VLLM_LORA_CACHE_DIR | ||
| env_vars["VLLM_ALLOW_RUNTIME_LORA_UPDATING"] = "True" | ||
|
|
||
| if model_entity and getattr(model_entity, "trust_remote_code", False): |
There was a problem hiding this comment.
@soluwalana are we okay to default to False here?
| return [d for d in range(1, n + 1) if n % d == 0] | ||
|
|
||
|
|
||
| def compute_tensor_parallel_size( |
There was a problem hiding this comment.
@soluwalana could you review this func implementation specifically? I'm not sure if 1 is a reasonable default?
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@services/core/models/src/nmp/core/models/schemas.py`:
- Around line 1211-1220: The schema currently allows health_check_path to be
None even when engine == "generic", which leads to invalid configs; add a
Pydantic root_validator on the model that owns the engine and executor_config
fields to enforce that when engine == "generic" the health_check_path field
(defined as health_check_path: Optional[str]) is not None and is a non-empty
string, raising a ValidationError with a clear message otherwise; implement this
check in the model class containing health_check_path and engine so invalid
configs fail validation early.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 4a63df38-b36f-43de-a1e6-b34b258dbc04
⛔ Files ignored due to path filters (6)
sdk/python/nemo-platform/.nmpcontext/openapi.yamlis excluded by!sdk/**sdk/python/nemo-platform/src/nemo_platform/types/inference/container_executor_config.pyis excluded by!sdk/**sdk/python/nemo-platform/src/nemo_platform/types/inference/container_executor_config_param.pyis excluded by!sdk/**sdk/python/nemo-platform/src/nemo_platform/types/inference/model_deployment_config_model_spec.pyis excluded by!sdk/**sdk/python/nemo-platform/src/nemo_platform/types/inference/model_deployment_config_model_spec_param.pyis excluded by!sdk/**sdk/python/nemo-platform/tests/api_resources/inference/test_deployment_configs.pyis excluded by!sdk/**
📒 Files selected for processing (18)
docs/run-inference/tutorials/deploy-models.mdopenapi/ga/individual/platform.openapi.yamlopenapi/ga/openapi.yamlopenapi/openapi.yamlpackages/nmp_testing/src/nmp/testing/e2e/customizer.pyservices/core/models/script/README.mdservices/core/models/script/v2_deployment_migration.pyservices/core/models/src/nmp/core/models/app/utils.pyservices/core/models/src/nmp/core/models/controllers/backends/common.pyservices/core/models/src/nmp/core/models/controllers/backends/docker/creation_reconciler.pyservices/core/models/src/nmp/core/models/controllers/backends/docker/vllm_compiler.pyservices/core/models/src/nmp/core/models/controllers/provider_reconciler.pyservices/core/models/src/nmp/core/models/schemas.pyservices/core/models/src/nmp/core/models/sidecars/adapters/main.pyservices/core/models/tests/unit/controllers/test_backend_config_fields.pyservices/core/models/tests/unit/controllers/test_docker_backend.pyservices/core/models/tests/unit/controllers/test_nimservice_compiler.pyservices/core/models/tests/unit/sidecars/test_adapters_controller.py
💤 Files with no reviewable changes (2)
- services/core/models/script/README.md
- services/core/models/script/v2_deployment_migration.py
✅ Files skipped from review due to trivial changes (1)
- docs/run-inference/tutorials/deploy-models.md
🚧 Files skipped from review as they are similar to previous changes (13)
- services/core/models/src/nmp/core/models/app/utils.py
- services/core/models/src/nmp/core/models/controllers/backends/common.py
- services/core/models/src/nmp/core/models/sidecars/adapters/main.py
- services/core/models/src/nmp/core/models/controllers/backends/docker/vllm_compiler.py
- services/core/models/tests/unit/sidecars/test_adapters_controller.py
- services/core/models/src/nmp/core/models/controllers/provider_reconciler.py
- openapi/openapi.yaml
- openapi/ga/individual/platform.openapi.yaml
- services/core/models/tests/unit/controllers/test_backend_config_fields.py
- services/core/models/src/nmp/core/models/controllers/backends/docker/creation_reconciler.py
- openapi/ga/openapi.yaml
- services/core/models/tests/unit/controllers/test_docker_backend.py
- services/core/models/tests/unit/controllers/test_nimservice_compiler.py
4f74606 to
1ff296d
Compare
| env_vars["VLLM_ALLOW_RUNTIME_LORA_UPDATING"] = "True" | ||
|
|
||
| if model_entity and getattr(model_entity, "trust_remote_code", False): | ||
| env_vars["VLLM_TRUST_REMOTE_CODE"] = "1" |
There was a problem hiding this comment.
Just making sure that this env var actually works? I couldn't find it anywhere in the vllm source code: https://github.com/search?q=repo%3Avllm-project%2Fvllm%20VLLM_TRUST_REMOTE_CODE&type=code
The alternative is to set the --trust-remote-code arg to vllm serve.
There was a problem hiding this comment.
I tried running it for realz and got
(APIServer pid=1) WARNING 06-09 21:35:31 [envs.py:2057] Unknown vLLM environment variable detected: VLLM_TRUST_REMOTE_CODE
There was a problem hiding this comment.
ty for the sanity check, I would have never stopped to question this one. I'll get it fixed.
There was a problem hiding this comment.
Fixed in a58ef9d. Replaced the bogus VLLM_TRUST_REMOTE_CODE env var with the --trust-remote-code vllm serve CLI flag (injected in compile_vllm_args, with the same user-override guard as our other flags). Also sanity-checked every other vLLM env var/arg we emit against vLLM's envs.py and arg_utils.py — the rest (VLLM_PLUGINS, VLLM_LORA_RESOLVER_CACHE_DIR, VLLM_ALLOW_RUNTIME_LORA_UPDATING, --served-model-name, --tensor-parallel-size, --chat-template, --enable-lora, --max-lora-rank) are all valid. Thanks for catching this.
Signed-off-by: Ben McCown <bmccown@nvidia.com>
1ff296d to
f4a30c7
Compare
…env var VLLM_TRUST_REMOTE_CODE is not a real vLLM environment variable -- vLLM logs 'Unknown vLLM environment variable detected: VLLM_TRUST_REMOTE_CODE' and ignores it. trust-remote-code is a vllm serve CLI flag (EngineArgs.trust_remote_code -> --trust-remote-code). Move the injection from compile_vllm_env_vars to compile_vllm_args, honoring the same user-override guard as the other injected flags so a user-supplied --trust-remote-code is not duplicated. compile_vllm_env_vars no longer needs model_entity, so drop the now-unused parameter and update the caller. Sanity-checked all other vLLM env vars/args we emit against vLLM's envs.py and arg_utils.py: VLLM_PLUGINS, VLLM_LORA_RESOLVER_CACHE_DIR, VLLM_ALLOW_RUNTIME_LORA_UPDATING and --served-model-name, --tensor-parallel-size, --chat-template, --enable-lora, --max-lora-rank are all valid. Addresses PR #232 review thread (albcui). Signed-off-by: Ben McCown <bmccown@nvidia.com>
Add vLLM as a deployment engine (docker backend)
Summary
Adds vLLM as a peer inference engine to NIM in the models service, deployable through
the existing docker service backend. A user sets
engine: "vllm"on aModelDeploymentConfigand gets a deployed, IGW-routable, OpenAI-compatible endpoint —including LoRA adapter support with hot-reload.
This is the first implementation pass of the vLLM RFC: the base schema/interface changes
and the docker backend vLLM path. The k8s backend vLLM path and the
genericengine areout of scope here.
Core changes
Schema:
engine+model_spec+executor_configModelDeploymentConfignow carries a top-levelenginediscriminant (nim/vllm/generic) plus two grouped configs shared across engines:model_spec— what model to serve and how (model source, chat template, tool-callconfig,
lora_enabled); executor-invariant.executor_config— compute + container settings (gpu,disk_size, image,additional_envs,additional_args); shared by the docker and k8s container executors.Existing NIM deployments continue to work via
engine: "nim". The k8s NIM-operator pathis preserved unchanged (its operator-specific fields ride along on
executor_config).Docker vLLM backend
vllm_compilerproduces the image,vllm servearg vector, and env for thecontainer: tensor-parallel auto-tuning from GPU count + model architecture, served-model
name, chat template, and
additional_argspassthrough.config.engine, reusing the existingGPU/port/volume/weight-puller plumbing; only the final container shape differs from NIM.
/healthvs NIM/v1/health/ready).default_vllm_image/default_vllm_image_tagsettings.LoRA hot-reload on vLLM
--enable-lora+ the filesystem-resolver plugin env, watching ashared adapter directory.
directory; the controller wires the sidecar's env for vLLM and pre-creates the watched
directory so vLLM doesn't crash-loop on startup.
base_model_name_or_pathto the locallyserved model path so vLLM's resolver matches and auto-loads it.
SDK / CLI / docs
(
--engine,--model-spec,--executor-config).docs/run-inference/tutorials/deploy-models.mdwith a vLLM section (base +LoRA) and refreshed the existing examples to the new config shape.
Testing
and the adapter base-model rewrite; existing models/customizer/IGW suites updated to the
new shape.
and served inference through IGW; deployed a base model with a real HuggingFace LoRA
adapter and confirmed the sidecar delivered it and vLLM served adapter-applied responses.
Summary by CodeRabbit
Release Notes
New Features
engine,model_spec, andexecutor_configfields.Breaking Changes
nim_deploymentparameter replaced with separateengine,model_spec, andexecutor_configfields in create/update requests.--nim-deploymentoption replaced with--engine,--executor-config, and--model-specoptions.