Skip to content

Commit f4d9513

Browse files
authored
Create placeholder instance models (#3821)
* Create placeholder instances * Add tests * Update terminating comment * Fix tests * Fix job_model.instance not set to placeholder * Fix placeholders not used for multinode tasks * Drop optional fleet_model handling * Fix placeholder cleanup on stale lock * Drop placeholder cleanup on non-stale path * Fix sqlite commite after unlock * Do not elect placeholder instances as masters * Count placeholders in _run_can_fit_into_fleet * Update contributing/RUNS-AND-JOBS.md * Regenerate migration * Ignore placeholders in get_placement_group_model_for_job * Rebase migration
1 parent 5730688 commit f4d9513

16 files changed

Lines changed: 680 additions & 75 deletions

File tree

contributing/RUNS-AND-JOBS.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -61,7 +61,9 @@ Services' run lifecycle has some modifications:
6161
## Job's Lifecycle
6262

6363
- STEP 1: A newly submitted job has status `SUBMITTED`. It is not assigned to any instance yet.
64-
- STEP 2: `JobSubmittedPipeline` tries to assign an existing instance or provision new capacity.
64+
- STEP 2: `JobSubmittedPipeline` assigns the job in two phases:
65+
- Assignment: claim an existing instance or reserve a *placeholder* `InstanceModel`. Placeholders are `PENDING` instances that reserve an `instance_num` and a `nodes.max` slot. `InstancePipeline` ignores them.
66+
- Provisioning: reuse the existing instance, or cloud-provision and promote the placeholder to `PROVISIONING`.
6567
- On success, the job becomes `PROVISIONING`.
6668
- On failure, the job becomes `TERMINATING`. `JobTerminatingPipeline` later assigns the final failed status.
6769
- STEP 3: `JobRunningPipeline` processes `PROVISIONING`, `PULLING`, and `RUNNING` jobs.

src/dstack/_internal/server/background/pipeline_tasks/fleets.py

Lines changed: 9 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -49,7 +49,10 @@
4949
is_fleet_empty,
5050
is_fleet_in_use,
5151
)
52-
from dstack._internal.server.services.instances import instance_matches_constraints
52+
from dstack._internal.server.services.instances import (
53+
instance_matches_constraints,
54+
is_placeholder_instance,
55+
)
5356
from dstack._internal.server.services.locking import get_locker
5457
from dstack._internal.server.services.pipelines import PipelineHinterProtocol
5558
from dstack._internal.server.utils import sentry_utils
@@ -935,8 +938,12 @@ def _select_current_master_instance_id(
935938
return instance_model.id
936939

937940
# Prefer existing surviving instances over freshly planned replacements to
938-
# avoid election churn during min-nodes backfill.
941+
# avoid election churn during min-nodes backfill. Skip placeholders —
942+
# they have no JPD and cannot anchor cluster placement, so electing one
943+
# just defers the real master decision.
939944
for instance_model in surviving_instance_models:
945+
if is_placeholder_instance(instance_model):
946+
continue
940947
if (
941948
_get_effective_instance_status(
942949
instance_model,

src/dstack/_internal/server/background/pipeline_tasks/instances/__init__.py

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -179,6 +179,13 @@ async def fetch(self, limit: int) -> list[InstancePipelineItem]:
179179
InstanceModel.compute_group_id.is_not(None),
180180
)
181181
),
182+
# Skip placeholder instances managed by JobSubmittedPipeline.
183+
not_(
184+
and_(
185+
InstanceModel.status == InstanceStatus.PENDING,
186+
InstanceModel.provisioning_job_id.is_not(None),
187+
)
188+
),
182189
InstanceModel.deleted == False,
183190
or_(
184191
# Process fast-moving instances (pending, provisioning, terminating)

0 commit comments

Comments
 (0)