Control which blocks stay on GPU and prevent eviction during multi-model workflows.
When multiple models share VRAM through Stagehand, the scheduler evicts blocks based on distance-to-next-use scoring. This works well for single-model streaming, but breaks down when:
- A model must stay fully loaded across multiple inference calls (e.g. text encoder used repeatedly)
- A primary model must not lose blocks when a guest model (upscaler, VAE) is temporarily loaded
- You need explicit VRAM budgeting across multiple runtimes
Context manager that prevents blocks from being selected as eviction candidates.
with runtime.keep_resident():
# All blocks of this runtime are protected from eviction
result1 = model.generate(prompt1)
result2 = model.generate(prompt2)
# Protection lifted — blocks are normal eviction candidates againProtect another runtime's blocks:
with transformer_runtime.keep_resident(text_encoder_runtime):
# Text encoder blocks stay on GPU while transformer runs
embeddings = text_encoder(prompt)
output = transformer(embeddings)Behavior:
- Sets protection flag on all blocks in the target runtime's registry
- Protected blocks are skipped during eviction candidate selection
- Protection is automatically removed on context exit (even on exception)
- If no blocks are found, logs a warning and yields without effect
Marks a model as a long-term resident and subtracts its VRAM footprint from the available guest budget.
# Reserve 2GB for the text encoder — scheduler accounts for this
runtime.reserve_for_resident(text_encoder_runtime, priority=ResidentPriority.PRIMARY)
# ... run many steps, text encoder blocks never evicted ...
# Release when done with the text encoder
runtime.release_reservation(text_encoder_runtime)Parameters:
model_or_runtime: Target runtime to reserve (orNonefor self)priority:ResidentPriority.PRIMARY(default, never evicted) orResidentPriority.SECONDARY(evicted only under extreme pressure)
Effects:
PRIMARYblocks are added to the scheduler's protection setBudgetManagersubtracts the model's total parameter size from available guest headroom- Guest models loaded via
as_guest()are evaluated against remaining headroom only
runtime.release_reservation(text_encoder_runtime)Removes the reservation, unprotects blocks (if PRIMARY), and returns headroom to the guest pool.
Context manager for short-lived models that should only use unreserved VRAM.
runtime.reserve_for_resident(priority=ResidentPriority.PRIMARY)
with runtime.as_guest(upscaler_runtime):
# Upscaler runs within guest headroom only
# Primary runtime blocks are never evicted to make room
result = upscaler.run(latent)
# Guest evicted on context exit| Priority | Eviction behavior | Use case |
|---|---|---|
PRIMARY |
Never evicted while reservation active | Main transformer, text encoder |
SECONDARY |
Evicted only under extreme VRAM pressure | Auxiliary models, optional components |
The four-pass video generation pipeline uses residency protection to manage a 22B transformer, text encoder, spatial upscaler, and temporal upscaler on a single 24GB GPU:
# Stage 1: Text encoding
with transformer_runtime.keep_resident(text_encoder_runtime):
embeddings = text_encoder(prompt)
# Stage 2: Base generation (transformer streams through VRAM)
transformer_runtime.reserve_for_resident(priority=ResidentPriority.PRIMARY)
with transformer_runtime.managed_forward():
base_latent = denoise(transformer, noise, embeddings, steps=8)
# Stage 3: Spatial upscale (guest model, uses remaining headroom)
with transformer_runtime.as_guest(spatial_upscaler_runtime):
upscaled = spatial_upscaler(base_latent)
# Stage 4: Temporal upscale (another guest)
with transformer_runtime.as_guest(temporal_upscaler_runtime):
final = temporal_upscaler(upscaled)
transformer_runtime.release_reservation()keep_resident()callsscheduler.protect_blocks(block_ids)/unprotect_blocks(block_ids)reserve_for_resident()stores a_ReservationEntrydataclass (runtime ref, size, block IDs, priority)BudgetManager.reserve_bytes()/release_reserved_bytes()adjust the effective watermarks- Reservations are tracked by
id(target)— one reservation per runtime instance release_reservation()is idempotent (no-op if no reservation exists)