fix(rollout): drain generation before offload memory release by EazyReal · Pull Request #2015 · THUDM/slime

EazyReal · 2026-06-04T02:03:51Z

Problem

RolloutServer.offload() currently releases SGLang engine memory by calling release_memory_occupation() directly on each offloaded engine group.

SGLangEngine.release_memory_occupation() does call flush_cache() internally, but that flush is engine-local. It does not first stop generation from accepting or advancing requests. Under high concurrency, memory release can therefore overlap with in-flight generation work, so offload may run before the rollout server has reached a stable drained state.

For rollout offload, the desired lifecycle is:

pause generation, drain in-flight requests, then release memory.

This is also the lifecycle shape already used by the rollout weight-update path before mutating engine state: pause_generation -> flush_cache -> update -> continue_generation.

Fix

This PR makes rollout offload a server-level three-phase transition:

issue pause_generation() for every offloaded SGLang engine in the server;
wait for all pause refs, then issue and wait for flush_cache() on every offloaded engine;
issue release_memory_occupation() only after the server has reached the drained state.

Generation resumes at the matching safe boundary: after onload_kv() restores KV-cache and CUDA-graph memory, the rollout server issues and waits for continue_generation().

Rationale

I kept the coordination in RolloutServer because that is the layer that can see all server groups and preserve the phase ordering across the whole rollout server. This lets every offloaded group pause before any group proceeds to flush or release memory.

This keeps the lower-level shape close to the existing code:

ServerGroup methods remain non-blocking and return Ray ObjectRefs, preserving the batching direction from refactor: make EngineGroup ops non-blocking and batch ray.get at RolloutServer level #1613.
release_memory_occupation() keeps its internal flush for direct callers such as recovery.
The normal rollout offload path adds only the orchestration-level quiescence needed before release.
Resuming at onload_kv() matches the normal restore order: weights are restored/updated first, then KV-cache and CUDA-graph memory are restored, then generation can continue.

A few alternatives seemed less precise:

relying only on the existing internal flush_cache() leaves generation unpaused before release;
adding sleeps/retries around release would make the race timing-dependent;
moving pause/continue into SGLangEngine would make it harder to coordinate all rollout groups in one server.

Tests

Adds a CPU unit test that imports the real rollout dataclasses with lightweight Ray/SGLang stubs and verifies:

all offloaded server groups receive pause_generation, and all pause refs are waited before any flush starts;
all flush refs are waited before any release starts;
groups with needs_offload=False are skipped;
onload_kv() restores KV/CUDA-graph memory before resuming generation.

The new test is registered in the CPU CI matrix through .github/workflows/pr-test.yml.j2, and the generated workflow is refreshed.

Validation

uv run --with pytest --with pyyaml python tests/test_rollout_offload_coordination.py
uv run --with ruff ruff check slime/ray/rollout.py tests/test_rollout_offload_coordination.py
python3 .github/workflows/generate_github_workflows.py
git diff --check HEAD~1..HEAD

fix(rollout): drain generation before offload release

f79613b

tardis-key mentioned this pull request Jun 4, 2026

📋 Daily Briefing — 2026-06-04 tardis-key/codex#12

Open

EazyReal marked this pull request as ready for review June 4, 2026 03:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(rollout): drain generation before offload memory release#2015

fix(rollout): drain generation before offload memory release#2015
EazyReal wants to merge 1 commit into
THUDM:mainfrom
EazyReal:vmax/offload-drain-before-release

EazyReal commented Jun 4, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

EazyReal commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Fix

Rationale

Tests

Validation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

EazyReal commented Jun 4, 2026 •

edited

Loading