Skip to content

fix(rollout): drain generation before offload memory release#2015

Open
EazyReal wants to merge 1 commit into
THUDM:mainfrom
EazyReal:vmax/offload-drain-before-release
Open

fix(rollout): drain generation before offload memory release#2015
EazyReal wants to merge 1 commit into
THUDM:mainfrom
EazyReal:vmax/offload-drain-before-release

Conversation

@EazyReal
Copy link
Copy Markdown

@EazyReal EazyReal commented Jun 4, 2026

Problem

RolloutServer.offload() currently releases SGLang engine memory by calling release_memory_occupation() directly on each offloaded engine group.

SGLangEngine.release_memory_occupation() does call flush_cache() internally, but that flush is engine-local. It does not first stop generation from accepting or advancing requests. Under high concurrency, memory release can therefore overlap with in-flight generation work, so offload may run before the rollout server has reached a stable drained state.

For rollout offload, the desired lifecycle is:

pause generation, drain in-flight requests, then release memory.

This is also the lifecycle shape already used by the rollout weight-update path before mutating engine state: pause_generation -> flush_cache -> update -> continue_generation.

Fix

This PR makes rollout offload a server-level three-phase transition:

  1. issue pause_generation() for every offloaded SGLang engine in the server;
  2. wait for all pause refs, then issue and wait for flush_cache() on every offloaded engine;
  3. issue release_memory_occupation() only after the server has reached the drained state.

Generation resumes at the matching safe boundary: after onload_kv() restores KV-cache and CUDA-graph memory, the rollout server issues and waits for continue_generation().

Rationale

I kept the coordination in RolloutServer because that is the layer that can see all server groups and preserve the phase ordering across the whole rollout server. This lets every offloaded group pause before any group proceeds to flush or release memory.

This keeps the lower-level shape close to the existing code:

  • ServerGroup methods remain non-blocking and return Ray ObjectRefs, preserving the batching direction from refactor: make EngineGroup ops non-blocking and batch ray.get at RolloutServer level #1613.
  • release_memory_occupation() keeps its internal flush for direct callers such as recovery.
  • The normal rollout offload path adds only the orchestration-level quiescence needed before release.
  • Resuming at onload_kv() matches the normal restore order: weights are restored/updated first, then KV-cache and CUDA-graph memory are restored, then generation can continue.

A few alternatives seemed less precise:

  • relying only on the existing internal flush_cache() leaves generation unpaused before release;
  • adding sleeps/retries around release would make the race timing-dependent;
  • moving pause/continue into SGLangEngine would make it harder to coordinate all rollout groups in one server.

Tests

Adds a CPU unit test that imports the real rollout dataclasses with lightweight Ray/SGLang stubs and verifies:

  • all offloaded server groups receive pause_generation, and all pause refs are waited before any flush starts;
  • all flush refs are waited before any release starts;
  • groups with needs_offload=False are skipped;
  • onload_kv() restores KV/CUDA-graph memory before resuming generation.

The new test is registered in the CPU CI matrix through .github/workflows/pr-test.yml.j2, and the generated workflow is refreshed.

Validation

  • uv run --with pytest --with pyyaml python tests/test_rollout_offload_coordination.py
  • uv run --with ruff ruff check slime/ray/rollout.py tests/test_rollout_offload_coordination.py
  • python3 .github/workflows/generate_github_workflows.py
  • git diff --check HEAD~1..HEAD

@EazyReal EazyReal marked this pull request as ready for review June 4, 2026 03:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant