Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion skyrl/train/config/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -188,7 +188,7 @@ class MegatronConfig(BaseConfig):
transformer_config_kwargs: Dict[str, Any] = field(
default_factory=lambda: copy.deepcopy(DEFAULT_TRANSFORMER_CONFIG_KWARGS)
)
empty_cuda_cache: Optional[bool] = None
empty_cuda_cache: Optional[bool] = True
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

While setting empty_cuda_cache to True by default helps prevent Out-Of-Memory (OOM) issues during the optimization step, calling torch.cuda.empty_cache() after every forward and forward-backward pass (per mini-batch) introduces significant CUDA synchronization overhead, which can drastically degrade training throughput.

Additionally, since None is no longer the default and is functionally treated as False in the worker implementation (if self.empty_cuda_cache:), we can simplify the type annotation from Optional[bool] to bool.

Suggested change
empty_cuda_cache: Optional[bool] = True
empty_cuda_cache: bool = True

model_config_kwargs: dict = field(default_factory=dict)
dist_ckpt_optim_fully_reshardable: bool = False
freeze_moe_router: bool = False
Expand Down
Loading