Skip to content

[megatron] make empty_cuda_cache default true for megatron#1731

Merged
erictang000 merged 1 commit into
NovaSky-AI:mainfrom
erictang000:empty_cuda_cache
May 29, 2026
Merged

[megatron] make empty_cuda_cache default true for megatron#1731
erictang000 merged 1 commit into
NovaSky-AI:mainfrom
erictang000:empty_cuda_cache

Conversation

@erictang000
Copy link
Copy Markdown
Collaborator

This was originally added in #716, but was set to None by default during the config refactoring: d1e1b29

Setting back to true, as this can help with unnecessary OOMs during optim step.

@erictang000 erictang000 merged commit 793a543 into NovaSky-AI:main May 29, 2026
2 of 3 checks passed
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request changes the default value of empty_cuda_cache from None to True in MegatronConfig. The reviewer notes that while this change helps prevent Out-Of-Memory (OOM) issues, enabling it by default can introduce significant CUDA synchronization overhead and degrade training throughput. Additionally, the reviewer suggests simplifying the type annotation from Optional[bool] to bool since None is no longer the default.

default_factory=lambda: copy.deepcopy(DEFAULT_TRANSFORMER_CONFIG_KWARGS)
)
empty_cuda_cache: Optional[bool] = None
empty_cuda_cache: Optional[bool] = True
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

While setting empty_cuda_cache to True by default helps prevent Out-Of-Memory (OOM) issues during the optimization step, calling torch.cuda.empty_cache() after every forward and forward-backward pass (per mini-batch) introduces significant CUDA synchronization overhead, which can drastically degrade training throughput.

Additionally, since None is no longer the default and is functionally treated as False in the worker implementation (if self.empty_cuda_cache:), we can simplify the type annotation from Optional[bool] to bool.

Suggested change
empty_cuda_cache: Optional[bool] = True
empty_cuda_cache: bool = True

@erictang000 erictang000 deleted the empty_cuda_cache branch May 29, 2026 20:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant