[megatron] make empty_cuda_cache default true for megatron#1731
Conversation
There was a problem hiding this comment.
Code Review
This pull request changes the default value of empty_cuda_cache from None to True in MegatronConfig. The reviewer notes that while this change helps prevent Out-Of-Memory (OOM) issues, enabling it by default can introduce significant CUDA synchronization overhead and degrade training throughput. Additionally, the reviewer suggests simplifying the type annotation from Optional[bool] to bool since None is no longer the default.
| default_factory=lambda: copy.deepcopy(DEFAULT_TRANSFORMER_CONFIG_KWARGS) | ||
| ) | ||
| empty_cuda_cache: Optional[bool] = None | ||
| empty_cuda_cache: Optional[bool] = True |
There was a problem hiding this comment.
While setting empty_cuda_cache to True by default helps prevent Out-Of-Memory (OOM) issues during the optimization step, calling torch.cuda.empty_cache() after every forward and forward-backward pass (per mini-batch) introduces significant CUDA synchronization overhead, which can drastically degrade training throughput.
Additionally, since None is no longer the default and is functionally treated as False in the worker implementation (if self.empty_cuda_cache:), we can simplify the type annotation from Optional[bool] to bool.
| empty_cuda_cache: Optional[bool] = True | |
| empty_cuda_cache: bool = True |
This was originally added in #716, but was set to None by default during the config refactoring: d1e1b29
Setting back to true, as this can help with unnecessary OOMs during optim step.