Describe the bug
Hi @pzelasko,
When resuming a Speechlm2 training I get this log that is a bit worrying. I remember for ASR seeing a Nemo log telling to ignore it, but I do not see a similar Nemo log here. Does it correctly load where it was when it ended (even when using batch_tokens)?
nemo-2.4.0+py3.12.10/lib/python3.12/site-packages/lightning/pytorch/loops/training_epoch_loop.py:161: You're resuming from a checkpoint that ended before the epoch ended and your dataloader is not resumable. This can cause unreliable results if further training is done. Consider using an end-of-epoch checkpoint or make your dataloader resumable by implementing the `state_dict` / `load_state_dict` interface.
(Do not pay attention to nemo-2.4.0 we are using NeMo 2.8.0rc0 but to avoid NaN loss we use or 2.4 env)
Steps/Code to reproduce bug
Expected behavior
Should resume as it was when it ended.
Environment overview (please complete the following information)
Environment details
NeMo 2.8.0rc0
Additional context
Add any other context about the problem here.
H100
Describe the bug
Hi @pzelasko,
When resuming a Speechlm2 training I get this log that is a bit worrying. I remember for ASR seeing a Nemo log telling to ignore it, but I do not see a similar Nemo log here. Does it correctly load where it was when it ended (even when using batch_tokens)?
(Do not pay attention to nemo-2.4.0 we are using NeMo 2.8.0rc0 but to avoid NaN loss we use or 2.4 env)
Steps/Code to reproduce bug
Expected behavior
Should resume as it was when it ended.
Environment overview (please complete the following information)
Environment details
NeMo 2.8.0rc0
Additional context
Add any other context about the problem here.
H100