Skip to content

SpeechLM2 : You're resuming from a checkpoint that ended before the epoch ended and your dataloader is not resumable. #15575

@AudranBert

Description

@AudranBert

Describe the bug

Hi @pzelasko,
When resuming a Speechlm2 training I get this log that is a bit worrying. I remember for ASR seeing a Nemo log telling to ignore it, but I do not see a similar Nemo log here. Does it correctly load where it was when it ended (even when using batch_tokens)?

nemo-2.4.0+py3.12.10/lib/python3.12/site-packages/lightning/pytorch/loops/training_epoch_loop.py:161: You're resuming from a checkpoint that ended before the epoch ended and your dataloader is not resumable. This can cause unreliable results if further training is done. Consider using an end-of-epoch checkpoint or make your dataloader resumable by implementing the `state_dict` / `load_state_dict` interface.

(Do not pay attention to nemo-2.4.0 we are using NeMo 2.8.0rc0 but to avoid NaN loss we use or 2.4 env)

Steps/Code to reproduce bug

Expected behavior

Should resume as it was when it ended.

Environment overview (please complete the following information)

Environment details

NeMo 2.8.0rc0

Additional context

Add any other context about the problem here.
H100

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions