SpeechLM2 : You're resuming from a checkpoint that ended before the epoch ended and your dataloader is not resumable.

**Describe the bug**

Hi @pzelasko,
When resuming a Speechlm2 training I get this log that is a bit worrying. I remember for ASR seeing a Nemo log telling to ignore it, but I do not see a similar Nemo log here. Does it correctly load where it was when it ended (even when using batch_tokens)?
``` 
nemo-2.4.0+py3.12.10/lib/python3.12/site-packages/lightning/pytorch/loops/training_epoch_loop.py:161: You're resuming from a checkpoint that ended before the epoch ended and your dataloader is not resumable. This can cause unreliable results if further training is done. Consider using an end-of-epoch checkpoint or make your dataloader resumable by implementing the `state_dict` / `load_state_dict` interface.
```

(Do not pay attention to nemo-2.4.0 we are using NeMo 2.8.0rc0 but to avoid NaN loss we use or 2.4 env)

**Steps/Code to reproduce bug**


**Expected behavior**

Should resume as it was when it ended.

**Environment overview (please complete the following information)**

**Environment details**

NeMo 2.8.0rc0

**Additional context**

Add any other context about the problem here.
H100


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SpeechLM2 : You're resuming from a checkpoint that ended before the epoch ended and your dataloader is not resumable. #15575

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

SpeechLM2 : You're resuming from a checkpoint that ended before the epoch ended and your dataloader is not resumable. #15575

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions