Skip to content

fix(profiler): dump memory snapshot on trainer OOM#176

Merged
kcz358 merged 1 commit into
mainfrom
fix-memory-snapshot-oom-branch
May 21, 2026
Merged

fix(profiler): dump memory snapshot on trainer OOM#176
kcz358 merged 1 commit into
mainfrom
fix-memory-snapshot-oom-branch

Conversation

@kcz358
Copy link
Copy Markdown
Collaborator

@kcz358 kcz358 commented May 21, 2026

Summary

  • Add a trainer-level OOM fallback that force-dumps CUDA memory snapshots when training_step raises OOM
  • Use unique snapshot filenames with rank, pid, and timestamp to avoid overwrite
  • Log full traceback when snapshot dumping fails

Verification

  • python -m compileall src/lmms_engine/utils/profiler.py src/lmms_engine/train/fsdp2/fsdp2_trainer.py

@kcz358 kcz358 merged commit a482ca3 into main May 21, 2026
3 checks passed
@kcz358 kcz358 deleted the fix-memory-snapshot-oom-branch branch May 21, 2026 02:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant