feat(profiler): add lightweight CUDA event timing by choiszt · Pull Request #179 · EvolvingLMMs-Lab/lmms-engine

choiszt · 2026-05-25T06:36:55Z

Summary

This PR adds an opt-in lightweight CUDA event profiler for long-running distributed FSDP2 training jobs. When enabled, selected ranks write sampled JSONL timing records under:

output_dir/cuda_event_profiler/cuda_events_rank_<rank>.jsonl

It currently instruments these coarse training phases:

host_to_device
training_step
training_metrics

The PR also gates the qwen3_5_moe parallelization import so missing optional Transformers modules do not break normal FSDP2 trainer imports.

Motivation

At small scale, a single aggregate step_time or MFU curve is often enough to notice that training became slower. At larger distributed scale, those aggregate metrics are usually not enough to explain why it became slower. A slowdown can come from very different sources:

one rank is slower than peers because its input batch is heavier or its host-to-device path is delayed
all ranks appear slow because they are waiting in collectives for a straggler
data movement starts taking a larger fraction of the step
training metrics, logging, or other non-model work starts interfering with the training loop
the issue is intermittent and only appears in a specific step window

torch.profiler is still the right tool for deep, operator-level analysis, but it is too heavy to leave on broadly for long distributed jobs. It also tends to be used after we already know which rank or time window is suspicious. This PR adds a cheaper first-pass diagnostic layer: CUDA event timing around a few coarse phases. The goal is not to replace full profiling, but to answer the first operational questions quickly:

Which rank is slow?
Which phase is slow?
Are other ranks waiting for one slow rank?
Is the issue persistent or only in a bounded step range?

This follows the same general observability lesson highlighted by MegaScale: CUDA event based timers can provide useful cross-rank timing signals with much lower overhead than full traces, making them suitable for sampled diagnosis in distributed training.

Design

The profiler is deliberately conservative:

disabled by default via enable_cuda_event_profiler: false
bounded by start_step / end_step
sampled by record_every_n_steps with a default of 10
flushed periodically with flush_every_n_steps with a default of 10
can restrict collection to selected ranks via ranks

Example config:

trainer_args:
  enable_cuda_event_profiler: true
  cuda_event_profiler_config:
    start_step: 100
    end_step: 1000
    record_every_n_steps: 10
    flush_every_n_steps: 50
    ranks: [0, 1, 7]

This keeps JSON volume bounded. For example, 8 ranks over 900 observed steps, sampled every 10 steps with 3 events per sampled step, produces only 8 * 90 * 3 = 2160 JSONL rows. The profiler is not intended to be enabled on every rank for every step in very large jobs.

Validation

Unit tests:
- disabled profiler does not write files
- enabled profiler writes valid JSONL
- start_step, end_step, and explicit record_every_n_steps work
- default sampling records every 10 steps
- rank filters skip unselected ranks
Real CUDA distributed validation:
- 2 GPU smoke test confirmed each rank writes expected JSONL records
- 8 GPU overhead comparison:
  - profiler disabled: 3.0619 ms/step
  - record every step, flush every 10 steps: 3.1376 ms/step (+2.47%)
  - record every 10 steps, flush every 10 steps: 3.0584 ms/step, within noise
- 8 GPU synthetic straggler test:
  - rank 0 synthetic work appears as synthetic_straggler
  - other ranks show longer all_reduce, confirming the traces can expose collective wait behavior
Import smoke:
- FSDP2SFTTrainer imports successfully after gating optional qwen3_5_moe parallelization import
- calling qwen3_5_moe parallelization without the optional Transformers module raises a clear ImportError
Static checks:
- compileall
- git diff --check

Notes

This validates single-node 8 GPU behavior. Multi-node production scaling validation is still future work. The intended usage is targeted diagnosis when we suspect stragglers, MFU drops, host-to-device stalls, or collective wait behavior, not always-on full-fidelity logging.

choiszt and others added 4 commits May 25, 2026 14:25

feat(profiler): add lightweight CUDA event timing

8d35af0

fix(parallel): gate qwen3_5_moe optional import

863c79f

style: auto-fix lint (black + isort)

a3e871a

chore(profiler): make event sampling conservative

196ba85

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(profiler): add lightweight CUDA event timing#179

feat(profiler): add lightweight CUDA event timing#179
choiszt wants to merge 4 commits into
mainfrom
infra/cuda-event-profiler

choiszt commented May 25, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

choiszt commented May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Design

Validation

Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

choiszt commented May 25, 2026 •

edited

Loading