Skip to content

feat(profiler): add lightweight CUDA event timing#179

Open
choiszt wants to merge 4 commits into
mainfrom
infra/cuda-event-profiler
Open

feat(profiler): add lightweight CUDA event timing#179
choiszt wants to merge 4 commits into
mainfrom
infra/cuda-event-profiler

Conversation

@choiszt
Copy link
Copy Markdown
Contributor

@choiszt choiszt commented May 25, 2026

Summary

This PR adds an opt-in lightweight CUDA event profiler for long-running distributed FSDP2 training jobs. When enabled, selected ranks write sampled JSONL timing records under:

output_dir/cuda_event_profiler/cuda_events_rank_<rank>.jsonl

It currently instruments these coarse training phases:

  • host_to_device
  • training_step
  • training_metrics

The PR also gates the qwen3_5_moe parallelization import so missing optional Transformers modules do not break normal FSDP2 trainer imports.

Motivation

At small scale, a single aggregate step_time or MFU curve is often enough to notice that training became slower. At larger distributed scale, those aggregate metrics are usually not enough to explain why it became slower. A slowdown can come from very different sources:

  • one rank is slower than peers because its input batch is heavier or its host-to-device path is delayed
  • all ranks appear slow because they are waiting in collectives for a straggler
  • data movement starts taking a larger fraction of the step
  • training metrics, logging, or other non-model work starts interfering with the training loop
  • the issue is intermittent and only appears in a specific step window

torch.profiler is still the right tool for deep, operator-level analysis, but it is too heavy to leave on broadly for long distributed jobs. It also tends to be used after we already know which rank or time window is suspicious. This PR adds a cheaper first-pass diagnostic layer: CUDA event timing around a few coarse phases. The goal is not to replace full profiling, but to answer the first operational questions quickly:

  • Which rank is slow?
  • Which phase is slow?
  • Are other ranks waiting for one slow rank?
  • Is the issue persistent or only in a bounded step range?

This follows the same general observability lesson highlighted by MegaScale: CUDA event based timers can provide useful cross-rank timing signals with much lower overhead than full traces, making them suitable for sampled diagnosis in distributed training.

Design

The profiler is deliberately conservative:

  • disabled by default via enable_cuda_event_profiler: false
  • bounded by start_step / end_step
  • sampled by record_every_n_steps with a default of 10
  • flushed periodically with flush_every_n_steps with a default of 10
  • can restrict collection to selected ranks via ranks

Example config:

trainer_args:
  enable_cuda_event_profiler: true
  cuda_event_profiler_config:
    start_step: 100
    end_step: 1000
    record_every_n_steps: 10
    flush_every_n_steps: 50
    ranks: [0, 1, 7]

This keeps JSON volume bounded. For example, 8 ranks over 900 observed steps, sampled every 10 steps with 3 events per sampled step, produces only 8 * 90 * 3 = 2160 JSONL rows. The profiler is not intended to be enabled on every rank for every step in very large jobs.

Validation

  • Unit tests:

    • disabled profiler does not write files
    • enabled profiler writes valid JSONL
    • start_step, end_step, and explicit record_every_n_steps work
    • default sampling records every 10 steps
    • rank filters skip unselected ranks
  • Real CUDA distributed validation:

    • 2 GPU smoke test confirmed each rank writes expected JSONL records
    • 8 GPU overhead comparison:
      • profiler disabled: 3.0619 ms/step
      • record every step, flush every 10 steps: 3.1376 ms/step (+2.47%)
      • record every 10 steps, flush every 10 steps: 3.0584 ms/step, within noise
    • 8 GPU synthetic straggler test:
      • rank 0 synthetic work appears as synthetic_straggler
      • other ranks show longer all_reduce, confirming the traces can expose collective wait behavior
  • Import smoke:

    • FSDP2SFTTrainer imports successfully after gating optional qwen3_5_moe parallelization import
    • calling qwen3_5_moe parallelization without the optional Transformers module raises a clear ImportError
  • Static checks:

    • compileall
    • git diff --check

Notes

This validates single-node 8 GPU behavior. Multi-node production scaling validation is still future work. The intended usage is targeted diagnosis when we suspect stragglers, MFU drops, host-to-device stalls, or collective wait behavior, not always-on full-fidelity logging.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant