feat(profiler): add lightweight CUDA event timing#179
Open
choiszt wants to merge 4 commits into
Open
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR adds an opt-in lightweight CUDA event profiler for long-running distributed FSDP2 training jobs. When enabled, selected ranks write sampled JSONL timing records under:
output_dir/cuda_event_profiler/cuda_events_rank_<rank>.jsonlIt currently instruments these coarse training phases:
host_to_devicetraining_steptraining_metricsThe PR also gates the
qwen3_5_moeparallelization import so missing optional Transformers modules do not break normal FSDP2 trainer imports.Motivation
At small scale, a single aggregate
step_timeor MFU curve is often enough to notice that training became slower. At larger distributed scale, those aggregate metrics are usually not enough to explain why it became slower. A slowdown can come from very different sources:torch.profileris still the right tool for deep, operator-level analysis, but it is too heavy to leave on broadly for long distributed jobs. It also tends to be used after we already know which rank or time window is suspicious. This PR adds a cheaper first-pass diagnostic layer: CUDA event timing around a few coarse phases. The goal is not to replace full profiling, but to answer the first operational questions quickly:This follows the same general observability lesson highlighted by MegaScale: CUDA event based timers can provide useful cross-rank timing signals with much lower overhead than full traces, making them suitable for sampled diagnosis in distributed training.
Design
The profiler is deliberately conservative:
enable_cuda_event_profiler: falsestart_step/end_steprecord_every_n_stepswith a default of 10flush_every_n_stepswith a default of 10ranksExample config:
This keeps JSON volume bounded. For example, 8 ranks over 900 observed steps, sampled every 10 steps with 3 events per sampled step, produces only
8 * 90 * 3 = 2160JSONL rows. The profiler is not intended to be enabled on every rank for every step in very large jobs.Validation
Unit tests:
start_step,end_step, and explicitrecord_every_n_stepsworkReal CUDA distributed validation:
3.0619 ms/step3.1376 ms/step(+2.47%)3.0584 ms/step, within noisesynthetic_stragglerall_reduce, confirming the traces can expose collective wait behaviorImport smoke:
FSDP2SFTTrainerimports successfully after gating optionalqwen3_5_moeparallelization importqwen3_5_moeparallelization without the optional Transformers module raises a clearImportErrorStatic checks:
compileallgit diff --checkNotes
This validates single-node 8 GPU behavior. Multi-node production scaling validation is still future work. The intended usage is targeted diagnosis when we suspect stragglers, MFU drops, host-to-device stalls, or collective wait behavior, not always-on full-fidelity logging.