Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
58 changes: 58 additions & 0 deletions docs/source/dataloaders.rst
Original file line number Diff line number Diff line change
Expand Up @@ -685,3 +685,61 @@ Other, more exotic configurations:
* With ``seed="trng"``, the base random seed itself will be drawn using a TRNG. It will be different on each GPU training process. This setting is not recommended.

* With ``seed="randomized"``, the base random seed is set to Python's global RNG seed. It might be different on each GPU training process. This setting is not recommended.

CP/TP-safe batches with ``BroadcastingDataLoader``
---------------------------------------------------

Context-parallel (CP) and tensor-parallel (TP) training require all ranks
within the same ``(cp, tp)`` sub-mesh of a DP slot to process the **same**
global batch each step — CP shards the sequence dimension and TP shards
the feature dimension, so a divergent global batch breaks the per-rank
shape contract that CP/TP collectives assume.

Independent Lhotse loaders on each rank with ``shard_seed="randomized"``
guarantee that *seeded* shard cursors line up, but they don't protect
against background-thread non-determinism (``concurrent_bucketing``,
worker scheduling jitter, etc.). The empirical signature is per-rank
``cu_seqlens`` divergence at a fraction of training steps, which then
deadlocks NCCL collectives with mismatched shapes.

The :class:`~nemo.collections.common.data.lhotse.broadcasting.BroadcastingDataLoader`
fixes this at the data layer: construct the real Lhotse loader on a
single DP-source rank (``cp_rank == 0`` and ``tp_rank == 0``) and let the
wrapper broadcast each batch to the other ranks in the ``(cp, tp)``
sub-mesh over NCCL. Iteration ends in lockstep via a continue/stop
broadcast — no length needs to be known up-front.

.. code-block:: python

from torch.distributed.device_mesh import init_device_mesh

from nemo.collections.common.data.lhotse import get_lhotse_dataloader_from_config
from nemo.collections.common.data.lhotse.broadcasting import (
BroadcastingDataLoader,
is_dp_source_rank,
)

mesh = init_device_mesh("cuda", (dp, cp, tp), mesh_dim_names=("dp", "cp", "tp"))

if is_dp_source_rank(mesh):
source = get_lhotse_dataloader_from_config(
config=cfg.train_ds,
global_rank=dp_rank,
world_size=dp_size,
dataset=dataset,
tokenizer=tokenizer,
)
else:
source = None

return BroadcastingDataLoader(source=source, device_mesh=mesh)

The wrapper delegates ``state_dict`` / ``load_state_dict`` to the source
loader on the source rank (no-ops on non-source ranks), so checkpoint and
resume keep working transparently with regular ``DataLoader``,
``torchdata.StatefulDataLoader``, or any other source object that
implements those methods.

The wrapper is a no-op when ``device_mesh`` is ``None`` or every named
axis present in the mesh has size 1, so the same call site works for
single-GPU, DDP-only, and CP/TP runs without a separate code path.
26 changes: 26 additions & 0 deletions docs/source/speechlm2/configs.rst
Original file line number Diff line number Diff line change
Expand Up @@ -229,6 +229,32 @@ Defaults come from Automodel's ``BackendConfig`` and auto-select TransformerEngi
DeepEP when available; override here to pin a specific backend (for example,
``attn: sdpa`` to bypass TE).

**Packed sequences (THD):**

.. code-block:: yaml

model:
packed_sequences: true # default false (right-padded BSHD path)
automodel_backend:
attn: te # THD path dispatches TE varlen FlashAttention

When ``packed_sequences`` is true, ``SALMAutomodel.prepare_inputs`` packs
each minibatch into a single flat ``[T_total, H]`` sequence with a
``cu_seqlens`` index instead of right-padding to ``[B, T_max, H]``.
``SALMAutomodel`` then forwards the THD metadata (``qkv_format``,
``cu_seqlens``, ``position_ids``, ``max_seqlen``) through ``forward()`` to
the LLM. The TE attention preprocessor splits the singular ``max_seqlen``
into the ``max_seqlen_q`` / ``max_seqlen_kv`` pair that
``DotProductAttention`` requires for ``qkv_format="thd"``. The packing also
rounds each utterance's flat length up to a multiple of ``2 * cp_size`` so
the same THD batch satisfies TE's CP DualChunkSwap contract — see the
"Context Parallelism (CP)" subsection in
:doc:`training_and_scaling` for the recommended pairing with ``cp_size > 1``.

Padding overhead drops from ``O(B * (T_max - T_avg))`` to
``O(per-utt rounding to 2*cp_size)``. Throughput improvement scales with
the variance of utterance lengths in your bucketing.

DuplexS2SModel Configuration
-----------------------------

Expand Down
89 changes: 87 additions & 2 deletions docs/source/speechlm2/training_and_scaling.rst
Original file line number Diff line number Diff line change
Expand Up @@ -183,8 +183,93 @@ For distributed inference, launch with ``torchrun``:
inputs=path/to/manifest \
ep_size=2

Configuration
^^^^^^^^^^^^^
Packed Sequences (THD)
""""""""""""""""""""""

``SALMAutomodel`` supports an opt-in packed-sequence (``THD``) training and
validation path that concatenates per-utterance text + audio embeddings into
a single flat ``[T_total, H]`` sequence with a ``cu_seqlens`` index, instead
of right-padding into the standard ``[B, T_max, H]`` (``BSHD``) layout. TE's
varlen FlashAttention then operates segment-by-segment without ever attending
across utterances, and Mamba's ``seq_idx`` is derived from the same
``cu_seqlens`` so SSM state resets at document boundaries.

For variable-length speech batches the padding overhead is substantial — the
``BSHD`` layout pays ``B * (T_max - T_avg)`` wasted compute per minibatch,
``THD`` pays only the per-utterance rounding to a multiple of ``2*cp_size``
(needed for TE's CP DualChunkSwap pattern). Throughput improvement scales
with the variance of utterance lengths.

Enable per-batch:

.. code-block:: yaml

model:
packed_sequences: true # opt-in; default false (BSHD)
automodel_backend:
attn: te # THD path requires TE attention

When ``packed_sequences`` is unset, the existing BSHD path is used unchanged.
Generate / inference always uses BSHD (it doesn't go through ``prepare_inputs``).

Context Parallelism (CP)
""""""""""""""""""""""""

``SALMAutomodel`` supports context parallelism for long-audio training on
hybrid Mamba/attention LLMs (e.g. Nemotron-V3). CP shards the sequence
dimension across GPUs so per-rank activations and KV-cache memory scale as
``T / cp_size`` instead of ``T``; attention layers go through TE's
DualChunkSwap pattern and Mamba mixers go through hidden-parallel
all-to-all (``MambaContextParallel`` in NeMo Automodel).

Enable via the strategy:

.. code-block:: yaml

trainer:
strategy:
_target_: nemo.collections.speechlm2.parts.parallel.AutomodelParallelStrategy
cp_size: 2 # context parallel size; must divide num_heads of every Mamba block
ep_size: 2 # may share the same ranks as CP

**The THD packed-sequence path is the only supported configuration under
CP.** Each utterance is its own attention segment and the per-utterance
sequence rounding aligns naturally with CP's ``2*cp_size`` requirement.

.. warning::
**BSHD + CP is not supported.** TE's fused-attention CP path supports
``causal`` but not ``padding_causal``, so the right-pad mask must be
dropped before the LLM. With the mask dropped, pad K/V leak into
real-token attention through the causal mask and the gradient through
the LoRA / projection parameters becomes ``NaN`` after the first
optimizer step (validated empirically: BSHD + CP=2 + EP=2 on a 2-GPU
run produces ``loss=4.62`` at step 1 then ``loss=nan`` from step 2
onwards). This is independent of the TE/cuDNN backward issue
documented below — setting ``NVTE_FUSED_ATTN=0`` does not fix it.
Set ``model.packed_sequences: true`` to use the THD path instead.

.. note::
**CP-safe data loading is automatic.** The speechlm2 datamodule wraps
the Lhotse loader in
:class:`~nemo.collections.common.data.lhotse.broadcasting.BroadcastingDataLoader`,
so under CP/TP every batch is constructed once on the DP source rank
(``cp_rank == 0`` and ``tp_rank == 0``) and broadcast to its sub-mesh
peers. This eliminates per-rank Lhotse non-determinism (``concurrent_bucketing``,
worker scheduling jitter, etc.) as a source of NCCL deadlocks under CP.
See :doc:`/dataloaders` for the standalone API.

.. note::
**TE/THD exploding-gradients workaround on some GPUs.** On certain GPU
architectures (notably Blackwell ``sm_120``), the cuDNN backend that
TransformerEngine 2.14 picks for ``qkv_format="thd"`` with
``attn_mask_type="padding_causal"`` returns correct forward activations
but gradients amplified 8×–960× per layer. Compounded across the LLM's
attention stack this drives gradients to ``1e22``-magnitudes at step 0,
the gradient-clip-by-norm computes ``1.0 / inf = 0``, and Adam's moments
eventually NaN. Force TE to dispatch FlashAttention instead of cuDNN by
setting ``NVTE_FUSED_ATTN=0`` in the launcher environment (requires
``flash-attn`` to be installed for your GPU arch). The FlashAttention
THD/``padding_causal`` backward is gradient-correct on the same shapes.

To configure parallelism, modify the ``trainer.strategy`` section in your YAML config:

Expand Down
Loading
Loading