Skip to content

[tinker] Fix multi-engine LoRA broadcast barrier hang (follow-on to merged #1720)#1728

Draft
hershg wants to merge 1 commit into
NovaSky-AI:mainfrom
hershg:upstream/barrier-ordering-on-1720
Draft

[tinker] Fix multi-engine LoRA broadcast barrier hang (follow-on to merged #1720)#1728
hershg wants to merge 1 commit into
NovaSky-AI:mainfrom
hershg:upstream/barrier-ordering-on-1720

Conversation

@hershg
Copy link
Copy Markdown

@hershg hershg commented May 29, 2026

What

Follow-on to @SumanthRH's now-merged #1720. Fixes a related multi-engine bug we hit while integrating #1720 into our self-hosted TCLI stack: a gloo barrier timeout in `_save_lora_adapters_and_sync` when rank-0 fans HTTP `load_lora_adapter` out to multiple inference engines.

(Original PR was stacked on #1720 before it merged. Now rebased onto main as a single-commit follow-on.)

The bug

`megatron_worker.py:1018` (pre-patch):

```python
async def _save_lora_adapters_and_sync(...):
# ... all ranks export adapter weights collectively ...
if torch.distributed.get_rank() == 0:
# rank-0 only: save to disk + HTTP fan-out
save_file(...)
await inference_engine_client.load_lora_adapter(lora_name, lora_sync_path)
# ↑ slow with multi-engine: fans out to N engine URLs in parallel
torch.distributed.barrier() # ranks 1-3 wait here, time out under load
```

On multi-engine configs (we tested `num_engines=4, tensor_parallel_size=1` on Qwen 3.6 35B), rank-0's HTTP fan-out to 4 engines can exceed gloo's per-collective "Application timeout" (~10s default), causing ranks 1-3 to drop with `RuntimeError: Application timeout caused pair closure` before rank-0 even gets to the barrier.

The fix (this PR)

Split the barrier: rank-0's disk write happens, then ALL ranks barrier, THEN rank-0 does its HTTP fan-out:

```python
async def _save_lora_adapters_and_sync(...):
# ... all ranks export adapter weights ...
if torch.distributed.get_rank() == 0:
# rank-0: disk save only
save_file(...)
# ALL ranks barrier here (fast — just disk-write sync)
torch.distributed.barrier()
# rank-0 HTTP fan-out AFTER barrier; ranks 1-3 proceed independently
if torch.distributed.get_rank() == 0:
await inference_engine_client.load_lora_adapter(...)
```

The caller's outer barrier in `broadcast_to_inference_engines` (line ~1029) still gates the next training step on the fan-out completing, so semantic correctness is preserved.

Verification

Reproduced on our self-hosted TCLI fork (Qwen 3.6 35B + tau-retail multi-turn RL) with #1720 cherry-picked in (pre-merge):

  • Before patch: `Application timeout caused pair closure` on ranks 1-3 within seconds of first training step.
  • After patch: 16 multi-LoRA trainers cohabit on one pod (max_loras=4 × 4 engines), multi-engine routing balanced 1.1-1.3× across 4 engines, zero barrier hangs over 6+ hours of sustained training.

Credit

#1720's seq_id routing fix was the foundational unlock — multi-engine inference doesn't actually work without it. This PR is a small follow-on. Thanks @SumanthRH @j316chuck @CharlieRuan for the design pointers via internal threads.

# Sync after rank-0 disk write so all ranks see consistent state on
# Weka. Critically, this barrier is BEFORE the rank-0 HTTP fan-out to
# inference engines — fan-out to 4 engines on multi-engine configs can
# take >gloo-timeout (~10s) which would hang ranks 1-3 at the barrier.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is too low for GLOO timeout?

For Megatron, we init process group here:

https://github.com/hershg/SkyRL/blob/976bfe3a45f6f1e3db3b5e84b5a3d5d485d0eb67/skyrl/backends/skyrl_train/workers/megatron/megatron_worker.py#L512-L521

The default timeout value is SKYRL_WORKER_NCCL_TIMEOUT_IN_S which is 600s.

Are you sure this fix is needed? Have you overridden SKYRL_WORKER_NCCL_TIMEOUT_IN_S in some way?

In _save_lora_adapters_and_sync, the torch.distributed.barrier() was AFTER
rank 0's HTTP load_lora_adapter fan-out, which on multi-engine configs
(num_engines=4) sends to all 4 backend URLs in parallel — each loading a
35B-LoRA from Weka takes seconds. Cumulative HTTP time can exceed gloo's
short "Application timeout" (~10s by default), which closes the gloo pair
on ranks 1-3 with "Application timeout caused pair closure".

Fix: barrier AFTER rank-0 disk write (fast) but BEFORE the HTTP fan-out.
Ranks 1-3 unblock immediately and can do other work (or hit the outer
barrier in broadcast_to_inference_engines, which times out under heavy
load but at least doesn't double-block the inner step).

Reproed on Qwen3.6-35B-A3B + tau-retail with our pinned base
36f38c7 + Sumanth's NovaSky-AI#1720 cherry-pick. Single trainer + 4 vLLM engines:
multi-engine routing works (per-engine /v1/completions balanced) but
broadcast at first training step hangs in this barrier.

Will follow up with: (a) instrument fan-out duration so we know if the
outer barrier is also at risk, (b) make HTTP fan-out concurrent across
engines (it should already be via _load_on_server asyncio.gather, verify).
@hershg hershg force-pushed the upstream/barrier-ordering-on-1720 branch from 976bfe3 to 8d2bc7f Compare May 29, 2026 19:21
@hershg hershg changed the title [tinker] Fix multi-engine LoRA broadcast barrier hang (builds on #1720) [tinker] Fix multi-engine LoRA broadcast barrier hang (follow-on to merged #1720) May 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants