[tinker] Fix multi-engine LoRA broadcast barrier hang (follow-on to merged #1720)#1728
Draft
hershg wants to merge 1 commit into
Draft
[tinker] Fix multi-engine LoRA broadcast barrier hang (follow-on to merged #1720)#1728hershg wants to merge 1 commit into
hershg wants to merge 1 commit into
Conversation
SumanthRH
reviewed
May 29, 2026
| # Sync after rank-0 disk write so all ranks see consistent state on | ||
| # Weka. Critically, this barrier is BEFORE the rank-0 HTTP fan-out to | ||
| # inference engines — fan-out to 4 engines on multi-engine configs can | ||
| # take >gloo-timeout (~10s) which would hang ranks 1-3 at the barrier. |
Member
There was a problem hiding this comment.
This is too low for GLOO timeout?
For Megatron, we init process group here:
The default timeout value is SKYRL_WORKER_NCCL_TIMEOUT_IN_S which is 600s.
Are you sure this fix is needed? Have you overridden SKYRL_WORKER_NCCL_TIMEOUT_IN_S in some way?
In _save_lora_adapters_and_sync, the torch.distributed.barrier() was AFTER rank 0's HTTP load_lora_adapter fan-out, which on multi-engine configs (num_engines=4) sends to all 4 backend URLs in parallel — each loading a 35B-LoRA from Weka takes seconds. Cumulative HTTP time can exceed gloo's short "Application timeout" (~10s by default), which closes the gloo pair on ranks 1-3 with "Application timeout caused pair closure". Fix: barrier AFTER rank-0 disk write (fast) but BEFORE the HTTP fan-out. Ranks 1-3 unblock immediately and can do other work (or hit the outer barrier in broadcast_to_inference_engines, which times out under heavy load but at least doesn't double-block the inner step). Reproed on Qwen3.6-35B-A3B + tau-retail with our pinned base 36f38c7 + Sumanth's NovaSky-AI#1720 cherry-pick. Single trainer + 4 vLLM engines: multi-engine routing works (per-engine /v1/completions balanced) but broadcast at first training step hangs in this barrier. Will follow up with: (a) instrument fan-out duration so we know if the outer barrier is also at risk, (b) make HTTP fan-out concurrent across engines (it should already be via _load_on_server asyncio.gather, verify).
976bfe3 to
8d2bc7f
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Follow-on to @SumanthRH's now-merged #1720. Fixes a related multi-engine bug we hit while integrating #1720 into our self-hosted TCLI stack: a gloo barrier timeout in `_save_lora_adapters_and_sync` when rank-0 fans HTTP `load_lora_adapter` out to multiple inference engines.
(Original PR was stacked on #1720 before it merged. Now rebased onto main as a single-commit follow-on.)
The bug
`megatron_worker.py:1018` (pre-patch):
```python
async def _save_lora_adapters_and_sync(...):
# ... all ranks export adapter weights collectively ...
if torch.distributed.get_rank() == 0:
# rank-0 only: save to disk + HTTP fan-out
save_file(...)
await inference_engine_client.load_lora_adapter(lora_name, lora_sync_path)
# ↑ slow with multi-engine: fans out to N engine URLs in parallel
torch.distributed.barrier() # ranks 1-3 wait here, time out under load
```
On multi-engine configs (we tested `num_engines=4, tensor_parallel_size=1` on Qwen 3.6 35B), rank-0's HTTP fan-out to 4 engines can exceed gloo's per-collective "Application timeout" (~10s default), causing ranks 1-3 to drop with `RuntimeError: Application timeout caused pair closure` before rank-0 even gets to the barrier.
The fix (this PR)
Split the barrier: rank-0's disk write happens, then ALL ranks barrier, THEN rank-0 does its HTTP fan-out:
```python
async def _save_lora_adapters_and_sync(...):
# ... all ranks export adapter weights ...
if torch.distributed.get_rank() == 0:
# rank-0: disk save only
save_file(...)
# ALL ranks barrier here (fast — just disk-write sync)
torch.distributed.barrier()
# rank-0 HTTP fan-out AFTER barrier; ranks 1-3 proceed independently
if torch.distributed.get_rank() == 0:
await inference_engine_client.load_lora_adapter(...)
```
The caller's outer barrier in `broadcast_to_inference_engines` (line ~1029) still gates the next training step on the fan-out completing, so semantic correctness is preserved.
Verification
Reproduced on our self-hosted TCLI fork (Qwen 3.6 35B + tau-retail multi-turn RL) with #1720 cherry-picked in (pre-merge):
Credit
#1720's seq_id routing fix was the foundational unlock — multi-engine inference doesn't actually work without it. This PR is a small follow-on. Thanks @SumanthRH @j316chuck @CharlieRuan for the design pointers via internal threads.