[tinker] Fix multi-engine LoRA broadcast barrier hang (follow-on to merged #1720) by hershg · Pull Request #1728 · NovaSky-AI/SkyRL

hershg · 2026-05-29T16:20:04Z

What

Follow-on to @SumanthRH's now-merged #1720. Fixes a related multi-engine bug we hit while integrating #1720 into our self-hosted TCLI stack: a gloo barrier timeout in `_save_lora_adapters_and_sync` when rank-0 fans HTTP `load_lora_adapter` out to multiple inference engines.

(Original PR was stacked on #1720 before it merged. Now rebased onto main as a single-commit follow-on.)

The bug

`megatron_worker.py:1018` (pre-patch):

```python
async def _save_lora_adapters_and_sync(...):
# ... all ranks export adapter weights collectively ...
if torch.distributed.get_rank() == 0:
# rank-0 only: save to disk + HTTP fan-out
save_file(...)
await inference_engine_client.load_lora_adapter(lora_name, lora_sync_path)
# ↑ slow with multi-engine: fans out to N engine URLs in parallel
torch.distributed.barrier() # ranks 1-3 wait here, time out under load
```

On multi-engine configs (we tested `num_engines=4, tensor_parallel_size=1` on Qwen 3.6 35B), rank-0's HTTP fan-out to 4 engines can exceed gloo's per-collective "Application timeout" (~10s default), causing ranks 1-3 to drop with `RuntimeError: Application timeout caused pair closure` before rank-0 even gets to the barrier.

The fix (this PR)

Split the barrier: rank-0's disk write happens, then ALL ranks barrier, THEN rank-0 does its HTTP fan-out:

```python
async def _save_lora_adapters_and_sync(...):
# ... all ranks export adapter weights ...
if torch.distributed.get_rank() == 0:
# rank-0: disk save only
save_file(...)
# ALL ranks barrier here (fast — just disk-write sync)
torch.distributed.barrier()
# rank-0 HTTP fan-out AFTER barrier; ranks 1-3 proceed independently
if torch.distributed.get_rank() == 0:
await inference_engine_client.load_lora_adapter(...)
```

The caller's outer barrier in `broadcast_to_inference_engines` (line ~1029) still gates the next training step on the fan-out completing, so semantic correctness is preserved.

Verification

Reproduced on our self-hosted TCLI fork (Qwen 3.6 35B + tau-retail multi-turn RL) with #1720 cherry-picked in (pre-merge):

Before patch: `Application timeout caused pair closure` on ranks 1-3 within seconds of first training step.
After patch: 16 multi-LoRA trainers cohabit on one pod (max_loras=4 × 4 engines), multi-engine routing balanced 1.1-1.3× across 4 engines, zero barrier hangs over 6+ hours of sustained training.

Credit

#1720's seq_id routing fix was the foundational unlock — multi-engine inference doesn't actually work without it. This PR is a small follow-on. Thanks @SumanthRH @j316chuck @CharlieRuan for the design pointers via internal threads.

SumanthRH · 2026-05-29T16:55:55Z

+        # Sync after rank-0 disk write so all ranks see consistent state on
+        # Weka. Critically, this barrier is BEFORE the rank-0 HTTP fan-out to
+        # inference engines — fan-out to 4 engines on multi-engine configs can
+        # take >gloo-timeout (~10s) which would hang ranks 1-3 at the barrier.


This is too low for GLOO timeout?

For Megatron, we init process group here:

https://github.com/hershg/SkyRL/blob/976bfe3a45f6f1e3db3b5e84b5a3d5d485d0eb67/skyrl/backends/skyrl_train/workers/megatron/megatron_worker.py#L512-L521

The default timeout value is SKYRL_WORKER_NCCL_TIMEOUT_IN_S which is 600s.

Are you sure this fix is needed? Have you overridden SKYRL_WORKER_NCCL_TIMEOUT_IN_S in some way?

In _save_lora_adapters_and_sync, the torch.distributed.barrier() was AFTER rank 0's HTTP load_lora_adapter fan-out, which on multi-engine configs (num_engines=4) sends to all 4 backend URLs in parallel — each loading a 35B-LoRA from Weka takes seconds. Cumulative HTTP time can exceed gloo's short "Application timeout" (~10s by default), which closes the gloo pair on ranks 1-3 with "Application timeout caused pair closure". Fix: barrier AFTER rank-0 disk write (fast) but BEFORE the HTTP fan-out. Ranks 1-3 unblock immediately and can do other work (or hit the outer barrier in broadcast_to_inference_engines, which times out under heavy load but at least doesn't double-block the inner step). Reproed on Qwen3.6-35B-A3B + tau-retail with our pinned base 36f38c7 + Sumanth's NovaSky-AI#1720 cherry-pick. Single trainer + 4 vLLM engines: multi-engine routing works (per-engine /v1/completions balanced) but broadcast at first training step hangs in this barrier. Will follow up with: (a) instrument fan-out duration so we know if the outer barrier is also at risk, (b) make HTTP fan-out concurrent across engines (it should already be via _load_on_server asyncio.gather, verify).

hershg mentioned this pull request May 29, 2026

num_engines>=2 silently fails to spawn VLLMServerActor (Qwen3.6-35B-A3B + Tinker API + Megatron) #1721

Closed

SumanthRH reviewed May 29, 2026

View reviewed changes

hershg force-pushed the upstream/barrier-ordering-on-1720 branch from 976bfe3 to 8d2bc7f Compare May 29, 2026 19:21

hershg changed the title ~~[tinker] Fix multi-engine LoRA broadcast barrier hang (builds on #1720)~~ [tinker] Fix multi-engine LoRA broadcast barrier hang (follow-on to merged #1720) May 29, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[tinker] Fix multi-engine LoRA broadcast barrier hang (follow-on to merged #1720)#1728

[tinker] Fix multi-engine LoRA broadcast barrier hang (follow-on to merged #1720)#1728
hershg wants to merge 1 commit into
NovaSky-AI:mainfrom
hershg:upstream/barrier-ordering-on-1720

hershg commented May 29, 2026 •

edited

Loading

Uh oh!

SumanthRH May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

hershg commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

The bug

The fix (this PR)

Verification

Credit

Uh oh!

SumanthRH May 29, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

hershg commented May 29, 2026 •

edited

Loading