[fix] Save trust_remote_code artifacts in MegatronStrategy.save_hf_model#1726
[fix] Save trust_remote_code artifacts in MegatronStrategy.save_hf_model#1726dinhxuanvu wants to merge 3 commits into
Conversation
When exporting a trained Megatron model to Hugging Face format, MegatronStrategy.save_hf_model only invokes bridge.save_hf_weights and the inherited save_hf_configs (which calls config/tokenizer/generation_config save_pretrained). Models that rely on trust_remote_code (e.g. Nemotron-H, custom MoE variants) keep their architecture in modeling_*.py modules and reference auxiliary files via auto_map; none of these are emitted by save_pretrained, so the resulting checkpoint directory is structurally incomplete and cannot be loaded with AutoModel.from_pretrained(..., trust_remote_code=True) without manually copying files from the original source path. This change adds a rank-0 call to bridge.hf_pretrained.save_artifacts(work_dir) before save_hf_configs. PreTrainedBase.save_artifacts copies all custom modeling files (*.py), required ARTIFACTS, and OPTIONAL_ARTIFACTS from the original source path. save_hf_configs runs afterwards so the existing config.json/tokenizer write semantics are preserved (it overwrites the config that save_artifacts wrote with the strategy's current view). Verified bridge API against NVIDIA-NeMo/Megatron-Bridge:src/megatron/bridge/models/hf_pretrained/base.py (save_artifacts) and conversion/auto_bridge.py (hf_pretrained attribute). Verification: - pre-commit hooks (ruff, black, gitleaks): pass - ast.parse + targeted AST walk confirms save_artifacts is invoked from within save_hf_model on rank 0 - Existing save/load coverage in tests/backends/skyrl_train/gpu/test_save_load_model.py is GPU-gated and exercises this path end-to-end on CI; no behavior change expected for non-trust_remote_code models because save_artifacts is a superset of save_hf_configs for those. Signed-off-by: Vu Dinh <vudinh@outlook.com>
There was a problem hiding this comment.
Code Review
This pull request updates the Megatron strategy to preserve custom modeling artifacts for trust_remote_code models by calling bridge.hf_pretrained.save_artifacts before saving configurations. The review feedback recommends adding a defensive check to verify that the hf_pretrained attribute exists on the bridge object to prevent potential AttributeErrors during model saving.
| # current view, but save_artifacts is required to copy the | ||
| # custom Python modules and other artifacts that | ||
| # save_pretrained() alone does not emit. | ||
| bridge.hf_pretrained.save_artifacts(work_dir) |
There was a problem hiding this comment.
To prevent potential AttributeError or NoneType errors during model saving, it is safer to use defensive programming and check if bridge has a non-None hf_pretrained attribute before calling save_artifacts. If bridge is a custom or mock bridge that does not implement hf_pretrained, this call would otherwise crash the training process at the very end.
| bridge.hf_pretrained.save_artifacts(work_dir) | |
| if getattr(bridge, "hf_pretrained", None) is not None: | |
| bridge.hf_pretrained.save_artifacts(work_dir) |
There was a problem hiding this comment.
The bridge.save_hf_weights call three lines above is unguarded on the same object — if bridge lacks hf_pretrained, it almost certainly lacks save_hf_weights too and we've already crashed.
More importantly, silently skipping save_artifacts is exactly the bug this PR fixes: HF dirs missing modeling_*.py reload as broken trust_remote_code checkpoints. A loud AttributeError surfaces a bridge gap; a silent skip re-introduces data loss. Prefer to fail loudly.
…ts ordering Locks in the rank-0-only invocation of bridge.hf_pretrained.save_artifacts and the relative ordering of save_hf_weights -> save_artifacts -> save_hf_configs. Gated on _has_megatron consistent with the rest of test_megatron_correctness.py. Signed-off-by: Vu Dinh <vudinh@outlook.com>
Pure formatting (line-length only); no logic changes. Fixes check_code_quality / black on PR NovaSky-AI#1726. Signed-off-by: Vu Dinh <vudinh@outlook.com>
Summary
MegatronStrategy.save_hf_modelpreviously only invokedbridge.save_hf_weightsand the inheritedsave_hf_configs. For models that rely ontrust_remote_code(Nemotron-H, custom MoE variants, etc.), the resulting export directory was missing the custommodeling_*.pymodules and otherauto_map-referenced artifacts, so the checkpoint could not be reloaded withAutoModel.from_pretrained(..., trust_remote_code=True)without manually copying files from the original source path.Root cause
save_pretrained()only emitsconfig.json, tokenizer files, and the generation config. It does not copy the custom Python modeling modules or theARTIFACTS/OPTIONAL_ARTIFACTSthattrust_remote_codemodels depend on. Megatron-Bridge'sPreTrainedBase.save_artifacts(work_dir)is the API that copies these from the original source path, butsave_hf_modelinskyrl/backends/skyrl_train/distributed/megatron/megatron_strategy.pywas not calling it before delegating tosave_hf_configs.Fix
bridge.hf_pretrained.save_artifacts(work_dir)immediately beforeself.save_hf_configs(...)insideMegatronStrategy.save_hf_model.save_hf_configsruns afterwards, so the existingconfig.json/ tokenizer write semantics are preserved (it overwrites the config thatsave_artifactswrote with the strategy's current view).trust_remote_codemodels,save_artifactsis effectively a superset ofsave_hf_configs, so behavior is unchanged.Verification
tests/backends/skyrl_train/gpu/test_save_load_model.pyexercises save/load end-to-end on CI runners.pre-commit run --files skyrl/backends/skyrl_train/distributed/megatron/megatron_strategy.py(ruff, black, gitleaks) all pass.trust_remote_codemodel contains the custommodeling_*.pyfiles,special_tokens_map.json, and other artifacts listed inARTIFACTS/OPTIONAL_ARTIFACTS, and reloads cleanly withAutoModel.from_pretrained(..., trust_remote_code=True)without external file copying.Risk
Low.
save_artifactsis the standard Megatron-Bridge API for emitting source artifacts and is a no-op for source paths that do not declare any custom modules.save_hf_configsruns after it, so any overlap onconfig.json/ tokenizer files is resolved in favor of the existing behavior.Notes
API verified against
NVIDIA-NeMo/Megatron-Bridge:src/megatron/bridge/models/hf_pretrained/base.py(save_artifacts) andconversion/auto_bridge.py(hf_pretrainedattribute onAutoBridge).