Skip to content

Save checkpoint with TP#3096

Merged
BenjaminBossan merged 24 commits intohuggingface:mainfrom
michaelbenayoun:save_tp
Apr 10, 2026
Merged

Save checkpoint with TP#3096
BenjaminBossan merged 24 commits intohuggingface:mainfrom
michaelbenayoun:save_tp

Conversation

@michaelbenayoun
Copy link
Copy Markdown
Member

@michaelbenayoun michaelbenayoun commented Mar 12, 2026

@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@michaelbenayoun michaelbenayoun force-pushed the save_tp branch 2 times, most recently from 045b9b1 to e8aa041 Compare March 24, 2026 00:45
Copy link
Copy Markdown
Member

@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. I have a few comments, but overall it already looks good.

Comment thread method_comparison/MetaMathQA/run.py Outdated
from typing import Any, Literal, Optional

import torch
from data import get_train_valid_test_datasets, get_wiki_small
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's undo changes to unrelated files like here and in utils.py.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. It is the linter which does that.

Comment thread src/peft/tuners/lora/layer.py Outdated
from .config import LoraConfig


@dataclass
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about we move this to utils/integration.py and give it a more generic name, as it's not strictly related to LoRA? Then we can also use it in save_and_load.py as the return type for _get_tp_info.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

Comment thread src/peft/utils/save_and_load.py
Comment thread src/peft/utils/save_and_load.py
Comment thread tests/test_gpu_examples.py Outdated
model.to(device)

lora_config = LoraConfig(r=4, target_modules=TARGET_MODULES, init_lora_weights=True)
model = inject_adapter_in_model(lora_config, model)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

model.load_adapter(lora_config) should work equally and is closer to what a normal user would do. Same applies to the inject_adapter_in_model use below.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not get this comment.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I mean is that instead of calling inject_adapter_in_model(lora_config, model), we should call model.load_adapter(lora_config), because this is closer to what a user would normally do.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have worked on it, after some time, I have done some changes here as well: huggingface/transformers#45155

So this PR and the one in transformers are now connected.

Comment thread tests/test_gpu_examples.py Outdated
Comment thread tests/test_gpu_examples.py Outdated
@BenjaminBossan
Copy link
Copy Markdown
Member

@michaelbenayoun What's the state of the PR?

@michaelbenayoun
Copy link
Copy Markdown
Member Author

Still working on it but it's not as trivial as you would think to add support for load_adapter, hence the delay.

@BenjaminBossan
Copy link
Copy Markdown
Member

Got it, LMK if I can help.

@michaelbenayoun
Copy link
Copy Markdown
Member Author

It is all good on my end. One thing: adding support for load_adapter required some changes in Transformers as well, which are located here: huggingface/transformers#45155.

@BenjaminBossan
Copy link
Copy Markdown
Member

Great, thanks for the progress. So if my understanding is correct, we should wait for the Transformers PR to land and be released first, otherwise this PR won't work correctly.

@michaelbenayoun
Copy link
Copy Markdown
Member Author

No, actually everything will work fine except the test that uses PreTrainedModel.load_adapter.
The PR in Transformers is adding support for this path, while this PR is adding support for saving checkpoints.

Copy link
Copy Markdown
Member

@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for updating the PR and also cleaning up the TP-related code. Generally, this LGTM.

I ran the tests locally with Transformers from main (I also hard-coded is_transformers_ge_v5_6_0 = True so that tests would not be skipped). There, I got an error:

TypeError: add_tensor_parallel_hooks_to_module() takes 5 positional arguments but 6 were given

This is because of the clean up in huggingface/transformers#44768. So after removing the 2nd tp_plan argument in the add_tensor_parallel_hooks_to_module call, all tests passed. Could you please update the code?

Note: Failing CI is unrelated.

@michaelbenayoun
Copy link
Copy Markdown
Member Author

Yes I had actually created a branch for this specific change but will update it here and push.

Copy link
Copy Markdown
Member

@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pushing the fix, LGTM. Failing CI is unrelated.

@BenjaminBossan BenjaminBossan merged commit 07a1db6 into huggingface:main Apr 10, 2026
2 of 10 checks passed
@michaelbenayoun michaelbenayoun deleted the save_tp branch April 10, 2026 15:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants