Save checkpoint with TP by michaelbenayoun · Pull Request #3096 · huggingface/peft

michaelbenayoun · 2026-03-12T20:27:15Z

Enable state dict gather before saving checkpoint when doing TP.

Should wait for:

HuggingFaceDocBuilderDev · 2026-03-12T20:32:32Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

BenjaminBossan

Thanks for the PR. I have a few comments, but overall it already looks good.

BenjaminBossan · 2026-03-26T10:13:54Z

 from typing import Any, Literal, Optional

 import torch
+from data import get_train_valid_test_datasets, get_wiki_small


Let's undo changes to unrelated files like here and in utils.py.

Done. It is the linter which does that.

BenjaminBossan · 2026-03-26T10:14:23Z

 from .config import LoraConfig


+@dataclass


How about we move this to utils/integration.py and give it a more generic name, as it's not strictly related to LoRA? Then we can also use it in save_and_load.py as the return type for _get_tp_info.

BenjaminBossan · 2026-03-26T10:20:17Z

+    model.to(device)
+
+    lora_config = LoraConfig(r=4, target_modules=TARGET_MODULES, init_lora_weights=True)
+    model = inject_adapter_in_model(lora_config, model)


model.load_adapter(lora_config) should work equally and is closer to what a normal user would do. Same applies to the inject_adapter_in_model use below.

I did not get this comment.

What I mean is that instead of calling inject_adapter_in_model(lora_config, model), we should call model.load_adapter(lora_config), because this is closer to what a user would normally do.

I have worked on it, after some time, I have done some changes here as well: huggingface/transformers#45155

So this PR and the one in transformers are now connected.

BenjaminBossan · 2026-04-07T10:18:58Z

@michaelbenayoun What's the state of the PR?

michaelbenayoun · 2026-04-07T14:56:26Z

Still working on it but it's not as trivial as you would think to add support for load_adapter, hence the delay.

BenjaminBossan · 2026-04-07T15:34:57Z

Got it, LMK if I can help.

michaelbenayoun · 2026-04-07T23:17:55Z

It is all good on my end. One thing: adding support for load_adapter required some changes in Transformers as well, which are located here: huggingface/transformers#45155.

BenjaminBossan · 2026-04-09T10:44:43Z

Great, thanks for the progress. So if my understanding is correct, we should wait for the Transformers PR to land and be released first, otherwise this PR won't work correctly.

michaelbenayoun · 2026-04-09T14:29:32Z

No, actually everything will work fine except the test that uses PreTrainedModel.load_adapter.
The PR in Transformers is adding support for this path, while this PR is adding support for saving checkpoints.

BenjaminBossan

Thanks for updating the PR and also cleaning up the TP-related code. Generally, this LGTM.

I ran the tests locally with Transformers from main (I also hard-coded is_transformers_ge_v5_6_0 = True so that tests would not be skipped). There, I got an error:

TypeError: add_tensor_parallel_hooks_to_module() takes 5 positional arguments but 6 were given

This is because of the clean up in huggingface/transformers#44768. So after removing the 2nd tp_plan argument in the add_tensor_parallel_hooks_to_module call, all tests passed. Could you please update the code?

Note: Failing CI is unrelated.

michaelbenayoun · 2026-04-10T14:37:34Z

Yes I had actually created a branch for this specific change but will update it here and push.

BenjaminBossan

Thanks for pushing the fix, LGTM. Failing CI is unrelated.

michaelbenayoun force-pushed the save_tp branch 2 times, most recently from 045b9b1 to e8aa041 Compare March 24, 2026 00:45

michaelbenayoun added 8 commits March 24, 2026 12:08

feat: add hook for lora.Embedding

720c7d6

fix: typos, deepcopy, minor

2c96acf

feat: add saving support

f036b7c

fix: remove embedding_colwise support for now

cf9fe8b

refactor: cleanup implementation

c31f7da

fix: broken conflict resolution during rebase

1c00328

fix: missing import and tests

dec1e01

fix: replace logger.warning by warnings.warn

96e7d5c

michaelbenayoun force-pushed the save_tp branch from e8aa041 to 96e7d5c Compare March 24, 2026 16:09

michaelbenayoun added 3 commits March 24, 2026 17:20

fix: make tp work with low level api

d5cb738

style: ruff

dfcf687

style: docstring

762c3c3

michaelbenayoun requested a review from BenjaminBossan March 25, 2026 15:56

BenjaminBossan requested changes Mar 26, 2026

View reviewed changes

michaelbenayoun added 6 commits March 26, 2026 15:22

fix: restore unrelated files

70de544

refactor: move TpInfo

4afada5

fix: remove unnecessary check

70dd58d

refactor: simplify iteration on tp_plan

227baf3

refactor: use fixture

76353c2

refactor: import safetensors at module level

4bff6d5

BenjaminBossan mentioned this pull request Mar 30, 2026

TP support for Finetuning using LoRA and other PEFT techniques #3044

Closed

feat: check if tp plan exists before looping over modules

e31d715

feat: load_adapter support

49b460d

michaelbenayoun mentioned this pull request Apr 7, 2026

Load adapter with TP huggingface/transformers#45155

Merged

test: use monitored barrier

a7a52e9

michaelbenayoun added 2 commits April 7, 2026 19:19

test: rename tests

5a2a4f7

fix: test name

c6b6716

michaelbenayoun added 2 commits April 9, 2026 14:13

test: skip if not latest transformers release

852a504

feat: update function signature for TP hooks

a4417a1

BenjaminBossan requested changes Apr 10, 2026

View reviewed changes

BenjaminBossan approved these changes Apr 10, 2026

View reviewed changes

BenjaminBossan merged commit 07a1db6 into huggingface:main Apr 10, 2026
2 of 10 checks passed

michaelbenayoun deleted the save_tp branch April 10, 2026 15:33

Conversation

michaelbenayoun commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Mar 12, 2026

Uh oh!

BenjaminBossan left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

BenjaminBossan commented Apr 7, 2026

Uh oh!

michaelbenayoun commented Apr 7, 2026

Uh oh!

BenjaminBossan commented Apr 7, 2026

Uh oh!

michaelbenayoun commented Apr 7, 2026

Uh oh!

BenjaminBossan commented Apr 9, 2026

Uh oh!

michaelbenayoun commented Apr 9, 2026

Uh oh!

BenjaminBossan left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

michaelbenayoun commented Apr 10, 2026

Uh oh!

BenjaminBossan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

michaelbenayoun commented Mar 12, 2026 •

edited

Loading

BenjaminBossan left a comment •

edited

Loading