[TinyLoRA]tinylora implementation#3024
Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
|
cc @jxmorris12 I have an implementation of TinyLoRA if you can kindly have a look? |
githubnemo
left a comment
There was a problem hiding this comment.
Hey @kashif :)
Thanks for the PR, this is already solid.
Merging with main should hopefully resolve the CI errors.
Some questions and comments below.
|
thanks fixing |
Co-authored-by: githubnemo <githubnemo@users.noreply.github.com>
Co-authored-by: githubnemo <githubnemo@users.noreply.github.com>
Co-authored-by: githubnemo <githubnemo@users.noreply.github.com>
Co-authored-by: githubnemo <githubnemo@users.noreply.github.com>
Co-authored-by: githubnemo <githubnemo@users.noreply.github.com>
Co-authored-by: githubnemo <githubnemo@users.noreply.github.com>
|
@githubnemo should be ready for another review thanks |
githubnemo
left a comment
There was a problem hiding this comment.
Thanks for the quick response :)
I think implementation-wise this is, except for two nits, good to go.
Let's add an example that showcases the primary use-case and add the method to the method comparison suite (maybe copy from method_comparison/MetaMathQA/experiments/lora/... and see where it takes us).
It'd also be stellar to have a commit message / PR description that is meaningful in the commit history.
|
ready @githubnemo |
|
Thanks for the PR Kashif. I ran the experiments on my machine and for got a test accuracy of 0% and 0.002% :) |
|
yes @BenjaminBossan i will test with the RL setup, we can wait if its ok, I want to also double check that nothing is wrong |
|
Out of curiosity, I wanted to check if TinyLoRA can achieve better scores if we increase the number of trainable parameters. So I took the default* setting and increased
Given the still tiny number of trainable parameters, this result is quite respectable. This is also a nice confirmation that there is no major bug in the implementation. I wonder if it would make sense to have a "maximalist" and a "minimalist" config, i.e. one with more trainable parameters and better score and one with extremely few trainable parameters (basically the current *One more change I did was to increase |
I think that's a good thing to have! I also wondered if it would make sense to extend the target modules to |
|
should we just document this? or add it somewhere else? |
I'd rather target either the attention xor the MLP part for consistence with other experiments.
What does "this" reference here? |
|
ah sorry, i meant the minimal/maximal config? |
Yes, let's add a 'maximalist' config with (possibly) r > 2 and u >= 2048. VeRA has ~128k parameters with 37.6% task accuracy according to https://huggingface.co/spaces/peft-internal-testing/PEFT-method-comparison - maybe it makes sense to match that setting ( |
githubnemo
left a comment
There was a problem hiding this comment.
The benchmarks spiraled a bit into side-projects. I think we can merge this without having the additional configs and check-in specific benchmarks at a later point in time.
Implementation LGTM.
|
tinylora实际性能可以达到论文里的多少? |
|
再请问下关于tinylora的4bit量化微调训练貌似还没打通? |
|
TinyLoRA doesn't support quantization yet. We have plans for that (#3117) but that will probably take some time to land in PEFT. You could try something like
Those are not the same as TinyLoRA, but they support bitsandbytes and are also very parameter efficient. |
Adds TinyLoRA, a new PEFT method based on "TinyLoRA: Learning to Reason in 13 Parameters". TinyLoRA achieves extreme parameter efficiency by replacing LoRA's trainable low-rank matrices with a tiny trainable vector projected through fixed random bases.
The key idea: given a frozen SVD decomposition
W ≈ B @ A(whereB = U @ sqrt(S)andA = sqrt(S) @ V^T), the weight update isdelta_W = B @ R @ AwhereRis anr x rtrainable matrix (following LoRA-XS). TinyLoRA takes this further by parameterizingRas a linear combination of fixed random projection matrices:where
vis the only trainable parameter (as small as 13 values) andP_iare fixed random matrices seeded deterministically.Features
uper target module (or even less with weight tying), compared tor * (in + out)for LoRAvvectors across layers viaweight_tying(0.0 = no sharing, 1.0 = all layers share onev)AandBmatrices computed from truncated SVD of pretrained weights, with singular values distributed equally viasqrt(S)nn.Linear,Conv1D, andnn.Embeddingsupports_lora_conversion()-> True — delta weights can be converted to standard LoRA format viaget_delta_weightPmatrices are seeded per-layer for reproducibility; optionally saved in checkpoints (save_projection=True)Config
Architecture
get_delta_weight,supports_lora_conversion