Replies: 2 comments
-
|
Hi there, you raise a good point. I think what we define as "fair" depends. I can see too possible scenarios a) Regular finetuning versus LoRA finetuning, where LoRA targets the same layers as the regular finetuning (all, last, some). Then, one can see whether LoRA, which updates fewer parameters per layer helps or not. b) Using the method with the highest accuracy (alll, last, or some layers) for each of the two (regular versus LoRA), which one is more efficient? I think another important factor that we are not considering in this discussion thread is the memory savings. In practice, a common bottle neck is that one cannot do regular training of a 8B model on many single GPUs (depending on the RAM), but LoRA is fine. All that being said, in the classification case (not in the supervised instruction finetuning case), updating only the last layer like you suggest is reasonable. But again, I think with larger models this becomes totally negligible either way. |
Beta Was this translation helpful? Give feedback.
-
|
Your comparison setup is actually sound, and the results make sense given how LoRA interacts with classification fine-tuning. On the fairness of the comparison: Comparing last-layer fine-tuning against last-layer LoRA is the right call. Applying LoRA to all linear layers introduces extra trainable parameters across the full network, which is a different budget and scope than the baseline - not an apples-to-apples comparison for classification. Why last-layer fine-tuning wins here: Classification fine-tuning on a pre-trained LLM is a low-rank adaptation problem by nature - you are essentially learning a linear mapping from the frozen representations to class labels. LoRA adds a low-rank decomposition on top of that, which introduces extra approximation error with no benefit when the base weights themselves are not being updated in the layers that matter. With zero hyperparameter tuning, the direct fine-tuning of the final block and On the forgetting observation: You are right that LoRA updates to early layers can cause representational drift. For classification tasks where the pre-trained representations are already strong, touching the early layers adds noise rather than signal. Last-layer LoRA avoids this, which is why it recovers most of the accuracy gap vs. full LoRA. Practical takeaway: For classification, last-layer LoRA is a reasonable middle ground if you need the LoRA weight-merging benefits (e.g., serving multiple adapters). But if you just want accuracy and speed, direct fine-tuning of the last transformer block onwards is hard to beat - your numbers confirm this well. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hey, I have just implemented the LoRA for classification fine-tuning. In the end, I have noticed the comment about how LoRA is slower, due to the added inference cost but that can be negated on larger models.
My question is, if it makes sense to compare this model against last layer LoRA fine-tuning? Because in classification fine-tuning, we only fine-tuned the layers from the last transformer block onwards and did not touch the first layer weights. I think here, using LoRA only on the last transformer block and the out_head will be fairer.
Here are my results for different experiments (I have a slightly different dataset and training loop, so the numbers differ from the book):
I think here, we see the training time cost is not that high when done on the same layers, and also in my experiment with zero hyperparameter tuning, last layer training performed better too (I assumed this might be caused by lora updates to initial layers, causing some "forgetting")
Beta Was this translation helpful? Give feedback.
All reactions