Finetuning and Alignment
Techniques for transforming base LLMs into helpful, harmless, and honest assistants.
Instruct Fine-Tuning
Instruct fine-tuning (or “instruction tuning”, or “supervised finetuning”, or “chat tuning” – the boundaries here are a bit fuzzy) is the primary technique used (at least initially) for coaxing LLMs to conform to a particular style or format. Here, data is presented as a sequence of (input, output) pairs where the input is a user question to answer, and the model’s goal is to predict the output – typically this also involves adding special “start”/”stop”/”role” tokens and other masking techniques, enabling the model to “understand” the difference between the user’s input and its own outputs. This technique is also widely used for task-specific finetuning on datasets with a particular kind of problem structure (e.g. translation, math, general question-answering).
See this blog post from Sebastian Ruder or this video from Shayne Longpre for short overviews.
Low-Rank Adapters (LoRA)
While pre-training (and “full finetuning”) requires applying gradient updates to all parameters of a model, this is typically impractical on consumer GPUs or home setups; fortunately, it’s often possible to significantly reduce the compute requirements by using parameter-efficient finetuning (PEFT) techniques like Low-Rank Adapters (LoRA). This can enable competitive performance even with relatively small datasets, particularly for application-specific use cases. The main idea behind LoRA is to train each weight matrix in a low-rank space by “freezing” the base matrix and training a factored representation with much smaller inner dimension, which is then added to the base matrix.
Resources on LoRA
- Video: LoRA paper walkthrough (part 1)
- Video: LoRA code demo (part 2)
- Blog post: "Parameter-Efficient LLM Finetuning With Low-Rank Adaptation" by Sebastian Raschka
- Blog post: "Practical Tips for Finetuning LLMs Using LoRA" by Sebastian Raschka
Additionally, an “decomposed” LoRA variant called DoRA has been gaining popularity in recent months, often yielding performance improvements; see this post from Sebastian Raschka for more details.
Reward Models and RLHF
One of the most prominent techniques for “aligning” a language model is Reinforcement Learning from Human Feedback (RLHF); here, we typically assume that an LLM has already been instruction-tuned to respect a chat style, and that we additionally have a “reward model” which has been trained on human preferences. Given pairs of differing outputs to an input, where a preferred output has been chosen by a human, the learning objective of the reward model is to predict the preferred output, which involves implicitly learning preference “scores”. This allows bootstrapping a general representation of human preferences (at least with respect to the dataset of output pairs), which can be used as a “reward simulator” for continual training of a LLM using RL policy gradient techniques like PPO.
RLHF represents a significant advancement in aligning LLMs with human values and preferences, enabling models to produce outputs that are not just factually accurate but also helpful, harmless, and honest.
Resources on RLHF
- Blog post: "Illustrating Reinforcement Learning from Human Feedback (RLHF)" from Hugging Face
- Blog post: "Reinforcement Learning from Human Feedback" from Chip Huyen
- Video: RLHF talk by Nathan Lambert
- Blog post: Insights on RewardBench from Sebastian Raschka
Direct Preference Optimization Methods
The space of alignment algorithms seems to be following a similar trajectory as we saw with stochastic optimization algorithms a decade ago. In this an analogy, RLHF is like SGD — it works, it’s the original, and it’s also become kind of a generic “catch-all” term for the class of algorithms that have followed it. Perhaps DPO is AdaGrad, and in the year since its release there’s been a rapid wave of further algorithmic developments along the same lines (KTO, IPO, ORPO, etc.), whose relative merits are still under active debate. Maybe a year from now, everyone will have settled on a standard approach which will become the “Adam” of alignment.
Resources on DPO
- Blog post: "Understanding the Implications of Direct Preference Optimization" by Matthew Gunton
- Blog post: "Fine-tuning language models with Direct Preference Optimization" from Hugging Face
- Blog post: "The Art of Preference Optimization" from Hugging Face (comparing DPO-flavored methods)
Context Scaling
Beyond task specification or alignment, another common goal of finetuning is to increase the effective context length of a model, either via additional training, adjusting parameters for positional encodings, or both. Even if adding more tokens to a model’s context can “type-check”, training on additional longer examples is generally necessary if the model may not have seen such long sequences during pretraining.
Resources on Context Scaling
- Blog post: "Scaling Rotational Embeddings for Long-Context Language Models" by Gradient AI
- Blog post: "Extending the RoPE" by Eleuther AI, introducing the YaRN method for increased context via attention temperature scaling
- Blog post: "Everything About Long Context Fine-tuning" by Wenbo Pan
Distillation and Merging
Here we’ll look at two very different methods of consolidating knowledge across LLMs — distillation and merging. Distillation was first popularized for BERT models, where the goal is to “distill” the knowledge and performance of a larger model into a smaller one (at least for some tasks) by having it serve as a “teacher” during the smaller model’s training, bypassing the need for large quantities of human-labeled data.
Resources on Distillation
- Blog post: "Smaller, faster, cheaper, lighter: Introducing DistilBERT, a distilled version of BERT" from Hugging Face
- Guide: "LLM distillation demystified: a complete guide" from Snorkel AI
- Research blog: "Distilling Step by Step" from Google Research
Merging is much more of a "wild west" technique, largely used by open-source engineers who want to combine the strengths of multiple finetuning efforts. It's kind of wild that it works at all, and perhaps grants some credence to "linear representation hypotheses".
The idea behind model merging is basically to take two different finetunes of the same base model and just average their weights. No training required. Technically, it’s usually “spherical interpolation” (or “slerp”), but this is pretty much just fancy averaging with a normalization step. For more details, see the post Merge Large Language Models with mergekit by Maxime Labonne.
Key Takeaways
- Instruct fine-tuning transforms base LLMs into models that follow user instructions
- Parameter-efficient techniques like LoRA make fine-tuning feasible on consumer hardware
- RLHF aligns models with human preferences through reward modeling and reinforcement learning
- Direct Preference Optimization (DPO) offers a simpler alternative to RLHF for alignment
- Context scaling techniques enable LLMs to handle much longer inputs than their pretraining allowed
- Knowledge distillation creates smaller, faster models that retain much of their teacher's capabilities
- Model merging can combine strengths from different fine-tuned models without additional training