Finetuning and Alignment

Techniques for transforming base LLMs into helpful, harmless, and honest assistants.

In pre-training, the goal is basically "predict the next token on random internet text". While the resulting "base" models are still useful in some contexts, their outputs are often chaotic or "unaligned", and they may not respect the format of a back-and-forth conversation. Here we'll look at a set of techniques for going from these base models to ones resembling the friendly chatbots and assistants we're more familiar with. A great companion resource, especially for this section, is Maxime Labonne's interactive LLM course on Github.

Instruct Fine-Tuning

Instruct fine-tuning (or “instruction tuning”, or “supervised finetuning”, or “chat tuning” – the boundaries here are a bit fuzzy) is the primary technique used (at least initially) for coaxing LLMs to conform to a particular style or format. Here, data is presented as a sequence of (input, output) pairs where the input is a user question to answer, and the model’s goal is to predict the output – typically this also involves adding special “start”/”stop”/”role” tokens and other masking techniques, enabling the model to “understand” the difference between the user’s input and its own outputs. This technique is also widely used for task-specific finetuning on datasets with a particular kind of problem structure (e.g. translation, math, general question-answering).

See this blog post from Sebastian Ruder or this video from Shayne Longpre for short overviews.

Low-Rank Adapters (LoRA)

While pre-training (and “full finetuning”) requires applying gradient updates to all parameters of a model, this is typically impractical on consumer GPUs or home setups; fortunately, it’s often possible to significantly reduce the compute requirements by using parameter-efficient finetuning (PEFT) techniques like Low-Rank Adapters (LoRA). This can enable competitive performance even with relatively small datasets, particularly for application-specific use cases. The main idea behind LoRA is to train each weight matrix in a low-rank space by “freezing” the base matrix and training a factored representation with much smaller inner dimension, which is then added to the base matrix.

Resources on LoRA

Video: LoRA paper walkthrough (part 1)
Video: LoRA code demo (part 2)
Blog post: "Parameter-Efficient LLM Finetuning With Low-Rank Adaptation" by Sebastian Raschka
Blog post: "Practical Tips for Finetuning LLMs Using LoRA" by Sebastian Raschka

Additionally, an “decomposed” LoRA variant called DoRA has been gaining popularity in recent months, often yielding performance improvements; see this post from Sebastian Raschka for more details.

Reward Models and RLHF

One of the most prominent techniques for “aligning” a language model is Reinforcement Learning from Human Feedback (RLHF); here, we typically assume that an LLM has already been instruction-tuned to respect a chat style, and that we additionally have a “reward model” which has been trained on human preferences. Given pairs of differing outputs to an input, where a preferred output has been chosen by a human, the learning objective of the reward model is to predict the preferred output, which involves implicitly learning preference “scores”. This allows bootstrapping a general representation of human preferences (at least with respect to the dataset of output pairs), which can be used as a “reward simulator” for continual training of a LLM using RL policy gradient techniques like PPO.

RLHF represents a significant advancement in aligning LLMs with human values and preferences, enabling models to produce outputs that are not just factually accurate but also helpful, harmless, and honest.

Resources on RLHF

Blog post: "Illustrating Reinforcement Learning from Human Feedback (RLHF)" from Hugging Face
Blog post: "Reinforcement Learning from Human Feedback" from Chip Huyen
Video: RLHF talk by Nathan Lambert
Blog post: Insights on RewardBench from Sebastian Raschka

Direct Preference Optimization Methods

The space of alignment algorithms seems to be following a similar trajectory as we saw with stochastic optimization algorithms a decade ago. In this an analogy, RLHF is like SGD — it works, it’s the original, and it’s also become kind of a generic “catch-all” term for the class of algorithms that have followed it. Perhaps DPO is AdaGrad, and in the year since its release there’s been a rapid wave of further algorithmic developments along the same lines (KTO, IPO, ORPO, etc.), whose relative merits are still under active debate. Maybe a year from now, everyone will have settled on a standard approach which will become the “Adam” of alignment.

Resources on DPO

Blog post: "Understanding the Implications of Direct Preference Optimization" by Matthew Gunton
Blog post: "Fine-tuning language models with Direct Preference Optimization" from Hugging Face
Blog post: "The Art of Preference Optimization" from Hugging Face (comparing DPO-flavored methods)

Context Scaling

Beyond task specification or alignment, another common goal of finetuning is to increase the effective context length of a model, either via additional training, adjusting parameters for positional encodings, or both. Even if adding more tokens to a model’s context can “type-check”, training on additional longer examples is generally necessary if the model may not have seen such long sequences during pretraining.

Resources on Context Scaling

Blog post: "Scaling Rotational Embeddings for Long-Context Language Models" by Gradient AI
Blog post: "Extending the RoPE" by Eleuther AI, introducing the YaRN method for increased context via attention temperature scaling
Blog post: "Everything About Long Context Fine-tuning" by Wenbo Pan

Distillation and Merging

Here we’ll look at two very different methods of consolidating knowledge across LLMs — distillation and merging. Distillation was first popularized for BERT models, where the goal is to “distill” the knowledge and performance of a larger model into a smaller one (at least for some tasks) by having it serve as a “teacher” during the smaller model’s training, bypassing the need for large quantities of human-labeled data.

Resources on Distillation

Blog post: "Smaller, faster, cheaper, lighter: Introducing DistilBERT, a distilled version of BERT" from Hugging Face
Guide: "LLM distillation demystified: a complete guide" from Snorkel AI
Research blog: "Distilling Step by Step" from Google Research

Merging is much more of a "wild west" technique, largely used by open-source engineers who want to combine the strengths of multiple finetuning efforts. It's kind of wild that it works at all, and perhaps grants some credence to "linear representation hypotheses".

The idea behind model merging is basically to take two different finetunes of the same base model and just average their weights. No training required. Technically, it’s usually “spherical interpolation” (or “slerp”), but this is pretty much just fancy averaging with a normalization step. For more details, see the post Merge Large Language Models with mergekit by Maxime Labonne.

Key Takeaways

Instruct fine-tuning transforms base LLMs into models that follow user instructions
Parameter-efficient techniques like LoRA make fine-tuning feasible on consumer hardware
RLHF aligns models with human preferences through reward modeling and reinforcement learning
Direct Preference Optimization (DPO) offers a simpler alternative to RLHF for alignment
Context scaling techniques enable LLMs to handle much longer inputs than their pretraining allowed
Knowledge distillation creates smaller, faster models that retain much of their teacher's capabilities
Model merging can combine strengths from different fine-tuned models without additional training