LLM Architecture and Training
From decoder-only Transformers to the implementation choices behind frontier LLMs.
Tokenization
Character-level tokenization (like in several of the Karpathy videos) tends to be inefficient for large-scale Transformers vs. word-level tokenization, yet naively picking a fixed “dictionary” (e.g. Merriam-Webster) of full words runs the risk of encountering unseen words or misspellings at inference time. Instead, the typical approach is to use subword-level tokenization to “cover” the space of possible inputs, while maintaining the efficiency gains which come from a larger token pool, using algorithms like Byte-Pair Encoding (BPE) to select the appropriate set of tokens. If you’ve ever seen Huffman coding in an introductory algorithms class I think it’s a somewhat useful analogy for BPE here, although the input-output format is notably different, as we don’t know the set of “tokens” in advance. I’d recommend watching Andrej Karpathy’s video on tokenization and checking out this tokenization guide from Masato Hagiwara.
Positional Encoding
As we saw in the past section, Transformers don’t natively have the same notion of adjacency or position within a context windows (in contrast to RNNs), and position must instead represented with some kind of vector encoding. While this could be done naively with something like one-hot encoding, this is impractical for context-scaling and suboptimal for learnability, as it throws away notions of ordinality. Originally, this was done with sinusoidal positional encodings, which may feel reminiscent of Fourier features if you’re familiar; the most popular implementation of this type of approach nowadays is likely Rotary Positional Encoding, or RoPE, which tends to be more stable and faster to learn during training.
Key Resources for Positional Encoding
- Blog post: Understanding Positional Embeddings by Harrison Pim on intuition for positional encodings
- Blog post: A Gentle Introduction to Positional Encoding by Mehreen Saeed on the original Transformer positional encodings
- Blog post: Rotary Embeddings on RoPE from Eleuther AI
- Animated video: Understanding Positional Encoding from DeepLearning Hero
Pretraining Recipes
Once you’ve committed to pretraining a LLM of a certain general size on a particular corpus of data (e.g Common Crawl, FineWeb), there are still a number of choices to make before you’re ready to go:
- Attention mechanisms (multi-head, multi-query, grouped-query)
- Activations (ReLU, GeLU, SwiGLU)
- Optimizers, learning rates, and schedulers (AdamW, warmup, cosine decay)
- Dropout?
- Hyperparameter choices and search strategies
- Batching, parallelization strategies, gradient accumulation
- How long to train for, how often to repeat data
- …and many other axes of variation
As far as I can tell, there's not a one-size-fits-all rule book for how to go about this, but the resources below provide valuable insights from those who have navigated these challenges.
Essential Pretraining Resources
- Blog post: A Recipe for Training Neural Networks by Andrej Karpathy - While it predates the LLM era, this is a great starting point for framing many problems relevant throughout deep learning
- Guide: The Novice's LLM Training Guide by Alpin Dale, discussing hyperparameter choices in practice, as well as the finetuning techniques we'll see in future sections
- Blog post: How to train your own Large Language Models from Replit has some nice discussions on data pipelines and evaluations for training
- Article: Navigating the Attention Landscape: MHA, MQA, and GQA Decoded by Shobhit Agarwal for understanding attention mechanism tradeoffs
- Blog post: The Evolution of the Modern Transformer from Deci AI for discussion of "popular defaults"
- Chapter: Learning Rate Scheduling from the d2l.ai book (Chapter 12.11)
- Blog post: Response to NYT from Eleuther AI on controversy surrounding reporting of "best practices"
Distributed Training and FSDP
There are a number of additional challenges associated with training models which are too large to fit on individual GPUs (or even multi-GPU machines), typically necessitating the use of distributed training protocols like Fully Sharded Data Parallelism (FSDP), in which models can be co-located across machines during training. It’s probably worth also understanding its precursor Distributed Data Parallelism (DDP), which is covered in the first post linked below.
Resources on Distributed Training
- Blog post: FSDP from Meta (who pioneered the method)
- Blog post: Understanding FSDP by Bar Rozenman, featuring many excellent visualizations
- Report: Training Great LLMs Entirely From Ground Zero in the Wilderness from Yi Tai on the challenges of pretraining a model in a startup environment
- Technical blog: FSDP QLora Deep Dive from Answer.AI on combining FSDP with parameter-efficient finetuning techniques for use on consumer GPUs
Scaling Laws
It’s useful to know about scaling laws as a meta-topic which comes up a lot in discussions of LLMs (most prominently in reference to the “Chinchilla” paper), more so than any particular empirical finding or technique. In short, the performance which will result from scaling up the model, data, and compute used for training a language model results in fairly reliable predictions for model loss. This then enables calibration of optimal hyperparameter settings without needing to run expensive grid searches.
Resources on Scaling Laws
- Blog overview: Chinchilla Scaling Laws for Large Language Models by Rania Hossam
- Discussion: New Scaling Laws for LLMs on LessWrong
- Post: Chinchilla's Wild Implications on LessWrong
- Analysis: Chinchilla Scaling: A Replication Attempt (potential issues with Chinchilla findings)
- Blog post: Scaling Laws and Emergent Properties by Clément Thiriet
- Video lecture: Scaling Language Models from Stanford CS224n
Mixture-of-Experts
While many of the prominent LLMs (such as Llama3) used today are “dense” models (i.e. without enforced sparsification), Mixture-of-Experts (MoE) architectures are becoming increasingly popular for navigating tradeoffs between “knowledge” and efficiency, used perhaps most notably in the open-weights world by Mistral AI’s “Mixtral” models (8x7B and 8x22B), and rumored to be used for GPT-4. In MoE models, only a fraction of the parameters are “active” for each step of inference, with trained router modules for selecting the parallel “experts” to use at each layer. This allows models to grow in size (and perhaps “knowlege” or “intelligence”) while remaining efficient for training or inference compared to a comparably-sized dense model.
Resources on Mixture-of-Experts
- Blog post: Mixture of Experts Explained from Hugging Face for a technical overview
- Video: Mixture of Experts Visualized from Trelis Research for a visualized explainer
Key Takeaways
- Subword tokenization strikes a balance between efficiency and handling unknown words
- Positional encoding schemes like RoPE are crucial for Transformers to understand sequence order
- LLM pretraining involves numerous architecture and optimization decisions
- Distributed training techniques like FSDP enable training of models too large for individual GPUs
- Scaling laws provide guidance on optimal allocation of compute, data, and model size
- Mixture-of-Experts models offer parameter efficiency by activating only relevant parameters during inference