LLM Architecture and Training

From decoder-only Transformers to the implementation choices behind frontier LLMs.

In this section, we'll explore a number of concepts which will take us from the decoder-only Transformer architecture towards understanding the implementation choices and tradeoffs behind many of today's frontier LLMs. If you first want a birds-eye view the of topics in section and some of the following ones, the post ["Understanding Large Language Models"](https://magazine.sebastianraschka.com/p/understanding-large-language-models) by Sebastian Raschka is a nice summary of what the LLM landscape looks like (at least up through mid-2023).

Tokenization

Character-level tokenization (like in several of the Karpathy videos) tends to be inefficient for large-scale Transformers vs. word-level tokenization, yet naively picking a fixed “dictionary” (e.g. Merriam-Webster) of full words runs the risk of encountering unseen words or misspellings at inference time. Instead, the typical approach is to use subword-level tokenization to “cover” the space of possible inputs, while maintaining the efficiency gains which come from a larger token pool, using algorithms like Byte-Pair Encoding (BPE) to select the appropriate set of tokens. If you’ve ever seen Huffman coding in an introductory algorithms class I think it’s a somewhat useful analogy for BPE here, although the input-output format is notably different, as we don’t know the set of “tokens” in advance. I’d recommend watching Andrej Karpathy’s video on tokenization and checking out this tokenization guide from Masato Hagiwara.

Positional Encoding

As we saw in the past section, Transformers don’t natively have the same notion of adjacency or position within a context windows (in contrast to RNNs), and position must instead represented with some kind of vector encoding. While this could be done naively with something like one-hot encoding, this is impractical for context-scaling and suboptimal for learnability, as it throws away notions of ordinality. Originally, this was done with sinusoidal positional encodings, which may feel reminiscent of Fourier features if you’re familiar; the most popular implementation of this type of approach nowadays is likely Rotary Positional Encoding, or RoPE, which tends to be more stable and faster to learn during training.

Key Resources for Positional Encoding

Blog post: Understanding Positional Embeddings by Harrison Pim on intuition for positional encodings
Blog post: A Gentle Introduction to Positional Encoding by Mehreen Saeed on the original Transformer positional encodings
Blog post: Rotary Embeddings on RoPE from Eleuther AI
Animated video: Understanding Positional Encoding from DeepLearning Hero

Pretraining Recipes

Once you’ve committed to pretraining a LLM of a certain general size on a particular corpus of data (e.g Common Crawl, FineWeb), there are still a number of choices to make before you’re ready to go:

Attention mechanisms (multi-head, multi-query, grouped-query)
Activations (ReLU, GeLU, SwiGLU)
Optimizers, learning rates, and schedulers (AdamW, warmup, cosine decay)
Dropout?
Hyperparameter choices and search strategies
Batching, parallelization strategies, gradient accumulation
How long to train for, how often to repeat data
…and many other axes of variation

As far as I can tell, there's not a one-size-fits-all rule book for how to go about this, but the resources below provide valuable insights from those who have navigated these challenges.

Essential Pretraining Resources

Blog post: A Recipe for Training Neural Networks by Andrej Karpathy - While it predates the LLM era, this is a great starting point for framing many problems relevant throughout deep learning
Guide: The Novice's LLM Training Guide by Alpin Dale, discussing hyperparameter choices in practice, as well as the finetuning techniques we'll see in future sections
Blog post: How to train your own Large Language Models from Replit has some nice discussions on data pipelines and evaluations for training
Article: Navigating the Attention Landscape: MHA, MQA, and GQA Decoded by Shobhit Agarwal for understanding attention mechanism tradeoffs
Blog post: The Evolution of the Modern Transformer from Deci AI for discussion of "popular defaults"
Chapter: Learning Rate Scheduling from the d2l.ai book (Chapter 12.11)
Blog post: Response to NYT from Eleuther AI on controversy surrounding reporting of "best practices"

Distributed Training and FSDP

There are a number of additional challenges associated with training models which are too large to fit on individual GPUs (or even multi-GPU machines), typically necessitating the use of distributed training protocols like Fully Sharded Data Parallelism (FSDP), in which models can be co-located across machines during training. It’s probably worth also understanding its precursor Distributed Data Parallelism (DDP), which is covered in the first post linked below.

Resources on Distributed Training

Blog post: FSDP from Meta (who pioneered the method)
Blog post: Understanding FSDP by Bar Rozenman, featuring many excellent visualizations
Report: Training Great LLMs Entirely From Ground Zero in the Wilderness from Yi Tai on the challenges of pretraining a model in a startup environment
Technical blog: FSDP QLora Deep Dive from Answer.AI on combining FSDP with parameter-efficient finetuning techniques for use on consumer GPUs

Scaling Laws

It’s useful to know about scaling laws as a meta-topic which comes up a lot in discussions of LLMs (most prominently in reference to the “Chinchilla” paper), more so than any particular empirical finding or technique. In short, the performance which will result from scaling up the model, data, and compute used for training a language model results in fairly reliable predictions for model loss. This then enables calibration of optimal hyperparameter settings without needing to run expensive grid searches.

Resources on Scaling Laws

Blog overview: Chinchilla Scaling Laws for Large Language Models by Rania Hossam
Discussion: New Scaling Laws for LLMs on LessWrong
Post: Chinchilla's Wild Implications on LessWrong
Analysis: Chinchilla Scaling: A Replication Attempt (potential issues with Chinchilla findings)
Blog post: Scaling Laws and Emergent Properties by Clément Thiriet
Video lecture: Scaling Language Models from Stanford CS224n

Mixture-of-Experts

While many of the prominent LLMs (such as Llama3) used today are “dense” models (i.e. without enforced sparsification), Mixture-of-Experts (MoE) architectures are becoming increasingly popular for navigating tradeoffs between “knowledge” and efficiency, used perhaps most notably in the open-weights world by Mistral AI’s “Mixtral” models (8x7B and 8x22B), and rumored to be used for GPT-4. In MoE models, only a fraction of the parameters are “active” for each step of inference, with trained router modules for selecting the parallel “experts” to use at each layer. This allows models to grow in size (and perhaps “knowlege” or “intelligence”) while remaining efficient for training or inference compared to a comparably-sized dense model.

Resources on Mixture-of-Experts

Blog post: Mixture of Experts Explained from Hugging Face for a technical overview
Video: Mixture of Experts Visualized from Trelis Research for a visualized explainer

Key Takeaways

Subword tokenization strikes a balance between efficiency and handling unknown words
Positional encoding schemes like RoPE are crucial for Transformers to understand sequence order
LLM pretraining involves numerous architecture and optimization decisions
Distributed training techniques like FSDP enable training of models too large for individual GPUs
Scaling laws provide guidance on optimal allocation of compute, data, and model size
Mixture-of-Experts models offer parameter efficiency by activating only relevant parameters during inference