The Grand AI Handbook

LLM Architecture and Training

From decoder-only Transformers to the implementation choices behind frontier LLMs.

In this section, we'll explore a number of concepts which will take us from the decoder-only Transformer architecture towards understanding the implementation choices and tradeoffs behind many of today's frontier LLMs. If you first want a birds-eye view the of topics in section and some of the following ones, the post ["Understanding Large Language Models"](https://magazine.sebastianraschka.com/p/understanding-large-language-models) by Sebastian Raschka is a nice summary of what the LLM landscape looks like (at least up through mid-2023).

Tokenization

Character-level tokenization (like in several of the Karpathy videos) tends to be inefficient for large-scale Transformers vs. word-level tokenization, yet naively picking a fixed “dictionary” (e.g. Merriam-Webster) of full words runs the risk of encountering unseen words or misspellings at inference time. Instead, the typical approach is to use subword-level tokenization to “cover” the space of possible inputs, while maintaining the efficiency gains which come from a larger token pool, using algorithms like Byte-Pair Encoding (BPE) to select the appropriate set of tokens. If you’ve ever seen Huffman coding in an introductory algorithms class I think it’s a somewhat useful analogy for BPE here, although the input-output format is notably different, as we don’t know the set of “tokens” in advance. I’d recommend watching Andrej Karpathy’s video on tokenization and checking out this tokenization guide from Masato Hagiwara.

Positional Encoding

As we saw in the past section, Transformers don’t natively have the same notion of adjacency or position within a context windows (in contrast to RNNs), and position must instead represented with some kind of vector encoding. While this could be done naively with something like one-hot encoding, this is impractical for context-scaling and suboptimal for learnability, as it throws away notions of ordinality. Originally, this was done with sinusoidal positional encodings, which may feel reminiscent of Fourier features if you’re familiar; the most popular implementation of this type of approach nowadays is likely Rotary Positional Encoding, or RoPE, which tends to be more stable and faster to learn during training.

Pretraining Recipes

Once you’ve committed to pretraining a LLM of a certain general size on a particular corpus of data (e.g Common Crawl, FineWeb), there are still a number of choices to make before you’re ready to go:

  • Attention mechanisms (multi-head, multi-query, grouped-query)
  • Activations (ReLU, GeLU, SwiGLU)
  • Optimizers, learning rates, and schedulers (AdamW, warmup, cosine decay)
  • Dropout?
  • Hyperparameter choices and search strategies
  • Batching, parallelization strategies, gradient accumulation
  • How long to train for, how often to repeat data
  • …and many other axes of variation
As far as I can tell, there's not a one-size-fits-all rule book for how to go about this, but the resources below provide valuable insights from those who have navigated these challenges.

Distributed Training and FSDP

There are a number of additional challenges associated with training models which are too large to fit on individual GPUs (or even multi-GPU machines), typically necessitating the use of distributed training protocols like Fully Sharded Data Parallelism (FSDP), in which models can be co-located across machines during training. It’s probably worth also understanding its precursor Distributed Data Parallelism (DDP), which is covered in the first post linked below.

Scaling Laws

It’s useful to know about scaling laws as a meta-topic which comes up a lot in discussions of LLMs (most prominently in reference to the “Chinchilla” paper), more so than any particular empirical finding or technique. In short, the performance which will result from scaling up the model, data, and compute used for training a language model results in fairly reliable predictions for model loss. This then enables calibration of optimal hyperparameter settings without needing to run expensive grid searches.

Mixture-of-Experts

While many of the prominent LLMs (such as Llama3) used today are “dense” models (i.e. without enforced sparsification), Mixture-of-Experts (MoE) architectures are becoming increasingly popular for navigating tradeoffs between “knowledge” and efficiency, used perhaps most notably in the open-weights world by Mistral AI’s “Mixtral” models (8x7B and 8x22B), and rumored to be used for GPT-4. In MoE models, only a fraction of the parameters are “active” for each step of inference, with trained router modules for selecting the parallel “experts” to use at each layer. This allows models to grow in size (and perhaps “knowlege” or “intelligence”) while remaining efficient for training or inference compared to a comparably-sized dense model.

Key Takeaways

  • Subword tokenization strikes a balance between efficiency and handling unknown words
  • Positional encoding schemes like RoPE are crucial for Transformers to understand sequence order
  • LLM pretraining involves numerous architecture and optimization decisions
  • Distributed training techniques like FSDP enable training of models too large for individual GPUs
  • Scaling laws provide guidance on optimal allocation of compute, data, and model size
  • Mixture-of-Experts models offer parameter efficiency by activating only relevant parameters during inference