Pretraining and Scaling
Techniques for training large-scale language models.
Chapter 21: Pretraining Strategies Masked LM, next-token prediction Denoising, contrastive objectives [CLM, MLM, SimCLR, span corruption] References Chapter 22: Model Scaling Scaling laws: Parameters, data, compute Mixture-of-experts, sparse transformers [Chinchilla scaling, MoE architectures, efficiency trade-offs] References Chapter 23: Training Infrastructure Large-scale datasets, web crawling Distributed training, GPU clusters [Data pipelines, TPUs, sharding techniques] References