Language Model Pretraining
Examine techniques for pretraining large language models on vast datasets.
About Pretraining
Pretraining involves training a model on a large, general dataset using unsupervised or self-supervised objectives to learn broad patterns, such as linguistic structures or semantic relationships, before fine-tuning on specific tasks. This approach, foundational to models like BERT and GPT, leverages vast datasets (e.g., Wikipedia, Common Crawl) to capture general knowledge, making models adaptable to diverse downstream applications.
Why We Need Pretraining:
- Data Efficiency: Pretraining reduces the need for labeled data in downstream tasks by learning general representations from unlabeled corpora.
- Generalization: Models pretrained on diverse datasets generalize better across tasks, from text classification to question answering.
- Scalability: Pretraining enables large models to learn complex patterns, improving performance as model size and data scale, as noted in scaling laws (Chinchilla).
Does Pretraining Indeed Help?: Empirical evidence confirms pretraining’s efficacy. For instance, BERT’s pretrained representations improved GLUE benchmark scores by 5-10% over non-pretrained models. Similarly, GPT’s pretraining enabled zero-shot and few-shot learning, reducing task-specific training costs. Pretraining’s benefits are particularly pronounced in low-data regimes, where fine-tuned models outperform task-specific training.
Key Resources for Pretraining
- Paper: A Comprehensive Survey on Pretrained Foundation Models by Zhao et al. (2023)
- Paper: Training Compute-Optimal Large Language Models by Hoffmann et al. (2022) – Chinchilla scaling laws
- Blog post: Exploring Pretraining in NLP by Sebastian Ruder
- Video: Pretraining Language Models from DeepLearning.AI
Introduce Pretrained Models
Pretrained models like ELMo, BERT, and GPT revolutionized NLP by introducing self-supervised learning objectives that exploit unlabeled text. These models differ in architecture, objectives, and motivations, addressing specific limitations in prior approaches.
Motivation for Developing Different Models:
- Contextual Representations: Earlier models like word2vec provided static embeddings, lacking context sensitivity. ELMo introduced contextualized embeddings using bidirectional LSTMs.
- Bidirectional Context: BERT addressed the unidirectional bias of models like GPT-1, enabling richer contextual understanding via masked language modeling.
- Generative Capabilities: GPT focused on autoregressive generation, prioritizing tasks like text completion and dialogue.
- Scalability and Efficiency: Models like RoBERTa optimized pretraining recipes to improve performance without architectural changes.
ELMo (Embeddings from Language Models)
Introduced in Deep Contextualized Word Representations by Peters et al. (2018), ELMo uses bidirectional LSTMs to generate contextualized word embeddings. Pretrained on a large corpus with a language modeling objective, ELMo captures word meaning based on surrounding context, improving tasks like sentiment analysis and named entity recognition.
Detailed Method:
- Architecture: Two stacked bidirectional LSTMs, one processing the sequence forward and another backward.
- Objective: Predict the next word (forward) and previous word (backward), maximizing log-likelihood over a corpus.
- Usage: ELMo embeddings are concatenated with task-specific inputs, enhancing downstream models without fine-tuning the LSTM.
Impact: ELMo improved performance on benchmarks like SQuAD by 4-5%, demonstrating the power of contextual embeddings over static word vectors.
BERT (Bidirectional Encoder Representations from Transformers)
Proposed in BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding by Devlin et al. (2018), BERT uses a Transformer encoder pretrained on two self-supervised tasks:
- Masked Language Modeling (MLM): Randomly masks 15% of tokens, predicting them based on bidirectional context.
- Next Sentence Prediction (NSP): Predicts whether two sentences are consecutive, aiding tasks like question answering.
Detailed Method:
- Architecture: Multi-layer Transformer encoder (12-24 layers in base/large versions).
- Corpus: BooksCorpus and English Wikipedia (~3.3B words).
- Training: MLM and NSP objectives, optimized with AdamW over large batches.
Impact: BERT achieved state-of-the-art results on GLUE (80.5% average score) and SQuAD, setting the standard for pretraining and fine-tuning.
Key Resources for Pretrained Models
- Paper: Deep Contextualized Word Representations by Peters et al. (2018) – ELMo
- Paper: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding by Devlin et al. (2018)
- Blog post: The Illustrated BERT by Jay Alammar
- Blog post: Introducing ELMo by Google AI
Pretraining Techniques
Pretraining relies on self-supervised objectives and diverse data strategies to learn robust representations. Below are key techniques used in LLMs.
Masked Language Modeling (MLM)
MLM, popularized by BERT, involves randomly masking tokens in a sequence and training the model to predict them using bidirectional context. This objective encourages the model to learn rich semantic and syntactic representations, effective for understanding tasks. Variants include dynamic masking (RoBERTa) and span masking (T5), where contiguous token spans are masked.
Next-Token Prediction
Next-token prediction, used in autoregressive models like GPT, trains the model to predict the next token given prior context. This objective, rooted in traditional language modeling, excels in generative tasks like text completion and dialogue. It leverages causal attention to ensure unidirectional processing.
Unsupervised Corpora
Pretraining relies on large, unlabeled text corpora, such as:
- Wikipedia: High-quality, diverse text (~6M articles).
- Common Crawl: Web-scraped data (~100TB), filtered for quality (e.g., FineWeb).
- BooksCorpus: Fiction and non-fiction books (~800M words).
These corpora provide the scale needed for general-purpose representations, as discussed in Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer (T5).
Data Augmentation
Data augmentation enhances pretraining by artificially expanding the corpus or introducing noise, improving robustness. Techniques include:
- Back-Translation: Translating text to another language and back to create paraphrases.
- Text Infilling: Replacing spans with random tokens to mimic MLM.
- Mixup: Combining sentences or embeddings to create synthetic examples.
Augmentation, used in models like T5, improves generalization, especially for low-resource tasks.
Contrastive Objectives
Contrastive objectives, inspired by methods like SimCLR, train models to distinguish positive pairs (e.g., related sentences) from negative pairs (unrelated ones). Used in models like Sentence-BERT, contrastive pretraining enhances semantic similarity tasks, such as paraphrase detection, by learning discriminative embeddings.
Domain Adaptation
Domain adaptation pretrains models on domain-specific corpora (e.g., biomedical texts for BioBERT, legal documents for LegalBERT) to improve performance on specialized tasks. This involves continued pretraining on targeted datasets, often using MLM or next-token prediction, as seen in BioBERT: A Pre-trained Biomedical Language Representation Model.
Key Resources for Pretraining Techniques
- Paper: Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer by Raffel et al. (2019) – T5 and corpora
- Paper: BioBERT: A Pre-trained Biomedical Language Representation Model by Lee et al. (2019)
- Blog post: Masked Language Modeling and Context on Towards Data Science
- Blog post: Domain Adaptation for Language Models by Hugging Face
- Video: Pretraining Objectives Explained from Stanford Online
Impact on Foundation Models
Pretraining has shaped foundation models by:
- Enabling Versatility: Models like BERT and T5 handle diverse tasks (e.g., classification, generation) due to general-purpose representations.
- Reducing Data Needs: Pretraining allows zero-shot or few-shot learning, as seen in GPT, minimizing labeled data requirements.
- Driving Scalability: Large corpora and objectives like MLM enable scaling to billion-parameter models, as discussed in Multimodal Foundation Models: From Specialists to General-Purpose Assistants.
- Supporting Domain Specialization: Domain adaptation creates models like BioBERT, tailored for specific fields.
Resources on Impact on Foundation Models
- Paper: A Comprehensive Survey on Pretrained Foundation Models by Zhao et al. (2023)
- Paper: Multimodal Foundation Models: From Specialists to General-Purpose Assistants by Yin et al. (2023)
- Blog post: How Pretraining Shapes Modern AI by IBM Research
Key Takeaways
- Pretraining learns general representations from unlabeled data, enabling data-efficient downstream tasks
- ELMo introduced contextual embeddings, while BERT pioneered bidirectional MLM
- Masked language modeling and next-token prediction are core self-supervised objectives
- Unsupervised corpora like Wikipedia and Common Crawl provide scale for pretraining
- Data augmentation and contrastive objectives enhance robustness and semantic understanding
- Domain adaptation tailors models to specialized fields like biomedicine
- Pretraining underpins the versatility and scalability of foundation models