Large Language Models

Survey the architecture and capabilities of large-scale language models.

This section surveys large-scale language models (LLMs), focusing on their architectures, capabilities, and innovations like model parallelism, attention scaling, and in-context learning. We explore prominent models such as GPT-3, LLaMA, PaLM, T5, Codex, Llama-2, and Mixtral of Experts, highlighting their design principles and applications. LLMs have transformed natural language processing with unprecedented scale and versatility, enabling tasks from text generation to code synthesis. For a broader context, the paper A Comprehensive Survey on Pretrained Foundation Models: A History from BERT to ChatGPT traces their evolution.

Large Language Models Overview

Large language models are Transformer-based architectures pretrained on massive datasets, typically containing billions of parameters. They excel at understanding and generating human-like text, leveraging self-supervised objectives like next-token prediction or masked language modeling. Key innovations include model parallelism for training efficiency, attention scaling for handling long contexts, and in-context learning for task adaptation without fine-tuning. These advancements enable LLMs to perform diverse tasks, from question answering to code generation.

Key Resources for LLMs

Paper: A Comprehensive Survey on Pretrained Foundation Models by Zhao et al. (2023)
Blog post: Large Language Models Explained by IBM Research
Video: Large Language Models from DeepLearning.AI

Key Concepts

Model Parallelism

Model parallelism distributes a model’s parameters across multiple devices, enabling training of billion-parameter LLMs. Techniques include:

Pipeline Parallelism: Splits layers across devices, processing mini-batches sequentially.
Tensor Parallelism: Divides matrix operations (e.g., attention) across devices, parallelizing computations within layers.

Model parallelism, used in GPT-3 and PaLM, reduces memory bottlenecks, as detailed in Efficient Large-Scale Language Model Training.

Attention Scaling

Attention scaling addresses the quadratic complexity of self-attention (O(n²) for sequence length n). Methods like sparse attention (e.g., Longformer) and efficient Transformers (e.g., Performer) reduce complexity to O(n) or O(n log n), enabling LLMs to process longer contexts, critical for tasks like document summarization. The paper Efficient Transformers: A Survey surveys these approaches.

In-Context Learning

In-context learning allows LLMs to perform tasks by conditioning on a few examples provided in the input prompt, without updating weights. Introduced in Language Models are Few-Shot Learners (GPT-3), it leverages the model’s pretrained knowledge to adapt dynamically, excelling in zero-shot and few-shot settings like question answering and translation.

Key Resources for Key Concepts

Paper: Efficient Large-Scale Language Model Training by Narayanan et al. (2021) – Model parallelism
Paper: Efficient Transformers: A Survey by Tay et al. (2020) – Attention scaling
Paper: Language Models are Few-Shot Learners by Brown et al. (2020) – In-context learning
Blog post: The Illustrated GPT-3 by Jay Alammar

Prominent Large Language Models

GPT-3

Introduced by OpenAI in Language Models are Few-Shot Learners (2020), GPT-3 is a 175B-parameter autoregressive model trained on a filtered Common Crawl dataset. Using next-token prediction, it excels in in-context learning, performing tasks like translation and question answering with few or no examples. Its scale enables emergent abilities, such as rudimentary reasoning, but it requires significant compute resources.

Capabilities: Text generation, dialogue, zero-shot learning, and task generalization.

Key Resources for GPT-3

Paper: Language Models are Few-Shot Learners by Brown et al. (2020)
Blog post: GPT-3 Applications by OpenAI
Article: GPT-3: A Revolution in NLP on Towards Data Science

T5 (Text-to-Text Transfer Transformer)

Developed by Google in Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer (2019), T5 reframes all NLP tasks as text-to-text problems, using an encoder-decoder architecture. Pretrained on a Colossal Clean Crawled Corpus with span-masking, T5 (11B parameters in its largest variant) excels in tasks like summarization, translation, and question answering.

Capabilities: Unified task handling, fine-tuning flexibility, and strong benchmark performance (e.g., SuperGLUE).

Key Resources for T5

Paper: Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer by Raffel et al. (2019)
Blog post: Exploring Transfer Learning with T5 by Google AI
Video: T5: Text-to-Text Transformer from Google Research

Codex

Codex, a descendant of GPT-3 by OpenAI (2021), is fine-tuned for code generation, powering tools like GitHub Copilot. Trained on public code repositories, it generates syntactically correct code across languages like Python and JavaScript, supporting tasks like autocompletion and bug fixing.

Capabilities: Code synthesis, debugging, and natural language-to-code translation.

Key Resources for Codex

Paper: Evaluating Large Language Models Trained on Code by Chen et al. (2021)
Blog post: OpenAI Codex by OpenAI
Article: Codex: The Code-Generating Model on Towards Data Science

LLaMA

LLaMA, developed by Meta AI in LLaMA: Efficient Language Models (2023), is a family of models (7B-65B parameters) optimized for research. Trained on curated datasets like Wikipedia and arXiv, LLaMA uses efficient attention mechanisms and model parallelism, outperforming larger models like GPT-3 on benchmarks like MMLU with fewer parameters.

Capabilities: Research-grade NLP, fine-tuning for specific tasks, and efficiency.

Llama-2

Llama-2, introduced in Llama 2: Open Foundation and Fine-Tuned Chat Models (2023), extends LLaMA with models (7B-70B parameters) fine-tuned for dialogue and instruction following. It incorporates reinforcement learning with human feedback (RLHF), improving safety and conversational abilities.

Capabilities: Chat, instruction following, and open-source accessibility for research.

Key Resources for LLaMA and Llama-2

Paper: LLaMA: Efficient Language Models by Touvron et al. (2023)
Paper: Llama 2: Open Foundation and Fine-Tuned Chat Models by Touvron et al. (2023)
Blog post: Introducing Llama 2 by Meta AI
Post: Llama-2 performance insights by @AIResearcher on X

Mixtral of Experts

Mixtral, developed by Mistral AI in Mixtral of Experts (2023), uses a mixture-of-experts (MoE) architecture, where each layer routes inputs to a subset of specialized subnetworks (experts). With 8x7B parameters, Mixtral achieves performance comparable to 70B-parameter models, leveraging sparse activation for efficiency.

Capabilities: High performance with low inference cost, excelling in multilingual tasks and reasoning.

Key Resources for Mixtral

Paper: Mixtral of Experts by Jiang et al. (2023)
Blog post: Mixtral of Experts by Mistral AI
Article: Mixtral: A New Era of LLMs on Towards Data Science

PaLM

PaLM, introduced by Google in PaLM: Scaling Language Modeling with Pathways (2022), is a 540B-parameter model trained on a diverse corpus (Web, books, code). Using Pathways, a distributed training system, PaLM leverages model parallelism and achieves state-of-the-art results on reasoning tasks like BIG-bench.

Capabilities: Advanced reasoning, multilingual processing, and code generation.

Key Resources for PaLM

Paper: PaLM: Scaling Language Modeling with Pathways by Chowdhery et al. (2022)
Blog post: PaLM: Scaling to 540B Parameters by Google AI
Article: PaLM: A Monster Language Model on Medium

Q&A Session: Capabilities and Insights

To illustrate LLM capabilities, consider common questions:

What tasks can LLMs perform? LLMs handle text generation (GPT-3), code synthesis (Codex), reasoning (PaLM), and dialogue (Llama-2). In-context learning enables task flexibility without retraining.
How does scale impact performance? Larger models (e.g., PaLM) show emergent abilities like reasoning, but efficient designs (e.g., Mixtral) achieve similar results with less compute.
What are the limitations? LLMs can produce biased or incorrect outputs, require significant resources, and struggle with out-of-distribution tasks, necessitating safety measures like RLHF (Llama-2).

These insights align with findings in Multimodal Foundation Models: From Specialists to General-Purpose Assistants, emphasizing LLMs’ transformative potential and challenges.

Resources for Q&A Insights

Paper: Multimodal Foundation Models: From Specialists to General-Purpose Assistants by Yin et al. (2023)
Blog post: Limitations of Large Language Models by Hugging Face
Video: LLMs: Capabilities and Challenges from Stanford Online

Impact on Foundation Models

LLMs have redefined foundation models by:

Scaling Capabilities: GPT-3 and PaLM demonstrate that larger models unlock emergent abilities like reasoning and in-context learning.
Enabling Generalization: T5’s text-to-text framework and Llama-2’s instruction tuning support diverse tasks with minimal adaptation.
Driving Efficiency: Mixtral’s MoE and LLaMA’s optimizations reduce compute costs, broadening access.
Powering Applications: Codex and PaLM enable real-world tools like Copilot and Google’s AI services.

These advancements, detailed in A Comprehensive Survey on Pretrained Foundation Models, underscore LLMs’ role in advancing AI research and deployment.

Resources on Impact

Paper: A Comprehensive Survey on Pretrained Foundation Models by Zhao et al. (2023)
Paper: Multimodal Foundation Models: From Specialists to General-Purpose Assistants by Yin et al. (2023)
Blog post: How LLMs Shape AI by IBM Research

Key Takeaways

LLMs leverage large-scale pretraining and Transformer architectures for versatile NLP
GPT-3 pioneered in-context learning, enabling zero-shot and few-shot task adaptation
T5 unifies tasks as text-to-text, while Codex specializes in code generation
LLaMA and Llama-2 optimize efficiency, with Llama-2 excelling in dialogue via RLHF
Mixtral’s MoE architecture achieves high performance with sparse computation
PaLM scales to 540B parameters, excelling in reasoning and multilingual tasks
Model parallelism and attention scaling enable training and inference at scale
LLMs drive foundation model innovation, powering diverse applications