The Grand AI Handbook
The Grand AI Handbook

Landmark Papers in LLMs

Explore the foundational research that has shaped the field of Large Language Models. This curated collection highlights the most influential papers that established key concepts, techniques, and breakthroughs in the evolution of LLMs.

Landmark Papers in LLMs is a curated collection showcasing the foundational research that has shaped the field of large language models. I've carefully selected these papers to highlight the key breakthroughs and conceptual advances that have defined the evolution of LLMs, providing historical context and significance for researchers and enthusiasts alike.

2017

June 2017
Transformers Attention

Attention Is All You Need

This paper introduced the Transformer architecture, replacing recurrence and convolutions with attention mechanisms, revolutionizing sequence modeling and establishing the foundation for all subsequent large language models through its efficiency, parallelizability, and capacity to capture long-range dependencies.

Read Paper

2018

October 2018
Bidirectional Pre-training

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

BERT revolutionized NLP by introducing bidirectional pre-training and demonstrating the effectiveness of the pre-train then fine-tune paradigm, achieving state-of-the-art results across numerous tasks and initiating a new era of transformer-based language models.

Read Paper
June 2018
Pre-training Transfer Learning

Improving Language Understanding by Generative Pre-Training

GPT-1 established the effectiveness of generative pre-training followed by discriminative fine-tuning for language tasks, demonstrating that a single pre-trained model could serve as a foundation for multiple downstream applications with minimal task-specific architecture modifications.

Read Paper

2019

February 2019
Unsupervised Learning Multitask Learning

Language Models are Unsupervised Multitask Learners

GPT-2 demonstrated that large-scale unsupervised pre-training could enable a single model to perform well across diverse tasks without task-specific fine-tuning, showcasing the power of scaling language models for generalization.

Read Paper
September 2019
Model Parallelism Scaling

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

Megatron-LM introduced efficient model parallelism techniques, enabling the training of multi-billion parameter language models and setting the stage for scaling LLMs to unprecedented sizes.

Read Paper
October 2019
Text-to-Text Transfer Learning

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

T5 reframed all NLP tasks as text-to-text problems, demonstrating that a single transformer model could achieve state-of-the-art performance across diverse tasks through unified pre-training and fine-tuning.

Read Paper
October 2019
Memory Optimization Scaling

ZeRO: Memory Optimizations Toward Training Trillion Parameter Models

ZeRO introduced memory-efficient training techniques, such as optimizer state and gradient partitioning, enabling the training of trillion-parameter models by reducing GPU memory requirements.

Read Paper

2020

January 2020
Scaling Laws Performance Prediction

Scaling Laws for Neural Language Models

This paper formalized scaling laws for language models, showing predictable relationships between model size, dataset size, and compute, guiding the design of larger and more efficient LLMs.

Read Paper
May 2020
Few-Shot Learning Scaling

Language Models are Few-Shot Learners

GPT-3 showcased the power of large-scale language models in few-shot learning, performing complex tasks with minimal examples and highlighting the emergent capabilities of massive LLMs.

Read Paper

2021

January 2021
Sparsity Scaling

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

Switch Transformers introduced sparse mixture-of-experts models, enabling efficient scaling to trillion-parameter models with reduced computational costs.

Read Paper
August 2021
Code Generation Pre-training

Evaluating Large Language Models Trained on Code

Codex demonstrated the ability of LLMs to generate high-quality code, paving the way for AI-assisted programming and the development of tools like GitHub Copilot.

Read Paper
August 2021
Foundation Models Ethics

On the Opportunities and Risks of Foundation Models

This paper coined the term "foundation models" and analyzed their societal implications, highlighting opportunities and risks in scalability, generalization, and ethical considerations.

Read Paper
September 2021
Zero-Shot Learning Instruction Tuning

Finetuned Language Models are Zero-Shot Learners

FLAN showed that instruction tuning could significantly improve zero-shot performance, enabling LLMs to generalize to unseen tasks with natural language instructions.

Read Paper
October 2021
Multitask Learning Zero-Shot Learning

Multitask Prompted Training Enables Zero-Shot Task Generalization

T0 demonstrated that multitask prompted training could enable LLMs to perform zero-shot generalization across a wide range of tasks, advancing prompt-based learning.

Read Paper
December 2021
Mixture-of-Experts Efficiency

GLaM: Efficient Scaling of Language Models with Mixture-of-Experts

GLaM introduced a mixture-of-experts approach to scale LLMs efficiently, achieving high performance with lower computational costs compared to dense models.

Read Paper
December 2021
Human Feedback Question Answering

WebGPT: Browser-assisted question-answering with human feedback

WebGPT combined LLMs with web browsing and human feedback to improve the accuracy and relevance of question-answering, paving the way for more reliable AI assistants.

Read Paper
December 2021
Retrieval-Augmented Scaling

Improving language models by retrieving from trillions of tokens

Retro introduced retrieval-augmented language models, enhancing performance by accessing external data during inference, improving factual accuracy and context awareness.

Read Paper
December 2021
Scaling Analysis

Scaling Language Models: Methods, Analysis & Insights from Training Gopher

Gopher provided insights into scaling LLMs, analyzing performance trade-offs and demonstrating the importance of data quality and model size for achieving high performance.

Read Paper

2022

January 2022
Reasoning Prompting

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

This paper introduced chain-of-thought prompting, enabling LLMs to solve complex reasoning tasks by generating intermediate reasoning steps, dramatically improving performance on mathematical and logical problems.

Read Paper
January 2022
Dialogue Safety

LaMDA: Language Models for Dialog Applications

LaMDA introduced a specialized dialogue model focusing on quality, safety, and groundedness, setting new standards for conversational AI and highlighting the importance of responsibility in LLM deployment.

Read Paper
January 2022
Mathematical Reasoning Specialized LLM

Solving Quantitative Reasoning Problems with Language Models

Minerva demonstrated that LLMs could solve complex mathematical and scientific problems with high accuracy when fine-tuned on technical content, advancing the frontier of AI for STEM applications.

Read Paper
January 2022
Model Parallelism Industry Collaboration

Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model

This paper detailed the training of one of the largest LLMs at the time, showcasing advanced parallel computing techniques and industry collaboration in scaling language models.

Read Paper
March 2022
RLHF Alignment

Training language models to follow instructions with human feedback

InstructGPT introduced reinforcement learning from human feedback (RLHF) to align language models with human preferences, significantly improving helpfulness and reducing harmful outputs.

Read Paper
April 2022
Pathways Scaling

PaLM: Scaling Language Modeling with Pathways

PaLM demonstrated the benefits of efficiently scaling language models to 540B parameters using the Pathways system, achieving breakthrough capabilities in reasoning, multilingual tasks, and code generation.

Read Paper
April 2022
Compute-Optimal Efficiency

Training Compute-Optimal Large Language Models

Chinchilla redefined optimal scaling laws for LLMs, demonstrating that models were significantly undertrained and establishing new principles for balancing model size and training tokens.

Read Paper
May 2022
Open Models Transparency

OPT: Open Pre-trained Transformer Language Models

OPT provided the research community with an open-source alternative to large proprietary language models, advancing transparency and accessibility in AI research.

Read Paper
May 2022
Learning Paradigms Unified Training

UL2: Unifying Language Learning Paradigms

UL2 introduced a unified approach to pre-training language models, combining multiple learning objectives to create more versatile and effective LLMs.

Read Paper
June 2022
Emergent Abilities Scaling

Emergent Abilities of Large Language Models

This paper identified and documented the phenomenon of emergent abilities in large language models, capabilities that appear suddenly as model scale increases rather than improving smoothly.

Read Paper
June 2022
Evaluation Benchmarking

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

BIG-bench introduced a comprehensive benchmark with 204 diverse tasks to evaluate language model capabilities, providing a more nuanced understanding of LLM strengths and weaknesses.

Read Paper
June 2022
Interfaces Systems Design

Language Models are General-Purpose Interfaces

METALM reframed language models as general-purpose interfaces for computing systems, demonstrating their versatility in connecting users to diverse applications and services.

Read Paper
September 2022
Alignment Dialogue

Improving alignment of dialogue agents via targeted human judgements

Sparrow introduced a framework for aligning dialogue agents with human values through a combination of reinforcement learning and rule-based constraints, addressing safety and helpfulness.

Read Paper
October 2022
Instruction Tuning Scaling

Scaling Instruction-Finetuned Language Models

This paper demonstrated the effectiveness of instruction tuning at scale, showing how models like Flan-T5 and Flan-PaLM could achieve significant improvements across hundreds of tasks.

Read Paper
October 2022
Bilingual Open Source

GLM-130B: An Open Bilingual Pre-trained Model

GLM-130B presented an open-source bilingual (English and Chinese) LLM with strong performance, advancing multilingual capabilities and accessibility in non-English languages.

Read Paper
November 2022
Evaluation Holistic Assessment

Holistic Evaluation of Language Models

HELM established a comprehensive framework for evaluating language models across multiple dimensions, including accuracy, calibration, robustness, fairness, and bias.

Read Paper
November 2022
Multilingual Open Access

BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

BLOOM demonstrated the potential of international collaboration in creating a large-scale, multilingual LLM supporting 46 languages and 13 programming languages, advancing global AI accessibility.

Read Paper
November 2022
Scientific Knowledge Specialized Models

Galactica: A Large Language Model for Science

Galactica pioneered domain-specific language models for scientific knowledge, demonstrating strengths in scientific reasoning but also highlighting challenges in ensuring factual accuracy.

Read Paper
December 2022
Instruction Tuning Meta-Learning

OPT-IML: Scaling Language Model Instruction Meta Learning through the Lens of Generalization

OPT-IML investigated generalization in instruction-tuned language models, providing insights into cross-task transfer and the factors affecting generalization to new tasks.

Read Paper

2023

January 2023
Instruction Tuning Dataset Design

The Flan Collection: Designing Data and Methods for Effective Instruction Tuning

This paper presented a comprehensive approach to instruction tuning, introducing methods for creating high-quality datasets and training procedures that significantly improved LLM performance.

Read Paper
February 2023
Open Source Efficiency

LLaMA: Open and Efficient Foundation Language Models

LLaMA introduced a family of efficient foundation language models that matched or exceeded the performance of much larger models, democratizing access to powerful AI while requiring fewer computational resources.

Read Paper
February 2023
Multimodal Vision-Language

Language Is Not All You Need: Aligning Perception with Language Models

Kosmos-1 pioneered a multimodal LLM capable of perceiving and reasoning about visual and textual information together, breaking new ground in multimodal understanding.

Read Paper
March 2023
Recurrent Models Long Sequences

Resurrecting Recurrent Neural Networks for Long Sequences

LRU revitalized recurrent architectures for language modeling, introducing the Linear Recurrent Unit that combined the strengths of transformers and RNNs for processing very long sequences efficiently.

Read Paper
March 2023
Embodied Multimodal

PaLM-E: An Embodied Multimodal Language Model

PaLM-E unified language, vision, and action into a single model, enabling embodied intelligence and demonstrating how LLMs could power robotics and physical systems.

Read Paper
March 2023
Multimodal Capabilities

GPT-4 Technical Report

GPT-4 introduced a multimodal LLM with unprecedented capabilities in reasoning, specialized domains, and visual understanding, setting new benchmarks for the field and revealing emergent abilities.

Read Paper
April 2023
Visual Instruction Multimodal

Visual Instruction Tuning

LLaVA pioneered instruction tuning for multimodal models, enabling vision-language models to follow complex visual instructions and significantly advancing the field of multimodal AI.

Read Paper
April 2023
Model Analysis Scaling

Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling

Pythia provided a comprehensive suite of models and tools for analyzing LLM behavior during training, enabling deeper scientific understanding of how these models learn and evolve.

Read Paper
May 2023
Self-Alignment Minimal Supervision

Principle-Driven Self-Alignment of Language Models from Scratch with Minimal Human Supervision

Dromedary introduced a novel approach to aligning LLMs with human values using principled self-instruction, demonstrating how models could be made more helpful and harmless with minimal human supervision.

Read Paper
May 2023
Scaling Efficiency

PaLM 2 Technical Report

PaLM 2 demonstrated improved efficiency in language model architecture and training, achieving superior performance with fewer parameters and maintaining strengths across reasoning, multilingual tasks, and coding.

Read Paper
May 2023
RNN Architecture Efficiency

RWKV: Reinventing RNNs for the Transformer Era

RWKV introduced a novel architecture combining the parallelizable training of transformers with the efficient inference of RNNs, offering a promising alternative for language modeling with linear scaling properties.

Read Paper
May 2023
Preference Optimization RLHF Alternative

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

DPO introduced a simplified approach to aligning language models with human preferences, eliminating the need for separate reward models and making alignment more accessible and efficient.

Read Paper
May 2023
Reasoning Problem Solving

Tree of Thoughts: Deliberate Problem Solving with Large Language Models

ToT introduced a framework for enabling LLMs to perform deliberate decision making by exploring and evaluating multiple reasoning paths, significantly improving performance on complex problems.

Read Paper
July 2023
Open Source Chat Models

Llama 2: Open Foundation and Fine-Tuned Chat Models

Llama 2 established new standards for open-source language models, with strong performance, safety features, and commercial availability, accelerating the adoption of LLMs in applications.

Read Paper
October 2023
Open Source Performance

Mistral 7B

Mistral 7B demonstrated that smaller, more efficient language models could outperform much larger models through architectural innovations and training improvements, redefining expectations for accessible AI.

Read Paper
December 2023
State Space Models Linear Complexity

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Mamba introduced selective state space models for sequence modeling, achieving transformer-level quality with linear-time computation and memory usage, offering a compelling alternative for long-sequence processing.

Read Paper

2024

January 2024
Efficiency MoE

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

DeepSeek-V2 introduced innovative approaches to mixture-of-experts architectures that substantially improved computational efficiency while maintaining high performance, making advanced language models more accessible.

Read Paper
February 2024
Open Science Transparency

OLMo: Accelerating the Science of Language Models

OLMo pioneered a new approach to transparent AI development, providing unprecedented access to training data, methodologies, and model checkpoints, enabling broader research participation in language model science.

Read Paper
May 2024
Architecture State Space Models

Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

Mamba2 established a theoretical framework unifying transformers and state space models, enabling more efficient architectures that maintained the strengths of both approaches while overcoming their respective limitations.

Read Paper
May 2024
Open Models Performance

The Llama 3 Herd of Models

Llama 3 established new standards for open-source language models, demonstrating performance competitive with closed-source alternatives while providing multiple model sizes optimized for different deployment scenarios.

Read Paper
June 2024
Training Data Data Quality

The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

FineWeb revolutionized training data curation for language models, introducing methods to automatically identify and extract high-quality content from web data, significantly improving model performance through better data quality.

Read Paper
September 2024
Mixture-of-Experts Open Source

OLMoE: Open Mixture-of-Experts Language Models

OLMoE pioneered fully open-source mixture-of-experts models that rivaled closed-source counterparts, providing the research community with transparent architectures and training methodologies for highly efficient language models.

Read Paper
December 2024
Multimodal Efficiency

Qwen2.5 Technical Report

Qwen2.5 represented a significant advancement in efficient multimodal models, introducing innovations in processing multiple modalities while maintaining computational efficiency and strong performance across languages.

Read Paper
December 2024
Foundation Models Architecture

DeepSeek-V3 Technical Report

DeepSeek-V3 pushed the boundaries of language model architectures with innovative attention mechanisms and training methodologies, offering substantial improvements in efficiency and performance across a wide range of tasks.

Read Paper

2025

January 2025
Reinforcement Learning Reasoning

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-R1 introduced a specialized framework for enhancing reasoning capabilities in large language models through innovative reinforcement learning techniques, significantly improving performance on complex reasoning tasks beyond previous state-of-the-art systems.

Read Paper