Landmark Papers in LLMs
Explore the foundational research that has shaped the field of Large Language Models. This curated collection highlights the most influential papers that established key concepts, techniques, and breakthroughs in the evolution of LLMs.
Landmark Papers in LLMs is a curated collection showcasing the foundational research that has shaped the field of large language models. I've carefully selected these papers to highlight the key breakthroughs and conceptual advances that have defined the evolution of LLMs, providing historical context and significance for researchers and enthusiasts alike.
2017
Attention Is All You Need
This paper introduced the Transformer architecture, replacing recurrence and convolutions with attention mechanisms, revolutionizing sequence modeling and establishing the foundation for all subsequent large language models through its efficiency, parallelizability, and capacity to capture long-range dependencies.
Read Paper2018
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT revolutionized NLP by introducing bidirectional pre-training and demonstrating the effectiveness of the pre-train then fine-tune paradigm, achieving state-of-the-art results across numerous tasks and initiating a new era of transformer-based language models.
Read PaperImproving Language Understanding by Generative Pre-Training
GPT-1 established the effectiveness of generative pre-training followed by discriminative fine-tuning for language tasks, demonstrating that a single pre-trained model could serve as a foundation for multiple downstream applications with minimal task-specific architecture modifications.
Read Paper2019
Language Models are Unsupervised Multitask Learners
GPT-2 demonstrated that large-scale unsupervised pre-training could enable a single model to perform well across diverse tasks without task-specific fine-tuning, showcasing the power of scaling language models for generalization.
Read PaperMegatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
Megatron-LM introduced efficient model parallelism techniques, enabling the training of multi-billion parameter language models and setting the stage for scaling LLMs to unprecedented sizes.
Read PaperExploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
T5 reframed all NLP tasks as text-to-text problems, demonstrating that a single transformer model could achieve state-of-the-art performance across diverse tasks through unified pre-training and fine-tuning.
Read PaperZeRO: Memory Optimizations Toward Training Trillion Parameter Models
ZeRO introduced memory-efficient training techniques, such as optimizer state and gradient partitioning, enabling the training of trillion-parameter models by reducing GPU memory requirements.
Read Paper2020
Scaling Laws for Neural Language Models
This paper formalized scaling laws for language models, showing predictable relationships between model size, dataset size, and compute, guiding the design of larger and more efficient LLMs.
Read PaperLanguage Models are Few-Shot Learners
GPT-3 showcased the power of large-scale language models in few-shot learning, performing complex tasks with minimal examples and highlighting the emergent capabilities of massive LLMs.
Read Paper2021
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
Switch Transformers introduced sparse mixture-of-experts models, enabling efficient scaling to trillion-parameter models with reduced computational costs.
Read PaperEvaluating Large Language Models Trained on Code
Codex demonstrated the ability of LLMs to generate high-quality code, paving the way for AI-assisted programming and the development of tools like GitHub Copilot.
Read PaperOn the Opportunities and Risks of Foundation Models
This paper coined the term "foundation models" and analyzed their societal implications, highlighting opportunities and risks in scalability, generalization, and ethical considerations.
Read PaperFinetuned Language Models are Zero-Shot Learners
FLAN showed that instruction tuning could significantly improve zero-shot performance, enabling LLMs to generalize to unseen tasks with natural language instructions.
Read PaperMultitask Prompted Training Enables Zero-Shot Task Generalization
T0 demonstrated that multitask prompted training could enable LLMs to perform zero-shot generalization across a wide range of tasks, advancing prompt-based learning.
Read PaperGLaM: Efficient Scaling of Language Models with Mixture-of-Experts
GLaM introduced a mixture-of-experts approach to scale LLMs efficiently, achieving high performance with lower computational costs compared to dense models.
Read PaperWebGPT: Browser-assisted question-answering with human feedback
WebGPT combined LLMs with web browsing and human feedback to improve the accuracy and relevance of question-answering, paving the way for more reliable AI assistants.
Read PaperImproving language models by retrieving from trillions of tokens
Retro introduced retrieval-augmented language models, enhancing performance by accessing external data during inference, improving factual accuracy and context awareness.
Read PaperScaling Language Models: Methods, Analysis & Insights from Training Gopher
Gopher provided insights into scaling LLMs, analyzing performance trade-offs and demonstrating the importance of data quality and model size for achieving high performance.
Read Paper2022
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
This paper introduced chain-of-thought prompting, enabling LLMs to solve complex reasoning tasks by generating intermediate reasoning steps, dramatically improving performance on mathematical and logical problems.
Read PaperLaMDA: Language Models for Dialog Applications
LaMDA introduced a specialized dialogue model focusing on quality, safety, and groundedness, setting new standards for conversational AI and highlighting the importance of responsibility in LLM deployment.
Read PaperSolving Quantitative Reasoning Problems with Language Models
Minerva demonstrated that LLMs could solve complex mathematical and scientific problems with high accuracy when fine-tuned on technical content, advancing the frontier of AI for STEM applications.
Read PaperUsing DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model
This paper detailed the training of one of the largest LLMs at the time, showcasing advanced parallel computing techniques and industry collaboration in scaling language models.
Read PaperTraining language models to follow instructions with human feedback
InstructGPT introduced reinforcement learning from human feedback (RLHF) to align language models with human preferences, significantly improving helpfulness and reducing harmful outputs.
Read PaperPaLM: Scaling Language Modeling with Pathways
PaLM demonstrated the benefits of efficiently scaling language models to 540B parameters using the Pathways system, achieving breakthrough capabilities in reasoning, multilingual tasks, and code generation.
Read PaperTraining Compute-Optimal Large Language Models
Chinchilla redefined optimal scaling laws for LLMs, demonstrating that models were significantly undertrained and establishing new principles for balancing model size and training tokens.
Read PaperOPT: Open Pre-trained Transformer Language Models
OPT provided the research community with an open-source alternative to large proprietary language models, advancing transparency and accessibility in AI research.
Read PaperUL2: Unifying Language Learning Paradigms
UL2 introduced a unified approach to pre-training language models, combining multiple learning objectives to create more versatile and effective LLMs.
Read PaperEmergent Abilities of Large Language Models
This paper identified and documented the phenomenon of emergent abilities in large language models, capabilities that appear suddenly as model scale increases rather than improving smoothly.
Read PaperBeyond the Imitation Game: Quantifying and extrapolating the capabilities of language models
BIG-bench introduced a comprehensive benchmark with 204 diverse tasks to evaluate language model capabilities, providing a more nuanced understanding of LLM strengths and weaknesses.
Read PaperLanguage Models are General-Purpose Interfaces
METALM reframed language models as general-purpose interfaces for computing systems, demonstrating their versatility in connecting users to diverse applications and services.
Read PaperImproving alignment of dialogue agents via targeted human judgements
Sparrow introduced a framework for aligning dialogue agents with human values through a combination of reinforcement learning and rule-based constraints, addressing safety and helpfulness.
Read PaperScaling Instruction-Finetuned Language Models
This paper demonstrated the effectiveness of instruction tuning at scale, showing how models like Flan-T5 and Flan-PaLM could achieve significant improvements across hundreds of tasks.
Read PaperGLM-130B: An Open Bilingual Pre-trained Model
GLM-130B presented an open-source bilingual (English and Chinese) LLM with strong performance, advancing multilingual capabilities and accessibility in non-English languages.
Read PaperHolistic Evaluation of Language Models
HELM established a comprehensive framework for evaluating language models across multiple dimensions, including accuracy, calibration, robustness, fairness, and bias.
Read PaperBLOOM: A 176B-Parameter Open-Access Multilingual Language Model
BLOOM demonstrated the potential of international collaboration in creating a large-scale, multilingual LLM supporting 46 languages and 13 programming languages, advancing global AI accessibility.
Read PaperGalactica: A Large Language Model for Science
Galactica pioneered domain-specific language models for scientific knowledge, demonstrating strengths in scientific reasoning but also highlighting challenges in ensuring factual accuracy.
Read PaperOPT-IML: Scaling Language Model Instruction Meta Learning through the Lens of Generalization
OPT-IML investigated generalization in instruction-tuned language models, providing insights into cross-task transfer and the factors affecting generalization to new tasks.
Read Paper2023
The Flan Collection: Designing Data and Methods for Effective Instruction Tuning
This paper presented a comprehensive approach to instruction tuning, introducing methods for creating high-quality datasets and training procedures that significantly improved LLM performance.
Read PaperLLaMA: Open and Efficient Foundation Language Models
LLaMA introduced a family of efficient foundation language models that matched or exceeded the performance of much larger models, democratizing access to powerful AI while requiring fewer computational resources.
Read PaperLanguage Is Not All You Need: Aligning Perception with Language Models
Kosmos-1 pioneered a multimodal LLM capable of perceiving and reasoning about visual and textual information together, breaking new ground in multimodal understanding.
Read PaperResurrecting Recurrent Neural Networks for Long Sequences
LRU revitalized recurrent architectures for language modeling, introducing the Linear Recurrent Unit that combined the strengths of transformers and RNNs for processing very long sequences efficiently.
Read PaperPaLM-E: An Embodied Multimodal Language Model
PaLM-E unified language, vision, and action into a single model, enabling embodied intelligence and demonstrating how LLMs could power robotics and physical systems.
Read PaperGPT-4 Technical Report
GPT-4 introduced a multimodal LLM with unprecedented capabilities in reasoning, specialized domains, and visual understanding, setting new benchmarks for the field and revealing emergent abilities.
Read PaperVisual Instruction Tuning
LLaVA pioneered instruction tuning for multimodal models, enabling vision-language models to follow complex visual instructions and significantly advancing the field of multimodal AI.
Read PaperPythia: A Suite for Analyzing Large Language Models Across Training and Scaling
Pythia provided a comprehensive suite of models and tools for analyzing LLM behavior during training, enabling deeper scientific understanding of how these models learn and evolve.
Read PaperPrinciple-Driven Self-Alignment of Language Models from Scratch with Minimal Human Supervision
Dromedary introduced a novel approach to aligning LLMs with human values using principled self-instruction, demonstrating how models could be made more helpful and harmless with minimal human supervision.
Read PaperPaLM 2 Technical Report
PaLM 2 demonstrated improved efficiency in language model architecture and training, achieving superior performance with fewer parameters and maintaining strengths across reasoning, multilingual tasks, and coding.
Read PaperRWKV: Reinventing RNNs for the Transformer Era
RWKV introduced a novel architecture combining the parallelizable training of transformers with the efficient inference of RNNs, offering a promising alternative for language modeling with linear scaling properties.
Read PaperDirect Preference Optimization: Your Language Model is Secretly a Reward Model
DPO introduced a simplified approach to aligning language models with human preferences, eliminating the need for separate reward models and making alignment more accessible and efficient.
Read PaperTree of Thoughts: Deliberate Problem Solving with Large Language Models
ToT introduced a framework for enabling LLMs to perform deliberate decision making by exploring and evaluating multiple reasoning paths, significantly improving performance on complex problems.
Read PaperLlama 2: Open Foundation and Fine-Tuned Chat Models
Llama 2 established new standards for open-source language models, with strong performance, safety features, and commercial availability, accelerating the adoption of LLMs in applications.
Read PaperMistral 7B
Mistral 7B demonstrated that smaller, more efficient language models could outperform much larger models through architectural innovations and training improvements, redefining expectations for accessible AI.
Read PaperMamba: Linear-Time Sequence Modeling with Selective State Spaces
Mamba introduced selective state space models for sequence modeling, achieving transformer-level quality with linear-time computation and memory usage, offering a compelling alternative for long-sequence processing.
Read Paper2024
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
DeepSeek-V2 introduced innovative approaches to mixture-of-experts architectures that substantially improved computational efficiency while maintaining high performance, making advanced language models more accessible.
Read PaperOLMo: Accelerating the Science of Language Models
OLMo pioneered a new approach to transparent AI development, providing unprecedented access to training data, methodologies, and model checkpoints, enabling broader research participation in language model science.
Read PaperTransformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality
Mamba2 established a theoretical framework unifying transformers and state space models, enabling more efficient architectures that maintained the strengths of both approaches while overcoming their respective limitations.
Read PaperThe Llama 3 Herd of Models
Llama 3 established new standards for open-source language models, demonstrating performance competitive with closed-source alternatives while providing multiple model sizes optimized for different deployment scenarios.
Read PaperThe FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale
FineWeb revolutionized training data curation for language models, introducing methods to automatically identify and extract high-quality content from web data, significantly improving model performance through better data quality.
Read PaperOLMoE: Open Mixture-of-Experts Language Models
OLMoE pioneered fully open-source mixture-of-experts models that rivaled closed-source counterparts, providing the research community with transparent architectures and training methodologies for highly efficient language models.
Read PaperQwen2.5 Technical Report
Qwen2.5 represented a significant advancement in efficient multimodal models, introducing innovations in processing multiple modalities while maintaining computational efficiency and strong performance across languages.
Read PaperDeepSeek-V3 Technical Report
DeepSeek-V3 pushed the boundaries of language model architectures with innovative attention mechanisms and training methodologies, offering substantial improvements in efficiency and performance across a wide range of tasks.
Read Paper2025
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
DeepSeek-R1 introduced a specialized framework for enhancing reasoning capabilities in large language models through innovative reinforcement learning techniques, significantly improving performance on complex reasoning tasks beyond previous state-of-the-art systems.
Read Paper