Landmark Papers in LLMs

Explore the foundational research that has shaped the field of Large Language Models. This curated collection highlights the most influential papers that established key concepts, techniques, and breakthroughs in the evolution of LLMs.

Landmark Papers in LLMs is a curated collection showcasing the foundational research that has shaped the field of large language models. I've carefully selected these papers to highlight the key breakthroughs and conceptual advances that have defined the evolution of LLMs, providing historical context and significance for researchers and enthusiasts alike.

2017

June 2017

Transformers Attention

Attention Is All You Need

This paper introduced the Transformer architecture, replacing recurrence and convolutions with attention mechanisms, revolutionizing sequence modeling and establishing the foundation for all subsequent large language models through its efficiency, parallelizability, and capacity to capture long-range dependencies.

Landmark Papers in LLMs

2017

Attention Is All You Need

2018

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Improving Language Understanding by Generative Pre-Training

2019

Language Models are Unsupervised Multitask Learners

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

ZeRO: Memory Optimizations Toward Training Trillion Parameter Models

2020

Scaling Laws for Neural Language Models

Language Models are Few-Shot Learners

2021

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

Evaluating Large Language Models Trained on Code

On the Opportunities and Risks of Foundation Models

Finetuned Language Models are Zero-Shot Learners

Multitask Prompted Training Enables Zero-Shot Task Generalization

GLaM: Efficient Scaling of Language Models with Mixture-of-Experts

WebGPT: Browser-assisted question-answering with human feedback

Improving language models by retrieving from trillions of tokens

Scaling Language Models: Methods, Analysis & Insights from Training Gopher

2022

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

LaMDA: Language Models for Dialog Applications

Solving Quantitative Reasoning Problems with Language Models

Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model

Training language models to follow instructions with human feedback

PaLM: Scaling Language Modeling with Pathways

Training Compute-Optimal Large Language Models

OPT: Open Pre-trained Transformer Language Models

UL2: Unifying Language Learning Paradigms

Emergent Abilities of Large Language Models

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

Language Models are General-Purpose Interfaces

Improving alignment of dialogue agents via targeted human judgements

Scaling Instruction-Finetuned Language Models

GLM-130B: An Open Bilingual Pre-trained Model

Holistic Evaluation of Language Models

BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

Galactica: A Large Language Model for Science

OPT-IML: Scaling Language Model Instruction Meta Learning through the Lens of Generalization

2023

The Flan Collection: Designing Data and Methods for Effective Instruction Tuning

LLaMA: Open and Efficient Foundation Language Models

Language Is Not All You Need: Aligning Perception with Language Models

Resurrecting Recurrent Neural Networks for Long Sequences

PaLM-E: An Embodied Multimodal Language Model

GPT-4 Technical Report

Visual Instruction Tuning

Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling

Principle-Driven Self-Alignment of Language Models from Scratch with Minimal Human Supervision

PaLM 2 Technical Report

RWKV: Reinventing RNNs for the Transformer Era

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

Tree of Thoughts: Deliberate Problem Solving with Large Language Models

Llama 2: Open Foundation and Fine-Tuned Chat Models

Mistral 7B

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

2024

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

OLMo: Accelerating the Science of Language Models

Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

The Llama 3 Herd of Models

The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

OLMoE: Open Mixture-of-Experts Language Models

Qwen2.5 Technical Report

DeepSeek-V3 Technical Report

2025

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning