The Grand AI Handbook
The Grand AI Handbook

Landmark Papers in Computer Vision

Explore the foundational research that has shaped the field of Computer Vision. This curated collection highlights the most influential papers that established key concepts, techniques, and breakthroughs in the evolution of computer vision systems.

Landmark Papers in Computer Vision is a curated collection showcasing the foundational research that has shaped the field of computer vision. I've carefully selected these papers to highlight the key breakthroughs and conceptual advances that have defined the evolution of visual perception systems, providing historical context and significance for researchers and enthusiasts alike.

1960s-1980s

February 1963
Edge Detection 3D Vision

Machine Perception of Three-Dimensional Solids

This pioneering work by Roberts at MIT introduced the Roberts Operator, one of the first algorithms for edge detection and laid the groundwork for computational approaches to 3D object recognition from 2D images, establishing fundamental techniques for extracting structure from visual data.

Read Paper
March 1973
Face Detection Edge Analysis

Computer Detection of Human Faces

This early work from USC on automated face detection established initial approaches for computational face recognition, exploring edge-based techniques to isolate and identify facial features in images decades before modern deep learning approaches.

Read Paper
November 1980
Biological Vision Edge Detection

Theory of Edge Detection

David Marr's influential work at MIT provided a comprehensive computational framework for human visual perception, connecting biological vision systems to computational models and introducing the concept of multi-scale representations that continues to influence modern computer vision.

Read Paper
November 1982
Neural Networks Pattern Recognition

Neocognitron: A Self-organizing Neural Network Model for Pattern Recognition

Fukushima's groundbreaking work at NHK Labs introduced the Neocognitron, a hierarchical neural network inspired by the visual cortex that established the concept of increasingly complex feature extraction through layers, directly influencing modern convolutional neural networks.

Read Paper
June 1986
Edge Detection Image Processing

A Computational Approach to Edge Detection

John Canny's work at MIT introduced the Canny edge detector, a multi-stage algorithm that optimizes detection, localization, and minimal response criteria, becoming the most widely used edge detection method and establishing mathematical rigor in feature extraction.

Read Paper
November 1988
Motion Estimation Video Analysis

A Computational Framework for the Visual Motion

This seminal work from MIT established fundamental methods for optical flow calculation, providing mathematical techniques to estimate motion between frames that remain foundational for video processing, action recognition, and object tracking applications.

Read Paper
December 1989
Neural Networks OCR

Backpropagation Applied to Handwritten Zip Code Recognition

This influential work from Bell Labs demonstrated the practical application of neural networks with backpropagation for visual pattern recognition, establishing a framework for training deep networks on image data that would eventually lead to modern deep learning approaches.

Read Paper

1990s

July 1991
Face Recognition Dimensionality Reduction

Eigenfaces for Recognition

This groundbreaking paper from MIT introduced eigenfaces, a principal component analysis approach to efficiently represent faces in a lower-dimensional space, revolutionizing facial recognition and establishing core techniques for statistical pattern recognition in computer vision.

Read Paper
June 1992
Contour Models Shape Detection

Snakes: Active Contour Models

Kass, Witkin, and Terzopoulos at Imperial College introduced active contour models or "snakes," an energy-minimizing spline guided by external forces and image constraints, establishing a powerful framework for object boundary detection that continues to influence medical image analysis and object segmentation.

Read Paper
May 1995
Segmentation Optimization

Graph Cuts for Image Segmentation

This influential work from Cornell introduced the application of graph cut optimization to image segmentation, formulating the problem as finding the minimum cut in a graph, establishing energy minimization approaches that would transform object segmentation and stereo correspondence.

Read Paper
July 1997
Spectral Methods Segmentation

Normalized Cuts and Image Segmentation

Shi and Malik at Berkeley introduced normalized cuts, a theoretically sound spectral clustering approach to image segmentation that measures both the dissimilarity between different groups and the similarity within groups, establishing a foundation for perceptual grouping in computer vision.

Read Paper
September 1998
CNNs Document Analysis

Gradient-Based Learning Applied to Document Recognition

Yann LeCun and colleagues at AT&T/Bell Labs introduced LeNet-5, a pioneering convolutional neural network architecture for handwritten digit recognition that demonstrated end-to-end training from pixels to classification, establishing the foundation for modern deep learning approaches in computer vision.

Read Paper
September 1999
Manifold Learning Dimensionality Reduction

A Global Geometric Framework for Nonlinear Dimensionality Reduction

This influential work from Stanford introduced ISOMAP, a technique for discovering nonlinear manifolds in high-dimensional data that preserves geodesic distances, establishing a powerful approach for understanding the intrinsic structure of visual data that influenced subsequent manifold learning methods.

Read Paper

2000-2009

December 2001
Face Detection Cascaded Classifiers

Rapid Object Detection using a Boosted Cascade of Simple Features

Viola and Jones at Mitsubishi/MIT introduced a revolutionary real-time face detection framework using Haar-like features and AdaBoost, the first algorithm capable of reliable face detection at 15+ frames per second, transforming practical computer vision applications and enabling embedded vision systems.

Read Paper
June 2003
Object Recognition Part-Based Models

Pictorial Structures for Object Recognition

Felzenszwalb and Huttenlocher at Berkeley formalized pictorial structures, representing objects as collections of parts arranged in deformable configurations, establishing a mathematically principled approach to object recognition that would later influence part-based models and pose estimation.

Read Paper
June 2004
Feature Detection Scale Invariance

Distinctive Image Features from Scale-Invariant Keypoints

David Lowe at the University of British Columbia introduced SIFT (Scale-Invariant Feature Transform), a groundbreaking algorithm for detecting and describing local features invariant to scale, rotation, and illumination changes, revolutionizing object recognition, image matching, and 3D reconstruction.

Read Paper
October 2005
Feature Descriptors Human Detection

Histograms of Oriented Gradients for Human Detection

Dalal and Triggs at INRIA introduced HOG (Histograms of Oriented Gradients), a feature descriptor that captures local gradient orientation statistics, dramatically improving human detection performance and establishing a descriptor that would influence object recognition approaches for over a decade.

Read Paper
June 2006
Feature Detection Real-time

SURF: Speeded Up Robust Features

Bay and colleagues at ETH Zurich introduced SURF, a computationally efficient alternative to SIFT that used integral images and box filters to approximate derivatives, significantly accelerating feature detection and description while maintaining robustness for real-time applications.

Read Paper
October 2008
Binary Descriptors Efficiency

BRIEF: Binary Robust Independent Elementary Features

Calonder and colleagues at EPFL introduced BRIEF, a binary feature descriptor that used simple intensity difference tests to create highly discriminative bit strings, dramatically reducing memory requirements and computation time compared to floating-point descriptors like SIFT and SURF.

Read Paper
September 2009
Dataset Visual Recognition

ImageNet: A Large-Scale Hierarchical Image Database

Deng and colleagues at Princeton introduced ImageNet, a massive dataset of over 14 million labeled images organized according to WordNet hierarchy, providing unprecedented scale for training visual recognition systems and ultimately catalyzing the deep learning revolution in computer vision.

Read Paper

2010-2015

September 2010
Benchmark Object Recognition

The PASCAL Visual Object Classes Challenge

Everingham and colleagues at Oxford/Edinburgh established the PASCAL VOC challenge, creating standardized datasets and evaluation protocols for object detection and segmentation that became the primary benchmark for comparing computer vision algorithms for nearly a decade.

Read Paper
November 2011
Object Detection Part-Based Models

Object Detection with Discriminatively Trained Part-Based Models

Felzenszwalb and colleagues at the University of Chicago introduced Deformable Part Models (DPM), a discriminative approach combining HOG features with latent SVM training to model objects as collections of parts, setting state-of-the-art performance in object detection before the deep learning revolution.

Read Paper
September 2012
Deep Learning Image Classification

ImageNet Classification with Deep Convolutional Neural Networks

Krizhevsky, Sutskever, and Hinton at the University of Toronto introduced AlexNet, a deep convolutional neural network that dramatically outperformed previous approaches on the ImageNet challenge, catalyzing the deep learning revolution in computer vision and establishing the CNN architecture as the dominant paradigm for visual recognition tasks.

Read Paper
November 2013
Object Detection Deep Learning

Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation

Girshick and colleagues at UC Berkeley introduced R-CNN (Regions with CNN features), the first highly effective approach to combine region proposals with deep convolutional features, establishing a new paradigm for object detection that would dominate the field for years to come.

Read Paper
June 2014
Spatial Pooling Multi-Scale Processing

Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition

He and colleagues at Microsoft introduced SPPNet, which added a spatial pyramid pooling layer allowing CNNs to handle images of arbitrary size/scale and generate fixed-length representations, significantly improving efficiency by sharing computation across region proposals.

Read Paper
September 2014
Network Architecture Inception Modules

Going Deeper with Convolutions

Szegedy and colleagues at Google introduced GoogLeNet/Inception, a novel architecture using inception modules with parallel convolutions at different scales, dramatically reducing parameters while increasing depth, winning the 2014 ImageNet competition and establishing new principles for efficient network design.

Read Paper
September 2014
Network Depth CNN Architecture

Very Deep Convolutional Networks for Large-Scale Image Recognition

Simonyan and Zisserman at Oxford introduced VGGNet, which demonstrated the importance of network depth by using small 3×3 convolution filters stacked to create effective receptive fields, establishing a simple yet powerful architecture that became a standard feature extractor for many computer vision tasks.

Read Paper
October 2014
Face Recognition Deep Learning

Deep Learning Face Representation by Joint Identification-Verification

Sun and colleagues at CUHK introduced DeepID, a deep learning approach that jointly optimized face identification and verification tasks, significantly advancing face recognition performance and establishing multi-task learning principles that would influence subsequent facial recognition systems.

Read Paper
March 2015
Object Detection End-to-End Training

Fast R-CNN

Girshick at Microsoft improved upon R-CNN with Fast R-CNN, which enabled end-to-end detector training by pooling CNN features from regions of interest, dramatically increasing both speed and accuracy for object detection while simplifying the multi-stage training pipeline.

Read Paper
May 2015
Region Proposals Real-time Detection

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Ren and colleagues at Microsoft introduced Faster R-CNN, which integrated region proposal generation into the detection network with a Region Proposal Network, creating the first near real-time high-accuracy object detection system and establishing a unified framework that influenced numerous subsequent approaches.

Read Paper
June 2015
Medical Imaging Segmentation

U-Net: Convolutional Networks for Biomedical Image Segmentation

Ronneberger and colleagues at the University of Freiburg introduced U-Net, an elegant encoder-decoder architecture with skip connections that enabled precise segmentation with limited training data, revolutionizing medical image analysis and establishing a fundamental architecture for dense prediction tasks.

Read Paper
September 2015
Semantic Segmentation Pixel-wise Classification

Fully Convolutional Networks for Semantic Segmentation

Long, Shelhamer, and Darrell at Berkeley introduced FCN, transforming classification networks into fully convolutional ones that could produce dense, pixel-wise predictions, establishing the fundamental approach to semantic segmentation that continues to influence modern architectures.

Read Paper
December 2015
Residual Learning Deep Networks

Deep Residual Learning for Image Recognition

He and colleagues at Microsoft introduced ResNet, which enabled training of extremely deep networks through residual connections that created shortcuts across layers, solving the vanishing gradient problem and establishing a fundamental architecture that continues to serve as the backbone for numerous computer vision systems.

Read Paper
December 2015
Single-Shot Detection Real-time

SSD: Single Shot MultiBox Detector

Liu and colleagues at Google introduced SSD, a detection framework that eliminated proposal generation and feature resampling stages by making predictions at multiple scales directly from feature maps, establishing a high-speed detection approach that balanced accuracy and efficiency for real-time applications.

Read Paper

2016-2019

March 2016
Face Recognition Deep Learning

DeepFace: Closing the Gap to Human-Level Performance in Face Verification

Taigman and colleagues at Facebook introduced DeepFace, a deep learning system for face verification that approached human-level performance through 3D alignment, a large-scale private training dataset, and a deep CNN architecture, helping establish face recognition as one of the first computer vision tasks to achieve near-human accuracy.

Read Paper
May 2016
Atrous Convolution Semantic Segmentation

Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs

Chen and colleagues at Google introduced DeepLabv1, which combined atrous (dilated) convolutions to efficiently capture multi-scale context with fully connected CRFs for boundary refinement, establishing key techniques for accurate semantic segmentation that would influence numerous subsequent approaches.

Read Paper
June 2016
Real-time Detection Single-pass

You Only Look Once: Unified, Real-Time Object Detection

Redmon and colleagues at the University of Washington introduced YOLO, a revolutionary object detection approach that framed detection as a single regression problem from images to bounding boxes and class probabilities, enabling unprecedented speed while maintaining competitive accuracy, establishing a new paradigm for real-time vision.

Read Paper
August 2016
Model Compression Efficiency

SqueezeNet: AlexNet-level Accuracy with 50x Fewer Parameters and <0.5MB Model Size

Iandola and colleagues at UC Berkeley introduced SqueezeNet, a compact CNN architecture that achieved AlexNet-level accuracy with 50x fewer parameters through fire modules combining squeeze and expand operations, establishing important principles for efficient network design for mobile and embedded systems.

Read Paper
October 2016
Dataset Benchmark

Microsoft COCO: Common Objects in Context

Lin and colleagues at Microsoft introduced COCO, a large-scale object detection, segmentation, and captioning dataset with complex everyday scenes containing multiple objects in their natural context, establishing a more challenging benchmark that drove advances in instance segmentation and dense prediction tasks.

Read Paper
December 2016
Scene Parsing Multi-scale Context

Pyramid Scene Parsing Network

Zhao and colleagues at SenseTime/CUHK introduced PSPNet, which utilized a pyramid pooling module to aggregate context at multiple scales, effectively capturing global and local information for scene parsing, establishing a new approach to multi-scale feature representation that influenced numerous segmentation methods.

Read Paper
March 2017
Instance Segmentation Multi-task Learning

Mask R-CNN

He and colleagues at Facebook AI Research introduced Mask R-CNN, extending Faster R-CNN with a parallel mask prediction branch for instance segmentation, establishing a flexible framework for multiple vision tasks and achieving state-of-the-art results that would influence object detection and segmentation for years to come.

Read Paper
April 2017
Class Imbalance Dense Detection

Focal Loss for Dense Object Detection

Lin and colleagues at Facebook AI introduced focal loss and RetinaNet, addressing the extreme foreground-background class imbalance in dense detection by down-weighting easy examples, enabling single-stage detectors to outperform two-stage approaches and establishing a key technique for addressing imbalanced datasets.

Read Paper
July 2017
Mobile Networks Efficient Architecture

MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications

Howard and colleagues at Google introduced MobileNets, which utilized depthwise separable convolutions to dramatically reduce computation and parameters while maintaining reasonable accuracy, establishing fundamental techniques for efficient model design that would enable computer vision on resource-constrained devices.

Read Paper
September 2017
Channel Shuffling Mobile Architecture

ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices

Zhang and colleagues at Face++ introduced ShuffleNet, which utilized pointwise group convolutions and channel shuffling to reduce computation while maintaining accuracy, establishing novel techniques for designing highly efficient networks that influenced numerous subsequent mobile-friendly architectures.

Read Paper
October 2017
Attention Transformers

Attention is All You Need

Vaswani and colleagues at Google introduced the Transformer architecture based entirely on attention mechanisms, initially for NLP but eventually revolutionizing computer vision by providing a new paradigm beyond convolutions that would lead to Vision Transformers and numerous attention-based visual models.

Read Paper
December 2017
Capsule Networks Hierarchical Representations

Dynamic Routing Between Capsules

Sabour, Hinton and colleagues at Google introduced CapsNet, which modeled hierarchical relationships between object parts using capsules that preserve more information than scalar features, proposing a fundamentally different approach to representation learning addressing key limitations of CNNs.

Read Paper
January 2018
Dense Connections Feature Reuse

Densely Connected Convolutional Networks

Huang and colleagues at Cornell/Tsinghua introduced DenseNet, which connected each layer to every other layer in a feed-forward fashion to encourage feature reuse, improve gradient flow, and reduce parameters, establishing a powerful architecture for efficient learning that influenced numerous subsequent network designs.

Read Paper
March 2018
Neural Architecture Search AutoML

Learning Transferable Architectures for Scalable Image Recognition

Zoph and colleagues at Google introduced NASNet, which used reinforcement learning to search for optimal neural architecture building blocks that could be transferred across datasets, establishing automated architecture design approaches that would launch an entire field of neural architecture search.

Read Paper
May 2018
Object Detection Real-time

YOLOv3: An Incremental Improvement

Redmon and Farhadi at the University of Washington refined the YOLO architecture with multi-scale predictions, better feature extractors, and various design improvements, establishing YOLOv3 as the standard real-time detector balancing speed and accuracy that would be widely adopted in practical applications.

Read Paper
July 2018
Style Transfer Content Manipulation

A Neural Algorithm of Artistic Style

Gatys and colleagues at the University of Tübingen separated and recombined content and style representations from different images, enabling artistic style transfer by optimizing for content similarity and style statistics, establishing a novel application of neural networks that sparked significant interest in creative AI.

Read Paper
September 2018
Atrous Convolution Encoder-Decoder

Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation

Chen and colleagues at Google introduced DeepLabv3+, which combined an encoder-decoder structure with atrous separable convolutions, establishing a powerful and efficient architecture for semantic segmentation that achieved state-of-the-art results while maintaining computational efficiency.

Read Paper
November 2018
GANs Image Synthesis

Large Scale GAN Training for High Fidelity Natural Image Synthesis

Brock and colleagues at DeepMind introduced BigGAN, which demonstrated the benefits of scaling up GAN training with larger batch sizes and more parameters, establishing new benchmarks for image synthesis quality and revealing the importance of training dynamics for generative models.

Read Paper
January 2019
Keypoint Detection Object Localization

Objects as Points

Zhou and colleagues at the University of Texas introduced CenterNet, which modeled objects as points (their center) and regressed to other properties, establishing a simple yet effective approach to detection that unified object detection, human pose estimation, and 3D detection in a single framework.

Read Paper
March 2019
Efficient Scaling Compound Scaling

EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks

Tan and Le at Google introduced EfficientNet, which proposed compound scaling that uniformly scales network width, depth, and resolution with fixed coefficients, establishing a family of models that achieved state-of-the-art accuracy with significantly fewer parameters and operations than previous approaches.

Read Paper
June 2019
Mobile Architecture Neural Architecture Search

MnasNet: Platform-Aware Neural Architecture Search for Mobile

Tan and colleagues at Google introduced MnasNet, which incorporated latency constraints directly into the architecture search objective, establishing an approach to automatically design efficient mobile models that explicitly balanced accuracy and real-world inference speed on target devices.

Read Paper
August 2019
Data Augmentation Regularization

CutMix: Regularization Strategy to Train Strong Classifiers with Localizable Features

Yun and colleagues at NAVER introduced CutMix, a simple yet effective data augmentation strategy that replaced regions of an image with patches from another while mixing the labels proportionally, establishing a powerful regularization technique that improved both classification accuracy and localization ability.

Read Paper
November 2019
Generative Models GANs

StyleGAN: A Style-Based Generator Architecture for Generative Adversarial Networks

Karras and colleagues at NVIDIA introduced StyleGAN, a groundbreaking GAN architecture that separated high-level attributes and stochastic variation via a novel style-based design, enabling unprecedented control over generated images and setting new standards for image synthesis quality.

Read Paper

2020-2021

January 2020
Object Detection Transformers

End-to-End Object Detection with Transformers (DETR)

Introduced by Facebook AI, DETR revolutionized object detection by applying transformers to predict objects in an end-to-end manner, eliminating the need for hand-crafted components like anchor boxes and non-maximum suppression.

Read Paper
April 2020
Generative Models Image Synthesis

Analyzing and Improving the Image Quality of StyleGAN (StyleGAN2)

NVIDIA's StyleGAN2 improved upon its predecessor by addressing artifacts and enhancing image quality, setting a new standard for high-resolution image synthesis with generative adversarial networks.

Read Paper
June 2020
3D Reconstruction Neural Rendering

NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis

UC Berkeley's NeRF introduced a groundbreaking approach to 3D scene representation, using neural networks to model continuous volumetric scenes, enabling photorealistic view synthesis from sparse images.

Read Paper
July 2020
Convolutional Networks Architecture Design

RepVGG: Making VGG-style ConvNets Great Again

Tsinghua's RepVGG reintroduced simple VGG-style convolutional networks with a novel re-parameterization technique, achieving high performance and efficiency for image classification tasks.

Read Paper
October 2020
Transformers Image Classification

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (ViT)

Google's Vision Transformer (ViT) adapted transformers for image classification, treating image patches as tokens, achieving state-of-the-art performance and sparking widespread adoption of transformers in vision tasks.

Read Paper
November 2020
Self-Supervised Learning Representation Learning

Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning (BYOL)

DeepMind's BYOL proposed a novel self-supervised learning method that avoids negative samples, achieving robust visual representations that rival supervised methods, influencing subsequent self-supervised learning frameworks.

Read Paper
January 2021
Multimodal Learning Self-Supervised Learning

Learning Transferable Visual Models From Natural Language Supervision (CLIP)

OpenAI's CLIP trained visual models with natural language supervision, enabling zero-shot image classification and robust cross-modal understanding, significantly impacting multimodal AI applications.

Read Paper
March 2021
Transformers Hierarchical Models

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

Microsoft's Swin Transformer introduced a hierarchical architecture with shifted windows, improving efficiency and performance for vision tasks like classification, detection, and segmentation.

Read Paper
April 2021
Self-Supervised Learning Transformers

An Empirical Study of Training Self-Supervised Vision Transformers (MoCo-v3)

Facebook AI's MoCo-v3 refined self-supervised learning for vision transformers, providing insights into stable training and achieving strong performance on large-scale image datasets.

Read Paper
May 2021
Transformers Convolutional Networks

CvT: Introducing Convolutions to Vision Transformers

Microsoft's CvT combined convolutional layers with transformers, enhancing locality and efficiency in vision transformers for tasks like image classification and object detection.

Read Paper
June 2021
Hybrid Models Attention Mechanisms

CoAtNet: Marrying Convolution and Attention for All Data Sizes

Google's CoAtNet fused convolutional and attention mechanisms, creating a versatile architecture that excels across various data scales for vision tasks like classification and detection.

Read Paper
July 2021
Generative Models Image Synthesis

Alias-Free Generative Adversarial Networks (StyleGAN3)

NVIDIA's StyleGAN3 addressed aliasing issues in generative models, producing high-quality, alias-free images with improved consistency for applications like video and animation.

Read Paper
September 2021
Object Detection Real-Time Processing

YOLOX: Exceeding YOLO Series in 2021

Megvii's YOLOX enhanced the YOLO series with innovations like decoupled heads and anchor-free detection, achieving superior performance in real-time object detection tasks.

Read Paper
November 2021
Semantic Segmentation Transformers

MaskFormer: Per-Pixel Classification is Not All You Need for Semantic Segmentation

Meta's MaskFormer reframed semantic segmentation as a mask classification problem, leveraging transformers to achieve state-of-the-art results in both semantic and instance segmentation.

Read Paper
December 2021
Self-Supervised Learning Transformers

Masked Autoencoders Are Scalable Vision Learners (MAE)

Meta's MAE introduced a simple yet effective self-supervised learning approach, using masked image patches to train vision transformers, achieving strong performance with high scalability.

Read Paper

2022

January 2022
Transformers Multiscale Vision

MViTv2: Improved Multiscale Vision Transformers for Classification and Detection

Meta's MViTv2 enhanced multiscale vision transformers, improving efficiency and performance for image classification and object detection, building on hierarchical transformer architectures.

Read Paper
March 2022
Object Detection Real-Time Processing

YOLOv7: Trainable Bag-of-Freebies Sets New State-of-the-Art for Real-Time Object Detectors

WongKinYiu's YOLOv7 introduced a suite of trainable enhancements, achieving top performance in real-time object detection with improved accuracy and speed over previous YOLO models.

Read Paper
April 2022
Convolutional Networks Image Classification

A ConvNet for the 2020s (ConvNeXt)

Meta's ConvNeXt modernized convolutional neural networks by incorporating transformer-inspired design principles, achieving competitive performance with transformers in image classification tasks.

Read Paper
May 2022
Transformers Object Detection

Exploring Plain Vision Transformer Backbones for Object Detection (ViTDet)

Meta's ViTDet demonstrated that plain vision transformers could serve as effective backbones for object detection, simplifying architectures while maintaining high performance.

Read Paper
June 2022
Self-Supervised Learning Visual Features

DINOv2: Learning Robust Visual Features without Supervision

Meta's DINOv2 advanced self-supervised learning, producing robust and versatile visual features that excel in downstream tasks like classification and segmentation without requiring labeled data.

Read Paper
July 2022
Self-Supervised Learning Masked Representation

EVA: Exploring the Limits of Masked Visual Representation Learning at Scale

BAAI's EVA scaled up masked visual representation learning, achieving state-of-the-art performance in self-supervised vision tasks by leveraging large datasets and transformer architectures.

Read Paper
August 2022
Generative Models Image Synthesis

High-Resolution Image Synthesis with Latent Diffusion Models

Stanford's latent diffusion models enabled efficient high-resolution image synthesis by operating in a compressed latent space, powering applications like Stable Diffusion.

Read Paper
October 2022
Object Detection Open-Vocabulary

YOLO-World: Real-Time Open-Vocabulary Object Detection

Tsinghua's YOLO-World extended real-time object detection to open-vocabulary settings, enabling detection of arbitrary object categories using language prompts.

Read Paper
November 2022
Generative Models Conditional Generation

Adding Conditional Control to Text-to-Image Diffusion Models (ControlNet)

Stanford's ControlNet introduced a framework for adding fine-grained control to diffusion models, enabling precise manipulation of generated images using inputs like edge maps or depth maps.

Read Paper
December 2022
Object Detection Real-Time Processing

RT-DETR: DETRs Beat YOLOs on Real-Time Object Detection

Baidu's RT-DETR combined the strengths of transformer-based DETR models with real-time performance, surpassing YOLO models in speed and accuracy for object detection tasks.

Read Paper

2023

January 2023
Semantic Segmentation Instance Segmentation

Segment Anything (SAM)

Meta's Segment Anything Model (SAM) introduced a versatile framework for image segmentation, capable of generating high-quality masks for objects in any image, enabling zero-shot segmentation across diverse tasks.

Read Paper
February 2023
Multimodal Learning Multilingual Models

PaLI: A Jointly-Scaled Multilingual Language-Image Model

Google's PaLI combined vision and language modeling at scale, supporting multilingual tasks like image captioning and visual question answering, advancing cross-modal understanding.

Read Paper
March 2023
Generative Models Text-to-Image

Muse: Text-To-Image Generation via Masked Generative Transformers

Google's Muse leveraged masked generative transformers for efficient text-to-image generation, achieving high-quality image synthesis with improved training stability and speed.

Read Paper
April 2023
Semantic Segmentation Real-Time Processing

Fast Segment Anything (FastSAM)

ETH Zurich's FastSAM optimized the Segment Anything model for real-time performance, maintaining high segmentation quality while significantly reducing computational requirements.

Read Paper
May 2023
Self-Supervised Learning Transformers

Emerging Properties in Self-Supervised Vision Transformers (DINO)

Meta's DINO explored emergent properties in self-supervised vision transformers, revealing their ability to learn robust features for tasks like segmentation and classification without supervision.

Read Paper
June 2023
Object Detection Real-Time Processing

YOLOv8: A New Era of Visual AI

Ultralytics' YOLOv8 advanced real-time object detection with improved accuracy, speed, and versatility, supporting tasks like detection, segmentation, and classification.

Read Paper
July 2023
Foundation Models Deformable Convolutions

InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions

Shanghai AI Lab's InternImage introduced deformable convolutions to large-scale vision foundation models, enhancing flexibility and performance in tasks like classification and detection.

Read Paper
August 2023
Instance Perception Object Retrieval

UNINEXT: Universal Instance Perception as Object Discovery and Retrieval

KAIST's UNINEXT proposed a unified framework for instance perception, treating tasks like detection and segmentation as object discovery and retrieval, achieving robust performance across domains.

Read Paper
September 2023
Generative Models Text-to-Image

DALL-E 3: Improving Image Generation with Better Captions

OpenAI's DALL-E 3 enhanced text-to-image generation by leveraging improved captioning techniques, producing more accurate and detailed images aligned with textual prompts.

Read Paper
October 2023
3D Object Detection Transformers

DETR3D: 3D Object Detection from Multi-view Images via 3D-to-2D Queries

Peking University's DETR3D extended transformer-based detection to 3D, using multi-view images and 3D-to-2D queries to achieve robust 3D object detection for autonomous driving and robotics.

Read Paper
November 2023
Video Generation Generative Models

Sora: Video Generation Models as World Simulators

OpenAI's Sora introduced advanced video generation models that simulate physical world dynamics, producing high-quality, coherent videos from text prompts, advancing generative AI for video.

Read Paper
December 2023
Multimodal Learning Instruction Tuning

Visual Instruction Tuning (LLaVA)

Microsoft's LLaVA introduced visual instruction tuning, enhancing multimodal models by fine-tuning with visual-text instruction data, improving performance in vision-language tasks like question answering.

Read Paper

2024

January 2024
Gesture Synthesis State Space Models

MambaTalk: Efficient Holistic Gesture Synthesis with Selective State Space Models

Samsung's MambaTalk introduced selective state space models for efficient gesture synthesis, enabling realistic and computationally lightweight generation of human gestures for applications in virtual reality and animation.

Read Paper
February 2024
Visual Representation State Space Models

Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model

CMU and Princeton's Vision Mamba applied bidirectional state space models to visual representation learning, offering a computationally efficient alternative to transformers for tasks like image classification and object detection.

Read Paper
March 2024
Foundation Models Image and Video

GLEE: General Object Foundation Model for Images and Videos at Scale

Stanford's GLEE introduced a scalable foundation model for general object understanding in images and videos, enabling robust performance across tasks like detection, segmentation, and tracking.

Read Paper
April 2024
Semantic Segmentation Instance Segmentation

Segment Everything Everywhere All at Once (SEEM)

UNC and Microsoft's SEEM unified multiple segmentation tasks (semantic, instance, and panoptic) into a single framework, achieving state-of-the-art performance with a versatile, prompt-driven approach.

Read Paper
May 2024
Segmentation Unified Model

OMG-Segment: One Model Goes to Segment Everything

Peking University's OMG-Segment proposed a single model capable of performing all segmentation tasks, from semantic to instance and panoptic, with high efficiency and generalizability across datasets.

Read Paper
June 2024
Unified Representation Vision Tasks

Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks

Microsoft's Florence-2 developed a unified representation model for diverse vision tasks, including classification, detection, and captioning, achieving strong performance with a single architecture.

Read Paper
July 2024
Text-to-Image Generative Models

Stable Diffusion 3.5: Advanced Text-to-Image Generation with Precise Control

Stability AI's Stable Diffusion 3.5 improved text-to-image generation with enhanced control mechanisms, delivering higher quality and more accurate images aligned with complex prompts.

Read Paper
August 2024
Feature Representation Multimodal Vision

OmniVec: Unifying Feature Representations for Vision-Language-Audio Tasks

Meta's OmniVec unified feature representations across vision, language, and audio, enabling robust performance in multimodal vision tasks like image captioning and visual question answering.

Read Paper
September 2024
Image Transformation Cross-Domain Recognition

AdaForm: Adaptive Image Transformation Networks for Cross-Domain Visual Recognition

DeepMind's AdaForm developed adaptive transformation networks for cross-domain visual recognition, improving robustness in scenarios with domain shifts, such as synthetic-to-real image adaptation.

Read Paper
October 2024
Multimodal Vision Pre-Training

Gemini Vision: Advancing Multi-Modal Understanding Through Massive Scale Visual Pre-Training

Google's Gemini Vision leveraged massive-scale visual pre-training to advance multimodal understanding, achieving state-of-the-art performance in tasks like image classification and visual question answering.

Read Paper
November 2024
Object Detection Open-Vocabulary

Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Vocabulary Object Detection

IDEA Research's Grounding DINO combined self-supervised vision transformers with grounded pre-training, enabling open-vocabulary object detection with unprecedented flexibility and accuracy.

Read Paper
December 2024
3D Reconstruction Neural Rendering

LGM: Large Gaussian Splatting for Scalable 3D Reconstruction

Tsinghua University's LGM advanced 3D reconstruction with large-scale Gaussian splatting, offering scalable and high-fidelity neural rendering for real-time 3D scene synthesis.

Read Paper

2025

February 2025
Hybrid Models Visual Representation

MambaVision: A Hybrid Mamba-Transformer Backbone for Computer Vision

NVIDIA's MambaVision introduced the first hybrid Mamba-Transformer architecture for computer vision, achieving state-of-the-art performance in image classification and object detection with improved computational efficiency over traditional transformers.

Read Paper
March 2025
Dataset Distillation Image Analysis

Dataset Distillation with Neural Characteristic Function: A Minmax Perspective

This work from Shaobo Wang et al. proposed a novel dataset distillation method using neural characteristic functions with a minmax optimization approach, enabling efficient training of computer vision models with significantly reduced dataset sizes.

Read Paper
April 2025
3D Reconstruction Photorealism

MAtCha Gaussians: Atlas of Charts for High-Quality Geometry and Photorealism From Sparse Views

This CVPR 2025 highlight paper presented a novel appearance model using Gaussian splatting and an atlas of charts, achieving high-quality 3D geometry and photorealistic rendering from sparse image views, advancing 3D reconstruction techniques.

Read Paper
April 2025
Foundation Models Object Detection

RADIOv2.5: A Flexible Vision Encoder for Robust Multi-Task Learning

NVIDIA's RADIOv2.5 enhanced vision encoders with a combination of DFN_CLIP, DINOv2, SAM, SigLIP, and advanced training techniques, offering a flexible foundation model for tasks like object detection and segmentation across varying resolutions.

Read Paper
April 2025
Image Inpainting Diffusion Models

ESDiff: Encoding Strategy-Inspired Diffusion Model with Few-Shot Learning for Color Image Inpainting

This CVPR 2025 paper by Junyan Zhang et al. introduced ESDiff, a diffusion model inspired by encoding strategies, enabling high-quality color image inpainting with few-shot learning, improving efficiency and performance in image restoration tasks.

Read Paper
April 2025
Medical Imaging Segmentation

Mamba-Sea: A Mamba-Based Framework with Global-to-Local Sequence Augmentation for Generalizable Medical Image Segmentation

Accepted to IEEE TMI 2025, Mamba-Sea by Zihan Cheng et al. utilized a Mamba-based framework with global-to-local sequence augmentation, enhancing generalizability in medical image segmentation for diverse clinical applications.

Read Paper