Landmark Papers in Computer Vision
Explore the foundational research that has shaped the field of Computer Vision. This curated collection highlights the most influential papers that established key concepts, techniques, and breakthroughs in the evolution of computer vision systems.
Landmark Papers in Computer Vision is a curated collection showcasing the foundational research that has shaped the field of computer vision. I've carefully selected these papers to highlight the key breakthroughs and conceptual advances that have defined the evolution of visual perception systems, providing historical context and significance for researchers and enthusiasts alike.
1960s-1980s
Machine Perception of Three-Dimensional Solids
This pioneering work by Roberts at MIT introduced the Roberts Operator, one of the first algorithms for edge detection and laid the groundwork for computational approaches to 3D object recognition from 2D images, establishing fundamental techniques for extracting structure from visual data.
Read PaperComputer Detection of Human Faces
This early work from USC on automated face detection established initial approaches for computational face recognition, exploring edge-based techniques to isolate and identify facial features in images decades before modern deep learning approaches.
Read PaperTheory of Edge Detection
David Marr's influential work at MIT provided a comprehensive computational framework for human visual perception, connecting biological vision systems to computational models and introducing the concept of multi-scale representations that continues to influence modern computer vision.
Read PaperNeocognitron: A Self-organizing Neural Network Model for Pattern Recognition
Fukushima's groundbreaking work at NHK Labs introduced the Neocognitron, a hierarchical neural network inspired by the visual cortex that established the concept of increasingly complex feature extraction through layers, directly influencing modern convolutional neural networks.
Read PaperA Computational Approach to Edge Detection
John Canny's work at MIT introduced the Canny edge detector, a multi-stage algorithm that optimizes detection, localization, and minimal response criteria, becoming the most widely used edge detection method and establishing mathematical rigor in feature extraction.
Read PaperA Computational Framework for the Visual Motion
This seminal work from MIT established fundamental methods for optical flow calculation, providing mathematical techniques to estimate motion between frames that remain foundational for video processing, action recognition, and object tracking applications.
Read PaperBackpropagation Applied to Handwritten Zip Code Recognition
This influential work from Bell Labs demonstrated the practical application of neural networks with backpropagation for visual pattern recognition, establishing a framework for training deep networks on image data that would eventually lead to modern deep learning approaches.
Read Paper1990s
Eigenfaces for Recognition
This groundbreaking paper from MIT introduced eigenfaces, a principal component analysis approach to efficiently represent faces in a lower-dimensional space, revolutionizing facial recognition and establishing core techniques for statistical pattern recognition in computer vision.
Read PaperSnakes: Active Contour Models
Kass, Witkin, and Terzopoulos at Imperial College introduced active contour models or "snakes," an energy-minimizing spline guided by external forces and image constraints, establishing a powerful framework for object boundary detection that continues to influence medical image analysis and object segmentation.
Read PaperGraph Cuts for Image Segmentation
This influential work from Cornell introduced the application of graph cut optimization to image segmentation, formulating the problem as finding the minimum cut in a graph, establishing energy minimization approaches that would transform object segmentation and stereo correspondence.
Read PaperNormalized Cuts and Image Segmentation
Shi and Malik at Berkeley introduced normalized cuts, a theoretically sound spectral clustering approach to image segmentation that measures both the dissimilarity between different groups and the similarity within groups, establishing a foundation for perceptual grouping in computer vision.
Read PaperGradient-Based Learning Applied to Document Recognition
Yann LeCun and colleagues at AT&T/Bell Labs introduced LeNet-5, a pioneering convolutional neural network architecture for handwritten digit recognition that demonstrated end-to-end training from pixels to classification, establishing the foundation for modern deep learning approaches in computer vision.
Read PaperA Global Geometric Framework for Nonlinear Dimensionality Reduction
This influential work from Stanford introduced ISOMAP, a technique for discovering nonlinear manifolds in high-dimensional data that preserves geodesic distances, establishing a powerful approach for understanding the intrinsic structure of visual data that influenced subsequent manifold learning methods.
Read Paper2000-2009
Rapid Object Detection using a Boosted Cascade of Simple Features
Viola and Jones at Mitsubishi/MIT introduced a revolutionary real-time face detection framework using Haar-like features and AdaBoost, the first algorithm capable of reliable face detection at 15+ frames per second, transforming practical computer vision applications and enabling embedded vision systems.
Read PaperPictorial Structures for Object Recognition
Felzenszwalb and Huttenlocher at Berkeley formalized pictorial structures, representing objects as collections of parts arranged in deformable configurations, establishing a mathematically principled approach to object recognition that would later influence part-based models and pose estimation.
Read PaperDistinctive Image Features from Scale-Invariant Keypoints
David Lowe at the University of British Columbia introduced SIFT (Scale-Invariant Feature Transform), a groundbreaking algorithm for detecting and describing local features invariant to scale, rotation, and illumination changes, revolutionizing object recognition, image matching, and 3D reconstruction.
Read PaperHistograms of Oriented Gradients for Human Detection
Dalal and Triggs at INRIA introduced HOG (Histograms of Oriented Gradients), a feature descriptor that captures local gradient orientation statistics, dramatically improving human detection performance and establishing a descriptor that would influence object recognition approaches for over a decade.
Read PaperSURF: Speeded Up Robust Features
Bay and colleagues at ETH Zurich introduced SURF, a computationally efficient alternative to SIFT that used integral images and box filters to approximate derivatives, significantly accelerating feature detection and description while maintaining robustness for real-time applications.
Read PaperBRIEF: Binary Robust Independent Elementary Features
Calonder and colleagues at EPFL introduced BRIEF, a binary feature descriptor that used simple intensity difference tests to create highly discriminative bit strings, dramatically reducing memory requirements and computation time compared to floating-point descriptors like SIFT and SURF.
Read PaperImageNet: A Large-Scale Hierarchical Image Database
Deng and colleagues at Princeton introduced ImageNet, a massive dataset of over 14 million labeled images organized according to WordNet hierarchy, providing unprecedented scale for training visual recognition systems and ultimately catalyzing the deep learning revolution in computer vision.
Read Paper2010-2015
The PASCAL Visual Object Classes Challenge
Everingham and colleagues at Oxford/Edinburgh established the PASCAL VOC challenge, creating standardized datasets and evaluation protocols for object detection and segmentation that became the primary benchmark for comparing computer vision algorithms for nearly a decade.
Read PaperObject Detection with Discriminatively Trained Part-Based Models
Felzenszwalb and colleagues at the University of Chicago introduced Deformable Part Models (DPM), a discriminative approach combining HOG features with latent SVM training to model objects as collections of parts, setting state-of-the-art performance in object detection before the deep learning revolution.
Read PaperImageNet Classification with Deep Convolutional Neural Networks
Krizhevsky, Sutskever, and Hinton at the University of Toronto introduced AlexNet, a deep convolutional neural network that dramatically outperformed previous approaches on the ImageNet challenge, catalyzing the deep learning revolution in computer vision and establishing the CNN architecture as the dominant paradigm for visual recognition tasks.
Read PaperRich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation
Girshick and colleagues at UC Berkeley introduced R-CNN (Regions with CNN features), the first highly effective approach to combine region proposals with deep convolutional features, establishing a new paradigm for object detection that would dominate the field for years to come.
Read PaperSpatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition
He and colleagues at Microsoft introduced SPPNet, which added a spatial pyramid pooling layer allowing CNNs to handle images of arbitrary size/scale and generate fixed-length representations, significantly improving efficiency by sharing computation across region proposals.
Read PaperGoing Deeper with Convolutions
Szegedy and colleagues at Google introduced GoogLeNet/Inception, a novel architecture using inception modules with parallel convolutions at different scales, dramatically reducing parameters while increasing depth, winning the 2014 ImageNet competition and establishing new principles for efficient network design.
Read PaperVery Deep Convolutional Networks for Large-Scale Image Recognition
Simonyan and Zisserman at Oxford introduced VGGNet, which demonstrated the importance of network depth by using small 3×3 convolution filters stacked to create effective receptive fields, establishing a simple yet powerful architecture that became a standard feature extractor for many computer vision tasks.
Read PaperDeep Learning Face Representation by Joint Identification-Verification
Sun and colleagues at CUHK introduced DeepID, a deep learning approach that jointly optimized face identification and verification tasks, significantly advancing face recognition performance and establishing multi-task learning principles that would influence subsequent facial recognition systems.
Read PaperFast R-CNN
Girshick at Microsoft improved upon R-CNN with Fast R-CNN, which enabled end-to-end detector training by pooling CNN features from regions of interest, dramatically increasing both speed and accuracy for object detection while simplifying the multi-stage training pipeline.
Read PaperFaster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks
Ren and colleagues at Microsoft introduced Faster R-CNN, which integrated region proposal generation into the detection network with a Region Proposal Network, creating the first near real-time high-accuracy object detection system and establishing a unified framework that influenced numerous subsequent approaches.
Read PaperU-Net: Convolutional Networks for Biomedical Image Segmentation
Ronneberger and colleagues at the University of Freiburg introduced U-Net, an elegant encoder-decoder architecture with skip connections that enabled precise segmentation with limited training data, revolutionizing medical image analysis and establishing a fundamental architecture for dense prediction tasks.
Read PaperFully Convolutional Networks for Semantic Segmentation
Long, Shelhamer, and Darrell at Berkeley introduced FCN, transforming classification networks into fully convolutional ones that could produce dense, pixel-wise predictions, establishing the fundamental approach to semantic segmentation that continues to influence modern architectures.
Read PaperDeep Residual Learning for Image Recognition
He and colleagues at Microsoft introduced ResNet, which enabled training of extremely deep networks through residual connections that created shortcuts across layers, solving the vanishing gradient problem and establishing a fundamental architecture that continues to serve as the backbone for numerous computer vision systems.
Read PaperSSD: Single Shot MultiBox Detector
Liu and colleagues at Google introduced SSD, a detection framework that eliminated proposal generation and feature resampling stages by making predictions at multiple scales directly from feature maps, establishing a high-speed detection approach that balanced accuracy and efficiency for real-time applications.
Read Paper2016-2019
DeepFace: Closing the Gap to Human-Level Performance in Face Verification
Taigman and colleagues at Facebook introduced DeepFace, a deep learning system for face verification that approached human-level performance through 3D alignment, a large-scale private training dataset, and a deep CNN architecture, helping establish face recognition as one of the first computer vision tasks to achieve near-human accuracy.
Read PaperSemantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs
Chen and colleagues at Google introduced DeepLabv1, which combined atrous (dilated) convolutions to efficiently capture multi-scale context with fully connected CRFs for boundary refinement, establishing key techniques for accurate semantic segmentation that would influence numerous subsequent approaches.
Read PaperYou Only Look Once: Unified, Real-Time Object Detection
Redmon and colleagues at the University of Washington introduced YOLO, a revolutionary object detection approach that framed detection as a single regression problem from images to bounding boxes and class probabilities, enabling unprecedented speed while maintaining competitive accuracy, establishing a new paradigm for real-time vision.
Read PaperSqueezeNet: AlexNet-level Accuracy with 50x Fewer Parameters and <0.5MB Model Size
Iandola and colleagues at UC Berkeley introduced SqueezeNet, a compact CNN architecture that achieved AlexNet-level accuracy with 50x fewer parameters through fire modules combining squeeze and expand operations, establishing important principles for efficient network design for mobile and embedded systems.
Read PaperMicrosoft COCO: Common Objects in Context
Lin and colleagues at Microsoft introduced COCO, a large-scale object detection, segmentation, and captioning dataset with complex everyday scenes containing multiple objects in their natural context, establishing a more challenging benchmark that drove advances in instance segmentation and dense prediction tasks.
Read PaperPyramid Scene Parsing Network
Zhao and colleagues at SenseTime/CUHK introduced PSPNet, which utilized a pyramid pooling module to aggregate context at multiple scales, effectively capturing global and local information for scene parsing, establishing a new approach to multi-scale feature representation that influenced numerous segmentation methods.
Read PaperMask R-CNN
He and colleagues at Facebook AI Research introduced Mask R-CNN, extending Faster R-CNN with a parallel mask prediction branch for instance segmentation, establishing a flexible framework for multiple vision tasks and achieving state-of-the-art results that would influence object detection and segmentation for years to come.
Read PaperFocal Loss for Dense Object Detection
Lin and colleagues at Facebook AI introduced focal loss and RetinaNet, addressing the extreme foreground-background class imbalance in dense detection by down-weighting easy examples, enabling single-stage detectors to outperform two-stage approaches and establishing a key technique for addressing imbalanced datasets.
Read PaperMobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications
Howard and colleagues at Google introduced MobileNets, which utilized depthwise separable convolutions to dramatically reduce computation and parameters while maintaining reasonable accuracy, establishing fundamental techniques for efficient model design that would enable computer vision on resource-constrained devices.
Read PaperShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices
Zhang and colleagues at Face++ introduced ShuffleNet, which utilized pointwise group convolutions and channel shuffling to reduce computation while maintaining accuracy, establishing novel techniques for designing highly efficient networks that influenced numerous subsequent mobile-friendly architectures.
Read PaperAttention is All You Need
Vaswani and colleagues at Google introduced the Transformer architecture based entirely on attention mechanisms, initially for NLP but eventually revolutionizing computer vision by providing a new paradigm beyond convolutions that would lead to Vision Transformers and numerous attention-based visual models.
Read PaperDynamic Routing Between Capsules
Sabour, Hinton and colleagues at Google introduced CapsNet, which modeled hierarchical relationships between object parts using capsules that preserve more information than scalar features, proposing a fundamentally different approach to representation learning addressing key limitations of CNNs.
Read PaperDensely Connected Convolutional Networks
Huang and colleagues at Cornell/Tsinghua introduced DenseNet, which connected each layer to every other layer in a feed-forward fashion to encourage feature reuse, improve gradient flow, and reduce parameters, establishing a powerful architecture for efficient learning that influenced numerous subsequent network designs.
Read PaperLearning Transferable Architectures for Scalable Image Recognition
Zoph and colleagues at Google introduced NASNet, which used reinforcement learning to search for optimal neural architecture building blocks that could be transferred across datasets, establishing automated architecture design approaches that would launch an entire field of neural architecture search.
Read PaperYOLOv3: An Incremental Improvement
Redmon and Farhadi at the University of Washington refined the YOLO architecture with multi-scale predictions, better feature extractors, and various design improvements, establishing YOLOv3 as the standard real-time detector balancing speed and accuracy that would be widely adopted in practical applications.
Read PaperA Neural Algorithm of Artistic Style
Gatys and colleagues at the University of Tübingen separated and recombined content and style representations from different images, enabling artistic style transfer by optimizing for content similarity and style statistics, establishing a novel application of neural networks that sparked significant interest in creative AI.
Read PaperEncoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation
Chen and colleagues at Google introduced DeepLabv3+, which combined an encoder-decoder structure with atrous separable convolutions, establishing a powerful and efficient architecture for semantic segmentation that achieved state-of-the-art results while maintaining computational efficiency.
Read PaperLarge Scale GAN Training for High Fidelity Natural Image Synthesis
Brock and colleagues at DeepMind introduced BigGAN, which demonstrated the benefits of scaling up GAN training with larger batch sizes and more parameters, establishing new benchmarks for image synthesis quality and revealing the importance of training dynamics for generative models.
Read PaperObjects as Points
Zhou and colleagues at the University of Texas introduced CenterNet, which modeled objects as points (their center) and regressed to other properties, establishing a simple yet effective approach to detection that unified object detection, human pose estimation, and 3D detection in a single framework.
Read PaperEfficientNet: Rethinking Model Scaling for Convolutional Neural Networks
Tan and Le at Google introduced EfficientNet, which proposed compound scaling that uniformly scales network width, depth, and resolution with fixed coefficients, establishing a family of models that achieved state-of-the-art accuracy with significantly fewer parameters and operations than previous approaches.
Read PaperMnasNet: Platform-Aware Neural Architecture Search for Mobile
Tan and colleagues at Google introduced MnasNet, which incorporated latency constraints directly into the architecture search objective, establishing an approach to automatically design efficient mobile models that explicitly balanced accuracy and real-world inference speed on target devices.
Read PaperCutMix: Regularization Strategy to Train Strong Classifiers with Localizable Features
Yun and colleagues at NAVER introduced CutMix, a simple yet effective data augmentation strategy that replaced regions of an image with patches from another while mixing the labels proportionally, establishing a powerful regularization technique that improved both classification accuracy and localization ability.
Read PaperStyleGAN: A Style-Based Generator Architecture for Generative Adversarial Networks
Karras and colleagues at NVIDIA introduced StyleGAN, a groundbreaking GAN architecture that separated high-level attributes and stochastic variation via a novel style-based design, enabling unprecedented control over generated images and setting new standards for image synthesis quality.
Read Paper2020-2021
End-to-End Object Detection with Transformers (DETR)
Introduced by Facebook AI, DETR revolutionized object detection by applying transformers to predict objects in an end-to-end manner, eliminating the need for hand-crafted components like anchor boxes and non-maximum suppression.
Read PaperAnalyzing and Improving the Image Quality of StyleGAN (StyleGAN2)
NVIDIA's StyleGAN2 improved upon its predecessor by addressing artifacts and enhancing image quality, setting a new standard for high-resolution image synthesis with generative adversarial networks.
Read PaperNeRF: Representing Scenes as Neural Radiance Fields for View Synthesis
UC Berkeley's NeRF introduced a groundbreaking approach to 3D scene representation, using neural networks to model continuous volumetric scenes, enabling photorealistic view synthesis from sparse images.
Read PaperRepVGG: Making VGG-style ConvNets Great Again
Tsinghua's RepVGG reintroduced simple VGG-style convolutional networks with a novel re-parameterization technique, achieving high performance and efficiency for image classification tasks.
Read PaperAn Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (ViT)
Google's Vision Transformer (ViT) adapted transformers for image classification, treating image patches as tokens, achieving state-of-the-art performance and sparking widespread adoption of transformers in vision tasks.
Read PaperBootstrap Your Own Latent: A New Approach to Self-Supervised Learning (BYOL)
DeepMind's BYOL proposed a novel self-supervised learning method that avoids negative samples, achieving robust visual representations that rival supervised methods, influencing subsequent self-supervised learning frameworks.
Read PaperLearning Transferable Visual Models From Natural Language Supervision (CLIP)
OpenAI's CLIP trained visual models with natural language supervision, enabling zero-shot image classification and robust cross-modal understanding, significantly impacting multimodal AI applications.
Read PaperSwin Transformer: Hierarchical Vision Transformer using Shifted Windows
Microsoft's Swin Transformer introduced a hierarchical architecture with shifted windows, improving efficiency and performance for vision tasks like classification, detection, and segmentation.
Read PaperAn Empirical Study of Training Self-Supervised Vision Transformers (MoCo-v3)
Facebook AI's MoCo-v3 refined self-supervised learning for vision transformers, providing insights into stable training and achieving strong performance on large-scale image datasets.
Read PaperCvT: Introducing Convolutions to Vision Transformers
Microsoft's CvT combined convolutional layers with transformers, enhancing locality and efficiency in vision transformers for tasks like image classification and object detection.
Read PaperCoAtNet: Marrying Convolution and Attention for All Data Sizes
Google's CoAtNet fused convolutional and attention mechanisms, creating a versatile architecture that excels across various data scales for vision tasks like classification and detection.
Read PaperAlias-Free Generative Adversarial Networks (StyleGAN3)
NVIDIA's StyleGAN3 addressed aliasing issues in generative models, producing high-quality, alias-free images with improved consistency for applications like video and animation.
Read PaperYOLOX: Exceeding YOLO Series in 2021
Megvii's YOLOX enhanced the YOLO series with innovations like decoupled heads and anchor-free detection, achieving superior performance in real-time object detection tasks.
Read PaperMaskFormer: Per-Pixel Classification is Not All You Need for Semantic Segmentation
Meta's MaskFormer reframed semantic segmentation as a mask classification problem, leveraging transformers to achieve state-of-the-art results in both semantic and instance segmentation.
Read PaperMasked Autoencoders Are Scalable Vision Learners (MAE)
Meta's MAE introduced a simple yet effective self-supervised learning approach, using masked image patches to train vision transformers, achieving strong performance with high scalability.
Read Paper2022
MViTv2: Improved Multiscale Vision Transformers for Classification and Detection
Meta's MViTv2 enhanced multiscale vision transformers, improving efficiency and performance for image classification and object detection, building on hierarchical transformer architectures.
Read PaperYOLOv7: Trainable Bag-of-Freebies Sets New State-of-the-Art for Real-Time Object Detectors
WongKinYiu's YOLOv7 introduced a suite of trainable enhancements, achieving top performance in real-time object detection with improved accuracy and speed over previous YOLO models.
Read PaperA ConvNet for the 2020s (ConvNeXt)
Meta's ConvNeXt modernized convolutional neural networks by incorporating transformer-inspired design principles, achieving competitive performance with transformers in image classification tasks.
Read PaperExploring Plain Vision Transformer Backbones for Object Detection (ViTDet)
Meta's ViTDet demonstrated that plain vision transformers could serve as effective backbones for object detection, simplifying architectures while maintaining high performance.
Read PaperDINOv2: Learning Robust Visual Features without Supervision
Meta's DINOv2 advanced self-supervised learning, producing robust and versatile visual features that excel in downstream tasks like classification and segmentation without requiring labeled data.
Read PaperEVA: Exploring the Limits of Masked Visual Representation Learning at Scale
BAAI's EVA scaled up masked visual representation learning, achieving state-of-the-art performance in self-supervised vision tasks by leveraging large datasets and transformer architectures.
Read PaperHigh-Resolution Image Synthesis with Latent Diffusion Models
Stanford's latent diffusion models enabled efficient high-resolution image synthesis by operating in a compressed latent space, powering applications like Stable Diffusion.
Read PaperYOLO-World: Real-Time Open-Vocabulary Object Detection
Tsinghua's YOLO-World extended real-time object detection to open-vocabulary settings, enabling detection of arbitrary object categories using language prompts.
Read PaperAdding Conditional Control to Text-to-Image Diffusion Models (ControlNet)
Stanford's ControlNet introduced a framework for adding fine-grained control to diffusion models, enabling precise manipulation of generated images using inputs like edge maps or depth maps.
Read PaperRT-DETR: DETRs Beat YOLOs on Real-Time Object Detection
Baidu's RT-DETR combined the strengths of transformer-based DETR models with real-time performance, surpassing YOLO models in speed and accuracy for object detection tasks.
Read Paper2023
Segment Anything (SAM)
Meta's Segment Anything Model (SAM) introduced a versatile framework for image segmentation, capable of generating high-quality masks for objects in any image, enabling zero-shot segmentation across diverse tasks.
Read PaperPaLI: A Jointly-Scaled Multilingual Language-Image Model
Google's PaLI combined vision and language modeling at scale, supporting multilingual tasks like image captioning and visual question answering, advancing cross-modal understanding.
Read PaperMuse: Text-To-Image Generation via Masked Generative Transformers
Google's Muse leveraged masked generative transformers for efficient text-to-image generation, achieving high-quality image synthesis with improved training stability and speed.
Read PaperFast Segment Anything (FastSAM)
ETH Zurich's FastSAM optimized the Segment Anything model for real-time performance, maintaining high segmentation quality while significantly reducing computational requirements.
Read PaperEmerging Properties in Self-Supervised Vision Transformers (DINO)
Meta's DINO explored emergent properties in self-supervised vision transformers, revealing their ability to learn robust features for tasks like segmentation and classification without supervision.
Read PaperYOLOv8: A New Era of Visual AI
Ultralytics' YOLOv8 advanced real-time object detection with improved accuracy, speed, and versatility, supporting tasks like detection, segmentation, and classification.
Read PaperInternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions
Shanghai AI Lab's InternImage introduced deformable convolutions to large-scale vision foundation models, enhancing flexibility and performance in tasks like classification and detection.
Read PaperUNINEXT: Universal Instance Perception as Object Discovery and Retrieval
KAIST's UNINEXT proposed a unified framework for instance perception, treating tasks like detection and segmentation as object discovery and retrieval, achieving robust performance across domains.
Read PaperDALL-E 3: Improving Image Generation with Better Captions
OpenAI's DALL-E 3 enhanced text-to-image generation by leveraging improved captioning techniques, producing more accurate and detailed images aligned with textual prompts.
Read PaperDETR3D: 3D Object Detection from Multi-view Images via 3D-to-2D Queries
Peking University's DETR3D extended transformer-based detection to 3D, using multi-view images and 3D-to-2D queries to achieve robust 3D object detection for autonomous driving and robotics.
Read PaperSora: Video Generation Models as World Simulators
OpenAI's Sora introduced advanced video generation models that simulate physical world dynamics, producing high-quality, coherent videos from text prompts, advancing generative AI for video.
Read PaperVisual Instruction Tuning (LLaVA)
Microsoft's LLaVA introduced visual instruction tuning, enhancing multimodal models by fine-tuning with visual-text instruction data, improving performance in vision-language tasks like question answering.
Read Paper2024
MambaTalk: Efficient Holistic Gesture Synthesis with Selective State Space Models
Samsung's MambaTalk introduced selective state space models for efficient gesture synthesis, enabling realistic and computationally lightweight generation of human gestures for applications in virtual reality and animation.
Read PaperVision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model
CMU and Princeton's Vision Mamba applied bidirectional state space models to visual representation learning, offering a computationally efficient alternative to transformers for tasks like image classification and object detection.
Read PaperGLEE: General Object Foundation Model for Images and Videos at Scale
Stanford's GLEE introduced a scalable foundation model for general object understanding in images and videos, enabling robust performance across tasks like detection, segmentation, and tracking.
Read PaperSegment Everything Everywhere All at Once (SEEM)
UNC and Microsoft's SEEM unified multiple segmentation tasks (semantic, instance, and panoptic) into a single framework, achieving state-of-the-art performance with a versatile, prompt-driven approach.
Read PaperOMG-Segment: One Model Goes to Segment Everything
Peking University's OMG-Segment proposed a single model capable of performing all segmentation tasks, from semantic to instance and panoptic, with high efficiency and generalizability across datasets.
Read PaperFlorence-2: Advancing a Unified Representation for a Variety of Vision Tasks
Microsoft's Florence-2 developed a unified representation model for diverse vision tasks, including classification, detection, and captioning, achieving strong performance with a single architecture.
Read PaperStable Diffusion 3.5: Advanced Text-to-Image Generation with Precise Control
Stability AI's Stable Diffusion 3.5 improved text-to-image generation with enhanced control mechanisms, delivering higher quality and more accurate images aligned with complex prompts.
Read PaperOmniVec: Unifying Feature Representations for Vision-Language-Audio Tasks
Meta's OmniVec unified feature representations across vision, language, and audio, enabling robust performance in multimodal vision tasks like image captioning and visual question answering.
Read PaperAdaForm: Adaptive Image Transformation Networks for Cross-Domain Visual Recognition
DeepMind's AdaForm developed adaptive transformation networks for cross-domain visual recognition, improving robustness in scenarios with domain shifts, such as synthetic-to-real image adaptation.
Read PaperGemini Vision: Advancing Multi-Modal Understanding Through Massive Scale Visual Pre-Training
Google's Gemini Vision leveraged massive-scale visual pre-training to advance multimodal understanding, achieving state-of-the-art performance in tasks like image classification and visual question answering.
Read PaperGrounding DINO: Marrying DINO with Grounded Pre-Training for Open-Vocabulary Object Detection
IDEA Research's Grounding DINO combined self-supervised vision transformers with grounded pre-training, enabling open-vocabulary object detection with unprecedented flexibility and accuracy.
Read PaperLGM: Large Gaussian Splatting for Scalable 3D Reconstruction
Tsinghua University's LGM advanced 3D reconstruction with large-scale Gaussian splatting, offering scalable and high-fidelity neural rendering for real-time 3D scene synthesis.
Read Paper2025
MambaVision: A Hybrid Mamba-Transformer Backbone for Computer Vision
NVIDIA's MambaVision introduced the first hybrid Mamba-Transformer architecture for computer vision, achieving state-of-the-art performance in image classification and object detection with improved computational efficiency over traditional transformers.
Read PaperDataset Distillation with Neural Characteristic Function: A Minmax Perspective
This work from Shaobo Wang et al. proposed a novel dataset distillation method using neural characteristic functions with a minmax optimization approach, enabling efficient training of computer vision models with significantly reduced dataset sizes.
Read PaperMAtCha Gaussians: Atlas of Charts for High-Quality Geometry and Photorealism From Sparse Views
This CVPR 2025 highlight paper presented a novel appearance model using Gaussian splatting and an atlas of charts, achieving high-quality 3D geometry and photorealistic rendering from sparse image views, advancing 3D reconstruction techniques.
Read PaperRADIOv2.5: A Flexible Vision Encoder for Robust Multi-Task Learning
NVIDIA's RADIOv2.5 enhanced vision encoders with a combination of DFN_CLIP, DINOv2, SAM, SigLIP, and advanced training techniques, offering a flexible foundation model for tasks like object detection and segmentation across varying resolutions.
Read PaperESDiff: Encoding Strategy-Inspired Diffusion Model with Few-Shot Learning for Color Image Inpainting
This CVPR 2025 paper by Junyan Zhang et al. introduced ESDiff, a diffusion model inspired by encoding strategies, enabling high-quality color image inpainting with few-shot learning, improving efficiency and performance in image restoration tasks.
Read PaperMamba-Sea: A Mamba-Based Framework with Global-to-Local Sequence Augmentation for Generalizable Medical Image Segmentation
Accepted to IEEE TMI 2025, Mamba-Sea by Zihan Cheng et al. utilized a Mamba-based framework with global-to-local sequence augmentation, enhancing generalizability in medical image segmentation for diverse clinical applications.
Read Paper