Vision Transformers
Introduce transformer-based architectures for computer vision tasks.
The Vision Transformer (ViT)
The Vision Transformer, introduced by Google Brain researchers in 2020, marked a pivotal moment by demonstrating that a pure Transformer architecture could achieve state-of-the-art results on image classification benchmarks, given sufficient training data.
Core Idea: Images as Sequences
ViT's core innovation is to treat an image as a sequence of fixed-size, non-overlapping patches, analogous to tokens in a sentence. Each patch is flattened into a vector.
Key Components
- Patch Embeddings: The image is divided into patches (e.g., 16x16 pixels). Each patch is flattened and linearly projected into an embedding vector. Positional embeddings are added to these patch embeddings to retain spatial information, as the standard Transformer is permutation-invariant.
- Transformer Encoder: The sequence of patch embeddings (plus an optional learnable class token embedding) is fed into a standard Transformer encoder, consisting of multiple layers of multi-head self-attention (MHSA) and feed-forward networks (MLP blocks). The self-attention mechanism allows the model to weigh the importance of different patches when representing any given patch.
- Classification Head: For image classification, typically the output corresponding to the class token is fed into a simple MLP head to produce the final prediction.
ViT showed excellent performance, especially when pre-trained on massive datasets (like JFT-300M), but initially struggled to match CNNs when trained on smaller datasets like ImageNet from scratch due to a lack of inductive biases inherent in CNNs (like locality and translation equivariance).
Key Resources for ViT
- Paper: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale by Dosovitskiy et al. (2020)
- Blog Post: Google AI Blog explaining ViT
- Explainer: Vision Transformer (ViT) Explained (Community explanation)
- Video: Vision Transformer (ViT) - Paper Explained (Yannic Kilcher)
Evolution and Advancements
Following the original ViT, research rapidly addressed its limitations, particularly data inefficiency and the lack of hierarchical structure suitable for dense prediction tasks.
DeiT: Data-efficient Image Transformers
DeiT, developed by Facebook AI, focused on training ViTs effectively using only ImageNet-1k, without requiring massive external datasets. Its key contribution was using knowledge distillation, where a separately trained CNN (like RegNet) acts as a "teacher" model. The ViT ("student") is trained to mimic the output predictions of the teacher model, in addition to learning from the true labels. DeiT also introduced a distillation token to specifically learn from the teacher's output.
Key Resources for DeiT
- Paper: Training data-efficient image transformers & distillation through attention by Touvron et al. (2021)
- Blog Post: Facebook AI Blog on DeiT
- Code: Official DeiT GitHub Repository
Hierarchical Transformers (Swin)
Standard ViTs produce feature maps of a single, low resolution throughout the network. This is inefficient for dense prediction tasks like object detection and semantic segmentation, which benefit from hierarchical or multi-scale feature maps, a hallmark of CNNs. Hierarchical vision transformers address this.
The **Swin Transformer (Shifted Window Transformer)**, developed by Microsoft Research, is a prominent example. It introduces two key innovations:
- Windowed Self-Attention: Self-attention is computed within local, non-overlapping windows (e.g., 7x7 patches) instead of globally across all patches. This significantly reduces computational complexity from quadratic to linear with respect to the number of patches.
- Shifted Windowing (SW-MSA): To enable cross-window communication while maintaining efficiency, the window configuration is shifted between consecutive layers. A window partition in one layer is shifted in the next, so patches that were in different windows can interact in the subsequent layer's attention calculation.
- Patch Merging: As the network deepens, layers progressively merge neighboring patches, increasing the receptive field and reducing the spatial resolution while increasing the feature dimension, creating a hierarchical feature pyramid similar to CNNs.
Swin Transformers achieved state-of-the-art performance across various vision tasks (classification, detection, segmentation) and demonstrated better scalability and efficiency compared to ViT.
Key Resources for Swin Transformer
- Paper: Swin Transformer: Hierarchical Vision Transformer using Shifted Windows by Liu et al. (2021)
- Code: Official Swin Transformer GitHub Repository
- Explainer: Swin Transformer Explained (Community explanation)
- Video: Swin Transformer (Paper Explained)
Key Takeaways
- Vision Transformers (ViTs) apply the Transformer architecture to vision by treating images as sequences of patches.
- The core ViT uses patch embeddings and a standard Transformer encoder, relying heavily on large-scale pre-training.
- DeiT introduced knowledge distillation techniques to train ViTs efficiently on smaller datasets like ImageNet.
- Hierarchical transformers, like the Swin Transformer, incorporate locality and multi-scale processing, crucial for dense prediction tasks.
- Swin Transformer uses efficient windowed self-attention and a shifted window mechanism to enable cross-window connections while building hierarchical feature maps.
- ViTs and their variants represent a major paradigm shift in computer vision, offering powerful attention-based alternatives to CNNs.