Vision Transformers

Introduce transformer-based architectures for computer vision tasks.

This section introduces Vision Transformers (ViTs), a class of deep learning models that adapt the highly successful Transformer architecture, originally designed for natural language processing, to computer vision tasks. By treating images as sequences of patches and leveraging self-attention mechanisms (**attention-based vision**), ViTs have challenged the long-standing dominance of Convolutional Neural Networks (CNNs) in the field. We explore the foundational ViT model, its key components like **patch embeddings**, and significant advancements like **DeiT** for data efficiency and **hierarchical transformers** such as the **Swin Transformer**, which incorporate structural biases suitable for vision.

The Vision Transformer (ViT)

The Vision Transformer, introduced by Google Brain researchers in 2020, marked a pivotal moment by demonstrating that a pure Transformer architecture could achieve state-of-the-art results on image classification benchmarks, given sufficient training data.

Core Idea: Images as Sequences

ViT's core innovation is to treat an image as a sequence of fixed-size, non-overlapping patches, analogous to tokens in a sentence. Each patch is flattened into a vector.

Key Components

Patch Embeddings: The image is divided into patches (e.g., 16x16 pixels). Each patch is flattened and linearly projected into an embedding vector. Positional embeddings are added to these patch embeddings to retain spatial information, as the standard Transformer is permutation-invariant.
Transformer Encoder: The sequence of patch embeddings (plus an optional learnable class token embedding) is fed into a standard Transformer encoder, consisting of multiple layers of multi-head self-attention (MHSA) and feed-forward networks (MLP blocks). The self-attention mechanism allows the model to weigh the importance of different patches when representing any given patch.
Classification Head: For image classification, typically the output corresponding to the class token is fed into a simple MLP head to produce the final prediction.

ViT showed excellent performance, especially when pre-trained on massive datasets (like JFT-300M), but initially struggled to match CNNs when trained on smaller datasets like ImageNet from scratch due to a lack of inductive biases inherent in CNNs (like locality and translation equivariance).

Key Resources for ViT

Paper: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale by Dosovitskiy et al. (2020)
Blog Post: Google AI Blog explaining ViT
Explainer: Vision Transformer (ViT) Explained (Community explanation)
Video: Vision Transformer (ViT) - Paper Explained (Yannic Kilcher)

Evolution and Advancements

Following the original ViT, research rapidly addressed its limitations, particularly data inefficiency and the lack of hierarchical structure suitable for dense prediction tasks.

DeiT: Data-efficient Image Transformers

DeiT, developed by Facebook AI, focused on training ViTs effectively using only ImageNet-1k, without requiring massive external datasets. Its key contribution was using knowledge distillation, where a separately trained CNN (like RegNet) acts as a "teacher" model. The ViT ("student") is trained to mimic the output predictions of the teacher model, in addition to learning from the true labels. DeiT also introduced a distillation token to specifically learn from the teacher's output.

Key Resources for DeiT

Paper: Training data-efficient image transformers & distillation through attention by Touvron et al. (2021)
Blog Post: Facebook AI Blog on DeiT
Code: Official DeiT GitHub Repository

Hierarchical Transformers (Swin)

Standard ViTs produce feature maps of a single, low resolution throughout the network. This is inefficient for dense prediction tasks like object detection and semantic segmentation, which benefit from hierarchical or multi-scale feature maps, a hallmark of CNNs. Hierarchical vision transformers address this.

The **Swin Transformer (Shifted Window Transformer)**, developed by Microsoft Research, is a prominent example. It introduces two key innovations:

Windowed Self-Attention: Self-attention is computed within local, non-overlapping windows (e.g., 7x7 patches) instead of globally across all patches. This significantly reduces computational complexity from quadratic to linear with respect to the number of patches.
Shifted Windowing (SW-MSA): To enable cross-window communication while maintaining efficiency, the window configuration is shifted between consecutive layers. A window partition in one layer is shifted in the next, so patches that were in different windows can interact in the subsequent layer's attention calculation.
Patch Merging: As the network deepens, layers progressively merge neighboring patches, increasing the receptive field and reducing the spatial resolution while increasing the feature dimension, creating a hierarchical feature pyramid similar to CNNs.

Swin Transformers achieved state-of-the-art performance across various vision tasks (classification, detection, segmentation) and demonstrated better scalability and efficiency compared to ViT.

Key Resources for Swin Transformer

Paper: Swin Transformer: Hierarchical Vision Transformer using Shifted Windows by Liu et al. (2021)
Code: Official Swin Transformer GitHub Repository
Explainer: Swin Transformer Explained (Community explanation)
Video: Swin Transformer (Paper Explained)

Key Takeaways

Vision Transformers (ViTs) apply the Transformer architecture to vision by treating images as sequences of patches.
The core ViT uses patch embeddings and a standard Transformer encoder, relying heavily on large-scale pre-training.
DeiT introduced knowledge distillation techniques to train ViTs efficiently on smaller datasets like ImageNet.
Hierarchical transformers, like the Swin Transformer, incorporate locality and multi-scale processing, crucial for dense prediction tasks.
Swin Transformer uses efficient windowed self-attention and a shifted window mechanism to enable cross-window connections while building hierarchical feature maps.
ViTs and their variants represent a major paradigm shift in computer vision, offering powerful attention-based alternatives to CNNs.