Efficient Transformers

Investigate transformer variants designed for improved computational efficiency.

This section explores transformer variants engineered to enhance computational efficiency, addressing the quadratic complexity of standard self-attention in the original Transformer. We cover Performer, Linformer, Longformer, sparse attention, BigBird, and low-rank factorization, which optimize memory and compute requirements while maintaining or improving performance. These innovations enable Transformers to handle longer sequences and scale to larger datasets, critical for modern foundation models. For a broader context, the paper A Comprehensive Survey on Pretrained Foundation Models: A History from BERT to ChatGPT provides a historical perspective on Transformer evolution.

Challenges with Standard Transformers

The self-attention mechanism in the original Transformer, introduced in Attention is All You Need, has a quadratic complexity of O(n²) in both time and memory, where n is the sequence length. This makes it computationally expensive for long sequences, limiting scalability. Efficient Transformer variants mitigate this through approximations, sparse attention patterns, or low-rank factorizations, enabling applications in tasks requiring extended contexts, such as document-level NLP or long-form generation.

Key Resources for Transformer Challenges

Paper: Attention is All You Need by Vaswani et al. (2017) – Original Transformer
Blog post: The Illustrated Transformer by Jay Alammar – Visualizes standard self-attention
Article: The Computational Bottleneck of Transformers on Towards Data Science

Performer

The Performer, introduced in Rethinking Attention with Performers by Choromanski et al. (2020), uses a kernel-based approximation to reduce self-attention’s complexity from O(n²) to O(n). By leveraging Fast Attention Via positive Orthogonal Random features (FAVOR+), it approximates attention scores with randomized feature maps, enabling linear scaling with sequence length. Performer maintains comparable performance to standard Transformers while being memory-efficient, ideal for long-sequence tasks like protein sequence modeling.

Key Resources for Performer

Paper: Rethinking Attention with Performers by Choromanski et al. (2020)
Blog post: Rethinking Attention with Performers by Google AI
Video: Performer: Linear Attention Explained from AI Explained

Linformer

Linformer, proposed in Linformer: Self-Attention with Linear Complexity by Wang et al. (2020), reduces attention complexity to O(n) by projecting the attention matrix into a low-rank representation. Instead of computing the full n×n attention matrix, Linformer uses a low-dimensional approximation, significantly lowering memory usage. This makes it suitable for tasks with long sequences, such as document summarization, while preserving performance on benchmarks like GLUE.

Key Resources for Linformer

Paper: Linformer: Self-Attention with Linear Complexity by Wang et al. (2020)
Blog post: Linformer: Efficient Transformer with Linear Complexity on Towards Data Science
Article: Linformer: Scaling Attention on Medium

Longformer

The Longformer, introduced in Longformer: The Long-Document Transformer by Beltagy et al. (2020), employs sparse attention to handle sequences up to 4,096 tokens. It combines sliding window attention (local context) with global attention (task-specific tokens), reducing complexity to O(n). Longformer excels in document-level tasks like question answering and summarization, offering a balance between efficiency and performance.

Key Resources for Longformer

Paper: Longformer: The Long-Document Transformer by Beltagy et al. (2020)
Blog post: Longformer: Efficient Transformers for Long Documents by Allen Institute for AI
Video: Longformer Explained from Hugging Face

Sparse Attention

Sparse attention mechanisms reduce the computational burden by limiting the number of tokens each token attends to, creating a sparse attention matrix instead of a dense one. Variants include:

Sliding Window Attention: Used in Longformer, attends to a fixed-size window around each token.
Dilated Attention: Skips tokens at regular intervals to capture broader context.
Random Attention: Randomly selects a subset of tokens to attend to, reducing computation.

Sparse attention, discussed in Longformer and other works, is critical for scaling Transformers to long sequences, enabling applications in genomics and long-form text processing.

Key Resources for Sparse Attention

Paper: Longformer: The Long-Document Transformer by Beltagy et al. (2020)
Paper: Efficient Transformers: A Survey by Tay et al. (2020) – Overview of sparse attention
Blog post: Sparse Attention in Transformers on Towards Data Science

BigBird

BigBird, proposed in Big Bird: Transformers for Longer Sequences by Zaheer et al. (2020), combines sparse attention patterns—random, sliding window, and global attention—to achieve O(n) complexity. Inspired by theoretical insights, BigBird scales to sequences up to 4,096 tokens, excelling in tasks like document classification and question answering. Its sparse structure mimics the efficiency of human cognition, focusing on key tokens while ignoring irrelevant ones.

Key Resources for BigBird

Paper: Big Bird: Transformers for Longer Sequences by Zaheer et al. (2020)
Blog post: BigBird: Transformers for Longer Sequences by Google AI
Video: BigBird Explained from Google Research

Low-Rank Factorization

Low-rank factorization approximates the attention matrix by decomposing it into smaller matrices, reducing memory and compute costs. Linformer, for example, projects the attention matrix into a low-rank space. Other approaches, like those in Performer, use kernel-based low-rank approximations. This technique is particularly effective for long sequences, enabling efficient training and inference in resource-constrained settings.

Key Resources for Low-Rank Factorization

Paper: Linformer: Self-Attention with Linear Complexity by Wang et al. (2020)
Paper: Rethinking Attention with Performers by Choromanski et al. (2020)
Article: Low-Rank Factorization in Transformers on Medium

Impact on Foundation Models

Efficient Transformers have significantly influenced foundation models by:

Scaling Sequence Lengths: Enabling processing of long documents, code, or biological sequences.
Reducing Resource Demands: Lowering memory and compute costs, democratizing access to large models.
Enhancing Applications: Supporting tasks like long-form generation, document-level NLP, and multimodal processing.

These variants, as discussed in Efficient Transformers: A Survey, underpin models like T5 and modern multimodal systems, described in Multimodal Foundation Models: From Specialists to General-Purpose Assistants.

Resources on Impact on Foundation Models

Paper: Efficient Transformers: A Survey by Tay et al. (2020)
Paper: A Comprehensive Survey on Pretrained Foundation Models by Zhao et al. (2023)
Paper: Multimodal Foundation Models: From Specialists to General-Purpose Assistants by Yin et al. (2023)

Key Takeaways

Standard Transformers face quadratic complexity, limiting scalability for long sequences
Performer uses kernel-based approximations for linear complexity
Linformer employs low-rank projections to reduce attention matrix size
Longformer combines sliding window and global attention for document-level tasks
Sparse attention patterns, like those in BigBird, focus on key tokens, achieving O(n) complexity
Low-rank factorization approximates attention, enabling efficient computation
Efficient Transformers enable longer sequences and lower costs, shaping modern foundation models