Addressing the Quadratic Scaling Problem

Approaches for circumventing the quadratic nature of attention in Transformers.

A major bottleneck in scaling both the size and context length of Transformers is the quadratic nature of attention, in which all pairs of token interactions are considered. Here we'll look at a number of approaches for circumventing this, ranging from those which are currently widely used to those which are more exploratory (but promising) research directions.

Sliding Window Attention

Introduced in the “Longformer” paper, sliding window attention acts as a sub-quadratic drop-in replacement for standard attention which allows attending only to a sliding window (shocking, right?) of recent tokens/states rather than the entire context window, under the pretense that vectors for these states have already attended to earlier ones and thus have sufficient representational power to encode relevant pieces of early context. Due to its simplicity, it’s become one of the more widely adopted approaches towards sub-quadratic scaling, and is used in Mistral’s popular Mixtral-8x7B model (among others).

Resources on Sliding Window Attention

Blog post: "What is Sliding Window Attention?" by Stephen M. Walker
Blog post: "Sliding Window Attention" by Manoj Kumal
Video: "Longformer: The Long-Document Transformer" by Yannic Kilcher

Ring Attention

Another modification to standard attention mechanisms, Ring Attention enables sub-quadratic full-context interaction via incremental computation with a “message-passing” structure, wherein “blocks” of context communicate with each other over a series of steps rather than all at once. Within each block, the technique is essentially classical attention.

While largely a research direction rather than standard technique at least within the open-weights world, Google's Gemini is rumored to possibly be using Ring Attention in order to enable its million-plus-token context.

Resources on Ring Attention

Blog post: "Breaking the Boundaries: Understanding Context Window Limitations and the idea of Ring Attention" by Tanuj Sharma
Blog post: "Understanding Ring Attention: Building Transformers With Near-Infinite Context" from E2E Networks
Video: "Ring Attention Explained"

Linear Attention (RWKV)

The Receptance-Weighted Key Value (RWKV) architecture is a return to the general structure of RNN models (e.g LSTMs), with modifications to enable increased scaling and a linear attention-style mechanism which supports recurrent “unrolling” of its representation (allowing constant computation per output token as context length scales).

Resources on RWKV

Blog post: "Getting Started With RWKV" from Hugging Face
Blog post: "The RWKV language model: An RNN with the advantages of a transformer" - Pt. 1 by Johan Wind
Blog post: "How the RWKV language model works" - Pt. 2 by Johan Wind
Video: "RWKV: Reinventing RNNs for the Transformer Era (Paper Explained)" by Yannic Kilcher

Structured State Space Models

Structured State Space Models (SSMs) have become one of the most popular alternatives to Transformers in terms of current research focus, with several notable variants (S4, Hyena, Mamba/S6, Jamba, Mamba-2), but are somewhat notorious for their complexity.

The architecture draws inspiration from classical control theory and linear time-invariant systems, with a number of optimizations to translate from continuous to discrete time, and to avoid dense representations of large matrices. They support both recurrent and convolutional representations, which allows efficiency gains both for training and at inference.

Many variants require carefully-conditioned “hidden state matrix” representations to support “memorization” of context without needing all-pairs attention. SSMs also seem to be becoming more practical at scale, and have recently resulted in breakthrough speed improvements for high-quality text to speech (via Cartesia AI, founded by the inventors of SSMs).

Resources on SSMs

Tutorial: "The Annotated S4" - comprehensive explainer focused on the S4 paper from which SSMs originated
Blog post: "A Visual Guide to Mamba and State Space Models" by Maarten Grootendorst - great for intuitions and visuals with slightly less math
Video: "Mamba: Linear-Time Sequence Modeling with Selective State Spaces (Paper Explained)" by Yannic Kilcher

Recently, the Mamba authors released their follow-up “Mamba 2” paper, and their accompanying series of blog posts discusses some newly-uncovered connections between SSM representations and linear attention which may be interesting:

Mamba-2 Blog Series

Blog post: State Space Duality (Mamba-2) Part I - The Model
Blog post: State Space Duality (Mamba-2) Part II - The Theory
Blog post: State Space Duality (Mamba-2) Part III - The Algorithm
Blog post: State Space Duality (Mamba-2) Part IV - The Systems

HyperAttention

Somewhat similar to RWKV and SSMs, HyperAttention is another proposal for achieving near-linear scaling for attention-like mechanisms, relying on locality-sensitive hashing (think vector DBs) rather than recurrent representations. I don’t see it discussed as much as the others, but it may be worth being aware of nonetheless.

Resources on HyperAttention

Blog post: "Linear Time Magic: How HyperAttention Optimizes Large Language Models" by Yousra Aoudi
Video: "HyperAttention Explained" by Tony Shin

Key Takeaways

Sliding Window Attention provides a simple way to achieve sub-quadratic scaling by limiting attention to recent tokens
Ring Attention enables full-context interaction with sub-quadratic complexity through message-passing between blocks
RWKV combines RNN structure with linear attention to achieve constant computation per token as context scales
Structured State Space Models draw from control theory to create efficient alternatives to Transformers
HyperAttention uses locality-sensitive hashing to achieve near-linear scaling for attention mechanisms
These approaches represent a significant research direction for scaling context length beyond what's feasible with standard attention