Addressing the Quadratic Scaling Problem
Approaches for circumventing the quadratic nature of attention in Transformers.
Sliding Window Attention
Introduced in the “Longformer” paper, sliding window attention acts as a sub-quadratic drop-in replacement for standard attention which allows attending only to a sliding window (shocking, right?) of recent tokens/states rather than the entire context window, under the pretense that vectors for these states have already attended to earlier ones and thus have sufficient representational power to encode relevant pieces of early context. Due to its simplicity, it’s become one of the more widely adopted approaches towards sub-quadratic scaling, and is used in Mistral’s popular Mixtral-8x7B model (among others).
Resources on Sliding Window Attention
- Blog post: "What is Sliding Window Attention?" by Stephen M. Walker
- Blog post: "Sliding Window Attention" by Manoj Kumal
- Video: "Longformer: The Long-Document Transformer" by Yannic Kilcher
Ring Attention
Another modification to standard attention mechanisms, Ring Attention enables sub-quadratic full-context interaction via incremental computation with a “message-passing” structure, wherein “blocks” of context communicate with each other over a series of steps rather than all at once. Within each block, the technique is essentially classical attention.
While largely a research direction rather than standard technique at least within the open-weights world, Google's Gemini is rumored to possibly be using Ring Attention in order to enable its million-plus-token context.
Resources on Ring Attention
- Blog post: "Breaking the Boundaries: Understanding Context Window Limitations and the idea of Ring Attention" by Tanuj Sharma
- Blog post: "Understanding Ring Attention: Building Transformers With Near-Infinite Context" from E2E Networks
- Video: "Ring Attention Explained"
Linear Attention (RWKV)
The Receptance-Weighted Key Value (RWKV) architecture is a return to the general structure of RNN models (e.g LSTMs), with modifications to enable increased scaling and a linear attention-style mechanism which supports recurrent “unrolling” of its representation (allowing constant computation per output token as context length scales).
Resources on RWKV
- Blog post: "Getting Started With RWKV" from Hugging Face
- Blog post: "The RWKV language model: An RNN with the advantages of a transformer" - Pt. 1 by Johan Wind
- Blog post: "How the RWKV language model works" - Pt. 2 by Johan Wind
- Video: "RWKV: Reinventing RNNs for the Transformer Era (Paper Explained)" by Yannic Kilcher
Structured State Space Models
Structured State Space Models (SSMs) have become one of the most popular alternatives to Transformers in terms of current research focus, with several notable variants (S4, Hyena, Mamba/S6, Jamba, Mamba-2), but are somewhat notorious for their complexity.
The architecture draws inspiration from classical control theory and linear time-invariant systems, with a number of optimizations to translate from continuous to discrete time, and to avoid dense representations of large matrices. They support both recurrent and convolutional representations, which allows efficiency gains both for training and at inference.
Many variants require carefully-conditioned “hidden state matrix” representations to support “memorization” of context without needing all-pairs attention. SSMs also seem to be becoming more practical at scale, and have recently resulted in breakthrough speed improvements for high-quality text to speech (via Cartesia AI, founded by the inventors of SSMs).
Resources on SSMs
- Tutorial: "The Annotated S4" - comprehensive explainer focused on the S4 paper from which SSMs originated
- Blog post: "A Visual Guide to Mamba and State Space Models" by Maarten Grootendorst - great for intuitions and visuals with slightly less math
- Video: "Mamba: Linear-Time Sequence Modeling with Selective State Spaces (Paper Explained)" by Yannic Kilcher
Recently, the Mamba authors released their follow-up “Mamba 2” paper, and their accompanying series of blog posts discusses some newly-uncovered connections between SSM representations and linear attention which may be interesting:
Mamba-2 Blog Series
HyperAttention
Somewhat similar to RWKV and SSMs, HyperAttention is another proposal for achieving near-linear scaling for attention-like mechanisms, relying on locality-sensitive hashing (think vector DBs) rather than recurrent representations. I don’t see it discussed as much as the others, but it may be worth being aware of nonetheless.
Resources on HyperAttention
- Blog post: "Linear Time Magic: How HyperAttention Optimizes Large Language Models" by Yousra Aoudi
- Video: "HyperAttention Explained" by Tony Shin
Key Takeaways
- Sliding Window Attention provides a simple way to achieve sub-quadratic scaling by limiting attention to recent tokens
- Ring Attention enables full-context interaction with sub-quadratic complexity through message-passing between blocks
- RWKV combines RNN structure with linear attention to achieve constant computation per token as context scales
- Structured State Space Models draw from control theory to create efficient alternatives to Transformers
- HyperAttention uses locality-sensitive hashing to achieve near-linear scaling for attention mechanisms
- These approaches represent a significant research direction for scaling context length beyond what's feasible with standard attention