The Grand AI Handbook

Addressing the Quadratic Scaling Problem

Approaches for circumventing the quadratic nature of attention in Transformers.

A major bottleneck in scaling both the size and context length of Transformers is the quadratic nature of attention, in which all pairs of token interactions are considered. Here we'll look at a number of approaches for circumventing this, ranging from those which are currently widely used to those which are more exploratory (but promising) research directions.

Sliding Window Attention

Introduced in the “Longformer” paper, sliding window attention acts as a sub-quadratic drop-in replacement for standard attention which allows attending only to a sliding window (shocking, right?) of recent tokens/states rather than the entire context window, under the pretense that vectors for these states have already attended to earlier ones and thus have sufficient representational power to encode relevant pieces of early context. Due to its simplicity, it’s become one of the more widely adopted approaches towards sub-quadratic scaling, and is used in Mistral’s popular Mixtral-8x7B model (among others).

Ring Attention

Another modification to standard attention mechanisms, Ring Attention enables sub-quadratic full-context interaction via incremental computation with a “message-passing” structure, wherein “blocks” of context communicate with each other over a series of steps rather than all at once. Within each block, the technique is essentially classical attention.

While largely a research direction rather than standard technique at least within the open-weights world, Google's Gemini is rumored to possibly be using Ring Attention in order to enable its million-plus-token context.

Linear Attention (RWKV)

The Receptance-Weighted Key Value (RWKV) architecture is a return to the general structure of RNN models (e.g LSTMs), with modifications to enable increased scaling and a linear attention-style mechanism which supports recurrent “unrolling” of its representation (allowing constant computation per output token as context length scales).

Structured State Space Models

Structured State Space Models (SSMs) have become one of the most popular alternatives to Transformers in terms of current research focus, with several notable variants (S4, Hyena, Mamba/S6, Jamba, Mamba-2), but are somewhat notorious for their complexity.

The architecture draws inspiration from classical control theory and linear time-invariant systems, with a number of optimizations to translate from continuous to discrete time, and to avoid dense representations of large matrices. They support both recurrent and convolutional representations, which allows efficiency gains both for training and at inference.

Many variants require carefully-conditioned “hidden state matrix” representations to support “memorization” of context without needing all-pairs attention. SSMs also seem to be becoming more practical at scale, and have recently resulted in breakthrough speed improvements for high-quality text to speech (via Cartesia AI, founded by the inventors of SSMs).

Recently, the Mamba authors released their follow-up “Mamba 2” paper, and their accompanying series of blog posts discusses some newly-uncovered connections between SSM representations and linear attention which may be interesting:

HyperAttention

Somewhat similar to RWKV and SSMs, HyperAttention is another proposal for achieving near-linear scaling for attention-like mechanisms, relying on locality-sensitive hashing (think vector DBs) rather than recurrent representations. I don’t see it discussed as much as the others, but it may be worth being aware of nonetheless.

Key Takeaways

  • Sliding Window Attention provides a simple way to achieve sub-quadratic scaling by limiting attention to recent tokens
  • Ring Attention enables full-context interaction with sub-quadratic complexity through message-passing between blocks
  • RWKV combines RNN structure with linear attention to achieve constant computation per token as context scales
  • Structured State Space Models draw from control theory to create efficient alternatives to Transformers
  • HyperAttention uses locality-sensitive hashing to achieve near-linear scaling for attention mechanisms
  • These approaches represent a significant research direction for scaling context length beyond what's feasible with standard attention