The Grand AI Handbook

Beyond Transformers: Other Generative Models

Non-Transformer architectures for generative modeling of complex distributions.

So far, everything we've looked has been focused on text and sequence prediction with language models, but many other "generative AI" techniques require learning distributions with less of a sequential structure (e.g. images). Here we'll examine a number of non-Transformer architectures for generative modeling, starting from simple mixture models and culminating with diffusion.

Distribution Modeling

Recalling our first glimpse of language models as simple bigram distributions, the most basic thing you can do in distributional modeling is just count co-occurrence probabilities in your dataset and repeat them as ground truth. This idea can be extended to conditional sampling or classification as “Naive Bayes” (blog post and video), often one of the simplest algorithms covered in introductory machine learning courses.

The next generative model students are often taught is the Gaussian Mixture Model and its Expectation-Maximization algorithm. This blog post and this video give decent overviews; the core idea here is assuming that data distributions can be approximated as a mixture of multivariate Gaussian distributions. GMMs can also be used for clustering if individual groups can be assumed to be approximately Gaussian.

While these methods aren't very effective at representing complex structures like images or language, related ideas will appear as components of some of the more advanced methods we'll see.

Variational Auto-Encoders

Auto-encoders and variational auto-encoders are widely used for learning compressed representations of data distributions, and can also be useful for “denoising” inputs, which will come into play when we discuss diffusion.

Generative Adversarial Nets

The basic idea behind Generative Adversarial Networks (GANs) is to simulate a “game” between two neural nets — the Generator wants to create samples which are indistinguishable from real data by the Discriminator, who wants to identify the generated samples, and both nets are trained continuously until an equilibrium (or desired sample quality) is reached.

Following from von Neumann's minimax theorem for zero-sum games, you basically get a "theorem" promising that GANs succeed at learning distributions, if you assume that gradient descent finds global minimizers and allow both networks to grow arbitrarily large.

Granted, neither of these are literally true in practice, but GANs do tend to be quite effective (although they’ve fallen out of favor somewhat in recent years, partly due to the instabilities of simultaneous training).

Conditional GANs

Conditional GANs are where we’ll start going from vanilla “distribution learning” to something which more closely resembles interactive generative tools like DALL-E and Midjourney, incorporating text-image multimodality. A key idea is to learn “representations” (in the sense of text embeddings or autoencoders) which are more abstract and can be applied to either text or image inputs.

For example, you could imagine training a vanilla GAN on (image, caption) pairs by embedding the text and concatenating it with an image, which could then learn this joint distribution over images and captions. This implicitly involves learning conditional distributions if part of the input (image or caption) is fixed.

This can be extended to enable automatic captioning (given an image) or image generation (given a caption). There a number of variants on this setup with differing bells and whistles. The VQGAN+CLIP architecture is worth knowing about, as it was a major popular source of early “AI art” generated from input text.

Normalizing Flows

The aim of normalizing flows is to learn a series of invertible transformations between Gaussian noise and an output distribution, avoiding the need for “simultaneous training” in GANs, and have been popular for generative modeling in a number of domains.

I haven’t personally gone very deep on normalizing flows, but they come up enough that they’re probably worth being aware of.

Diffusion Models

One of the central ideas behind diffusion models (like StableDiffusion) is iterative guided application of denoising operations, refining random noise into something that increasingly resembles an image. Diffusion originates from the worlds of stochastic differential equations and statistical physics — relating to the “Schrodinger bridge” problem and optimal transport for probability distributions — and a fair amount of math is basically unavoidable if you want to understand the whole picture.

Diffusion models work by gradually adding noise to training data and then learning to reverse this process, effectively learning how to transform random noise into structured data that matches the target distribution.

Key Takeaways

  • Simple models like Naive Bayes and Gaussian Mixture Models form the foundation of generative modeling
  • Variational Auto-Encoders learn compressed data representations useful for generation and denoising
  • Generative Adversarial Networks create realistic outputs through an adversarial training process
  • Conditional GANs extend the GAN framework to enable text-to-image generation
  • Normalizing Flows learn invertible transformations between simple distributions and complex ones
  • Diffusion Models iteratively denoise random inputs to create structured outputs like images
  • Each architecture presents different tradeoffs in training stability, output quality, and controllability