Beyond Transformers: Other Generative Models
Non-Transformer architectures for generative modeling of complex distributions.
Distribution Modeling
Recalling our first glimpse of language models as simple bigram distributions, the most basic thing you can do in distributional modeling is just count co-occurrence probabilities in your dataset and repeat them as ground truth. This idea can be extended to conditional sampling or classification as “Naive Bayes” (blog post and video), often one of the simplest algorithms covered in introductory machine learning courses.
The next generative model students are often taught is the Gaussian Mixture Model and its Expectation-Maximization algorithm. This blog post and this video give decent overviews; the core idea here is assuming that data distributions can be approximated as a mixture of multivariate Gaussian distributions. GMMs can also be used for clustering if individual groups can be assumed to be approximately Gaussian.
While these methods aren't very effective at representing complex structures like images or language, related ideas will appear as components of some of the more advanced methods we'll see.
Variational Auto-Encoders
Auto-encoders and variational auto-encoders are widely used for learning compressed representations of data distributions, and can also be useful for “denoising” inputs, which will come into play when we discuss diffusion.
Resources on Variational Auto-Encoders
- Textbook chapter: "Autoencoders" in the "Deep Learning" book
- Blog post: "From Autoencoder to Beta-VAE" from Lilian Weng
- Video: "Variational Autoencoders" from Arxiv Insights
- Blog post: "Deep Generative Models" from Prakash Pandey - covers both VAEs and GANs
Generative Adversarial Nets
The basic idea behind Generative Adversarial Networks (GANs) is to simulate a “game” between two neural nets — the Generator wants to create samples which are indistinguishable from real data by the Discriminator, who wants to identify the generated samples, and both nets are trained continuously until an equilibrium (or desired sample quality) is reached.
Following from von Neumann's minimax theorem for zero-sum games, you basically get a "theorem" promising that GANs succeed at learning distributions, if you assume that gradient descent finds global minimizers and allow both networks to grow arbitrarily large.
Granted, neither of these are literally true in practice, but GANs do tend to be quite effective (although they’ve fallen out of favor somewhat in recent years, partly due to the instabilities of simultaneous training).
Resources on GANs
- Guide: "Complete Guide to Generative Adversarial Networks" from Paperspace
- Tutorial: "Generative Adversarial Networks (GANs): End-to-End Introduction"
- Textbook chapter: Deep Learning, Ch. 20 - Generative Models (theory-focused)
Conditional GANs
Conditional GANs are where we’ll start going from vanilla “distribution learning” to something which more closely resembles interactive generative tools like DALL-E and Midjourney, incorporating text-image multimodality. A key idea is to learn “representations” (in the sense of text embeddings or autoencoders) which are more abstract and can be applied to either text or image inputs.
For example, you could imagine training a vanilla GAN on (image, caption) pairs by embedding the text and concatenating it with an image, which could then learn this joint distribution over images and captions. This implicitly involves learning conditional distributions if part of the input (image or caption) is fixed.
This can be extended to enable automatic captioning (given an image) or image generation (given a caption). There a number of variants on this setup with differing bells and whistles. The VQGAN+CLIP architecture is worth knowing about, as it was a major popular source of early “AI art” generated from input text.
Resources on Conditional GANs
- Blog post: "Implementing Conditional Generative Adversarial Networks" from Paperspace
- Article: "Conditional Generative Adversarial Network — How to Gain Control Over GAN Outputs" by Saul Dobilas
- Tutorial: "The Illustrated VQGAN" by LJ Miranda
- Talk: "Using Deep Learning to Generate Artwork with VQGAN-CLIP" from Paperspace
Normalizing Flows
The aim of normalizing flows is to learn a series of invertible transformations between Gaussian noise and an output distribution, avoiding the need for “simultaneous training” in GANs, and have been popular for generative modeling in a number of domains.
Resources on Normalizing Flows
- Blog post: "Flow-based Deep Generative Models" from Lilian Weng
I haven’t personally gone very deep on normalizing flows, but they come up enough that they’re probably worth being aware of.
Diffusion Models
One of the central ideas behind diffusion models (like StableDiffusion) is iterative guided application of denoising operations, refining random noise into something that increasingly resembles an image. Diffusion originates from the worlds of stochastic differential equations and statistical physics — relating to the “Schrodinger bridge” problem and optimal transport for probability distributions — and a fair amount of math is basically unavoidable if you want to understand the whole picture.
Diffusion models work by gradually adding noise to training data and then learning to reverse this process, effectively learning how to transform random noise into structured data that matches the target distribution.
Resources on Diffusion Models
- Introduction: "A friendly Introduction to Denoising Diffusion Probabilistic Models" by Antony Gitau
- Deep dive: "What are Diffusion Models?" by Lilian Weng
- Code walkthrough: "The Annotated Diffusion Model" from Hugging Face
- Advanced technique: "Fine-tuning Diffusion Models with LoRA" from Hugging Face
Key Takeaways
- Simple models like Naive Bayes and Gaussian Mixture Models form the foundation of generative modeling
- Variational Auto-Encoders learn compressed data representations useful for generation and denoising
- Generative Adversarial Networks create realistic outputs through an adversarial training process
- Conditional GANs extend the GAN framework to enable text-to-image generation
- Normalizing Flows learn invertible transformations between simple distributions and complex ones
- Diffusion Models iteratively denoise random inputs to create structured outputs like images
- Each architecture presents different tradeoffs in training stability, output quality, and controllability