Multimodal Models

Approaches for combining text with other modalities like images, audio, and video.

Note: This is the set of topics with which I'm the least familiar, but wanted to include for completeness. I'll be lighter on commentary and recommendations here, and will return to add more when I think I have a tighter story to tell. The post "Multimodality and Large Multimodal Models (LMMs)" by Chip Huyen is a nice broad overview (or "How Multimodal LLMs Work" by Kevin Musgrave for a more concise one).

Tokenization Beyond Text

The idea of tokenization isn’t only relevant to text; audio, images, and video can also be “tokenized” for use in Transformer-style architectures, and there a range of tradeoffs to consider between tokenization and other methods like convolution. The next two sections will look more into visual inputs; this blog post from AssemblyAI touches on a number of relevant topics for audio tokenization and representation in sequence models, for applications like audio generation, text-to-speech, and speech-to-text.

Just as text can be broken into tokens, other modalities like images can be divided into "patches" or audio into "frames" that serve as tokens for multimodal transformers.

VQ-VAE

The VQ-VAE architecture has become quite popular for image generation in recent years, and underlies at least the earlier versions of DALL-E.

Resources on VQ-VAE

Blog post: "Understanding VQ-VAE (DALL-E Explained Pt. 1)" from the Machine Learning @ Berkeley blog
Blog post: "How is it so good? (DALL-E Explained Pt. 2)" from Machine Learning @ Berkeley
Tutorial: "Understanding Vector Quantized Variational Autoencoders (VQ-VAE)" by Shashank Yadav

Vision Transformers

Vision Transformers extend the Transformer architecture to domains like image and video, and have become popular for applications like self-driving cars as well as for multimodal LLMs. There’s a nice section in the d2l.ai book about how they work.

Vision Transformers (ViT) adapt the transformer architecture to work with images by splitting them into patches, embedding these patches, and processing them just like tokens in a standard transformer model.

Resources on Vision and Multimodal Models

Blog post: "Generalized Visual Language Models" by Lilian Weng - discusses a range of different approaches for training multimodal Transformer-style models
Guide: "Guide to Vision Language Models" from Encord's blog - overviews several architectures for mixing text and vision
Paper: MM1 from Apple - examines several architecture and data tradeoffs with experimental evidence for Vision Transformers
Visualization: "Multimodal Neurons in Artificial Neural Networks" from Distill.pub - very fun visualizations of concept representations in multimodal networks

Key Takeaways

Tokenization concepts extend beyond text to images, audio, and video
VQ-VAE architectures provide a foundation for image generation, including early versions of DALL-E
Vision Transformers adapt the transformer architecture to process images by dividing them into patches
Multimodal models combine different forms of data (text, images, audio) for richer understanding and generation
Research in multimodal architectures continues to evolve rapidly, with various approaches to combining different data types