Multimodal Models
Approaches for combining text with other modalities like images, audio, and video.
Tokenization Beyond Text
The idea of tokenization isn’t only relevant to text; audio, images, and video can also be “tokenized” for use in Transformer-style architectures, and there a range of tradeoffs to consider between tokenization and other methods like convolution. The next two sections will look more into visual inputs; this blog post from AssemblyAI touches on a number of relevant topics for audio tokenization and representation in sequence models, for applications like audio generation, text-to-speech, and speech-to-text.
Just as text can be broken into tokens, other modalities like images can be divided into "patches" or audio into "frames" that serve as tokens for multimodal transformers.
VQ-VAE
The VQ-VAE architecture has become quite popular for image generation in recent years, and underlies at least the earlier versions of DALL-E.
Resources on VQ-VAE
- Blog post: "Understanding VQ-VAE (DALL-E Explained Pt. 1)" from the Machine Learning @ Berkeley blog
- Blog post: "How is it so good? (DALL-E Explained Pt. 2)" from Machine Learning @ Berkeley
- Tutorial: "Understanding Vector Quantized Variational Autoencoders (VQ-VAE)" by Shashank Yadav
Vision Transformers
Vision Transformers extend the Transformer architecture to domains like image and video, and have become popular for applications like self-driving cars as well as for multimodal LLMs. There’s a nice section in the d2l.ai book about how they work.
Vision Transformers (ViT) adapt the transformer architecture to work with images by splitting them into patches, embedding these patches, and processing them just like tokens in a standard transformer model.
Resources on Vision and Multimodal Models
- Blog post: "Generalized Visual Language Models" by Lilian Weng - discusses a range of different approaches for training multimodal Transformer-style models
- Guide: "Guide to Vision Language Models" from Encord's blog - overviews several architectures for mixing text and vision
- Paper: MM1 from Apple - examines several architecture and data tradeoffs with experimental evidence for Vision Transformers
- Visualization: "Multimodal Neurons in Artificial Neural Networks" from Distill.pub - very fun visualizations of concept representations in multimodal networks
Key Takeaways
- Tokenization concepts extend beyond text to images, audio, and video
- VQ-VAE architectures provide a foundation for image generation, including early versions of DALL-E
- Vision Transformers adapt the transformer architecture to process images by dividing them into patches
- Multimodal models combine different forms of data (text, images, audio) for richer understanding and generation
- Research in multimodal architectures continues to evolve rapidly, with various approaches to combining different data types