The Grand AI Handbook

Multimodal Models

Approaches for combining text with other modalities like images, audio, and video.

Note: This is the set of topics with which I'm the least familiar, but wanted to include for completeness. I'll be lighter on commentary and recommendations here, and will return to add more when I think I have a tighter story to tell. The post "Multimodality and Large Multimodal Models (LMMs)" by Chip Huyen is a nice broad overview (or "How Multimodal LLMs Work" by Kevin Musgrave for a more concise one).

Tokenization Beyond Text

The idea of tokenization isn’t only relevant to text; audio, images, and video can also be “tokenized” for use in Transformer-style architectures, and there a range of tradeoffs to consider between tokenization and other methods like convolution. The next two sections will look more into visual inputs; this blog post from AssemblyAI touches on a number of relevant topics for audio tokenization and representation in sequence models, for applications like audio generation, text-to-speech, and speech-to-text.

Just as text can be broken into tokens, other modalities like images can be divided into "patches" or audio into "frames" that serve as tokens for multimodal transformers.

VQ-VAE

The VQ-VAE architecture has become quite popular for image generation in recent years, and underlies at least the earlier versions of DALL-E.

Vision Transformers

Vision Transformers extend the Transformer architecture to domains like image and video, and have become popular for applications like self-driving cars as well as for multimodal LLMs. There’s a nice section in the d2l.ai book about how they work.

Vision Transformers (ViT) adapt the transformer architecture to work with images by splitting them into patches, embedding these patches, and processing them just like tokens in a standard transformer model.

Key Takeaways

  • Tokenization concepts extend beyond text to images, audio, and video
  • VQ-VAE architectures provide a foundation for image generation, including early versions of DALL-E
  • Vision Transformers adapt the transformer architecture to process images by dividing them into patches
  • Multimodal models combine different forms of data (text, images, audio) for richer understanding and generation
  • Research in multimodal architectures continues to evolve rapidly, with various approaches to combining different data types