Multimodal and Dynamic Vision

Chapter 54: Multimodal Learning: Vision and Language (CLIP, ViLBERT, BLIP, image captioning, VQA) Chapter 55: Multimodal Learning: Vision and Beyond (Vision-audio, vision-touch, cross-modal retrieval) Chapter 56: Video Understanding: Classification and Action (C3D, I3D, SlowFast, TimeSformer, VideoMAE) Chapter 57: Video Segmentation and Tracking (VOS, STCN, DeepSORT, ByteTrack, multi-object tracking) Chapter 58: Event-Based and Neuromorphic Vision (Event cameras, DVS, spiking neural networks)