Neural, Generative, and RL Model Interpretability
Specialized methods for deep learning, generative AI, and reinforcement learning.
Chapter 16: Learned Features and Saliency Maps Feature visualization, saliency maps for neural networks [Gradient-based attribution, Grad-CAM, feature maps] Chapter 17: Concept-Based Explanations Detecting and explaining high-level concepts [TCAV (Testing with Concept Activation Vectors), concept bottleneck models] Chapter 18: Adversarial Examples Adversarial attacks and their implications for interpretability [Adversarial perturbations, robust saliency, explanation stability] Chapter 19: Influential Instances Identifying training data that impacts predictions [Influence functions, data valuation, memorization detection] Chapter 20: Mechanistic Interpretability Understanding internal model computations [Circuit analysis, sparse autoencoders, probing representations] Chapter 21: Generative AI Interpretability Explaining generative models like LLMs, GANs, and diffusion models [Latent space traversal, attribution for text generation, diffusion path analysis] Chapter 22: Multimodal Model Interpretability Explaining vision-language and other multimodal models [Cross-modal SHAP, multimodal TCAV, attention visualization, attention rollouts, attention flow] Chapter 23: Reinforcement Learning Interpretability Explaining policies and value functions in RL models [Policy visualization, value attribution, Q-function decomposition]