RNN and CNN
Explore the role of foundation models in natural language processing and computer vision tasks.
Recurrent Neural Networks (RNNs)
Recurrent Neural Networks are designed for sequential data, ideal for NLP tasks where word order is critical. RNNs maintain a hidden state that propagates information across time steps, modeling temporal dependencies. However, vanilla RNNs face significant challenges:
- Vanishing Gradients: During backpropagation through time, gradients can shrink exponentially, hindering learning of long-range dependencies.
- Exploding Gradients: Conversely, gradients can grow uncontrollably, causing unstable training. Techniques like gradient clipping mitigate this.
The paper Recurrent Neural Networks (RNNs): A Gentle Introduction and Overview by Zeyer et al. provides a thorough introduction to these dynamics.
Key Variants:
- Long Short-Term Memory (LSTM): Introduced in Long Short-Term Memory by Hochreiter and Schmidhuber (1997), LSTMs use memory cells and gates (input, forget, output) to selectively retain information, addressing vanishing gradients.
- Gated Recurrent Unit (GRU): Proposed in Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling by Chung et al. (2014), GRUs simplify LSTMs by merging gates, offering comparable performance with fewer parameters.
- Highway Networks: Described in Highway Networks by Srivastava et al. (2015), these allow information to bypass layers, easing gradient flow in deep RNNs.
- Bidirectional RNNs: Process sequences in both directions, enhancing context for tasks like named entity recognition.
Attention Mechanisms in RNNs: Attention mechanisms, introduced in Effective Approaches to Attention-based Neural Machine Translation by Luong et al. (2015), allow RNNs to focus on relevant parts of the input sequence, significantly improving performance in tasks like machine translation. This concept later inspired the Transformer’s self-attention, which largely replaced RNNs in foundation models.
Downstream Applications:
- Text Classification: Sentiment analysis, spam detection (e.g., LSTM-based classifiers).
- Machine Translation: Sequence-to-sequence models with attention (e.g., early Google Translate).
- Speech Recognition: Transcribing audio using RNNs for temporal modeling.
- Named Entity Recognition (NER): Identifying entities with bidirectional LSTMs or GRUs.
- Text Generation: Generating coherent text sequences (e.g., early chatbots).
RNNs were pivotal in early NLP but have been largely superseded by Transformers due to scalability and parallelization. The blog Understanding LSTM Networks by Chris Olah remains a gold standard for LSTM intuition.
Key Resources for RNNs
- Paper: Recurrent Neural Networks (RNNs): A Gentle Introduction and Overview by Zeyer et al. (2019) – Comprehensive RNN introduction
- Paper: Long Short-Term Memory by Hochreiter and Schmidhuber (1997) – Original LSTM paper
- Paper: Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling by Chung et al. (2014) – GRU evaluation
- Paper: Highway Networks by Srivastava et al. (2015) – Gradient flow improvements
- Paper: Effective Approaches to Attention-based Neural Machine Translation by Luong et al. (2015) – Attention in RNNs
- Blog post: Understanding LSTM Networks by Chris Olah
- Blog post: The Unreasonable Effectiveness of RNNs by Andrej Karpathy
- Video: RNNs and LSTMs Explained from DeepLearning.AI
Convolutional Neural Networks (CNNs)
Convolutional Neural Networks excel at processing grid-like data, such as images, by using convolutional layers to extract spatial features (e.g., edges, textures). Shared weights (filters) reduce parameters, while pooling layers downsample features for generalization. Like RNNs, CNNs face vanishing and exploding gradient issues in deep networks, addressed by architectures like ResNet. The paper An Introduction to Convolutional Neural Networks by O’Shea and Nash (2015) offers a clear primer.
Key Variants:
- AlexNet: Introduced in ImageNet Classification with Deep Convolutional Neural Networks by Krizhevsky et al. (2012), it popularized deep CNNs.
- ResNet (Residual Network): Proposed in Deep Residual Learning for Image Recognition by He et al. (2015), uses skip connections to mitigate vanishing gradients in deep networks.
- ResNeXt: Described in Aggregated Residual Transformations for Deep Neural Networks by Xie et al. (2016), enhances ResNet with grouped convolutions.
- DenseNet: Introduced in Densely Connected Convolutional Networks by Huang et al. (2016), connects each layer to all subsequent layers for feature reuse.
- ConvNeXt: Proposed in A ConvNet for the 2020s by Liu et al. (2022), modernizes CNNs with Transformer-inspired designs.
- UNet: Developed in U-Net: Convolutional Networks for Biomedical Image Segmentation by Ronneberger et al. (2015), uses an encoder-decoder for segmentation.
- MobileNet: Optimized for mobile devices with depthwise separable convolutions.
Downstream Applications:
- Image Classification: Labeling images (e.g., ResNet on ImageNet).
- Object Detection: Localizing objects (e.g., Faster R-CNN, YOLO).
- Semantic Segmentation: Pixel-wise classification (e.g., UNet in medical imaging).
- Facial Recognition: Identifying faces (e.g., MobileNet-based systems).
- Image Generation: Supporting GANs for tasks like style transfer.
CNNs remain integral to vision foundation models like CLIP and DINO, often serving as feature extractors. The paper Large Multimodal Models: Notes on CVPR 2023 Tutorial discusses their role in multimodal systems.
Key Resources for CNNs
- Paper: An Introduction to Convolutional Neural Networks by O’Shea and Nash (2015) – CNN basics
- Paper: ImageNet Classification with Deep Convolutional Neural Networks by Krizhevsky et al. (2012) – AlexNet
- Paper: Deep Residual Learning for Image Recognition by He et al. (2015) – ResNet
- Paper: Aggregated Residual Transformations for Deep Neural Networks by Xie et al. (2016) – ResNeXt
- Paper: Densely Connected Convolutional Networks by Huang et al. (2016) – DenseNet
- Paper: U-Net: Convolutional Networks for Biomedical Image Segmentation by Ronneberger et al. (2015) – UNet
- Paper: A ConvNet for the 2020s by Liu et al. (2022) – ConvNeXt
- Blog post: A Comprehensive Guide to CNNs on Towards Data Science
- Video: CNNs Explained from Stanford CS231n
- Paper: Large Multimodal Models: Notes on CVPR 2023 Tutorial by Chan et al. (2023) – CNNs in multimodal models
RNNs and CNNs in Foundation Models
RNNs and CNNs were critical precursors to foundation models. RNNs, with attention mechanisms, shaped early NLP models like seq2seq and word2vec, influencing self-supervised learning in BERT and GPT. CNNs provided robust feature extraction for vision models like CLIP, which integrates CNN backbones with Transformers. Hybrid approaches, such as ConvNeXt or attention-augmented LSTMs, remain relevant in specialized tasks. The paper Multimodal Foundation Models: From Specialists to General-Purpose Assistants highlights CNNs’ role in multimodal systems.
Resources on RNNs and CNNs in Foundation Models
- Paper: A Comprehensive Survey on Pretrained Foundation Models by Zhao et al. (2023) – Historical context
- Paper: Multimodal Foundation Models: From Specialists to General-Purpose Assistants by Yin et al. (2023) – CNNs in multimodal models
- Blog post: How CNNs and RNNs Paved the Way for Transformers by IBM Research
Key Takeaways
- RNNs process sequential data for NLP, with LSTMs and GRUs addressing vanishing/exploding gradients
- Attention mechanisms in RNNs improved tasks like translation, inspiring Transformer architectures
- CNNs extract spatial features for vision tasks, with variants like ResNet and ConvNeXt tackling gradient issues
- Applications include text classification, translation, image classification, and segmentation
- RNNs and CNNs influenced foundation models, contributing to NLP and vision components