RNN and CNN

Explore the role of foundation models in natural language processing and computer vision tasks.

This section delves into the core neural network architectures—Recurrent Neural Networks (RNNs) for Natural Language Processing (NLP) and Convolutional Neural Networks (CNNs) for Computer Vision (CV)—that laid the foundation for modern foundation models. We cover their basics, variants, challenges like vanishing and exploding gradients, attention mechanisms, and their applications in downstream tasks. These architectures have shaped the evolution of versatile, large-scale models. For a comprehensive overview, the paper A Comprehensive Survey on Pretrained Foundation Models: A History from BERT to ChatGPT traces their historical context.

Recurrent Neural Networks (RNNs)

Recurrent Neural Networks are designed for sequential data, ideal for NLP tasks where word order is critical. RNNs maintain a hidden state that propagates information across time steps, modeling temporal dependencies. However, vanilla RNNs face significant challenges:

Vanishing Gradients: During backpropagation through time, gradients can shrink exponentially, hindering learning of long-range dependencies.
Exploding Gradients: Conversely, gradients can grow uncontrollably, causing unstable training. Techniques like gradient clipping mitigate this.

The paper Recurrent Neural Networks (RNNs): A Gentle Introduction and Overview by Zeyer et al. provides a thorough introduction to these dynamics.

Key Variants:

Long Short-Term Memory (LSTM): Introduced in Long Short-Term Memory by Hochreiter and Schmidhuber (1997), LSTMs use memory cells and gates (input, forget, output) to selectively retain information, addressing vanishing gradients.
Gated Recurrent Unit (GRU): Proposed in Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling by Chung et al. (2014), GRUs simplify LSTMs by merging gates, offering comparable performance with fewer parameters.
Highway Networks: Described in Highway Networks by Srivastava et al. (2015), these allow information to bypass layers, easing gradient flow in deep RNNs.
Bidirectional RNNs: Process sequences in both directions, enhancing context for tasks like named entity recognition.

Attention Mechanisms in RNNs: Attention mechanisms, introduced in Effective Approaches to Attention-based Neural Machine Translation by Luong et al. (2015), allow RNNs to focus on relevant parts of the input sequence, significantly improving performance in tasks like machine translation. This concept later inspired the Transformer’s self-attention, which largely replaced RNNs in foundation models.

Downstream Applications:

Text Classification: Sentiment analysis, spam detection (e.g., LSTM-based classifiers).
Machine Translation: Sequence-to-sequence models with attention (e.g., early Google Translate).
Speech Recognition: Transcribing audio using RNNs for temporal modeling.
Named Entity Recognition (NER): Identifying entities with bidirectional LSTMs or GRUs.
Text Generation: Generating coherent text sequences (e.g., early chatbots).

RNNs were pivotal in early NLP but have been largely superseded by Transformers due to scalability and parallelization. The blog Understanding LSTM Networks by Chris Olah remains a gold standard for LSTM intuition.

Key Resources for RNNs

Paper: Recurrent Neural Networks (RNNs): A Gentle Introduction and Overview by Zeyer et al. (2019) – Comprehensive RNN introduction
Paper: Long Short-Term Memory by Hochreiter and Schmidhuber (1997) – Original LSTM paper
Paper: Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling by Chung et al. (2014) – GRU evaluation
Paper: Highway Networks by Srivastava et al. (2015) – Gradient flow improvements
Paper: Effective Approaches to Attention-based Neural Machine Translation by Luong et al. (2015) – Attention in RNNs
Blog post: Understanding LSTM Networks by Chris Olah
Blog post: The Unreasonable Effectiveness of RNNs by Andrej Karpathy
Video: RNNs and LSTMs Explained from DeepLearning.AI

Convolutional Neural Networks (CNNs)

Convolutional Neural Networks excel at processing grid-like data, such as images, by using convolutional layers to extract spatial features (e.g., edges, textures). Shared weights (filters) reduce parameters, while pooling layers downsample features for generalization. Like RNNs, CNNs face vanishing and exploding gradient issues in deep networks, addressed by architectures like ResNet. The paper An Introduction to Convolutional Neural Networks by O’Shea and Nash (2015) offers a clear primer.

Key Variants:

AlexNet: Introduced in ImageNet Classification with Deep Convolutional Neural Networks by Krizhevsky et al. (2012), it popularized deep CNNs.
ResNet (Residual Network): Proposed in Deep Residual Learning for Image Recognition by He et al. (2015), uses skip connections to mitigate vanishing gradients in deep networks.
ResNeXt: Described in Aggregated Residual Transformations for Deep Neural Networks by Xie et al. (2016), enhances ResNet with grouped convolutions.
DenseNet: Introduced in Densely Connected Convolutional Networks by Huang et al. (2016), connects each layer to all subsequent layers for feature reuse.
ConvNeXt: Proposed in A ConvNet for the 2020s by Liu et al. (2022), modernizes CNNs with Transformer-inspired designs.
UNet: Developed in U-Net: Convolutional Networks for Biomedical Image Segmentation by Ronneberger et al. (2015), uses an encoder-decoder for segmentation.
MobileNet: Optimized for mobile devices with depthwise separable convolutions.

Downstream Applications:

Image Classification: Labeling images (e.g., ResNet on ImageNet).
Object Detection: Localizing objects (e.g., Faster R-CNN, YOLO).
Semantic Segmentation: Pixel-wise classification (e.g., UNet in medical imaging).
Facial Recognition: Identifying faces (e.g., MobileNet-based systems).
Image Generation: Supporting GANs for tasks like style transfer.

CNNs remain integral to vision foundation models like CLIP and DINO, often serving as feature extractors. The paper Large Multimodal Models: Notes on CVPR 2023 Tutorial discusses their role in multimodal systems.

Key Resources for CNNs

Paper: An Introduction to Convolutional Neural Networks by O’Shea and Nash (2015) – CNN basics
Paper: ImageNet Classification with Deep Convolutional Neural Networks by Krizhevsky et al. (2012) – AlexNet
Paper: Deep Residual Learning for Image Recognition by He et al. (2015) – ResNet
Paper: Aggregated Residual Transformations for Deep Neural Networks by Xie et al. (2016) – ResNeXt
Paper: Densely Connected Convolutional Networks by Huang et al. (2016) – DenseNet
Paper: U-Net: Convolutional Networks for Biomedical Image Segmentation by Ronneberger et al. (2015) – UNet
Paper: A ConvNet for the 2020s by Liu et al. (2022) – ConvNeXt
Blog post: A Comprehensive Guide to CNNs on Towards Data Science
Video: CNNs Explained from Stanford CS231n
Paper: Large Multimodal Models: Notes on CVPR 2023 Tutorial by Chan et al. (2023) – CNNs in multimodal models

RNNs and CNNs in Foundation Models

RNNs and CNNs were critical precursors to foundation models. RNNs, with attention mechanisms, shaped early NLP models like seq2seq and word2vec, influencing self-supervised learning in BERT and GPT. CNNs provided robust feature extraction for vision models like CLIP, which integrates CNN backbones with Transformers. Hybrid approaches, such as ConvNeXt or attention-augmented LSTMs, remain relevant in specialized tasks. The paper Multimodal Foundation Models: From Specialists to General-Purpose Assistants highlights CNNs’ role in multimodal systems.

Resources on RNNs and CNNs in Foundation Models

Paper: A Comprehensive Survey on Pretrained Foundation Models by Zhao et al. (2023) – Historical context
Paper: Multimodal Foundation Models: From Specialists to General-Purpose Assistants by Yin et al. (2023) – CNNs in multimodal models
Blog post: How CNNs and RNNs Paved the Way for Transformers by IBM Research

Key Takeaways

RNNs process sequential data for NLP, with LSTMs and GRUs addressing vanishing/exploding gradients
Attention mechanisms in RNNs improved tasks like translation, inspiring Transformer architectures
CNNs extract spatial features for vision tasks, with variants like ResNet and ConvNeXt tackling gradient issues
Applications include text classification, translation, image classification, and segmentation
RNNs and CNNs influenced foundation models, contributing to NLP and vision components