NLP and Computer Vision

The evolution of Natural Language Processing and Computer Vision: From traditional approaches to neural networks that paved the way for foundation models.

This section explores the pivotal developments in Natural Language Processing (NLP) and Computer Vision (CV) that paved the way for foundation models. We'll trace the evolution from traditional rule-based systems to statistical methods, and finally to the deep learning architectures that enabled the creation of large-scale models with transfer learning capabilities across domains.

Background in NLP

Natural Language Processing evolved dramatically over decades, transforming from highly specialized rule-based systems to today’s versatile foundation models. Understanding this evolution provides crucial context for appreciating current capabilities and limitations.

Overview of Classical NLP Pipelines

Traditional NLP systems operated as sequential pipelines with discrete processing stages. Each component handled a specific linguistic analysis task:

Tokenization: Breaking text into words, subwords, or characters
Part-of-Speech (POS) Tagging: Labeling words as nouns, verbs, adjectives, etc.
Named Entity Recognition: Identifying proper nouns and categorizing them (people, organizations, locations)
Syntactic Parsing: Determining grammatical structure (constituency or dependency parsing)
Coreference Resolution: Identifying when different expressions refer to the same entity
Semantic Analysis: Extracting meanings and relationships between words and phrases

Rule-based vs Statistical Methods

Early NLP development followed two distinct paradigms:

Rule-based approaches relied on linguistic experts to manually craft rules for language processing. These systems achieved high precision for specific domains but lacked robustness when facing new variations or domains.

Statistical methods emerged in the 1990s, utilizing probability and machine learning to learn patterns from data. Key developments included:

Hidden Markov Models for POS tagging and sequence labeling
Statistical parsers based on Probabilistic Context-Free Grammars
Named entity recognition systems using conditional random fields
Statistical machine translation leveraging parallel corpora

While rule-based systems offered interpretability and precision, statistical methods provided adaptability and generalization capabilities that would become essential for scaling NLP applications.

Key Resources for Classical NLP

Survey paper: A Survey of Named Entity Recognition and Classification by Nadeau & Sekine - Comprehensive overview of classical NER approaches
Book: Speech and Language Processing by Jurafsky & Martin - Foundational textbook covering traditional NLP components
Course: Stanford CS224N: Natural Language Processing with Deep Learning - Includes historical context of pre-deep learning NLP
Blog post: A Visual Guide to Understanding Sequence Models by Jay Alammar

Sequence to Sequence Learning

The sequence-to-sequence (seq2seq) framework represented a revolutionary paradigm shift for NLP, particularly for tasks involving transformation between sequences like translation, summarization, and question answering.

Encoder-decoder Architecture

The encoder-decoder architecture introduced an elegant approach to sequence transformation:

An encoder processes the input sequence to create a fixed-length representation (context vector)
A decoder generates the output sequence conditioned on this representation
Initially implemented with LSTMs/GRUs, later versions incorporated attention mechanisms

This design provided a unified approach to previously distinct NLP tasks, establishing a foundation for later transformer-based models.

Applications: Translation, Summarization

Seq2seq models achieved breakthroughs in key NLP applications:

Machine Translation: Neural Machine Translation (NMT) systems outperformed traditional statistical methods
Text Summarization: Encoder-decoder architectures enabled both extractive and abstractive summarization
Dialogue Systems: Enabled more coherent multi-turn conversations
Speech Recognition: Combined with acoustic modeling for improved transcription

Essential Seq2Seq Resources

Paper: Sequence to Sequence Learning with Neural Networks by Sutskever et al. - Seminal work establishing the seq2seq paradigm
Paper: A Neural Attention Model for Sentence Summarization by Rush et al. - Early application of neural methods to summarization
Tutorial: Neural Machine Translation with Attention from TensorFlow - Practical implementation guide
Paper: Deep Speech 2: End-to-End Speech Recognition in English and Mandarin by Baidu Research - Application of deep learning to speech recognition

Thumbs Up? Sentiment Classification Using ML

Sentiment analysis represents one of the most commercially valuable and widely deployed NLP applications, evolving from basic polarity detection to nuanced emotion recognition.

Traditional ML Models

Early sentiment classification relied on classical machine learning approaches:

Naive Bayes: Probabilistic classifiers based on applying Bayes’ theorem
Support Vector Machines (SVMs): Finding optimal hyperplanes to separate sentiment classes
Logistic Regression: Predicting probability of sentiment categories

Feature Engineering

The performance of traditional models depended heavily on feature engineering:

Bag of Words (BoW): Simple word presence/absence or frequency-based representations
TF-IDF: Term Frequency-Inverse Document Frequency weighting to prioritize informative words
N-grams: Capturing short phrases rather than individual words
Lexicon-based features: Using sentiment dictionaries to assign polarity scores
Syntactic features: Incorporating grammatical relationships between words

Sentiment Analysis Resources

Paper: Thumbs up? Sentiment Classification using Machine Learning Techniques by Pang et al. - Foundational work in ML-based sentiment analysis
Dataset: IMDB Movie Reviews - Widely used benchmark for sentiment classification
Blog post: Machine Learning for Text Classification - Tutorial on traditional ML approaches
Article: Twitter Sentiment Analysis for Beginners - Practical application of classification techniques

Teaching Machines to Read and Comprehend

Machine reading comprehension (MRC) focuses on systems that can understand text passages and answer questions about them, representing a crucial step toward language understanding.

Reading Comprehension Datasets

Progress in MRC has been driven by increasingly challenging datasets:

SQuAD (Stanford Question Answering Dataset): Questions on Wikipedia articles where answers are text spans
CNN/Daily Mail: News articles paired with bullet-point summaries converted to cloze-style questions
RACE: Reading comprehension questions from English exams for Chinese students
NarrativeQA: Questions requiring understanding of entire books or movie scripts

Early Models

Initial approaches to reading comprehension utilized recurrent architectures:

LSTM/GRU-based models: Capturing sequential dependencies in text
Attention mechanisms: Helping models focus on relevant parts of input when generating answers
Memory networks: Explicitly storing information for later retrieval when answering questions
BiDAF (Bi-Directional Attention Flow): Connecting question and context bidirectionally

Reading Comprehension Resources

Paper: Teaching Machines to Read and Comprehend by Hermann et al. - Pioneering work on neural reading comprehension
Paper: SQuAD: 100,000+ Questions for Machine Comprehension of Text by Rajpurkar et al. - Introducing the influential SQuAD dataset
Blog post: Attention? Attention! by Lilian Weng - Comprehensive guide to attention mechanisms
Tutorial: Question Answering from Stanford CS224N - Technical overview of QA approaches

A Neural Attention Model for Sequence Learning

Attention mechanisms revolutionized sequence modeling by allowing models to focus selectively on different parts of the input when generating each element of the output.

Bahdanau Attention

Bahdanau (or “additive”) attention introduced a mechanism to align and weight input elements:

Computes a score between each encoder hidden state and the current decoder state
Normalizes scores using softmax to obtain attention weights
Creates a context vector as weighted sum of encoder states
Combines context vector with decoder state for prediction

Improves seq2seq by Focusing on Relevant Parts

Attention mechanisms addressed key limitations of vanilla seq2seq:

Information bottleneck: Eliminated reliance on fixed-length context vectors
Long-range dependencies: Enabled direct connections between related words regardless of distance
Interpretability: Provided visualization of what the model focuses on when generating each word
Alignment learning: Automatically discovered word/phrase correspondences between languages

Attention mechanisms were the critical innovation that eventually led to the Transformer architecture, which forms the backbone of nearly all foundation models today.

Attention Mechanism Resources

Paper: Neural Machine Translation by Jointly Learning to Align and Translate by Bahdanau et al. - Introduced attention mechanisms for NMT
Blog post: Visualizing Neural Machine Translation by Jay Alammar - Excellent visual explanation of attention
Tutorial: Attention and Augmented Recurrent Neural Networks from Distill - Interactive visualizations of attention
Video: Attention Mechanisms in Neural Networks from DeepLearning.AI - Conceptual explanation by Andrew Ng

Background in Computer Vision

Computer vision has undergone a parallel evolution to NLP, moving from hand-crafted features to learned representations through deep neural networks.

Image Classification, Detection, Segmentation

Computer vision encompasses several core tasks of increasing complexity:

Image Classification: Assigning labels to entire images
Object Detection: Localizing and classifying multiple objects in images
Semantic Segmentation: Classifying each pixel into a category
Instance Segmentation: Distinguishing individual objects within categories
Pose Estimation: Identifying the position and orientation of objects or people

Traditional Pipelines: SIFT, HOG, CNNs

Early computer vision relied on hand-engineered feature extraction:

SIFT (Scale-Invariant Feature Transform): Detecting and describing local features invariant to scaling and rotation
HOG (Histogram of Oriented Gradients): Counting occurrences of gradient orientations in localized portions of images
SURF (Speeded-Up Robust Features): Faster approximation of SIFT

These were followed by machine learning approaches:

Bag of Visual Words: Adapting text classification techniques to images
Early CNNs: LeNet and other convolutional architectures showing promise for digit recognition

Classical Computer Vision Resources

Paper: Distinctive Image Features from Scale-Invariant Keypoints by David Lowe - Introducing SIFT
Paper: Histograms of Oriented Gradients for Human Detection by Dalal & Triggs - HOG feature descriptor
Book: Computer Vision: Algorithms and Applications by Richard Szeliski - Comprehensive reference
Course: Stanford CS231n: Convolutional Neural Networks for Visual Recognition - Historical context and modern methods

ImageNet and the Deep Learning Boom

The ImageNet competition catalyzed a revolution in computer vision, demonstrating the power of deep learning at scale.

AlexNet, VGG, ResNet (Brief Overview)

Several breakthrough architectures emerged during the ImageNet era:

AlexNet (2012): First CNN to win ImageNet, featuring ReLU activations and dropout
VGG (2014): Demonstrated the importance of network depth with small, uniform filters
GoogLeNet/Inception (2014): Introduced inception modules with multiple filter sizes
ResNet (2015): Solved the vanishing gradient problem with residual connections, enabling training of much deeper networks
DenseNet (2017): Created dense connections between layers for improved information flow

Importance of Large-scale Vision Datasets

Large datasets proved critical for vision progress:

ImageNet: 14 million images across 20,000+ categories
COCO (Common Objects in Context): Rich annotations including object detection, segmentation, and captioning
Open Images: Multi-label classification with 9 million images across 6,000+ categories
JFT-300M: Google’s internal dataset with 300M+ images used for pretraining

The success of supervised learning on large-scale vision datasets established a pattern that would later be applied to foundation models: scale matters tremendously for both data and model size.

ImageNet Revolution Resources

Paper: ImageNet Classification with Deep Convolutional Neural Networks by Krizhevsky et al. - The AlexNet paper that started the revolution
Paper: Deep Residual Learning for Image Recognition by He et al. - Introducing ResNet architecture
Paper: Microsoft COCO: Common Objects in Context by Lin et al. - Detailed explanation of the COCO dataset
Blog post: CNN Architectures: LeNet, AlexNet, VGG, GoogLeNet, ResNet and more - Overview of CNN evolution

CNN Architecture Intuition

Understanding the intuition behind convolutional neural networks helps explain their effectiveness for visual tasks.

Convolution, Pooling, Filters

CNNs consist of several key components working together:

Convolutional layers: Apply learned filters across the input to detect features
Activation functions: Introduce non-linearity (typically ReLU)
Pooling layers: Reduce spatial dimensions while preserving important features
Feature hierarchies: Early layers detect edges and textures; deeper layers detect complex shapes and objects
Fully connected layers: Convert spatial features to classification outputs

Why CNNs Work Well for Images

CNNs are uniquely suited to image processing due to:

Parameter sharing: Using the same filters across the entire image
Local connectivity: Each neuron connects only to a small region of the input
Translation invariance: Detecting features regardless of position
Hierarchical feature learning: Building complex representations from simple ones
Scale and distortion robustness: Handling variations through pooling and depth

CNN Architecture Resources

Blog post: CNN Explainer - Interactive visualization of CNN operations
Tutorial: CS231n Convolutional Neural Networks - Detailed explanation of CNN principles
Video: How Convolutional Neural Networks Work by Brandon Rohrer - Visual explanation
Paper: Visualizing and Understanding Convolutional Networks by Zeiler & Fergus - Techniques for interpreting CNN features

Early CV Applications

The success of CNNs quickly led to their application in numerous computer vision tasks.

Object Detection (YOLO, R-CNN)

Object detection systems evolved rapidly:

R-CNN family: Region-based CNN approaches (R-CNN, Fast R-CNN, Faster R-CNN)
YOLO (You Only Look Once): First real-time detection system treating detection as regression
SSD (Single Shot Detector): Multi-scale detection with predefined anchor boxes
RetinaNet: Addressing class imbalance with focal loss
Mask R-CNN: Extending Faster R-CNN for instance segmentation

Image Captioning (Tie-in to NLP)

Image captioning represented an early bridge between CV and NLP:

CNN-LSTM architectures: Using CNNs to encode images and LSTMs to generate captions
Attention mechanisms: Focusing on relevant image regions when generating each word
Semantic alignment: Learning correspondences between visual features and textual descriptions
Multimodal embeddings: Creating joint representations of images and text

Early multimodal models like image captioning systems were precursors to foundation models, demonstrating how representations could transfer between domains.

CV Applications Resources

Paper: Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation by Girshick et al. - R-CNN approach
Paper: You Only Look Once: Unified, Real-Time Object Detection by Redmon et al. - Introducing YOLO
Paper: Show and Tell: A Neural Image Caption Generator by Vinyals et al. - Early image captioning system
Paper: Deep Visual-Semantic Alignments for Generating Image Descriptions by Karpathy & Fei-Fei - Aligning image regions with text
Paper: Fully Convolutional Networks for Semantic Segmentation by Long et al. - Pioneering semantic segmentation approach
Paper: DeepFace: Closing the Gap to Human-Level Performance in Face Verification by Taigman et al. - Early deep learning for face recognition
Paper: DeepPose: Human Pose Estimation via Deep Neural Networks by Toshev & Szegedy - CNN-based approach to pose estimation
Paper: Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks by Radford et al. - Introducing DCGANs

Class Discussion / Case Study

Compare Classical ML vs Deep Learning NLP

Discussion points for comparing traditional and deep learning approaches:

Feature engineering vs representation learning: Shift from manual feature design to learned representations
Task-specific vs general models: Evolution from specialized systems to multipurpose architectures
Data requirements: Increase in data needed for modern approaches
Interpretability tradeoffs: Classical models often more transparent but less powerful
Domain adaptation capabilities: Deep learning’s advantages for transfer learning

Discuss Evolution Toward Foundation Models

Key trends in the development of foundation models:

Architecture convergence: Transformer becoming dominant across domains
Scale as a strategy: Increasing model size, data, and compute
Self-supervised learning: Moving beyond labeled data
Cross-modal capabilities: Integrating text, images, audio in unified frameworks
Few-shot learning: Adapting to new tasks with minimal examples
Emergent abilities: Capabilities not present in smaller models appearing at scale

Key Takeaways

NLP and CV evolved from rule-based systems to statistical methods to deep learning
Sequence-to-sequence learning and attention mechanisms were critical innovations for NLP
Large datasets like ImageNet and COCO catalyzed progress in computer vision
CNNs revolutionized computer vision through hierarchical feature learning
Early multimodal systems like image captioning bridged the gap between vision and language
The convergence of architectures (especially Transformers) across modalities set the stage for foundation models