NLP and Computer Vision
The evolution of Natural Language Processing and Computer Vision: From traditional approaches to neural networks that paved the way for foundation models.
Background in NLP
Natural Language Processing evolved dramatically over decades, transforming from highly specialized rule-based systems to today’s versatile foundation models. Understanding this evolution provides crucial context for appreciating current capabilities and limitations.
Overview of Classical NLP Pipelines
Traditional NLP systems operated as sequential pipelines with discrete processing stages. Each component handled a specific linguistic analysis task:
- Tokenization: Breaking text into words, subwords, or characters
- Part-of-Speech (POS) Tagging: Labeling words as nouns, verbs, adjectives, etc.
- Named Entity Recognition: Identifying proper nouns and categorizing them (people, organizations, locations)
- Syntactic Parsing: Determining grammatical structure (constituency or dependency parsing)
- Coreference Resolution: Identifying when different expressions refer to the same entity
- Semantic Analysis: Extracting meanings and relationships between words and phrases
Rule-based vs Statistical Methods
Early NLP development followed two distinct paradigms:
Rule-based approaches relied on linguistic experts to manually craft rules for language processing. These systems achieved high precision for specific domains but lacked robustness when facing new variations or domains.
Statistical methods emerged in the 1990s, utilizing probability and machine learning to learn patterns from data. Key developments included:
- Hidden Markov Models for POS tagging and sequence labeling
- Statistical parsers based on Probabilistic Context-Free Grammars
- Named entity recognition systems using conditional random fields
- Statistical machine translation leveraging parallel corpora
While rule-based systems offered interpretability and precision, statistical methods provided adaptability and generalization capabilities that would become essential for scaling NLP applications.
Key Resources for Classical NLP
- Survey paper: A Survey of Named Entity Recognition and Classification by Nadeau & Sekine - Comprehensive overview of classical NER approaches
- Book: Speech and Language Processing by Jurafsky & Martin - Foundational textbook covering traditional NLP components
- Course: Stanford CS224N: Natural Language Processing with Deep Learning - Includes historical context of pre-deep learning NLP
- Blog post: A Visual Guide to Understanding Sequence Models by Jay Alammar
Sequence to Sequence Learning
The sequence-to-sequence (seq2seq) framework represented a revolutionary paradigm shift for NLP, particularly for tasks involving transformation between sequences like translation, summarization, and question answering.
Encoder-decoder Architecture
The encoder-decoder architecture introduced an elegant approach to sequence transformation:
- An encoder processes the input sequence to create a fixed-length representation (context vector)
- A decoder generates the output sequence conditioned on this representation
- Initially implemented with LSTMs/GRUs, later versions incorporated attention mechanisms
This design provided a unified approach to previously distinct NLP tasks, establishing a foundation for later transformer-based models.
Applications: Translation, Summarization
Seq2seq models achieved breakthroughs in key NLP applications:
- Machine Translation: Neural Machine Translation (NMT) systems outperformed traditional statistical methods
- Text Summarization: Encoder-decoder architectures enabled both extractive and abstractive summarization
- Dialogue Systems: Enabled more coherent multi-turn conversations
- Speech Recognition: Combined with acoustic modeling for improved transcription
Essential Seq2Seq Resources
- Paper: Sequence to Sequence Learning with Neural Networks by Sutskever et al. - Seminal work establishing the seq2seq paradigm
- Paper: A Neural Attention Model for Sentence Summarization by Rush et al. - Early application of neural methods to summarization
- Tutorial: Neural Machine Translation with Attention from TensorFlow - Practical implementation guide
- Paper: Deep Speech 2: End-to-End Speech Recognition in English and Mandarin by Baidu Research - Application of deep learning to speech recognition
Thumbs Up? Sentiment Classification Using ML
Sentiment analysis represents one of the most commercially valuable and widely deployed NLP applications, evolving from basic polarity detection to nuanced emotion recognition.
Traditional ML Models
Early sentiment classification relied on classical machine learning approaches:
- Naive Bayes: Probabilistic classifiers based on applying Bayes’ theorem
- Support Vector Machines (SVMs): Finding optimal hyperplanes to separate sentiment classes
- Logistic Regression: Predicting probability of sentiment categories
Feature Engineering
The performance of traditional models depended heavily on feature engineering:
- Bag of Words (BoW): Simple word presence/absence or frequency-based representations
- TF-IDF: Term Frequency-Inverse Document Frequency weighting to prioritize informative words
- N-grams: Capturing short phrases rather than individual words
- Lexicon-based features: Using sentiment dictionaries to assign polarity scores
- Syntactic features: Incorporating grammatical relationships between words
Sentiment Analysis Resources
- Paper: Thumbs up? Sentiment Classification using Machine Learning Techniques by Pang et al. - Foundational work in ML-based sentiment analysis
- Dataset: IMDB Movie Reviews - Widely used benchmark for sentiment classification
- Blog post: Machine Learning for Text Classification - Tutorial on traditional ML approaches
- Article: Twitter Sentiment Analysis for Beginners - Practical application of classification techniques
Teaching Machines to Read and Comprehend
Machine reading comprehension (MRC) focuses on systems that can understand text passages and answer questions about them, representing a crucial step toward language understanding.
Reading Comprehension Datasets
Progress in MRC has been driven by increasingly challenging datasets:
- SQuAD (Stanford Question Answering Dataset): Questions on Wikipedia articles where answers are text spans
- CNN/Daily Mail: News articles paired with bullet-point summaries converted to cloze-style questions
- RACE: Reading comprehension questions from English exams for Chinese students
- NarrativeQA: Questions requiring understanding of entire books or movie scripts
Early Models
Initial approaches to reading comprehension utilized recurrent architectures:
- LSTM/GRU-based models: Capturing sequential dependencies in text
- Attention mechanisms: Helping models focus on relevant parts of input when generating answers
- Memory networks: Explicitly storing information for later retrieval when answering questions
- BiDAF (Bi-Directional Attention Flow): Connecting question and context bidirectionally
Reading Comprehension Resources
- Paper: Teaching Machines to Read and Comprehend by Hermann et al. - Pioneering work on neural reading comprehension
- Paper: SQuAD: 100,000+ Questions for Machine Comprehension of Text by Rajpurkar et al. - Introducing the influential SQuAD dataset
- Blog post: Attention? Attention! by Lilian Weng - Comprehensive guide to attention mechanisms
- Tutorial: Question Answering from Stanford CS224N - Technical overview of QA approaches
A Neural Attention Model for Sequence Learning
Attention mechanisms revolutionized sequence modeling by allowing models to focus selectively on different parts of the input when generating each element of the output.
Bahdanau Attention
Bahdanau (or “additive”) attention introduced a mechanism to align and weight input elements:
- Computes a score between each encoder hidden state and the current decoder state
- Normalizes scores using softmax to obtain attention weights
- Creates a context vector as weighted sum of encoder states
- Combines context vector with decoder state for prediction
Improves seq2seq by Focusing on Relevant Parts
Attention mechanisms addressed key limitations of vanilla seq2seq:
- Information bottleneck: Eliminated reliance on fixed-length context vectors
- Long-range dependencies: Enabled direct connections between related words regardless of distance
- Interpretability: Provided visualization of what the model focuses on when generating each word
- Alignment learning: Automatically discovered word/phrase correspondences between languages
Attention mechanisms were the critical innovation that eventually led to the Transformer architecture, which forms the backbone of nearly all foundation models today.
Attention Mechanism Resources
- Paper: Neural Machine Translation by Jointly Learning to Align and Translate by Bahdanau et al. - Introduced attention mechanisms for NMT
- Blog post: Visualizing Neural Machine Translation by Jay Alammar - Excellent visual explanation of attention
- Tutorial: Attention and Augmented Recurrent Neural Networks from Distill - Interactive visualizations of attention
- Video: Attention Mechanisms in Neural Networks from DeepLearning.AI - Conceptual explanation by Andrew Ng
Background in Computer Vision
Computer vision has undergone a parallel evolution to NLP, moving from hand-crafted features to learned representations through deep neural networks.
Image Classification, Detection, Segmentation
Computer vision encompasses several core tasks of increasing complexity:
- Image Classification: Assigning labels to entire images
- Object Detection: Localizing and classifying multiple objects in images
- Semantic Segmentation: Classifying each pixel into a category
- Instance Segmentation: Distinguishing individual objects within categories
- Pose Estimation: Identifying the position and orientation of objects or people
Traditional Pipelines: SIFT, HOG, CNNs
Early computer vision relied on hand-engineered feature extraction:
- SIFT (Scale-Invariant Feature Transform): Detecting and describing local features invariant to scaling and rotation
- HOG (Histogram of Oriented Gradients): Counting occurrences of gradient orientations in localized portions of images
- SURF (Speeded-Up Robust Features): Faster approximation of SIFT
These were followed by machine learning approaches:
- Bag of Visual Words: Adapting text classification techniques to images
- Early CNNs: LeNet and other convolutional architectures showing promise for digit recognition
Classical Computer Vision Resources
- Paper: Distinctive Image Features from Scale-Invariant Keypoints by David Lowe - Introducing SIFT
- Paper: Histograms of Oriented Gradients for Human Detection by Dalal & Triggs - HOG feature descriptor
- Book: Computer Vision: Algorithms and Applications by Richard Szeliski - Comprehensive reference
- Course: Stanford CS231n: Convolutional Neural Networks for Visual Recognition - Historical context and modern methods
ImageNet and the Deep Learning Boom
The ImageNet competition catalyzed a revolution in computer vision, demonstrating the power of deep learning at scale.
AlexNet, VGG, ResNet (Brief Overview)
Several breakthrough architectures emerged during the ImageNet era:
- AlexNet (2012): First CNN to win ImageNet, featuring ReLU activations and dropout
- VGG (2014): Demonstrated the importance of network depth with small, uniform filters
- GoogLeNet/Inception (2014): Introduced inception modules with multiple filter sizes
- ResNet (2015): Solved the vanishing gradient problem with residual connections, enabling training of much deeper networks
- DenseNet (2017): Created dense connections between layers for improved information flow
Importance of Large-scale Vision Datasets
Large datasets proved critical for vision progress:
- ImageNet: 14 million images across 20,000+ categories
- COCO (Common Objects in Context): Rich annotations including object detection, segmentation, and captioning
- Open Images: Multi-label classification with 9 million images across 6,000+ categories
- JFT-300M: Google’s internal dataset with 300M+ images used for pretraining
The success of supervised learning on large-scale vision datasets established a pattern that would later be applied to foundation models: scale matters tremendously for both data and model size.
ImageNet Revolution Resources
- Paper: ImageNet Classification with Deep Convolutional Neural Networks by Krizhevsky et al. - The AlexNet paper that started the revolution
- Paper: Deep Residual Learning for Image Recognition by He et al. - Introducing ResNet architecture
- Paper: Microsoft COCO: Common Objects in Context by Lin et al. - Detailed explanation of the COCO dataset
- Blog post: CNN Architectures: LeNet, AlexNet, VGG, GoogLeNet, ResNet and more - Overview of CNN evolution
CNN Architecture Intuition
Understanding the intuition behind convolutional neural networks helps explain their effectiveness for visual tasks.
Convolution, Pooling, Filters
CNNs consist of several key components working together:
- Convolutional layers: Apply learned filters across the input to detect features
- Activation functions: Introduce non-linearity (typically ReLU)
- Pooling layers: Reduce spatial dimensions while preserving important features
- Feature hierarchies: Early layers detect edges and textures; deeper layers detect complex shapes and objects
- Fully connected layers: Convert spatial features to classification outputs
Why CNNs Work Well for Images
CNNs are uniquely suited to image processing due to:
- Parameter sharing: Using the same filters across the entire image
- Local connectivity: Each neuron connects only to a small region of the input
- Translation invariance: Detecting features regardless of position
- Hierarchical feature learning: Building complex representations from simple ones
- Scale and distortion robustness: Handling variations through pooling and depth
CNN Architecture Resources
- Blog post: CNN Explainer - Interactive visualization of CNN operations
- Tutorial: CS231n Convolutional Neural Networks - Detailed explanation of CNN principles
- Video: How Convolutional Neural Networks Work by Brandon Rohrer - Visual explanation
- Paper: Visualizing and Understanding Convolutional Networks by Zeiler & Fergus - Techniques for interpreting CNN features
Early CV Applications
The success of CNNs quickly led to their application in numerous computer vision tasks.
Object Detection (YOLO, R-CNN)
Object detection systems evolved rapidly:
- R-CNN family: Region-based CNN approaches (R-CNN, Fast R-CNN, Faster R-CNN)
- YOLO (You Only Look Once): First real-time detection system treating detection as regression
- SSD (Single Shot Detector): Multi-scale detection with predefined anchor boxes
- RetinaNet: Addressing class imbalance with focal loss
- Mask R-CNN: Extending Faster R-CNN for instance segmentation
Image Captioning (Tie-in to NLP)
Image captioning represented an early bridge between CV and NLP:
- CNN-LSTM architectures: Using CNNs to encode images and LSTMs to generate captions
- Attention mechanisms: Focusing on relevant image regions when generating each word
- Semantic alignment: Learning correspondences between visual features and textual descriptions
- Multimodal embeddings: Creating joint representations of images and text
Early multimodal models like image captioning systems were precursors to foundation models, demonstrating how representations could transfer between domains.
CV Applications Resources
- Paper: Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation by Girshick et al. - R-CNN approach
- Paper: You Only Look Once: Unified, Real-Time Object Detection by Redmon et al. - Introducing YOLO
- Paper: Show and Tell: A Neural Image Caption Generator by Vinyals et al. - Early image captioning system
- Paper: Deep Visual-Semantic Alignments for Generating Image Descriptions by Karpathy & Fei-Fei - Aligning image regions with text
- Paper: Fully Convolutional Networks for Semantic Segmentation by Long et al. - Pioneering semantic segmentation approach
- Paper: DeepFace: Closing the Gap to Human-Level Performance in Face Verification by Taigman et al. - Early deep learning for face recognition
- Paper: DeepPose: Human Pose Estimation via Deep Neural Networks by Toshev & Szegedy - CNN-based approach to pose estimation
- Paper: Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks by Radford et al. - Introducing DCGANs
Class Discussion / Case Study
Compare Classical ML vs Deep Learning NLP
Discussion points for comparing traditional and deep learning approaches:
- Feature engineering vs representation learning: Shift from manual feature design to learned representations
- Task-specific vs general models: Evolution from specialized systems to multipurpose architectures
- Data requirements: Increase in data needed for modern approaches
- Interpretability tradeoffs: Classical models often more transparent but less powerful
- Domain adaptation capabilities: Deep learning’s advantages for transfer learning
Discuss Evolution Toward Foundation Models
Key trends in the development of foundation models:
- Architecture convergence: Transformer becoming dominant across domains
- Scale as a strategy: Increasing model size, data, and compute
- Self-supervised learning: Moving beyond labeled data
- Cross-modal capabilities: Integrating text, images, audio in unified frameworks
- Few-shot learning: Adapting to new tasks with minimal examples
- Emergent abilities: Capabilities not present in smaller models appearing at scale
Key Takeaways
- NLP and CV evolved from rule-based systems to statistical methods to deep learning
- Sequence-to-sequence learning and attention mechanisms were critical innovations for NLP
- Large datasets like ImageNet and COCO catalyzed progress in computer vision
- CNNs revolutionized computer vision through hierarchical feature learning
- Early multimodal systems like image captioning bridged the gap between vision and language
- The convergence of architectures (especially Transformers) across modalities set the stage for foundation models