The Grand AI Handbook

NLP and Computer Vision

The evolution of Natural Language Processing and Computer Vision: From traditional approaches to neural networks that paved the way for foundation models.

This section explores the pivotal developments in Natural Language Processing (NLP) and Computer Vision (CV) that paved the way for foundation models. We'll trace the evolution from traditional rule-based systems to statistical methods, and finally to the deep learning architectures that enabled the creation of large-scale models with transfer learning capabilities across domains.

Background in NLP

Natural Language Processing evolved dramatically over decades, transforming from highly specialized rule-based systems to today’s versatile foundation models. Understanding this evolution provides crucial context for appreciating current capabilities and limitations.

Overview of Classical NLP Pipelines

Traditional NLP systems operated as sequential pipelines with discrete processing stages. Each component handled a specific linguistic analysis task:

  • Tokenization: Breaking text into words, subwords, or characters
  • Part-of-Speech (POS) Tagging: Labeling words as nouns, verbs, adjectives, etc.
  • Named Entity Recognition: Identifying proper nouns and categorizing them (people, organizations, locations)
  • Syntactic Parsing: Determining grammatical structure (constituency or dependency parsing)
  • Coreference Resolution: Identifying when different expressions refer to the same entity
  • Semantic Analysis: Extracting meanings and relationships between words and phrases

Rule-based vs Statistical Methods

Early NLP development followed two distinct paradigms:

Rule-based approaches relied on linguistic experts to manually craft rules for language processing. These systems achieved high precision for specific domains but lacked robustness when facing new variations or domains.

Statistical methods emerged in the 1990s, utilizing probability and machine learning to learn patterns from data. Key developments included:

  • Hidden Markov Models for POS tagging and sequence labeling
  • Statistical parsers based on Probabilistic Context-Free Grammars
  • Named entity recognition systems using conditional random fields
  • Statistical machine translation leveraging parallel corpora
While rule-based systems offered interpretability and precision, statistical methods provided adaptability and generalization capabilities that would become essential for scaling NLP applications.

Sequence to Sequence Learning

The sequence-to-sequence (seq2seq) framework represented a revolutionary paradigm shift for NLP, particularly for tasks involving transformation between sequences like translation, summarization, and question answering.

Encoder-decoder Architecture

The encoder-decoder architecture introduced an elegant approach to sequence transformation:

  • An encoder processes the input sequence to create a fixed-length representation (context vector)
  • A decoder generates the output sequence conditioned on this representation
  • Initially implemented with LSTMs/GRUs, later versions incorporated attention mechanisms

This design provided a unified approach to previously distinct NLP tasks, establishing a foundation for later transformer-based models.

Applications: Translation, Summarization

Seq2seq models achieved breakthroughs in key NLP applications:

  • Machine Translation: Neural Machine Translation (NMT) systems outperformed traditional statistical methods
  • Text Summarization: Encoder-decoder architectures enabled both extractive and abstractive summarization
  • Dialogue Systems: Enabled more coherent multi-turn conversations
  • Speech Recognition: Combined with acoustic modeling for improved transcription

Thumbs Up? Sentiment Classification Using ML

Sentiment analysis represents one of the most commercially valuable and widely deployed NLP applications, evolving from basic polarity detection to nuanced emotion recognition.

Traditional ML Models

Early sentiment classification relied on classical machine learning approaches:

  • Naive Bayes: Probabilistic classifiers based on applying Bayes’ theorem
  • Support Vector Machines (SVMs): Finding optimal hyperplanes to separate sentiment classes
  • Logistic Regression: Predicting probability of sentiment categories

Feature Engineering

The performance of traditional models depended heavily on feature engineering:

  • Bag of Words (BoW): Simple word presence/absence or frequency-based representations
  • TF-IDF: Term Frequency-Inverse Document Frequency weighting to prioritize informative words
  • N-grams: Capturing short phrases rather than individual words
  • Lexicon-based features: Using sentiment dictionaries to assign polarity scores
  • Syntactic features: Incorporating grammatical relationships between words

Teaching Machines to Read and Comprehend

Machine reading comprehension (MRC) focuses on systems that can understand text passages and answer questions about them, representing a crucial step toward language understanding.

Reading Comprehension Datasets

Progress in MRC has been driven by increasingly challenging datasets:

  • SQuAD (Stanford Question Answering Dataset): Questions on Wikipedia articles where answers are text spans
  • CNN/Daily Mail: News articles paired with bullet-point summaries converted to cloze-style questions
  • RACE: Reading comprehension questions from English exams for Chinese students
  • NarrativeQA: Questions requiring understanding of entire books or movie scripts

Early Models

Initial approaches to reading comprehension utilized recurrent architectures:

  • LSTM/GRU-based models: Capturing sequential dependencies in text
  • Attention mechanisms: Helping models focus on relevant parts of input when generating answers
  • Memory networks: Explicitly storing information for later retrieval when answering questions
  • BiDAF (Bi-Directional Attention Flow): Connecting question and context bidirectionally

A Neural Attention Model for Sequence Learning

Attention mechanisms revolutionized sequence modeling by allowing models to focus selectively on different parts of the input when generating each element of the output.

Bahdanau Attention

Bahdanau (or “additive”) attention introduced a mechanism to align and weight input elements:

  • Computes a score between each encoder hidden state and the current decoder state
  • Normalizes scores using softmax to obtain attention weights
  • Creates a context vector as weighted sum of encoder states
  • Combines context vector with decoder state for prediction

Improves seq2seq by Focusing on Relevant Parts

Attention mechanisms addressed key limitations of vanilla seq2seq:

  • Information bottleneck: Eliminated reliance on fixed-length context vectors
  • Long-range dependencies: Enabled direct connections between related words regardless of distance
  • Interpretability: Provided visualization of what the model focuses on when generating each word
  • Alignment learning: Automatically discovered word/phrase correspondences between languages
Attention mechanisms were the critical innovation that eventually led to the Transformer architecture, which forms the backbone of nearly all foundation models today.

Background in Computer Vision

Computer vision has undergone a parallel evolution to NLP, moving from hand-crafted features to learned representations through deep neural networks.

Image Classification, Detection, Segmentation

Computer vision encompasses several core tasks of increasing complexity:

  • Image Classification: Assigning labels to entire images
  • Object Detection: Localizing and classifying multiple objects in images
  • Semantic Segmentation: Classifying each pixel into a category
  • Instance Segmentation: Distinguishing individual objects within categories
  • Pose Estimation: Identifying the position and orientation of objects or people

Traditional Pipelines: SIFT, HOG, CNNs

Early computer vision relied on hand-engineered feature extraction:

  • SIFT (Scale-Invariant Feature Transform): Detecting and describing local features invariant to scaling and rotation
  • HOG (Histogram of Oriented Gradients): Counting occurrences of gradient orientations in localized portions of images
  • SURF (Speeded-Up Robust Features): Faster approximation of SIFT

These were followed by machine learning approaches:

  • Bag of Visual Words: Adapting text classification techniques to images
  • Early CNNs: LeNet and other convolutional architectures showing promise for digit recognition

ImageNet and the Deep Learning Boom

The ImageNet competition catalyzed a revolution in computer vision, demonstrating the power of deep learning at scale.

AlexNet, VGG, ResNet (Brief Overview)

Several breakthrough architectures emerged during the ImageNet era:

  • AlexNet (2012): First CNN to win ImageNet, featuring ReLU activations and dropout
  • VGG (2014): Demonstrated the importance of network depth with small, uniform filters
  • GoogLeNet/Inception (2014): Introduced inception modules with multiple filter sizes
  • ResNet (2015): Solved the vanishing gradient problem with residual connections, enabling training of much deeper networks
  • DenseNet (2017): Created dense connections between layers for improved information flow

Importance of Large-scale Vision Datasets

Large datasets proved critical for vision progress:

  • ImageNet: 14 million images across 20,000+ categories
  • COCO (Common Objects in Context): Rich annotations including object detection, segmentation, and captioning
  • Open Images: Multi-label classification with 9 million images across 6,000+ categories
  • JFT-300M: Google’s internal dataset with 300M+ images used for pretraining
The success of supervised learning on large-scale vision datasets established a pattern that would later be applied to foundation models: scale matters tremendously for both data and model size.

CNN Architecture Intuition

Understanding the intuition behind convolutional neural networks helps explain their effectiveness for visual tasks.

Convolution, Pooling, Filters

CNNs consist of several key components working together:

  • Convolutional layers: Apply learned filters across the input to detect features
  • Activation functions: Introduce non-linearity (typically ReLU)
  • Pooling layers: Reduce spatial dimensions while preserving important features
  • Feature hierarchies: Early layers detect edges and textures; deeper layers detect complex shapes and objects
  • Fully connected layers: Convert spatial features to classification outputs

Why CNNs Work Well for Images

CNNs are uniquely suited to image processing due to:

  • Parameter sharing: Using the same filters across the entire image
  • Local connectivity: Each neuron connects only to a small region of the input
  • Translation invariance: Detecting features regardless of position
  • Hierarchical feature learning: Building complex representations from simple ones
  • Scale and distortion robustness: Handling variations through pooling and depth

Early CV Applications

The success of CNNs quickly led to their application in numerous computer vision tasks.

Object Detection (YOLO, R-CNN)

Object detection systems evolved rapidly:

  • R-CNN family: Region-based CNN approaches (R-CNN, Fast R-CNN, Faster R-CNN)
  • YOLO (You Only Look Once): First real-time detection system treating detection as regression
  • SSD (Single Shot Detector): Multi-scale detection with predefined anchor boxes
  • RetinaNet: Addressing class imbalance with focal loss
  • Mask R-CNN: Extending Faster R-CNN for instance segmentation

Image Captioning (Tie-in to NLP)

Image captioning represented an early bridge between CV and NLP:

  • CNN-LSTM architectures: Using CNNs to encode images and LSTMs to generate captions
  • Attention mechanisms: Focusing on relevant image regions when generating each word
  • Semantic alignment: Learning correspondences between visual features and textual descriptions
  • Multimodal embeddings: Creating joint representations of images and text
Early multimodal models like image captioning systems were precursors to foundation models, demonstrating how representations could transfer between domains.

Class Discussion / Case Study

Compare Classical ML vs Deep Learning NLP

Discussion points for comparing traditional and deep learning approaches:

  • Feature engineering vs representation learning: Shift from manual feature design to learned representations
  • Task-specific vs general models: Evolution from specialized systems to multipurpose architectures
  • Data requirements: Increase in data needed for modern approaches
  • Interpretability tradeoffs: Classical models often more transparent but less powerful
  • Domain adaptation capabilities: Deep learning’s advantages for transfer learning

Discuss Evolution Toward Foundation Models

Key trends in the development of foundation models:

  • Architecture convergence: Transformer becoming dominant across domains
  • Scale as a strategy: Increasing model size, data, and compute
  • Self-supervised learning: Moving beyond labeled data
  • Cross-modal capabilities: Integrating text, images, audio in unified frameworks
  • Few-shot learning: Adapting to new tasks with minimal examples
  • Emergent abilities: Capabilities not present in smaller models appearing at scale

Key Takeaways

  • NLP and CV evolved from rule-based systems to statistical methods to deep learning
  • Sequence-to-sequence learning and attention mechanisms were critical innovations for NLP
  • Large datasets like ImageNet and COCO catalyzed progress in computer vision
  • CNNs revolutionized computer vision through hierarchical feature learning
  • Early multimodal systems like image captioning bridged the gap between vision and language
  • The convergence of architectures (especially Transformers) across modalities set the stage for foundation models