The Grand AI Handbook

Transformer Models

Core transformer architectures driving modern NLP.

Chapter 16: Transformer Fundamentals Self-attention, multi-head attention Positional encodings, layer normalization [Scaled dot-product attention, residual connections, feed-forward layers] References Chapter 17: Encoder Models BERT: Masked language modeling Variants: RoBERTa, ALBERT, DistilBERT [Bidirectional training, tokenization: WordPiece, BPE] References Chapter 18: Decoder Models GPT: Autoregressive modeling Variants: GPT-2, GPT-3, LLaMA [Causal masking, prompt tuning, in-context learning] References Chapter 19: Encoder-Decoder Models T5: Text-to-text framework BART, MarianMT [Seq2seq learning, denoising objectives, beam search] References Chapter 20: Knowledge-Augmented NLP Retrieval-augmented generation (RAG), knowledge graphs Applications: Fact-checking, knowledge-intensive QA [Dense Passage Retrieval, REALM, entity linking] References