The Grand AI Handbook

Evaluation and Benchmarks

Assessing NLP systems rigorously and fairly.

Chapter 53: Task-Specific Metrics BLEU, ROUGE, METEOR; LLM-specific metrics Accuracy, F1 for classification [Perplexity, factual consistency, coherence scores, BLEURT] References Chapter 54: NLP Benchmarks GLUE, SuperGLUE, SQuAD, XNLI Leaderboards, task diversity [Big-Bench, HELM, multilingual benchmarks] References Chapter 55: Robustness Testing Adversarial examples, out-of-distribution generalization Applications: Reliable NLP [TextFooler, stress testing, distributional shift] References Chapter 56: Human Evaluation User studies, fluency ratings, preference elicitation Applications: Practical NLP assessment [Crowdsourcing, A/B testing, subjective metrics] References