The Grand AI Handbook

Evaluation and Benchmarking

Assessing MLOps systems and their outputs.

Chapter 49: Model Performance Metrics Latency, throughput, accuracy Domain-specific KPIs: Precision@K, MAP Chapter 50: MLOps Benchmarks Datasets: DAWNBench, MLPerf Leaderboards: Papers With Code Chapter 51: Robustness and Stress Testing Adversarial attacks Out-of-distribution testing Tools: Foolbox, RobustBench Chapter 52: Human-in-the-Loop Evaluation Active learning Crowdsourcing Platforms: Amazon Mechanical Turk, Labelbox