A curated list of evaluation tools, benchmark datasets, leaderboards, frameworks, and resources for assessing model performance across reasoning, safety, robustness, multimodality, RAG, LLMs, and traditional machine learning tasks.
Support ongoing maintenance and curation via GitHub Sponsors.
- General LLM Benchmarks
- Reasoning & Math
- Multimodal Benchmarks
- RAG Benchmarks
- Safety & Robustness
- Evaluation Frameworks
- Datasets
- Learning Resources
- Related Awesome Lists
- HELM – Holistic evaluation of LLMs across dozens of tasks and domains.
- LM Evaluation Harness – Standardized benchmarking suite for language models.
- OpenLLM Leaderboard (HuggingFace) – Public leaderboard for open-source LLM performance.
- BIG-Bench – Broad set of challenging tasks for evaluating model generalization.
- MT-Bench – Multi-turn chat interaction benchmark.
- AlpacaEval – Automatic evaluation of instruction-following models.
- GSM8K – Benchmark for math reasoning at grade-school level.
- MATH – Extensive dataset of high school and competition math problems.
- AIME Bench – Benchmark based on American Invitational Mathematics Examination questions.
- ARC (Abstraction & Reasoning Corpus) – Tests generalization and abstraction capabilities.
- AGIEval – Benchmarks for human-style exams and reasoning.
- MMBench – Large benchmark for vision–language reasoning.
- COCO Captions & VQA – Standard dataset for VQA tasks.
- Flickr30k – Image–text benchmark for captioning and retrieval.
- ImageNet – Core image classification benchmark still used for comparison.
- LAION Benchmarks – Evaluation datasets for multimodal embeddings and retrieval.
- Video Question Answering Benchmarks – For video–text reasoning.
- RAGAS – Comprehensive metrics for evaluating retrieval-augmented generation.
- BEIR – Retrieval benchmark widely used to test search quality in RAG systems.
- FiQA / TREC Variants – Evaluation datasets for information retrieval tasks.
- Natural Questions (NQ) – Large dataset for retrieval + QA evaluation.
- HotpotQA – Multi-hop question answering benchmark.
- HarmBench – Safety and harm classification benchmark.
- SafetyBench – Benchmark suite for safety testing.
- ToxiGen – Dataset for toxic or harmful content detection.
- Red Team Prompt Datasets – Collections of prompts to stress-test model alignment.
- RobustBench – Leaderboard of robust classification models.
- AdversarialNLI – Dataset for robustness in natural language inference.
- OpenAI Evals – Evaluation framework for custom metrics and tasks.
- TruLens – Observability and feedback evaluation for LLM apps and RAG.
- Arize Phoenix – Open-source toolkit for LLM/RAG evals and trace analysis.
- LightEval – Fast LLM evaluation pipeline.
- Evalchemy – Lightweight framework for running LLM benchmarks.
- Weights & Biases Evaluation – Model comparison and metric visualization tools.
- TruthfulQA – Benchmark for truthfulness in open-ended QA.
- SQuAD – Reading comprehension dataset.
- BoolQ – Yes/no question dataset.
- SuperGLUE – General NLU evaluation suite.
- MultiRC – Multi-sentence reasoning benchmark.
- WikiQA – QA benchmark used in retrieval + QA systems.
- CommonSenseQA – Commonsense reasoning dataset.
- HELM Overview – Intro to full-spectrum model benchmarking.
- LLM Evaluation Guide (HuggingFace) – End-to-end guide for evaluating language models.
- Stanford CS25 Notes – Covers benchmarking and model evaluation basics.
- MLPerf – Industry-standard ML performance benchmarking guidelines.
- DeepMind Papers on Evaluation – Research on model testing and evaluation.
- Awesome AI
- Awesome AI Safety & Alignment
- Awesome AI Security
- Awesome AI Research Tools
- Awesome Machine Learning
Contributions are welcome. Please ensure your submission fully follows the requirements outlined in CONTRIBUTING.md, including formatting, scope alignment, and category placement.
Pull requests that do not adhere to the contribution guidelines may be closed.