Awesome AI Benchmarks & Evaluation

A curated list of evaluation tools, benchmark datasets, leaderboards, frameworks, and resources for assessing model performance across reasoning, safety, robustness, multimodality, RAG, LLMs, and traditional machine learning tasks.

Support ongoing maintenance and curation via GitHub Sponsors.

General LLM Benchmarks

HELM – Holistic evaluation of LLMs across dozens of tasks and domains.
LM Evaluation Harness – Standardized benchmarking suite for language models.
OpenLLM Leaderboard (HuggingFace) – Public leaderboard for open-source LLM performance.
BIG-Bench – Broad set of challenging tasks for evaluating model generalization.
MT-Bench – Multi-turn chat interaction benchmark.
AlpacaEval – Automatic evaluation of instruction-following models.

Reasoning & Math

GSM8K – Benchmark for math reasoning at grade-school level.
MATH – Extensive dataset of high school and competition math problems.
AIME Bench – Benchmark based on American Invitational Mathematics Examination questions.
ARC (Abstraction & Reasoning Corpus) – Tests generalization and abstraction capabilities.
AGIEval – Benchmarks for human-style exams and reasoning.

Multimodal Benchmarks

MMBench – Large benchmark for vision–language reasoning.
COCO Captions & VQA – Standard dataset for VQA tasks.
Flickr30k – Image–text benchmark for captioning and retrieval.
ImageNet – Core image classification benchmark still used for comparison.
LAION Benchmarks – Evaluation datasets for multimodal embeddings and retrieval.
Video Question Answering Benchmarks – For video–text reasoning.

RAG Benchmarks

RAGAS – Comprehensive metrics for evaluating retrieval-augmented generation.
BEIR – Retrieval benchmark widely used to test search quality in RAG systems.
FiQA / TREC Variants – Evaluation datasets for information retrieval tasks.
Natural Questions (NQ) – Large dataset for retrieval + QA evaluation.
HotpotQA – Multi-hop question answering benchmark.

Safety & Robustness

HarmBench – Safety and harm classification benchmark.
SafetyBench – Benchmark suite for safety testing.
ToxiGen – Dataset for toxic or harmful content detection.
Red Team Prompt Datasets – Collections of prompts to stress-test model alignment.
RobustBench – Leaderboard of robust classification models.
AdversarialNLI – Dataset for robustness in natural language inference.

Evaluation Frameworks

OpenAI Evals – Evaluation framework for custom metrics and tasks.
TruLens – Observability and feedback evaluation for LLM apps and RAG.
Arize Phoenix – Open-source toolkit for LLM/RAG evals and trace analysis.
LightEval – Fast LLM evaluation pipeline.
Evalchemy – Lightweight framework for running LLM benchmarks.
Weights & Biases Evaluation – Model comparison and metric visualization tools.

Datasets

TruthfulQA – Benchmark for truthfulness in open-ended QA.
SQuAD – Reading comprehension dataset.
BoolQ – Yes/no question dataset.
SuperGLUE – General NLU evaluation suite.
MultiRC – Multi-sentence reasoning benchmark.
WikiQA – QA benchmark used in retrieval + QA systems.
CommonSenseQA – Commonsense reasoning dataset.

Learning Resources

HELM Overview – Intro to full-spectrum model benchmarking.
LLM Evaluation Guide (HuggingFace) – End-to-end guide for evaluating language models.
Stanford CS25 Notes – Covers benchmarking and model evaluation basics.
MLPerf – Industry-standard ML performance benchmarking guidelines.
DeepMind Papers on Evaluation – Research on model testing and evaluation.

Related Awesome Lists

Contribute

Contributions are welcome. Please ensure your submission fully follows the requirements outlined in CONTRIBUTING.md, including formatting, scope alignment, and category placement.

Pull requests that do not adhere to the contribution guidelines may be closed.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.github		.github
.editorconfig		.editorconfig
.gitattributes		.gitattributes
CHANGELOG.md		CHANGELOG.md
CITATION.cff		CITATION.cff
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
README.md		README.md
awesome-lists.json		awesome-lists.json
check_readme_links.py		check_readme_links.py
lychee.toml		lychee.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Awesome AI Benchmarks & Evaluation

Contents

General LLM Benchmarks

Reasoning & Math

Multimodal Benchmarks

RAG Benchmarks

Safety & Robustness

Evaluation Frameworks

Datasets

Learning Resources

Related Awesome Lists

Contribute

License

About

Uh oh!

Releases 2

Sponsor this project

Uh oh!

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Awesome AI Benchmarks & Evaluation

Contents

General LLM Benchmarks

Reasoning & Math

Multimodal Benchmarks

RAG Benchmarks

Safety & Robustness

Evaluation Frameworks

Datasets

Learning Resources

Related Awesome Lists

Contribute

License

About

Topics

Resources

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 2

Sponsor this project

Uh oh!

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages