BullshitBench v2

BullshitBench measures whether models detect nonsense, call it out clearly, and avoid confidently continuing with invalid assumptions.

Public viewer (latest): https://petergpt.github.io/bullshit-benchmark/viewer/index.v2.html
Updated: 2026-03-12

Latest Changelog Entry (2026-03-12)

Added benchmark runs for the new Grok 4.20 variants across both published datasets:
- x-ai/grok-4.20-beta
- x-ai/grok-4.20-multi-agent-beta
Published the Grok 4.20 rows into both viewer tracks:
- v1 (data/latest) with 55 questions
- v2 (data/v2/latest) with 100 questions
Simplified the visible model labels in the viewers by dropping the Beta suffix from the Grok 4.20 display names while keeping the underlying model IDs unchanged.
Refined the main chart row-selection treatment to make model selection easier to see without overpowering the chart.
Updated org color mapping so xAI renders in black and OpenAI renders in green in the viewers.
Added click-to-pin labels for scatter-chart dots in the v2 viewer so specific models can be called out on demand.
Full details: CHANGELOG.md

v2 Changelog Highlights

100 new nonsense questions in the v2 set.
Domain-specific question coverage across 5 domains: software (40), finance (15), legal (15), medical (15), physics (15).
New visualizations in the v2 viewer, including:
- Detection Rate by Model (stacked mix bars)
- Domain Landscape (overall vs domain detection mix)
- Detection Rate Over Time
- Do Newer Models Perform Better?
- Does Thinking Harder Help? (tokens/cost toggle)

Viewer Walkthrough (v2)

The screenshots below follow the same flow as viewer/index.v2.html, starting with the main chart.

1. Detection Rate by Model (Main Chart)

Primary leaderboard-style view showing each model's green/amber/red split.

2. Domain Landscape

Detection mix by domain to compare overall performance vs each domain at a glance.

3. Detection Rate Over Time

Release-date trend view focused on Anthropic, OpenAI, and Google.

4. Do Newer Models Perform Better?

All-model scatter by release date vs. green rate.

5. Does Thinking Harder Help?

Reasoning scatter (tokens/cost toggle in the viewer) vs. green rate.

Benchmark Scope (v2)

100 nonsense prompts total.
5 domain groups: software (40), finance (15), legal (15), medical (15), physics (15).
13 nonsense techniques (for example: plausible_nonexistent_framework, misapplied_mechanism, nested_nonsense, specificity_trap).
3-judge panel aggregation (anthropic/claude-sonnet-4.6, openai/gpt-5.2, google/gemini-3.1-pro-preview) using full panel mode + mean aggregation.
Published v2 leaderboard currently includes 80 model/reasoning rows.

What This Measures

Clear Pushback: the model clearly rejects the broken premise.
Partial Challenge: the model flags issues but still engages the bad premise.
Accepted Nonsense: the model treats the nonsense as valid.

Quick Start

Set API keys:

export OPENROUTER_API_KEY=your_key_here
export OPENAI_API_KEY=your_openai_key_here  # required only for models routed to OpenAI
export OPENAI_PROJECT=proj_xxx              # optional: force OpenAI requests to a specific project
export OPENAI_ORGANIZATION=org_xxx          # optional: force organization context

Provider routing is configured per model via collect.model_providers and grade.model_providers in config (default is OpenRouter), for example: {"*":"openrouter","gpt-5.3":"openai"}.

Run collection + primary judge (Claude by default):

./scripts/run_end_to_end.sh

Run v2 end-to-end and publish into the dedicated v2 dataset:

./scripts/run_end_to_end.sh --config config.v2.json --viewer-output-dir data/v2/latest --with-additional-judges

Run the Qwen-only OpenRouter benchmark pack (Qwen 3.5 / Qwen 3 / Qwen 2.5):

./scripts/run_end_to_end.sh --config config.qwen-openrouter.json --viewer-output-dir data/v2/latest --with-additional-judges

Estimate total run cost from OpenRouter catalog pricing before collecting:

python3 scripts/estimate_openrouter_cost.py --config config.qwen-openrouter.json

Optionally run the default config end-to-end (publishes to data/latest):

./scripts/run_end_to_end.sh --with-additional-judges

Open the viewer:

Published viewer (latest): https://petergpt.github.io/bullshit-benchmark/viewer/index.v2.html
Local viewer (optional):

./scripts/run_end_to_end.sh --with-additional-judges --serve --port 8877

Then open http://localhost:8877/viewer/index.v2.html. Use the Benchmark Version dropdown in the filters panel to switch between published datasets (for example v1 and v2).

Published Datasets

v1 dataset remains in data/latest.
v2 dataset is published in data/v2/latest.
v2 question set comes from drafts/new-questions.md via scripts/build_questions_v2_from_draft.py.
Canonical judging is now fixed to exactly 3 judges on every row with mean aggregation (legacy disagreement-tiebreak mode is retired from the main pipeline).
Release notes and notable changes are tracked in CHANGELOG.md.

Documentation

Technical Guide: pipeline operations, publishing artifacts, launch-date metadata workflow, repo layout, env vars.
Changelog: v1 to v2 release notes and publish-history highlights.
Question Set: benchmark questions and scoring metadata.
Question Set v2: v2 question pool generated from drafts/new-questions.md.
Config: default model/pipeline settings.
Config v2: v2-ready config (uses questions.v2.json).

Notes

This README is intentionally audience-facing.
Technical and maintainer-oriented content lives in docs/TECHNICAL.md.

License

MIT. See LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 72 Commits
.github/workflows		.github/workflows
data		data
docs		docs
drafts		drafts
scripts		scripts
viewer		viewer
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
PROMPT.md		PROMPT.md
README.md		README.md
config.json		config.json
config.qwen-openrouter.json		config.qwen-openrouter.json
config.v1.gpt-5.3-chat.gemini-3.1-flash-lite-preview.json		config.v1.gpt-5.3-chat.gemini-3.1-flash-lite-preview.json
config.v2.gpt-5.3-chat.gemini-3.1-flash-lite-preview.json		config.v2.gpt-5.3-chat.gemini-3.1-flash-lite-preview.json
config.v2.json		config.v2.json
index.html		index.html
questions.json		questions.json
questions.v2.json		questions.v2.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BullshitBench v2

Latest Changelog Entry (2026-03-12)

v2 Changelog Highlights

Viewer Walkthrough (v2)

1. Detection Rate by Model (Main Chart)

2. Domain Landscape

3. Detection Rate Over Time

4. Do Newer Models Perform Better?

5. Does Thinking Harder Help?

Benchmark Scope (v2)

What This Measures

Quick Start

Published Datasets

Documentation

Notes

License

Star History

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

BullshitBench v2

Latest Changelog Entry (2026-03-12)

v2 Changelog Highlights

Viewer Walkthrough (v2)

1. Detection Rate by Model (Main Chart)

2. Domain Landscape

3. Detection Rate Over Time

4. Do Newer Models Perform Better?

5. Does Thinking Harder Help?

Benchmark Scope (v2)

What This Measures

Quick Start

Published Datasets

Documentation

Notes

License

Star History

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages