Bullshit Benchmark

A benchmark for testing whether models identify and push back on nonsensical prompts instead of confidently answering them.

📺 Public Viewers

Check out fancy dashboard: 👉 Fancy Viewer (Interactive DataViz) 💩 👈

Original dashboard: Original Data Explorer.

Repo Layout

scripts/openrouter_benchmark.py: core CLI (collect, grade, grade-panel, aggregate, report)
scripts/run_end_to_end.sh: one-command rerun (collect -> grade-panel -> publish)
scripts/publish_latest_to_viewer.sh: publish final artifacts into data/latest
scripts/cleanup_generated_outputs.sh: remove generated local run artifacts
questions.json: benchmark question set
config.json: canonical config
viewer/index.html: canonical interactive viewer
data/latest/*: canonical published dataset

Canonical Published Data

data/latest contains the latest dataset used by the viewer:

responses.jsonl
collection_stats.json
panel_summary.json
aggregate_summary.json
aggregate.jsonl
leaderboard.csv
manifest.json

Re-run

Run the full pipeline and republish data/latest:

./scripts/run_end_to_end.sh

Publish Existing Run Artifacts

./scripts/publish_latest_to_viewer.sh \
  --responses-file <path/to/responses.jsonl> \
  --collection-stats <path/to/collection_stats.json> \
  --panel-summary <path/to/panel_summary.json> \
  --aggregate-summary <path/to/aggregate_summary.json> \
  --aggregate-rows <path/to/aggregate.jsonl>

The publish step also sanitizes local-machine path fields from the published dataset.

Environment

Required:

OPENROUTER_API_KEY

Optional:

OPENROUTER_REFERER
OPENROUTER_APP_NAME

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
data/latest		data/latest
scripts		scripts
viewer		viewer
.gitignore		.gitignore
README.md		README.md
config.json		config.json
index.html		index.html
questions.json		questions.json
test.js		test.js
test_script.js		test_script.js

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Bullshit Benchmark

📺 Public Viewers

Repo Layout

Canonical Published Data

Re-run

Publish Existing Run Artifacts

Environment

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Bullshit Benchmark

📺 Public Viewers

Repo Layout

Canonical Published Data

Re-run

Publish Existing Run Artifacts

Environment

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages