A benchmark for testing whether models identify and push back on nonsensical prompts instead of confidently answering them.
Check out fancy dashboard: 👉 Fancy Viewer (Interactive DataViz) 💩 👈
Original dashboard: Original Data Explorer.
scripts/openrouter_benchmark.py: core CLI (collect,grade,grade-panel,aggregate,report)scripts/run_end_to_end.sh: one-command rerun (collect->grade-panel-> publish)scripts/publish_latest_to_viewer.sh: publish final artifacts intodata/latestscripts/cleanup_generated_outputs.sh: remove generated local run artifactsquestions.json: benchmark question setconfig.json: canonical configviewer/index.html: canonical interactive viewerdata/latest/*: canonical published dataset
data/latest contains the latest dataset used by the viewer:
responses.jsonlcollection_stats.jsonpanel_summary.jsonaggregate_summary.jsonaggregate.jsonlleaderboard.csvmanifest.json
Run the full pipeline and republish data/latest:
./scripts/run_end_to_end.sh./scripts/publish_latest_to_viewer.sh \
--responses-file <path/to/responses.jsonl> \
--collection-stats <path/to/collection_stats.json> \
--panel-summary <path/to/panel_summary.json> \
--aggregate-summary <path/to/aggregate_summary.json> \
--aggregate-rows <path/to/aggregate.jsonl>The publish step also sanitizes local-machine path fields from the published dataset.
Required:
OPENROUTER_API_KEY
Optional:
OPENROUTER_REFEREROPENROUTER_APP_NAME