Read this file and continue from here.
Update this repo so BullshitBench can run a Qwen-only benchmark via OpenRouter and produce a cost estimate sourced from OpenRouter pricing before running.
- Benchmark models should be ONLY these families/sizes (small/medium/large style):
- Qwen 3.5: 9B, 27B, 35B-A3B
- Qwen 3: 2–3 similar sizes
- Qwen 2.5: 2–3 similar sizes
- Use OpenRouter-only routing for model execution.
- Verify models are actually available on OpenRouter.
- Estimate full benchmark cost from OpenRouter pricing before asking for API keys.
- Then request keys and do a dry run.
A prior agent already added:
config.qwen-openrouter.jsonscripts/estimate_openrouter_cost.py- README quick-start updates for both.
Latest commit before this handoff:
482db1f Add Qwen-only OpenRouter benchmark config and cost estimator
Environment currently cannot reach OpenRouter through proxy:
curl -i https://openrouter.aireturnsHTTP/1.1 403 Forbiddenandcurl: (56) CONNECT tunnel failed, response 403.python3 scripts/estimate_openrouter_cost.py --config config.qwen-openrouter.jsonfails similarly when fetchinghttps://openrouter.ai/api/v1/models.
User noted they likely need a new instance for updated network allowlist settings to take effect.
- Re-test connectivity:
curl -i https://openrouter.aicurl -sS https://openrouter.ai/api/v1/models | head
- If reachable, run:
python3 scripts/estimate_openrouter_cost.py --config config.qwen-openrouter.json
- Validate model IDs in
config.qwen-openrouter.jsonare present in OpenRouter catalog. - If any IDs are unavailable, update config to available equivalents and re-run estimate.
- Report estimated full benchmark cost with assumptions.
- Ask user for API keys and run a small dry run command.
- Keep provider routing as OpenRouter for collect/judge unless user asks otherwise.
- Keep judge panel consistent with project conventions unless user asks to change.
- If environment still blocks OpenRouter, provide exact command output and stop short of fake estimates.
- Commit any fixes and create PR message per task harness requirements.