|
| 1 | +# GC Benchmarking |
| 2 | + |
| 3 | +LLM inference benchmarking for OpenAI-compatible providers. The tool runs the |
| 4 | +same logical model across every enabled provider that has a configured model ID, |
| 5 | +then reports time to first token, end-to-end latency, output throughput, token |
| 6 | +counts, retry attempts, and error rate. |
| 7 | + |
| 8 | +The default configuration compares General Compute, served through SambaNova's |
| 9 | +cloud API, with OpenRouter for the same model families. Add or disable providers |
| 10 | +in `config/config.yaml` without changing code. |
| 11 | + |
| 12 | +## Features |
| 13 | + |
| 14 | +- Same-model provider comparisons |
| 15 | +- Provider interleaving within each iteration to reduce time-window bias |
| 16 | +- Warm-up requests that are discarded from metrics |
| 17 | +- Prompt variation to reduce provider-side cache effects |
| 18 | +- Streaming TTFT measurement, including reasoning-token streams |
| 19 | +- Incremental raw CSV writes so interrupted runs keep completed samples |
| 20 | +- CSV, HTML, and static-site JSON report generation |
| 21 | + |
| 22 | +## Installation |
| 23 | + |
| 24 | +Use Python 3.10 or newer. |
| 25 | + |
| 26 | +```bash |
| 27 | +python3 -m venv venv |
| 28 | +source venv/bin/activate |
| 29 | +pip install -e ".[dev]" |
| 30 | +``` |
| 31 | + |
| 32 | +Or run the local setup script: |
| 33 | + |
| 34 | +```bash |
| 35 | +./setup.sh |
| 36 | +``` |
| 37 | + |
| 38 | +## Configuration |
| 39 | + |
| 40 | +Create a local `.env` from the example and add provider keys: |
| 41 | + |
| 42 | +```bash |
| 43 | +cp .env.example .env |
| 44 | +``` |
| 45 | + |
| 46 | +Required by the default config: |
| 47 | + |
| 48 | +```bash |
| 49 | +SAMBANOVA_API_KEY=your_sambanova_api_key_here |
| 50 | +OPENROUTER_API_KEY=your_openrouter_api_key_here |
| 51 | +``` |
| 52 | + |
| 53 | +`.env` and benchmark outputs are intentionally ignored by Git. Do not commit |
| 54 | +real API keys or generated result files. |
| 55 | + |
| 56 | +By default, the CLI loads `config/config.yaml` from the current working |
| 57 | +directory when present. Otherwise, it falls back to the packaged default config. |
| 58 | +Set `CONFIG_FILE=/path/to/config.yaml` to use an explicit file. |
| 59 | + |
| 60 | +## Usage |
| 61 | + |
| 62 | +List configured providers, models, and workloads: |
| 63 | + |
| 64 | +```bash |
| 65 | +benchmark providers |
| 66 | +benchmark models |
| 67 | +benchmark workloads |
| 68 | +``` |
| 69 | + |
| 70 | +Run a quick connectivity test: |
| 71 | + |
| 72 | +```bash |
| 73 | +benchmark test --provider general_compute --model gpt-oss-120b --workload ctx_256 --iterations 1 |
| 74 | +``` |
| 75 | + |
| 76 | +Run a benchmark: |
| 77 | + |
| 78 | +```bash |
| 79 | +benchmark run --providers general_compute,openrouter --models gpt-oss-120b --workloads ctx_256,ctx_1k --iterations 5 |
| 80 | +``` |
| 81 | + |
| 82 | +Run all enabled providers, models, and workloads: |
| 83 | + |
| 84 | +```bash |
| 85 | +benchmark run --iterations 50 |
| 86 | +``` |
| 87 | + |
| 88 | +Regenerate reports for an existing session: |
| 89 | + |
| 90 | +```bash |
| 91 | +benchmark report <session-id> |
| 92 | +``` |
| 93 | + |
| 94 | +List local sessions: |
| 95 | + |
| 96 | +```bash |
| 97 | +benchmark list-sessions |
| 98 | +``` |
| 99 | + |
| 100 | +## Workloads |
| 101 | + |
| 102 | +The default workloads are context-size sweeps: |
| 103 | + |
| 104 | +- `ctx_256`: 256 input tokens |
| 105 | +- `ctx_1k`: 1,024 input tokens |
| 106 | +- `ctx_4k`: 4,096 input tokens |
| 107 | +- `ctx_16k`: 16,384 input tokens |
| 108 | +- `ctx_64k`: 65,536 input tokens |
| 109 | +- `ctx_128k`: 131,072 input tokens |
| 110 | + |
| 111 | +Token counts are approximate because prompts are generated with `tiktoken` |
| 112 | +`cl100k_base`, not each model provider's native tokenizer. |
| 113 | + |
| 114 | +## Outputs |
| 115 | + |
| 116 | +Results are written under `results/`: |
| 117 | + |
| 118 | +- `session_<id>_raw.csv`: one row per request |
| 119 | +- `session_<id>_summary.csv`: aggregate statistics by model, provider, and workload |
| 120 | +- `session_<id>_report.html`: general HTML charts and tables |
| 121 | +- `session_<id>_provider_performance.html`: provider performance charts |
| 122 | + |
| 123 | +HTML reports load Plotly from the public CDN. Use the CSV outputs if you need a |
| 124 | +fully offline artifact. |
| 125 | + |
| 126 | +## Static Site Export |
| 127 | + |
| 128 | +Export a completed session as pre-aggregated JSON for a static site: |
| 129 | + |
| 130 | +```bash |
| 131 | +benchmark publish <session-id> --site-path ../my-site --label "June benchmark" |
| 132 | +``` |
| 133 | + |
| 134 | +This writes files under `../my-site/public/benchmarks/`: |
| 135 | + |
| 136 | +- `manifest.json` |
| 137 | +- `<session-id>.json` |
| 138 | +- `<session-id>_raw.csv` unless `--no-copy-raw` is passed |
| 139 | + |
| 140 | +Remove a published session: |
| 141 | + |
| 142 | +```bash |
| 143 | +benchmark unpublish <session-id> --site-path ../my-site |
| 144 | +``` |
| 145 | + |
| 146 | +## Methodology Notes |
| 147 | + |
| 148 | +Comparisons are meaningful only within the same logical model. OpenRouter is an |
| 149 | +aggregator, so its latency can include routing overhead and can vary by selected |
| 150 | +backend. Review provider routing settings in `config/config.yaml` before |
| 151 | +publishing benchmark claims. |
| 152 | + |
| 153 | +The tool measures output throughput after TTFT, so decode speed is separated |
| 154 | +from queueing and prompt-processing overhead. Retries are limited to transient |
| 155 | +errors; failed attempts and backoff sleeps do not inflate successful-attempt |
| 156 | +latency metrics. |
| 157 | + |
| 158 | +## Development |
| 159 | + |
| 160 | +```bash |
| 161 | +pytest |
| 162 | +ruff check src tests |
| 163 | +mypy src |
| 164 | +``` |
| 165 | + |
| 166 | +Format code: |
| 167 | + |
| 168 | +```bash |
| 169 | +black src tests |
| 170 | +``` |
| 171 | + |
| 172 | +## Security |
| 173 | + |
| 174 | +Please do not open public issues with secrets, API keys, private benchmark data, |
| 175 | +or unpublished provider credentials. See `SECURITY.md` for reporting guidance. |
| 176 | + |
| 177 | +## License |
| 178 | + |
| 179 | +MIT. See `LICENSE`. |
0 commit comments