A comprehensive Flask application for benchmarking Large Language Model APIs with native Claude integration, real-time dashboards, and extensible architecture.
- Latency Benchmarks: TTFT, tokens/second, p50/p95/p99 percentiles
- Quality Metrics: Coherence scoring, JSON validation, hallucination detection
- Load Testing: Concurrent requests, rate limit stress testing, circuit breakers
- Claude Integration: Native support for Sonnet, Opus, Haiku with model comparisons
- Real-time Dashboard: WebSocket updates with cyberpunk aesthetics
- Latency distribution histograms
- Throughput over time graphs
- Cost-per-1k-tokens comparisons
- Peak performance heatmaps
- Real-time benchmark progress
- Python 3.11+
- Docker & Docker Compose (optional)
- Redis (for background jobs)
# Clone and navigate to the project
cd modelbench
# Install dependencies
pip install -r requirements.txt
# Set up environment variables
cp .env.example .env
# Edit .env with your API keys
# Initialize database
flask db init
flask db migrate
flask db upgrade
# Run development server
flask run# Start all services
docker-compose up -d
# Access the application
open http://localhost:5000┌─────────────────────────────────────────────────────────────┐
│ ModelBench │
├─────────────────────────────────────────────────────────────┤
│ Dashboard (Cyberpunk UI) │ REST API │
│ - Real-time WebSockets │ - Benchmark endpoints │
│ - Chart.js visualizations │ - Metrics comparison │
│ - Export functionality │ - Health checks │
├─────────────────────────────────────────────────────────────┤
│ Benchmark Engine │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Load Tester │ │ LLM Clients │ │ Analyzer │ │
│ │ - Async pool │ │ - Claude │ │ - Quality │ │
│ │ - Circuit │ │ - OpenAI │ │ - Metrics │ │
│ │ breaker │ │ - Extensible │ │ - Costs │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
├─────────────────────────────────────────────────────────────┤
│ Celery Workers │ Redis │ PostgreSQL/SQLite │
└─────────────────────────────────────────────────────────────┘
| Endpoint | Method | Description |
|---|---|---|
/api/benchmark/run |
POST | Start new benchmark |
/api/benchmark/status/<id> |
GET | Get benchmark progress |
/api/metrics/compare |
GET | Compare multiple models |
/api/claude/analyze |
POST | Meta-evaluation with Claude |
/api/health |
GET | System health check |
Environment variables in .env:
# API Keys
ANTHROPIC_API_KEY=your_key_here
OPENAI_API_KEY=your_key_here
# Database
DATABASE_URL=sqlite:///modelbench.db
# DATABASE_URL=postgresql://user:pass@localhost/modelbench
# Redis
REDIS_URL=redis://localhost:6379/0
# Flask
SECRET_KEY=your-secret-key
FLASK_ENV=development{
"name": "RAG Context Retrieval",
"models": ["claude-3-5-sonnet-20241022", "gpt-4"],
"prompts": [
{
"text": "Based on the following documents...",
"context_length": 4000
}
],
"metrics": ["ttft", "tokens_per_second", "accuracy"]
}{
"name": "Python Function Generation",
"models": ["claude-3-opus-20240229", "claude-3-5-sonnet-20241022"],
"prompts": "examples/code_prompts.txt",
"expected_schema": "json",
"iterations": 10
}For handling 1000+ concurrent connections:
-
Gunicorn with Gevent Workers
gunicorn -k gevent -w 4 --worker-connections 1000 "app:create_app()" -
Database Connection Pooling
- SQLAlchemy pool_size=20, max_overflow=40
- Connection recycling for long-running benchmarks
-
Redis Caching Layer
- Cache benchmark configurations
- Store intermediate results
- Rate limiting state
-
Async Architecture
- Flask[async] for concurrent API calls
- Celery for background processing
- WebSocket for real-time updates
-
Circuit Breaker Pattern
- Prevents cascade failures
- Automatic recovery detection
- Configurable thresholds
# Run all tests
pytest
# Run with coverage
pytest --cov=app --cov-report=html
# Run specific test file
pytest tests/test_benchmark.py -v# Run benchmark from command line
flask benchmark run --config examples/rag_benchmark.json
# Export results
flask benchmark export --id 123 --format csv
# Compare models
flask benchmark compare --models claude-3-opus,gpt-4- Fork the repository
- Create a feature branch
- Make your changes
- Run tests
- Submit a pull request
MIT License - see LICENSE file for details.
- Built with Flask, SQLAlchemy, and Celery
- Visualization powered by Chart.js
- UI inspired by cyberpunk aesthetics