ModelBench: LLM Performance Benchmarking Suite

A comprehensive Flask application for benchmarking Large Language Model APIs with native Claude integration, real-time dashboards, and extensible architecture.

Features

Core Capabilities

Latency Benchmarks: TTFT, tokens/second, p50/p95/p99 percentiles
Quality Metrics: Coherence scoring, JSON validation, hallucination detection
Load Testing: Concurrent requests, rate limit stress testing, circuit breakers
Claude Integration: Native support for Sonnet, Opus, Haiku with model comparisons
Real-time Dashboard: WebSocket updates with cyberpunk aesthetics

Dashboard Visualizations

Latency distribution histograms
Throughput over time graphs
Cost-per-1k-tokens comparisons
Peak performance heatmaps
Real-time benchmark progress

Quick Start

Prerequisites

Python 3.11+
Docker & Docker Compose (optional)
Redis (for background jobs)

Installation

# Clone and navigate to the project
cd modelbench

# Install dependencies
pip install -r requirements.txt

# Set up environment variables
cp .env.example .env
# Edit .env with your API keys

# Initialize database
flask db init
flask db migrate
flask db upgrade

# Run development server
flask run

Docker Setup

# Start all services
docker-compose up -d

# Access the application
open http://localhost:5000

Architecture

┌─────────────────────────────────────────────────────────────┐
│                        ModelBench                            │
├─────────────────────────────────────────────────────────────┤
│  Dashboard (Cyberpunk UI)  │  REST API                      │
│  - Real-time WebSockets    │  - Benchmark endpoints         │
│  - Chart.js visualizations │  - Metrics comparison          │
│  - Export functionality    │  - Health checks               │
├─────────────────────────────────────────────────────────────┤
│                    Benchmark Engine                          │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐       │
│  │ Load Tester  │  │ LLM Clients  │  │  Analyzer    │       │
│  │ - Async pool │  │ - Claude     │  │ - Quality    │       │
│  │ - Circuit    │  │ - OpenAI     │  │ - Metrics    │       │
│  │   breaker    │  │ - Extensible │  │ - Costs      │       │
│  └──────────────┘  └──────────────┘  └──────────────┘       │
├─────────────────────────────────────────────────────────────┤
│  Celery Workers  │  Redis  │  PostgreSQL/SQLite             │
└─────────────────────────────────────────────────────────────┘

API Endpoints

Endpoint	Method	Description
`/api/benchmark/run`	POST	Start new benchmark
`/api/benchmark/status/<id>`	GET	Get benchmark progress
`/api/metrics/compare`	GET	Compare multiple models
`/api/claude/analyze`	POST	Meta-evaluation with Claude
`/api/health`	GET	System health check

Configuration

Environment variables in .env:

# API Keys
ANTHROPIC_API_KEY=your_key_here
OPENAI_API_KEY=your_key_here

# Database
DATABASE_URL=sqlite:///modelbench.db
# DATABASE_URL=postgresql://user:pass@localhost/modelbench

# Redis
REDIS_URL=redis://localhost:6379/0

# Flask
SECRET_KEY=your-secret-key
FLASK_ENV=development

Example Benchmarks

RAG Performance Test

{
  "name": "RAG Context Retrieval",
  "models": ["claude-3-5-sonnet-20241022", "gpt-4"],
  "prompts": [
    {
      "text": "Based on the following documents...",
      "context_length": 4000
    }
  ],
  "metrics": ["ttft", "tokens_per_second", "accuracy"]
}

Code Generation Benchmark

{
  "name": "Python Function Generation",
  "models": ["claude-3-opus-20240229", "claude-3-5-sonnet-20241022"],
  "prompts": "examples/code_prompts.txt",
  "expected_schema": "json",
  "iterations": 10
}

Performance Optimizations

For handling 1000+ concurrent connections:

Gunicorn with Gevent Workers

gunicorn -k gevent -w 4 --worker-connections 1000 "app:create_app()"

Database Connection Pooling
- SQLAlchemy pool_size=20, max_overflow=40
- Connection recycling for long-running benchmarks
Redis Caching Layer
- Cache benchmark configurations
- Store intermediate results
- Rate limiting state
Async Architecture
- Flask[async] for concurrent API calls
- Celery for background processing
- WebSocket for real-time updates
Circuit Breaker Pattern
- Prevents cascade failures
- Automatic recovery detection
- Configurable thresholds

Testing

# Run all tests
pytest

# Run with coverage
pytest --cov=app --cov-report=html

# Run specific test file
pytest tests/test_benchmark.py -v

CLI Commands

# Run benchmark from command line
flask benchmark run --config examples/rag_benchmark.json

# Export results
flask benchmark export --id 123 --format csv

# Compare models
flask benchmark compare --models claude-3-opus,gpt-4

Contributing

Fork the repository
Create a feature branch
Make your changes
Run tests
Submit a pull request

License

MIT License - see LICENSE file for details.

Acknowledgments

Built with Flask, SQLAlchemy, and Celery
Visualization powered by Chart.js
UI inspired by cyberpunk aesthetics

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
app		app
examples		examples
tests		tests
.env.example		.env.example
.gitignore		.gitignore
Dockerfile		Dockerfile
PERFORMANCE_OPTIMIZATION.md		PERFORMANCE_OPTIMIZATION.md
Pipfile		Pipfile
README.md		README.md
docker-compose.yml		docker-compose.yml
pytest.ini		pytest.ini
requirements.txt		requirements.txt
run.py		run.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ModelBench: LLM Performance Benchmarking Suite

Features

Core Capabilities

Dashboard Visualizations

Quick Start

Prerequisites

Installation

Docker Setup

Architecture

API Endpoints

Configuration

Example Benchmarks

RAG Performance Test

Code Generation Benchmark

Performance Optimizations

Testing

CLI Commands

Contributing

License

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ModelBench: LLM Performance Benchmarking Suite

Features

Core Capabilities

Dashboard Visualizations

Quick Start

Prerequisites

Installation

Docker Setup

Architecture

API Endpoints

Configuration

Example Benchmarks

RAG Performance Test

Code Generation Benchmark

Performance Optimizations

Testing

CLI Commands

Contributing

License

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages