Skip to content

eyoussef/modelbench-qwen35-local

Repository files navigation

ModelBench: LLM Performance Benchmarking Suite

A comprehensive Flask application for benchmarking Large Language Model APIs with native Claude integration, real-time dashboards, and extensible architecture.

Features

Core Capabilities

  • Latency Benchmarks: TTFT, tokens/second, p50/p95/p99 percentiles
  • Quality Metrics: Coherence scoring, JSON validation, hallucination detection
  • Load Testing: Concurrent requests, rate limit stress testing, circuit breakers
  • Claude Integration: Native support for Sonnet, Opus, Haiku with model comparisons
  • Real-time Dashboard: WebSocket updates with cyberpunk aesthetics

Dashboard Visualizations

  • Latency distribution histograms
  • Throughput over time graphs
  • Cost-per-1k-tokens comparisons
  • Peak performance heatmaps
  • Real-time benchmark progress

Quick Start

Prerequisites

  • Python 3.11+
  • Docker & Docker Compose (optional)
  • Redis (for background jobs)

Installation

# Clone and navigate to the project
cd modelbench

# Install dependencies
pip install -r requirements.txt

# Set up environment variables
cp .env.example .env
# Edit .env with your API keys

# Initialize database
flask db init
flask db migrate
flask db upgrade

# Run development server
flask run

Docker Setup

# Start all services
docker-compose up -d

# Access the application
open http://localhost:5000

Architecture

┌─────────────────────────────────────────────────────────────┐
│                        ModelBench                            │
├─────────────────────────────────────────────────────────────┤
│  Dashboard (Cyberpunk UI)  │  REST API                      │
│  - Real-time WebSockets    │  - Benchmark endpoints         │
│  - Chart.js visualizations │  - Metrics comparison          │
│  - Export functionality    │  - Health checks               │
├─────────────────────────────────────────────────────────────┤
│                    Benchmark Engine                          │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐       │
│  │ Load Tester  │  │ LLM Clients  │  │  Analyzer    │       │
│  │ - Async pool │  │ - Claude     │  │ - Quality    │       │
│  │ - Circuit    │  │ - OpenAI     │  │ - Metrics    │       │
│  │   breaker    │  │ - Extensible │  │ - Costs      │       │
│  └──────────────┘  └──────────────┘  └──────────────┘       │
├─────────────────────────────────────────────────────────────┤
│  Celery Workers  │  Redis  │  PostgreSQL/SQLite             │
└─────────────────────────────────────────────────────────────┘

API Endpoints

Endpoint Method Description
/api/benchmark/run POST Start new benchmark
/api/benchmark/status/<id> GET Get benchmark progress
/api/metrics/compare GET Compare multiple models
/api/claude/analyze POST Meta-evaluation with Claude
/api/health GET System health check

Configuration

Environment variables in .env:

# API Keys
ANTHROPIC_API_KEY=your_key_here
OPENAI_API_KEY=your_key_here

# Database
DATABASE_URL=sqlite:///modelbench.db
# DATABASE_URL=postgresql://user:pass@localhost/modelbench

# Redis
REDIS_URL=redis://localhost:6379/0

# Flask
SECRET_KEY=your-secret-key
FLASK_ENV=development

Example Benchmarks

RAG Performance Test

{
  "name": "RAG Context Retrieval",
  "models": ["claude-3-5-sonnet-20241022", "gpt-4"],
  "prompts": [
    {
      "text": "Based on the following documents...",
      "context_length": 4000
    }
  ],
  "metrics": ["ttft", "tokens_per_second", "accuracy"]
}

Code Generation Benchmark

{
  "name": "Python Function Generation",
  "models": ["claude-3-opus-20240229", "claude-3-5-sonnet-20241022"],
  "prompts": "examples/code_prompts.txt",
  "expected_schema": "json",
  "iterations": 10
}

Performance Optimizations

For handling 1000+ concurrent connections:

  1. Gunicorn with Gevent Workers

    gunicorn -k gevent -w 4 --worker-connections 1000 "app:create_app()"
  2. Database Connection Pooling

    • SQLAlchemy pool_size=20, max_overflow=40
    • Connection recycling for long-running benchmarks
  3. Redis Caching Layer

    • Cache benchmark configurations
    • Store intermediate results
    • Rate limiting state
  4. Async Architecture

    • Flask[async] for concurrent API calls
    • Celery for background processing
    • WebSocket for real-time updates
  5. Circuit Breaker Pattern

    • Prevents cascade failures
    • Automatic recovery detection
    • Configurable thresholds

Testing

# Run all tests
pytest

# Run with coverage
pytest --cov=app --cov-report=html

# Run specific test file
pytest tests/test_benchmark.py -v

CLI Commands

# Run benchmark from command line
flask benchmark run --config examples/rag_benchmark.json

# Export results
flask benchmark export --id 123 --format csv

# Compare models
flask benchmark compare --models claude-3-opus,gpt-4

Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Run tests
  5. Submit a pull request

License

MIT License - see LICENSE file for details.

Acknowledgments

  • Built with Flask, SQLAlchemy, and Celery
  • Visualization powered by Chart.js
  • UI inspired by cyberpunk aesthetics

About

A comprehensive Flask application for benchmarking Large Language Model APIs with native Claude integration, real-time dashboards, and extensible architecture.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors