Skip to content

Latest commit

 

History

History
286 lines (239 loc) · 8.45 KB

File metadata and controls

286 lines (239 loc) · 8.45 KB

Local Streaming Benchmark Protocol

Comprehensive MoE CPU Offloading Performance Analysis with Streaming

Based on H100 whitepaper methodology adapted for local hardware

Test Environment

Hardware:

  • CPU: AMD/Intel (to be documented) <<<<<<< HEAD <<<<<<< HEAD
  • RAM: 131GB available =======
  • RAM: 131GB available

main =======

  • RAM: 131GB available

main

  • GPU: NVIDIA (to be documented)
  • Storage: 45GB available for models
  • Platform: Windows with MSYS2

Software:

  • Shimmy: Branch feat/moe-cpu-offload
  • Temperature: 0.3 (verified to eliminate repetition)
  • Streaming: ENABLED (critical for usability)

Benchmark Test Categories

1. Memory Usage Analysis

Replicate H100 methodology for memory distribution:

Metrics:

  • GPU VRAM usage with --cpu-moe <<<<<<< HEAD <<<<<<< HEAD
  • CPU RAM usage =======
  • CPU RAM usage

main =======

  • CPU RAM usage

main

  • Model load time
  • Expert tensor distribution verification

Test Command:

./target/release/shimmy.exe serve --cpu-moe --bind 127.0.0.1:11435 --model-dirs ./models

2. Streaming Performance Benchmarks

Based on H100 whitepaper categories, adapted for streaming:

2.1 Basic Functionality Tests

Purpose: Verify streaming works with no repetition

Test Prompt Max Tokens Expected Outcome
Simple Response "Hello, how are you?" 50 Clean greeting, no repetition
Code Generation "Write a Python function to calculate factorial" 150 Correct code, proper formatting
Technical Explanation "Explain how binary search works" 200 Coherent explanation

2.2 Complex Reasoning Tasks

Purpose: Test model capabilities under CPU offloading

Test Prompt Max Tokens Success Criteria
Multi-step Problem "You have 3-gallon and 5-gallon jugs. Measure exactly 4 gallons step-by-step" 300 Logical steps, correct solution
System Design "Design a simple chat application architecture" 400 Coherent design, realistic components
Algorithm Analysis "Compare bubble sort and quicksort algorithms" 350 Accurate comparison, technical depth

<<<<<<< HEAD <<<<<<< HEAD

2.3 Long-form Generation Tests

=======

2.3 Long-form Generation Tests

main =======

2.3 Long-form Generation Tests

main Purpose: Stress test streaming with extended generation

Test Prompt Max Tokens Success Criteria
Creative Writing "Write a short story about AI discovering emotions" 800 Narrative structure, no repetition
Technical Documentation "Document a REST API for a library management system" 1000 Professional structure, complete examples
Research Analysis "Analyze the benefits and challenges of renewable energy" 600 Comprehensive coverage, logical flow

3. Performance Metrics Collection

For each test, collect:

3.1 Timing Metrics

<<<<<<< HEAD <<<<<<< HEAD

  • Total Generation Time: Start to [DONE] =======
  • Total Generation Time: Start to [DONE]

main =======

  • Total Generation Time: Start to [DONE]

main

  • First Token Latency: Request to first token
  • Average Tokens/Second: Total tokens ÷ generation time
  • Streaming Responsiveness: Subjective feel of real-time progress

<<<<<<< HEAD <<<<<<< HEAD

3.2 Quality Metrics

=======

3.2 Quality Metrics

main =======

3.2 Quality Metrics

main

  • Repetition Score: Using our validated algorithm
  • Completion Rate: Successfully completed vs timeout
  • Content Quality: Subjective assessment (1-5 scale)
  • Technical Accuracy: For code/technical content

3.3 Resource Metrics

  • Peak GPU Memory: During generation <<<<<<< HEAD <<<<<<< HEAD
  • Peak CPU Memory: During generation =======
  • Peak CPU Memory: During generation

main =======

  • Peak CPU Memory: During generation

main

  • CPU Utilization: Average during generation

Test Execution Framework

Per-Model Test Protocol

For each model (DeepSeek → GPT-OSS → Phi-3.5-MoE):

  1. Model Loading:

    • Clean start shimmy server with --cpu-moe
    • Record load time and memory distribution
    • Verify expert tensor CPU offloading
  2. Systematic Testing:

    • Execute all 9 benchmark tests in order
    • Allow 5-second pause between tests
    • Record all metrics for each test
  3. Quality Assessment:

    • Manual review of all generated content
    • Flag any repetition or quality issues
    • Document edge cases or failures
  4. Resource Monitoring:

    • Continuous memory monitoring during tests
    • Performance profiling for bottlenecks
    • Temperature validation throughout

White Paper Data Collection

For each model, document:

Architecture Specifications

  • Parameter count
  • Expert configuration (count, active per token)
  • Context length
  • Model file size

Memory Performance

  • Baseline GPU memory (estimated)
  • CPU offloaded GPU memory (measured)
  • VRAM savings percentage
  • Memory distribution breakdown

Streaming Performance

  • Average tokens/second across all tests
  • Range of performance (min/max)
  • First token latency average
  • Streaming responsiveness rating

Quality Validation

  • Repetition score across all tests
  • Content quality assessment
  • Technical accuracy rate
  • Completion success rate

Expected Outcomes

Based on H100 results, local hardware expectations:

Memory Savings (Should Match H100)

  • DeepSeek 16B: ~95-99% VRAM savings <<<<<<< HEAD <<<<<<< HEAD
  • GPT-OSS 20B: ~99% VRAM savings =======
  • GPT-OSS 20B: ~99% VRAM savings

main =======

  • GPT-OSS 20B: ~99% VRAM savings

main

  • Phi-3.5-MoE 41.9B: ~97% VRAM savings

Performance (Expected Lower Than H100)

  • H100 baseline: Not documented in whitepaper
  • Local expectation: 1-3 tokens/second based on initial testing
  • Streaming UX: Should feel responsive despite lower speed

Quality (Should Match H100)

  • Temperature 0.3: No repetition issues
  • Content quality: Maintained across all models
  • Technical accuracy: Preserved with CPU offloading

Success Criteria

Technical Success

  • ✅ All models load successfully with CPU offloading
  • ✅ Memory savings match H100 percentages (±5%)
  • ✅ No repetition issues with temperature 0.3
  • ✅ Streaming works smoothly for all test cases

<<<<<<< HEAD <<<<<<< HEAD

Performance Success

=======

Performance Success

main =======

Performance Success

main

  • ✅ Consistent generation speed (no significant degradation during long tests)
  • ✅ Reasonable completion times (<5 minutes for 1000 tokens)
  • ✅ Good streaming responsiveness (tokens appear steadily)

Quality Success

  • ✅ All generated content is coherent and relevant
  • ✅ Technical content (code, explanations) is accurate
  • ✅ No repetitive patterns or loops
  • ✅ Creative content maintains narrative structure

Documentation Output

Comprehensive Results Table

| Model | Parameters | VRAM Saved | Avg Tokens/Sec | Quality Score | Repetition Score |
|-------|------------|-------------|----------------|---------------|------------------|
| DeepSeek 16B | 16.38B | XX% | X.X | X/5 | X.XXX |
<<<<<<< HEAD
<<<<<<< HEAD
| GPT-OSS 20B | 20B | XX% | X.X | X/5 | X.XXX |
=======
| GPT-OSS 20B | 20B | XX% | X.X | X/5 | X.XXX |  
>>>>>>> main
=======
| GPT-OSS 20B | 20B | XX% | X.X | X/5 | X.XXX |  
>>>>>>> main
| Phi-3.5-MoE 41.9B | 41.9B | XX% | X.X | X/5 | X.XXX |

Detailed Analysis Report

  • Performance comparison across models <<<<<<< HEAD <<<<<<< HEAD
  • Hardware bottleneck identification =======
  • Hardware bottleneck identification

main =======

  • Hardware bottleneck identification

main

  • Streaming vs non-streaming UX analysis
  • Quality preservation validation
  • Production readiness assessment

<<<<<<< HEAD <<<<<<< HEAD This protocol will generate comprehensive data for the white paper demonstrating MoE CPU offloading with streaming is production-ready for Shimmy 1.7.0 release.

This protocol will generate comprehensive data for the white paper demonstrating MoE CPU offloading with streaming is production-ready for Shimmy 1.7.0 release.

main ======= This protocol will generate comprehensive data for the white paper demonstrating MoE CPU offloading with streaming is production-ready for Shimmy 1.7.0 release. main