Local Streaming Benchmark Protocol

Comprehensive MoE CPU Offloading Performance Analysis with Streaming

Based on H100 whitepaper methodology adapted for local hardware

Test Environment

Hardware:

CPU: AMD/Intel (to be documented) <<<<<<< HEAD <<<<<<< HEAD
RAM: 131GB available =======
RAM: 131GB available

main =======

RAM: 131GB available

main

GPU: NVIDIA (to be documented)
Storage: 45GB available for models
Platform: Windows with MSYS2

Software:

Shimmy: Branch feat/moe-cpu-offload
Temperature: 0.3 (verified to eliminate repetition)
Streaming: ENABLED (critical for usability)

Benchmark Test Categories

1. Memory Usage Analysis

Replicate H100 methodology for memory distribution:

Metrics:

GPU VRAM usage with --cpu-moe <<<<<<< HEAD <<<<<<< HEAD
CPU RAM usage =======
CPU RAM usage

main =======

CPU RAM usage

main

Model load time
Expert tensor distribution verification

Test Command:

./target/release/shimmy.exe serve --cpu-moe --bind 127.0.0.1:11435 --model-dirs ./models

2. Streaming Performance Benchmarks

Based on H100 whitepaper categories, adapted for streaming:

2.1 Basic Functionality Tests

Purpose: Verify streaming works with no repetition

Test	Prompt	Max Tokens	Expected Outcome
Simple Response	"Hello, how are you?"	50	Clean greeting, no repetition
Code Generation	"Write a Python function to calculate factorial"	150	Correct code, proper formatting
Technical Explanation	"Explain how binary search works"	200	Coherent explanation

2.2 Complex Reasoning Tasks

Purpose: Test model capabilities under CPU offloading

Test	Prompt	Max Tokens	Success Criteria
Multi-step Problem	"You have 3-gallon and 5-gallon jugs. Measure exactly 4 gallons step-by-step"	300	Logical steps, correct solution
System Design	"Design a simple chat application architecture"	400	Coherent design, realistic components
Algorithm Analysis	"Compare bubble sort and quicksort algorithms"	350	Accurate comparison, technical depth

<<<<<<< HEAD <<<<<<< HEAD

2.3 Long-form Generation Tests

=======

2.3 Long-form Generation Tests

main =======

2.3 Long-form Generation Tests

main Purpose: Stress test streaming with extended generation

Test	Prompt	Max Tokens	Success Criteria
Creative Writing	"Write a short story about AI discovering emotions"	800	Narrative structure, no repetition
Technical Documentation	"Document a REST API for a library management system"	1000	Professional structure, complete examples
Research Analysis	"Analyze the benefits and challenges of renewable energy"	600	Comprehensive coverage, logical flow

3. Performance Metrics Collection

For each test, collect:

3.1 Timing Metrics

<<<<<<< HEAD <<<<<<< HEAD

Total Generation Time: Start to [DONE] =======
Total Generation Time: Start to [DONE]

main =======

Total Generation Time: Start to [DONE]

main

First Token Latency: Request to first token
Average Tokens/Second: Total tokens ÷ generation time
Streaming Responsiveness: Subjective feel of real-time progress

<<<<<<< HEAD <<<<<<< HEAD

3.2 Quality Metrics

=======

3.2 Quality Metrics

main =======

3.2 Quality Metrics

main

Repetition Score: Using our validated algorithm
Completion Rate: Successfully completed vs timeout
Content Quality: Subjective assessment (1-5 scale)
Technical Accuracy: For code/technical content

3.3 Resource Metrics

Peak GPU Memory: During generation <<<<<<< HEAD <<<<<<< HEAD
Peak CPU Memory: During generation =======
Peak CPU Memory: During generation

main =======

Peak CPU Memory: During generation

main

CPU Utilization: Average during generation

Test Execution Framework

Per-Model Test Protocol

For each model (DeepSeek → GPT-OSS → Phi-3.5-MoE):

Model Loading:
- Clean start shimmy server with --cpu-moe
- Record load time and memory distribution
- Verify expert tensor CPU offloading
Systematic Testing:
- Execute all 9 benchmark tests in order
- Allow 5-second pause between tests
- Record all metrics for each test
Quality Assessment:
- Manual review of all generated content
- Flag any repetition or quality issues
- Document edge cases or failures
Resource Monitoring:
- Continuous memory monitoring during tests
- Performance profiling for bottlenecks
- Temperature validation throughout

White Paper Data Collection

For each model, document:

Architecture Specifications

Parameter count
Expert configuration (count, active per token)
Context length
Model file size

Memory Performance

Baseline GPU memory (estimated)
CPU offloaded GPU memory (measured)
VRAM savings percentage
Memory distribution breakdown

Streaming Performance

Average tokens/second across all tests
Range of performance (min/max)
First token latency average
Streaming responsiveness rating

Quality Validation

Repetition score across all tests
Content quality assessment
Technical accuracy rate
Completion success rate

Expected Outcomes

Based on H100 results, local hardware expectations:

Memory Savings (Should Match H100)

DeepSeek 16B: ~95-99% VRAM savings <<<<<<< HEAD <<<<<<< HEAD
GPT-OSS 20B: ~99% VRAM savings =======
GPT-OSS 20B: ~99% VRAM savings

main =======

GPT-OSS 20B: ~99% VRAM savings

main

Phi-3.5-MoE 41.9B: ~97% VRAM savings

Performance (Expected Lower Than H100)

H100 baseline: Not documented in whitepaper
Local expectation: 1-3 tokens/second based on initial testing
Streaming UX: Should feel responsive despite lower speed

Quality (Should Match H100)

Temperature 0.3: No repetition issues
Content quality: Maintained across all models
Technical accuracy: Preserved with CPU offloading

Success Criteria

Technical Success

✅ All models load successfully with CPU offloading
✅ Memory savings match H100 percentages (±5%)
✅ No repetition issues with temperature 0.3
✅ Streaming works smoothly for all test cases

<<<<<<< HEAD <<<<<<< HEAD

Performance Success

=======

Performance Success

main =======

Performance Success

main

✅ Consistent generation speed (no significant degradation during long tests)
✅ Reasonable completion times (<5 minutes for 1000 tokens)
✅ Good streaming responsiveness (tokens appear steadily)

Quality Success

✅ All generated content is coherent and relevant
✅ Technical content (code, explanations) is accurate
✅ No repetitive patterns or loops
✅ Creative content maintains narrative structure

Documentation Output

Comprehensive Results Table

| Model | Parameters | VRAM Saved | Avg Tokens/Sec | Quality Score | Repetition Score |
|-------|------------|-------------|----------------|---------------|------------------|
| DeepSeek 16B | 16.38B | XX% | X.X | X/5 | X.XXX |
<<<<<<< HEAD
<<<<<<< HEAD
| GPT-OSS 20B | 20B | XX% | X.X | X/5 | X.XXX |
=======
| GPT-OSS 20B | 20B | XX% | X.X | X/5 | X.XXX |  
>>>>>>> main
=======
| GPT-OSS 20B | 20B | XX% | X.X | X/5 | X.XXX |  
>>>>>>> main
| Phi-3.5-MoE 41.9B | 41.9B | XX% | X.X | X/5 | X.XXX |

Detailed Analysis Report

Performance comparison across models <<<<<<< HEAD <<<<<<< HEAD
Hardware bottleneck identification =======
Hardware bottleneck identification

main =======

Hardware bottleneck identification

main

Streaming vs non-streaming UX analysis
Quality preservation validation
Production readiness assessment

<<<<<<< HEAD <<<<<<< HEAD This protocol will generate comprehensive data for the white paper demonstrating MoE CPU offloading with streaming is production-ready for Shimmy 1.7.0 release.

This protocol will generate comprehensive data for the white paper demonstrating MoE CPU offloading with streaming is production-ready for Shimmy 1.7.0 release.

main ======= This protocol will generate comprehensive data for the white paper demonstrating MoE CPU offloading with streaming is production-ready for Shimmy 1.7.0 release. main

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Local Streaming Benchmark Protocol

Test Environment

Benchmark Test Categories

1. Memory Usage Analysis

2. Streaming Performance Benchmarks

2.1 Basic Functionality Tests

2.2 Complex Reasoning Tasks

2.3 Long-form Generation Tests

2.3 Long-form Generation Tests

2.3 Long-form Generation Tests

3. Performance Metrics Collection

3.1 Timing Metrics

3.2 Quality Metrics

3.2 Quality Metrics

3.2 Quality Metrics

3.3 Resource Metrics

Test Execution Framework

Per-Model Test Protocol

White Paper Data Collection

Architecture Specifications

Memory Performance

Streaming Performance

Quality Validation

Expected Outcomes

Memory Savings (Should Match H100)

Performance (Expected Lower Than H100)

Quality (Should Match H100)

Success Criteria

Technical Success

Performance Success

Performance Success

Performance Success

Quality Success

Documentation Output

Comprehensive Results Table

Detailed Analysis Report

<<<<<<< HEAD <<<<<<< HEAD This protocol will generate comprehensive data for the white paper demonstrating MoE CPU offloading with streaming is production-ready for Shimmy 1.7.0 release.

FilesExpand file tree

LOCAL_STREAMING_BENCHMARK_PROTOCOL.md

Latest commit

History

LOCAL_STREAMING_BENCHMARK_PROTOCOL.md

File metadata and controls

Local Streaming Benchmark Protocol

Test Environment

Benchmark Test Categories

1. Memory Usage Analysis

2. Streaming Performance Benchmarks

2.1 Basic Functionality Tests

2.2 Complex Reasoning Tasks

2.3 Long-form Generation Tests

2.3 Long-form Generation Tests

2.3 Long-form Generation Tests

3. Performance Metrics Collection

3.1 Timing Metrics

3.2 Quality Metrics

3.2 Quality Metrics

3.2 Quality Metrics

3.3 Resource Metrics

Test Execution Framework

Per-Model Test Protocol

White Paper Data Collection

Architecture Specifications

Memory Performance

Streaming Performance

Quality Validation

Expected Outcomes

Memory Savings (Should Match H100)

Performance (Expected Lower Than H100)

Quality (Should Match H100)

Success Criteria

Technical Success

Performance Success

Performance Success

Performance Success

Quality Success

Documentation Output

Comprehensive Results Table

Detailed Analysis Report

<<<<<<< HEAD <<<<<<< HEAD This protocol will generate comprehensive data for the white paper demonstrating MoE CPU offloading with streaming is production-ready for Shimmy 1.7.0 release.