Comprehensive MoE CPU Offloading Performance Analysis with Streaming
Based on H100 whitepaper methodology adapted for local hardware
Hardware:
- CPU: AMD/Intel (to be documented) <<<<<<< HEAD <<<<<<< HEAD
- RAM: 131GB available =======
- RAM: 131GB available
main =======
- RAM: 131GB available
main
- GPU: NVIDIA (to be documented)
- Storage: 45GB available for models
- Platform: Windows with MSYS2
Software:
- Shimmy: Branch
feat/moe-cpu-offload - Temperature: 0.3 (verified to eliminate repetition)
- Streaming: ENABLED (critical for usability)
Replicate H100 methodology for memory distribution:
Metrics:
- GPU VRAM usage with
--cpu-moe<<<<<<< HEAD <<<<<<< HEAD - CPU RAM usage =======
- CPU RAM usage
main =======
- CPU RAM usage
main
- Model load time
- Expert tensor distribution verification
Test Command:
./target/release/shimmy.exe serve --cpu-moe --bind 127.0.0.1:11435 --model-dirs ./modelsBased on H100 whitepaper categories, adapted for streaming:
Purpose: Verify streaming works with no repetition
| Test | Prompt | Max Tokens | Expected Outcome |
|---|---|---|---|
| Simple Response | "Hello, how are you?" | 50 | Clean greeting, no repetition |
| Code Generation | "Write a Python function to calculate factorial" | 150 | Correct code, proper formatting |
| Technical Explanation | "Explain how binary search works" | 200 | Coherent explanation |
Purpose: Test model capabilities under CPU offloading
| Test | Prompt | Max Tokens | Success Criteria |
|---|---|---|---|
| Multi-step Problem | "You have 3-gallon and 5-gallon jugs. Measure exactly 4 gallons step-by-step" | 300 | Logical steps, correct solution |
| System Design | "Design a simple chat application architecture" | 400 | Coherent design, realistic components |
| Algorithm Analysis | "Compare bubble sort and quicksort algorithms" | 350 | Accurate comparison, technical depth |
<<<<<<< HEAD <<<<<<< HEAD
=======
main =======
main Purpose: Stress test streaming with extended generation
| Test | Prompt | Max Tokens | Success Criteria |
|---|---|---|---|
| Creative Writing | "Write a short story about AI discovering emotions" | 800 | Narrative structure, no repetition |
| Technical Documentation | "Document a REST API for a library management system" | 1000 | Professional structure, complete examples |
| Research Analysis | "Analyze the benefits and challenges of renewable energy" | 600 | Comprehensive coverage, logical flow |
For each test, collect:
<<<<<<< HEAD <<<<<<< HEAD
- Total Generation Time: Start to [DONE] =======
- Total Generation Time: Start to [DONE]
main =======
- Total Generation Time: Start to [DONE]
main
- First Token Latency: Request to first token
- Average Tokens/Second: Total tokens ÷ generation time
- Streaming Responsiveness: Subjective feel of real-time progress
<<<<<<< HEAD <<<<<<< HEAD
=======
main =======
main
- Repetition Score: Using our validated algorithm
- Completion Rate: Successfully completed vs timeout
- Content Quality: Subjective assessment (1-5 scale)
- Technical Accuracy: For code/technical content
- Peak GPU Memory: During generation <<<<<<< HEAD <<<<<<< HEAD
- Peak CPU Memory: During generation =======
- Peak CPU Memory: During generation
main =======
- Peak CPU Memory: During generation
main
- CPU Utilization: Average during generation
For each model (DeepSeek → GPT-OSS → Phi-3.5-MoE):
-
Model Loading:
- Clean start shimmy server with
--cpu-moe - Record load time and memory distribution
- Verify expert tensor CPU offloading
- Clean start shimmy server with
-
Systematic Testing:
- Execute all 9 benchmark tests in order
- Allow 5-second pause between tests
- Record all metrics for each test
-
Quality Assessment:
- Manual review of all generated content
- Flag any repetition or quality issues
- Document edge cases or failures
-
Resource Monitoring:
- Continuous memory monitoring during tests
- Performance profiling for bottlenecks
- Temperature validation throughout
For each model, document:
- Parameter count
- Expert configuration (count, active per token)
- Context length
- Model file size
- Baseline GPU memory (estimated)
- CPU offloaded GPU memory (measured)
- VRAM savings percentage
- Memory distribution breakdown
- Average tokens/second across all tests
- Range of performance (min/max)
- First token latency average
- Streaming responsiveness rating
- Repetition score across all tests
- Content quality assessment
- Technical accuracy rate
- Completion success rate
Based on H100 results, local hardware expectations:
- DeepSeek 16B: ~95-99% VRAM savings <<<<<<< HEAD <<<<<<< HEAD
- GPT-OSS 20B: ~99% VRAM savings =======
- GPT-OSS 20B: ~99% VRAM savings
main =======
- GPT-OSS 20B: ~99% VRAM savings
main
- Phi-3.5-MoE 41.9B: ~97% VRAM savings
- H100 baseline: Not documented in whitepaper
- Local expectation: 1-3 tokens/second based on initial testing
- Streaming UX: Should feel responsive despite lower speed
- Temperature 0.3: No repetition issues
- Content quality: Maintained across all models
- Technical accuracy: Preserved with CPU offloading
- ✅ All models load successfully with CPU offloading
- ✅ Memory savings match H100 percentages (±5%)
- ✅ No repetition issues with temperature 0.3
- ✅ Streaming works smoothly for all test cases
<<<<<<< HEAD <<<<<<< HEAD
=======
main =======
main
- ✅ Consistent generation speed (no significant degradation during long tests)
- ✅ Reasonable completion times (<5 minutes for 1000 tokens)
- ✅ Good streaming responsiveness (tokens appear steadily)
- ✅ All generated content is coherent and relevant
- ✅ Technical content (code, explanations) is accurate
- ✅ No repetitive patterns or loops
- ✅ Creative content maintains narrative structure
| Model | Parameters | VRAM Saved | Avg Tokens/Sec | Quality Score | Repetition Score |
|-------|------------|-------------|----------------|---------------|------------------|
| DeepSeek 16B | 16.38B | XX% | X.X | X/5 | X.XXX |
<<<<<<< HEAD
<<<<<<< HEAD
| GPT-OSS 20B | 20B | XX% | X.X | X/5 | X.XXX |
=======
| GPT-OSS 20B | 20B | XX% | X.X | X/5 | X.XXX |
>>>>>>> main
=======
| GPT-OSS 20B | 20B | XX% | X.X | X/5 | X.XXX |
>>>>>>> main
| Phi-3.5-MoE 41.9B | 41.9B | XX% | X.X | X/5 | X.XXX |
- Performance comparison across models <<<<<<< HEAD <<<<<<< HEAD
- Hardware bottleneck identification =======
- Hardware bottleneck identification
main =======
- Hardware bottleneck identification
main
- Streaming vs non-streaming UX analysis
- Quality preservation validation
- Production readiness assessment
<<<<<<< HEAD <<<<<<< HEAD This protocol will generate comprehensive data for the white paper demonstrating MoE CPU offloading with streaming is production-ready for Shimmy 1.7.0 release.
This protocol will generate comprehensive data for the white paper demonstrating MoE CPU offloading with streaming is production-ready for Shimmy 1.7.0 release.
main ======= This protocol will generate comprehensive data for the white paper demonstrating MoE CPU offloading with streaming is production-ready for Shimmy 1.7.0 release. main