Skip to content

Latest commit

 

History

History
209 lines (146 loc) · 4.94 KB

File metadata and controls

209 lines (146 loc) · 4.94 KB

Configuration Guide

This guide covers all configuration options for Shimmy.

Environment Variables

Required

  • SHIMMY_BASE_GGUF: Path to the base GGUF model file
    export SHIMMY_BASE_GGUF=/path/to/your/model.gguf

Optional

  • SHIMMY_LORA_GGUF: Path to LoRA adapter file

    export SHIMMY_LORA_GGUF=/path/to/your/lora.gguf
  • SHIMMY_LOG_LEVEL: Logging level (error, warn, info, debug, trace)

    export SHIMMY_LOG_LEVEL=info
  • SHIMMY_BIND_ADDRESS: Default bind address for server

    export SHIMMY_BIND_ADDRESS=127.0.0.1:11435

Command Line Options

Server Configuration

shimmy serve [OPTIONS]

Options:

  • --bind <ADDRESS>: Bind address (default: 127.0.0.1:11435)
  • --port <PORT>: Port number (overrides port in bind address)
  • --workers <N>: Number of worker threads (default: auto-detected)
  • --max-connections <N>: Maximum concurrent connections (default: 100)

Model Configuration

shimmy generate [OPTIONS]

Options:

  • --model <NAME>: Model name to use (default: "default")
  • --prompt <TEXT>: Input prompt
  • --max-tokens <N>: Maximum tokens to generate (default: 100)
  • --temperature <F>: Sampling temperature (default: 0.7)
  • --top-p <F>: Top-p sampling (default: 0.9)
  • --top-k <N>: Top-k sampling (default: 40)

Model Setup

GGUF Models

Place your GGUF model files in a accessible location and set the environment variable:

# Example model locations
export SHIMMY_BASE_GGUF=~/.cache/models/phi3-mini.gguf
export SHIMMY_BASE_GGUF=/models/llama2-7b.gguf
export SHIMMY_BASE_GGUF=./models/mistral-7b.gguf

LoRA Adapters

If using LoRA adapters, ensure they are compatible with your base model:

export SHIMMY_LORA_GGUF=~/.cache/adapters/coding-adapter.gguf

Templates

Shimmy supports multiple prompt templates:

Available Templates

  • chatml: ChatML format for chat-based models
  • llama3: Llama 3 instruction format
  • openchat: OpenChat conversation format

Template Selection

Templates are automatically selected based on model detection, but can be overridden:

shimmy generate --template chatml --prompt "Hello"

Performance Tuning

CPU Optimization

# Set number of threads for inference
export OMP_NUM_THREADS=8

# Enable CPU optimizations
export SHIMMY_CPU_THREADS=8

Memory Management

# Limit memory usage (in MB)
export SHIMMY_MAX_MEMORY=4096

# Enable memory mapping for large models
export SHIMMY_MMAP=true

GPU Support

Shimmy automatically detects and supports GPU acceleration through llama.cpp:

Supported GPU Vendors:

  • NVIDIA: CUDA acceleration (automatic detection via nvidia-smi)
  • AMD: ROCm acceleration (detection via rocm-smi, rocminfo, or Windows device enumeration)
  • Intel: GPU acceleration via Intel GPU drivers
  • Apple: Metal acceleration (automatic on macOS with supported GPUs)

Requirements:

  • NVIDIA: CUDA drivers and toolkit installed
  • AMD: ROCm toolkit for Linux, or compatible Windows drivers for Radeon GPUs
  • Intel: Latest Intel GPU drivers

Configuration: No manual configuration required - GPU acceleration is automatically enabled when compatible hardware and drivers are detected. Falls back to CPU inference if GPU is unavailable.

Verification: Check if your GPU is detected:

shimmy serve --bind 127.0.0.1:11435 --verbose
# Look for GPU initialization messages in the output

Security Considerations

Network Security

  • Bind to localhost (127.0.0.1) for local-only access
  • Use a reverse proxy (nginx, caddy) for external access
  • Consider authentication middleware for production use

Model Security

  • Verify model file integrity before loading
  • Use trusted model sources
  • Monitor resource usage for potential abuse

Logging Configuration

Log Levels

# Minimal logging (errors only)
export SHIMMY_LOG_LEVEL=error

# Standard logging (info and above)
export SHIMMY_LOG_LEVEL=info

# Debug logging (all messages)
export SHIMMY_LOG_LEVEL=debug

Log Output

# Log to file
shimmy serve 2>&1 | tee shimmy.log

# Structured JSON logging
export SHIMMY_LOG_FORMAT=json

Troubleshooting

Common Issues

  1. Model not loading

    • Check file path and permissions
    • Verify GGUF format compatibility
    • Check available memory
  2. Server not starting

    • Verify port is not in use
    • Check bind address format
    • Review log output for errors
  3. Slow inference

    • Increase CPU thread count
    • Verify model size vs available memory
    • Consider model quantization

Debug Mode

Enable verbose logging for troubleshooting:

SHIMMY_LOG_LEVEL=debug shimmy serve --verbose