This guide covers all configuration options for Shimmy.
SHIMMY_BASE_GGUF: Path to the base GGUF model fileexport SHIMMY_BASE_GGUF=/path/to/your/model.gguf
-
SHIMMY_LORA_GGUF: Path to LoRA adapter fileexport SHIMMY_LORA_GGUF=/path/to/your/lora.gguf -
SHIMMY_LOG_LEVEL: Logging level (error, warn, info, debug, trace)export SHIMMY_LOG_LEVEL=info -
SHIMMY_BIND_ADDRESS: Default bind address for serverexport SHIMMY_BIND_ADDRESS=127.0.0.1:11435
shimmy serve [OPTIONS]Options:
--bind <ADDRESS>: Bind address (default: 127.0.0.1:11435)--port <PORT>: Port number (overrides port in bind address)--workers <N>: Number of worker threads (default: auto-detected)--max-connections <N>: Maximum concurrent connections (default: 100)
shimmy generate [OPTIONS]Options:
--model <NAME>: Model name to use (default: "default")--prompt <TEXT>: Input prompt--max-tokens <N>: Maximum tokens to generate (default: 100)--temperature <F>: Sampling temperature (default: 0.7)--top-p <F>: Top-p sampling (default: 0.9)--top-k <N>: Top-k sampling (default: 40)
Place your GGUF model files in a accessible location and set the environment variable:
# Example model locations
export SHIMMY_BASE_GGUF=~/.cache/models/phi3-mini.gguf
export SHIMMY_BASE_GGUF=/models/llama2-7b.gguf
export SHIMMY_BASE_GGUF=./models/mistral-7b.ggufIf using LoRA adapters, ensure they are compatible with your base model:
export SHIMMY_LORA_GGUF=~/.cache/adapters/coding-adapter.ggufShimmy supports multiple prompt templates:
chatml: ChatML format for chat-based modelsllama3: Llama 3 instruction formatopenchat: OpenChat conversation format
Templates are automatically selected based on model detection, but can be overridden:
shimmy generate --template chatml --prompt "Hello"# Set number of threads for inference
export OMP_NUM_THREADS=8
# Enable CPU optimizations
export SHIMMY_CPU_THREADS=8# Limit memory usage (in MB)
export SHIMMY_MAX_MEMORY=4096
# Enable memory mapping for large models
export SHIMMY_MMAP=trueShimmy automatically detects and supports GPU acceleration through llama.cpp:
Supported GPU Vendors:
- NVIDIA: CUDA acceleration (automatic detection via
nvidia-smi) - AMD: ROCm acceleration (detection via
rocm-smi,rocminfo, or Windows device enumeration) - Intel: GPU acceleration via Intel GPU drivers
- Apple: Metal acceleration (automatic on macOS with supported GPUs)
Requirements:
- NVIDIA: CUDA drivers and toolkit installed
- AMD: ROCm toolkit for Linux, or compatible Windows drivers for Radeon GPUs
- Intel: Latest Intel GPU drivers
Configuration: No manual configuration required - GPU acceleration is automatically enabled when compatible hardware and drivers are detected. Falls back to CPU inference if GPU is unavailable.
Verification: Check if your GPU is detected:
shimmy serve --bind 127.0.0.1:11435 --verbose
# Look for GPU initialization messages in the output- Bind to localhost (
127.0.0.1) for local-only access - Use a reverse proxy (nginx, caddy) for external access
- Consider authentication middleware for production use
- Verify model file integrity before loading
- Use trusted model sources
- Monitor resource usage for potential abuse
# Minimal logging (errors only)
export SHIMMY_LOG_LEVEL=error
# Standard logging (info and above)
export SHIMMY_LOG_LEVEL=info
# Debug logging (all messages)
export SHIMMY_LOG_LEVEL=debug# Log to file
shimmy serve 2>&1 | tee shimmy.log
# Structured JSON logging
export SHIMMY_LOG_FORMAT=json-
Model not loading
- Check file path and permissions
- Verify GGUF format compatibility
- Check available memory
-
Server not starting
- Verify port is not in use
- Check bind address format
- Review log output for errors
-
Slow inference
- Increase CPU thread count
- Verify model size vs available memory
- Consider model quantization
Enable verbose logging for troubleshooting:
SHIMMY_LOG_LEVEL=debug shimmy serve --verbose