Configuration Guide

This guide covers all configuration options for Shimmy.

Environment Variables

Required

SHIMMY_BASE_GGUF: Path to the base GGUF model file
```
export SHIMMY_BASE_GGUF=/path/to/your/model.gguf
```

Optional

SHIMMY_LORA_GGUF: Path to LoRA adapter file

export SHIMMY_LORA_GGUF=/path/to/your/lora.gguf

SHIMMY_LOG_LEVEL: Logging level (error, warn, info, debug, trace)
```
export SHIMMY_LOG_LEVEL=info
```
SHIMMY_BIND_ADDRESS: Default bind address for server
```
export SHIMMY_BIND_ADDRESS=127.0.0.1:11435
```

Command Line Options

Server Configuration

shimmy serve [OPTIONS]

Options:

--bind <ADDRESS>: Bind address (default: 127.0.0.1:11435)
--port <PORT>: Port number (overrides port in bind address)
--workers <N>: Number of worker threads (default: auto-detected)
--max-connections <N>: Maximum concurrent connections (default: 100)

Model Configuration

shimmy generate [OPTIONS]

Options:

--model <NAME>: Model name to use (default: "default")
--prompt <TEXT>: Input prompt
--max-tokens <N>: Maximum tokens to generate (default: 100)
--temperature <F>: Sampling temperature (default: 0.7)
--top-p <F>: Top-p sampling (default: 0.9)
--top-k <N>: Top-k sampling (default: 40)

Model Setup

GGUF Models

Place your GGUF model files in a accessible location and set the environment variable:

# Example model locations
export SHIMMY_BASE_GGUF=~/.cache/models/phi3-mini.gguf
export SHIMMY_BASE_GGUF=/models/llama2-7b.gguf
export SHIMMY_BASE_GGUF=./models/mistral-7b.gguf

LoRA Adapters

If using LoRA adapters, ensure they are compatible with your base model:

export SHIMMY_LORA_GGUF=~/.cache/adapters/coding-adapter.gguf

Templates

Shimmy supports multiple prompt templates:

Available Templates

chatml: ChatML format for chat-based models
llama3: Llama 3 instruction format
openchat: OpenChat conversation format

Template Selection

Templates are automatically selected based on model detection, but can be overridden:

shimmy generate --template chatml --prompt "Hello"

Performance Tuning

CPU Optimization

# Set number of threads for inference
export OMP_NUM_THREADS=8

# Enable CPU optimizations
export SHIMMY_CPU_THREADS=8

Memory Management

# Limit memory usage (in MB)
export SHIMMY_MAX_MEMORY=4096

# Enable memory mapping for large models
export SHIMMY_MMAP=true

GPU Support

Shimmy automatically detects and supports GPU acceleration through llama.cpp:

Supported GPU Vendors:

NVIDIA: CUDA acceleration (automatic detection via nvidia-smi)
AMD: ROCm acceleration (detection via rocm-smi, rocminfo, or Windows device enumeration)
Intel: GPU acceleration via Intel GPU drivers
Apple: Metal acceleration (automatic on macOS with supported GPUs)

Requirements:

NVIDIA: CUDA drivers and toolkit installed
AMD: ROCm toolkit for Linux, or compatible Windows drivers for Radeon GPUs
Intel: Latest Intel GPU drivers

Configuration: No manual configuration required - GPU acceleration is automatically enabled when compatible hardware and drivers are detected. Falls back to CPU inference if GPU is unavailable.

Verification: Check if your GPU is detected:

shimmy serve --bind 127.0.0.1:11435 --verbose
# Look for GPU initialization messages in the output

Security Considerations

Network Security

Bind to localhost (127.0.0.1) for local-only access
Use a reverse proxy (nginx, caddy) for external access
Consider authentication middleware for production use

Model Security

Verify model file integrity before loading
Use trusted model sources
Monitor resource usage for potential abuse

Logging Configuration

Log Levels

# Minimal logging (errors only)
export SHIMMY_LOG_LEVEL=error

# Standard logging (info and above)
export SHIMMY_LOG_LEVEL=info

# Debug logging (all messages)
export SHIMMY_LOG_LEVEL=debug

Log Output

# Log to file
shimmy serve 2>&1 | tee shimmy.log

# Structured JSON logging
export SHIMMY_LOG_FORMAT=json

Troubleshooting

Common Issues

Model not loading
- Check file path and permissions
- Verify GGUF format compatibility
- Check available memory
Server not starting
- Verify port is not in use
- Check bind address format
- Review log output for errors
Slow inference
- Increase CPU thread count
- Verify model size vs available memory
- Consider model quantization

Debug Mode

Enable verbose logging for troubleshooting:

SHIMMY_LOG_LEVEL=debug shimmy serve --verbose

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Configuration Guide

Environment Variables

Required

Optional

Command Line Options

Server Configuration

Model Configuration

Model Setup

GGUF Models

LoRA Adapters

Templates

Available Templates

Template Selection

Performance Tuning

CPU Optimization

Memory Management

GPU Support

Security Considerations

Network Security

Model Security

Logging Configuration

Log Levels

Log Output

Troubleshooting

Common Issues

Debug Mode

FilesExpand file tree

CONFIGURATION.md

Latest commit

History

CONFIGURATION.md

File metadata and controls

Configuration Guide

Environment Variables

Required

Optional

Command Line Options

Server Configuration

Model Configuration

Model Setup

GGUF Models

LoRA Adapters

Templates

Available Templates

Template Selection

Performance Tuning

CPU Optimization

Memory Management

GPU Support

Security Considerations

Network Security

Model Security

Logging Configuration

Log Levels

Log Output

Troubleshooting

Common Issues

Debug Mode