Backend Reference
caro supports multiple inference backends for flexibility across different platforms and use cases.
Backend Overview
Section titled “Backend Overview”| Backend | Platform | GPU Support | Best For |
|---|---|---|---|
| MLX | Apple Silicon | Yes | Macs with M1/M2/M3/M4 |
| Ollama | All | Varies | Cross-platform, easy setup |
| vLLM | Linux/Server | Yes (CUDA) | High-performance serving |
MLX Backend
Section titled “MLX Backend”The MLX backend provides GPU-accelerated inference on Apple Silicon.
Requirements
Section titled “Requirements”- Apple Silicon Mac (M1/M2/M3/M4)
- macOS 12.0+
- Xcode with Metal compiler
Configuration
Section titled “Configuration”[backends.mlx]enabled = truethreads = 4gpu = truePerformance
Section titled “Performance”| Metric | M1 | M1 Pro | M2 Pro | M4 Pro |
|---|---|---|---|---|
| First inference | 2.5s | 2.0s | 1.8s | 1.5s |
| Subsequent | 800ms | 600ms | 500ms | 400ms |
| Memory | 1.2GB | 1.2GB | 1.2GB | 1.2GB |
Troubleshooting
Section titled “Troubleshooting”Metal compiler not found:
sudo xcode-select --switch /Applications/Xcode.app/Contents/Developerxcrun --find metalBuild failure:
cargo cleancargo build --release --features embedded-mlxOllama Backend
Section titled “Ollama Backend”Ollama provides easy local model serving across all platforms.
- Install Ollama:
# macOSbrew install ollama
# Linuxcurl -fsSL https://ollama.ai/install.sh | sh- Start Ollama:
ollama serve- Pull a model:
ollama pull qwen2.5-coder:latestConfiguration
Section titled “Configuration”[backends.ollama]enabled = truehost = "http://localhost:11434"model = "qwen2.5-coder:latest"timeout = 30Available Models
Section titled “Available Models”| Model | Size | Speed | Quality |
|---|---|---|---|
qwen2.5-coder:0.5b | 0.5GB | Fast | Good |
qwen2.5-coder:1.5b | 1.1GB | Medium | Better |
qwen2.5-coder:7b | 4.5GB | Slower | Best |
codellama:7b | 4GB | Medium | Good |
Performance Tips
Section titled “Performance Tips”- Keep Ollama running in background
- Use smaller models for faster responses
- Increase timeout for larger models
vLLM Backend
Section titled “vLLM Backend”vLLM provides high-performance serving for production deployments.
- Install vLLM:
pip install vllm- Start server:
vllm serve Qwen/Qwen2.5-Coder-1.5B-Instruct \ --port 8000 \ --max-model-len 4096Configuration
Section titled “Configuration”[backends.vllm]enabled = trueurl = "http://localhost:8000"timeout = 30Docker Deployment
Section titled “Docker Deployment”services: vllm: image: vllm/vllm-openai:latest ports: - "8000:8000" volumes: - ./models:/models command: > --model Qwen/Qwen2.5-Coder-1.5B-Instruct --max-model-len 4096Performance
Section titled “Performance”| GPUs | Throughput | Latency |
|---|---|---|
| 1x A100 | 100+ req/s | <500ms |
| 1x RTX 4090 | 50+ req/s | <800ms |
| 1x RTX 3090 | 30+ req/s | <1200ms |
Backend Selection
Section titled “Backend Selection”Automatic Selection
Section titled “Automatic Selection”caro automatically selects the best available backend:
- MLX - If on Apple Silicon with MLX support
- Ollama - If Ollama is running locally
- vLLM - If vLLM server is configured
Manual Selection
Section titled “Manual Selection”Override via command line:
caro --backend ollama "list files"caro --backend vllm "list files"Or via environment:
export CARO_BACKEND=ollamacaro "list files"Custom Backend
Section titled “Custom Backend”Implement the ModelBackend trait for custom backends:
pub trait ModelBackend: Send + Sync { async fn generate(&self, prompt: &str) -> Result<String>; fn is_available(&self) -> bool; fn name(&self) -> &str;}See the source code for implementation examples.