Skip to content

Backend Reference

caro supports multiple inference backends for flexibility across different platforms and use cases.

BackendPlatformGPU SupportBest For
MLXApple SiliconYesMacs with M1/M2/M3/M4
OllamaAllVariesCross-platform, easy setup
vLLMLinux/ServerYes (CUDA)High-performance serving

The MLX backend provides GPU-accelerated inference on Apple Silicon.

  • Apple Silicon Mac (M1/M2/M3/M4)
  • macOS 12.0+
  • Xcode with Metal compiler
[backends.mlx]
enabled = true
threads = 4
gpu = true
MetricM1M1 ProM2 ProM4 Pro
First inference2.5s2.0s1.8s1.5s
Subsequent800ms600ms500ms400ms
Memory1.2GB1.2GB1.2GB1.2GB

Metal compiler not found:

Terminal window
sudo xcode-select --switch /Applications/Xcode.app/Contents/Developer
xcrun --find metal

Build failure:

Terminal window
cargo clean
cargo build --release --features embedded-mlx

Ollama provides easy local model serving across all platforms.

  1. Install Ollama:
Terminal window
# macOS
brew install ollama
# Linux
curl -fsSL https://ollama.ai/install.sh | sh
  1. Start Ollama:
Terminal window
ollama serve
  1. Pull a model:
Terminal window
ollama pull qwen2.5-coder:latest
[backends.ollama]
enabled = true
host = "http://localhost:11434"
model = "qwen2.5-coder:latest"
timeout = 30
ModelSizeSpeedQuality
qwen2.5-coder:0.5b0.5GBFastGood
qwen2.5-coder:1.5b1.1GBMediumBetter
qwen2.5-coder:7b4.5GBSlowerBest
codellama:7b4GBMediumGood
  • Keep Ollama running in background
  • Use smaller models for faster responses
  • Increase timeout for larger models

vLLM provides high-performance serving for production deployments.

  1. Install vLLM:
Terminal window
pip install vllm
  1. Start server:
Terminal window
vllm serve Qwen/Qwen2.5-Coder-1.5B-Instruct \
--port 8000 \
--max-model-len 4096
[backends.vllm]
enabled = true
url = "http://localhost:8000"
timeout = 30
docker-compose.yml
services:
vllm:
image: vllm/vllm-openai:latest
ports:
- "8000:8000"
volumes:
- ./models:/models
command: >
--model Qwen/Qwen2.5-Coder-1.5B-Instruct
--max-model-len 4096
GPUsThroughputLatency
1x A100100+ req/s<500ms
1x RTX 409050+ req/s<800ms
1x RTX 309030+ req/s<1200ms

caro automatically selects the best available backend:

  1. MLX - If on Apple Silicon with MLX support
  2. Ollama - If Ollama is running locally
  3. vLLM - If vLLM server is configured

Override via command line:

Terminal window
caro --backend ollama "list files"
caro --backend vllm "list files"

Or via environment:

Terminal window
export CARO_BACKEND=ollama
caro "list files"

Implement the ModelBackend trait for custom backends:

pub trait ModelBackend: Send + Sync {
async fn generate(&self, prompt: &str) -> Result<String>;
fn is_available(&self) -> bool;
fn name(&self) -> &str;
}

See the source code for implementation examples.