Status
Backend Status
This page documents the current working status of each inference backend in caro.
Status Overview
Section titled “Status Overview”| Backend | Platform | Status | Real Inference |
|---|---|---|---|
| Embedded MLX | Apple Silicon | Working | Stub (GPU ready after Xcode) |
| Embedded CPU | All platforms | Working | Yes |
| Ollama | All platforms | Working | Yes |
| vLLM | Linux/Server | Working | Yes |
Embedded Backend (Default)
Section titled “Embedded Backend (Default)”The embedded backend is the default and requires no external dependencies after initial model download.
MLX Backend (Apple Silicon)
Section titled “MLX Backend (Apple Silicon)”GPU Acceleration
What’s Working:
| Component | Status | Notes |
|---|---|---|
| Platform detection | Working | Correctly identifies M1/M2/M3/M4 |
| Model download | Working | 1.1GB Qwen 2.5 Coder from Hugging Face |
| Model loading | Working | Loads GGUF file into memory |
| Inference pipeline | Working | End-to-end flow operational |
| GPU acceleration | Requires Xcode | Metal compiler needed |
Current Implementation:
The default build uses a stub implementation that:
- Loads the actual 1.1GB model file
- Returns pattern-matched responses instantly
- Works immediately without Xcode installation
- Suitable for testing and development
For Real GPU Inference:
- Install Xcode from the App Store (15GB download)
- Run
sudo xcode-select --switch /Applications/Xcode.app/Contents/Developer - Verify Metal:
xcrun --find metal - Rebuild:
cargo build --release --features embedded-mlx
Performance Comparison:
| Mode | First Inference | Subsequent | Model Load |
|---|---|---|---|
| Stub (default) | ~100ms | ~100ms | ~500ms |
| Real MLX (with Xcode) | < 2s | < 500ms | < 2s |
CPU Backend (Cross-Platform)
Section titled “CPU Backend (Cross-Platform)”The CPU backend works on all platforms using the Candle framework.
| Metric | Value |
|---|---|
| Platform | Any (macOS, Linux, Windows) |
| Model | Qwen 2.5 Coder 1.5B (GGUF) |
| First inference | ~4-5s |
| Subsequent | ~3-4s |
| Memory | ~1.5GB |
Remote Backends
Section titled “Remote Backends”Ollama Backend
Section titled “Ollama Backend”Ollama provides easy local model serving with good performance.
Setup:
# Install Ollamabrew install ollama # macOScurl -fsSL https://ollama.ai/install.sh | sh # Linux
# Start serverollama serve
# Pull modelollama pull qwen2.5-coder:1.5bConfiguration:
[backends.ollama]enabled = truehost = "http://localhost:11434"model = "qwen2.5-coder:1.5b"Status:
| Feature | Status |
|---|---|
| HTTP API integration | Working |
| Model management | Working |
| Streaming responses | Working |
| Error handling | Working |
vLLM Backend
Section titled “vLLM Backend”vLLM provides high-performance serving for production deployments.
Setup:
# Install vLLMpip install vllm
# Start servervllm serve Qwen/Qwen2.5-Coder-1.5B-Instruct \ --port 8000 \ --max-model-len 4096Configuration:
[backends.vllm]enabled = trueurl = "http://localhost:8000"timeout = 30Status:
| Feature | Status |
|---|---|
| OpenAI-compatible API | Working |
| Batch processing | Working |
| GPU acceleration | Working |
| Error handling | Working |
Backend Selection
Section titled “Backend Selection”Automatic Selection
Section titled “Automatic Selection”caro automatically selects the best backend in this order:
- Embedded MLX - If on Apple Silicon
- Embedded CPU - If MLX not available
- Ollama - If Ollama server detected
- vLLM - If vLLM server configured
Manual Selection
Section titled “Manual Selection”Override the automatic selection:
# Use specific backendcaro --backend ollama "list files"caro --backend vllm "list files"
# Or via environmentexport CARO_BACKEND=ollamacaro "list files"Troubleshooting
Section titled “Troubleshooting”MLX: “Metal compiler not found”
Section titled “MLX: “Metal compiler not found””This error occurs when building with --features embedded-mlx without Xcode.
Solution: Install Xcode from the App Store, then:
sudo xcode-select --switch /Applications/Xcode.app/Contents/Developerxcrun --find metal # Should show /usr/bin/metalOllama: “Connection refused”
Section titled “Ollama: “Connection refused””Ollama server not running.
Solution:
ollama serve # Start serverollama ps # Check running modelsvLLM: “Timeout”
Section titled “vLLM: “Timeout””Model loading or inference taking too long.
Solution: Increase timeout in config:
[backends.vllm]timeout = 60 # Increase from default 30sModel Download Fails
Section titled “Model Download Fails”Network or storage issues.
Solution:
# Clear cacherm -rf ~/.cache/caro/models/
# Manual downloadmkdir -p ~/.cache/caro/modelscd ~/.cache/caro/modelscurl -L -o qwen2.5-coder-1.5b-instruct-q4_k_m.gguf \ "https://huggingface.co/Qwen/Qwen2.5-Coder-1.5B-Instruct-GGUF/resolve/main/qwen2.5-coder-1.5b-instruct-q4_k_m.gguf"Performance Benchmarks
Section titled “Performance Benchmarks”Measured on various hardware configurations:
| Hardware | Backend | First Inference | Subsequent | Memory |
|---|---|---|---|---|
| M4 Pro (14-core) | MLX (stub) | 100ms | 100ms | 1.1GB |
| M4 Pro (14-core) | MLX (real) | 1.5s | 400ms | 1.2GB |
| M1 MacBook Air | MLX (stub) | 100ms | 100ms | 1.1GB |
| M1 MacBook Air | MLX (real) | 2.5s | 800ms | 1.2GB |
| Intel Mac (i7) | CPU | 5s | 4s | 1.5GB |
| Linux x64 (Ryzen) | CPU | 4s | 3.5s | 1.5GB |
| Linux + RTX 4090 | vLLM | 0.5s | 0.3s | 4GB |