Skip to content

Backend Status

This page documents the current working status of each inference backend in caro.

BackendPlatformStatusReal Inference
Embedded MLXApple SiliconWorkingStub (GPU ready after Xcode)
Embedded CPUAll platformsWorkingYes
OllamaAll platformsWorkingYes
vLLMLinux/ServerWorkingYes

The embedded backend is the default and requires no external dependencies after initial model download.

Status

Fully Functional

GPU Acceleration

Available with Xcode

What’s Working:

ComponentStatusNotes
Platform detectionWorkingCorrectly identifies M1/M2/M3/M4
Model downloadWorking1.1GB Qwen 2.5 Coder from Hugging Face
Model loadingWorkingLoads GGUF file into memory
Inference pipelineWorkingEnd-to-end flow operational
GPU accelerationRequires XcodeMetal compiler needed

Current Implementation:

The default build uses a stub implementation that:

  • Loads the actual 1.1GB model file
  • Returns pattern-matched responses instantly
  • Works immediately without Xcode installation
  • Suitable for testing and development

For Real GPU Inference:

  1. Install Xcode from the App Store (15GB download)
  2. Run sudo xcode-select --switch /Applications/Xcode.app/Contents/Developer
  3. Verify Metal: xcrun --find metal
  4. Rebuild: cargo build --release --features embedded-mlx

Performance Comparison:

ModeFirst InferenceSubsequentModel Load
Stub (default)~100ms~100ms~500ms
Real MLX (with Xcode)< 2s< 500ms< 2s
Fully Functional

The CPU backend works on all platforms using the Candle framework.

MetricValue
PlatformAny (macOS, Linux, Windows)
ModelQwen 2.5 Coder 1.5B (GGUF)
First inference~4-5s
Subsequent~3-4s
Memory~1.5GB
Fully Functional

Ollama provides easy local model serving with good performance.

Setup:

Terminal window
# Install Ollama
brew install ollama # macOS
curl -fsSL https://ollama.ai/install.sh | sh # Linux
# Start server
ollama serve
# Pull model
ollama pull qwen2.5-coder:1.5b

Configuration:

~/.config/caro/config.toml
[backends.ollama]
enabled = true
host = "http://localhost:11434"
model = "qwen2.5-coder:1.5b"

Status:

FeatureStatus
HTTP API integrationWorking
Model managementWorking
Streaming responsesWorking
Error handlingWorking
Fully Functional

vLLM provides high-performance serving for production deployments.

Setup:

Terminal window
# Install vLLM
pip install vllm
# Start server
vllm serve Qwen/Qwen2.5-Coder-1.5B-Instruct \
--port 8000 \
--max-model-len 4096

Configuration:

~/.config/caro/config.toml
[backends.vllm]
enabled = true
url = "http://localhost:8000"
timeout = 30

Status:

FeatureStatus
OpenAI-compatible APIWorking
Batch processingWorking
GPU accelerationWorking
Error handlingWorking

caro automatically selects the best backend in this order:

  1. Embedded MLX - If on Apple Silicon
  2. Embedded CPU - If MLX not available
  3. Ollama - If Ollama server detected
  4. vLLM - If vLLM server configured

Override the automatic selection:

Terminal window
# Use specific backend
caro --backend ollama "list files"
caro --backend vllm "list files"
# Or via environment
export CARO_BACKEND=ollama
caro "list files"

This error occurs when building with --features embedded-mlx without Xcode.

Solution: Install Xcode from the App Store, then:

Terminal window
sudo xcode-select --switch /Applications/Xcode.app/Contents/Developer
xcrun --find metal # Should show /usr/bin/metal

Ollama server not running.

Solution:

Terminal window
ollama serve # Start server
ollama ps # Check running models

Model loading or inference taking too long.

Solution: Increase timeout in config:

[backends.vllm]
timeout = 60 # Increase from default 30s

Network or storage issues.

Solution:

Terminal window
# Clear cache
rm -rf ~/.cache/caro/models/
# Manual download
mkdir -p ~/.cache/caro/models
cd ~/.cache/caro/models
curl -L -o qwen2.5-coder-1.5b-instruct-q4_k_m.gguf \
"https://huggingface.co/Qwen/Qwen2.5-Coder-1.5B-Instruct-GGUF/resolve/main/qwen2.5-coder-1.5b-instruct-q4_k_m.gguf"

Measured on various hardware configurations:

HardwareBackendFirst InferenceSubsequentMemory
M4 Pro (14-core)MLX (stub)100ms100ms1.1GB
M4 Pro (14-core)MLX (real)1.5s400ms1.2GB
M1 MacBook AirMLX (stub)100ms100ms1.1GB
M1 MacBook AirMLX (real)2.5s800ms1.2GB
Intel Mac (i7)CPU5s4s1.5GB
Linux x64 (Ryzen)CPU4s3.5s1.5GB
Linux + RTX 4090vLLM0.5s0.3s4GB