Backend Status

This page documents the current working status of each inference backend in caro.

Status Overview

Backend	Platform	Status	Real Inference
Embedded MLX	Apple Silicon	Working	Stub (GPU ready after Xcode)
Embedded CPU	All platforms	Working	Yes
Ollama	All platforms	Working	Yes
vLLM	Linux/Server	Working	Yes

Embedded Backend (Default)

The embedded backend is the default and requires no external dependencies after initial model download.

MLX Backend (Apple Silicon)

Status

Fully Functional

GPU Acceleration

Available with Xcode

What’s Working:

Component	Status	Notes
Platform detection	Working	Correctly identifies M1/M2/M3/M4
Model download	Working	1.1GB Qwen 2.5 Coder from Hugging Face
Model loading	Working	Loads GGUF file into memory
Inference pipeline	Working	End-to-end flow operational
GPU acceleration	Requires Xcode	Metal compiler needed

Current Implementation:

The default build uses a stub implementation that:

Loads the actual 1.1GB model file
Returns pattern-matched responses instantly
Works immediately without Xcode installation
Suitable for testing and development

For Real GPU Inference:

Install Xcode from the App Store (15GB download)
Run sudo xcode-select --switch /Applications/Xcode.app/Contents/Developer
Verify Metal: xcrun --find metal
Rebuild: cargo build --release --features embedded-mlx

Performance Comparison:

Mode	First Inference	Subsequent	Model Load
Stub (default)	~100ms	~100ms	~500ms
Real MLX (with Xcode)	< 2s	< 500ms	< 2s

CPU Backend (Cross-Platform)

Fully Functional

The CPU backend works on all platforms using the Candle framework.

Metric	Value
Platform	Any (macOS, Linux, Windows)
Model	Qwen 2.5 Coder 1.5B (GGUF)
First inference	~4-5s
Subsequent	~3-4s
Memory	~1.5GB

Remote Backends

Ollama Backend

Fully Functional

Ollama provides easy local model serving with good performance.

Setup:

# Install Ollama
brew install ollama  # macOS
curl -fsSL https://ollama.ai/install.sh | sh  # Linux

# Start server
ollama serve

# Pull model
ollama pull qwen2.5-coder:1.5b

Configuration:

[backends.ollama]
enabled = true
host = "http://localhost:11434"
model = "qwen2.5-coder:1.5b"

Status:

Feature	Status
HTTP API integration	Working
Model management	Working
Streaming responses	Working
Error handling	Working

vLLM Backend

Fully Functional

vLLM provides high-performance serving for production deployments.

Setup:

# Install vLLM
pip install vllm

# Start server
vllm serve Qwen/Qwen2.5-Coder-1.5B-Instruct \
  --port 8000 \
  --max-model-len 4096

Configuration:

[backends.vllm]
enabled = true
url = "http://localhost:8000"
timeout = 30

Status:

Feature	Status
OpenAI-compatible API	Working
Batch processing	Working
GPU acceleration	Working
Error handling	Working

Backend Selection

Automatic Selection

caro automatically selects the best backend in this order:

Embedded MLX - If on Apple Silicon
Embedded CPU - If MLX not available
Ollama - If Ollama server detected
vLLM - If vLLM server configured

Manual Selection

Override the automatic selection:

# Use specific backend
caro --backend ollama "list files"
caro --backend vllm "list files"

# Or via environment
export CARO_BACKEND=ollama
caro "list files"

Troubleshooting

MLX: “Metal compiler not found”

This error occurs when building with --features embedded-mlx without Xcode.

Solution: Install Xcode from the App Store, then:

sudo xcode-select --switch /Applications/Xcode.app/Contents/Developer
xcrun --find metal  # Should show /usr/bin/metal

Ollama: “Connection refused”

Ollama server not running.

Solution:

ollama serve  # Start server
ollama ps     # Check running models

vLLM: “Timeout”

Model loading or inference taking too long.

Solution: Increase timeout in config:

[backends.vllm]
timeout = 60  # Increase from default 30s

Model Download Fails

Network or storage issues.

Solution:

# Clear cache
rm -rf ~/.cache/caro/models/

# Manual download
mkdir -p ~/.cache/caro/models
cd ~/.cache/caro/models
curl -L -o qwen2.5-coder-1.5b-instruct-q4_k_m.gguf \
  "https://huggingface.co/Qwen/Qwen2.5-Coder-1.5B-Instruct-GGUF/resolve/main/qwen2.5-coder-1.5b-instruct-q4_k_m.gguf"

Performance Benchmarks

Measured on various hardware configurations:

Hardware	Backend	First Inference	Subsequent	Memory
M4 Pro (14-core)	MLX (stub)	100ms	100ms	1.1GB
M4 Pro (14-core)	MLX (real)	1.5s	400ms	1.2GB
M1 MacBook Air	MLX (stub)	100ms	100ms	1.1GB
M1 MacBook Air	MLX (real)	2.5s	800ms	1.2GB
Intel Mac (i7)	CPU	5s	4s	1.5GB
Linux x64 (Ryzen)	CPU	4s	3.5s	1.5GB
Linux + RTX 4090	vLLM	0.5s	0.3s	4GB