ADR-001: LLM Inference
| Status | Accepted |
|---|---|
| Date | December 2025 |
| Authors | Caro Maintainers |
| Supersedes | N/A |
Table of Contents
Section titled “Table of Contents”- Executive Summary
- Context and Problem Statement
- Decision Drivers
- Design Philosophy
- Architecture Overview
- Backend System Design
- Crate Selection Rationale
- Apple Silicon Strategy
- Hugging Face Integration
- Security Considerations
- Future Direction
- Consequences
Executive Summary
Section titled “Executive Summary”This document records the architectural decisions governing LLM inference in Caro, a single-binary CLI tool that converts natural language to safe POSIX shell commands. The architecture embraces a multi-backend approach with first-class support for local inference on Apple Silicon, reflecting our belief that the future of AI inference is distributed—running on the machines closest to the data and the humans who control them.
Core Tenets:
- Local-first inference with remote fallback
- Apple Silicon as first-class citizen (MPS/MLX/Metal)
- Privacy through local control of models and data
- Unified trait system enabling seamless backend switching
- Hugging Face ecosystem for model distribution and caching
Context and Problem Statement
Section titled “Context and Problem Statement”The Challenge
Section titled “The Challenge”We needed to design an inference system that could:
- Convert natural language to shell commands with sub-second response times
- Run entirely offline on consumer hardware (laptops, home offices)
- Support multiple inference backends with different performance profiles
- Maintain a single-binary distribution under 50MB
- Prioritize user privacy and security
The Landscape
Section titled “The Landscape”The LLM inference ecosystem presents several paths:
| Approach | Pros | Cons |
|---|---|---|
| Cloud APIs (OpenAI, Anthropic) | Easy integration, powerful models | Data leaves machine, network dependency, cost |
| Local CPU inference (llama.cpp, Candle) | Works everywhere | Slow on non-GPU hardware |
| Local GPU inference (MLX, CUDA) | Fast, private | Platform-specific, complex setup |
| Hybrid | Best of both worlds | Complex architecture |
We chose local-first with multi-backend support as the foundational architecture.
Decision Drivers
Section titled “Decision Drivers”Primary Drivers
Section titled “Primary Drivers”- Privacy and Security: The model should only see what you choose to share
- Offline Capability: Must work without network connectivity
- Performance: First inference under 2 seconds on Apple Silicon
- Portability: Single binary, minimal external dependencies
- Extensibility: Easy to add new backends as hardware evolves
Secondary Drivers
Section titled “Secondary Drivers”- Developer experience (simple configuration, sensible defaults)
- Binary size constraints (< 50MB without embedded model)
- Memory efficiency on consumer hardware
- Cross-platform support (macOS, Linux, Windows)
Design Philosophy
Section titled “Design Philosophy”The Belief: Distributed Inference is the Future
Section titled “The Belief: Distributed Inference is the Future”We hold a strong conviction about where inference is heading:
The future of AI is not centralized datacenters—it’s distributed inference running everywhere: in home offices, on laptops, inside edge devices, and yes, on the powerful machines already sitting on developers’ desks.
This belief stems from several observations:
1. Hardware Democratization
Section titled “1. Hardware Democratization”Apple Silicon has proven that consumer hardware can run serious AI workloads:
- M1/M2/M3/M4 chips: Unified memory architecture, powerful Neural Engine
- Metal Performance Shaders (MPS): GPU acceleration without CUDA
- MLX Framework: Apple’s native ML framework optimized for their silicon
NVIDIA’s DGX Spark and similar products signal that high-performance inference is moving from datacenters to under-desk machines. We expect to see significant growth in companies deploying inference hardware in home offices and small teams.
2. The Privacy Imperative
Section titled “2. The Privacy Imperative”There’s a harsh limitation to reality that must be acknowledged:
A model has access to whatever context you provide. The only way to truly control what a model sees is to control the model and the machine running it.
This doesn’t mean cloud APIs (Anthropic, OpenAI, etc.) are bad—they’re excellent for many use cases. The question is: what data are you willing to send through that transport?
For a CLI tool that sees:
- Your shell commands
- Your file paths and directory structures
- Your natural language descriptions of tasks
…local inference provides the most privacy-respecting default.
3. Network Independence
Section titled “3. Network Independence”Remote inference has reached a plateau of convenience but introduces:
- Latency variability
- Availability dependencies
- Rate limits and costs
- Network security considerations
Local inference eliminates these concerns entirely for the price of one-time model download.
The Strategy: Local-First, Remote-Optional
Section titled “The Strategy: Local-First, Remote-Optional”┌─────────────────────────────────────────────────────────┐│ User Request │└───────────────────────────┬─────────────────────────────┘ │ ▼┌─────────────────────────────────────────────────────────┐│ Backend Selection ││ ┌─────────────────────────────────────────────────┐ ││ │ 1. User preference (config/CLI flag) │ ││ │ 2. Platform detection (Apple Silicon? GPU?) │ ││ │ 3. Availability check (model cached? API up?) │ ││ └─────────────────────────────────────────────────┘ │└───────────────────────────┬─────────────────────────────┘ │ ┌───────────────┼───────────────┐ ▼ ▼ ▼ ┌───────────┐ ┌───────────┐ ┌───────────────┐ │ MLX │ │ CPU │ │ Remote │ │ Backend │ │ Backend │ │ Backends │ │ (macOS │ │ (Candle) │ │ (Ollama/vLLM) │ │ aarch64) │ │ │ │ │ └───────────┘ └───────────┘ └───────────────┘ │ │ │ └───────────────┼───────────────┘ ▼ ┌───────────────┐ │ Generated │ │ Command │ └───────────────┘Architecture Overview
Section titled “Architecture Overview”Trait-Based Backend System
Section titled “Trait-Based Backend System”All backends implement a common CommandGenerator trait:
#[async_trait]pub trait CommandGenerator: Send + Sync { /// Generate a shell command from natural language input async fn generate_command( &self, request: &CommandRequest, ) -> Result<GeneratedCommand, GeneratorError>;
/// Check if this backend is currently available for use async fn is_available(&self) -> bool;
/// Get information about this backend's capabilities and performance fn backend_info(&self) -> BackendInfo;
/// Perform any necessary cleanup when shutting down async fn shutdown(&self) -> Result<(), GeneratorError>;}This design enables:
- Runtime backend switching without code changes
- Graceful fallback when primary backend unavailable
- Uniform error handling across all backends
- Performance introspection for backend selection decisions
Backend Hierarchy
Section titled “Backend Hierarchy”src/backends/├── mod.rs # CommandGenerator trait, BackendInfo, GeneratorError├── embedded/│ ├── mod.rs # Embedded backend exports│ ├── common.rs # InferenceBackend trait (internal)│ ├── embedded_backend.rs # EmbeddedModelBackend orchestrator│ ├── mlx.rs # Apple Silicon MLX backend│ └── cpu.rs # Cross-platform CPU backend└── remote/ ├── mod.rs # Remote backend exports ├── ollama.rs # Ollama API backend └── vllm.rs # vLLM API backend (OpenAI-compatible)Feature Gating
Section titled “Feature Gating”[features]default = ["embedded-mlx", "embedded-cpu"]embedded-mlx = ["cxx", "llama_cpp"]embedded-cpu = ["candle-core", "candle-transformers"]remote-backends = ["reqwest", "tokio/net"]full = ["remote-backends", "embedded-mlx", "embedded-cpu"]This allows:
- Minimal builds without remote backends
- Platform-specific optimizations
- Reduced binary size for embedded use cases
Backend System Design
Section titled “Backend System Design”Embedded Backends
Section titled “Embedded Backends”MLX Backend (Apple Silicon)
Section titled “MLX Backend (Apple Silicon)”Decision: Use llama_cpp crate with Metal feature for Apple Silicon inference.
Rationale:
- Mature ecosystem: llama.cpp has extensive model support and optimization
- Metal integration: Native GPU acceleration on Apple Silicon
- GGUF format: Efficient quantized model format, widely adopted
- Memory efficiency: Uses mmap for model loading, unified memory aware
Configuration:
// GPU optimization for Apple Siliconn_gpu_layers: 99, // All layers on GPUuse_mmap: true, // Memory-mapped loadinguse_mlock: false, // Don't pin to RAM (let unified memory manage)context_size: 2048, // Sufficient for command generationbatch_size: 512, // Prompt processing batch
// Sampling for command generationtemperature: 0.7, // Balanced creativity/determinismtop_k: 40, // Limit candidate tokenstop_p: 0.95, // Nucleus samplingrepetition_penalty: 1.1, // Avoid repetitionWhy not pure mlx-rs?
We evaluated mlx-rs (pure Rust MLX bindings) but encountered significant friction:
| Factor | llama_cpp | mlx-rs |
|---|---|---|
| Build deps | CMake | CMake + Xcode (Metal compiler) |
| Model format | GGUF (universal) | MLX format (Apple-specific) |
| Ecosystem | Huge model zoo | Smaller selection |
| Maturity | Battle-tested | Newer, evolving |
| Binary size | ~5MB overhead | ~3MB overhead |
The mlx-rs crate requires full Xcode installation (15GB) to compile Metal shaders, while llama_cpp with Metal works with pre-compiled kernels. For a CLI tool targeting developers who may not have Xcode, this was a significant barrier.
CPU Backend (Cross-Platform)
Section titled “CPU Backend (Cross-Platform)”Decision: Use Candle for cross-platform CPU inference.
Rationale:
- Pure Rust: No C++ dependencies, easier cross-compilation
- Hugging Face maintained: Well-supported, frequent updates
- Transformer support: Native support for Qwen architecture
- Fallback role: Ensures Caro works everywhere, even without GPU
Current Status: Stub implementation pending full Candle integration. The stub provides API compatibility while MLX is prioritized for Apple Silicon users.
Remote Backends
Section titled “Remote Backends”Ollama Backend
Section titled “Ollama Backend”Decision: Support Ollama as a local-remote hybrid option.
Rationale:
- Zero config: Works with models already managed by Ollama
- Model flexibility: Users can run any Ollama-supported model
- Resource isolation: Inference runs in Ollama process
- Familiar workflow: Many developers already use Ollama
Implementation:
// Ollama REST API integrationbase_url: "http://localhost:11434",endpoint: "/api/generate",timeout: 30s,temperature: 0.1, // More deterministic for commandstop_k: 10,top_p: 0.3,vLLM Backend
Section titled “vLLM Backend”Decision: Support vLLM for high-performance server deployments.
Rationale:
- Production-grade: Designed for high-throughput inference
- OpenAI-compatible: Standard API, easy integration
- Team scenarios: Shared inference server for organizations
- GPU utilization: Better batching and scheduling
Implementation:
// OpenAI-compatible APIendpoint: "/v1/chat/completions",authentication: Bearer token (optional),timeout: 30s,temperature: 0.1,max_tokens: 100,Crate Selection Rationale
Section titled “Crate Selection Rationale”Core Dependencies
Section titled “Core Dependencies”| Crate | Purpose | Why This Crate? |
|---|---|---|
| llama_cpp | MLX inference | Metal support, GGUF format, mature, low overhead |
| candle-core/transformers | CPU inference | Pure Rust, HF maintained, transformer architectures |
| hf-hub | Model download | Official HF crate, async support, caching built-in |
| tokio | Async runtime | Industry standard, multi-threaded, excellent ecosystem |
| reqwest | HTTP client | Async, rustls (no OpenSSL), feature-gated |
| serde/serde_json | Serialization | Universal standard, excellent derive macros |
Inference-Specific Choices
Section titled “Inference-Specific Choices”llama_cpp over alternatives
Section titled “llama_cpp over alternatives”| Alternative | Why Not? |
|---|---|
| rust-bert | Heavy deps, ONNX focus, larger binary |
| ort (ONNX Runtime) | Requires ONNX conversion, extra step |
| ctranslate2-rs | Smaller ecosystem, conversion needed |
| mlx-rs | Requires full Xcode for Metal compiler |
| tract | Limited model support, inference-only |
Decision: llama_cpp with Metal feature provides the best balance of:
- GGUF format support (no conversion)
- Metal acceleration (GPU inference)
- Build simplicity (no Xcode requirement)
- Model availability (huge GGUF ecosystem)
Candle over alternatives for CPU
Section titled “Candle over alternatives for CPU”| Alternative | Why Not? |
|---|---|
| llama.cpp (CPU mode) | Larger binary, C++ deps |
| tract | Limited transformer support |
| ggml-rs | Less maintained than candle |
Decision: Candle provides pure Rust CPU inference with direct HF integration.
Supporting Crates
Section titled “Supporting Crates”| Crate | Purpose | Why This Crate? |
|---|---|---|
| tokenizers | Text tokenization | HF official, fast, model-compatible |
| sha2 | Checksum validation | Standard, no deps, WASM-friendly |
| regex | Safety patterns | Fast, feature-rich, pre-compilation |
| once_cell | Lazy statics | Thread-safe, zero-cost after init |
| directories | Cache paths | XDG-compliant, cross-platform |
Apple Silicon Strategy
Section titled “Apple Silicon Strategy”Why First-Class Apple Silicon Support?
Section titled “Why First-Class Apple Silicon Support?”- Market reality: Significant developer population on Mac
- Performance: M-series chips excel at ML workloads
- Unified memory: Eliminates GPU↔CPU transfer overhead
- MLX ecosystem: Apple is investing heavily in local ML
- DX priority: Caro targets developers, many on Mac
Metal/MPS/MLX Clarification
Section titled “Metal/MPS/MLX Clarification”These terms are often confused:
| Term | What It Is | Caro Usage |
|---|---|---|
| Metal | Apple’s low-level GPU API (like Vulkan) | Used by llama.cpp for GPU kernels |
| MPS | Metal Performance Shaders, optimized GPU ops | Used by llama.cpp’s Metal backend |
| MLX | Apple’s high-level ML framework | Future integration via mlx-rs |
| Neural Engine | Dedicated NPU on Apple Silicon | Not currently used (via MLX in future) |
Current Implementation
Section titled “Current Implementation”#[cfg(all(target_os = "macos", target_arch = "aarch64"))]mod mlx { // llama_cpp with Metal feature // - GPU-accelerated inference // - GGUF model format // - Unified memory optimized}
#[cfg(not(all(target_os = "macos", target_arch = "aarch64")))]mod mlx { // Stub implementation // Falls back to CPU backend}Performance Targets
Section titled “Performance Targets”| Metric | Target | Current |
|---|---|---|
| Startup time | < 100ms | ~50ms |
| First inference | < 2s | ~1.8s (M4 Pro) |
| Subsequent inference | < 500ms | ~400ms |
| Memory usage | ~1.2GB | ~1.1GB |
Future MLX Integration
Section titled “Future MLX Integration”When mlx-rs matures and provides pre-compiled Metal shaders (eliminating Xcode requirement), we plan to:
- Add native MLX backend alongside llama.cpp
- Support MLX model format for Apple-optimized models
- Leverage Neural Engine where beneficial
- Potentially use MLX as primary, llama.cpp as fallback
Hugging Face Integration
Section titled “Hugging Face Integration”Model Distribution Strategy
Section titled “Model Distribution Strategy”Decision: Use Hugging Face Hub as primary model distribution mechanism.
Rationale:
- Standard infrastructure: Where the models already live
- Versioning: Model versions tracked automatically
- Caching: Built-in caching with
hf-hubcrate - Offline support: Works offline once model cached
- Community: Access to thousands of quantized models
Current Model
Section titled “Current Model”Repository: Qwen/Qwen2.5-Coder-1.5B-Instruct-GGUFFile: qwen2.5-coder-1.5b-instruct-q4_k_m.ggufSize: ~1.1GBQuantization: Q4_K_M (4-bit, excellent quality/size)Cache Architecture
Section titled “Cache Architecture”~/.cache/caro/├── models/│ └── qwen2.5-coder-1.5b-instruct-q4_k_m.gguf├── manifest.json # Model metadata, checksums, timestamps└── config/ # User preferences cacheFeatures:
- LRU eviction for multiple models
- SHA256 checksum validation
- Last-accessed timestamp tracking
- Configurable max cache size
Model Loading Flow
Section titled “Model Loading Flow”1. Check bundled model (if any) ↓ not found2. Check cache directory ↓ not found or invalid3. Download from Hugging Face Hub ↓ downloaded4. Validate checksum ↓ valid5. Load into backendLazy Loading: Models load on first inference, not startup, for fast CLI launch.
Security Considerations
Section titled “Security Considerations”The Privacy Spectrum
Section titled “The Privacy Spectrum”┌─────────────────────────────────────────────────────────────┐│ DATA EXPOSURE SPECTRUM │├─────────────────────────────────────────────────────────────┤│ ││ LOCAL INFERENCE REMOTE INFERENCE ││ (Full Privacy) (Shared with Provider) ││ ││ ◄─────────────────────────────────────────────────────────►││ ││ • Your prompts stay local • Prompts sent over network ││ • No API keys needed • API keys required ││ • Works offline • Network required ││ • You control the model • Provider controls model ││ • Full auditability • Trust required ││ │└─────────────────────────────────────────────────────────────┘Our Position
Section titled “Our Position”Using cloud APIs (Anthropic, OpenAI, etc.) is not inherently bad. The question is always: what data are you willing to share through that transport?
For Caro, local inference is the default because:
- Shell commands reveal intent: “delete all backup files older than 30 days” reveals infrastructure details
- File paths reveal structure:
/home/user/company-secrets/financial/is metadata - Context accumulates: A session of commands builds a picture of your work
- Defaults matter: Most users won’t change defaults, so defaults should be private
Remote Backend Security
Section titled “Remote Backend Security”When remote backends are used:
// vLLM: Bearer token authenticationAuthorization: Bearer <api_key>
// Ollama: Local by default, but supports remote// Warning: Ensure Ollama is not exposed publicly
// All remote: TLS via rustls (no OpenSSL)Security measures:
- TLS-only connections (HTTP upgraded to HTTPS)
- No credential storage (environment variables preferred)
- Clear error messages for auth failures
- Separate handling for 401/403 (no fallback on auth errors)
Model Security
Section titled “Model Security”Model validation:
- SHA256 checksum verification on download
- File size validation (prevents truncated downloads)
- Manifest tracking for integrity
- No automatic model updates (user-initiated only)
Future Direction
Section titled “Future Direction”Short-Term (6 months)
Section titled “Short-Term (6 months)”- Complete Candle CPU backend: Full cross-platform support
- Streaming responses: Progressive output for longer generations
- Model selection: Allow users to choose from multiple models
- Performance profiling: Identify optimization opportunities
Medium-Term (12 months)
Section titled “Medium-Term (12 months)”- Native MLX integration: When
mlx-rsprovides pre-compiled shaders - Neural Engine support: Via MLX for applicable workloads
- Windows GPU support: DirectML or CUDA backends
- Model fine-tuning: Domain-specific command generation
Long-Term Vision
Section titled “Long-Term Vision”- Embedded model distribution: Single binary with model included
- Federated learning: Improve models from anonymized usage patterns
- Hardware acceleration detection: Automatic optimal backend selection
- Edge deployment: ARM/embedded Linux support for IoT
Hardware Trends We’re Tracking
Section titled “Hardware Trends We’re Tracking”| Hardware | Timeline | Implications |
|---|---|---|
| DGX Spark | Now | Desktop inference stations becoming viable |
| Apple Silicon | Ongoing | Unified memory, NPU integration |
| Intel NPU (Meteor Lake) | 2024+ | x86 laptops gain inference acceleration |
| Qualcomm AI PC | 2024+ | ARM Windows with on-device AI |
| AMD XDNA (Ryzen AI) | 2024+ | CPU-integrated NPU |
Our architecture is designed to accommodate these developments through the backend trait system.
Consequences
Section titled “Consequences”Positive
Section titled “Positive”- Privacy by default: Users’ data stays on their machine
- Works offline: No network dependency for core functionality
- Fast iteration: Sub-second inference on Apple Silicon
- Extensible: New backends can be added without refactoring
- Small footprint: Single binary under 50MB
- Cross-platform: Works on macOS, Linux, Windows
Negative
Section titled “Negative”- Initial download: ~1.1GB model download on first use
- Disk space: Model consumes cache space
- Memory usage: ~1.2GB during inference
- Build complexity: Feature flags add conditional compilation
- Apple Silicon priority: CPU backends less optimized
- llama.cpp churn: Rapid development may require updates
- Model updates: Newer models may require architecture changes
- mlx-rs maturity: Unknown when Xcode requirement will be lifted
- Quantization quality: Q4 quantization has accuracy limits
Mitigations
Section titled “Mitigations”- Version pinning: Lock llama_cpp to tested versions
- Trait abstraction: Backend system isolates model changes
- Dual path: llama.cpp + future mlx-rs as fallbacks for each other
- Model selection: Allow users to choose larger models if needed
Appendix A: Backend Comparison Matrix
Section titled “Appendix A: Backend Comparison Matrix”| Feature | MLX (llama.cpp) | CPU (Candle) | Ollama | vLLM |
|---|---|---|---|---|
| Platform | macOS aarch64 | All | All | All |
| Network | No | No | Local | Yes |
| Model format | GGUF | SafeTensors | Varies | Varies |
| First inference | ~1.8s | ~4s | ~2s | ~3s |
| Memory | ~1.2GB | ~1.5GB | External | External |
| Offline | Yes | Yes | Yes* | No |
| GPU accel | Metal | No | Depends | Yes |
| Feature flag | embedded-mlx | embedded-cpu | remote-backends | remote-backends |
*Ollama works offline if model already downloaded
Appendix B: Model Selection Rationale
Section titled “Appendix B: Model Selection Rationale”Qwen 2.5 Coder 1.5B was chosen for:
- Size: 1.5B parameters fits in unified memory comfortably
- Coding focus: Trained specifically for code-related tasks
- Instruction tuning: Follows JSON output format reliably
- Quantization: Q4_K_M provides good quality at 1.1GB
- License: Apache 2.0, permissive for CLI distribution
- Availability: On Hugging Face in GGUF format
Appendix C: Related Documents
Section titled “Appendix C: Related Documents”This ADR was authored in December 2025 and reflects the state of the Caro project at that time. Architectural decisions may evolve as the LLM ecosystem matures.