# LLM Routing Strategy When to use a local model vs an Anthropic frontier model in the voice pipeline. --- ## Current model inventory | Role | Model | Where | VRAM | |------|-------|-------|------| | STT | Whisper base.en + Silero VAD | sherpa-onnx (GPU) | ~0.5 GB | | TTS | Chatterbox Turbo | Python/CUDA | ~4 GB | | Completeness classifier | Qwen 2.5 3B | Ollama | ~2 GB | | Main reasoning | Claude (Anthropic) | API / Claude Code | — | Total local VRAM: ~6–7 GB on RTX 3090 (24 GB). Plenty of headroom. --- ## Routing principles ### Use local model when: - Task is **classification or routing** (completeness, intent detection, command vs question) - Task is **latency-sensitive** and the answer doesn't require world knowledge or reasoning - Task is **high-frequency** (runs on every utterance — billing would add up) - Privacy matters — query content shouldn't leave the machine ### Use Anthropic frontier model when: - Task requires **reasoning, synthesis, or judgment** - Task involves **code generation or editing** - Task involves **world knowledge** beyond what a small model handles well - Query is **low-frequency** (once per user request, not once per audio frame) - Quality matters more than latency --- ## GPU split (future option) If voice models and reasoning models compete for VRAM, split across cards: - **Card A** (smaller VRAM): STT + TTS + classifier — pure inference, no large models needed - **Card B** (larger VRAM): local reasoning model (e.g. Qwen 20B or similar) for tasks that benefit from a capable local model but shouldn't hit the API Use `CUDA_VISIBLE_DEVICES` or Ollama's GPU assignment to pin models to specific cards. --- ## Upgrade candidates - **Whisper small.en or medium.en** — better transcription accuracy, especially for accents. Small adds ~0.5 GB, medium ~1.5 GB. - **Completeness classifier → larger model** — Qwen 7B or 14B would give significantly better sentence boundary detection and instruction/question classification. - **Local reasoning model** — for tasks that are too simple to warrant an API call but too complex for a 3B classifier (e.g. summarisation, note-taking, simple Q&A). A 7–14B quantised model via Ollama would fit alongside the voice stack.