Files
claude-voice-experiment/LLM-ROUTING.md
mikael-lovqvists-claude-agent db8889aeed Initial commit — voice pipeline experiment
STT (Silero VAD + Whisper via sherpa-onnx), Chatterbox TTS HTTP server,
query completeness classifier (Ollama), multi-voice demo scripts, and
planning docs. Kept as reference; clean rewrite planned in separate repos.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-30 04:48:54 +00:00

52 lines
2.2 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# LLM Routing Strategy
When to use a local model vs an Anthropic frontier model in the voice pipeline.
---
## Current model inventory
| Role | Model | Where | VRAM |
|------|-------|-------|------|
| STT | Whisper base.en + Silero VAD | sherpa-onnx (GPU) | ~0.5 GB |
| TTS | Chatterbox Turbo | Python/CUDA | ~4 GB |
| Completeness classifier | Qwen 2.5 3B | Ollama | ~2 GB |
| Main reasoning | Claude (Anthropic) | API / Claude Code | — |
Total local VRAM: ~67 GB on RTX 3090 (24 GB). Plenty of headroom.
---
## Routing principles
### Use local model when:
- Task is **classification or routing** (completeness, intent detection, command vs question)
- Task is **latency-sensitive** and the answer doesn't require world knowledge or reasoning
- Task is **high-frequency** (runs on every utterance — billing would add up)
- Privacy matters — query content shouldn't leave the machine
### Use Anthropic frontier model when:
- Task requires **reasoning, synthesis, or judgment**
- Task involves **code generation or editing**
- Task involves **world knowledge** beyond what a small model handles well
- Query is **low-frequency** (once per user request, not once per audio frame)
- Quality matters more than latency
---
## GPU split (future option)
If voice models and reasoning models compete for VRAM, split across cards:
- **Card A** (smaller VRAM): STT + TTS + classifier — pure inference, no large models needed
- **Card B** (larger VRAM): local reasoning model (e.g. Qwen 20B or similar) for tasks that benefit from a capable local model but shouldn't hit the API
Use `CUDA_VISIBLE_DEVICES` or Ollama's GPU assignment to pin models to specific cards.
---
## Upgrade candidates
- **Whisper small.en or medium.en** — better transcription accuracy, especially for accents. Small adds ~0.5 GB, medium ~1.5 GB.
- **Completeness classifier → larger model** — Qwen 7B or 14B would give significantly better sentence boundary detection and instruction/question classification.
- **Local reasoning model** — for tasks that are too simple to warrant an API call but too complex for a 3B classifier (e.g. summarisation, note-taking, simple Q&A). A 714B quantised model via Ollama would fit alongside the voice stack.