Initial commit — voice pipeline experiment
STT (Silero VAD + Whisper via sherpa-onnx), Chatterbox TTS HTTP server, query completeness classifier (Ollama), multi-voice demo scripts, and planning docs. Kept as reference; clean rewrite planned in separate repos. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
51
LLM-ROUTING.md
Normal file
51
LLM-ROUTING.md
Normal file
@@ -0,0 +1,51 @@
|
||||
# LLM Routing Strategy
|
||||
|
||||
When to use a local model vs an Anthropic frontier model in the voice pipeline.
|
||||
|
||||
---
|
||||
|
||||
## Current model inventory
|
||||
|
||||
| Role | Model | Where | VRAM |
|
||||
|------|-------|-------|------|
|
||||
| STT | Whisper base.en + Silero VAD | sherpa-onnx (GPU) | ~0.5 GB |
|
||||
| TTS | Chatterbox Turbo | Python/CUDA | ~4 GB |
|
||||
| Completeness classifier | Qwen 2.5 3B | Ollama | ~2 GB |
|
||||
| Main reasoning | Claude (Anthropic) | API / Claude Code | — |
|
||||
|
||||
Total local VRAM: ~6–7 GB on RTX 3090 (24 GB). Plenty of headroom.
|
||||
|
||||
---
|
||||
|
||||
## Routing principles
|
||||
|
||||
### Use local model when:
|
||||
- Task is **classification or routing** (completeness, intent detection, command vs question)
|
||||
- Task is **latency-sensitive** and the answer doesn't require world knowledge or reasoning
|
||||
- Task is **high-frequency** (runs on every utterance — billing would add up)
|
||||
- Privacy matters — query content shouldn't leave the machine
|
||||
|
||||
### Use Anthropic frontier model when:
|
||||
- Task requires **reasoning, synthesis, or judgment**
|
||||
- Task involves **code generation or editing**
|
||||
- Task involves **world knowledge** beyond what a small model handles well
|
||||
- Query is **low-frequency** (once per user request, not once per audio frame)
|
||||
- Quality matters more than latency
|
||||
|
||||
---
|
||||
|
||||
## GPU split (future option)
|
||||
|
||||
If voice models and reasoning models compete for VRAM, split across cards:
|
||||
- **Card A** (smaller VRAM): STT + TTS + classifier — pure inference, no large models needed
|
||||
- **Card B** (larger VRAM): local reasoning model (e.g. Qwen 20B or similar) for tasks that benefit from a capable local model but shouldn't hit the API
|
||||
|
||||
Use `CUDA_VISIBLE_DEVICES` or Ollama's GPU assignment to pin models to specific cards.
|
||||
|
||||
---
|
||||
|
||||
## Upgrade candidates
|
||||
|
||||
- **Whisper small.en or medium.en** — better transcription accuracy, especially for accents. Small adds ~0.5 GB, medium ~1.5 GB.
|
||||
- **Completeness classifier → larger model** — Qwen 7B or 14B would give significantly better sentence boundary detection and instruction/question classification.
|
||||
- **Local reasoning model** — for tasks that are too simple to warrant an API call but too complex for a 3B classifier (e.g. summarisation, note-taking, simple Q&A). A 7–14B quantised model via Ollama would fit alongside the voice stack.
|
||||
Reference in New Issue
Block a user