Files
claude-voice-experiment/LLM-ROUTING.md
mikael-lovqvists-claude-agent db8889aeed Initial commit — voice pipeline experiment
STT (Silero VAD + Whisper via sherpa-onnx), Chatterbox TTS HTTP server,
query completeness classifier (Ollama), multi-voice demo scripts, and
planning docs. Kept as reference; clean rewrite planned in separate repos.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-30 04:48:54 +00:00

2.2 KiB
Raw Blame History

LLM Routing Strategy

When to use a local model vs an Anthropic frontier model in the voice pipeline.


Current model inventory

Role Model Where VRAM
STT Whisper base.en + Silero VAD sherpa-onnx (GPU) ~0.5 GB
TTS Chatterbox Turbo Python/CUDA ~4 GB
Completeness classifier Qwen 2.5 3B Ollama ~2 GB
Main reasoning Claude (Anthropic) API / Claude Code

Total local VRAM: ~67 GB on RTX 3090 (24 GB). Plenty of headroom.


Routing principles

Use local model when:

  • Task is classification or routing (completeness, intent detection, command vs question)
  • Task is latency-sensitive and the answer doesn't require world knowledge or reasoning
  • Task is high-frequency (runs on every utterance — billing would add up)
  • Privacy matters — query content shouldn't leave the machine

Use Anthropic frontier model when:

  • Task requires reasoning, synthesis, or judgment
  • Task involves code generation or editing
  • Task involves world knowledge beyond what a small model handles well
  • Query is low-frequency (once per user request, not once per audio frame)
  • Quality matters more than latency

GPU split (future option)

If voice models and reasoning models compete for VRAM, split across cards:

  • Card A (smaller VRAM): STT + TTS + classifier — pure inference, no large models needed
  • Card B (larger VRAM): local reasoning model (e.g. Qwen 20B or similar) for tasks that benefit from a capable local model but shouldn't hit the API

Use CUDA_VISIBLE_DEVICES or Ollama's GPU assignment to pin models to specific cards.


Upgrade candidates

  • Whisper small.en or medium.en — better transcription accuracy, especially for accents. Small adds ~0.5 GB, medium ~1.5 GB.
  • Completeness classifier → larger model — Qwen 7B or 14B would give significantly better sentence boundary detection and instruction/question classification.
  • Local reasoning model — for tasks that are too simple to warrant an API call but too complex for a 3B classifier (e.g. summarisation, note-taking, simple Q&A). A 714B quantised model via Ollama would fit alongside the voice stack.