mikael-lovqvists-claude-agent/claude-voice-experiment

Files

mikael-lovqvists-claude-agent db8889aeed Initial commit — voice pipeline experiment

STT (Silero VAD + Whisper via sherpa-onnx), Chatterbox TTS HTTP server,
query completeness classifier (Ollama), multi-voice demo scripts, and
planning docs. Kept as reference; clean rewrite planned in separate repos.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-05-30 04:48:54 +00:00

2.2 KiB

Raw Permalink Blame History

LLM Routing Strategy

When to use a local model vs an Anthropic frontier model in the voice pipeline.

Current model inventory

Role	Model	Where	VRAM
STT	Whisper base.en + Silero VAD	sherpa-onnx (GPU)	~0.5 GB
TTS	Chatterbox Turbo	Python/CUDA	~4 GB
Completeness classifier	Qwen 2.5 3B	Ollama	~2 GB
Main reasoning	Claude (Anthropic)	API / Claude Code	—

Total local VRAM: ~6–7 GB on RTX 3090 (24 GB). Plenty of headroom.

Routing principles

Use local model when:

Task is classification or routing (completeness, intent detection, command vs question)
Task is latency-sensitive and the answer doesn't require world knowledge or reasoning
Task is high-frequency (runs on every utterance — billing would add up)
Privacy matters — query content shouldn't leave the machine

Use Anthropic frontier model when:

Task requires reasoning, synthesis, or judgment
Task involves code generation or editing
Task involves world knowledge beyond what a small model handles well
Query is low-frequency (once per user request, not once per audio frame)
Quality matters more than latency

GPU split (future option)

If voice models and reasoning models compete for VRAM, split across cards:

Card A (smaller VRAM): STT + TTS + classifier — pure inference, no large models needed
Card B (larger VRAM): local reasoning model (e.g. Qwen 20B or similar) for tasks that benefit from a capable local model but shouldn't hit the API

Use CUDA_VISIBLE_DEVICES or Ollama's GPU assignment to pin models to specific cards.

Upgrade candidates

Whisper small.en or medium.en — better transcription accuracy, especially for accents. Small adds ~0.5 GB, medium ~1.5 GB.
Completeness classifier → larger model — Qwen 7B or 14B would give significantly better sentence boundary detection and instruction/question classification.
Local reasoning model — for tasks that are too simple to warrant an API call but too complex for a 3B classifier (e.g. summarisation, note-taking, simple Q&A). A 7–14B quantised model via Ollama would fit alongside the voice stack.

2.2 KiB Raw Permalink Blame History Unescape Escape