mikael-lovqvists-claude-agent/claude-voice-experiment

Files

mikael-lovqvists-claude-agent db8889aeed Initial commit — voice pipeline experiment

STT (Silero VAD + Whisper via sherpa-onnx), Chatterbox TTS HTTP server,
query completeness classifier (Ollama), multi-voice demo scripts, and
planning docs. Kept as reference; clean rewrite planned in separate repos.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-05-30 04:48:54 +00:00

6.8 KiB

Raw Blame History

Voice Project — Cleanup Plan

Note

Revised approach — keep this project as-is (it works and serves as a reference). The path forward is new clean repos for each component, then a top-level project that composes them. See New architecture section below.

The project has accumulated experiment scripts, dead TTS backends, and outdated setup files. This document is the plan before doing the actual cleanup.

Current state

Active pipeline (keep)

File	Role
`query-demo.mjs`	Main entry point — STT → classifier → dispatch
`tts-server.mjs`	TTS HTTP server (Chatterbox, voice switching)
`voices.yaml`	Named voice config
`lib/stt.mjs`	Silero VAD + Whisper, pre-roll buffer
`lib/chatterbox-tts.mjs`	Node wrapper for chatterbox-server.py
`lib/tts-client.mjs`	HTTP client for tts-server.mjs
`lib/local-query-complete.mjs`	Ollama classifier
`chatterbox-server.py`	Long-running Python TTS process
`listen.mjs`	Standalone mic → stdout tool
`speak-as.mjs`	Standalone stdin → TTS tool
`demos/`	Multi-voice demo scripts
`download-models.sh`	Whisper + VAD model download
`setup-venv.sh`	Python venv setup (needs rewrite — see below)
`models/`	Downloaded STT models
`venv/`	Python venv (not committed)

Dead / experimental (remove or archive)

File	Reason
`acting-demo-bark.mjs`	Bark TTS — abandoned
`acting-demo-chatterbox.mjs`	One-off demo, superseded
`acting-demo.mjs`	Old demo
`bark-server.py`	Bark backend — abandoned
`demo-bark.mjs`	Bark demo
`demo-kokoro.mjs`	Kokoro TTS experiment — abandoned
`lib/bark-tts.mjs`	Bark wrapper
`lib/tts.mjs`	Old TTS abstraction (superseded by tts-client.mjs)
`lib/markdown.mjs`	Unused? Verify before deleting
`lib/llm.mjs`	Unused? Verify before deleting
`requirements.txt`	Lists kokoro + faster-whisper deps that aren't used
`voice-buddy.mjs`	Merged into query-demo.mjs — delete

What needs to be documented / fixed

1. `setup-venv.sh` — rewrite

Current script installs Bark deps. The active backend is Chatterbox. New script should:

Install chatterbox-tts and its deps (torch, transformers, accelerate, numpy)
HuggingFace token instruction: write token to ~/.secrets/hugging-face.token
Note: find_hf_cache() in chatterbox-server.py avoids HF network calls if the model is cached

2. `requirements.txt` — replace or delete

Currently lists kokoro and faster-whisper (unused). Either:

Replace with chatterbox-requirements.txt listing only what setup-venv.sh installs
Or just delete it — setup-venv.sh is the source of truth

3. sherpa-onnx / CUDA build — document clearly

The npm package ships a CPU-only .so. Using CUDA requires a manual rebuild. This is documented in NOTES.md but should also appear in a top-level README.md or SETUP.md so it's hard to miss.

4. System dependencies — list them

Nothing currently lists what needs to be installed at the OS level:

parec / pacat (PulseAudio) — audio I/O
cmake — for optional CUDA sherpa-onnx rebuild
onnxruntime-opt-cuda (or equivalent) — for GPU STT
ollama running locally with qwen2.5:3b pulled — for classifier
jq — used in shell functions and demo scripts
Node.js (version? check what's required)
Python 3 with venv support

5. External services — document

TTS server: runs on the host (not in container), port 11500
Ollama: runs on 192.168.2.99:11434 — classifier and future LLM routing
TTS_URL env var: set in ~/.bashrc to point at TTS server

6. `voices.yaml` — keep and document

Already the right format. Add a comment block at the top explaining how to add a voice (grab a clean WAV/OGG clip, at least 5 seconds, no background music).

Proposed new top-level structure

voice/
  README.md          ← start here: what this is, quick start
  SETUP.md           ← full setup: OS deps, npm install, venv, CUDA, models
  NOTES.md           ← architecture deep-dives (keep as-is)
  TODO.md            ← keep as-is
  VOICES.md          ← wishlist (keep as-is)
  voices.yaml        ← runtime voice config
  query-demo.mjs     ← main entry point
  tts-server.mjs     ← TTS HTTP server
  listen.mjs         ← standalone STT tool
  speak-as.mjs       ← standalone TTS tool
  chatterbox-server.py
  setup-venv.sh      ← rewritten for Chatterbox only
  download-models.sh
  package.json
  lib/
    stt.mjs
    chatterbox-tts.mjs
    tts-client.mjs
    local-query-complete.mjs
  demos/
    *.sh
  models/            ← gitignored
  venv/              ← gitignored

Cleanup steps (in order)

Verify lib/markdown.mjs and lib/llm.mjs are unused — grep for imports
Delete dead files: bark/kokoro scripts, old demos, voice-buddy.mjs, lib/bark-tts.mjs, lib/tts.mjs, and unused lib files
Rewrite setup-venv.sh for Chatterbox only
Replace requirements.txt with accurate one or delete
Write SETUP.md covering all install steps end-to-end
Write README.md with quick start
Add voice cloning tips to top of voices.yaml
Commit the whole cleanup as a single commit

Open questions

Keep acting-demo-chatterbox.mjs as a reference for TTS capability demos, or delete?
Should demos/ scripts be committed as-is or made more portable (no hardcoded paths)?

New architecture (revised plan)

Keep this project as a reference. Build clean standalone repos instead:

Repo	Contents	Depends on
`tts-server`	chatterbox-server.py, tts-server.mjs, lib/chatterbox-tts.mjs, voices.yaml format, HTTP API	Python venv, Chatterbox
`stt`	lib/stt.mjs, sherpa-onnx, VAD + Whisper, pre-roll buffer, listen.mjs	sherpa-onnx-node, models
`voice-assistant`	query-demo.mjs, lib/tts-client.mjs, lib/local-query-complete.mjs	tts-server + stt as submodules

Each standalone repo gets its own README, SETUP, and can be used independently. The voice-assistant repo composes them via git submodules (or npm workspace / symlinks — TBD).

Open questions for new architecture

Composition: npm packages — each repo published to the private npm.efforting.tech registry; voice-assistant depends on @efforting/tts-server and @efforting/stt as npm deps. Cleaner than submodules for ESM projects.
Model storage — models are large, have different backup strategies, and should NOT live inside project directories. There are canonical model locations on the system; repos should reference them via an env var (e.g. MODELS_DIR or per-model env vars) rather than downloading into the project. Migrate existing models/ directories out as part of the new repo setup. Not urgent — note for cleanup time.

6.8 KiB Raw Blame History