Files
claude-voice-experiment/CLEANUP-PLAN.md
mikael-lovqvists-claude-agent db8889aeed Initial commit — voice pipeline experiment
STT (Silero VAD + Whisper via sherpa-onnx), Chatterbox TTS HTTP server,
query completeness classifier (Ollama), multi-voice demo scripts, and
planning docs. Kept as reference; clean rewrite planned in separate repos.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-30 04:48:54 +00:00

6.8 KiB

Voice Project — Cleanup Plan

Note

Revised approach — keep this project as-is (it works and serves as a reference). The path forward is new clean repos for each component, then a top-level project that composes them. See New architecture section below.

The project has accumulated experiment scripts, dead TTS backends, and outdated setup files. This document is the plan before doing the actual cleanup.


Current state

Active pipeline (keep)

File Role
query-demo.mjs Main entry point — STT → classifier → dispatch
tts-server.mjs TTS HTTP server (Chatterbox, voice switching)
voices.yaml Named voice config
lib/stt.mjs Silero VAD + Whisper, pre-roll buffer
lib/chatterbox-tts.mjs Node wrapper for chatterbox-server.py
lib/tts-client.mjs HTTP client for tts-server.mjs
lib/local-query-complete.mjs Ollama classifier
chatterbox-server.py Long-running Python TTS process
listen.mjs Standalone mic → stdout tool
speak-as.mjs Standalone stdin → TTS tool
demos/ Multi-voice demo scripts
download-models.sh Whisper + VAD model download
setup-venv.sh Python venv setup (needs rewrite — see below)
models/ Downloaded STT models
venv/ Python venv (not committed)

Dead / experimental (remove or archive)

File Reason
acting-demo-bark.mjs Bark TTS — abandoned
acting-demo-chatterbox.mjs One-off demo, superseded
acting-demo.mjs Old demo
bark-server.py Bark backend — abandoned
demo-bark.mjs Bark demo
demo-kokoro.mjs Kokoro TTS experiment — abandoned
lib/bark-tts.mjs Bark wrapper
lib/tts.mjs Old TTS abstraction (superseded by tts-client.mjs)
lib/markdown.mjs Unused? Verify before deleting
lib/llm.mjs Unused? Verify before deleting
requirements.txt Lists kokoro + faster-whisper deps that aren't used
voice-buddy.mjs Merged into query-demo.mjs — delete

What needs to be documented / fixed

1. setup-venv.sh — rewrite

Current script installs Bark deps. The active backend is Chatterbox. New script should:

  • Install chatterbox-tts and its deps (torch, transformers, accelerate, numpy)
  • HuggingFace token instruction: write token to ~/.secrets/hugging-face.token
  • Note: find_hf_cache() in chatterbox-server.py avoids HF network calls if the model is cached

2. requirements.txt — replace or delete

Currently lists kokoro and faster-whisper (unused). Either:

  • Replace with chatterbox-requirements.txt listing only what setup-venv.sh installs
  • Or just delete it — setup-venv.sh is the source of truth

3. sherpa-onnx / CUDA build — document clearly

The npm package ships a CPU-only .so. Using CUDA requires a manual rebuild. This is documented in NOTES.md but should also appear in a top-level README.md or SETUP.md so it's hard to miss.

4. System dependencies — list them

Nothing currently lists what needs to be installed at the OS level:

  • parec / pacat (PulseAudio) — audio I/O
  • cmake — for optional CUDA sherpa-onnx rebuild
  • onnxruntime-opt-cuda (or equivalent) — for GPU STT
  • ollama running locally with qwen2.5:3b pulled — for classifier
  • jq — used in shell functions and demo scripts
  • Node.js (version? check what's required)
  • Python 3 with venv support

5. External services — document

  • TTS server: runs on the host (not in container), port 11500
  • Ollama: runs on 192.168.2.99:11434 — classifier and future LLM routing
  • TTS_URL env var: set in ~/.bashrc to point at TTS server

6. voices.yaml — keep and document

Already the right format. Add a comment block at the top explaining how to add a voice (grab a clean WAV/OGG clip, at least 5 seconds, no background music).


Proposed new top-level structure

voice/
  README.md          ← start here: what this is, quick start
  SETUP.md           ← full setup: OS deps, npm install, venv, CUDA, models
  NOTES.md           ← architecture deep-dives (keep as-is)
  TODO.md            ← keep as-is
  VOICES.md          ← wishlist (keep as-is)
  voices.yaml        ← runtime voice config
  query-demo.mjs     ← main entry point
  tts-server.mjs     ← TTS HTTP server
  listen.mjs         ← standalone STT tool
  speak-as.mjs       ← standalone TTS tool
  chatterbox-server.py
  setup-venv.sh      ← rewritten for Chatterbox only
  download-models.sh
  package.json
  lib/
    stt.mjs
    chatterbox-tts.mjs
    tts-client.mjs
    local-query-complete.mjs
  demos/
    *.sh
  models/            ← gitignored
  venv/              ← gitignored

Cleanup steps (in order)

  1. Verify lib/markdown.mjs and lib/llm.mjs are unused — grep for imports
  2. Delete dead files: bark/kokoro scripts, old demos, voice-buddy.mjs, lib/bark-tts.mjs, lib/tts.mjs, and unused lib files
  3. Rewrite setup-venv.sh for Chatterbox only
  4. Replace requirements.txt with accurate one or delete
  5. Write SETUP.md covering all install steps end-to-end
  6. Write README.md with quick start
  7. Add voice cloning tips to top of voices.yaml
  8. Commit the whole cleanup as a single commit

Open questions

  • Keep acting-demo-chatterbox.mjs as a reference for TTS capability demos, or delete?
  • Should demos/ scripts be committed as-is or made more portable (no hardcoded paths)?

New architecture (revised plan)

Keep this project as a reference. Build clean standalone repos instead:

Repo Contents Depends on
tts-server chatterbox-server.py, tts-server.mjs, lib/chatterbox-tts.mjs, voices.yaml format, HTTP API Python venv, Chatterbox
stt lib/stt.mjs, sherpa-onnx, VAD + Whisper, pre-roll buffer, listen.mjs sherpa-onnx-node, models
voice-assistant query-demo.mjs, lib/tts-client.mjs, lib/local-query-complete.mjs tts-server + stt as submodules

Each standalone repo gets its own README, SETUP, and can be used independently. The voice-assistant repo composes them via git submodules (or npm workspace / symlinks — TBD).

Open questions for new architecture

  • Composition: npm packages — each repo published to the private npm.efforting.tech registry; voice-assistant depends on @efforting/tts-server and @efforting/stt as npm deps. Cleaner than submodules for ESM projects.
  • Model storage — models are large, have different backup strategies, and should NOT live inside project directories. There are canonical model locations on the system; repos should reference them via an env var (e.g. MODELS_DIR or per-model env vars) rather than downloading into the project. Migrate existing models/ directories out as part of the new repo setup. Not urgent — note for cleanup time.