# Voice Project — Cleanup Plan > [!NOTE] > **Revised approach** — keep this project as-is (it works and serves as a reference). The path forward is new clean repos for each component, then a top-level project that composes them. See **New architecture** section below. The project has accumulated experiment scripts, dead TTS backends, and outdated setup files. This document is the plan before doing the actual cleanup. --- ## Current state ### Active pipeline (keep) | File | Role | |------|------| | `query-demo.mjs` | Main entry point — STT → Pending_Query → dispatch | | `lib/pending-query.mjs` | Query state machine — wake word, silence timer, send/cancel/pause/instant dispatch | | `tts-server.mjs` | TTS HTTP server (Chatterbox, voice switching) | | `voices.yaml` | Named voice config | | `lib/stt.mjs` | Silero VAD + Whisper, pre-roll buffer | | `lib/chatterbox-tts.mjs` | Node wrapper for chatterbox-server.py | | `lib/tts-client.mjs` | HTTP client for tts-server.mjs | | `lib/local-query-complete.mjs` | Ollama classifier | | `chatterbox-server.py` | Long-running Python TTS process | | `listen.mjs` | Standalone mic → stdout tool | | `speak-as.mjs` | Standalone stdin → TTS tool | | `demos/` | Multi-voice demo scripts | | `download-models.sh` | Whisper + VAD model download | | `setup-venv.sh` | Python venv setup (needs rewrite — see below) | | `models/` | Downloaded STT models | | `venv/` | Python venv (not committed) | ### Dead / experimental (remove or archive) | File | Reason | |------|--------| | `acting-demo-bark.mjs` | Bark TTS — abandoned | | `acting-demo-chatterbox.mjs` | One-off demo, superseded | | `acting-demo.mjs` | Old demo | | `bark-server.py` | Bark backend — abandoned | | `demo-bark.mjs` | Bark demo | | `demo-kokoro.mjs` | Kokoro TTS experiment — abandoned | | `lib/bark-tts.mjs` | Bark wrapper | | `lib/tts.mjs` | Old TTS abstraction (superseded by tts-client.mjs) | | `lib/markdown.mjs` | Unused? Verify before deleting | | `lib/llm.mjs` | Unused? Verify before deleting | | `requirements.txt` | Lists kokoro + faster-whisper deps that aren't used | | `voice-buddy.mjs` | Merged into query-demo.mjs — **already deleted** | | `test-chime.mjs` | Keep — useful for testing chime config | --- ## What needs to be documented / fixed ### 1. `setup-venv.sh` — rewrite Current script installs Bark deps. The active backend is Chatterbox. New script should: - Install `chatterbox-tts` and its deps (torch, transformers, accelerate, numpy) - HuggingFace token instruction: write token to `~/.secrets/hugging-face.token` - Note: `find_hf_cache()` in chatterbox-server.py avoids HF network calls if the model is cached ### 2. `requirements.txt` — replace or delete Currently lists kokoro and faster-whisper (unused). Either: - Replace with `chatterbox-requirements.txt` listing only what `setup-venv.sh` installs - Or just delete it — `setup-venv.sh` is the source of truth ### 3. sherpa-onnx / CUDA build — document clearly The npm package ships a CPU-only `.so`. Using CUDA requires a manual rebuild. This is documented in `NOTES.md` but should also appear in a top-level `README.md` or `SETUP.md` so it's hard to miss. ### 4. System dependencies — list them Nothing currently lists what needs to be installed at the OS level: - `parec` / `pacat` (PulseAudio) — audio I/O - `cmake` — for optional CUDA sherpa-onnx rebuild - `onnxruntime-opt-cuda` (or equivalent) — for GPU STT - `ollama` running locally with `qwen2.5:3b` pulled — for classifier - `jq` — used in shell functions and demo scripts - Node.js (version? check what's required) - Python 3 with venv support ### 5. External services — document - **TTS server**: runs on the host (not in container), port 11500 - **Ollama**: runs on `192.168.2.99:11434` — classifier and future LLM routing - **TTS_URL** env var: set in `~/.bashrc` to point at TTS server ### 6. `voices.yaml` — keep and document Already the right format. Add a comment block at the top explaining how to add a voice (grab a clean WAV/OGG clip, at least 5 seconds, no background music). --- ## Proposed new top-level structure ``` voice/ README.md ← start here: what this is, quick start SETUP.md ← full setup: OS deps, npm install, venv, CUDA, models NOTES.md ← architecture deep-dives (keep as-is) TODO.md ← keep as-is VOICES.md ← wishlist (keep as-is) voices.yaml ← runtime voice config query-demo.mjs ← main entry point tts-server.mjs ← TTS HTTP server listen.mjs ← standalone STT tool speak-as.mjs ← standalone TTS tool chatterbox-server.py setup-venv.sh ← rewritten for Chatterbox only download-models.sh package.json lib/ stt.mjs chatterbox-tts.mjs tts-client.mjs local-query-complete.mjs demos/ *.sh models/ ← gitignored venv/ ← gitignored ``` --- ## Cleanup steps (in order) 1. **Verify** `lib/markdown.mjs` and `lib/llm.mjs` are unused — grep for imports 2. **Delete** dead files: bark/kokoro scripts, old demos, voice-buddy.mjs, lib/bark-tts.mjs, lib/tts.mjs, and unused lib files 3. **Rewrite** `setup-venv.sh` for Chatterbox only 4. **Replace** `requirements.txt` with accurate one or delete 5. **Write** `SETUP.md` covering all install steps end-to-end 6. **Write** `README.md` with quick start 7. **Add** voice cloning tips to top of `voices.yaml` 8. **Commit** the whole cleanup as a single commit --- ## tmux pane monitoring for Claude state (future) After dispatching a query, poll `docker exec tmux capture-pane -p -t claude` to watch the pane contents. Detect completion by looking for Claude Code's end-of-task timing line, which matches a pattern like `"for \d+ seconds"` or `"for \d+ minutes"`. This line always appears after Claude finishes a task and is reliable enough to use as a signal. Use this to: - Cancel the working chime timeout early when Claude finishes quickly - Play a "done but silent" notification if Claude finishes without calling `speak` Polling interval of ~1s should be sufficient. Stop polling once the pattern is seen or a max timeout is reached. Also detect Claude Code's session feedback prompt ("How is Claude doing this session?"). When seen: - Speak the question via TTS - Enter a temporary voice response mode: "good", "fine", "bad" → send the corresponding rating; "dismiss" or "ignore" → dismiss the prompt - This lets users give Anthropic feedback even when using the voice interface exclusively --- ## Audio mixing — chimes over speech (future) Currently the playback queue is strictly sequential — chimes and speech never overlap. A future improvement would be to mix chime audio on top of ongoing speech rather than waiting for it to finish. This would make activation chimes feel more responsive. Requires summing float32 buffers before passing to pacat rather than queuing them separately. --- ## Replace chatterbox-server.py stdin protocol with HTTP (future) The current stdin/stdout JSON line protocol between `chatterbox-tts.mjs` and `chatterbox-server.py` is internal and fragile. Replace it with a lightweight HTTP server inside the Python process so: - Multiple Node services can connect independently - External tools can also talk to it directly - The protocol is self-documenting and testable with curl This also applies to other Python backend processes if they appear. --- ## Pending_Query — migrate to ESM library event dispatch (future) The current constructor callback pattern (`on_submit`, `on_cancel`, `on_activate`, `on_mode_change`, etc.) is reinventing an event emitter. Replace with the event dispatch system from the shared ESM library (`nodejs.esm-library`) so callers use `.on('submit', handler)` etc. instead of constructor options. No functional change — just removes the code smell and aligns with the shared library. --- ## Current voice command vocabulary | Phrase | Effect | Mode | |--------|--------|------| | `computer` | Activate query (wake word) | idle | | `go` / `done` / `send` | Submit accumulated query | active | | `abort` / `cancel` | Discard query, return to idle | active | | `pause` | Freeze accumulation and timer (thinking pause) | active | | `resume` / `continue` | Unfreeze and restart timer | paused | | `yes please` | Instant dispatch, bypasses wake word | any | | `always listen` | Switch to always-on mode | any | | `stop listening` | Switch to wake-word mode | any | | `mode query` | Speak current mode status | any | Silence timer auto-enabled in wake-word mode; disabled in always-on mode. --- ## Utterance pre-processing: collapse repeated words (planned) STT occasionally produces repeated words as artifacts (e.g. "go go", "cancel cancel"). Add a pre-processing step before command matching that collapses consecutive identical words in a fragment into one (e.g. "go go" → "go"). This would be a simple normalisation rule applied to the `norm` string before any word-set lookups. --- ## Utterance routing layer (planned) As query interaction grows more complex, utterance dispatch needs its own class/config rather than living inside `Pending_Query`. Rough shape: - A routing table mapping normalized utterance strings (or patterns) to handler callbacks - Handlers run before the utterance is appended — if one matches, it consumes the utterance - Examples: `'read back'` → speak accumulated text; `'redo'` / `'start over'` → clear without cancel; `'yes please'` / `'no'` → immediate dispatch to Claude - Send words and cancel words become entries in this same table - `Pending_Query` only handles accumulation and submission; routing logic lives outside it --- ## Query staging / readback (planned) Before a query is sent to Claude, give the user a chance to review and correct it via voice: - After accumulating utterances, read back the query text via TTS so the user can confirm it looks right - Commands like "read back" or "what did I say" speak the current accumulated text - Commands like "redo" or "start over" clear the accumulation without cancelling the session - Only "go" / "send" actually dispatches to Claude This requires a richer interaction loop inside `Pending_Query` — the current send/cancel/append model is the foundation. --- ## Open questions - Keep `acting-demo-chatterbox.mjs` as a reference for TTS capability demos, or delete? - Should `demos/` scripts be committed as-is or made more portable (no hardcoded paths)? --- ## New architecture (revised plan) Keep this project as a reference. Build clean standalone repos instead: | Repo | Contents | Depends on | |------|----------|-----------| | `tts-server` | chatterbox-server.py, tts-server.mjs, lib/chatterbox-tts.mjs, voices.yaml format, HTTP API | Python venv, Chatterbox | | `stt` | lib/stt.mjs, sherpa-onnx, VAD + Whisper, pre-roll buffer, listen.mjs | sherpa-onnx-node, models | | `voice-assistant` | query-demo.mjs, lib/tts-client.mjs, lib/local-query-complete.mjs | tts-server + stt as submodules | Each standalone repo gets its own README, SETUP, and can be used independently. The voice-assistant repo composes them via git submodules (or npm workspace / symlinks — TBD). ### Open questions for new architecture - **Composition: npm packages** — each repo published to the private npm.efforting.tech registry; voice-assistant depends on `@efforting/tts-server` and `@efforting/stt` as npm deps. Cleaner than submodules for ESM projects. - **Model storage** — models are large, have different backup strategies, and should NOT live inside project directories. There are canonical model locations on the system; repos should reference them via an env var (e.g. `MODELS_DIR` or per-model env vars) rather than downloading into the project. Migrate existing `models/` directories out as part of the new repo setup. Not urgent — note for cleanup time.