- README explaining experimental/transparency purpose - faster-whisper STT backend (fw-stt.mjs, faster-whisper-server.py, install-faster-whisper.sh) - Bug fixes: Buffer alignment in on_audio, --debug-waveform URL parsing, silent fetch errors, instant dispatch timer leak - Global uncaughtException/unhandledRejection handlers in query-demo.mjs - Design docs: CHANGELOG, COMMAND-DISPATCH, INTERFACE-THEORY, VOICE-POLICY - Systemd service unit templates Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
315 lines
16 KiB
Markdown
315 lines
16 KiB
Markdown
# Voice Project — Cleanup Plan
|
||
|
||
> [!NOTE]
|
||
> **Revised approach** — keep this project as-is (it works and serves as a reference). The path forward is new clean repos for each component, then a top-level project that composes them. See **New architecture** section below.
|
||
|
||
The project has accumulated experiment scripts, dead TTS backends, and outdated setup files. This document is the plan before doing the actual cleanup.
|
||
|
||
---
|
||
|
||
## Current state
|
||
|
||
### Active pipeline (keep)
|
||
|
||
| File | Role |
|
||
|------|------|
|
||
| `query-demo.mjs` | Main entry point — STT → Pending_Query → dispatch |
|
||
| `lib/pending-query.mjs` | Query state machine — wake word, silence timer, send/cancel/pause/instant dispatch |
|
||
| `tts-server.mjs` | TTS HTTP server (Chatterbox, voice switching) |
|
||
| `voices.yaml` | Named voice config |
|
||
| `lib/stt.mjs` | Silero VAD + Whisper, pre-roll buffer |
|
||
| `lib/chatterbox-tts.mjs` | Node wrapper for chatterbox-server.py |
|
||
| `lib/tts-client.mjs` | HTTP client for tts-server.mjs |
|
||
| `lib/local-query-complete.mjs` | Ollama classifier |
|
||
| `chatterbox-server.py` | Long-running Python TTS process |
|
||
| `listen.mjs` | Standalone mic → stdout tool |
|
||
| `speak-as.mjs` | Standalone stdin → TTS tool |
|
||
| `demos/` | Multi-voice demo scripts |
|
||
| `download-models.sh` | Whisper + VAD model download |
|
||
| `setup-venv.sh` | Python venv setup (needs rewrite — see below) |
|
||
| `models/` | Downloaded STT models |
|
||
| `venv/` | Python venv (not committed) |
|
||
|
||
### Dead / experimental (remove or archive)
|
||
|
||
| File | Reason |
|
||
|------|--------|
|
||
| `acting-demo-bark.mjs` | Bark TTS — abandoned |
|
||
| `acting-demo-chatterbox.mjs` | One-off demo, superseded |
|
||
| `acting-demo.mjs` | Old demo |
|
||
| `bark-server.py` | Bark backend — abandoned |
|
||
| `demo-bark.mjs` | Bark demo |
|
||
| `demo-kokoro.mjs` | Kokoro TTS experiment — abandoned |
|
||
| `lib/bark-tts.mjs` | Bark wrapper |
|
||
| `lib/tts.mjs` | Old TTS abstraction (superseded by tts-client.mjs) |
|
||
| `lib/markdown.mjs` | Unused? Verify before deleting |
|
||
| `lib/llm.mjs` | Unused? Verify before deleting |
|
||
| `requirements.txt` | Lists kokoro + faster-whisper deps that aren't used |
|
||
| `voice-buddy.mjs` | Merged into query-demo.mjs — **already deleted** |
|
||
| `test-chime.mjs` | Keep — useful for testing chime config |
|
||
|
||
---
|
||
|
||
## What needs to be documented / fixed
|
||
|
||
### 1. `setup-venv.sh` — rewrite
|
||
|
||
Current script installs Bark deps. The active backend is Chatterbox. New script should:
|
||
- Install `chatterbox-tts` and its deps (torch, transformers, accelerate, numpy)
|
||
- HuggingFace token instruction: write token to `~/.secrets/hugging-face.token`
|
||
- Note: `find_hf_cache()` in chatterbox-server.py avoids HF network calls if the model is cached
|
||
|
||
### 2. `requirements.txt` — replace or delete
|
||
|
||
Currently lists kokoro and faster-whisper (unused). Either:
|
||
- Replace with `chatterbox-requirements.txt` listing only what `setup-venv.sh` installs
|
||
- Or just delete it — `setup-venv.sh` is the source of truth
|
||
|
||
### 3. sherpa-onnx / CUDA build — document clearly
|
||
|
||
The npm package ships a CPU-only `.so`. Using CUDA requires a manual rebuild. This is documented in `NOTES.md` but should also appear in a top-level `README.md` or `SETUP.md` so it's hard to miss.
|
||
|
||
### 4. System dependencies — list them
|
||
|
||
Nothing currently lists what needs to be installed at the OS level:
|
||
- `parec` / `pacat` (PulseAudio) — audio I/O
|
||
- `cmake` — for optional CUDA sherpa-onnx rebuild
|
||
- `onnxruntime-opt-cuda` (or equivalent) — for GPU STT
|
||
- `ollama` running locally with `qwen2.5:3b` pulled — for classifier
|
||
- `jq` — used in shell functions and demo scripts
|
||
- Node.js (version? check what's required)
|
||
- Python 3 with venv support
|
||
|
||
### 5. External services — document
|
||
|
||
- **TTS server**: runs on the host (not in container), port 11500
|
||
- **Ollama**: runs on `192.168.2.99:11434` — classifier and future LLM routing
|
||
- **TTS_URL** env var: set in `~/.bashrc` to point at TTS server
|
||
|
||
### 6. `voices.yaml` — keep and document
|
||
|
||
Already the right format. Add a comment block at the top explaining how to add a voice (grab a clean WAV/OGG clip, at least 5 seconds, no background music).
|
||
|
||
---
|
||
|
||
## Proposed new top-level structure
|
||
|
||
```
|
||
voice/
|
||
README.md ← start here: what this is, quick start
|
||
SETUP.md ← full setup: OS deps, npm install, venv, CUDA, models
|
||
NOTES.md ← architecture deep-dives (keep as-is)
|
||
TODO.md ← keep as-is
|
||
VOICES.md ← wishlist (keep as-is)
|
||
voices.yaml ← runtime voice config
|
||
query-demo.mjs ← main entry point
|
||
tts-server.mjs ← TTS HTTP server
|
||
listen.mjs ← standalone STT tool
|
||
speak-as.mjs ← standalone TTS tool
|
||
chatterbox-server.py
|
||
setup-venv.sh ← rewritten for Chatterbox only
|
||
download-models.sh
|
||
package.json
|
||
lib/
|
||
stt.mjs
|
||
chatterbox-tts.mjs
|
||
tts-client.mjs
|
||
local-query-complete.mjs
|
||
demos/
|
||
*.sh
|
||
models/ ← gitignored
|
||
venv/ ← gitignored
|
||
```
|
||
|
||
---
|
||
|
||
## Cleanup steps (in order)
|
||
|
||
1. **Verify** `lib/markdown.mjs` and `lib/llm.mjs` are unused — grep for imports
|
||
2. **Delete** dead files: bark/kokoro scripts, old demos, voice-buddy.mjs, lib/bark-tts.mjs, lib/tts.mjs, and unused lib files
|
||
3. **Rewrite** `setup-venv.sh` for Chatterbox only
|
||
4. **Replace** `requirements.txt` with accurate one or delete
|
||
5. **Write** `SETUP.md` covering all install steps end-to-end
|
||
6. **Write** `README.md` with quick start
|
||
7. **Add** voice cloning tips to top of `voices.yaml`
|
||
8. **Commit** the whole cleanup as a single commit
|
||
|
||
---
|
||
|
||
## tmux pane monitoring for Claude state (future)
|
||
|
||
After dispatching a query, poll `docker exec <id> tmux capture-pane -p -t claude` to watch the pane contents. Detect completion by looking for Claude Code's end-of-task timing line, which matches a pattern like `"for \d+ seconds"` or `"for \d+ minutes"`. This line always appears after Claude finishes a task and is reliable enough to use as a signal.
|
||
|
||
Use this to:
|
||
- Cancel the working chime timeout early when Claude finishes quickly
|
||
- Play a "done but silent" notification if Claude finishes without calling `speak`
|
||
|
||
Polling interval of ~1s should be sufficient. Stop polling once the pattern is seen or a max timeout is reached.
|
||
|
||
Also detect Claude Code's session feedback prompt ("How is Claude doing this session?"). When seen:
|
||
- Speak the question via TTS
|
||
- Enter a temporary voice response mode: "good", "fine", "bad" → send the corresponding rating; "dismiss" or "ignore" → dismiss the prompt
|
||
- This lets users give Anthropic feedback even when using the voice interface exclusively
|
||
|
||
---
|
||
|
||
## Silence timer warning chime (planned)
|
||
|
||
Before the silence timer fires and submits the query, play a short warning chime (e.g. 1–2 seconds before expiry) so the user has a moment to say `pause` or continue speaking if the submission would be premature. Requires a secondary timeout that fires slightly before the main one.
|
||
|
||
---
|
||
|
||
## Audio mixing — chimes over speech (future)
|
||
|
||
Currently the playback queue is strictly sequential — chimes and speech never overlap. A future improvement would be to mix chime audio on top of ongoing speech rather than waiting for it to finish. This would make activation chimes feel more responsive. Requires summing float32 buffers before passing to pacat rather than queuing them separately.
|
||
|
||
---
|
||
|
||
## Replace chatterbox-server.py stdin protocol with HTTP (future)
|
||
|
||
The current stdin/stdout JSON line protocol between `chatterbox-tts.mjs` and `chatterbox-server.py` is internal and fragile. Replace it with a lightweight HTTP server inside the Python process so:
|
||
- Multiple Node services can connect independently
|
||
- External tools can also talk to it directly
|
||
- The protocol is self-documenting and testable with curl
|
||
|
||
This also applies to other Python backend processes if they appear.
|
||
|
||
---
|
||
|
||
## Pending_Query — migrate to ESM library event dispatch (future)
|
||
|
||
The current constructor callback pattern (`on_submit`, `on_cancel`, `on_activate`, `on_mode_change`, etc.) is reinventing an event emitter. Replace with the event dispatch system from the shared ESM library (`nodejs.esm-library`) so callers use `.on('submit', handler)` etc. instead of constructor options. No functional change — just removes the code smell and aligns with the shared library.
|
||
|
||
---
|
||
|
||
## Current voice command vocabulary
|
||
|
||
| Phrase | Effect | Mode |
|
||
|--------|--------|------|
|
||
| `computer` | Activate query (wake word) | idle |
|
||
| `go` / `done` / `send` | Submit accumulated query | active |
|
||
| `abort` / `cancel` | Discard query, return to idle | active |
|
||
| `pause` | Freeze accumulation and timer (thinking pause) | active |
|
||
| `resume` / `continue` | Unfreeze and restart timer | paused |
|
||
| `yes please` | Instant dispatch, bypasses wake word | any |
|
||
| `always listen` | Switch to always-on mode | any |
|
||
| `stop listening` | Switch to wake-word mode | any |
|
||
| `mode query` | Speak current mode status | any |
|
||
|
||
Silence timer auto-enabled in wake-word mode; disabled in always-on mode.
|
||
|
||
---
|
||
|
||
## Bug: alphanumeric code dictation fails
|
||
|
||
Two problems when dictating codes like "F 0 7 5 0 0 0 0":
|
||
|
||
1. **Repeated word collapse** — "zero zero zero zero" becomes "zero". The collapse rule is too aggressive; it should only apply to command words, not to content being accumulated.
|
||
2. **Whisper letter recognition** — single spoken letters are often transcribed incorrectly (e.g. "F" → "if", "ef"). A NATO phonetic alphabet mode ("foxtrot zero seven five") would be more reliable for codes.
|
||
3. **Silence timer** — pauses between individual characters trigger early submission.
|
||
|
||
Fix: disable the collapse rule for content (only apply it to the command-word matching step, not to the accumulated text). Consider a "code entry mode" that disables the silence timer and uses phonetic alphabet mapping.
|
||
|
||
---
|
||
|
||
## Utterance pre-processing: collapse repeated words (temporary fix)
|
||
|
||
Implemented as a quick fix in `Pending_Query._collapse_repeated_words()`. Should be absorbed into the proper rewrite/rule engine when that is built — the rule engine is the right home for utterance normalization and transformation.
|
||
|
||
STT occasionally produces repeated words as artifacts (e.g. "go go", "cancel cancel"). Add a pre-processing step before command matching that collapses consecutive identical words in a fragment into one (e.g. "go go" → "go"). This would be a simple normalisation rule applied to the `norm` string before any word-set lookups.
|
||
|
||
---
|
||
|
||
## API integrations (future)
|
||
|
||
Direct API integrations the voice assistant can call without involving Claude:
|
||
- YouTube — check for new videos from specific creators (via YouTube Data API with user account)
|
||
- Weather — already partially working via web search; proper API integration would be cleaner
|
||
- Calendar — check upcoming events
|
||
- Simple lookups that don't need LLM reasoning
|
||
|
||
For complex queries involving these services, route to Claude with the relevant tool/API access provided. The split: voice assistant handles simple factual retrieval; Claude handles reasoning, summarization, or multi-step tasks.
|
||
|
||
---
|
||
|
||
## Query journal (planned)
|
||
|
||
Log all STT fragments and command decisions to a JSONL journal file. Each entry should include: timestamp, raw fragment, normalised text, what command (if any) it matched, accumulated text at the time, and how the query was resolved (send word, silence timeout, cancel, instant dispatch). Purposes:
|
||
- Debug utterance recognition issues without live mic
|
||
- Replay and re-classify fragments when the routing rules change
|
||
- Tune the rule engine against real recorded data
|
||
|
||
Will become essential once the routing layer grows complex.
|
||
|
||
---
|
||
|
||
## Next major direction: rule-based command routing and multi-target dispatch
|
||
|
||
The voice interface should be able to route queries to different targets, not just Claude Code. This requires a more structured command layer. Planned capabilities:
|
||
|
||
- **Stashed queries** — accumulate a query while waiting for Claude to respond to a previous one; hold it and send when ready
|
||
- **Voice notes** — dispatch short notes directly to the personal note/task system (task inventory) without involving Claude Code at all
|
||
- **Multi-target routing** — based on query prefix or context, route to: Claude Code, note system, calendar, web search, etc.
|
||
- **Rule-based command structure** — replace the current ad-hoc word sets with a proper routing table that maps utterance patterns to handlers and targets
|
||
|
||
This is the core next step for the project. The current `Pending_Query` accumulation model is the right foundation; the routing layer sits above it.
|
||
|
||
---
|
||
|
||
## Utterance routing layer (planned)
|
||
|
||
As query interaction grows more complex, utterance dispatch needs its own class/config rather than living inside `Pending_Query`. Rough shape:
|
||
|
||
- A routing table mapping normalized utterance strings (or patterns) to handler callbacks
|
||
- Handlers run before the utterance is appended — if one matches, it consumes the utterance
|
||
- Examples: `'read back'` → speak accumulated text; `'redo'` / `'start over'` → clear without cancel; `'yes please'` / `'no'` → immediate dispatch to Claude
|
||
- Send words and cancel words become entries in this same table
|
||
- `Pending_Query` only handles accumulation and submission; routing logic lives outside it
|
||
|
||
---
|
||
|
||
## Query staging / readback (planned)
|
||
|
||
Two distinct commands:
|
||
- **`read back`** — speak the raw verbatim accumulated text so you can hear exactly what Whisper captured
|
||
- **`clean up`** (or `process`) — send accumulated text through a local LLM to fix dictation artifacts (repeated words, wrong homophones, missing punctuation, expand abbreviations) then read back the cleaned result; optionally replace accumulated text with the cleaned version before sending
|
||
|
||
The clean-up pass is especially useful for dictating letters or longer text where Whisper artifacts would be embarrassing if sent directly.
|
||
|
||
---
|
||
|
||
Before a query is sent to Claude, give the user a chance to review and correct it via voice:
|
||
|
||
- After accumulating utterances, read back the query text via TTS so the user can confirm it looks right
|
||
- Commands like "read back" or "what did I say" speak the current accumulated text
|
||
- Commands like "redo" or "start over" clear the accumulation without cancelling the session
|
||
- Only "go" / "send" actually dispatches to Claude
|
||
|
||
This requires a richer interaction loop inside `Pending_Query` — the current send/cancel/append model is the foundation.
|
||
|
||
---
|
||
|
||
## Open questions
|
||
|
||
- Keep `acting-demo-chatterbox.mjs` as a reference for TTS capability demos, or delete?
|
||
- Should `demos/` scripts be committed as-is or made more portable (no hardcoded paths)?
|
||
|
||
---
|
||
|
||
## New architecture (revised plan)
|
||
|
||
Keep this project as a reference. Build clean standalone repos instead:
|
||
|
||
| Repo | Contents | Depends on |
|
||
|------|----------|-----------|
|
||
| `tts-server` | chatterbox-server.py, tts-server.mjs, lib/chatterbox-tts.mjs, voices.yaml format, HTTP API | Python venv, Chatterbox |
|
||
| `stt` | lib/stt.mjs, sherpa-onnx, VAD + Whisper, pre-roll buffer, listen.mjs | sherpa-onnx-node, models |
|
||
| `voice-assistant` | query-demo.mjs, lib/tts-client.mjs, lib/local-query-complete.mjs | tts-server + stt as submodules |
|
||
|
||
Each standalone repo gets its own README, SETUP, and can be used independently. The voice-assistant repo composes them via git submodules (or npm workspace / symlinks — TBD).
|
||
|
||
### Open questions for new architecture
|
||
|
||
- **Composition: npm packages** — each repo published to the private npm.efforting.tech registry; voice-assistant depends on `@efforting/tts-server` and `@efforting/stt` as npm deps. Cleaner than submodules for ESM projects.
|
||
- **Model storage** — models are large, have different backup strategies, and should NOT live inside project directories. There are canonical model locations on the system; repos should reference them via an env var (e.g. `MODELS_DIR` or per-model env vars) rather than downloading into the project. Migrate existing `models/` directories out as part of the new repo setup. Not urgent — note for cleanup time.
|