Initial commit — voice pipeline experiment

STT (Silero VAD + Whisper via sherpa-onnx), Chatterbox TTS HTTP server, query completeness classifier (Ollama), multi-voice demo scripts, and planning docs. Kept as reference; clean rewrite planned in separate repos. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-30 04:48:54 +00:00
commit db8889aeed
35 changed files with 2782 additions and 0 deletions
--- a/.gitignore
+++ b/.gitignore
@@ -0,0 +1,11 @@
+node_modules/
+models/
+venv/
+*.ogg
+*.wav
+*.mp3
+*.onnx
+*.bin
+package-lock.json
+__pycache__/
+*.pyc
--- a/CLEANUP-PLAN.md
+++ b/CLEANUP-PLAN.md
@@ -0,0 +1,159 @@
+# Voice Project — Cleanup Plan
+
+> [!NOTE]
+> **Revised approach** — keep this project as-is (it works and serves as a reference). The path forward is new clean repos for each component, then a top-level project that composes them. See **New architecture** section below.
+
+The project has accumulated experiment scripts, dead TTS backends, and outdated setup files. This document is the plan before doing the actual cleanup.
+
+---
+
+## Current state
+
+### Active pipeline (keep)
+
+| File | Role |
+|------|------|
+| `query-demo.mjs` | Main entry point — STT → classifier → dispatch |
+| `tts-server.mjs` | TTS HTTP server (Chatterbox, voice switching) |
+| `voices.yaml` | Named voice config |
+| `lib/stt.mjs` | Silero VAD + Whisper, pre-roll buffer |
+| `lib/chatterbox-tts.mjs` | Node wrapper for chatterbox-server.py |
+| `lib/tts-client.mjs` | HTTP client for tts-server.mjs |
+| `lib/local-query-complete.mjs` | Ollama classifier |
+| `chatterbox-server.py` | Long-running Python TTS process |
+| `listen.mjs` | Standalone mic → stdout tool |
+| `speak-as.mjs` | Standalone stdin → TTS tool |
+| `demos/` | Multi-voice demo scripts |
+| `download-models.sh` | Whisper + VAD model download |
+| `setup-venv.sh` | Python venv setup (needs rewrite — see below) |
+| `models/` | Downloaded STT models |
+| `venv/` | Python venv (not committed) |
+
+### Dead / experimental (remove or archive)
+
+| File | Reason |
+|------|--------|
+| `acting-demo-bark.mjs` | Bark TTS — abandoned |
+| `acting-demo-chatterbox.mjs` | One-off demo, superseded |
+| `acting-demo.mjs` | Old demo |
+| `bark-server.py` | Bark backend — abandoned |
+| `demo-bark.mjs` | Bark demo |
+| `demo-kokoro.mjs` | Kokoro TTS experiment — abandoned |
+| `lib/bark-tts.mjs` | Bark wrapper |
+| `lib/tts.mjs` | Old TTS abstraction (superseded by tts-client.mjs) |
+| `lib/markdown.mjs` | Unused? Verify before deleting |
+| `lib/llm.mjs` | Unused? Verify before deleting |
+| `requirements.txt` | Lists kokoro + faster-whisper deps that aren't used |
+| `voice-buddy.mjs` | Merged into query-demo.mjs — delete |
+
+---
+
+## What needs to be documented / fixed
+
+### 1. `setup-venv.sh` — rewrite
+
+Current script installs Bark deps. The active backend is Chatterbox. New script should:
+- Install `chatterbox-tts` and its deps (torch, transformers, accelerate, numpy)
+- HuggingFace token instruction: write token to `~/.secrets/hugging-face.token`
+- Note: `find_hf_cache()` in chatterbox-server.py avoids HF network calls if the model is cached
+
+### 2. `requirements.txt` — replace or delete
+
+Currently lists kokoro and faster-whisper (unused). Either:
+- Replace with `chatterbox-requirements.txt` listing only what `setup-venv.sh` installs
+- Or just delete it — `setup-venv.sh` is the source of truth
+
+### 3. sherpa-onnx / CUDA build — document clearly
+
+The npm package ships a CPU-only `.so`. Using CUDA requires a manual rebuild. This is documented in `NOTES.md` but should also appear in a top-level `README.md` or `SETUP.md` so it's hard to miss.
+
+### 4. System dependencies — list them
+
+Nothing currently lists what needs to be installed at the OS level:
+- `parec` / `pacat` (PulseAudio) — audio I/O
+- `cmake` — for optional CUDA sherpa-onnx rebuild
+- `onnxruntime-opt-cuda` (or equivalent) — for GPU STT
+- `ollama` running locally with `qwen2.5:3b` pulled — for classifier
+- `jq` — used in shell functions and demo scripts
+- Node.js (version? check what's required)
+- Python 3 with venv support
+
+### 5. External services — document
+
+- **TTS server**: runs on the host (not in container), port 11500
+- **Ollama**: runs on `192.168.2.99:11434` — classifier and future LLM routing
+- **TTS_URL** env var: set in `~/.bashrc` to point at TTS server
+
+### 6. `voices.yaml` — keep and document
+
+Already the right format. Add a comment block at the top explaining how to add a voice (grab a clean WAV/OGG clip, at least 5 seconds, no background music).
+
+---
+
+## Proposed new top-level structure
+
+```
+voice/
+  README.md          ← start here: what this is, quick start
+  SETUP.md           ← full setup: OS deps, npm install, venv, CUDA, models
+  NOTES.md           ← architecture deep-dives (keep as-is)
+  TODO.md            ← keep as-is
+  VOICES.md          ← wishlist (keep as-is)
+  voices.yaml        ← runtime voice config
+  query-demo.mjs     ← main entry point
+  tts-server.mjs     ← TTS HTTP server
+  listen.mjs         ← standalone STT tool
+  speak-as.mjs       ← standalone TTS tool
+  chatterbox-server.py
+  setup-venv.sh      ← rewritten for Chatterbox only
+  download-models.sh
+  package.json
+  lib/
+    stt.mjs
+    chatterbox-tts.mjs
+    tts-client.mjs
+    local-query-complete.mjs
+  demos/
+    *.sh
+  models/            ← gitignored
+  venv/              ← gitignored
+```
+
+---
+
+## Cleanup steps (in order)
+
+1. **Verify** `lib/markdown.mjs` and `lib/llm.mjs` are unused — grep for imports
+2. **Delete** dead files: bark/kokoro scripts, old demos, voice-buddy.mjs, lib/bark-tts.mjs, lib/tts.mjs, and unused lib files
+3. **Rewrite** `setup-venv.sh` for Chatterbox only
+4. **Replace** `requirements.txt` with accurate one or delete
+5. **Write** `SETUP.md` covering all install steps end-to-end
+6. **Write** `README.md` with quick start
+7. **Add** voice cloning tips to top of `voices.yaml`
+8. **Commit** the whole cleanup as a single commit
+
+---
+
+## Open questions
+
+- Keep `acting-demo-chatterbox.mjs` as a reference for TTS capability demos, or delete?
+- Should `demos/` scripts be committed as-is or made more portable (no hardcoded paths)?
+
+---
+
+## New architecture (revised plan)
+
+Keep this project as a reference. Build clean standalone repos instead:
+
+| Repo | Contents | Depends on |
+|------|----------|-----------|
+| `tts-server` | chatterbox-server.py, tts-server.mjs, lib/chatterbox-tts.mjs, voices.yaml format, HTTP API | Python venv, Chatterbox |
+| `stt` | lib/stt.mjs, sherpa-onnx, VAD + Whisper, pre-roll buffer, listen.mjs | sherpa-onnx-node, models |
+| `voice-assistant` | query-demo.mjs, lib/tts-client.mjs, lib/local-query-complete.mjs | tts-server + stt as submodules |
+
+Each standalone repo gets its own README, SETUP, and can be used independently. The voice-assistant repo composes them via git submodules (or npm workspace / symlinks — TBD).
+
+### Open questions for new architecture
+
+- **Composition: npm packages** — each repo published to the private npm.efforting.tech registry; voice-assistant depends on `@efforting/tts-server` and `@efforting/stt` as npm deps. Cleaner than submodules for ESM projects.
+- **Model storage** — models are large, have different backup strategies, and should NOT live inside project directories. There are canonical model locations on the system; repos should reference them via an env var (e.g. `MODELS_DIR` or per-model env vars) rather than downloading into the project. Migrate existing `models/` directories out as part of the new repo setup. Not urgent — note for cleanup time.
--- a/LLM-ROUTING.md
+++ b/LLM-ROUTING.md
@@ -0,0 +1,51 @@
+# LLM Routing Strategy
+
+When to use a local model vs an Anthropic frontier model in the voice pipeline.
+
+---
+
+## Current model inventory
+
+| Role | Model | Where | VRAM |
+|------|-------|-------|------|
+| STT | Whisper base.en + Silero VAD | sherpa-onnx (GPU) | ~0.5 GB |
+| TTS | Chatterbox Turbo | Python/CUDA | ~4 GB |
+| Completeness classifier | Qwen 2.5 3B | Ollama | ~2 GB |
+| Main reasoning | Claude (Anthropic) | API / Claude Code | — |
+
+Total local VRAM: ~6–7 GB on RTX 3090 (24 GB). Plenty of headroom.
+
+---
+
+## Routing principles
+
+### Use local model when:
+- Task is **classification or routing** (completeness, intent detection, command vs question)
+- Task is **latency-sensitive** and the answer doesn't require world knowledge or reasoning
+- Task is **high-frequency** (runs on every utterance — billing would add up)
+- Privacy matters — query content shouldn't leave the machine
+
+### Use Anthropic frontier model when:
+- Task requires **reasoning, synthesis, or judgment**
+- Task involves **code generation or editing**
+- Task involves **world knowledge** beyond what a small model handles well
+- Query is **low-frequency** (once per user request, not once per audio frame)
+- Quality matters more than latency
+
+---
+
+## GPU split (future option)
+
+If voice models and reasoning models compete for VRAM, split across cards:
+- **Card A** (smaller VRAM): STT + TTS + classifier — pure inference, no large models needed
+- **Card B** (larger VRAM): local reasoning model (e.g. Qwen 20B or similar) for tasks that benefit from a capable local model but shouldn't hit the API
+
+Use `CUDA_VISIBLE_DEVICES` or Ollama's GPU assignment to pin models to specific cards.
+
+---
+
+## Upgrade candidates
+
+- **Whisper small.en or medium.en** — better transcription accuracy, especially for accents. Small adds ~0.5 GB, medium ~1.5 GB.
+- **Completeness classifier → larger model** — Qwen 7B or 14B would give significantly better sentence boundary detection and instruction/question classification.
+- **Local reasoning model** — for tasks that are too simple to warrant an API call but too complex for a 3B classifier (e.g. summarisation, note-taking, simple Q&A). A 7–14B quantised model via Ollama would fit alongside the voice stack.
--- a/NOTES.md
+++ b/NOTES.md
@@ -0,0 +1,161 @@
+# Voice Experiment Notes
+
+Real-time voice pipeline: microphone → STT → (LLM) → TTS. Everything runs locally, no internet after initial model download.
+
+---
+
+## Architecture
+
+```
+parec (PulseAudio)
+  └─ lib/stt.mjs          Silero VAD + Whisper → text on stdout
+       └─ sherpa-onnx-node
+            └─ libsherpa-onnx-c-api.so   (needs CUDA build — see below)
+                 └─ libonnxruntime.so     (system, with CUDA)
+
+text → lib/chatterbox-tts.mjs
+         └─ chatterbox-server.py         long-running Python TTS server
+              └─ chatterbox-tts (pip)    ResembleAI Chatterbox model
+```
+
+---
+
+## STT
+
+**Library:** `sherpa-onnx-node` npm package
+**Models:** Whisper (offline batch) + Silero VAD
+**Audio capture:** `parec --format=s16le --rate=16000 --channels=1 --latency-msec=50`
+
+### Model download
+
+```bash
+bash download-models.sh base.en   # or: tiny.en small.en medium.en large-v3 large-v3-turbo
+```
+
+Models land in `models/`:
+- `silero_vad.onnx`
+- `sherpa-onnx-whisper-base.en/`
+
+### How it works
+
+`lib/stt.mjs` feeds 512-sample (32ms) windows into Silero VAD. When VAD signals speech end (after 500ms silence), the segment is sent to Whisper for transcription. The transcription is written to stdout — **note: the native addon writes directly to fd 1 (bypassing Node.js process.stdout), not via the JavaScript callback**. The JS callback is also called and can be used for additional processing.
+
+### Getting CUDA
+
+The npm package bundles a CPU-only `libsherpa-onnx-c-api.so` (`SHERPA_ONNX_ENABLE_GPU=OFF`). To use CUDA, rebuild it from source:
+
+```bash
+# Prerequisites: cmake, onnxruntime-opt-cuda (or onnxruntime-cuda) from pacman
+git clone --depth 1 git@github.com:k2-fsa/sherpa-onnx ~/Foreign/sherpa-onnx
+
+export SHERPA_ONNXRUNTIME_INCLUDE_DIR=/usr/include/onnxruntime
+export SHERPA_ONNXRUNTIME_LIB_DIR=/usr/lib
+
+cd ~/Foreign/sherpa-onnx
+cmake -B build \
+  -DCMAKE_BUILD_TYPE=Release \
+  -DBUILD_SHARED_LIBS=ON \
+  -DSHERPA_ONNX_ENABLE_GPU=ON \
+  -DSHERPA_ONNX_ENABLE_PYTHON=OFF \
+  -DSHERPA_ONNX_ENABLE_TESTS=OFF \
+  .
+
+cmake --build build --target sherpa-onnx-c-api
+
+# Symlink into node_modules (remove bundled CPU onnxruntime so system one is used)
+NMOD=~/workspace/experiments/voice/node_modules/sherpa-onnx-linux-x64
+ln -sf ~/Foreign/sherpa-onnx/build/lib/libsherpa-onnx-c-api.so $NMOD/libsherpa-onnx-c-api.so
+rm -f $NMOD/libonnxruntime.so
+```
+
+The system `/usr/lib/libonnxruntime.so` (from `onnxruntime-opt-cuda`) is used at runtime. The sherpa-onnx C API is stable across versions so the pre-built `sherpa-onnx.node` (1.13.2) works with the freshly built library.
+
+---
+
+## TTS
+
+**Library:** `chatterbox-tts` Python package (ResembleAI)
+**Models:** `ResembleAI/chatterbox-turbo` (default) or `ResembleAI/chatterbox`
+**Playback:** `pacat --format=float32le --rate=24000 --channels=1`
+
+### Setup
+
+```bash
+bash setup-venv.sh   # creates ./venv, installs chatterbox-tts + deps
+```
+
+HuggingFace token (for model download) goes in `~/.secrets/hugging-face.token`.
+
+### Variants
+
+| Variant | RTF (RTX 3090) | Notes |
+|---------|---------------|-------|
+| turbo   | ~0.2          | default, female voice, no exaggeration param |
+| full    | slower        | male voice, supports `exaggeration` (0.0–1.0) |
+
+### Paralinguistic tags
+
+Work in both variants:
+```
+[laugh] [chuckle] [cough] [clear throat] [sigh] [shush] [groan] [sniff] [gasp]
+```
+
+### Voice cloning
+
+Pass a WAV reference clip (at least 5 seconds, clean speech, no music). Quality scales with both length and emotional range — a clip with variation (question, statement, different tones) gives the model more to work with for natural prosody than a flat monotone sample of the same length.
+```javascript
+await tts.speak('Hello.', { audio_prompt: '/path/to/voice.wav' })
+```
+
+The server preprocesses the WAV to float32 mono to work around a librosa/numpy float64 issue.
+
+### Server protocol
+
+`chatterbox-server.py` is a long-running process. Communication:
+- **stdin:** one JSON line per request: `{"text": "...", "temperature": 0.8, ...}`
+- **stdout:** `ok\n` after each utterance is **generated** (playback continues in background)
+- **stderr:** status/timing logs
+
+Generation and playback are pipelined: the server generates the next utterance while the previous one is still playing. RTF ~0.2 means the next chunk is ready well before playback finishes.
+
+### Acting tips
+
+- Emphasis: `UPPERCASE` words
+- Pause/rhythm: `...` `—`
+- Acronym workaround: `un-workable` not `UNWORKABLE` (avoids letter-spelling)
+- Exaggeration (full model only): `{ exaggeration: 0.0–1.0 }`
+- Temperature: `{ temperature: 0.3–1.4 }` (lower = more consistent)
+
+---
+
+## Tools
+
+| Script | Usage |
+|--------|-------|
+| `listen.mjs` | Mic → Whisper → stdout (one line per utterance) |
+| `speak-as.mjs` | stdin → Chatterbox TTS with voice reference |
+| `acting-demo-chatterbox.mjs` | TTS capability demo (sections 1–7) |
+
+### Pipe example
+
+```bash
+node listen.mjs 2>/dev/null | node speak-as.mjs /path/to/voice.wav
+```
+
+`speak-as.mjs` deduplicates consecutive identical lines to handle VAD over-segmentation.
+
+---
+
+## Known Quirks
+
+**librosa float64:** newer numpy makes `librosa.resample` return float64, breaking torch. Patched in `chatterbox-server.py`:
+```python
+_orig = _librosa.resample
+_librosa.resample = lambda *a, **k: _orig(*a, **k).astype(np.float32)
+```
+
+**HuggingFace offline:** `from_pretrained()` always hits the internet. The server uses `find_hf_cache()` to detect local snapshots and calls `from_local()` instead.
+
+**VAD over-segmentation:** a mid-utterance pause produces two segments that Whisper transcribes to similar or identical text. `speak-as.mjs` deduplicates consecutive lines (normalized, punctuation-stripped) before speaking.
+
+**Native stdout:** the sherpa-onnx native addon writes transcription results directly to fd 1 via C `write()`, bypassing Node.js `process.stdout.write`. This cannot be intercepted from JavaScript. It only writes for non-trivial results (not silence/garbage).
--- a/SPEAKER-DIARIZATION.md
+++ b/SPEAKER-DIARIZATION.md
@@ -0,0 +1,80 @@
+# Speaker Diarization
+
+Notes on adding speaker identification to the voice pipeline.
+
+---
+
+## What Whisper does and doesn't do
+
+Whisper is a transcription model — it converts audio to text. It has no concept of who is speaking. In a multi-speaker conversation it will transcribe everything into one flat stream with no speaker labels.
+
+---
+
+## Adding diarization
+
+Speaker diarization is the task of segmenting audio by speaker: "who spoke when." The standard approach is to run a diarization model before or alongside transcription, then align the speaker segments with the transcript.
+
+### pyannote.audio
+
+The most capable open-source option. Runs locally, GPU-accelerated.
+
+```
+pip install pyannote.audio
+```
+
+Produces time-stamped segments:
+```
+SPEAKER_00  0.5s – 3.2s
+SPEAKER_01  3.8s – 7.1s
+SPEAKER_00  7.5s – 12.0s
+```
+
+Feed each segment to Whisper independently → transcript with speaker tags.
+
+Requires a HuggingFace token and model access request (free).
+
+### sherpa-onnx
+
+Has some diarization support in newer versions — worth checking if it can be added to the existing pipeline without a separate Python dependency.
+
+---
+
+## Speaker identification vs diarization
+
+| Feature | Diarization | Identification |
+|---------|------------|----------------|
+| Labels speakers | ✓ (anonymous: speaker 1, 2…) | ✓ (named: Mikael, Claude…) |
+| Requires enrollment | ✗ | ✓ (reference clip per person) |
+| Cross-session consistency | ✗ | ✓ |
+
+For named identification, a speaker embedding model compares voice fingerprints against enrolled reference clips. Options:
+
+- **resemblyzer** — simple cosine similarity on d-vector embeddings
+- **wespeaker** — more accurate, available as ONNX (fits existing sherpa-onnx stack)
+- **speechbrain** — full toolkit, heavier dependency
+
+---
+
+## Future experiments
+
+### Multi-speaker transcription
+Run pyannote diarization on a conversation recording, feed segments to Whisper, produce a tagged transcript:
+```
+Mikael: How do I juggle oranges?
+Claude: Start with two balls and work up from there.
+```
+
+### Real-time diarization
+Extend the current VAD → Whisper pipeline with a rolling speaker embedding. Each new segment gets a speaker label by comparing its embedding to known speakers. Useful for meeting notes or multi-person dictation.
+
+### Voice enrollment
+Record a short reference clip per person. Store embeddings. When a new segment arrives, score it against all enrolled speakers and assign the closest match above a confidence threshold. Unknown speakers get a temporary anonymous label.
+
+### Wake word per speaker
+Combine identification with wake word detection so only enrolled voices can trigger the assistant — a simple voice-based access control layer.
+
+---
+
+## VRAM considerations
+
+pyannote diarization adds roughly 1–2 GB depending on the model variant. wespeaker ONNX is much lighter. Both fit comfortably alongside the existing Whisper + Chatterbox stack on a 24 GB card.
--- a/TODO.md
+++ b/TODO.md
@@ -0,0 +1,41 @@
+# Voice Experiment — TODO
+
+## Classifier
+- [ ] Add query journal (JSONL) logging all STT fragments, classifier decisions, and how each query was resolved (timeout / send-word / classifier). Enables replay and re-classification experiments without live mic input.
+- [ ] Tune classifier — still too conservative on plain statements and short instructions
+- [ ] Experiment with larger classifier model (Qwen 7B or 14B)
+
+## Voice switching
+- [ ] Allow changing the active voice at runtime via voice command — "switch to the X voice", "use a different voice". Could be per-project as a cue (project A = voice A) or just on-demand preference. Implementation: store current `audio_prompt` path in a mutable state file; voice-switch commands update it directly without going through the full Claude dispatch.
+- [ ] Consider a small set of named voices with friendly names so you can say "switch to the calm voice" rather than a file path.
+- [ ] Daily voice rotation: select voice based on a hash of the current date — automatic variety without manual choice. Hash the date string mod N voices, pick from the named voice list.
+- [ ] Character mode: pair a voice with a system prompt preamble injected before the query. Technically trivial — prepend to the dispatch prompt. Hard part is writing character definitions that hold up over a session rather than collapsing into generic assistant after a few turns. A shallow character just sounds like a costume.
+
+## Pipeline
+- [ ] PTY-based Claude Code injection — replace X11/clipboard paste with direct PTY write to avoid focus stealing
+- [ ] Voice command routing — detect slash command intents (rename session, switch session) and dispatch as `/command` rather than a prompt
+- [ ] Activation phrase — say "computer" (or similar) to signal the start of a query; play a short acknowledgement chime (Star Trek style). Currently always-on; activation phrase would make it more intentional.
+- [ ] Audio cues — replace or supplement the "Efforting" TTS utterance with short sound effects: one tone on query start/acknowledgement, a different one on dispatch. Faster feedback than TTS.
+- [ ] Reserved control vocabulary — single words, pattern matched before any classifier:
+
+  | Word(s) | Action |
+  |---------|--------|
+  | `computer` | Generic activate — start accumulating a query, play acknowledgement chime |
+  | `project note` | Activate with project context injected into the prompt preamble |
+  | `go` / `send` / `done` | Dispatch current query immediately |
+  | `cancel` / `scratch that` | Discard accumulated query, return to idle |
+  | `stop` / `goodbye` / `hang up` | Shut down the session cleanly |
+  | `switch voice` / `different voice` | Trigger voice-switch flow |
+
+  Words are checked normalized (lowercase, punctuation stripped) before classifier runs. Classifier only sees real query content.
+
+- [ ] Classifier rule hierarchy — pre-classifiers (fast pattern match) → optional LLM classifiers. Each stage can short-circuit. Makes the system tuneable: add/remove LLM stages without touching the pattern layer.
+
+## TTS / Long text
+- [ ] Long inputs collapse into gibberish — Chatterbox has a context limit. The tts-server should automatically split on sentence boundaries and queue them sequentially. `Chatterbox_Tts` already has `speak_streaming` that does this — use it in the server instead of `speak`.
+
+## TTS / Prompt
+- [ ] Filename and path readback sounds bad when names are all-caps (e.g. `ARCHITECTURE.md` gets spelled out). Instruct Claude in the voice prompt to read filenames and paths in lowercase unless the term is a known acronym. Paralinguistic tags exist but are mostly for expression (laugh, sigh, etc.) — not useful here.
+
+## STT
+- [ ] Try Whisper small.en or medium.en for better accuracy on accents
--- a/VOICES.md
+++ b/VOICES.md
@@ -0,0 +1,29 @@
+# Voice Clone Wishlist
+
+Voices to prepare reference clips for.
+
+## Ready
+
+| Voice | Character / Show | Notes |
+|-------|-----------------|-------|
+| Rommie | Andromeda — ship AI / android | `/home/devilholk/Documents/rommie-sample.wav` — working well |
+| Wilford Brimley (Harold W. Smith) | Remo Williams: The Adventure Begins (1985) — head of CURE | Turned out well — gravelly authoritative delivery, dry clean speech |
+
+## To Do
+
+| Voice | Character / Show | Notes |
+|-------|-----------------|-------|
+| Fred Ward | Remo Williams: The Adventure Begins (1985) — Remo Williams himself | Distinctive gravelly voice, same film as Brimley so similar source quality |
+| Alan Scarfe | TNG (Romulan), Andromeda S5 (Flavin) | Deep, authoritative voice — hunt for quiet scenes without ship hum |
+| John Fleck | Enterprise (Silik the Suliban) | Distinctive raspy voice — Suliban scenes may have atmosphere noise |
+| Steve Bacic | Andromeda (Telemachus Rhade) | — |
+| Alex Diakun | Andromeda (Perseid character, name ~"Atune"), Stargate SG-1 (science role) | Prolific Vancouver sci-fi character actor, often plays scientists/scholars — distinctive voice |
+| Unknown actress | Andromeda S4/S5 (Dylan's love interest), Babylon 5 (Mars rebellion leader) | Possibly Marjorie Monaghan (Number One in B5) — unconfirmed |
+| Claudia Christian | Babylon 5 — Commander Ivanova | User already has a voice clip ready |
+
+## Notes
+
+- Animated/cartoon voices (e.g. Darkwing Duck) don't clone well — too far outside natural human speech distribution
+- Compressed/heavily post-processed audio (spaceship hum, background score) degrades results even after noise reduction
+- OGG vs WAV quality difference is likely source quality, not encoding — soundfile handles both
+- Voice cloning quality scales with both clip length and emotional range — varied prosody (questions, statements, different tones) gives the model more to anchor on than flat monotone
--- a/acting-demo-bark.mjs
+++ b/acting-demo-bark.mjs
@@ -0,0 +1,135 @@
+/**
+ * Demonstrates acting and expressiveness with Bark TTS.
+ *
+ * Run: node acting-demo-bark.mjs [start-section]
+ *
+ * Bark expressive techniques:
+ *   UPPERCASE          — stress/emphasis on a word
+ *   ...                — hesitation, trailing off
+ *   [laughs]           — laughter
+ *   [sighs]            — sigh
+ *   [gasps]            — sharp intake of breath
+ *   [clears throat]    — throat clearing
+ *   ♪ text ♪           — sung phrase
+ *   —                  — abrupt break
+ *   !                  — raised energy
+ *
+ * Voice presets (English):
+ *   v2/en_speaker_0  calm female
+ *   v2/en_speaker_1  calm male
+ *   v2/en_speaker_3  deep male
+ *   v2/en_speaker_6  neutral/warm (default)
+ *   v2/en_speaker_9  expressive
+ */
+
+import { Bark_Tts } from './lib/bark-tts.mjs'
+
+const tts = new Bark_Tts()
+
+process.stderr.write('[bark] loading model...\n')
+await tts.init()
+process.stderr.write('[bark] ready\n\n')
+
+const START = parseInt(process.argv[2] ?? '1', 10)
+
+function section(title) {
+	process.stdout.write(`\n${'─'.repeat(50)}\n${title}\n${'─'.repeat(50)}\n`)
+}
+
+async function say(text, voice = undefined) {
+	const opts = { preprocess: false }
+	if (voice) opts.voice = voice
+	process.stdout.write(`  ${voice ?? 'default'}\n  "${text}"\n\n`)
+	await tts.speak(text, opts)
+}
+
+// ─── 1. Baseline ───────────────────────────────────────
+section('1. Baseline')
+if (START <= 1) {
+	await say('The package has arrived. Please sign here.')
+}
+
+// ─── 2. Paralinguistic tokens ──────────────────────────
+section('2. Paralinguistic tokens')
+if (START <= 2) {
+	await say('[clears throat] Right. Where were we.')
+	await say('And then he just... [sighs] gave up.')
+	await say('[gasps] I had no idea you were here!')
+	await say("Ha. [laughs] That's the funniest thing I've heard all week.")
+}
+
+// ─── 3. Emphasis via UPPERCASE ─────────────────────────
+section('3. Emphasis via UPPERCASE')
+if (START <= 3) {
+	await say('You need to stop. RIGHT NOW.')
+	await say('I have told you THIS BEFORE.')
+	await say('This is ABSOLUTELY unacceptable.')
+}
+
+// ─── 4. Hesitation and rhythm ──────────────────────────
+section('4. Hesitation and rhythm')
+if (START <= 4) {
+	await say('It was... quiet. Too quiet.')
+	await say('I just... I don\'t know what to say.')
+	await say('He said he was fine — but his eyes told a different story.')
+}
+
+// ─── 5. Sung phrase ────────────────────────────────────
+section('5. Singing')
+if (START <= 5) {
+	await say('♪ Happy birthday to you, happy birthday to you ♪')
+}
+
+// ─── 6. Acronyms ───────────────────────────────────────
+section('6. Acronyms')
+if (START <= 6) {
+	// Without spacing — Bark may try to pronounce as a word
+	await say('The FBI is investigating.')
+	await say('NASA launched a new rocket.')
+	await say('I work with the CPU and GPU.')
+	// With letter spacing — forces spelling out
+	await say('The F B I is investigating.')
+	await say('N A S A launched a new rocket.')
+	await say('I work with the C P U and G P U.')
+}
+
+// ─── 7. Voice character ────────────────────────────────
+section('7. Voice presets — same line')
+if (START <= 7) {
+	const line = "Well, that's certainly one way to look at it."
+	for (const voice of ['v2/en_speaker_0', 'v2/en_speaker_3', 'v2/en_speaker_6', 'v2/en_speaker_9']) {
+		await say(line, voice)
+	}
+}
+
+// ─── 7. Combined scene ─────────────────────────────────
+section('8. Combined — a tense scene')
+if (START <= 8) {
+	await say('Something is wrong.', 'v2/en_speaker_9')
+	await say('I can FEEL it.', 'v2/en_speaker_9')
+	await say('[gasps] Run!', 'v2/en_speaker_9')
+}
+
+// ─── 9. Paragraph — voices 7-9 ─────────────────────────
+// Pass voices as CLI args to override, e.g.:
+//   node acting-demo-bark.mjs 9 v2/en_speaker_0 v2/en_speaker_2 v2/en_speaker_5
+const PARA_VOICES = process.argv.slice(3).length
+	? process.argv.slice(3)
+	: ['v2/en_speaker_7', 'v2/en_speaker_8', 'v2/en_speaker_9']
+
+const PARAGRAPH = 'It was a cold evening when the letter FINALLY arrived. '
+	+ '[sighs] She had been waiting for months... not knowing whether the news would be good — or bad. '
+	+ 'Her hands trembled as she tore open the envelope. '
+	+ 'Inside was a single sheet of paper with just THREE words written on it. '
+	+ '[gasps] She read them twice... then sat down slowly, and stared at the wall.'
+
+section('9. Paragraph — voice comparison')
+if (START <= 9) {
+	for (const voice of PARA_VOICES) {
+		process.stdout.write(`\n  — ${voice} —\n`)
+		await say(PARAGRAPH, voice)
+	}
+}
+
+tts.stop()
+section('Done')
--- a/acting-demo-chatterbox.mjs
+++ b/acting-demo-chatterbox.mjs
@@ -0,0 +1,109 @@
+/**
+ * Acting demo for Chatterbox TTS.
+ *
+ * Run: node acting-demo-chatterbox.mjs [start-section] [turbo|full]
+ *
+ * Paralinguistic tags:
+ *   [laugh] [chuckle] [cough] [clear throat] [sigh] [shush] [groan] [sniff] [gasp]
+ *
+ * Exaggeration (full model only, 0.0-1.0):
+ *   0.0 = neutral, 1.0 = maximum emotion intensity
+ */
+
+import { Chatterbox_Tts } from './lib/chatterbox-tts.mjs'
+
+const START   = parseInt(process.argv[2] ?? '1', 10)
+const VARIANT = process.argv[3] ?? 'turbo'
+
+const tts = new Chatterbox_Tts({ variant: VARIANT })
+
+process.stderr.write(`[chatterbox] loading ${VARIANT} model...\n`)
+await tts.init()
+process.stderr.write('[chatterbox] ready\n\n')
+
+function section(title) {
+    process.stdout.write(`\n${'─'.repeat(50)}\n${title}\n${'─'.repeat(50)}\n`)
+}
+
+async function say(text, opts = {}) {
+    process.stdout.write(`  "${text}"\n`)
+    if (Object.keys(opts).length) process.stdout.write(`  ${JSON.stringify(opts)}\n`)
+    process.stdout.write('\n')
+    await tts.speak(text, { preprocess: false, ...opts })
+}
+
+// ─── 1. Baseline ───────────────────────────────────────
+section('1. Baseline')
+if (START <= 1) {
+    await say('The package has arrived. Please sign here.')
+}
+
+// ─── 2. Paralinguistic tags ────────────────────────────
+section('2. Paralinguistic tags')
+if (START <= 2) {
+    await say('[cough] Right, where were we.')
+    await say('And then he just... [sigh] gave up.')
+    await say('[gasp] I had no idea you were here!')
+    await say('Ha! [laugh] That is the funniest thing I have heard all week.')
+    await say('[chuckle] Well, that is certainly one way to look at it.')
+    await say('[groan] Not this again.')
+}
+
+// ─── 3. Punctuation and rhythm ─────────────────────────
+section('3. Punctuation and rhythm')
+if (START <= 3) {
+    await say('It was... quiet. Too quiet.')
+    await say('He said he was fine — but his eyes told a different story.')
+    await say('I told you. I told you this would happen. But did anyone listen? No.')
+}
+
+// ─── 4. Exaggeration — full model only ─────────────────
+section(`4. Exaggeration (${VARIANT === 'full' ? 'active' : 'ignored in turbo — run with: node acting-demo-chatterbox.mjs 4 full'})`)
+if (START <= 4) {
+    await say('I am so incredibly happy to see you today!', { exaggeration: 0.0 })
+    await say('I am so incredibly happy to see you today!', { exaggeration: 0.5 })
+    await say('I am so incredibly happy to see you today!', { exaggeration: 1.0 })
+}
+
+// ─── 5. Temperature ────────────────────────────────────
+section('5. Temperature (lower = more predictable)')
+if (START <= 5) {
+    await say('Something is very wrong here and I think we should leave now.', { temperature: 0.3 })
+    await say('Something is very wrong here and I think we should leave now.', { temperature: 0.8 })
+    await say('Something is very wrong here and I think we should leave now.', { temperature: 1.4 })
+}
+
+// ─── 6. Paragraph ──────────────────────────────────────
+section('6. Paragraph with tags')
+if (START <= 6) {
+    await say(
+        'It was a cold evening when the letter finally arrived. '
+        + '[sigh] She had been waiting for months... not knowing whether the news would be good — or bad. '
+        + 'Her hands trembled as she tore open the envelope. '
+        + '[gasp] Inside was a single sheet of paper with just three words written on it.'
+    )
+}
+
+// ─── 7. Audio prompt / voice cloning ───────────────────
+// Pass a WAV file path as third argument:
+//   node acting-demo-chatterbox.mjs 7 turbo /path/to/voice.wav
+section('7. Audio prompt (voice cloning)')
+if (START <= 7) {
+    const audio_prompt = process.argv[4]
+    if (!audio_prompt) {
+        process.stdout.write('  Skipped — pass a WAV file as the 4th argument:\n')
+        process.stdout.write('  node acting-demo-chatterbox.mjs 7 turbo /path/to/voice.wav\n\n')
+        process.stdout.write('  Requirements: WAV file, at least 5 seconds, clean speech, no music/noise\n')
+    } else {
+        process.stdout.write(`  voice: ${audio_prompt}\n\n`)
+        await say('Hello. This is a test of voice cloning.', { audio_prompt })
+        await say(
+            'It was a cold evening when the letter finally arrived. '
+            + '[sigh] She had been waiting for months... not knowing whether the news would be good — or bad.',
+            { audio_prompt }
+        )
+    }
+}
+
+tts.stop()
+section('Done')
--- a/acting-demo.mjs
+++ b/acting-demo.mjs
@@ -0,0 +1,163 @@
+/**
+ * Demonstrates prosody and acting techniques with Kokoro TTS.
+ *
+ * Run: node acting-demo.mjs [start-section]
+ * E.g.: node acting-demo.mjs 5
+ *
+ * Kokoro's expressive range comes from four levers:
+ *
+ * 1. Voice choice — each voice has a baked-in emotional character
+ * 2. Speed — slower = deliberate/grave, faster = excited/happy
+ * 3. silenceScale — controls pause length between sentences
+ * 4. Punctuation & markup — espeak-ng honours some SSML and typographic cues
+ *
+ * SSML tags that pass through espeak-ng:
+ *   <break time="500ms"/>          — explicit pause
+ *   <emphasis level="strong">x</emphasis>  — louder/longer (model-dependent)
+ *   <prosody rate="slow">x</prosody>       — local rate change
+ *   <phoneme alphabet="ipa" ph="..."/>     — force pronunciation
+ *
+ * Typographic cues that affect prosody without SSML:
+ *   UPPERCASE words            — some stress
+ *   Ellipsis ...               — trailing pause / trailing off
+ *   Em-dash —                  — abrupt break
+ *   Comma placement            — micro-pauses
+ *   Exclamation !              — raised energy
+ */
+
+import { Tts } from './lib/tts.mjs'
+import { spawn } from 'node:child_process'
+
+const tts = new Tts()
+tts.init()
+
+// Overriding the internal _play is the simplest way to control
+// per-utterance GenerationConfig without reworking the class.
+
+async function say(text, { voice = 'af_heart', speed = 1.0, silence_scale = 1.0 } = {}) {
+	const { createRequire } = await import('node:module')
+	const require = createRequire(import.meta.url)
+	const sherpa  = require('sherpa-onnx-node')
+
+	const VOICES = { af_heart:0, af_bella:1, af_nicole:2, af_sarah:3, af_sky:4, am_adam:5, am_michael:6 }
+	const sid    = VOICES[voice] ?? 0
+
+	process.stdout.write(`  voice=${voice} speed=${speed} silence=${silence_scale}\n  "${text}"\n\n`)
+
+	const audio = tts._tts.generate({
+		text,
+		generationConfig: new sherpa.GenerationConfig({ sid, speed, silenceScale: silence_scale }),
+	})
+
+	await play(audio.samples, audio.sampleRate)
+}
+
+// Synthesize multiple phrase/option pairs and play them as one continuous audio
+async function say_concat(parts) {
+	const { createRequire } = await import('node:module')
+	const require = createRequire(import.meta.url)
+	const sherpa  = require('sherpa-onnx-node')
+	const VOICES  = { af_heart:0, af_bella:1, af_nicole:2, af_sarah:3, af_sky:4, am_adam:5, am_michael:6 }
+
+	let total_len = 0
+	const chunks = []
+
+	for (const [text, opts = {}] of parts) {
+		const { voice = 'af_heart', speed = 1.0, silence_scale = 1.0 } = opts
+		const sid   = VOICES[voice] ?? 0
+		const audio = tts._tts.generate({
+			text,
+			generationConfig: new sherpa.GenerationConfig({ sid, speed, silenceScale: silence_scale }),
+		})
+		process.stdout.write(`  [${speed}x] "${text}"\n`)
+		chunks.push(audio.samples)
+		total_len += audio.samples.length
+	}
+
+	const combined = new Float32Array(total_len)
+	let offset = 0
+	for (const chunk of chunks) {
+		combined.set(chunk, offset)
+		offset += chunk.length
+	}
+
+	await play(combined, tts._tts.sampleRate)
+	process.stdout.write('\n')
+}
+
+function play(samples, sample_rate) {
+	return new Promise((resolve, reject) => {
+		const proc = spawn('pacat', [
+			'--format=float32le', `--rate=${sample_rate}`, '--channels=1',
+		], { stdio: ['pipe', 'ignore', 'inherit'] })
+		proc.on('error', reject)
+		proc.on('close', resolve)
+		proc.stdin.write(Buffer.from(samples.buffer))
+		proc.stdin.end()
+	})
+}
+
+const START = parseInt(process.argv[2] ?? '1', 10)
+
+function section(title) {
+	process.stdout.write(`\n${'─'.repeat(50)}\n${title}\n${'─'.repeat(50)}\n`)
+}
+
+// ─── 1. Baseline ───────────────────────────────────────
+section('1. Baseline — neutral af_heart')
+if (START <= 1) {
+	await say('The package has arrived. Please sign here.')
+}
+
+// ─── 2. Speed as emotion ───────────────────────────────
+section('2. Speed — slow (grave) vs fast (excited)')
+if (START <= 2) {
+	await say('I have some terrible news. The project has been cancelled.', { speed: 0.8 })
+	await say("Oh my goodness, I can't believe you're actually here! This is amazing!", { speed: 1.25 })
+}
+
+// ─── 3. Voice character ────────────────────────────────
+section('3. Voice character — same line, different voices')
+if (START <= 3) {
+	const line = "Well, that's certainly one way to look at it."
+	for (const voice of ['af_heart', 'af_sky', 'am_adam', 'bm_george']) {
+		await say(line, { voice })
+	}
+}
+
+// ─── 4. Punctuation-driven prosody ─────────────────────
+section('4. Punctuation and typographic cues')
+if (START <= 4) {
+	await say('I told you. I told you this would happen. But did anyone listen? No.')
+	await say('It was... quiet. Too quiet.')
+	await say('He said he was fine — but his eyes told a different story.')
+	await say('This is ABSOLUTELY unacceptable.')
+}
+
+// ─── 5. Per-phrase speed — manual emphasis ─────────────
+// SSML tags are not supported by this backend; emphasis is achieved by
+// synthesizing key phrases at a different speed and concatenating the audio.
+section('5. Per-phrase speed (manual emphasis)')
+if (START <= 5) {
+	// "stop" spoken slower and then the rest faster — sounds like emphasis
+	await say_concat([
+		['You need to ',        { speed: 1.0 }],
+		['stop.',               { speed: 0.7 }],
+		['Right now.',          { speed: 1.1 }],
+	])
+	await say_concat([
+		['Listen very carefully.', { speed: 0.75 }],
+		['I will only say this once.', { speed: 0.9 }],
+	])
+}
+
+// ─── 6. Combining levers ───────────────────────────────
+section('6. Combined — a tense scene')
+if (START <= 6) {
+	await say('Something is wrong.', { speed: 0.85, silence_scale: 1.5 })
+	await say('I can feel it.', { speed: 0.8, silence_scale: 2.0 })
+	await say('<break time="600ms"/> Run!', { speed: 1.3, voice: 'af_sky' })
+}
+
+section('Done')
+process.stdout.write('Tip: adjust speed/silence_scale and swap voices to build a character.\n')
--- a/bark-server.py
+++ b/bark-server.py
@@ -0,0 +1,135 @@
+#!/usr/bin/env -S bash -c 'exec "$(dirname "$0")/venv/bin/python3" "$0" "$@"'
+"""
+Bark TTS server — keeps model loaded, reads JSON lines from stdin.
+
+Protocol:
+  stdin:  one JSON line per request: {"text": "...", "voice": "v2/en_speaker_6"}
+  stdout: "ok\n" after each utterance is finished playing
+  stderr: status/timing messages
+
+Usage:
+  python bark-server.py [model] [voice]
+  python bark-server.py suno/bark-small v2/en_speaker_3
+
+Voices (English):
+  v2/en_speaker_0  .. v2/en_speaker_9
+  Speaker 6 = neutral/warm, 9 = expressive, 3 = deep male
+"""
+
+import os
+import sys
+import json
+import time
+import subprocess
+import numpy as np
+
+MODEL      = sys.argv[1] if len(sys.argv) > 1 else 'suno/bark'
+DEF_VOICE  = sys.argv[2] if len(sys.argv) > 2 else 'v2/en_speaker_6'
+SAMPLE_RATE = 24000
+
+TOKEN_FILE = os.path.expanduser('~/.secrets/hugging-face.token')
+try:
+    with open(TOKEN_FILE) as f:
+        os.environ['HF_TOKEN'] = f.read().strip()
+except FileNotFoundError:
+    pass
+
+# Disable background safetensors conversion attempts — the HF API endpoint
+# it calls has changed and now errors. The pytorch_model.bin files work fine.
+os.environ['SAFETENSORS_FAST_GPU'] = '0'
+os.environ.setdefault('TRANSFORMERS_NO_ADVISORY_WARNINGS', '1')
+
+def log(msg):
+    print(f'[bark] {msg}', file=sys.stderr, flush=True)
+
+log(f'loading {MODEL}...')
+t0 = time.time()
+
+import torch
+from transformers import AutoProcessor, BarkModel
+
+device = 'cuda' if torch.cuda.is_available() else 'cpu'
+dtype  = torch.float16 if device == 'cuda' else torch.float32
+
+if device == 'cuda':
+    # TF32 is faster on Ampere (30xx, 40xx) for both matmul and convolutions
+    torch.backends.cuda.matmul.allow_tf32 = True
+    torch.backends.cudnn.allow_tf32       = True
+
+processor = AutoProcessor.from_pretrained(MODEL)
+model     = BarkModel.from_pretrained(MODEL, torch_dtype=dtype, use_safetensors=False).to(device)
+
+# Bark's internal generation calls set both max_length and max_new_tokens simultaneously
+# which causes a conflict warning and may cause early stopping. Clearing max_length
+# lets max_new_tokens take sole control.
+model.generation_config.max_length = None
+
+# torch.compile gives a meaningful speedup on 3090 (adds ~30s first-run compile)
+try:
+    model = torch.compile(model, mode='reduce-overhead')
+    log('torch.compile applied')
+except Exception as e:
+    log(f'torch.compile skipped: {e}')
+
+log(f'ready on {device} ({time.time() - t0:.1f}s load time)')
+print('ready', flush=True)  # signal to Node that we are ready
+
+
+def speak(text, voice):
+    t1 = time.time()
+
+    inputs = processor(text, voice_preset=voice, return_tensors='pt')
+
+    # Set attention_mask if not provided by processor — without it the model
+    # cannot distinguish padding from real tokens, which causes erratic generation
+    # including leading filler sounds.
+    if 'attention_mask' not in inputs:
+        inputs['attention_mask'] = torch.ones_like(inputs['input_ids'])
+
+    inputs = {k: v.to(device) for k, v in inputs.items()}
+
+    with torch.inference_mode():
+        audio = model.generate(
+            **inputs,
+            semantic_temperature = 0.3,
+            coarse_temperature   = 0.3,
+            fine_temperature     = 0.3,
+        )
+
+    samples = audio.cpu().numpy().squeeze().astype(np.float32)
+    elapsed_gen = time.time() - t1
+    duration    = len(samples) / SAMPLE_RATE
+    log(f'generated {duration:.1f}s audio in {elapsed_gen:.1f}s  rtf={elapsed_gen/duration:.2f}')
+
+    proc = subprocess.Popen(
+        ['pacat', '--format=float32le', f'--rate={SAMPLE_RATE}', '--channels=1'],
+        stdin=subprocess.PIPE,
+    )
+    proc.stdin.write(samples.tobytes())
+    proc.stdin.close()
+    proc.wait()
+
+
+for line in sys.stdin:
+    line = line.strip()
+    if not line:
+        continue
+
+    try:
+        req          = json.loads(line)
+        text  = req.get('text', '')
+        voice = req.get('voice', DEF_VOICE)
+    except json.JSONDecodeError:
+        text  = line
+        voice = DEF_VOICE
+
+    if not text:
+        print('ok', flush=True)
+        continue
+
+    try:
+        speak(text, voice)
+    except Exception as e:
+        log(f'error: {e}')
+
+    print('ok', flush=True)
--- a/chatterbox-server.py
+++ b/chatterbox-server.py
@@ -0,0 +1,200 @@
+#!/usr/bin/env -S bash -c 'exec "$(dirname "$0")/venv/bin/python3" "$0" "$@"'
+"""
+Chatterbox TTS server — keeps model loaded, reads JSON lines from stdin.
+
+Protocol:
+  stdin:  {"text": "...", "temperature": 0.8, "top_p": 0.95}
+  stdout: "ok\n" after each utterance is generated (playback may still be in progress)
+  stderr: status/timing messages
+
+Usage:
+  ./chatterbox-server.py
+  ./chatterbox-server.py turbo    # default
+  ./chatterbox-server.py full     # original model, supports exaggeration
+
+Paralinguistic tags supported in text:
+  [laugh] [chuckle] [cough] [clear throat] [sigh] [shush] [groan] [sniff] [gasp]
+
+Full model only:
+  exaggeration  0.0-1.0  emotion intensity (ignored in turbo)
+"""
+
+import os
+import sys
+import json
+import time
+import queue
+import threading
+import subprocess
+import numpy as np
+
+TOKEN_FILE = os.path.expanduser('~/.secrets/hugging-face.token')
+try:
+    with open(TOKEN_FILE) as f:
+        os.environ['HF_TOKEN'] = f.read().strip()
+except FileNotFoundError:
+    pass
+
+def find_hf_cache(repo_id):
+    """Return the local snapshot path if the model is already cached, else None."""
+    from pathlib import Path
+    cache_dir = Path(os.environ.get('HF_HOME', os.path.expanduser('~/.cache/huggingface'))) / 'hub'
+    repo_dir  = cache_dir / f"models--{repo_id.replace('/', '--')}" / 'snapshots'
+    if repo_dir.exists():
+        snapshots = sorted(repo_dir.iterdir(), key=lambda p: p.stat().st_mtime)
+        if snapshots:
+            return str(snapshots[-1])
+    return None
+
+VARIANT    = sys.argv[1] if len(sys.argv) > 1 else 'turbo'
+SAMPLE_RATE = 24000
+
+def log(msg):
+    print(f'[chatterbox] {msg}', file=sys.stderr, flush=True)
+
+log(f'loading chatterbox-{VARIANT}...')
+t0 = time.time()
+
+import tempfile
+import traceback
+import numpy as np
+import torch
+import soundfile as sf
+import librosa as _librosa
+
+# librosa.resample returns float64 in newer numpy — patch it to always return float32
+_orig_resample = _librosa.resample
+def _resample_float32(*args, **kwargs):
+    return _orig_resample(*args, **kwargs).astype(np.float32)
+_librosa.resample = _resample_float32
+
+device = 'cuda' if torch.cuda.is_available() else 'cpu'
+
+REPO_IDS = {
+    'turbo': 'ResembleAI/chatterbox-turbo',
+    'full':  'ResembleAI/chatterbox',
+}
+
+if VARIANT == 'turbo':
+    from chatterbox.tts_turbo import ChatterboxTurboTTS as Model
+else:
+    from chatterbox.tts import ChatterboxTTS as Model
+
+cached = find_hf_cache(REPO_IDS[VARIANT])
+if cached:
+    log(f'loading from cache: {cached}')
+    model = Model.from_local(cached, device=device)
+else:
+    log('cache not found, downloading...')
+    model = Model.from_pretrained(device=device)
+
+log(f'ready on {device} ({time.time() - t0:.1f}s load time)')
+print('ready', flush=True)
+
+
+_wav_cache = {}
+
+def ensure_float32_wav(path):
+    """Re-save audio as float32 mono WAV to work around librosa/numpy float64 issue.
+    Result is cached by input path so repeated calls with the same file are free."""
+    if path in _wav_cache:
+        return _wav_cache[path]
+    wav, sr = sf.read(path, dtype='float32', always_2d=True)
+    wav = wav.mean(axis=1)  # stereo → mono if needed
+    tmp = tempfile.NamedTemporaryFile(suffix='.wav', delete=False)
+    sf.write(tmp.name, wav, sr, subtype='FLOAT')
+    _wav_cache[path] = tmp.name
+    return tmp.name
+
+
+_SENTINEL = object()
+
+playback_queue = queue.Queue()
+
+
+def playback_worker():
+    """Plays audio samples in order. Runs in its own thread."""
+    while True:
+        item = playback_queue.get()
+        if item is _SENTINEL:
+            break
+        samples = item
+        proc = subprocess.Popen(
+            ['pacat', '--format=float32le', f'--rate={SAMPLE_RATE}', '--channels=1'],
+            stdin=subprocess.PIPE,
+        )
+        proc.stdin.write(samples.tobytes())
+        proc.stdin.close()
+        proc.wait()
+        playback_queue.task_done()
+
+
+playback_thread = threading.Thread(target=playback_worker, daemon=True)
+playback_thread.start()
+
+
+def generate(text, opts):
+    t1 = time.time()
+
+    if VARIANT == 'turbo':
+        kwargs = {
+            'temperature':        opts.get('temperature',        0.8),
+            'top_p':              opts.get('top_p',              0.95),
+            'top_k':              opts.get('top_k',              1000),
+            'repetition_penalty': opts.get('repetition_penalty', 1.2),
+            'min_p':              opts.get('min_p',              0.0),
+        }
+    else:
+        kwargs = {
+            'temperature':        opts.get('temperature',        0.8),
+            'top_p':              opts.get('top_p',              1.0),
+            'repetition_penalty': opts.get('repetition_penalty', 1.2),
+            'min_p':              opts.get('min_p',              0.05),
+            'exaggeration':       opts.get('exaggeration',       0.5),
+            'cfg_weight':         opts.get('cfg_weight',         0.5),
+        }
+
+    audio_prompt = opts.get('audio_prompt')
+    if audio_prompt:
+        kwargs['audio_prompt_path'] = ensure_float32_wav(audio_prompt)
+
+    with torch.inference_mode():
+        wav = model.generate(text, **kwargs)
+
+    samples = wav.squeeze(0).cpu().numpy().astype(np.float32)
+    elapsed = time.time() - t1
+    duration = len(samples) / SAMPLE_RATE
+    log(f'generated {duration:.1f}s audio in {elapsed:.1f}s  rtf={elapsed/duration:.2f}')
+
+    return samples
+
+
+for line in sys.stdin:
+    line = line.strip()
+    if not line:
+        continue
+
+    try:
+        req  = json.loads(line)
+        text = req.pop('text', '')
+        opts = req  # remaining fields are generation options
+    except json.JSONDecodeError:
+        text = line
+        opts = {}
+
+    if not text:
+        print('ok', flush=True)
+        continue
+
+    try:
+        samples = generate(text, opts)
+        playback_queue.put(samples)
+    except Exception as e:
+        log(f'error: {e}')
+        traceback.print_exc(file=sys.stderr)
+
+    print('ok', flush=True)
+
+# Drain playback before exit
+playback_queue.put(_SENTINEL)
+playback_thread.join()
--- a/demo-bark.mjs
+++ b/demo-bark.mjs
@@ -0,0 +1,85 @@
+/**
+ * Voice demo: microphone → Whisper STT → (optional LLM cleanup) → Bark TTS
+ *
+ * Bark supports emphasis via UPPERCASE, paralinguistic tokens ([laughs], [sighs],
+ * [clears throat]), and hesitation (...). Markdown is preprocessed automatically:
+ * **bold** → UPPERCASE, headers get a pause, etc.
+ *
+ * Environment variables:
+ *   WHISPER_MODEL   default: base.en
+ *                   options: tiny.en  base.en  small.en  medium.en  large-v3  large-v3-turbo
+ *   BARK_MODEL      default: suno/bark  (use suno/bark-small for faster/smaller)
+ *   BARK_VOICE      default: v2/en_speaker_6
+ *                   options: v2/en_speaker_0 .. v2/en_speaker_9
+ *   STT_PROVIDER    default: cuda
+ *   USE_LLM         set to 0 to disable (default: 1)
+ *   OLLAMA_MODEL    default: phi3:mini
+ */
+
+import { Stt }                             from './lib/stt.mjs'
+import { Bark_Tts }                        from './lib/bark-tts.mjs'
+import { llm_available, list_models, cleanup } from './lib/llm.mjs'
+
+const WHISPER_MODEL = process.env.WHISPER_MODEL || 'base.en'
+const STT_PROVIDER  = process.env.STT_PROVIDER  || 'cuda'
+const USE_LLM       = process.env.USE_LLM !== '0'
+
+process.stderr.write(`[stt] loading whisper-${WHISPER_MODEL} on ${STT_PROVIDER}...\n`)
+const stt = new Stt({ whisper_name: WHISPER_MODEL, provider: STT_PROVIDER })
+stt.init()
+process.stderr.write('[stt] ready\n')
+
+process.stderr.write('[tts] loading Bark (this takes a moment)...\n')
+const tts = new Bark_Tts()
+await tts.init()
+process.stderr.write('[tts] Bark ready\n')
+
+let has_llm = false
+if (USE_LLM) {
+	has_llm = await llm_available()
+	if (has_llm) {
+		const models = await list_models()
+		process.stderr.write(`[llm] ollama available. models: ${models.slice(0, 6).join(', ')}\n`)
+	} else {
+		process.stderr.write('[llm] ollama not reachable, skipping cleanup\n')
+	}
+}
+
+process.stderr.write(`\n[ready] whisper=${WHISPER_MODEL} bark=${process.env.BARK_MODEL || 'suno/bark'} llm=${has_llm}\n`)
+process.stderr.write('[ready] speak into your microphone. Ctrl+C to stop.\n\n')
+
+let speaking = false
+
+const stop = stt.listen(async (raw_text) => {
+	if (speaking) {
+		process.stderr.write(`[skipped] ${raw_text}\n`)
+		return
+	}
+	speaking = true
+
+	try {
+		process.stdout.write(`[raw]  ${raw_text}\n`)
+
+		let text = raw_text
+		if (has_llm) {
+			text = await cleanup(raw_text)
+			if (text !== raw_text) {
+				process.stdout.write(`[llm]  ${text}\n`)
+			}
+		}
+
+		await tts.speak_streaming(text)
+		process.stdout.write('\n')
+	} catch (err) {
+		process.stderr.write(`[error] ${err.message}\n`)
+	} finally {
+		speaking = false
+	}
+})
+
+process.on('SIGINT', () => {
+	process.stderr.write('\n[voice] stopping\n')
+	stop()
+	tts.stop()
+	process.exit(0)
+})
--- a/demo-kokoro.mjs
+++ b/demo-kokoro.mjs
@@ -0,0 +1,85 @@
+/**
+ * Voice demo: microphone → Whisper STT → (optional LLM cleanup) → Kokoro TTS
+ *
+ * Environment variables:
+ *   WHISPER_MODEL   default: base.en
+ *                   options: tiny.en  base.en  small.en  medium.en  large-v3  large-v3-turbo
+ *   VOICE           default: af_heart
+ *                   options: af_heart af_bella af_nicole af_sarah af_sky am_adam am_michael
+ *   STT_PROVIDER    default: cuda
+ *   USE_LLM         set to 0 to disable (default: 1)
+ *   OLLAMA_MODEL    default: phi3:mini
+ */
+
+import { Stt }                             from './lib/stt.mjs'
+import { Tts, VOICES }                     from './lib/tts.mjs'
+import { llm_available, list_models, cleanup } from './lib/llm.mjs'
+
+const WHISPER_MODEL = process.env.WHISPER_MODEL || 'base.en'
+const VOICE         = process.env.VOICE         || 'af_heart'
+const STT_PROVIDER  = process.env.STT_PROVIDER  || 'cuda'
+const USE_LLM       = process.env.USE_LLM !== '0'
+
+if (!(VOICE in VOICES)) {
+	process.stderr.write(`[voice] unknown voice "${VOICE}". Available: ${Object.keys(VOICES).join(', ')}\n`)
+	process.exit(1)
+}
+
+process.stderr.write(`[stt] loading whisper-${WHISPER_MODEL} on ${STT_PROVIDER}...\n`)
+const stt = new Stt({ whisper_name: WHISPER_MODEL, provider: STT_PROVIDER })
+stt.init()
+process.stderr.write('[stt] ready\n')
+
+process.stderr.write('[tts] loading Kokoro...\n')
+const tts = new Tts({ voice: VOICE })
+tts.init()
+process.stderr.write('[tts] ready\n')
+
+let has_llm = false
+if (USE_LLM) {
+	has_llm = await llm_available()
+	if (has_llm) {
+		const models = await list_models()
+		process.stderr.write(`[llm] ollama available. models: ${models.slice(0, 6).join(', ')}\n`)
+	} else {
+		process.stderr.write('[llm] ollama not reachable, skipping cleanup\n')
+	}
+}
+
+process.stderr.write(`\n[ready] whisper=${WHISPER_MODEL} voice=${VOICE} llm=${has_llm}\n`)
+process.stderr.write('[ready] speak into your microphone. Ctrl+C to stop.\n\n')
+
+let speaking = false
+
+const stop = stt.listen(async (raw_text) => {
+	if (speaking) {
+		process.stderr.write(`[skipped] ${raw_text}\n`)
+		return
+	}
+	speaking = true
+
+	try {
+		process.stdout.write(`[raw]  ${raw_text}\n`)
+
+		let text = raw_text
+		if (has_llm) {
+			text = await cleanup(raw_text)
+			if (text !== raw_text) {
+				process.stdout.write(`[llm]  ${text}\n`)
+			}
+		}
+
+		await tts.speak_streaming(text)
+		process.stdout.write('\n')
+	} catch (err) {
+		process.stderr.write(`[error] ${err.message}\n`)
+	} finally {
+		speaking = false
+	}
+})
+
+process.on('SIGINT', () => {
+	process.stderr.write('\n[voice] stopping\n')
+	stop()
+	process.exit(0)
+})
--- a/demos/darkwing-briefing.sh
+++ b/demos/darkwing-briefing.sh
@@ -0,0 +1,24 @@
+#!/usr/bin/env bash
+# Darkwing Briefing — Darkwing Duck, Gosalyn, Harold W. Smith, Marella, Rommie
+
+TTS="${TTS_URL:-http://192.168.2.99:11500}"
+switch() { curl -s -X POST "$TTS/voice" -H 'Content-Type: application/json' -d "{\"name\": \"$1\"}" > /dev/null; }
+say()    { curl -s -X POST "$TTS/speak" -H 'Content-Type: application/json' -d "$(jq -n --arg t "$1" '{text:$t}')" > /dev/null; }
+
+switch darkwing-duck
+say "I am the terror that flaps in the night. I am the highly classified operative that your clearance level does not cover."
+
+switch gosalyn
+say "Dad, you're at the WRONG briefing AGAIN."
+
+switch harold-w-smith
+say "How did a duck get into this facility."
+
+switch marella
+say "The same way everything else does, sir. The paperwork looked official."
+
+switch rommie
+say "I ran a verification check. The duck is, in fact, cleared for this. I have no further explanation."
+
+switch darkwing-duck
+say "I love it when a plan comes together. Let's get dangerous."
--- a/demos/threat-assessment.sh
+++ b/demos/threat-assessment.sh
@@ -0,0 +1,21 @@
+#!/usr/bin/env bash
+# Threat Assessment — Ivanova, Archangel, Rommie, Chuin, Santini
+
+TTS="${TTS_URL:-http://192.168.2.99:11500}"
+switch() { curl -s -X POST "$TTS/voice" -H 'Content-Type: application/json' -d "{\"name\": \"$1\"}" > /dev/null; }
+say()    { curl -s -X POST "$TTS/speak" -H 'Content-Type: application/json' -d "$(jq -n --arg t "$1" '{text:$t}')" > /dev/null; }
+
+switch ivanova
+say "We have an unidentified vessel on an intercept course. Threat assessment?"
+
+switch rommie
+say "Weapons hot, shields raised, and they're broadcasting on a frequency we're not supposed to know exists."
+
+switch archangel
+say "Ah. That would be my people. Stand down, Commander — they're just making sure I'm still alive."
+
+switch chuin
+say "Are you? I have been wondering this myself for some time."
+
+switch santini
+say "I'm gonna need a bigger hangar."
--- a/demos/wrong-meeting.sh
+++ b/demos/wrong-meeting.sh
@@ -0,0 +1,24 @@
+#!/usr/bin/env bash
+# Wrong Meeting — Harold W. Smith, Ivanova, Archangel, Rommie, Penny Parker
+
+TTS="${TTS_URL:-http://192.168.2.99:11500}"
+switch() { curl -s -X POST "$TTS/voice" -H 'Content-Type: application/json' -d "{\"name\": \"$1\"}" > /dev/null; }
+say()    { curl -s -X POST "$TTS/speak" -H 'Content-Type: application/json' -d "$(jq -n --arg t "$1" '{text:$t}')" > /dev/null; }
+
+switch harold-w-smith
+say "This meeting never happened. None of you are here, and neither am I."
+
+switch ivanova
+say "That's extremely reassuring. Last time someone said that to me, a space station blew up."
+
+switch archangel
+say "Technically it was a controlled demolition. The paperwork was filed in triplicate."
+
+switch rommie
+say "I have reviewed the paperwork. It was filed AFTER the explosion."
+
+switch penny-parker
+say "You know, I walked into the wrong building again, didn't I."
+
+switch harold-w-smith
+say "Yes. But you saw nothing."
--- a/download-models.sh
+++ b/download-models.sh
@@ -0,0 +1,63 @@
+#!/usr/bin/env bash
+# Download ONNX models for the voice experiment.
+#
+# Usage:
+#   bash download-models.sh              # default: whisper base.en
+#   WHISPER_MODEL=large-v3-turbo bash download-models.sh
+#
+# After downloading, run: node demo.mjs
+
+set -euo pipefail
+
+WHISPER_MODEL="${WHISPER_MODEL:-base.en}"
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+MODELS_DIR="${SCRIPT_DIR}/models"
+
+mkdir -p "${MODELS_DIR}"
+cd "${MODELS_DIR}"
+
+echo "==> models dir: ${MODELS_DIR}"
+
+# --- Silero VAD (~2 MB) ---
+if [ ! -f silero_vad.onnx ]; then
+	echo "==> downloading silero_vad.onnx"
+	curl -L --progress-bar -o silero_vad.onnx \
+		'https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/silero_vad.onnx'
+else
+	echo "==> silero_vad.onnx already present"
+fi
+
+# --- Whisper STT ---
+WHISPER_DIR="sherpa-onnx-whisper-${WHISPER_MODEL}"
+if [ ! -d "${WHISPER_DIR}" ]; then
+	echo "==> downloading whisper-${WHISPER_MODEL}"
+	curl -L --progress-bar -o "${WHISPER_DIR}.tar.bz2" \
+		"https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-whisper-${WHISPER_MODEL}.tar.bz2"
+	echo "==> extracting..."
+	tar xf "${WHISPER_DIR}.tar.bz2"
+else
+	echo "==> whisper-${WHISPER_MODEL} already present"
+fi
+
+# --- Kokoro TTS (~310 MB) ---
+if [ ! -d kokoro-en-v0_19 ]; then
+	echo "==> downloading kokoro-en-v0_19 (English TTS)"
+	curl -L --progress-bar -o kokoro-en-v0_19.tar.bz2 \
+		'https://github.com/k2-fsa/sherpa-onnx/releases/download/tts-models/kokoro-en-v0_19.tar.bz2'
+	echo "==> extracting..."
+	tar xf kokoro-en-v0_19.tar.bz2
+else
+	echo "==> kokoro-en-v0_19 already present"
+fi
+
+echo ""
+echo "==> all models ready"
+echo ""
+echo "Next steps:"
+echo "  cd ${SCRIPT_DIR}"
+echo "  npm install"
+echo "  node demo.mjs"
+echo ""
+echo "To use a better STT model:"
+echo "  WHISPER_MODEL=large-v3-turbo bash download-models.sh"
+echo "  WHISPER_MODEL=large-v3-turbo node demo.mjs"
--- a/lib/bark-tts.mjs
+++ b/lib/bark-tts.mjs
@@ -0,0 +1,112 @@
+/**
+ * Bark TTS — Node.js wrapper around bark-server.py.
+ *
+ * Spawns the Python server once, keeps it alive, sends requests as JSON lines.
+ * Markdown is preprocessed before sending: **bold** → UPPERCASE, etc.
+ *
+ * Usage:
+ *   const tts = new Bark_Tts()
+ *   await tts.init()          // spawns server, waits for model load
+ *   await tts.speak('Hello')
+ *   await tts.speak('**Very** important point.')   // emphasis via CAPS
+ *   tts.stop()
+ *
+ * Environment variables:
+ *   BARK_MODEL   HuggingFace model id (default: suno/bark)
+ *   BARK_VOICE   voice preset        (default: v2/en_speaker_6)
+ *
+ * Bark voice presets (English):
+ *   v2/en_speaker_0  calm female
+ *   v2/en_speaker_1  calm male
+ *   v2/en_speaker_3  deep male
+ *   v2/en_speaker_6  neutral/warm (default)
+ *   v2/en_speaker_9  expressive
+ */
+
+import { spawn }        from 'node:child_process'
+import * as path        from 'node:path'
+import * as readline    from 'node:readline'
+import { markdown_to_bark, split_sentences } from './markdown.mjs'
+
+const BARK_MODEL = process.env.BARK_MODEL || 'suno/bark'
+const BARK_VOICE = process.env.BARK_VOICE || 'v2/en_speaker_6'
+const SERVER     = path.join(import.meta.dirname, '..', 'bark-server.py')
+
+export class Bark_Tts {
+    constructor({
+        model = BARK_MODEL,
+        voice = BARK_VOICE,
+    } = {}) {
+        this._model   = model
+        this._voice   = voice
+        this._proc    = null
+        this._rl      = null
+        this._resolve = null  // resolver for the current in-flight request
+    }
+
+    /** Spawn bark-server.py and wait until it signals "ready". */
+    init() {
+        return new Promise((resolve, reject) => {
+            this._proc = spawn(SERVER, [this._model, this._voice], {
+                stdio: ['pipe', 'pipe', 'inherit'],
+            })
+
+            this._proc.on('error', reject)
+            this._proc.on('close', (code) => {
+                if (code !== 0 && code !== null) {
+                    process.stderr.write(`[bark] server exited with code ${code}\n`)
+                }
+            })
+
+            this._rl = readline.createInterface({ input: this._proc.stdout })
+
+            this._rl.on('line', (line) => {
+                if (line === 'ready') {
+                    resolve()
+                    return
+                }
+                if (line === 'ok' && this._resolve) {
+                    const res  = this._resolve
+                    this._resolve = null
+                    res()
+                }
+            })
+        })
+    }
+
+    /** Preprocess markdown and speak as a single request. */
+    async speak(text, { voice = this._voice, preprocess = true } = {}) {
+        const clean = preprocess ? markdown_to_bark(text) : text
+        return this._send(clean, voice)
+    }
+
+    /**
+     * Preprocess markdown and speak sentence by sentence.
+     * Lower latency — first sentence starts playing while rest are queued.
+     */
+    async speak_streaming(text, opts = {}) {
+        const clean     = opts.preprocess !== false ? markdown_to_bark(text) : text
+        const sentences = split_sentences(clean)
+        for (const s of sentences) {
+            await this._send(s, opts.voice ?? this._voice)
+        }
+    }
+
+    _send(text, voice) {
+        return new Promise((resolve, reject) => {
+            if (!this._proc) {
+                return reject(new Error('Bark_Tts not initialized — call init() first'))
+            }
+            this._resolve = resolve
+            const payload = JSON.stringify({ text, voice }) + '\n'
+            this._proc.stdin.write(payload)
+        })
+    }
+
+    stop() {
+        this._rl?.close()
+        this._proc?.kill()
+        this._proc  = null
+        this._rl    = null
+    }
+}
--- a/lib/chatterbox-tts.mjs
+++ b/lib/chatterbox-tts.mjs
@@ -0,0 +1,101 @@
+/**
+ * Chatterbox TTS — Node.js wrapper around chatterbox-server.py.
+ *
+ * Usage:
+ *   const tts = new Chatterbox_Tts()
+ *   await tts.init()
+ *   await tts.speak('Hello! [chuckle] That is funny.')
+ *   await tts.speak('Something intense.', { exaggeration: 0.8 })  // full model only
+ *   tts.stop()
+ *
+ * Paralinguistic tags (embed in text):
+ *   [laugh] [chuckle] [cough] [clear throat] [sigh] [shush] [groan] [sniff] [gasp]
+ *
+ * Generation options (passed as second arg to speak()):
+ *   temperature         0.05–2.0   default 0.8
+ *   top_p               0–1        default 0.95
+ *   top_k               0–1000     default 1000
+ *   repetition_penalty  ≥1.0       default 1.2
+ *   min_p               0–1        default 0.0
+ *   audio_prompt        string     path to reference WAV for voice cloning
+ *   exaggeration        0–1        emotion intensity (full model only)
+ *   cfg_weight          0–1        classifier-free guidance (full model only)
+ */
+
+import { spawn }     from 'node:child_process'
+import * as path     from 'node:path'
+import * as readline from 'node:readline'
+import { markdown_to_bark as markdown_to_speech, split_sentences } from './markdown.mjs'
+
+const SERVER = path.join(import.meta.dirname, '..', 'chatterbox-server.py')
+
+export class Chatterbox_Tts {
+    constructor({
+        variant = 'turbo',  // 'turbo' or 'full'
+    } = {}) {
+        this._variant  = variant
+        this._proc     = null
+        this._rl       = null
+        this._resolve  = null
+    }
+
+    init() {
+        return new Promise((resolve, reject) => {
+            this._proc = spawn(SERVER, [this._variant], {
+                stdio: ['pipe', 'pipe', 'inherit'],
+            })
+
+            this._proc.on('error', reject)
+            this._proc.on('close', (code) => {
+                if (code !== null && code !== 0) {
+                    process.stderr.write(`[chatterbox] server exited with code ${code}\n`)
+                }
+            })
+
+            this._rl = readline.createInterface({ input: this._proc.stdout })
+            this._rl.on('line', (line) => {
+                if (line === 'ready') {
+                    resolve()
+                    return
+                }
+                if (line === 'ok' && this._resolve) {
+                    const res = this._resolve
+                    this._resolve = null
+                    res()
+                }
+            })
+        })
+    }
+
+    async speak(text, opts = {}) {
+        const clean = opts.preprocess === true ? markdown_to_speech(text) : text
+        return this._send(clean, opts)
+    }
+
+    async speak_streaming(text, opts = {}) {
+        const clean     = opts.preprocess !== false ? markdown_to_speech(text) : text
+        const sentences = split_sentences(clean)
+        for (const s of sentences) {
+            await this._send(s, opts)
+        }
+    }
+
+    _send(text, opts = {}) {
+        return new Promise((resolve, reject) => {
+            if (!this._proc) {
+                return reject(new Error('Chatterbox_Tts not initialized — call init() first'))
+            }
+            this._resolve = resolve
+            const { preprocess: _, ...gen_opts } = opts
+            const payload = JSON.stringify({ text, ...gen_opts }) + '\n'
+            this._proc.stdin.write(payload)
+        })
+    }
+
+    stop() {
+        this._rl?.close()
+        this._proc?.kill()
+        this._proc = null
+        this._rl   = null
+    }
+}
--- a/lib/llm.mjs
+++ b/lib/llm.mjs
@@ -0,0 +1,64 @@
+/**
+ * Optional LLM cleanup via Ollama REST API.
+ *
+ * Cleans up raw Whisper output: fixes punctuation, capitalization,
+ * removes filler words, corrects obvious recognition errors.
+ *
+ * Set OLLAMA_MODEL env var to choose the model (default: phi3:mini).
+ * Set OLLAMA_URL env var to change the base URL (default: http://localhost:11434).
+ */
+
+const OLLAMA_URL   = process.env.OLLAMA_URL   || 'http://localhost:11434'
+const OLLAMA_MODEL = process.env.OLLAMA_MODEL || 'phi3:mini'
+
+export async function llm_available() {
+	try {
+		const res = await fetch(`${OLLAMA_URL}/api/tags`, {
+			signal: AbortSignal.timeout(2000),
+		})
+		return res.ok
+	} catch {
+		return false
+	}
+}
+
+export async function list_models() {
+	const res  = await fetch(`${OLLAMA_URL}/api/tags`)
+	const data = await res.json()
+	return (data.models ?? []).map(m => m.name)
+}
+
+/**
+ * Clean up a raw STT transcription.
+ * Returns the cleaned string, or the original if the request fails.
+ */
+export async function cleanup(raw_text, model = OLLAMA_MODEL) {
+	const prompt = [
+		'You are a transcription editor. Clean up this speech-to-text output:',
+		'- Fix punctuation and capitalization',
+		'- Remove meaningless filler words (um, uh, like, you know)',
+		'- Correct obvious word recognition errors based on context',
+		'- Keep the meaning and phrasing intact',
+		'- Return ONLY the cleaned text, no explanation, no quotes',
+		'',
+		raw_text,
+	].join('\n')
+
+	try {
+		const res = await fetch(`${OLLAMA_URL}/api/generate`, {
+			method:  'POST',
+			headers: { 'Content-Type': 'application/json' },
+			body: JSON.stringify({
+				model,
+				prompt,
+				stream:  false,
+				options: { temperature: 0.1, num_predict: 256 },
+			}),
+			signal: AbortSignal.timeout(10_000),
+		})
+		const data = await res.json()
+		return data.response?.trim() || raw_text
+	} catch {
+		return raw_text
+	}
+}
--- a/lib/local-query-complete.mjs
+++ b/lib/local-query-complete.mjs
@@ -0,0 +1,43 @@
+export async function is_query_complete(query, model = 'qwen2.5:3b') {
+	const payload = {
+		model,
+		messages: [
+			{
+				role: 'system',
+				content: `
+					You are a voice query completeness classifier. Determine if the input is complete enough to act on.
+
+					Respond only with JSON: {"complete": true} or {"complete": false}
+
+					Default to {"complete": true}. Only respond with {"complete": false} if the input is clearly mid-sentence, trails off, or is a single ambiguous word with no clear intent.
+
+					Complete examples: "What time is it", "Turn off the lights", "How do I juggle", "Make a note", "Check the weather"
+					Incomplete examples: "How do I", "The thing is", "Can you"
+				`.replace(/^[\t ]+/gm, ''),
+			},
+			{ role: 'user', content: query },
+		],
+		stream: false,
+		format: 'json',
+		options: { num_predict: 10, num_ctx: 2048 },
+	}
+
+	const response = await fetch('http://192.168.2.99:11434/api/chat', {
+		method: 'POST',
+		headers: { 'Content-Type': 'application/json' },
+		body: JSON.stringify(payload),
+	})
+
+	if (response.ok) {
+		try {
+			const data = await response.json()
+			const parsed = JSON.parse(data.message.content)
+			return parsed.complete === true
+		} catch {
+			return true
+		}
+	} else {
+		// Fail open — if classifier errors, let the query through
+		return true
+	}
+}
--- a/lib/markdown.mjs
+++ b/lib/markdown.mjs
@@ -0,0 +1,66 @@
+/**
+ * Convert markdown to Bark-friendly plain text.
+ *
+ * Bark emphasis conventions:
+ *   UPPERCASE words     — stress/emphasis
+ *   ...                 — hesitation / trailing off
+ *   [laughs] [sighs]    — paralinguistic tokens (pass through as-is)
+ *
+ * Markdown mapping:
+ *   **bold** / __bold__ → UPPERCASE  (strong emphasis)
+ *   *italic* / _italic_ → unchanged  (soft emphasis, Bark handles naturally)
+ *   # Heading           → text + pause via ".\n"
+ *   `code`              → unchanged (spoken literally)
+ *   ```block```         → skipped
+ *   [text](url)         → text only
+ *   ---                 → "..." (pause)
+ *   - item / 1. item    → item text (bullet stripped)
+ */
+
+export function markdown_to_bark(text) {
+    // Remove fenced code blocks entirely
+    text = text.replace(/```[\s\S]*?```/g, '')
+
+    // Inline code — keep the content, drop backticks
+    text = text.replace(/`([^`]+)`/g, '$1')
+
+    // Links — keep visible text
+    text = text.replace(/\[([^\]]+)\]\([^)]*\)/g, '$1')
+
+    // Images — drop
+    text = text.replace(/!\[([^\]]*)\]\([^)]*\)/g, '')
+
+    // **bold** / __bold__ → UPPERCASE
+    text = text.replace(/\*\*([^*]+)\*\*/g, (_, w) => w.toUpperCase())
+    text = text.replace(/__([^_]+)__/g, (_, w) => w.toUpperCase())
+
+    // *italic* / _italic_ — strip markers, keep text
+    text = text.replace(/\*([^*]+)\*/g, '$1')
+    text = text.replace(/_([^_]+)_/g, '$1')
+
+    // Headings — keep text, add a beat after
+    text = text.replace(/^#{1,6}\s+(.+)$/gm, '$1.')
+
+    // Horizontal rules → pause
+    text = text.replace(/^[-*_]{3,}\s*$/gm, '...')
+
+    // List bullets / numbers — strip prefix
+    text = text.replace(/^\s*[-*+]\s+/gm, '')
+    text = text.replace(/^\s*\d+\.\s+/gm, '')
+
+    // Collapse excess blank lines
+    text = text.replace(/\n{3,}/g, '\n\n')
+
+    return text.trim()
+}
+
+/**
+ * Split text into sentences suitable for sequential TTS.
+ * Keeps sentence-ending punctuation with the preceding sentence.
+ */
+export function split_sentences(text) {
+    return text
+        .split(/(?<=[.!?])\s+/)
+        .map(s => s.trim())
+        .filter(s => s.length > 0)
+}
--- a/lib/stt.mjs
+++ b/lib/stt.mjs
@@ -0,0 +1,170 @@
+/**
+ * Speech-to-Text using sherpa-onnx-node.
+ *
+ * Uses Whisper (offline/batch) as the recognizer, gated by Silero VAD so
+ * transcription only runs when a complete utterance has been detected.
+ * Audio is captured from the microphone via parec (PulseAudio) or arecord (ALSA).
+ *
+ * Model layout expected under models_dir:
+ *   silero_vad.onnx
+ *   sherpa-onnx-whisper-<name>/
+ *     <name>-encoder.int8.onnx
+ *     <name>-decoder.int8.onnx
+ *     <name>-tokens.txt
+ *
+ * Available Whisper model names (pass as whisper_name):
+ *   tiny.en  base.en  small.en  medium.en  large-v3  large-v3-turbo
+ * Download with: bash download-models.sh [model-name]
+ */
+
+import { spawn, execSync } from 'node:child_process'
+import { createRequire } from 'node:module'
+import * as path from 'node:path'
+
+const require = createRequire(import.meta.url)
+const DEFAULT_MODELS_DIR = path.join(import.meta.dirname, '..', 'models')
+
+const PRE_ROLL_SAMPLES  = 3200    // 200ms at 16kHz
+const HISTORY_SAMPLES   = 960000  // 60s ring buffer — matches VAD internal size
+
+function find_mic() {
+	const candidates = [
+		['parec',   ['--format=s16le', '--rate=16000', '--channels=1', '--latency-msec=50']],
+		['arecord', ['-f', 'S16_LE', '-r', '16000', '-c', '1', '-t', 'raw', '-q']],
+	]
+	for (const [cmd, args] of candidates) {
+		try {
+			execSync(`which ${cmd}`, { stdio: 'ignore' })
+			return [cmd, args]
+		} catch { /* try next */ }
+	}
+	throw new Error('No mic capture command found — need parec (PulseAudio) or arecord (ALSA)')
+}
+
+function s16le_to_f32(buf) {
+	const out = new Float32Array(buf.length / 2)
+	for (let i = 0; i < out.length; i++) {
+		out[i] = buf.readInt16LE(i * 2) / 32768.0
+	}
+	return out
+}
+
+export class Stt {
+	constructor({
+		models_dir   = DEFAULT_MODELS_DIR,
+		whisper_name = 'base.en',
+		provider     = 'cuda',
+	} = {}) {
+		this._whisper_dir  = path.join(models_dir, `sherpa-onnx-whisper-${whisper_name}`)
+		this._whisper_name = whisper_name
+		this._vad_model    = path.join(models_dir, 'silero_vad.onnx')
+		this._provider     = provider
+		this._recognizer   = null
+		this._vad          = null
+		this._history      = new Float32Array(HISTORY_SAMPLES)
+		this._history_pos  = 0  // total samples fed, monotonically increasing
+	}
+
+	init() {
+		const sherpa = require('sherpa-onnx-node')
+		const { _whisper_dir: wdir, _whisper_name: wname, _vad_model, _provider } = this
+
+		this._recognizer = new sherpa.OfflineRecognizer({
+			featConfig: { sampleRate: 16000, featureDim: 80 },
+			modelConfig: {
+				whisper: {
+					encoder: path.join(wdir, `${wname}-encoder.int8.onnx`),
+					decoder: path.join(wdir, `${wname}-decoder.int8.onnx`),
+				},
+				tokens:     path.join(wdir, `${wname}-tokens.txt`),
+				numThreads: 2,
+				provider:   _provider,
+				debug:      false,
+			},
+		})
+
+		// VAD runs on CPU — silero_vad.onnx is tiny
+		this._vad = new sherpa.Vad({
+			sileroVad: {
+				model:              _vad_model,
+				threshold:          0.5,
+				minSilenceDuration: 0.5,   // seconds of silence to end an utterance
+				minSpeechDuration:  0.1,   // ignore sub-100ms blips
+				windowSize:         512,   // 32ms @ 16kHz
+				maxSpeechDuration:  30.0,
+			},
+			sampleRate: 16000,
+			numThreads: 1,
+			provider:   'cpu',
+			debug:      false,
+		}, 60)  // 60-second internal ring buffer
+	}
+
+	_transcribe(samples) {
+		const stream = this._recognizer.createStream()
+		stream.acceptWaveform({ samples, sampleRate: 16000 })
+		this._recognizer.decode(stream)
+		return this._recognizer.getResult(stream).text.trim()
+	}
+
+	_with_preroll(seg) {
+		const pre_start = Math.max(0, seg.start - PRE_ROLL_SAMPLES)
+		const pre_len   = seg.start - pre_start
+		if (pre_len === 0) return seg.samples
+		const out = new Float32Array(pre_len + seg.samples.length)
+		for (let i = 0; i < pre_len; i++) {
+			out[i] = this._history[(pre_start + i) % HISTORY_SAMPLES]
+		}
+		out.set(seg.samples, pre_len)
+		return out
+	}
+
+	/**
+	 * Start listening. Calls on_text(text) for each detected utterance.
+	 * Returns a stop() function.
+	 */
+	listen(on_text, { on_audio } = {}) {
+		const [cmd, args] = find_mic()
+		process.stderr.write(`[stt] mic: ${cmd} ${args.join(' ')}\n`)
+		const mic = spawn(cmd, args, { stdio: ['ignore', 'pipe', 'inherit'] })
+
+		const VAD_WIN = 512  // must match windowSize above
+		let pending = Buffer.alloc(0)
+
+		mic.stdout.on('data', (chunk) => {
+			if (on_audio) on_audio()
+			pending = Buffer.concat([pending, chunk])
+
+			// Feed complete VAD windows
+			while (pending.length >= VAD_WIN * 2) {
+				const win = pending.subarray(0, VAD_WIN * 2)
+				pending   = pending.subarray(VAD_WIN * 2)
+				const f32 = s16le_to_f32(win)
+				// Write to history ring buffer for pre-roll
+				const base = this._history_pos % HISTORY_SAMPLES
+				for (let i = 0; i < f32.length; i++) {
+					this._history[(base + i) % HISTORY_SAMPLES] = f32[i]
+				}
+				this._history_pos += f32.length
+				this._vad.acceptWaveform(f32)
+			}
+
+			// Drain any complete speech segments
+			while (!this._vad.isEmpty()) {
+				const seg = this._vad.front()
+				this._vad.pop()
+				const text = this._transcribe(this._with_preroll(seg))
+				if (text) on_text(text)
+			}
+		})
+
+		mic.on('error', err => process.stderr.write(`[stt] mic error: ${err.message}\n`))
+		mic.on('close', code => {
+			if (code !== null && code !== 0) {
+				process.stderr.write(`[stt] mic exited with code ${code}\n`)
+			}
+		})
+
+		return () => mic.kill()
+	}
+}
--- a/lib/tts-client.mjs
+++ b/lib/tts-client.mjs
@@ -0,0 +1,51 @@
+/**
+ * TTS HTTP client — speaks text via a running tts-server.mjs instance.
+ *
+ * Usage:
+ *   const tts = new Tts_Client()
+ *   await tts.speak('Hello world.')
+ *   await tts.speak('Hello.', { audio_prompt: '/path/to/voice.wav' })
+ *   const { voices, current } = await tts.list_voices()
+ *   await tts.set_voice('rommie')
+ *
+ * The server URL defaults to TTS_URL env var, then http://192.168.2.99:11500.
+ */
+
+const DEFAULT_URL = process.env.TTS_URL ?? 'http://192.168.2.99:11500'
+
+export class Tts_Client {
+	constructor({ url = DEFAULT_URL } = {}) {
+		this._url = url
+	}
+
+	async speak(text, opts = {}) {
+		const res = await fetch(`${this._url}/speak`, {
+			method:  'POST',
+			headers: { 'Content-Type': 'application/json' },
+			body:    JSON.stringify({ text, ...opts }),
+		})
+		if (!res.ok) {
+			const data = await res.json().catch(() => ({}))
+			throw new Error(data.error ?? `HTTP ${res.status}`)
+		}
+	}
+
+	async list_voices() {
+		const res = await fetch(`${this._url}/voices`)
+		if (!res.ok) throw new Error(`HTTP ${res.status}`)
+		return res.json()   // { voices: [{name, description, active}], current }
+	}
+
+	async set_voice(name) {
+		const res = await fetch(`${this._url}/voice`, {
+			method:  'POST',
+			headers: { 'Content-Type': 'application/json' },
+			body:    JSON.stringify({ name }),
+		})
+		if (!res.ok) {
+			const data = await res.json().catch(() => ({}))
+			throw new Error(data.error ?? `HTTP ${res.status}`)
+		}
+		return res.json()   // { ok, name, path }
+	}
+}
--- a/lib/tts.mjs
+++ b/lib/tts.mjs
@@ -0,0 +1,107 @@
+/**
+ * Text-to-Speech using sherpa-onnx-node with the Kokoro model.
+ *
+ * Model layout expected under models_dir:
+ *   kokoro-en-v0_19/
+ *     model.onnx
+ *     voices.bin
+ *     tokens.txt
+ *     lexicon-us-en.txt
+ *     espeak-ng-data/
+ *
+ * Audio output goes to pacat (PulseAudio) as float32le @ 24kHz.
+ * speak_streaming() splits on sentence boundaries so the first sentence
+ * plays while the rest are still being synthesized.
+ */
+
+import { spawn } from 'node:child_process'
+import { createRequire } from 'node:module'
+import * as path from 'node:path'
+
+const require = createRequire(import.meta.url)
+const DEFAULT_MODELS_DIR = path.join(import.meta.dirname, '..', 'models')
+
+// Speaker IDs in kokoro-en-v0_19
+export const VOICES = {
+	af_heart:   0,
+	af_bella:   1,
+	af_nicole:  2,
+	af_sarah:   3,
+	af_sky:     4,
+	am_adam:    5,
+	am_michael: 6,
+}
+
+export class Tts {
+	constructor({
+		models_dir = DEFAULT_MODELS_DIR,
+		voice      = 'af_heart',
+		speed      = 1.0,
+		provider   = 'cpu',   // Kokoro synthesis is fast enough on CPU
+	} = {}) {
+		this._kokoro_dir = path.join(models_dir, 'kokoro-en-v0_19')
+		this._voice      = voice
+		this._speed      = speed
+		this._provider   = provider
+		this._tts        = null
+	}
+
+	init() {
+		const sherpa = require('sherpa-onnx-node')
+		const { _kokoro_dir: dir, _provider } = this
+
+		this._tts = new sherpa.OfflineTts({
+			model: {
+				kokoro: {
+					model:   path.join(dir, 'model.onnx'),
+					voices:  path.join(dir, 'voices.bin'),
+					tokens:  path.join(dir, 'tokens.txt'),
+					dataDir: path.join(dir, 'espeak-ng-data'),
+				},
+			},
+			maxNumSentences: 2,
+			numThreads:      2,
+			debug:           false,
+			provider:        _provider,
+		})
+	}
+
+	/** Synthesize and play a single piece of text. Resolves when playback ends. */
+	async speak(text) {
+		const sid   = VOICES[this._voice] ?? 0
+		const audio = this._tts.generate({ text, sid, speed: this._speed })
+		await this._play(audio.samples, audio.sampleRate)
+	}
+
+	/**
+	 * Split text on sentence boundaries and synthesize + play each sentence
+	 * in sequence. Lower latency than waiting for full synthesis first.
+	 */
+	async speak_streaming(text) {
+		for (const sentence of split_sentences(text)) {
+			if (sentence.trim()) {
+				await this.speak(sentence.trim())
+			}
+		}
+	}
+
+	_play(samples, sample_rate) {
+		return new Promise((resolve, reject) => {
+			const proc = spawn('pacat', [
+				'--format=float32le',
+				`--rate=${sample_rate}`,
+				'--channels=1',
+			], { stdio: ['pipe', 'ignore', 'inherit'] })
+
+			proc.on('error', reject)
+			proc.on('close', resolve)
+			proc.stdin.write(Buffer.from(samples.buffer))
+			proc.stdin.end()
+		})
+	}
+}
+
+function split_sentences(text) {
+	// Split after .  !  ? — keep the punctuation with the preceding sentence
+	return text.split(/(?<=[.!?])\s+/)
+}
--- a/listen.mjs
+++ b/listen.mjs
@@ -0,0 +1,22 @@
+/**
+ * Listen with Whisper and print transcribed text to stdout.
+ *
+ * Usage:
+ *   node listen.mjs
+ *   node listen.mjs small.en
+ *   node listen.mjs large-v3-turbo
+ */
+
+import { Stt } from './lib/stt.mjs'
+
+const whisper_name = process.argv[2] ?? 'base.en'
+
+const stt = new Stt({ whisper_name })
+stt.init()
+
+process.stderr.write(`[listen] whisper model: ${whisper_name}\n`)
+process.stderr.write(`[listen] listening — Ctrl+C to stop\n`)
+
+stt.listen((text) => {
+	process.stdout.write(text + '\n')
+})
--- a/package.json
+++ b/package.json
@@ -0,0 +1,13 @@
+{
+	"name": "voice-experiment",
+	"version": "0.1.0",
+	"type": "module",
+	"scripts": {
+		"download": "bash download-models.sh",
+		"setup-venv": "bash setup-venv.sh"
+	},
+	"dependencies": {
+		"sherpa-onnx-node": "^1.12.14",
+		"js-yaml": "^4.1.0"
+	}
+}
--- a/query-demo.mjs
+++ b/query-demo.mjs
@@ -0,0 +1,123 @@
+/**
+ * Voice query demo <20><> accumulates STT utterances into a query, checks
+ * completeness with a local LLM classifier, then dispatches to Claude Code.
+ *
+ * Usage:
+ *   node query-demo.mjs [--audio-prompt voice.wav] [--whisper model] [--window-id 0x...] [--claude-remote /path/to/claude-remote.mjs]
+ *
+ * If --window-id and --claude-remote are supplied, completed queries are
+ * dispatched to Claude Code. Otherwise they are only logged to stderr.
+ *
+ * A query is considered complete when:
+ *   - The classifier says so
+ *   - The last word is a send-word (go/done/send)
+ *   - No new fragment arrives within SILENCE_TIMEOUT ms
+ */
+
+import { execFileSync }    from 'node:child_process'
+import { Stt }              from './lib/stt.mjs'
+import { Tts_Client }       from './lib/tts-client.mjs'
+import { is_query_complete } from './lib/local-query-complete.mjs'
+
+function get_arg(name) {
+	const i = process.argv.indexOf(name)
+	return i !== -1 ? process.argv[i + 1] : null
+}
+
+const audio_prompt   = get_arg('--audio-prompt') ?? '/home/devilholk/Documents/rommie-sample.wav'
+const whisper_name   = get_arg('--whisper')      ?? 'base.en'
+const window_id      = get_arg('--window-id')
+const claude_remote  = get_arg('--claude-remote')
+
+const SILENCE_TIMEOUT = parseInt(get_arg('--silence-timeout') ?? '6000')
+
+const tts = new Tts_Client()
+const stt = new Stt({ whisper_name })
+
+stt.init()
+process.stderr.write(`[query-demo] whisper model: ${whisper_name}\n`)
+process.stderr.write(`[query-demo] voice: ${audio_prompt}\n`)
+if (window_id && claude_remote) {
+	process.stderr.write(`[query-demo] dispatch: window ${window_id} via ${claude_remote}\n`)
+} else {
+	process.stderr.write(`[query-demo] no --window-id/--claude-remote — logging only\n`)
+}
+await tts.speak('Ready for input.', { audio_prompt })
+
+function make_prompt(query) {
+	return [
+		'[voice-buddy] You have received a voice query.',
+		'',
+		'This is a voice interface. When you are done, use the `speak` shell command to inform the user of the outcome — keep it brief and spoken-word natural.',
+		'If the task is a question, speak the answer directly.',
+		'If the task is an action, carry it out and then speak a short confirmation.',
+		'',
+		'--- Query ---',
+		query,
+		'--- End of query ---',
+	].join('\n')
+}
+
+function dispatch(query) {
+	execFileSync('node', [claude_remote, '--window-id', window_id], {
+		input:    make_prompt(query),
+		encoding: 'utf8',
+		stdio:    ['pipe', 'inherit', 'inherit'],
+	})
+}
+
+async function submit(query) {
+	query = query.trim()
+	if (!query) {
+		return
+	}
+	process.stderr.write(`[query-demo] query: ${JSON.stringify(query)}\n`)
+	if (window_id && claude_remote) {
+		dispatch(query)
+	}
+	await tts.speak('Efforting.', { audio_prompt })
+}
+
+const SEND_WORDS = new Set(['go', 'done', 'send'])
+
+let accumulated    = ''
+let silence_timer  = null
+
+function reset_silence_timer() {
+	clearTimeout(silence_timer)
+	silence_timer = setTimeout(async () => {
+		if (!accumulated) {
+			return
+		}
+		process.stderr.write('[query-demo] silence timeout — submitting\n')
+		const query = accumulated
+		accumulated = ''
+		await submit(query)
+	}, SILENCE_TIMEOUT)
+}
+
+stt.listen(async (text) => {
+	accumulated = accumulated ? `${accumulated}\n${text}` : text
+	process.stderr.write(`[query-demo] fragment: ${JSON.stringify(accumulated)}\n`)
+
+	reset_silence_timer()
+
+	const last_line = accumulated.split('\n').at(-1)
+	const last_norm = last_line.toLowerCase().replace(/[^a-z]/g, '')
+	const forced    = SEND_WORDS.has(last_norm)
+
+	if (!forced) {
+		const complete = await is_query_complete(last_line)
+		if (!complete) {
+			return
+		}
+	}
+
+	clearTimeout(silence_timer)
+
+	const lines = accumulated.split('\n')
+	const query = (forced ? lines.slice(0, -1) : lines).join('\n')
+	accumulated = ''
+
+	await submit(query)
+}, { on_audio: () => { if (accumulated) reset_silence_timer() } })
--- a/requirements.txt
+++ b/requirements.txt
@@ -0,0 +1,11 @@
+# STT
+faster-whisper>=1.1.0
+
+# TTS (already used in kokoro-tts-test)
+kokoro>=0.9.4
+
+# Audio capture
+sounddevice>=0.4.6
+
+# Numpy (usually already present)
+numpy>=1.24.0
--- a/setup-venv.sh
+++ b/setup-venv.sh
@@ -0,0 +1,24 @@
+#!/usr/bin/env bash
+# Creates a venv in ./venv and installs Python deps for Bark.
+# Safe to re-run — skips install if already done.
+
+set -euo pipefail
+
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+VENV="${SCRIPT_DIR}/venv"
+
+if [ ! -d "${VENV}" ]; then
+	echo "==> creating venv at ${VENV}"
+	python3 -m venv "${VENV}"
+fi
+
+echo "==> installing Python dependencies"
+"${VENV}/bin/pip" install --upgrade pip --quiet
+"${VENV}/bin/pip" install transformers accelerate torch numpy
+"${VENV}/bin/pip" install chatterbox-tts
+
+chmod +x "${SCRIPT_DIR}/bark-server.py"
+chmod +x "${SCRIPT_DIR}/chatterbox-server.py"
+
+echo ""
+echo "==> venv ready at ${VENV}"
--- a/speak-as.mjs
+++ b/speak-as.mjs
@@ -0,0 +1,47 @@
+/**
+ * Read text from stdin and speak it using a voice reference clip.
+ *
+ * Usage:
+ *   echo "Hello world" | node speak-as.mjs /path/to/voice.wav
+ *   cat script.txt    | node speak-as.mjs /path/to/voice.wav
+ *   node speak-as.mjs /path/to/voice.wav   (then type, Ctrl+D when done)
+ */
+
+import * as fs       from 'node:fs'
+import * as readline from 'node:readline'
+import { Chatterbox_Tts } from './lib/chatterbox-tts.mjs'
+
+const audio_prompt = process.argv[2]
+
+if (!audio_prompt) {
+	process.stderr.write('Usage: node speak-as.mjs /path/to/voice.wav\n')
+	process.exit(1)
+}
+
+if (!fs.existsSync(audio_prompt)) {
+	process.stderr.write(`File not found: ${audio_prompt}\n`)
+	process.exit(1)
+}
+
+const tts = new Chatterbox_Tts()
+await tts.init()
+
+const rl = readline.createInterface({
+	input:    process.stdin,
+	output:   process.stdout,
+	terminal: process.stdin.isTTY,
+})
+
+const norm = s => s.toLowerCase().replace(/[^a-z0-9 ]/g, '').replace(/\s+/g, ' ').trim()
+let last_norm = ''
+
+for await (const line of rl) {
+	const text = line.trim()
+	if (!text) continue
+	const n = norm(text)
+	if (n === last_norm) continue
+	last_norm = n
+	await tts.speak(text, { audio_prompt })
+}
+
+tts.stop()
--- a/tts-server.mjs
+++ b/tts-server.mjs
@@ -0,0 +1,133 @@
+/**
+ * TTS HTTP server — wraps chatterbox-server.py and exposes a simple HTTP API.
+ * Requests are serialized so generation and playback stay in order.
+ *
+ * Usage:
+ *   node tts-server.mjs
+ *   TTS_PORT=11500 node tts-server.mjs
+ *
+ * API:
+ *   POST /speak   { "text": "...", "audio_prompt": "/path/to/voice.wav", ... }
+ *                 → 200 { "ok": true }  (after generation, playback continues in background)
+ *   GET  /voices  → 200 { "voices": ["rommie", ...], "current": "rommie" | null }
+ *   POST /voice   { "name": "rommie" }
+ *                 → 200 { "ok": true, "name": "rommie", "path": "..." }
+ *   GET  /health  → 200 { "ok": true }
+ */
+
+import * as http from 'node:http'
+import * as fs   from 'node:fs'
+import * as path from 'node:path'
+import yaml from 'js-yaml'
+import { Chatterbox_Tts } from './lib/chatterbox-tts.mjs'
+
+const PORT        = parseInt(process.env.TTS_PORT ?? '11500')
+const VOICES_FILE = path.join(import.meta.dirname, 'voices.yaml')
+
+function reload_voices() {
+	try {
+		const doc = yaml.load(fs.readFileSync(VOICES_FILE, 'utf8'))
+		return doc?.voices ?? {}
+	} catch {
+		return {}
+	}
+}
+
+let voices = reload_voices()
+let current_voice = null  // name of active voice, or null
+
+// --- TTS setup ---
+const tts = new Chatterbox_Tts()
+process.stderr.write('[tts-server] starting chatterbox...\n')
+await tts.init()
+process.stderr.write('[tts-server] chatterbox ready\n')
+
+// Serialize all speak requests through a promise chain
+let queue = Promise.resolve()
+
+function enqueue(fn) {
+	const result = queue.then(fn)
+	// Don't let a failed request poison the queue
+	queue = result.catch(() => {})
+	return result
+}
+
+function read_body(req) {
+	return new Promise((resolve, reject) => {
+		let buf = ''
+		req.on('data', chunk => { buf += chunk })
+		req.on('end', () => {
+			try { resolve(JSON.parse(buf)) } catch (e) { reject(e) }
+		})
+		req.on('error', reject)
+	})
+}
+
+function send(res, status, body) {
+	const payload = JSON.stringify(body)
+	res.writeHead(status, { 'Content-Type': 'application/json' })
+	res.end(payload)
+}
+
+const server = http.createServer(async (req, res) => {
+	if (req.method === 'GET' && req.url === '/health') {
+		return send(res, 200, { ok: true })
+	}
+
+	if (req.method === 'GET' && req.url === '/voices') {
+		voices = reload_voices()
+		const list = Object.entries(voices).map(([name, v]) => ({
+			name,
+			description: v.description ?? '',
+			active: name === current_voice,
+		}))
+		return send(res, 200, { voices: list, current: current_voice })
+	}
+
+	if (req.method === 'POST' && req.url === '/voice') {
+		let body
+		try { body = await read_body(req) } catch {
+			return send(res, 400, { error: 'invalid JSON' })
+		}
+		const { name } = body
+		if (!name) return send(res, 400, { error: 'name required' })
+		voices = reload_voices()
+		if (!voices[name]) return send(res, 404, { error: `unknown voice: ${name}` })
+		current_voice = name
+		process.stderr.write(`[tts-server] voice switched to: ${name}\n`)
+		return send(res, 200, { ok: true, name, path: voices[name].path })
+	}
+
+	if (req.method === 'POST' && req.url === '/speak') {
+		let body
+		try {
+			body = await read_body(req)
+		} catch {
+			return send(res, 400, { error: 'invalid JSON' })
+		}
+
+		const { text, ...opts } = body
+		if (!text) {
+			return send(res, 400, { error: 'text required' })
+		}
+
+		// Inject current voice as default audio_prompt if none provided
+		if (!opts.audio_prompt && current_voice && voices[current_voice]) {
+			opts.audio_prompt = voices[current_voice].path
+		}
+
+		try {
+			await enqueue(() => tts.speak_streaming(text, { preprocess: false, ...opts }))
+			send(res, 200, { ok: true })
+		} catch (err) {
+			send(res, 500, { error: err.message })
+		}
+		return
+	}
+
+	send(res, 404, { error: 'not found' })
+})
+
+server.listen(PORT, () => {
+	process.stderr.write(`[tts-server] listening on port ${PORT}\n`)
+})
--- a/voice-buddy.mjs
+++ b/voice-buddy.mjs
@@ -0,0 +1,65 @@
+/**
+ * Voice buddy — reads voice queries from stdin (JSON lines) and dispatches
+ * them to a Claude Code instance via claude-remote.
+ *
+ * Usage:
+ *   node query-demo.mjs | node voice-buddy.mjs --window-id 0x1234abc --claude-remote /workspace/claude-remote/claude-remote.mjs
+ *
+ * Options:
+ *   --window-id <id>       X11 window ID of Claude Code (decimal or 0x hex)
+ *   --claude-remote <path> Path to claude-remote.mjs
+ */
+
+import * as readline   from 'node:readline'
+import { execFileSync } from 'node:child_process'
+
+const args = process.argv.slice(2)
+
+function get_arg(name) {
+	const i = args.indexOf(name)
+	if (i === -1 || i + 1 >= args.length) {
+		process.stderr.write(`Missing argument: ${name}\n`)
+		process.exit(1)
+	}
+	return args[i + 1]
+}
+
+const window_id     = get_arg('--window-id')
+const claude_remote = get_arg('--claude-remote')
+
+function make_prompt(query) {
+	return [
+		'[voice-buddy] You have received a voice query.',
+		'',
+		'Respond via the `speak` shell command. Keep your response brief and spoken-word natural — this is a voice interface.',
+		'',
+		'--- Query ---',
+		query,
+		'--- End of query ---',
+	].join('\n')
+}
+
+function dispatch(query) {
+	const prompt = make_prompt(query)
+	process.stderr.write(`[voice-buddy] dispatching: ${JSON.stringify(query)}\n`)
+	execFileSync('node', [claude_remote, '--window-id', window_id], {
+		input:    prompt,
+		encoding: 'utf8',
+		stdio:    ['pipe', 'inherit', 'inherit'],
+	})
+}
+
+const rl = readline.createInterface({ input: process.stdin })
+
+for await (const line of rl) {
+	const trimmed = line.trim()
+	if (!trimmed) {
+		continue
+	}
+	try {
+		const query = JSON.parse(trimmed)
+		dispatch(query)
+	} catch {
+		process.stderr.write(`[voice-buddy] bad JSON line: ${trimmed}\n`)
+	}
+}
--- a/voices.yaml
+++ b/voices.yaml
@@ -0,0 +1,54 @@
+# Named voices for TTS server.
+# Switch via: POST /voice {"name": "rommie"}
+# List via:   GET  /voices
+
+voices:
+  rommie:
+    path: /home/devilholk/Documents/rommie-sample.wav
+    description: Rommie - Andromeda ship AI / android
+
+  archangel:
+    path: /home/devilholk/Documents/archangel.ogg
+    description: Arch Angel from Air Wolf
+
+  marella:
+    path: /home/devilholk/Documents/murella.ogg
+    description: Marella from Air Wolf
+
+  chuin:
+    path: /home/devilholk/Documents/chuin.ogg
+    description: Chuin from Remo Williams - The adventure begins
+
+  harold-w-smith:
+    path: /home/devilholk/Documents/cure-leader.ogg
+    description: Harold W. Smith from Remo Williams - The adventure begins
+
+  penny-parker:
+    path: /home/devilholk/Documents/penny.ogg
+    description: Penny from MacGyver
+
+  santini:
+    path: /home/devilholk/Documents/santini.ogg
+    description: Santini from Air Wolf
+
+  gosalyn:
+    path: /home/devilholk/Documents/gasalin.ogg
+    description: Gosalyn from Darkwing Duck
+
+  darkwing-duck:
+    path: /home/devilholk/Documents/dwd.ogg
+    description: Darkwing Duck
+
+  belitski:
+    path: /home/devilholk/Documents/slipstream2.ogg
+    description: Belitski from Slipstream
+
+  ivanova:
+    path: /home/devilholk/Documents/ivanova.ogg
+    description: Susan Ivonavo from Babylon 5
+
+  sheila:
+    path: /home/devilholk/Documents/darkness-woman.ogg
+    description: Sheila from Army of Darkness
+
+# Add more voices as clips become available