Initial commit — voice pipeline experiment

STT (Silero VAD + Whisper via sherpa-onnx), Chatterbox TTS HTTP server, query completeness classifier (Ollama), multi-voice demo scripts, and planning docs. Kept as reference; clean rewrite planned in separate repos. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-30 04:48:54 +00:00
commit db8889aeed
35 changed files with 2782 additions and 0 deletions
--- a/.gitignore
+++ b/.gitignore
@@ -0,0 +1,11 @@
 node_modules/
 models/
 venv/
 *.ogg
 *.wav
 *.mp3
 *.onnx
 *.bin
 package-lock.json
 __pycache__/
 *.pyc
--- a/CLEANUP-PLAN.md
+++ b/CLEANUP-PLAN.md
@@ -0,0 +1,159 @@
 # Voice Project — Cleanup Plan
 > [!NOTE]
 > **Revised approach** — keep this project as-is (it works and serves as a reference). The path forward is new clean repos for each component, then a top-level project that composes them. See **New architecture** section below.
 The project has accumulated experiment scripts, dead TTS backends, and outdated setup files. This document is the plan before doing the actual cleanup.
 ---
 ## Current state
 ### Active pipeline (keep)
 | File | Role |
 |------|------|
 | `query-demo.mjs` | Main entry point — STT → classifier → dispatch |
 | `tts-server.mjs` | TTS HTTP server (Chatterbox, voice switching) |
 | `voices.yaml` | Named voice config |
 | `lib/stt.mjs` | Silero VAD + Whisper, pre-roll buffer |
 | `lib/chatterbox-tts.mjs` | Node wrapper for chatterbox-server.py |
 | `lib/tts-client.mjs` | HTTP client for tts-server.mjs |
 | `lib/local-query-complete.mjs` | Ollama classifier |
 | `chatterbox-server.py` | Long-running Python TTS process |
 | `listen.mjs` | Standalone mic → stdout tool |
 | `speak-as.mjs` | Standalone stdin → TTS tool |
 | `demos/` | Multi-voice demo scripts |
 | `download-models.sh` | Whisper + VAD model download |
 | `setup-venv.sh` | Python venv setup (needs rewrite — see below) |
 | `models/` | Downloaded STT models |
 | `venv/` | Python venv (not committed) |
 ### Dead / experimental (remove or archive)
 | File | Reason |
 |------|--------|
 | `acting-demo-bark.mjs` | Bark TTS — abandoned |
 | `acting-demo-chatterbox.mjs` | One-off demo, superseded |
 | `acting-demo.mjs` | Old demo |
 | `bark-server.py` | Bark backend — abandoned |
 | `demo-bark.mjs` | Bark demo |
 | `demo-kokoro.mjs` | Kokoro TTS experiment — abandoned |
 | `lib/bark-tts.mjs` | Bark wrapper |
 | `lib/tts.mjs` | Old TTS abstraction (superseded by tts-client.mjs) |
 | `lib/markdown.mjs` | Unused? Verify before deleting |
 | `lib/llm.mjs` | Unused? Verify before deleting |
 | `requirements.txt` | Lists kokoro + faster-whisper deps that aren't used |
 | `voice-buddy.mjs` | Merged into query-demo.mjs — delete |
 ---
 ## What needs to be documented / fixed
 ### 1. `setup-venv.sh` — rewrite
 Current script installs Bark deps. The active backend is Chatterbox. New script should:
 - Install `chatterbox-tts` and its deps (torch, transformers, accelerate, numpy)
 - HuggingFace token instruction: write token to `~/.secrets/hugging-face.token`
 - Note: `find_hf_cache()` in chatterbox-server.py avoids HF network calls if the model is cached
 ### 2. `requirements.txt` — replace or delete
 Currently lists kokoro and faster-whisper (unused). Either:
 - Replace with `chatterbox-requirements.txt` listing only what `setup-venv.sh` installs
 - Or just delete it — `setup-venv.sh` is the source of truth
 ### 3. sherpa-onnx / CUDA build — document clearly
 The npm package ships a CPU-only `.so`. Using CUDA requires a manual rebuild. This is documented in `NOTES.md` but should also appear in a top-level `README.md` or `SETUP.md` so it's hard to miss.
 ### 4. System dependencies — list them
 Nothing currently lists what needs to be installed at the OS level:
 - `parec` / `pacat` (PulseAudio) — audio I/O
 - `cmake` — for optional CUDA sherpa-onnx rebuild
 - `onnxruntime-opt-cuda` (or equivalent) — for GPU STT
 - `ollama` running locally with `qwen2.5:3b` pulled — for classifier
 - `jq` — used in shell functions and demo scripts
 - Node.js (version? check what's required)
 - Python 3 with venv support
 ### 5. External services — document
 - **TTS server**: runs on the host (not in container), port 11500
 - **Ollama**: runs on `192.168.2.99:11434` — classifier and future LLM routing
 - **TTS_URL** env var: set in `~/.bashrc` to point at TTS server
 ### 6. `voices.yaml` — keep and document
 Already the right format. Add a comment block at the top explaining how to add a voice (grab a clean WAV/OGG clip, at least 5 seconds, no background music).
 ---
 ## Proposed new top-level structure
 ```
 voice/
  README.md          ← start here: what this is, quick start
  SETUP.md           ← full setup: OS deps, npm install, venv, CUDA, models
  NOTES.md           ← architecture deep-dives (keep as-is)
  TODO.md            ← keep as-is
  VOICES.md          ← wishlist (keep as-is)
  voices.yaml        ← runtime voice config
  query-demo.mjs     ← main entry point
  tts-server.mjs     ← TTS HTTP server
  listen.mjs         ← standalone STT tool
  speak-as.mjs       ← standalone TTS tool
  chatterbox-server.py
  setup-venv.sh      ← rewritten for Chatterbox only
  download-models.sh
  package.json
  lib/
    stt.mjs
    chatterbox-tts.mjs
    tts-client.mjs
    local-query-complete.mjs
  demos/
    *.sh
  models/            ← gitignored
  venv/              ← gitignored
 ```
 ---
 ## Cleanup steps (in order)
 1. **Verify** `lib/markdown.mjs` and `lib/llm.mjs` are unused — grep for imports
 2. **Delete** dead files: bark/kokoro scripts, old demos, voice-buddy.mjs, lib/bark-tts.mjs, lib/tts.mjs, and unused lib files
 3. **Rewrite** `setup-venv.sh` for Chatterbox only
 4. **Replace** `requirements.txt` with accurate one or delete
 5. **Write** `SETUP.md` covering all install steps end-to-end
 6. **Write** `README.md` with quick start
 7. **Add** voice cloning tips to top of `voices.yaml`
 8. **Commit** the whole cleanup as a single commit
 ---
 ## Open questions
 - Keep `acting-demo-chatterbox.mjs` as a reference for TTS capability demos, or delete?
 - Should `demos/` scripts be committed as-is or made more portable (no hardcoded paths)?
 ---
 ## New architecture (revised plan)
 Keep this project as a reference. Build clean standalone repos instead:
 | Repo | Contents | Depends on |
 |------|----------|-----------|
 | `tts-server` | chatterbox-server.py, tts-server.mjs, lib/chatterbox-tts.mjs, voices.yaml format, HTTP API | Python venv, Chatterbox |
 | `stt` | lib/stt.mjs, sherpa-onnx, VAD + Whisper, pre-roll buffer, listen.mjs | sherpa-onnx-node, models |
 | `voice-assistant` | query-demo.mjs, lib/tts-client.mjs, lib/local-query-complete.mjs | tts-server + stt as submodules |
 Each standalone repo gets its own README, SETUP, and can be used independently. The voice-assistant repo composes them via git submodules (or npm workspace / symlinks — TBD).
 ### Open questions for new architecture
 - **Composition: npm packages** — each repo published to the private npm.efforting.tech registry; voice-assistant depends on `@efforting/tts-server` and `@efforting/stt` as npm deps. Cleaner than submodules for ESM projects.
 - **Model storage** — models are large, have different backup strategies, and should NOT live inside project directories. There are canonical model locations on the system; repos should reference them via an env var (e.g. `MODELS_DIR` or per-model env vars) rather than downloading into the project. Migrate existing `models/` directories out as part of the new repo setup. Not urgent — note for cleanup time.
--- a/LLM-ROUTING.md
+++ b/LLM-ROUTING.md
@@ -0,0 +1,51 @@
 # LLM Routing Strategy
 When to use a local model vs an Anthropic frontier model in the voice pipeline.
 ---
 ## Current model inventory
 | Role | Model | Where | VRAM |
 |------|-------|-------|------|
 | STT | Whisper base.en + Silero VAD | sherpa-onnx (GPU) | ~0.5 GB |
 | TTS | Chatterbox Turbo | Python/CUDA | ~4 GB |
 | Completeness classifier | Qwen 2.5 3B | Ollama | ~2 GB |
 | Main reasoning | Claude (Anthropic) | API / Claude Code | — |
 Total local VRAM: ~6–7 GB on RTX 3090 (24 GB). Plenty of headroom.
 ---
 ## Routing principles
 ### Use local model when:
 - Task is **classification or routing** (completeness, intent detection, command vs question)
 - Task is **latency-sensitive** and the answer doesn't require world knowledge or reasoning
 - Task is **high-frequency** (runs on every utterance — billing would add up)
 - Privacy matters — query content shouldn't leave the machine
 ### Use Anthropic frontier model when:
 - Task requires **reasoning, synthesis, or judgment**
 - Task involves **code generation or editing**
 - Task involves **world knowledge** beyond what a small model handles well
 - Query is **low-frequency** (once per user request, not once per audio frame)
 - Quality matters more than latency
 ---
 ## GPU split (future option)
 If voice models and reasoning models compete for VRAM, split across cards:
 - **Card A** (smaller VRAM): STT + TTS + classifier — pure inference, no large models needed
 - **Card B** (larger VRAM): local reasoning model (e.g. Qwen 20B or similar) for tasks that benefit from a capable local model but shouldn't hit the API
 Use `CUDA_VISIBLE_DEVICES` or Ollama's GPU assignment to pin models to specific cards.
 ---
 ## Upgrade candidates
 - **Whisper small.en or medium.en** — better transcription accuracy, especially for accents. Small adds ~0.5 GB, medium ~1.5 GB.
 - **Completeness classifier → larger model** — Qwen 7B or 14B would give significantly better sentence boundary detection and instruction/question classification.
 - **Local reasoning model** — for tasks that are too simple to warrant an API call but too complex for a 3B classifier (e.g. summarisation, note-taking, simple Q&A). A 7–14B quantised model via Ollama would fit alongside the voice stack.
--- a/NOTES.md
+++ b/NOTES.md
@@ -0,0 +1,161 @@
 # Voice Experiment Notes
 Real-time voice pipeline: microphone → STT → (LLM) → TTS. Everything runs locally, no internet after initial model download.
 ---
 ## Architecture
 ```
 parec (PulseAudio)
  └─ lib/stt.mjs          Silero VAD + Whisper → text on stdout
       └─ sherpa-onnx-node
            └─ libsherpa-onnx-c-api.so   (needs CUDA build — see below)
                 └─ libonnxruntime.so     (system, with CUDA)
 text → lib/chatterbox-tts.mjs
         └─ chatterbox-server.py         long-running Python TTS server
              └─ chatterbox-tts (pip)    ResembleAI Chatterbox model
 ```
 ---
 ## STT
 **Library:** `sherpa-onnx-node` npm package
 **Models:** Whisper (offline batch) + Silero VAD
 **Audio capture:** `parec --format=s16le --rate=16000 --channels=1 --latency-msec=50`
 ### Model download
 ```bash
 bash download-models.sh base.en   # or: tiny.en small.en medium.en large-v3 large-v3-turbo
 ```
 Models land in `models/`:
 - `silero_vad.onnx`
 - `sherpa-onnx-whisper-base.en/`
 ### How it works
 `lib/stt.mjs` feeds 512-sample (32ms) windows into Silero VAD. When VAD signals speech end (after 500ms silence), the segment is sent to Whisper for transcription. The transcription is written to stdout — **note: the native addon writes directly to fd 1 (bypassing Node.js process.stdout), not via the JavaScript callback**. The JS callback is also called and can be used for additional processing.
 ### Getting CUDA
 The npm package bundles a CPU-only `libsherpa-onnx-c-api.so` (`SHERPA_ONNX_ENABLE_GPU=OFF`). To use CUDA, rebuild it from source:
 ```bash
 # Prerequisites: cmake, onnxruntime-opt-cuda (or onnxruntime-cuda) from pacman
 git clone --depth 1 git@github.com:k2-fsa/sherpa-onnx ~/Foreign/sherpa-onnx
 export SHERPA_ONNXRUNTIME_INCLUDE_DIR=/usr/include/onnxruntime
 export SHERPA_ONNXRUNTIME_LIB_DIR=/usr/lib
 cd ~/Foreign/sherpa-onnx
 cmake -B build \
  -DCMAKE_BUILD_TYPE=Release \
  -DBUILD_SHARED_LIBS=ON \
  -DSHERPA_ONNX_ENABLE_GPU=ON \
  -DSHERPA_ONNX_ENABLE_PYTHON=OFF \
  -DSHERPA_ONNX_ENABLE_TESTS=OFF \
  .
 cmake --build build --target sherpa-onnx-c-api
 # Symlink into node_modules (remove bundled CPU onnxruntime so system one is used)
 NMOD=~/workspace/experiments/voice/node_modules/sherpa-onnx-linux-x64
 ln -sf ~/Foreign/sherpa-onnx/build/lib/libsherpa-onnx-c-api.so $NMOD/libsherpa-onnx-c-api.so
 rm -f $NMOD/libonnxruntime.so
 ```
 The system `/usr/lib/libonnxruntime.so` (from `onnxruntime-opt-cuda`) is used at runtime. The sherpa-onnx C API is stable across versions so the pre-built `sherpa-onnx.node` (1.13.2) works with the freshly built library.
 ---
 ## TTS
 **Library:** `chatterbox-tts` Python package (ResembleAI)
 **Models:** `ResembleAI/chatterbox-turbo` (default) or `ResembleAI/chatterbox`
 **Playback:** `pacat --format=float32le --rate=24000 --channels=1`
 ### Setup
 ```bash
 bash setup-venv.sh   # creates ./venv, installs chatterbox-tts + deps
 ```
 HuggingFace token (for model download) goes in `~/.secrets/hugging-face.token`.
 ### Variants
 | Variant | RTF (RTX 3090) | Notes |
 |---------|---------------|-------|
 | turbo   | ~0.2          | default, female voice, no exaggeration param |
 | full    | slower        | male voice, supports `exaggeration` (0.0–1.0) |
 ### Paralinguistic tags
 Work in both variants:
 ```
 [laugh] [chuckle] [cough] [clear throat] [sigh] [shush] [groan] [sniff] [gasp]
 ```
 ### Voice cloning
 Pass a WAV reference clip (at least 5 seconds, clean speech, no music). Quality scales with both length and emotional range — a clip with variation (question, statement, different tones) gives the model more to work with for natural prosody than a flat monotone sample of the same length.
 ```javascript
 await tts.speak('Hello.', { audio_prompt: '/path/to/voice.wav' })
 ```
 The server preprocesses the WAV to float32 mono to work around a librosa/numpy float64 issue.
 ### Server protocol
 `chatterbox-server.py` is a long-running process. Communication:
 - **stdin:** one JSON line per request: `{"text": "...", "temperature": 0.8, ...}`
 - **stdout:** `ok\n` after each utterance is **generated** (playback continues in background)
 - **stderr:** status/timing logs
 Generation and playback are pipelined: the server generates the next utterance while the previous one is still playing. RTF ~0.2 means the next chunk is ready well before playback finishes.
 ### Acting tips
 - Emphasis: `UPPERCASE` words
 - Pause/rhythm: `...` `—`
 - Acronym workaround: `un-workable` not `UNWORKABLE` (avoids letter-spelling)
 - Exaggeration (full model only): `{ exaggeration: 0.0–1.0 }`
 - Temperature: `{ temperature: 0.3–1.4 }` (lower = more consistent)
 ---
 ## Tools
 | Script | Usage |
 |--------|-------|
 | `listen.mjs` | Mic → Whisper → stdout (one line per utterance) |
 | `speak-as.mjs` | stdin → Chatterbox TTS with voice reference |
 | `acting-demo-chatterbox.mjs` | TTS capability demo (sections 1–7) |
 ### Pipe example
 ```bash
 node listen.mjs 2>/dev/null | node speak-as.mjs /path/to/voice.wav
 ```
 `speak-as.mjs` deduplicates consecutive identical lines to handle VAD over-segmentation.
 ---
 ## Known Quirks
 **librosa float64:** newer numpy makes `librosa.resample` return float64, breaking torch. Patched in `chatterbox-server.py`:
 ```python
 _orig = _librosa.resample
 _librosa.resample = lambda *a, **k: _orig(*a, **k).astype(np.float32)
 ```
 **HuggingFace offline:** `from_pretrained()` always hits the internet. The server uses `find_hf_cache()` to detect local snapshots and calls `from_local()` instead.
 **VAD over-segmentation:** a mid-utterance pause produces two segments that Whisper transcribes to similar or identical text. `speak-as.mjs` deduplicates consecutive lines (normalized, punctuation-stripped) before speaking.
 **Native stdout:** the sherpa-onnx native addon writes transcription results directly to fd 1 via C `write()`, bypassing Node.js `process.stdout.write`. This cannot be intercepted from JavaScript. It only writes for non-trivial results (not silence/garbage).
--- a/SPEAKER-DIARIZATION.md
+++ b/SPEAKER-DIARIZATION.md
@@ -0,0 +1,80 @@
 # Speaker Diarization
 Notes on adding speaker identification to the voice pipeline.
 ---
 ## What Whisper does and doesn't do
 Whisper is a transcription model — it converts audio to text. It has no concept of who is speaking. In a multi-speaker conversation it will transcribe everything into one flat stream with no speaker labels.
 ---
 ## Adding diarization
 Speaker diarization is the task of segmenting audio by speaker: "who spoke when." The standard approach is to run a diarization model before or alongside transcription, then align the speaker segments with the transcript.
 ### pyannote.audio
 The most capable open-source option. Runs locally, GPU-accelerated.
 ```
 pip install pyannote.audio
 ```
 Produces time-stamped segments:
 ```
 SPEAKER_00  0.5s – 3.2s
 SPEAKER_01  3.8s – 7.1s
 SPEAKER_00  7.5s – 12.0s
 ```
 Feed each segment to Whisper independently → transcript with speaker tags.
 Requires a HuggingFace token and model access request (free).
 ### sherpa-onnx
 Has some diarization support in newer versions — worth checking if it can be added to the existing pipeline without a separate Python dependency.
 ---
 ## Speaker identification vs diarization
 | Feature | Diarization | Identification |
 |---------|------------|----------------|
 | Labels speakers | ✓ (anonymous: speaker 1, 2…) | ✓ (named: Mikael, Claude…) |
 | Requires enrollment | ✗ | ✓ (reference clip per person) |
 | Cross-session consistency | ✗ | ✓ |
 For named identification, a speaker embedding model compares voice fingerprints against enrolled reference clips. Options:
 - **resemblyzer** — simple cosine similarity on d-vector embeddings
 - **wespeaker** — more accurate, available as ONNX (fits existing sherpa-onnx stack)
 - **speechbrain** — full toolkit, heavier dependency
 ---
 ## Future experiments
 ### Multi-speaker transcription
 Run pyannote diarization on a conversation recording, feed segments to Whisper, produce a tagged transcript:
 ```
 Mikael: How do I juggle oranges?
 Claude: Start with two balls and work up from there.
 ```
 ### Real-time diarization
 Extend the current VAD → Whisper pipeline with a rolling speaker embedding. Each new segment gets a speaker label by comparing its embedding to known speakers. Useful for meeting notes or multi-person dictation.
 ### Voice enrollment
 Record a short reference clip per person. Store embeddings. When a new segment arrives, score it against all enrolled speakers and assign the closest match above a confidence threshold. Unknown speakers get a temporary anonymous label.
 ### Wake word per speaker
 Combine identification with wake word detection so only enrolled voices can trigger the assistant — a simple voice-based access control layer.
 ---
 ## VRAM considerations
 pyannote diarization adds roughly 1–2 GB depending on the model variant. wespeaker ONNX is much lighter. Both fit comfortably alongside the existing Whisper + Chatterbox stack on a 24 GB card.
--- a/TODO.md
+++ b/TODO.md
@@ -0,0 +1,41 @@
 # Voice Experiment — TODO
 ## Classifier
 - [ ] Add query journal (JSONL) logging all STT fragments, classifier decisions, and how each query was resolved (timeout / send-word / classifier). Enables replay and re-classification experiments without live mic input.
 - [ ] Tune classifier — still too conservative on plain statements and short instructions
 - [ ] Experiment with larger classifier model (Qwen 7B or 14B)
 ## Voice switching
 - [ ] Allow changing the active voice at runtime via voice command — "switch to the X voice", "use a different voice". Could be per-project as a cue (project A = voice A) or just on-demand preference. Implementation: store current `audio_prompt` path in a mutable state file; voice-switch commands update it directly without going through the full Claude dispatch.
 - [ ] Consider a small set of named voices with friendly names so you can say "switch to the calm voice" rather than a file path.
 - [ ] Daily voice rotation: select voice based on a hash of the current date — automatic variety without manual choice. Hash the date string mod N voices, pick from the named voice list.
 - [ ] Character mode: pair a voice with a system prompt preamble injected before the query. Technically trivial — prepend to the dispatch prompt. Hard part is writing character definitions that hold up over a session rather than collapsing into generic assistant after a few turns. A shallow character just sounds like a costume.
 ## Pipeline
 - [ ] PTY-based Claude Code injection — replace X11/clipboard paste with direct PTY write to avoid focus stealing
 - [ ] Voice command routing — detect slash command intents (rename session, switch session) and dispatch as `/command` rather than a prompt
 - [ ] Activation phrase — say "computer" (or similar) to signal the start of a query; play a short acknowledgement chime (Star Trek style). Currently always-on; activation phrase would make it more intentional.
 - [ ] Audio cues — replace or supplement the "Efforting" TTS utterance with short sound effects: one tone on query start/acknowledgement, a different one on dispatch. Faster feedback than TTS.
 - [ ] Reserved control vocabulary — single words, pattern matched before any classifier:
  | Word(s) | Action |
  |---------|--------|
  | `computer` | Generic activate — start accumulating a query, play acknowledgement chime |
  | `project note` | Activate with project context injected into the prompt preamble |
  | `go` / `send` / `done` | Dispatch current query immediately |
  | `cancel` / `scratch that` | Discard accumulated query, return to idle |
  | `stop` / `goodbye` / `hang up` | Shut down the session cleanly |
  | `switch voice` / `different voice` | Trigger voice-switch flow |
  Words are checked normalized (lowercase, punctuation stripped) before classifier runs. Classifier only sees real query content.
 - [ ] Classifier rule hierarchy — pre-classifiers (fast pattern match) → optional LLM classifiers. Each stage can short-circuit. Makes the system tuneable: add/remove LLM stages without touching the pattern layer.
 ## TTS / Long text
 - [ ] Long inputs collapse into gibberish — Chatterbox has a context limit. The tts-server should automatically split on sentence boundaries and queue them sequentially. `Chatterbox_Tts` already has `speak_streaming` that does this — use it in the server instead of `speak`.
 ## TTS / Prompt
 - [ ] Filename and path readback sounds bad when names are all-caps (e.g. `ARCHITECTURE.md` gets spelled out). Instruct Claude in the voice prompt to read filenames and paths in lowercase unless the term is a known acronym. Paralinguistic tags exist but are mostly for expression (laugh, sigh, etc.) — not useful here.
 ## STT
 - [ ] Try Whisper small.en or medium.en for better accuracy on accents
--- a/VOICES.md
+++ b/VOICES.md
@@ -0,0 +1,29 @@
 # Voice Clone Wishlist
 Voices to prepare reference clips for.
 ## Ready
 | Voice | Character / Show | Notes |
 |-------|-----------------|-------|
 | Rommie | Andromeda — ship AI / android | `/home/devilholk/Documents/rommie-sample.wav` — working well |
 | Wilford Brimley (Harold W. Smith) | Remo Williams: The Adventure Begins (1985) — head of CURE | Turned out well — gravelly authoritative delivery, dry clean speech |
 ## To Do
 | Voice | Character / Show | Notes |
 |-------|-----------------|-------|
 | Fred Ward | Remo Williams: The Adventure Begins (1985) — Remo Williams himself | Distinctive gravelly voice, same film as Brimley so similar source quality |
 | Alan Scarfe | TNG (Romulan), Andromeda S5 (Flavin) | Deep, authoritative voice — hunt for quiet scenes without ship hum |
 | John Fleck | Enterprise (Silik the Suliban) | Distinctive raspy voice — Suliban scenes may have atmosphere noise |
 | Steve Bacic | Andromeda (Telemachus Rhade) | — |
 | Alex Diakun | Andromeda (Perseid character, name ~"Atune"), Stargate SG-1 (science role) | Prolific Vancouver sci-fi character actor, often plays scientists/scholars — distinctive voice |
 | Unknown actress | Andromeda S4/S5 (Dylan's love interest), Babylon 5 (Mars rebellion leader) | Possibly Marjorie Monaghan (Number One in B5) — unconfirmed |
 | Claudia Christian | Babylon 5 — Commander Ivanova | User already has a voice clip ready |
 ## Notes
 - Animated/cartoon voices (e.g. Darkwing Duck) don't clone well — too far outside natural human speech distribution
 - Compressed/heavily post-processed audio (spaceship hum, background score) degrades results even after noise reduction
 - OGG vs WAV quality difference is likely source quality, not encoding — soundfile handles both
 - Voice cloning quality scales with both clip length and emotional range — varied prosody (questions, statements, different tones) gives the model more to anchor on than flat monotone
--- a/acting-demo-bark.mjs
+++ b/acting-demo-bark.mjs
@@ -0,0 +1,135 @@
 /**
 * Demonstrates acting and expressiveness with Bark TTS.
 *
 * Run: node acting-demo-bark.mjs [start-section]
 *
 * Bark expressive techniques:
 *   UPPERCASE          — stress/emphasis on a word
 *   ...                — hesitation, trailing off
 *   [laughs]           — laughter
 *   [sighs]            — sigh
 *   [gasps]            — sharp intake of breath
 *   [clears throat]    — throat clearing
 *   ♪ text ♪           — sung phrase
 *   —                  — abrupt break
 *   !                  — raised energy
 *
 * Voice presets (English):
 *   v2/en_speaker_0  calm female
 *   v2/en_speaker_1  calm male
 *   v2/en_speaker_3  deep male
 *   v2/en_speaker_6  neutral/warm (default)
 *   v2/en_speaker_9  expressive
 */
 import { Bark_Tts } from './lib/bark-tts.mjs'
 const tts = new Bark_Tts()
 process.stderr.write('[bark] loading model...\n')
 await tts.init()
 process.stderr.write('[bark] ready\n\n')
 const START = parseInt(process.argv[2] ?? '1', 10)
 function section(title) {
 	process.stdout.write(`\n${'─'.repeat(50)}\n${title}\n${'─'.repeat(50)}\n`)
 }
 async function say(text, voice = undefined) {
 	const opts = { preprocess: false }
 	if (voice) opts.voice = voice
 	process.stdout.write(`  ${voice ?? 'default'}\n  "${text}"\n\n`)
 	await tts.speak(text, opts)
 }
 // ─── 1. Baseline ───────────────────────────────────────
 section('1. Baseline')
 if (START <= 1) {
 	await say('The package has arrived. Please sign here.')
 }
 // ─── 2. Paralinguistic tokens ──────────────────────────
 section('2. Paralinguistic tokens')
 if (START <= 2) {
 	await say('[clears throat] Right. Where were we.')
 	await say('And then he just... [sighs] gave up.')
 	await say('[gasps] I had no idea you were here!')
 	await say("Ha. [laughs] That's the funniest thing I've heard all week.")
 }
 // ─── 3. Emphasis via UPPERCASE ─────────────────────────
 section('3. Emphasis via UPPERCASE')
 if (START <= 3) {
 	await say('You need to stop. RIGHT NOW.')
 	await say('I have told you THIS BEFORE.')
 	await say('This is ABSOLUTELY unacceptable.')
 }
 // ─── 4. Hesitation and rhythm ──────────────────────────
 section('4. Hesitation and rhythm')
 if (START <= 4) {
 	await say('It was... quiet. Too quiet.')
 	await say('I just... I don\'t know what to say.')
 	await say('He said he was fine — but his eyes told a different story.')
 }
 // ─── 5. Sung phrase ────────────────────────────────────
 section('5. Singing')
 if (START <= 5) {
 	await say('♪ Happy birthday to you, happy birthday to you ♪')
 }
 // ─── 6. Acronyms ───────────────────────────────────────
 section('6. Acronyms')
 if (START <= 6) {
 	// Without spacing — Bark may try to pronounce as a word
 	await say('The FBI is investigating.')
 	await say('NASA launched a new rocket.')
 	await say('I work with the CPU and GPU.')
 	// With letter spacing — forces spelling out
 	await say('The F B I is investigating.')
 	await say('N A S A launched a new rocket.')
 	await say('I work with the C P U and G P U.')
 }
 // ─── 7. Voice character ────────────────────────────────
 section('7. Voice presets — same line')
 if (START <= 7) {
 	const line = "Well, that's certainly one way to look at it."
 	for (const voice of ['v2/en_speaker_0', 'v2/en_speaker_3', 'v2/en_speaker_6', 'v2/en_speaker_9']) {
 		await say(line, voice)
 	}
 }
 // ─── 7. Combined scene ─────────────────────────────────
 section('8. Combined — a tense scene')
 if (START <= 8) {
 	await say('Something is wrong.', 'v2/en_speaker_9')
 	await say('I can FEEL it.', 'v2/en_speaker_9')
 	await say('[gasps] Run!', 'v2/en_speaker_9')
 }
 // ─── 9. Paragraph — voices 7-9 ─────────────────────────
 // Pass voices as CLI args to override, e.g.:
 //   node acting-demo-bark.mjs 9 v2/en_speaker_0 v2/en_speaker_2 v2/en_speaker_5
 const PARA_VOICES = process.argv.slice(3).length
 	? process.argv.slice(3)
 	: ['v2/en_speaker_7', 'v2/en_speaker_8', 'v2/en_speaker_9']
 const PARAGRAPH = 'It was a cold evening when the letter FINALLY arrived. '
 	+ '[sighs] She had been waiting for months... not knowing whether the news would be good — or bad. '
 	+ 'Her hands trembled as she tore open the envelope. '
 	+ 'Inside was a single sheet of paper with just THREE words written on it. '
 	+ '[gasps] She read them twice... then sat down slowly, and stared at the wall.'
 section('9. Paragraph — voice comparison')
 if (START <= 9) {
 	for (const voice of PARA_VOICES) {
 		process.stdout.write(`\n  — ${voice} —\n`)
 		await say(PARAGRAPH, voice)
 	}
 }
 tts.stop()
 section('Done')
--- a/acting-demo-chatterbox.mjs
+++ b/acting-demo-chatterbox.mjs
@@ -0,0 +1,109 @@
 /**
 * Acting demo for Chatterbox TTS.
 *
 * Run: node acting-demo-chatterbox.mjs [start-section] [turbo|full]
 *
 * Paralinguistic tags:
 *   [laugh] [chuckle] [cough] [clear throat] [sigh] [shush] [groan] [sniff] [gasp]
 *
 * Exaggeration (full model only, 0.0-1.0):
 *   0.0 = neutral, 1.0 = maximum emotion intensity
 */
 import { Chatterbox_Tts } from './lib/chatterbox-tts.mjs'
 const START   = parseInt(process.argv[2] ?? '1', 10)
 const VARIANT = process.argv[3] ?? 'turbo'
 const tts = new Chatterbox_Tts({ variant: VARIANT })
 process.stderr.write(`[chatterbox] loading ${VARIANT} model...\n`)
 await tts.init()
 process.stderr.write('[chatterbox] ready\n\n')
 function section(title) {
    process.stdout.write(`\n${'─'.repeat(50)}\n${title}\n${'─'.repeat(50)}\n`)
 }
 async function say(text, opts = {}) {
    process.stdout.write(`  "${text}"\n`)
    if (Object.keys(opts).length) process.stdout.write(`  ${JSON.stringify(opts)}\n`)
    process.stdout.write('\n')
    await tts.speak(text, { preprocess: false, ...opts })
 }
 // ─── 1. Baseline ───────────────────────────────────────
 section('1. Baseline')
 if (START <= 1) {
    await say('The package has arrived. Please sign here.')
 }
 // ─── 2. Paralinguistic tags ────────────────────────────
 section('2. Paralinguistic tags')
 if (START <= 2) {
    await say('[cough] Right, where were we.')
    await say('And then he just... [sigh] gave up.')
    await say('[gasp] I had no idea you were here!')
    await say('Ha! [laugh] That is the funniest thing I have heard all week.')
    await say('[chuckle] Well, that is certainly one way to look at it.')
    await say('[groan] Not this again.')
 }
 // ─── 3. Punctuation and rhythm ─────────────────────────
 section('3. Punctuation and rhythm')
 if (START <= 3) {
    await say('It was... quiet. Too quiet.')
    await say('He said he was fine — but his eyes told a different story.')
    await say('I told you. I told you this would happen. But did anyone listen? No.')
 }
 // ─── 4. Exaggeration — full model only ─────────────────
 section(`4. Exaggeration (${VARIANT === 'full' ? 'active' : 'ignored in turbo — run with: node acting-demo-chatterbox.mjs 4 full'})`)
 if (START <= 4) {
    await say('I am so incredibly happy to see you today!', { exaggeration: 0.0 })
    await say('I am so incredibly happy to see you today!', { exaggeration: 0.5 })
    await say('I am so incredibly happy to see you today!', { exaggeration: 1.0 })
 }
 // ─── 5. Temperature ────────────────────────────────────
 section('5. Temperature (lower = more predictable)')
 if (START <= 5) {
    await say('Something is very wrong here and I think we should leave now.', { temperature: 0.3 })
    await say('Something is very wrong here and I think we should leave now.', { temperature: 0.8 })
    await say('Something is very wrong here and I think we should leave now.', { temperature: 1.4 })
 }
 // ─── 6. Paragraph ──────────────────────────────────────
 section('6. Paragraph with tags')
 if (START <= 6) {
    await say(
        'It was a cold evening when the letter finally arrived. '
        + '[sigh] She had been waiting for months... not knowing whether the news would be good — or bad. '
        + 'Her hands trembled as she tore open the envelope. '
        + '[gasp] Inside was a single sheet of paper with just three words written on it.'
    )
 }
 // ─── 7. Audio prompt / voice cloning ───────────────────
 // Pass a WAV file path as third argument:
 //   node acting-demo-chatterbox.mjs 7 turbo /path/to/voice.wav
 section('7. Audio prompt (voice cloning)')
 if (START <= 7) {
    const audio_prompt = process.argv[4]
    if (!audio_prompt) {
        process.stdout.write('  Skipped — pass a WAV file as the 4th argument:\n')
        process.stdout.write('  node acting-demo-chatterbox.mjs 7 turbo /path/to/voice.wav\n\n')
        process.stdout.write('  Requirements: WAV file, at least 5 seconds, clean speech, no music/noise\n')
    } else {
        process.stdout.write(`  voice: ${audio_prompt}\n\n`)
        await say('Hello. This is a test of voice cloning.', { audio_prompt })
        await say(
            'It was a cold evening when the letter finally arrived. '
            + '[sigh] She had been waiting for months... not knowing whether the news would be good — or bad.',
            { audio_prompt }
        )
    }
 }
 tts.stop()
 section('Done')
--- a/acting-demo.mjs
+++ b/acting-demo.mjs
@@ -0,0 +1,163 @@
 /**
 * Demonstrates prosody and acting techniques with Kokoro TTS.
 *
 * Run: node acting-demo.mjs [start-section]
 * E.g.: node acting-demo.mjs 5
 *
 * Kokoro's expressive range comes from four levers:
 *
 * 1. Voice choice — each voice has a baked-in emotional character
 * 2. Speed — slower = deliberate/grave, faster = excited/happy
 * 3. silenceScale — controls pause length between sentences
 * 4. Punctuation & markup — espeak-ng honours some SSML and typographic cues
 *
 * SSML tags that pass through espeak-ng:
 *   <break time="500ms"/>          — explicit pause
 *   <emphasis level="strong">x</emphasis>  — louder/longer (model-dependent)
 *   <prosody rate="slow">x</prosody>       — local rate change
 *   <phoneme alphabet="ipa" ph="..."/>     — force pronunciation
 *
 * Typographic cues that affect prosody without SSML:
 *   UPPERCASE words            — some stress
 *   Ellipsis ...               — trailing pause / trailing off
 *   Em-dash —                  — abrupt break
 *   Comma placement            — micro-pauses
 *   Exclamation !              — raised energy
 */
 import { Tts } from './lib/tts.mjs'
 import { spawn } from 'node:child_process'
 const tts = new Tts()
 tts.init()
 // Overriding the internal _play is the simplest way to control
 // per-utterance GenerationConfig without reworking the class.
 async function say(text, { voice = 'af_heart', speed = 1.0, silence_scale = 1.0 } = {}) {
 	const { createRequire } = await import('node:module')
 	const require = createRequire(import.meta.url)
 	const sherpa  = require('sherpa-onnx-node')
 	const VOICES = { af_heart:0, af_bella:1, af_nicole:2, af_sarah:3, af_sky:4, am_adam:5, am_michael:6 }
 	const sid    = VOICES[voice] ?? 0
 	process.stdout.write(`  voice=${voice} speed=${speed} silence=${silence_scale}\n  "${text}"\n\n`)
 	const audio = tts._tts.generate({
 		text,
 		generationConfig: new sherpa.GenerationConfig({ sid, speed, silenceScale: silence_scale }),
 	})
 	await play(audio.samples, audio.sampleRate)
 }
 // Synthesize multiple phrase/option pairs and play them as one continuous audio
 async function say_concat(parts) {
 	const { createRequire } = await import('node:module')
 	const require = createRequire(import.meta.url)
 	const sherpa  = require('sherpa-onnx-node')
 	const VOICES  = { af_heart:0, af_bella:1, af_nicole:2, af_sarah:3, af_sky:4, am_adam:5, am_michael:6 }
 	let total_len = 0
 	const chunks = []
 	for (const [text, opts = {}] of parts) {
 		const { voice = 'af_heart', speed = 1.0, silence_scale = 1.0 } = opts
 		const sid   = VOICES[voice] ?? 0
 		const audio = tts._tts.generate({
 			text,
 			generationConfig: new sherpa.GenerationConfig({ sid, speed, silenceScale: silence_scale }),
 		})
 		process.stdout.write(`  [${speed}x] "${text}"\n`)
 		chunks.push(audio.samples)
 		total_len += audio.samples.length
 	}
 	const combined = new Float32Array(total_len)
 	let offset = 0
 	for (const chunk of chunks) {
 		combined.set(chunk, offset)
 		offset += chunk.length
 	}
 	await play(combined, tts._tts.sampleRate)
 	process.stdout.write('\n')
 }
 function play(samples, sample_rate) {
 	return new Promise((resolve, reject) => {
 		const proc = spawn('pacat', [
 			'--format=float32le', `--rate=${sample_rate}`, '--channels=1',
 		], { stdio: ['pipe', 'ignore', 'inherit'] })
 		proc.on('error', reject)
 		proc.on('close', resolve)
 		proc.stdin.write(Buffer.from(samples.buffer))
 		proc.stdin.end()
 	})
 }
 const START = parseInt(process.argv[2] ?? '1', 10)
 function section(title) {
 	process.stdout.write(`\n${'─'.repeat(50)}\n${title}\n${'─'.repeat(50)}\n`)
 }
 // ─── 1. Baseline ───────────────────────────────────────
 section('1. Baseline — neutral af_heart')
 if (START <= 1) {
 	await say('The package has arrived. Please sign here.')
 }
 // ─── 2. Speed as emotion ───────────────────────────────
 section('2. Speed — slow (grave) vs fast (excited)')
 if (START <= 2) {
 	await say('I have some terrible news. The project has been cancelled.', { speed: 0.8 })
 	await say("Oh my goodness, I can't believe you're actually here! This is amazing!", { speed: 1.25 })
 }
 // ─── 3. Voice character ────────────────────────────────
 section('3. Voice character — same line, different voices')
 if (START <= 3) {
 	const line = "Well, that's certainly one way to look at it."
 	for (const voice of ['af_heart', 'af_sky', 'am_adam', 'bm_george']) {
 		await say(line, { voice })
 	}
 }
 // ─── 4. Punctuation-driven prosody ─────────────────────
 section('4. Punctuation and typographic cues')
 if (START <= 4) {
 	await say('I told you. I told you this would happen. But did anyone listen? No.')
 	await say('It was... quiet. Too quiet.')
 	await say('He said he was fine — but his eyes told a different story.')
 	await say('This is ABSOLUTELY unacceptable.')
 }
 // ─── 5. Per-phrase speed — manual emphasis ─────────────
 // SSML tags are not supported by this backend; emphasis is achieved by
 // synthesizing key phrases at a different speed and concatenating the audio.
 section('5. Per-phrase speed (manual emphasis)')
 if (START <= 5) {
 	// "stop" spoken slower and then the rest faster — sounds like emphasis
 	await say_concat([
 		['You need to ',        { speed: 1.0 }],
 		['stop.',               { speed: 0.7 }],
 		['Right now.',          { speed: 1.1 }],
 	])
 	await say_concat([
 		['Listen very carefully.', { speed: 0.75 }],
 		['I will only say this once.', { speed: 0.9 }],
 	])
 }
 // ─── 6. Combining levers ───────────────────────────────
 section('6. Combined — a tense scene')
 if (START <= 6) {
 	await say('Something is wrong.', { speed: 0.85, silence_scale: 1.5 })
 	await say('I can feel it.', { speed: 0.8, silence_scale: 2.0 })
 	await say('<break time="600ms"/> Run!', { speed: 1.3, voice: 'af_sky' })
 }
 section('Done')
 process.stdout.write('Tip: adjust speed/silence_scale and swap voices to build a character.\n')
--- a/bark-server.py
+++ b/bark-server.py
@@ -0,0 +1,135 @@
 #!/usr/bin/env -S bash -c 'exec "$(dirname "$0")/venv/bin/python3" "$0" "$@"'
 """
 Bark TTS server — keeps model loaded, reads JSON lines from stdin.
 Protocol:
  stdin:  one JSON line per request: {"text": "...", "voice": "v2/en_speaker_6"}
  stdout: "ok\n" after each utterance is finished playing
  stderr: status/timing messages
 Usage:
  python bark-server.py [model] [voice]
  python bark-server.py suno/bark-small v2/en_speaker_3
 Voices (English):
  v2/en_speaker_0  .. v2/en_speaker_9
  Speaker 6 = neutral/warm, 9 = expressive, 3 = deep male
 """
 import os
 import sys
 import json
 import time
 import subprocess
 import numpy as np
 MODEL      = sys.argv[1] if len(sys.argv) > 1 else 'suno/bark'
 DEF_VOICE  = sys.argv[2] if len(sys.argv) > 2 else 'v2/en_speaker_6'
 SAMPLE_RATE = 24000
 TOKEN_FILE = os.path.expanduser('~/.secrets/hugging-face.token')
 try:
    with open(TOKEN_FILE) as f:
        os.environ['HF_TOKEN'] = f.read().strip()
 except FileNotFoundError:
    pass
 # Disable background safetensors conversion attempts — the HF API endpoint
 # it calls has changed and now errors. The pytorch_model.bin files work fine.
 os.environ['SAFETENSORS_FAST_GPU'] = '0'
 os.environ.setdefault('TRANSFORMERS_NO_ADVISORY_WARNINGS', '1')
 def log(msg):
    print(f'[bark] {msg}', file=sys.stderr, flush=True)
 log(f'loading {MODEL}...')
 t0 = time.time()
 import torch
 from transformers import AutoProcessor, BarkModel
 device = 'cuda' if torch.cuda.is_available() else 'cpu'
 dtype  = torch.float16 if device == 'cuda' else torch.float32
 if device == 'cuda':
    # TF32 is faster on Ampere (30xx, 40xx) for both matmul and convolutions
    torch.backends.cuda.matmul.allow_tf32 = True
    torch.backends.cudnn.allow_tf32       = True
 processor = AutoProcessor.from_pretrained(MODEL)
 model     = BarkModel.from_pretrained(MODEL, torch_dtype=dtype, use_safetensors=False).to(device)
 # Bark's internal generation calls set both max_length and max_new_tokens simultaneously
 # which causes a conflict warning and may cause early stopping. Clearing max_length
 # lets max_new_tokens take sole control.
 model.generation_config.max_length = None
 # torch.compile gives a meaningful speedup on 3090 (adds ~30s first-run compile)
 try:
    model = torch.compile(model, mode='reduce-overhead')
    log('torch.compile applied')
 except Exception as e:
    log(f'torch.compile skipped: {e}')
 log(f'ready on {device} ({time.time() - t0:.1f}s load time)')
 print('ready', flush=True)  # signal to Node that we are ready
 def speak(text, voice):
    t1 = time.time()
    inputs = processor(text, voice_preset=voice, return_tensors='pt')
    # Set attention_mask if not provided by processor — without it the model
    # cannot distinguish padding from real tokens, which causes erratic generation
    # including leading filler sounds.
    if 'attention_mask' not in inputs:
        inputs['attention_mask'] = torch.ones_like(inputs['input_ids'])
    inputs = {k: v.to(device) for k, v in inputs.items()}
    with torch.inference_mode():
        audio = model.generate(
            **inputs,
            semantic_temperature = 0.3,
            coarse_temperature   = 0.3,
            fine_temperature     = 0.3,
        )
    samples = audio.cpu().numpy().squeeze().astype(np.float32)
    elapsed_gen = time.time() - t1
    duration    = len(samples) / SAMPLE_RATE
    log(f'generated {duration:.1f}s audio in {elapsed_gen:.1f}s  rtf={elapsed_gen/duration:.2f}')
    proc = subprocess.Popen(
        ['pacat', '--format=float32le', f'--rate={SAMPLE_RATE}', '--channels=1'],
        stdin=subprocess.PIPE,
    )
    proc.stdin.write(samples.tobytes())
    proc.stdin.close()
    proc.wait()
 for line in sys.stdin:
    line = line.strip()
    if not line:
        continue
    try:
        req          = json.loads(line)
        text  = req.get('text', '')
        voice = req.get('voice', DEF_VOICE)
    except json.JSONDecodeError:
        text  = line
        voice = DEF_VOICE
    if not text:
        print('ok', flush=True)
        continue
    try:
        speak(text, voice)
    except Exception as e:
        log(f'error: {e}')
    print('ok', flush=True)
--- a/chatterbox-server.py
+++ b/chatterbox-server.py
@@ -0,0 +1,200 @@
 #!/usr/bin/env -S bash -c 'exec "$(dirname "$0")/venv/bin/python3" "$0" "$@"'
 """
 Chatterbox TTS server — keeps model loaded, reads JSON lines from stdin.
 Protocol:
  stdin:  {"text": "...", "temperature": 0.8, "top_p": 0.95}
  stdout: "ok\n" after each utterance is generated (playback may still be in progress)
  stderr: status/timing messages
 Usage:
  ./chatterbox-server.py
  ./chatterbox-server.py turbo    # default
  ./chatterbox-server.py full     # original model, supports exaggeration
 Paralinguistic tags supported in text:
  [laugh] [chuckle] [cough] [clear throat] [sigh] [shush] [groan] [sniff] [gasp]
 Full model only:
  exaggeration  0.0-1.0  emotion intensity (ignored in turbo)
 """
 import os
 import sys
 import json
 import time
 import queue
 import threading
 import subprocess
 import numpy as np
 TOKEN_FILE = os.path.expanduser('~/.secrets/hugging-face.token')
 try:
    with open(TOKEN_FILE) as f:
        os.environ['HF_TOKEN'] = f.read().strip()
 except FileNotFoundError:
    pass
 def find_hf_cache(repo_id):
    """Return the local snapshot path if the model is already cached, else None."""
    from pathlib import Path
    cache_dir = Path(os.environ.get('HF_HOME', os.path.expanduser('~/.cache/huggingface'))) / 'hub'
    repo_dir  = cache_dir / f"models--{repo_id.replace('/', '--')}" / 'snapshots'
    if repo_dir.exists():
        snapshots = sorted(repo_dir.iterdir(), key=lambda p: p.stat().st_mtime)
        if snapshots:
            return str(snapshots[-1])
    return None
 VARIANT    = sys.argv[1] if len(sys.argv) > 1 else 'turbo'
 SAMPLE_RATE = 24000
 def log(msg):
    print(f'[chatterbox] {msg}', file=sys.stderr, flush=True)
 log(f'loading chatterbox-{VARIANT}...')
 t0 = time.time()
 import tempfile
 import traceback
 import numpy as np
 import torch
 import soundfile as sf
 import librosa as _librosa
 # librosa.resample returns float64 in newer numpy — patch it to always return float32
 _orig_resample = _librosa.resample
 def _resample_float32(*args, **kwargs):
    return _orig_resample(*args, **kwargs).astype(np.float32)
 _librosa.resample = _resample_float32
 device = 'cuda' if torch.cuda.is_available() else 'cpu'
 REPO_IDS = {
    'turbo': 'ResembleAI/chatterbox-turbo',
    'full':  'ResembleAI/chatterbox',
 }
 if VARIANT == 'turbo':
    from chatterbox.tts_turbo import ChatterboxTurboTTS as Model
 else:
    from chatterbox.tts import ChatterboxTTS as Model
 cached = find_hf_cache(REPO_IDS[VARIANT])
 if cached:
    log(f'loading from cache: {cached}')
    model = Model.from_local(cached, device=device)
 else:
    log('cache not found, downloading...')
    model = Model.from_pretrained(device=device)
 log(f'ready on {device} ({time.time() - t0:.1f}s load time)')
 print('ready', flush=True)
 _wav_cache = {}
 def ensure_float32_wav(path):
    """Re-save audio as float32 mono WAV to work around librosa/numpy float64 issue.
    Result is cached by input path so repeated calls with the same file are free."""
    if path in _wav_cache:
        return _wav_cache[path]
    wav, sr = sf.read(path, dtype='float32', always_2d=True)
    wav = wav.mean(axis=1)  # stereo → mono if needed
    tmp = tempfile.NamedTemporaryFile(suffix='.wav', delete=False)
    sf.write(tmp.name, wav, sr, subtype='FLOAT')
    _wav_cache[path] = tmp.name
    return tmp.name
 _SENTINEL = object()
 playback_queue = queue.Queue()
 def playback_worker():
    """Plays audio samples in order. Runs in its own thread."""
    while True:
        item = playback_queue.get()
        if item is _SENTINEL:
            break
        samples = item
        proc = subprocess.Popen(
            ['pacat', '--format=float32le', f'--rate={SAMPLE_RATE}', '--channels=1'],
            stdin=subprocess.PIPE,
        )
        proc.stdin.write(samples.tobytes())
        proc.stdin.close()
        proc.wait()
        playback_queue.task_done()
 playback_thread = threading.Thread(target=playback_worker, daemon=True)
 playback_thread.start()
 def generate(text, opts):
    t1 = time.time()
    if VARIANT == 'turbo':
        kwargs = {
            'temperature':        opts.get('temperature',        0.8),
            'top_p':              opts.get('top_p',              0.95),
            'top_k':              opts.get('top_k',              1000),
            'repetition_penalty': opts.get('repetition_penalty', 1.2),
            'min_p':              opts.get('min_p',              0.0),
        }
    else:
        kwargs = {
            'temperature':        opts.get('temperature',        0.8),
            'top_p':              opts.get('top_p',              1.0),
            'repetition_penalty': opts.get('repetition_penalty', 1.2),
            'min_p':              opts.get('min_p',              0.05),
            'exaggeration':       opts.get('exaggeration',       0.5),
            'cfg_weight':         opts.get('cfg_weight',         0.5),
        }
    audio_prompt = opts.get('audio_prompt')
    if audio_prompt:
        kwargs['audio_prompt_path'] = ensure_float32_wav(audio_prompt)
    with torch.inference_mode():
        wav = model.generate(text, **kwargs)
    samples = wav.squeeze(0).cpu().numpy().astype(np.float32)
    elapsed = time.time() - t1
    duration = len(samples) / SAMPLE_RATE
    log(f'generated {duration:.1f}s audio in {elapsed:.1f}s  rtf={elapsed/duration:.2f}')
    return samples
 for line in sys.stdin:
    line = line.strip()
    if not line:
        continue
    try:
        req  = json.loads(line)
        text = req.pop('text', '')
        opts = req  # remaining fields are generation options
    except json.JSONDecodeError:
        text = line
        opts = {}
    if not text:
        print('ok', flush=True)
        continue
    try:
        samples = generate(text, opts)
        playback_queue.put(samples)
    except Exception as e:
        log(f'error: {e}')
        traceback.print_exc(file=sys.stderr)
    print('ok', flush=True)
 # Drain playback before exit
 playback_queue.put(_SENTINEL)
 playback_thread.join()
--- a/demo-bark.mjs
+++ b/demo-bark.mjs
@@ -0,0 +1,85 @@
 /**
 * Voice demo: microphone → Whisper STT → (optional LLM cleanup) → Bark TTS
 *
 * Bark supports emphasis via UPPERCASE, paralinguistic tokens ([laughs], [sighs],
 * [clears throat]), and hesitation (...). Markdown is preprocessed automatically:
 * **bold** → UPPERCASE, headers get a pause, etc.
 *
 * Environment variables:
 *   WHISPER_MODEL   default: base.en
 *                   options: tiny.en  base.en  small.en  medium.en  large-v3  large-v3-turbo
 *   BARK_MODEL      default: suno/bark  (use suno/bark-small for faster/smaller)
 *   BARK_VOICE      default: v2/en_speaker_6
 *                   options: v2/en_speaker_0 .. v2/en_speaker_9
 *   STT_PROVIDER    default: cuda
 *   USE_LLM         set to 0 to disable (default: 1)
 *   OLLAMA_MODEL    default: phi3:mini
 */
 import { Stt }                             from './lib/stt.mjs'
 import { Bark_Tts }                        from './lib/bark-tts.mjs'
 import { llm_available, list_models, cleanup } from './lib/llm.mjs'
 const WHISPER_MODEL = process.env.WHISPER_MODEL || 'base.en'
 const STT_PROVIDER  = process.env.STT_PROVIDER  || 'cuda'
 const USE_LLM       = process.env.USE_LLM !== '0'
 process.stderr.write(`[stt] loading whisper-${WHISPER_MODEL} on ${STT_PROVIDER}...\n`)
 const stt = new Stt({ whisper_name: WHISPER_MODEL, provider: STT_PROVIDER })
 stt.init()
 process.stderr.write('[stt] ready\n')
 process.stderr.write('[tts] loading Bark (this takes a moment)...\n')
 const tts = new Bark_Tts()
 await tts.init()
 process.stderr.write('[tts] Bark ready\n')
 let has_llm = false
 if (USE_LLM) {
 	has_llm = await llm_available()
 	if (has_llm) {
 		const models = await list_models()
 		process.stderr.write(`[llm] ollama available. models: ${models.slice(0, 6).join(', ')}\n`)
 	} else {
 		process.stderr.write('[llm] ollama not reachable, skipping cleanup\n')
 	}
 }
 process.stderr.write(`\n[ready] whisper=${WHISPER_MODEL} bark=${process.env.BARK_MODEL || 'suno/bark'} llm=${has_llm}\n`)
 process.stderr.write('[ready] speak into your microphone. Ctrl+C to stop.\n\n')
 let speaking = false
 const stop = stt.listen(async (raw_text) => {
 	if (speaking) {
 		process.stderr.write(`[skipped] ${raw_text}\n`)
 		return
 	}
 	speaking = true
 	try {
 		process.stdout.write(`[raw]  ${raw_text}\n`)
 		let text = raw_text
 		if (has_llm) {
 			text = await cleanup(raw_text)
 			if (text !== raw_text) {
 				process.stdout.write(`[llm]  ${text}\n`)
 			}
 		}
 		await tts.speak_streaming(text)
 		process.stdout.write('\n')
 	} catch (err) {
 		process.stderr.write(`[error] ${err.message}\n`)
 	} finally {
 		speaking = false
 	}
 })
 process.on('SIGINT', () => {
 	process.stderr.write('\n[voice] stopping\n')
 	stop()
 	tts.stop()
 	process.exit(0)
 })
--- a/demo-kokoro.mjs
+++ b/demo-kokoro.mjs
@@ -0,0 +1,85 @@
 /**
 * Voice demo: microphone → Whisper STT → (optional LLM cleanup) → Kokoro TTS
 *
 * Environment variables:
 *   WHISPER_MODEL   default: base.en
 *                   options: tiny.en  base.en  small.en  medium.en  large-v3  large-v3-turbo
 *   VOICE           default: af_heart
 *                   options: af_heart af_bella af_nicole af_sarah af_sky am_adam am_michael
 *   STT_PROVIDER    default: cuda
 *   USE_LLM         set to 0 to disable (default: 1)
 *   OLLAMA_MODEL    default: phi3:mini
 */
 import { Stt }                             from './lib/stt.mjs'
 import { Tts, VOICES }                     from './lib/tts.mjs'
 import { llm_available, list_models, cleanup } from './lib/llm.mjs'
 const WHISPER_MODEL = process.env.WHISPER_MODEL || 'base.en'
 const VOICE         = process.env.VOICE         || 'af_heart'
 const STT_PROVIDER  = process.env.STT_PROVIDER  || 'cuda'
 const USE_LLM       = process.env.USE_LLM !== '0'
 if (!(VOICE in VOICES)) {
 	process.stderr.write(`[voice] unknown voice "${VOICE}". Available: ${Object.keys(VOICES).join(', ')}\n`)
 	process.exit(1)
 }
 process.stderr.write(`[stt] loading whisper-${WHISPER_MODEL} on ${STT_PROVIDER}...\n`)
 const stt = new Stt({ whisper_name: WHISPER_MODEL, provider: STT_PROVIDER })
 stt.init()
 process.stderr.write('[stt] ready\n')
 process.stderr.write('[tts] loading Kokoro...\n')
 const tts = new Tts({ voice: VOICE })
 tts.init()
 process.stderr.write('[tts] ready\n')
 let has_llm = false
 if (USE_LLM) {
 	has_llm = await llm_available()
 	if (has_llm) {
 		const models = await list_models()
 		process.stderr.write(`[llm] ollama available. models: ${models.slice(0, 6).join(', ')}\n`)
 	} else {
 		process.stderr.write('[llm] ollama not reachable, skipping cleanup\n')
 	}
 }
 process.stderr.write(`\n[ready] whisper=${WHISPER_MODEL} voice=${VOICE} llm=${has_llm}\n`)
 process.stderr.write('[ready] speak into your microphone. Ctrl+C to stop.\n\n')
 let speaking = false
 const stop = stt.listen(async (raw_text) => {
 	if (speaking) {
 		process.stderr.write(`[skipped] ${raw_text}\n`)
 		return
 	}
 	speaking = true
 	try {
 		process.stdout.write(`[raw]  ${raw_text}\n`)
 		let text = raw_text
 		if (has_llm) {
 			text = await cleanup(raw_text)
 			if (text !== raw_text) {
 				process.stdout.write(`[llm]  ${text}\n`)
 			}
 		}
 		await tts.speak_streaming(text)
 		process.stdout.write('\n')
 	} catch (err) {
 		process.stderr.write(`[error] ${err.message}\n`)
 	} finally {
 		speaking = false
 	}
 })
 process.on('SIGINT', () => {
 	process.stderr.write('\n[voice] stopping\n')
 	stop()
 	process.exit(0)
 })
--- a/demos/darkwing-briefing.sh
+++ b/demos/darkwing-briefing.sh
@@ -0,0 +1,24 @@
 #!/usr/bin/env bash
 # Darkwing Briefing — Darkwing Duck, Gosalyn, Harold W. Smith, Marella, Rommie
 TTS="${TTS_URL:-http://192.168.2.99:11500}"
 switch() { curl -s -X POST "$TTS/voice" -H 'Content-Type: application/json' -d "{\"name\": \"$1\"}" > /dev/null; }
 say()    { curl -s -X POST "$TTS/speak" -H 'Content-Type: application/json' -d "$(jq -n --arg t "$1" '{text:$t}')" > /dev/null; }
 switch darkwing-duck
 say "I am the terror that flaps in the night. I am the highly classified operative that your clearance level does not cover."
 switch gosalyn
 say "Dad, you're at the WRONG briefing AGAIN."
 switch harold-w-smith
 say "How did a duck get into this facility."
 switch marella
 say "The same way everything else does, sir. The paperwork looked official."
 switch rommie
 say "I ran a verification check. The duck is, in fact, cleared for this. I have no further explanation."
 switch darkwing-duck
 say "I love it when a plan comes together. Let's get dangerous."
--- a/demos/threat-assessment.sh
+++ b/demos/threat-assessment.sh
@@ -0,0 +1,21 @@
 #!/usr/bin/env bash
 # Threat Assessment — Ivanova, Archangel, Rommie, Chuin, Santini
 TTS="${TTS_URL:-http://192.168.2.99:11500}"
 switch() { curl -s -X POST "$TTS/voice" -H 'Content-Type: application/json' -d "{\"name\": \"$1\"}" > /dev/null; }
 say()    { curl -s -X POST "$TTS/speak" -H 'Content-Type: application/json' -d "$(jq -n --arg t "$1" '{text:$t}')" > /dev/null; }
 switch ivanova
 say "We have an unidentified vessel on an intercept course. Threat assessment?"
 switch rommie
 say "Weapons hot, shields raised, and they're broadcasting on a frequency we're not supposed to know exists."
 switch archangel
 say "Ah. That would be my people. Stand down, Commander — they're just making sure I'm still alive."
 switch chuin
 say "Are you? I have been wondering this myself for some time."
 switch santini
 say "I'm gonna need a bigger hangar."
--- a/demos/wrong-meeting.sh
+++ b/demos/wrong-meeting.sh
@@ -0,0 +1,24 @@
 #!/usr/bin/env bash
 # Wrong Meeting — Harold W. Smith, Ivanova, Archangel, Rommie, Penny Parker
 TTS="${TTS_URL:-http://192.168.2.99:11500}"
 switch() { curl -s -X POST "$TTS/voice" -H 'Content-Type: application/json' -d "{\"name\": \"$1\"}" > /dev/null; }
 say()    { curl -s -X POST "$TTS/speak" -H 'Content-Type: application/json' -d "$(jq -n --arg t "$1" '{text:$t}')" > /dev/null; }
 switch harold-w-smith
 say "This meeting never happened. None of you are here, and neither am I."
 switch ivanova
 say "That's extremely reassuring. Last time someone said that to me, a space station blew up."
 switch archangel
 say "Technically it was a controlled demolition. The paperwork was filed in triplicate."
 switch rommie
 say "I have reviewed the paperwork. It was filed AFTER the explosion."
 switch penny-parker
 say "You know, I walked into the wrong building again, didn't I."
 switch harold-w-smith
 say "Yes. But you saw nothing."
--- a/download-models.sh
+++ b/download-models.sh
@@ -0,0 +1,63 @@
 #!/usr/bin/env bash
 # Download ONNX models for the voice experiment.
 #
 # Usage:
 #   bash download-models.sh              # default: whisper base.en
 #   WHISPER_MODEL=large-v3-turbo bash download-models.sh
 #
 # After downloading, run: node demo.mjs
 set -euo pipefail
 WHISPER_MODEL="${WHISPER_MODEL:-base.en}"
 SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
 MODELS_DIR="${SCRIPT_DIR}/models"
 mkdir -p "${MODELS_DIR}"
 cd "${MODELS_DIR}"
 echo "==> models dir: ${MODELS_DIR}"
 # --- Silero VAD (~2 MB) ---
 if [ ! -f silero_vad.onnx ]; then
 	echo "==> downloading silero_vad.onnx"
 	curl -L --progress-bar -o silero_vad.onnx \
 		'https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/silero_vad.onnx'
 else
 	echo "==> silero_vad.onnx already present"
 fi
 # --- Whisper STT ---
 WHISPER_DIR="sherpa-onnx-whisper-${WHISPER_MODEL}"
 if [ ! -d "${WHISPER_DIR}" ]; then
 	echo "==> downloading whisper-${WHISPER_MODEL}"
 	curl -L --progress-bar -o "${WHISPER_DIR}.tar.bz2" \
 		"https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-whisper-${WHISPER_MODEL}.tar.bz2"
 	echo "==> extracting..."
 	tar xf "${WHISPER_DIR}.tar.bz2"
 else
 	echo "==> whisper-${WHISPER_MODEL} already present"
 fi
 # --- Kokoro TTS (~310 MB) ---
 if [ ! -d kokoro-en-v0_19 ]; then
 	echo "==> downloading kokoro-en-v0_19 (English TTS)"
 	curl -L --progress-bar -o kokoro-en-v0_19.tar.bz2 \
 		'https://github.com/k2-fsa/sherpa-onnx/releases/download/tts-models/kokoro-en-v0_19.tar.bz2'
 	echo "==> extracting..."
 	tar xf kokoro-en-v0_19.tar.bz2
 else
 	echo "==> kokoro-en-v0_19 already present"
 fi
 echo ""
 echo "==> all models ready"
 echo ""
 echo "Next steps:"
 echo "  cd ${SCRIPT_DIR}"
 echo "  npm install"
 echo "  node demo.mjs"
 echo ""
 echo "To use a better STT model:"
 echo "  WHISPER_MODEL=large-v3-turbo bash download-models.sh"
 echo "  WHISPER_MODEL=large-v3-turbo node demo.mjs"
--- a/lib/bark-tts.mjs
+++ b/lib/bark-tts.mjs
@@ -0,0 +1,112 @@
 /**
 * Bark TTS — Node.js wrapper around bark-server.py.
 *
 * Spawns the Python server once, keeps it alive, sends requests as JSON lines.
 * Markdown is preprocessed before sending: **bold** → UPPERCASE, etc.
 *
 * Usage:
 *   const tts = new Bark_Tts()
 *   await tts.init()          // spawns server, waits for model load
 *   await tts.speak('Hello')
 *   await tts.speak('**Very** important point.')   // emphasis via CAPS
 *   tts.stop()
 *
 * Environment variables:
 *   BARK_MODEL   HuggingFace model id (default: suno/bark)
 *   BARK_VOICE   voice preset        (default: v2/en_speaker_6)
 *
 * Bark voice presets (English):
 *   v2/en_speaker_0  calm female
 *   v2/en_speaker_1  calm male
 *   v2/en_speaker_3  deep male
 *   v2/en_speaker_6  neutral/warm (default)
 *   v2/en_speaker_9  expressive
 */
 import { spawn }        from 'node:child_process'
 import * as path        from 'node:path'
 import * as readline    from 'node:readline'
 import { markdown_to_bark, split_sentences } from './markdown.mjs'
 const BARK_MODEL = process.env.BARK_MODEL || 'suno/bark'
 const BARK_VOICE = process.env.BARK_VOICE || 'v2/en_speaker_6'
 const SERVER     = path.join(import.meta.dirname, '..', 'bark-server.py')
 export class Bark_Tts {
    constructor({
        model = BARK_MODEL,
        voice = BARK_VOICE,
    } = {}) {
        this._model   = model
        this._voice   = voice
        this._proc    = null
        this._rl      = null
        this._resolve = null  // resolver for the current in-flight request
    }
    /** Spawn bark-server.py and wait until it signals "ready". */
    init() {
        return new Promise((resolve, reject) => {
            this._proc = spawn(SERVER, [this._model, this._voice], {
                stdio: ['pipe', 'pipe', 'inherit'],
            })
            this._proc.on('error', reject)
            this._proc.on('close', (code) => {
                if (code !== 0 && code !== null) {
                    process.stderr.write(`[bark] server exited with code ${code}\n`)
                }
            })
            this._rl = readline.createInterface({ input: this._proc.stdout })
            this._rl.on('line', (line) => {
                if (line === 'ready') {
                    resolve()
                    return
                }
                if (line === 'ok' && this._resolve) {
                    const res  = this._resolve
                    this._resolve = null
                    res()
                }
            })
        })
    }
    /** Preprocess markdown and speak as a single request. */
    async speak(text, { voice = this._voice, preprocess = true } = {}) {
        const clean = preprocess ? markdown_to_bark(text) : text
        return this._send(clean, voice)
    }
    /**
     * Preprocess markdown and speak sentence by sentence.
     * Lower latency — first sentence starts playing while rest are queued.
     */
    async speak_streaming(text, opts = {}) {
        const clean     = opts.preprocess !== false ? markdown_to_bark(text) : text
        const sentences = split_sentences(clean)
        for (const s of sentences) {
            await this._send(s, opts.voice ?? this._voice)
        }
    }
    _send(text, voice) {
        return new Promise((resolve, reject) => {
            if (!this._proc) {
                return reject(new Error('Bark_Tts not initialized — call init() first'))
            }
            this._resolve = resolve
            const payload = JSON.stringify({ text, voice }) + '\n'
            this._proc.stdin.write(payload)
        })
    }
    stop() {
        this._rl?.close()
        this._proc?.kill()
        this._proc  = null
        this._rl    = null
    }
 }
--- a/lib/chatterbox-tts.mjs
+++ b/lib/chatterbox-tts.mjs
@@ -0,0 +1,101 @@
 /**
 * Chatterbox TTS — Node.js wrapper around chatterbox-server.py.
 *
 * Usage:
 *   const tts = new Chatterbox_Tts()
 *   await tts.init()
 *   await tts.speak('Hello! [chuckle] That is funny.')
 *   await tts.speak('Something intense.', { exaggeration: 0.8 })  // full model only
 *   tts.stop()
 *
 * Paralinguistic tags (embed in text):
 *   [laugh] [chuckle] [cough] [clear throat] [sigh] [shush] [groan] [sniff] [gasp]
 *
 * Generation options (passed as second arg to speak()):
 *   temperature         0.05–2.0   default 0.8
 *   top_p               0–1        default 0.95
 *   top_k               0–1000     default 1000
 *   repetition_penalty  ≥1.0       default 1.2
 *   min_p               0–1        default 0.0
 *   audio_prompt        string     path to reference WAV for voice cloning
 *   exaggeration        0–1        emotion intensity (full model only)
 *   cfg_weight          0–1        classifier-free guidance (full model only)
 */
 import { spawn }     from 'node:child_process'
 import * as path     from 'node:path'
 import * as readline from 'node:readline'
 import { markdown_to_bark as markdown_to_speech, split_sentences } from './markdown.mjs'
 const SERVER = path.join(import.meta.dirname, '..', 'chatterbox-server.py')
 export class Chatterbox_Tts {
    constructor({
        variant = 'turbo',  // 'turbo' or 'full'
    } = {}) {
        this._variant  = variant
        this._proc     = null
        this._rl       = null
        this._resolve  = null
    }
    init() {
        return new Promise((resolve, reject) => {
            this._proc = spawn(SERVER, [this._variant], {
                stdio: ['pipe', 'pipe', 'inherit'],
            })
            this._proc.on('error', reject)
            this._proc.on('close', (code) => {
                if (code !== null && code !== 0) {
                    process.stderr.write(`[chatterbox] server exited with code ${code}\n`)
                }
            })
            this._rl = readline.createInterface({ input: this._proc.stdout })
            this._rl.on('line', (line) => {
                if (line === 'ready') {
                    resolve()
                    return
                }
                if (line === 'ok' && this._resolve) {
                    const res = this._resolve
                    this._resolve = null
                    res()
                }
            })
        })
    }
    async speak(text, opts = {}) {
        const clean = opts.preprocess === true ? markdown_to_speech(text) : text
        return this._send(clean, opts)
    }
    async speak_streaming(text, opts = {}) {
        const clean     = opts.preprocess !== false ? markdown_to_speech(text) : text
        const sentences = split_sentences(clean)
        for (const s of sentences) {
            await this._send(s, opts)
        }
    }
    _send(text, opts = {}) {
        return new Promise((resolve, reject) => {
            if (!this._proc) {
                return reject(new Error('Chatterbox_Tts not initialized — call init() first'))
            }
            this._resolve = resolve
            const { preprocess: _, ...gen_opts } = opts
            const payload = JSON.stringify({ text, ...gen_opts }) + '\n'
            this._proc.stdin.write(payload)
        })
    }
    stop() {
        this._rl?.close()
        this._proc?.kill()
        this._proc = null
        this._rl   = null
    }
 }
--- a/lib/llm.mjs
+++ b/lib/llm.mjs
@@ -0,0 +1,64 @@
 /**
 * Optional LLM cleanup via Ollama REST API.
 *
 * Cleans up raw Whisper output: fixes punctuation, capitalization,
 * removes filler words, corrects obvious recognition errors.
 *
 * Set OLLAMA_MODEL env var to choose the model (default: phi3:mini).
 * Set OLLAMA_URL env var to change the base URL (default: http://localhost:11434).
 */
 const OLLAMA_URL   = process.env.OLLAMA_URL   || 'http://localhost:11434'
 const OLLAMA_MODEL = process.env.OLLAMA_MODEL || 'phi3:mini'
 export async function llm_available() {
 	try {
 		const res = await fetch(`${OLLAMA_URL}/api/tags`, {
 			signal: AbortSignal.timeout(2000),
 		})
 		return res.ok
 	} catch {
 		return false
 	}
 }
 export async function list_models() {
 	const res  = await fetch(`${OLLAMA_URL}/api/tags`)
 	const data = await res.json()
 	return (data.models ?? []).map(m => m.name)
 }
 /**
 * Clean up a raw STT transcription.
 * Returns the cleaned string, or the original if the request fails.
 */
 export async function cleanup(raw_text, model = OLLAMA_MODEL) {
 	const prompt = [
 		'You are a transcription editor. Clean up this speech-to-text output:',
 		'- Fix punctuation and capitalization',
 		'- Remove meaningless filler words (um, uh, like, you know)',
 		'- Correct obvious word recognition errors based on context',
 		'- Keep the meaning and phrasing intact',
 		'- Return ONLY the cleaned text, no explanation, no quotes',
 		'',
 		raw_text,
 	].join('\n')
 	try {
 		const res = await fetch(`${OLLAMA_URL}/api/generate`, {
 			method:  'POST',
 			headers: { 'Content-Type': 'application/json' },
 			body: JSON.stringify({
 				model,
 				prompt,
 				stream:  false,
 				options: { temperature: 0.1, num_predict: 256 },
 			}),
 			signal: AbortSignal.timeout(10_000),
 		})
 		const data = await res.json()
 		return data.response?.trim() || raw_text
 	} catch {
 		return raw_text
 	}
 }
--- a/lib/local-query-complete.mjs
+++ b/lib/local-query-complete.mjs
@@ -0,0 +1,43 @@
 export async function is_query_complete(query, model = 'qwen2.5:3b') {
 	const payload = {
 		model,
 		messages: [
 			{
 				role: 'system',
 				content: `
 					You are a voice query completeness classifier. Determine if the input is complete enough to act on.
 					Respond only with JSON: {"complete": true} or {"complete": false}
 					Default to {"complete": true}. Only respond with {"complete": false} if the input is clearly mid-sentence, trails off, or is a single ambiguous word with no clear intent.
 					Complete examples: "What time is it", "Turn off the lights", "How do I juggle", "Make a note", "Check the weather"
 					Incomplete examples: "How do I", "The thing is", "Can you"
 				`.replace(/^[\t ]+/gm, ''),
 			},
 			{ role: 'user', content: query },
 		],
 		stream: false,
 		format: 'json',
 		options: { num_predict: 10, num_ctx: 2048 },
 	}
 	const response = await fetch('http://192.168.2.99:11434/api/chat', {
 		method: 'POST',
 		headers: { 'Content-Type': 'application/json' },
 		body: JSON.stringify(payload),
 	})
 	if (response.ok) {
 		try {
 			const data = await response.json()
 			const parsed = JSON.parse(data.message.content)
 			return parsed.complete === true
 		} catch {
 			return true
 		}
 	} else {
 		// Fail open — if classifier errors, let the query through
 		return true
 	}
 }
--- a/lib/markdown.mjs
+++ b/lib/markdown.mjs
@@ -0,0 +1,66 @@
 /**
 * Convert markdown to Bark-friendly plain text.
 *
 * Bark emphasis conventions:
 *   UPPERCASE words     — stress/emphasis
 *   ...                 — hesitation / trailing off
 *   [laughs] [sighs]    — paralinguistic tokens (pass through as-is)
 *
 * Markdown mapping:
 *   **bold** / __bold__ → UPPERCASE  (strong emphasis)
 *   *italic* / _italic_ → unchanged  (soft emphasis, Bark handles naturally)
 *   # Heading           → text + pause via ".\n"
 *   `code`              → unchanged (spoken literally)
 *   ```block```         → skipped
 *   [text](url)         → text only
 *   ---                 → "..." (pause)
 *   - item / 1. item    → item text (bullet stripped)
 */
 export function markdown_to_bark(text) {
    // Remove fenced code blocks entirely
    text = text.replace(/```[\s\S]*?```/g, '')
    // Inline code — keep the content, drop backticks
    text = text.replace(/`([^`]+)`/g, '$1')
    // Links — keep visible text
    text = text.replace(/\[([^\]]+)\]\([^)]*\)/g, '$1')
    // Images — drop
    text = text.replace(/!\[([^\]]*)\]\([^)]*\)/g, '')
    // **bold** / __bold__ → UPPERCASE
    text = text.replace(/\*\*([^*]+)\*\*/g, (_, w) => w.toUpperCase())
    text = text.replace(/__([^_]+)__/g, (_, w) => w.toUpperCase())
    // *italic* / _italic_ — strip markers, keep text
    text = text.replace(/\*([^*]+)\*/g, '$1')
    text = text.replace(/_([^_]+)_/g, '$1')
    // Headings — keep text, add a beat after
    text = text.replace(/^#{1,6}\s+(.+)$/gm, '$1.')
    // Horizontal rules → pause
    text = text.replace(/^[-*_]{3,}\s*$/gm, '...')
    // List bullets / numbers — strip prefix
    text = text.replace(/^\s*[-*+]\s+/gm, '')
    text = text.replace(/^\s*\d+\.\s+/gm, '')
    // Collapse excess blank lines
    text = text.replace(/\n{3,}/g, '\n\n')
    return text.trim()
 }
 /**
 * Split text into sentences suitable for sequential TTS.
 * Keeps sentence-ending punctuation with the preceding sentence.
 */
 export function split_sentences(text) {
    return text
        .split(/(?<=[.!?])\s+/)
        .map(s => s.trim())
        .filter(s => s.length > 0)
 }
--- a/lib/stt.mjs
+++ b/lib/stt.mjs
@@ -0,0 +1,170 @@
 /**
 * Speech-to-Text using sherpa-onnx-node.
 *
 * Uses Whisper (offline/batch) as the recognizer, gated by Silero VAD so
 * transcription only runs when a complete utterance has been detected.
 * Audio is captured from the microphone via parec (PulseAudio) or arecord (ALSA).
 *
 * Model layout expected under models_dir:
 *   silero_vad.onnx
 *   sherpa-onnx-whisper-<name>/
 *     <name>-encoder.int8.onnx
 *     <name>-decoder.int8.onnx
 *     <name>-tokens.txt
 *
 * Available Whisper model names (pass as whisper_name):
 *   tiny.en  base.en  small.en  medium.en  large-v3  large-v3-turbo
 * Download with: bash download-models.sh [model-name]
 */
 import { spawn, execSync } from 'node:child_process'
 import { createRequire } from 'node:module'
 import * as path from 'node:path'
 const require = createRequire(import.meta.url)
 const DEFAULT_MODELS_DIR = path.join(import.meta.dirname, '..', 'models')
 const PRE_ROLL_SAMPLES  = 3200    // 200ms at 16kHz
 const HISTORY_SAMPLES   = 960000  // 60s ring buffer — matches VAD internal size
 function find_mic() {
 	const candidates = [
 		['parec',   ['--format=s16le', '--rate=16000', '--channels=1', '--latency-msec=50']],
 		['arecord', ['-f', 'S16_LE', '-r', '16000', '-c', '1', '-t', 'raw', '-q']],
 	]
 	for (const [cmd, args] of candidates) {
 		try {
 			execSync(`which ${cmd}`, { stdio: 'ignore' })
 			return [cmd, args]
 		} catch { /* try next */ }
 	}
 	throw new Error('No mic capture command found — need parec (PulseAudio) or arecord (ALSA)')
 }
 function s16le_to_f32(buf) {
 	const out = new Float32Array(buf.length / 2)
 	for (let i = 0; i < out.length; i++) {
 		out[i] = buf.readInt16LE(i * 2) / 32768.0
 	}
 	return out
 }
 export class Stt {
 	constructor({
 		models_dir   = DEFAULT_MODELS_DIR,
 		whisper_name = 'base.en',
 		provider     = 'cuda',
 	} = {}) {
 		this._whisper_dir  = path.join(models_dir, `sherpa-onnx-whisper-${whisper_name}`)
 		this._whisper_name = whisper_name
 		this._vad_model    = path.join(models_dir, 'silero_vad.onnx')
 		this._provider     = provider
 		this._recognizer   = null
 		this._vad          = null
 		this._history      = new Float32Array(HISTORY_SAMPLES)
 		this._history_pos  = 0  // total samples fed, monotonically increasing
 	}
 	init() {
 		const sherpa = require('sherpa-onnx-node')
 		const { _whisper_dir: wdir, _whisper_name: wname, _vad_model, _provider } = this
 		this._recognizer = new sherpa.OfflineRecognizer({
 			featConfig: { sampleRate: 16000, featureDim: 80 },
 			modelConfig: {
 				whisper: {
 					encoder: path.join(wdir, `${wname}-encoder.int8.onnx`),
 					decoder: path.join(wdir, `${wname}-decoder.int8.onnx`),
 				},
 				tokens:     path.join(wdir, `${wname}-tokens.txt`),
 				numThreads: 2,
 				provider:   _provider,
 				debug:      false,
 			},
 		})
 		// VAD runs on CPU — silero_vad.onnx is tiny
 		this._vad = new sherpa.Vad({
 			sileroVad: {
 				model:              _vad_model,
 				threshold:          0.5,
 				minSilenceDuration: 0.5,   // seconds of silence to end an utterance
 				minSpeechDuration:  0.1,   // ignore sub-100ms blips
 				windowSize:         512,   // 32ms @ 16kHz
 				maxSpeechDuration:  30.0,
 			},
 			sampleRate: 16000,
 			numThreads: 1,
 			provider:   'cpu',
 			debug:      false,
 		}, 60)  // 60-second internal ring buffer
 	}
 	_transcribe(samples) {
 		const stream = this._recognizer.createStream()
 		stream.acceptWaveform({ samples, sampleRate: 16000 })
 		this._recognizer.decode(stream)
 		return this._recognizer.getResult(stream).text.trim()
 	}
 	_with_preroll(seg) {
 		const pre_start = Math.max(0, seg.start - PRE_ROLL_SAMPLES)
 		const pre_len   = seg.start - pre_start
 		if (pre_len === 0) return seg.samples
 		const out = new Float32Array(pre_len + seg.samples.length)
 		for (let i = 0; i < pre_len; i++) {
 			out[i] = this._history[(pre_start + i) % HISTORY_SAMPLES]
 		}
 		out.set(seg.samples, pre_len)
 		return out
 	}
 	/**
 	 * Start listening. Calls on_text(text) for each detected utterance.
 	 * Returns a stop() function.
 	 */
 	listen(on_text, { on_audio } = {}) {
 		const [cmd, args] = find_mic()
 		process.stderr.write(`[stt] mic: ${cmd} ${args.join(' ')}\n`)
 		const mic = spawn(cmd, args, { stdio: ['ignore', 'pipe', 'inherit'] })
 		const VAD_WIN = 512  // must match windowSize above
 		let pending = Buffer.alloc(0)
 		mic.stdout.on('data', (chunk) => {
 			if (on_audio) on_audio()
 			pending = Buffer.concat([pending, chunk])
 			// Feed complete VAD windows
 			while (pending.length >= VAD_WIN * 2) {
 				const win = pending.subarray(0, VAD_WIN * 2)
 				pending   = pending.subarray(VAD_WIN * 2)
 				const f32 = s16le_to_f32(win)
 				// Write to history ring buffer for pre-roll
 				const base = this._history_pos % HISTORY_SAMPLES
 				for (let i = 0; i < f32.length; i++) {
 					this._history[(base + i) % HISTORY_SAMPLES] = f32[i]
 				}
 				this._history_pos += f32.length
 				this._vad.acceptWaveform(f32)
 			}
 			// Drain any complete speech segments
 			while (!this._vad.isEmpty()) {
 				const seg = this._vad.front()
 				this._vad.pop()
 				const text = this._transcribe(this._with_preroll(seg))
 				if (text) on_text(text)
 			}
 		})
 		mic.on('error', err => process.stderr.write(`[stt] mic error: ${err.message}\n`))
 		mic.on('close', code => {
 			if (code !== null && code !== 0) {
 				process.stderr.write(`[stt] mic exited with code ${code}\n`)
 			}
 		})
 		return () => mic.kill()
 	}
 }
--- a/lib/tts-client.mjs
+++ b/lib/tts-client.mjs
@@ -0,0 +1,51 @@
 /**
 * TTS HTTP client — speaks text via a running tts-server.mjs instance.
 *
 * Usage:
 *   const tts = new Tts_Client()
 *   await tts.speak('Hello world.')
 *   await tts.speak('Hello.', { audio_prompt: '/path/to/voice.wav' })
 *   const { voices, current } = await tts.list_voices()
 *   await tts.set_voice('rommie')
 *
 * The server URL defaults to TTS_URL env var, then http://192.168.2.99:11500.
 */
 const DEFAULT_URL = process.env.TTS_URL ?? 'http://192.168.2.99:11500'
 export class Tts_Client {
 	constructor({ url = DEFAULT_URL } = {}) {
 		this._url = url
 	}
 	async speak(text, opts = {}) {
 		const res = await fetch(`${this._url}/speak`, {
 			method:  'POST',
 			headers: { 'Content-Type': 'application/json' },
 			body:    JSON.stringify({ text, ...opts }),
 		})
 		if (!res.ok) {
 			const data = await res.json().catch(() => ({}))
 			throw new Error(data.error ?? `HTTP ${res.status}`)
 		}
 	}
 	async list_voices() {
 		const res = await fetch(`${this._url}/voices`)
 		if (!res.ok) throw new Error(`HTTP ${res.status}`)
 		return res.json()   // { voices: [{name, description, active}], current }
 	}
 	async set_voice(name) {
 		const res = await fetch(`${this._url}/voice`, {
 			method:  'POST',
 			headers: { 'Content-Type': 'application/json' },
 			body:    JSON.stringify({ name }),
 		})
 		if (!res.ok) {
 			const data = await res.json().catch(() => ({}))
 			throw new Error(data.error ?? `HTTP ${res.status}`)
 		}
 		return res.json()   // { ok, name, path }
 	}
 }
--- a/lib/tts.mjs
+++ b/lib/tts.mjs
@@ -0,0 +1,107 @@
 /**
 * Text-to-Speech using sherpa-onnx-node with the Kokoro model.
 *
 * Model layout expected under models_dir:
 *   kokoro-en-v0_19/
 *     model.onnx
 *     voices.bin
 *     tokens.txt
 *     lexicon-us-en.txt
 *     espeak-ng-data/
 *
 * Audio output goes to pacat (PulseAudio) as float32le @ 24kHz.
 * speak_streaming() splits on sentence boundaries so the first sentence
 * plays while the rest are still being synthesized.
 */
 import { spawn } from 'node:child_process'
 import { createRequire } from 'node:module'
 import * as path from 'node:path'
 const require = createRequire(import.meta.url)
 const DEFAULT_MODELS_DIR = path.join(import.meta.dirname, '..', 'models')
 // Speaker IDs in kokoro-en-v0_19
 export const VOICES = {
 	af_heart:   0,
 	af_bella:   1,
 	af_nicole:  2,
 	af_sarah:   3,
 	af_sky:     4,
 	am_adam:    5,
 	am_michael: 6,
 }
 export class Tts {
 	constructor({
 		models_dir = DEFAULT_MODELS_DIR,
 		voice      = 'af_heart',
 		speed      = 1.0,
 		provider   = 'cpu',   // Kokoro synthesis is fast enough on CPU
 	} = {}) {
 		this._kokoro_dir = path.join(models_dir, 'kokoro-en-v0_19')
 		this._voice      = voice
 		this._speed      = speed
 		this._provider   = provider
 		this._tts        = null
 	}
 	init() {
 		const sherpa = require('sherpa-onnx-node')
 		const { _kokoro_dir: dir, _provider } = this
 		this._tts = new sherpa.OfflineTts({
 			model: {
 				kokoro: {
 					model:   path.join(dir, 'model.onnx'),
 					voices:  path.join(dir, 'voices.bin'),
 					tokens:  path.join(dir, 'tokens.txt'),
 					dataDir: path.join(dir, 'espeak-ng-data'),
 				},
 			},
 			maxNumSentences: 2,
 			numThreads:      2,
 			debug:           false,
 			provider:        _provider,
 		})
 	}
 	/** Synthesize and play a single piece of text. Resolves when playback ends. */
 	async speak(text) {
 		const sid   = VOICES[this._voice] ?? 0
 		const audio = this._tts.generate({ text, sid, speed: this._speed })
 		await this._play(audio.samples, audio.sampleRate)
 	}
 	/**
 	 * Split text on sentence boundaries and synthesize + play each sentence
 	 * in sequence. Lower latency than waiting for full synthesis first.
 	 */
 	async speak_streaming(text) {
 		for (const sentence of split_sentences(text)) {
 			if (sentence.trim()) {
 				await this.speak(sentence.trim())
 			}
 		}
 	}
 	_play(samples, sample_rate) {
 		return new Promise((resolve, reject) => {
 			const proc = spawn('pacat', [
 				'--format=float32le',
 				`--rate=${sample_rate}`,
 				'--channels=1',
 			], { stdio: ['pipe', 'ignore', 'inherit'] })
 			proc.on('error', reject)
 			proc.on('close', resolve)
 			proc.stdin.write(Buffer.from(samples.buffer))
 			proc.stdin.end()
 		})
 	}
 }
 function split_sentences(text) {
 	// Split after .  !  ? — keep the punctuation with the preceding sentence
 	return text.split(/(?<=[.!?])\s+/)
 }
--- a/listen.mjs
+++ b/listen.mjs
@@ -0,0 +1,22 @@
 /**
 * Listen with Whisper and print transcribed text to stdout.
 *
 * Usage:
 *   node listen.mjs
 *   node listen.mjs small.en
 *   node listen.mjs large-v3-turbo
 */
 import { Stt } from './lib/stt.mjs'
 const whisper_name = process.argv[2] ?? 'base.en'
 const stt = new Stt({ whisper_name })
 stt.init()
 process.stderr.write(`[listen] whisper model: ${whisper_name}\n`)
 process.stderr.write(`[listen] listening — Ctrl+C to stop\n`)
 stt.listen((text) => {
 	process.stdout.write(text + '\n')
 })
--- a/package.json
+++ b/package.json
@@ -0,0 +1,13 @@
 {
 	"name": "voice-experiment",
 	"version": "0.1.0",
 	"type": "module",
 	"scripts": {
 		"download": "bash download-models.sh",
 		"setup-venv": "bash setup-venv.sh"
 	},
 	"dependencies": {
 		"sherpa-onnx-node": "^1.12.14",
 		"js-yaml": "^4.1.0"
 	}
 }
--- a/query-demo.mjs
+++ b/query-demo.mjs
@@ -0,0 +1,123 @@
 /**
 * Voice query demo <20><> accumulates STT utterances into a query, checks
 * completeness with a local LLM classifier, then dispatches to Claude Code.
 *
 * Usage:
 *   node query-demo.mjs [--audio-prompt voice.wav] [--whisper model] [--window-id 0x...] [--claude-remote /path/to/claude-remote.mjs]
 *
 * If --window-id and --claude-remote are supplied, completed queries are
 * dispatched to Claude Code. Otherwise they are only logged to stderr.
 *
 * A query is considered complete when:
 *   - The classifier says so
 *   - The last word is a send-word (go/done/send)
 *   - No new fragment arrives within SILENCE_TIMEOUT ms
 */
 import { execFileSync }    from 'node:child_process'
 import { Stt }              from './lib/stt.mjs'
 import { Tts_Client }       from './lib/tts-client.mjs'
 import { is_query_complete } from './lib/local-query-complete.mjs'
 function get_arg(name) {
 	const i = process.argv.indexOf(name)
 	return i !== -1 ? process.argv[i + 1] : null
 }
 const audio_prompt   = get_arg('--audio-prompt') ?? '/home/devilholk/Documents/rommie-sample.wav'
 const whisper_name   = get_arg('--whisper')      ?? 'base.en'
 const window_id      = get_arg('--window-id')
 const claude_remote  = get_arg('--claude-remote')
 const SILENCE_TIMEOUT = parseInt(get_arg('--silence-timeout') ?? '6000')
 const tts = new Tts_Client()
 const stt = new Stt({ whisper_name })
 stt.init()
 process.stderr.write(`[query-demo] whisper model: ${whisper_name}\n`)
 process.stderr.write(`[query-demo] voice: ${audio_prompt}\n`)
 if (window_id && claude_remote) {
 	process.stderr.write(`[query-demo] dispatch: window ${window_id} via ${claude_remote}\n`)
 } else {
 	process.stderr.write(`[query-demo] no --window-id/--claude-remote — logging only\n`)
 }
 await tts.speak('Ready for input.', { audio_prompt })
 function make_prompt(query) {
 	return [
 		'[voice-buddy] You have received a voice query.',
 		'',
 		'This is a voice interface. When you are done, use the `speak` shell command to inform the user of the outcome — keep it brief and spoken-word natural.',
 		'If the task is a question, speak the answer directly.',
 		'If the task is an action, carry it out and then speak a short confirmation.',
 		'',
 		'--- Query ---',
 		query,
 		'--- End of query ---',
 	].join('\n')
 }
 function dispatch(query) {
 	execFileSync('node', [claude_remote, '--window-id', window_id], {
 		input:    make_prompt(query),
 		encoding: 'utf8',
 		stdio:    ['pipe', 'inherit', 'inherit'],
 	})
 }
 async function submit(query) {
 	query = query.trim()
 	if (!query) {
 		return
 	}
 	process.stderr.write(`[query-demo] query: ${JSON.stringify(query)}\n`)
 	if (window_id && claude_remote) {
 		dispatch(query)
 	}
 	await tts.speak('Efforting.', { audio_prompt })
 }
 const SEND_WORDS = new Set(['go', 'done', 'send'])
 let accumulated    = ''
 let silence_timer  = null
 function reset_silence_timer() {
 	clearTimeout(silence_timer)
 	silence_timer = setTimeout(async () => {
 		if (!accumulated) {
 			return
 		}
 		process.stderr.write('[query-demo] silence timeout — submitting\n')
 		const query = accumulated
 		accumulated = ''
 		await submit(query)
 	}, SILENCE_TIMEOUT)
 }
 stt.listen(async (text) => {
 	accumulated = accumulated ? `${accumulated}\n${text}` : text
 	process.stderr.write(`[query-demo] fragment: ${JSON.stringify(accumulated)}\n`)
 	reset_silence_timer()
 	const last_line = accumulated.split('\n').at(-1)
 	const last_norm = last_line.toLowerCase().replace(/[^a-z]/g, '')
 	const forced    = SEND_WORDS.has(last_norm)
 	if (!forced) {
 		const complete = await is_query_complete(last_line)
 		if (!complete) {
 			return
 		}
 	}
 	clearTimeout(silence_timer)
 	const lines = accumulated.split('\n')
 	const query = (forced ? lines.slice(0, -1) : lines).join('\n')
 	accumulated = ''
 	await submit(query)
 }, { on_audio: () => { if (accumulated) reset_silence_timer() } })
--- a/requirements.txt
+++ b/requirements.txt
@@ -0,0 +1,11 @@
 # STT
 faster-whisper>=1.1.0
 # TTS (already used in kokoro-tts-test)
 kokoro>=0.9.4
 # Audio capture
 sounddevice>=0.4.6
 # Numpy (usually already present)
 numpy>=1.24.0
--- a/setup-venv.sh
+++ b/setup-venv.sh
@@ -0,0 +1,24 @@
 #!/usr/bin/env bash
 # Creates a venv in ./venv and installs Python deps for Bark.
 # Safe to re-run — skips install if already done.
 set -euo pipefail
 SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
 VENV="${SCRIPT_DIR}/venv"
 if [ ! -d "${VENV}" ]; then
 	echo "==> creating venv at ${VENV}"
 	python3 -m venv "${VENV}"
 fi
 echo "==> installing Python dependencies"
 "${VENV}/bin/pip" install --upgrade pip --quiet
 "${VENV}/bin/pip" install transformers accelerate torch numpy
 "${VENV}/bin/pip" install chatterbox-tts
 chmod +x "${SCRIPT_DIR}/bark-server.py"
 chmod +x "${SCRIPT_DIR}/chatterbox-server.py"
 echo ""
 echo "==> venv ready at ${VENV}"
--- a/speak-as.mjs
+++ b/speak-as.mjs
@@ -0,0 +1,47 @@
 /**
 * Read text from stdin and speak it using a voice reference clip.
 *
 * Usage:
 *   echo "Hello world" | node speak-as.mjs /path/to/voice.wav
 *   cat script.txt    | node speak-as.mjs /path/to/voice.wav
 *   node speak-as.mjs /path/to/voice.wav   (then type, Ctrl+D when done)
 */
 import * as fs       from 'node:fs'
 import * as readline from 'node:readline'
 import { Chatterbox_Tts } from './lib/chatterbox-tts.mjs'
 const audio_prompt = process.argv[2]
 if (!audio_prompt) {
 	process.stderr.write('Usage: node speak-as.mjs /path/to/voice.wav\n')
 	process.exit(1)
 }
 if (!fs.existsSync(audio_prompt)) {
 	process.stderr.write(`File not found: ${audio_prompt}\n`)
 	process.exit(1)
 }
 const tts = new Chatterbox_Tts()
 await tts.init()
 const rl = readline.createInterface({
 	input:    process.stdin,
 	output:   process.stdout,
 	terminal: process.stdin.isTTY,
 })
 const norm = s => s.toLowerCase().replace(/[^a-z0-9 ]/g, '').replace(/\s+/g, ' ').trim()
 let last_norm = ''
 for await (const line of rl) {
 	const text = line.trim()
 	if (!text) continue
 	const n = norm(text)
 	if (n === last_norm) continue
 	last_norm = n
 	await tts.speak(text, { audio_prompt })
 }
 tts.stop()
--- a/tts-server.mjs
+++ b/tts-server.mjs
@@ -0,0 +1,133 @@
 /**
 * TTS HTTP server — wraps chatterbox-server.py and exposes a simple HTTP API.
 * Requests are serialized so generation and playback stay in order.
 *
 * Usage:
 *   node tts-server.mjs
 *   TTS_PORT=11500 node tts-server.mjs
 *
 * API:
 *   POST /speak   { "text": "...", "audio_prompt": "/path/to/voice.wav", ... }
 *                 → 200 { "ok": true }  (after generation, playback continues in background)
 *   GET  /voices  → 200 { "voices": ["rommie", ...], "current": "rommie" | null }
 *   POST /voice   { "name": "rommie" }
 *                 → 200 { "ok": true, "name": "rommie", "path": "..." }
 *   GET  /health  → 200 { "ok": true }
 */
 import * as http from 'node:http'
 import * as fs   from 'node:fs'
 import * as path from 'node:path'
 import yaml from 'js-yaml'
 import { Chatterbox_Tts } from './lib/chatterbox-tts.mjs'
 const PORT        = parseInt(process.env.TTS_PORT ?? '11500')
 const VOICES_FILE = path.join(import.meta.dirname, 'voices.yaml')
 function reload_voices() {
 	try {
 		const doc = yaml.load(fs.readFileSync(VOICES_FILE, 'utf8'))
 		return doc?.voices ?? {}
 	} catch {
 		return {}
 	}
 }
 let voices = reload_voices()
 let current_voice = null  // name of active voice, or null
 // --- TTS setup ---
 const tts = new Chatterbox_Tts()
 process.stderr.write('[tts-server] starting chatterbox...\n')
 await tts.init()
 process.stderr.write('[tts-server] chatterbox ready\n')
 // Serialize all speak requests through a promise chain
 let queue = Promise.resolve()
 function enqueue(fn) {
 	const result = queue.then(fn)
 	// Don't let a failed request poison the queue
 	queue = result.catch(() => {})
 	return result
 }
 function read_body(req) {
 	return new Promise((resolve, reject) => {
 		let buf = ''
 		req.on('data', chunk => { buf += chunk })
 		req.on('end', () => {
 			try { resolve(JSON.parse(buf)) } catch (e) { reject(e) }
 		})
 		req.on('error', reject)
 	})
 }
 function send(res, status, body) {
 	const payload = JSON.stringify(body)
 	res.writeHead(status, { 'Content-Type': 'application/json' })
 	res.end(payload)
 }
 const server = http.createServer(async (req, res) => {
 	if (req.method === 'GET' && req.url === '/health') {
 		return send(res, 200, { ok: true })
 	}
 	if (req.method === 'GET' && req.url === '/voices') {
 		voices = reload_voices()
 		const list = Object.entries(voices).map(([name, v]) => ({
 			name,
 			description: v.description ?? '',
 			active: name === current_voice,
 		}))
 		return send(res, 200, { voices: list, current: current_voice })
 	}
 	if (req.method === 'POST' && req.url === '/voice') {
 		let body
 		try { body = await read_body(req) } catch {
 			return send(res, 400, { error: 'invalid JSON' })
 		}
 		const { name } = body
 		if (!name) return send(res, 400, { error: 'name required' })
 		voices = reload_voices()
 		if (!voices[name]) return send(res, 404, { error: `unknown voice: ${name}` })
 		current_voice = name
 		process.stderr.write(`[tts-server] voice switched to: ${name}\n`)
 		return send(res, 200, { ok: true, name, path: voices[name].path })
 	}
 	if (req.method === 'POST' && req.url === '/speak') {
 		let body
 		try {
 			body = await read_body(req)
 		} catch {
 			return send(res, 400, { error: 'invalid JSON' })
 		}
 		const { text, ...opts } = body
 		if (!text) {
 			return send(res, 400, { error: 'text required' })
 		}
 		// Inject current voice as default audio_prompt if none provided
 		if (!opts.audio_prompt && current_voice && voices[current_voice]) {
 			opts.audio_prompt = voices[current_voice].path
 		}
 		try {
 			await enqueue(() => tts.speak_streaming(text, { preprocess: false, ...opts }))
 			send(res, 200, { ok: true })
 		} catch (err) {
 			send(res, 500, { error: err.message })
 		}
 		return
 	}
 	send(res, 404, { error: 'not found' })
 })
 server.listen(PORT, () => {
 	process.stderr.write(`[tts-server] listening on port ${PORT}\n`)
 })
--- a/voice-buddy.mjs
+++ b/voice-buddy.mjs
@@ -0,0 +1,65 @@
 /**
 * Voice buddy — reads voice queries from stdin (JSON lines) and dispatches
 * them to a Claude Code instance via claude-remote.
 *
 * Usage:
 *   node query-demo.mjs | node voice-buddy.mjs --window-id 0x1234abc --claude-remote /workspace/claude-remote/claude-remote.mjs
 *
 * Options:
 *   --window-id <id>       X11 window ID of Claude Code (decimal or 0x hex)
 *   --claude-remote <path> Path to claude-remote.mjs
 */
 import * as readline   from 'node:readline'
 import { execFileSync } from 'node:child_process'
 const args = process.argv.slice(2)
 function get_arg(name) {
 	const i = args.indexOf(name)
 	if (i === -1 || i + 1 >= args.length) {
 		process.stderr.write(`Missing argument: ${name}\n`)
 		process.exit(1)
 	}
 	return args[i + 1]
 }
 const window_id     = get_arg('--window-id')
 const claude_remote = get_arg('--claude-remote')
 function make_prompt(query) {
 	return [
 		'[voice-buddy] You have received a voice query.',
 		'',
 		'Respond via the `speak` shell command. Keep your response brief and spoken-word natural — this is a voice interface.',
 		'',
 		'--- Query ---',
 		query,
 		'--- End of query ---',
 	].join('\n')
 }
 function dispatch(query) {
 	const prompt = make_prompt(query)
 	process.stderr.write(`[voice-buddy] dispatching: ${JSON.stringify(query)}\n`)
 	execFileSync('node', [claude_remote, '--window-id', window_id], {
 		input:    prompt,
 		encoding: 'utf8',
 		stdio:    ['pipe', 'inherit', 'inherit'],
 	})
 }
 const rl = readline.createInterface({ input: process.stdin })
 for await (const line of rl) {
 	const trimmed = line.trim()
 	if (!trimmed) {
 		continue
 	}
 	try {
 		const query = JSON.parse(trimmed)
 		dispatch(query)
 	} catch {
 		process.stderr.write(`[voice-buddy] bad JSON line: ${trimmed}\n`)
 	}
 }
--- a/voices.yaml
+++ b/voices.yaml
@@ -0,0 +1,54 @@
 # Named voices for TTS server.
 # Switch via: POST /voice {"name": "rommie"}
 # List via:   GET  /voices
 voices:
  rommie:
    path: /home/devilholk/Documents/rommie-sample.wav
    description: Rommie - Andromeda ship AI / android
  archangel:
    path: /home/devilholk/Documents/archangel.ogg
    description: Arch Angel from Air Wolf
  marella:
    path: /home/devilholk/Documents/murella.ogg
    description: Marella from Air Wolf
  chuin:
    path: /home/devilholk/Documents/chuin.ogg
    description: Chuin from Remo Williams - The adventure begins
  harold-w-smith:
    path: /home/devilholk/Documents/cure-leader.ogg
    description: Harold W. Smith from Remo Williams - The adventure begins
  penny-parker:
    path: /home/devilholk/Documents/penny.ogg
    description: Penny from MacGyver
  santini:
    path: /home/devilholk/Documents/santini.ogg
    description: Santini from Air Wolf
  gosalyn:
    path: /home/devilholk/Documents/gasalin.ogg
    description: Gosalyn from Darkwing Duck
  darkwing-duck:
    path: /home/devilholk/Documents/dwd.ogg
    description: Darkwing Duck
  belitski:
    path: /home/devilholk/Documents/slipstream2.ogg
    description: Belitski from Slipstream
  ivanova:
    path: /home/devilholk/Documents/ivanova.ogg
    description: Susan Ivonavo from Babylon 5
  sheila:
    path: /home/devilholk/Documents/darkness-woman.ogg
    description: Sheila from Army of Darkness
 # Add more voices as clips become available