From db8889aeed4745862891e73109cc08289253e609 Mon Sep 17 00:00:00 2001 From: mikael-lovqvists-claude-agent Date: Sat, 30 May 2026 04:48:54 +0000 Subject: [PATCH] =?UTF-8?q?Initial=20commit=20=E2=80=94=20voice=20pipeline?= =?UTF-8?q?=20experiment?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit STT (Silero VAD + Whisper via sherpa-onnx), Chatterbox TTS HTTP server, query completeness classifier (Ollama), multi-voice demo scripts, and planning docs. Kept as reference; clean rewrite planned in separate repos. Co-Authored-By: Claude Sonnet 4.6 --- .gitignore | 11 ++ CLEANUP-PLAN.md | 159 ++++++++++++++++++++++++++++ LLM-ROUTING.md | 51 +++++++++ NOTES.md | 161 ++++++++++++++++++++++++++++ SPEAKER-DIARIZATION.md | 80 ++++++++++++++ TODO.md | 41 +++++++ VOICES.md | 29 +++++ acting-demo-bark.mjs | 135 +++++++++++++++++++++++ acting-demo-chatterbox.mjs | 109 +++++++++++++++++++ acting-demo.mjs | 163 ++++++++++++++++++++++++++++ bark-server.py | 135 +++++++++++++++++++++++ chatterbox-server.py | 200 +++++++++++++++++++++++++++++++++++ demo-bark.mjs | 85 +++++++++++++++ demo-kokoro.mjs | 85 +++++++++++++++ demos/darkwing-briefing.sh | 24 +++++ demos/threat-assessment.sh | 21 ++++ demos/wrong-meeting.sh | 24 +++++ download-models.sh | 63 +++++++++++ lib/bark-tts.mjs | 112 ++++++++++++++++++++ lib/chatterbox-tts.mjs | 101 ++++++++++++++++++ lib/llm.mjs | 64 +++++++++++ lib/local-query-complete.mjs | 43 ++++++++ lib/markdown.mjs | 66 ++++++++++++ lib/stt.mjs | 170 +++++++++++++++++++++++++++++ lib/tts-client.mjs | 51 +++++++++ lib/tts.mjs | 107 +++++++++++++++++++ listen.mjs | 22 ++++ package.json | 13 +++ query-demo.mjs | 123 +++++++++++++++++++++ requirements.txt | 11 ++ setup-venv.sh | 24 +++++ speak-as.mjs | 47 ++++++++ tts-server.mjs | 133 +++++++++++++++++++++++ voice-buddy.mjs | 65 ++++++++++++ voices.yaml | 54 ++++++++++ 35 files changed, 2782 insertions(+) create mode 100644 .gitignore create mode 100644 CLEANUP-PLAN.md create mode 100644 LLM-ROUTING.md create mode 100644 NOTES.md create mode 100644 SPEAKER-DIARIZATION.md create mode 100644 TODO.md create mode 100644 VOICES.md create mode 100644 acting-demo-bark.mjs create mode 100644 acting-demo-chatterbox.mjs create mode 100644 acting-demo.mjs create mode 100755 bark-server.py create mode 100755 chatterbox-server.py create mode 100644 demo-bark.mjs create mode 100644 demo-kokoro.mjs create mode 100755 demos/darkwing-briefing.sh create mode 100755 demos/threat-assessment.sh create mode 100755 demos/wrong-meeting.sh create mode 100755 download-models.sh create mode 100644 lib/bark-tts.mjs create mode 100644 lib/chatterbox-tts.mjs create mode 100644 lib/llm.mjs create mode 100644 lib/local-query-complete.mjs create mode 100644 lib/markdown.mjs create mode 100644 lib/stt.mjs create mode 100644 lib/tts-client.mjs create mode 100644 lib/tts.mjs create mode 100644 listen.mjs create mode 100644 package.json create mode 100644 query-demo.mjs create mode 100644 requirements.txt create mode 100755 setup-venv.sh create mode 100644 speak-as.mjs create mode 100644 tts-server.mjs create mode 100644 voice-buddy.mjs create mode 100644 voices.yaml diff --git a/.gitignore b/.gitignore new file mode 100644 index 0000000..1895cb3 --- /dev/null +++ b/.gitignore @@ -0,0 +1,11 @@ +node_modules/ +models/ +venv/ +*.ogg +*.wav +*.mp3 +*.onnx +*.bin +package-lock.json +__pycache__/ +*.pyc diff --git a/CLEANUP-PLAN.md b/CLEANUP-PLAN.md new file mode 100644 index 0000000..5dba735 --- /dev/null +++ b/CLEANUP-PLAN.md @@ -0,0 +1,159 @@ +# Voice Project — Cleanup Plan + +> [!NOTE] +> **Revised approach** — keep this project as-is (it works and serves as a reference). The path forward is new clean repos for each component, then a top-level project that composes them. See **New architecture** section below. + +The project has accumulated experiment scripts, dead TTS backends, and outdated setup files. This document is the plan before doing the actual cleanup. + +--- + +## Current state + +### Active pipeline (keep) + +| File | Role | +|------|------| +| `query-demo.mjs` | Main entry point — STT → classifier → dispatch | +| `tts-server.mjs` | TTS HTTP server (Chatterbox, voice switching) | +| `voices.yaml` | Named voice config | +| `lib/stt.mjs` | Silero VAD + Whisper, pre-roll buffer | +| `lib/chatterbox-tts.mjs` | Node wrapper for chatterbox-server.py | +| `lib/tts-client.mjs` | HTTP client for tts-server.mjs | +| `lib/local-query-complete.mjs` | Ollama classifier | +| `chatterbox-server.py` | Long-running Python TTS process | +| `listen.mjs` | Standalone mic → stdout tool | +| `speak-as.mjs` | Standalone stdin → TTS tool | +| `demos/` | Multi-voice demo scripts | +| `download-models.sh` | Whisper + VAD model download | +| `setup-venv.sh` | Python venv setup (needs rewrite — see below) | +| `models/` | Downloaded STT models | +| `venv/` | Python venv (not committed) | + +### Dead / experimental (remove or archive) + +| File | Reason | +|------|--------| +| `acting-demo-bark.mjs` | Bark TTS — abandoned | +| `acting-demo-chatterbox.mjs` | One-off demo, superseded | +| `acting-demo.mjs` | Old demo | +| `bark-server.py` | Bark backend — abandoned | +| `demo-bark.mjs` | Bark demo | +| `demo-kokoro.mjs` | Kokoro TTS experiment — abandoned | +| `lib/bark-tts.mjs` | Bark wrapper | +| `lib/tts.mjs` | Old TTS abstraction (superseded by tts-client.mjs) | +| `lib/markdown.mjs` | Unused? Verify before deleting | +| `lib/llm.mjs` | Unused? Verify before deleting | +| `requirements.txt` | Lists kokoro + faster-whisper deps that aren't used | +| `voice-buddy.mjs` | Merged into query-demo.mjs — delete | + +--- + +## What needs to be documented / fixed + +### 1. `setup-venv.sh` — rewrite + +Current script installs Bark deps. The active backend is Chatterbox. New script should: +- Install `chatterbox-tts` and its deps (torch, transformers, accelerate, numpy) +- HuggingFace token instruction: write token to `~/.secrets/hugging-face.token` +- Note: `find_hf_cache()` in chatterbox-server.py avoids HF network calls if the model is cached + +### 2. `requirements.txt` — replace or delete + +Currently lists kokoro and faster-whisper (unused). Either: +- Replace with `chatterbox-requirements.txt` listing only what `setup-venv.sh` installs +- Or just delete it — `setup-venv.sh` is the source of truth + +### 3. sherpa-onnx / CUDA build — document clearly + +The npm package ships a CPU-only `.so`. Using CUDA requires a manual rebuild. This is documented in `NOTES.md` but should also appear in a top-level `README.md` or `SETUP.md` so it's hard to miss. + +### 4. System dependencies — list them + +Nothing currently lists what needs to be installed at the OS level: +- `parec` / `pacat` (PulseAudio) — audio I/O +- `cmake` — for optional CUDA sherpa-onnx rebuild +- `onnxruntime-opt-cuda` (or equivalent) — for GPU STT +- `ollama` running locally with `qwen2.5:3b` pulled — for classifier +- `jq` — used in shell functions and demo scripts +- Node.js (version? check what's required) +- Python 3 with venv support + +### 5. External services — document + +- **TTS server**: runs on the host (not in container), port 11500 +- **Ollama**: runs on `192.168.2.99:11434` — classifier and future LLM routing +- **TTS_URL** env var: set in `~/.bashrc` to point at TTS server + +### 6. `voices.yaml` — keep and document + +Already the right format. Add a comment block at the top explaining how to add a voice (grab a clean WAV/OGG clip, at least 5 seconds, no background music). + +--- + +## Proposed new top-level structure + +``` +voice/ + README.md ← start here: what this is, quick start + SETUP.md ← full setup: OS deps, npm install, venv, CUDA, models + NOTES.md ← architecture deep-dives (keep as-is) + TODO.md ← keep as-is + VOICES.md ← wishlist (keep as-is) + voices.yaml ← runtime voice config + query-demo.mjs ← main entry point + tts-server.mjs ← TTS HTTP server + listen.mjs ← standalone STT tool + speak-as.mjs ← standalone TTS tool + chatterbox-server.py + setup-venv.sh ← rewritten for Chatterbox only + download-models.sh + package.json + lib/ + stt.mjs + chatterbox-tts.mjs + tts-client.mjs + local-query-complete.mjs + demos/ + *.sh + models/ ← gitignored + venv/ ← gitignored +``` + +--- + +## Cleanup steps (in order) + +1. **Verify** `lib/markdown.mjs` and `lib/llm.mjs` are unused — grep for imports +2. **Delete** dead files: bark/kokoro scripts, old demos, voice-buddy.mjs, lib/bark-tts.mjs, lib/tts.mjs, and unused lib files +3. **Rewrite** `setup-venv.sh` for Chatterbox only +4. **Replace** `requirements.txt` with accurate one or delete +5. **Write** `SETUP.md` covering all install steps end-to-end +6. **Write** `README.md` with quick start +7. **Add** voice cloning tips to top of `voices.yaml` +8. **Commit** the whole cleanup as a single commit + +--- + +## Open questions + +- Keep `acting-demo-chatterbox.mjs` as a reference for TTS capability demos, or delete? +- Should `demos/` scripts be committed as-is or made more portable (no hardcoded paths)? + +--- + +## New architecture (revised plan) + +Keep this project as a reference. Build clean standalone repos instead: + +| Repo | Contents | Depends on | +|------|----------|-----------| +| `tts-server` | chatterbox-server.py, tts-server.mjs, lib/chatterbox-tts.mjs, voices.yaml format, HTTP API | Python venv, Chatterbox | +| `stt` | lib/stt.mjs, sherpa-onnx, VAD + Whisper, pre-roll buffer, listen.mjs | sherpa-onnx-node, models | +| `voice-assistant` | query-demo.mjs, lib/tts-client.mjs, lib/local-query-complete.mjs | tts-server + stt as submodules | + +Each standalone repo gets its own README, SETUP, and can be used independently. The voice-assistant repo composes them via git submodules (or npm workspace / symlinks — TBD). + +### Open questions for new architecture + +- **Composition: npm packages** — each repo published to the private npm.efforting.tech registry; voice-assistant depends on `@efforting/tts-server` and `@efforting/stt` as npm deps. Cleaner than submodules for ESM projects. +- **Model storage** — models are large, have different backup strategies, and should NOT live inside project directories. There are canonical model locations on the system; repos should reference them via an env var (e.g. `MODELS_DIR` or per-model env vars) rather than downloading into the project. Migrate existing `models/` directories out as part of the new repo setup. Not urgent — note for cleanup time. diff --git a/LLM-ROUTING.md b/LLM-ROUTING.md new file mode 100644 index 0000000..d9c5d1b --- /dev/null +++ b/LLM-ROUTING.md @@ -0,0 +1,51 @@ +# LLM Routing Strategy + +When to use a local model vs an Anthropic frontier model in the voice pipeline. + +--- + +## Current model inventory + +| Role | Model | Where | VRAM | +|------|-------|-------|------| +| STT | Whisper base.en + Silero VAD | sherpa-onnx (GPU) | ~0.5 GB | +| TTS | Chatterbox Turbo | Python/CUDA | ~4 GB | +| Completeness classifier | Qwen 2.5 3B | Ollama | ~2 GB | +| Main reasoning | Claude (Anthropic) | API / Claude Code | — | + +Total local VRAM: ~6–7 GB on RTX 3090 (24 GB). Plenty of headroom. + +--- + +## Routing principles + +### Use local model when: +- Task is **classification or routing** (completeness, intent detection, command vs question) +- Task is **latency-sensitive** and the answer doesn't require world knowledge or reasoning +- Task is **high-frequency** (runs on every utterance — billing would add up) +- Privacy matters — query content shouldn't leave the machine + +### Use Anthropic frontier model when: +- Task requires **reasoning, synthesis, or judgment** +- Task involves **code generation or editing** +- Task involves **world knowledge** beyond what a small model handles well +- Query is **low-frequency** (once per user request, not once per audio frame) +- Quality matters more than latency + +--- + +## GPU split (future option) + +If voice models and reasoning models compete for VRAM, split across cards: +- **Card A** (smaller VRAM): STT + TTS + classifier — pure inference, no large models needed +- **Card B** (larger VRAM): local reasoning model (e.g. Qwen 20B or similar) for tasks that benefit from a capable local model but shouldn't hit the API + +Use `CUDA_VISIBLE_DEVICES` or Ollama's GPU assignment to pin models to specific cards. + +--- + +## Upgrade candidates + +- **Whisper small.en or medium.en** — better transcription accuracy, especially for accents. Small adds ~0.5 GB, medium ~1.5 GB. +- **Completeness classifier → larger model** — Qwen 7B or 14B would give significantly better sentence boundary detection and instruction/question classification. +- **Local reasoning model** — for tasks that are too simple to warrant an API call but too complex for a 3B classifier (e.g. summarisation, note-taking, simple Q&A). A 7–14B quantised model via Ollama would fit alongside the voice stack. diff --git a/NOTES.md b/NOTES.md new file mode 100644 index 0000000..18f5848 --- /dev/null +++ b/NOTES.md @@ -0,0 +1,161 @@ +# Voice Experiment Notes + +Real-time voice pipeline: microphone → STT → (LLM) → TTS. Everything runs locally, no internet after initial model download. + +--- + +## Architecture + +``` +parec (PulseAudio) + └─ lib/stt.mjs Silero VAD + Whisper → text on stdout + └─ sherpa-onnx-node + └─ libsherpa-onnx-c-api.so (needs CUDA build — see below) + └─ libonnxruntime.so (system, with CUDA) + +text → lib/chatterbox-tts.mjs + └─ chatterbox-server.py long-running Python TTS server + └─ chatterbox-tts (pip) ResembleAI Chatterbox model +``` + +--- + +## STT + +**Library:** `sherpa-onnx-node` npm package +**Models:** Whisper (offline batch) + Silero VAD +**Audio capture:** `parec --format=s16le --rate=16000 --channels=1 --latency-msec=50` + +### Model download + +```bash +bash download-models.sh base.en # or: tiny.en small.en medium.en large-v3 large-v3-turbo +``` + +Models land in `models/`: +- `silero_vad.onnx` +- `sherpa-onnx-whisper-base.en/` + +### How it works + +`lib/stt.mjs` feeds 512-sample (32ms) windows into Silero VAD. When VAD signals speech end (after 500ms silence), the segment is sent to Whisper for transcription. The transcription is written to stdout — **note: the native addon writes directly to fd 1 (bypassing Node.js process.stdout), not via the JavaScript callback**. The JS callback is also called and can be used for additional processing. + +### Getting CUDA + +The npm package bundles a CPU-only `libsherpa-onnx-c-api.so` (`SHERPA_ONNX_ENABLE_GPU=OFF`). To use CUDA, rebuild it from source: + +```bash +# Prerequisites: cmake, onnxruntime-opt-cuda (or onnxruntime-cuda) from pacman +git clone --depth 1 git@github.com:k2-fsa/sherpa-onnx ~/Foreign/sherpa-onnx + +export SHERPA_ONNXRUNTIME_INCLUDE_DIR=/usr/include/onnxruntime +export SHERPA_ONNXRUNTIME_LIB_DIR=/usr/lib + +cd ~/Foreign/sherpa-onnx +cmake -B build \ + -DCMAKE_BUILD_TYPE=Release \ + -DBUILD_SHARED_LIBS=ON \ + -DSHERPA_ONNX_ENABLE_GPU=ON \ + -DSHERPA_ONNX_ENABLE_PYTHON=OFF \ + -DSHERPA_ONNX_ENABLE_TESTS=OFF \ + . + +cmake --build build --target sherpa-onnx-c-api + +# Symlink into node_modules (remove bundled CPU onnxruntime so system one is used) +NMOD=~/workspace/experiments/voice/node_modules/sherpa-onnx-linux-x64 +ln -sf ~/Foreign/sherpa-onnx/build/lib/libsherpa-onnx-c-api.so $NMOD/libsherpa-onnx-c-api.so +rm -f $NMOD/libonnxruntime.so +``` + +The system `/usr/lib/libonnxruntime.so` (from `onnxruntime-opt-cuda`) is used at runtime. The sherpa-onnx C API is stable across versions so the pre-built `sherpa-onnx.node` (1.13.2) works with the freshly built library. + +--- + +## TTS + +**Library:** `chatterbox-tts` Python package (ResembleAI) +**Models:** `ResembleAI/chatterbox-turbo` (default) or `ResembleAI/chatterbox` +**Playback:** `pacat --format=float32le --rate=24000 --channels=1` + +### Setup + +```bash +bash setup-venv.sh # creates ./venv, installs chatterbox-tts + deps +``` + +HuggingFace token (for model download) goes in `~/.secrets/hugging-face.token`. + +### Variants + +| Variant | RTF (RTX 3090) | Notes | +|---------|---------------|-------| +| turbo | ~0.2 | default, female voice, no exaggeration param | +| full | slower | male voice, supports `exaggeration` (0.0–1.0) | + +### Paralinguistic tags + +Work in both variants: +``` +[laugh] [chuckle] [cough] [clear throat] [sigh] [shush] [groan] [sniff] [gasp] +``` + +### Voice cloning + +Pass a WAV reference clip (at least 5 seconds, clean speech, no music). Quality scales with both length and emotional range — a clip with variation (question, statement, different tones) gives the model more to work with for natural prosody than a flat monotone sample of the same length. +```javascript +await tts.speak('Hello.', { audio_prompt: '/path/to/voice.wav' }) +``` + +The server preprocesses the WAV to float32 mono to work around a librosa/numpy float64 issue. + +### Server protocol + +`chatterbox-server.py` is a long-running process. Communication: +- **stdin:** one JSON line per request: `{"text": "...", "temperature": 0.8, ...}` +- **stdout:** `ok\n` after each utterance is **generated** (playback continues in background) +- **stderr:** status/timing logs + +Generation and playback are pipelined: the server generates the next utterance while the previous one is still playing. RTF ~0.2 means the next chunk is ready well before playback finishes. + +### Acting tips + +- Emphasis: `UPPERCASE` words +- Pause/rhythm: `...` `—` +- Acronym workaround: `un-workable` not `UNWORKABLE` (avoids letter-spelling) +- Exaggeration (full model only): `{ exaggeration: 0.0–1.0 }` +- Temperature: `{ temperature: 0.3–1.4 }` (lower = more consistent) + +--- + +## Tools + +| Script | Usage | +|--------|-------| +| `listen.mjs` | Mic → Whisper → stdout (one line per utterance) | +| `speak-as.mjs` | stdin → Chatterbox TTS with voice reference | +| `acting-demo-chatterbox.mjs` | TTS capability demo (sections 1–7) | + +### Pipe example + +```bash +node listen.mjs 2>/dev/null | node speak-as.mjs /path/to/voice.wav +``` + +`speak-as.mjs` deduplicates consecutive identical lines to handle VAD over-segmentation. + +--- + +## Known Quirks + +**librosa float64:** newer numpy makes `librosa.resample` return float64, breaking torch. Patched in `chatterbox-server.py`: +```python +_orig = _librosa.resample +_librosa.resample = lambda *a, **k: _orig(*a, **k).astype(np.float32) +``` + +**HuggingFace offline:** `from_pretrained()` always hits the internet. The server uses `find_hf_cache()` to detect local snapshots and calls `from_local()` instead. + +**VAD over-segmentation:** a mid-utterance pause produces two segments that Whisper transcribes to similar or identical text. `speak-as.mjs` deduplicates consecutive lines (normalized, punctuation-stripped) before speaking. + +**Native stdout:** the sherpa-onnx native addon writes transcription results directly to fd 1 via C `write()`, bypassing Node.js `process.stdout.write`. This cannot be intercepted from JavaScript. It only writes for non-trivial results (not silence/garbage). diff --git a/SPEAKER-DIARIZATION.md b/SPEAKER-DIARIZATION.md new file mode 100644 index 0000000..b1cd99b --- /dev/null +++ b/SPEAKER-DIARIZATION.md @@ -0,0 +1,80 @@ +# Speaker Diarization + +Notes on adding speaker identification to the voice pipeline. + +--- + +## What Whisper does and doesn't do + +Whisper is a transcription model — it converts audio to text. It has no concept of who is speaking. In a multi-speaker conversation it will transcribe everything into one flat stream with no speaker labels. + +--- + +## Adding diarization + +Speaker diarization is the task of segmenting audio by speaker: "who spoke when." The standard approach is to run a diarization model before or alongside transcription, then align the speaker segments with the transcript. + +### pyannote.audio + +The most capable open-source option. Runs locally, GPU-accelerated. + +``` +pip install pyannote.audio +``` + +Produces time-stamped segments: +``` +SPEAKER_00 0.5s – 3.2s +SPEAKER_01 3.8s – 7.1s +SPEAKER_00 7.5s – 12.0s +``` + +Feed each segment to Whisper independently → transcript with speaker tags. + +Requires a HuggingFace token and model access request (free). + +### sherpa-onnx + +Has some diarization support in newer versions — worth checking if it can be added to the existing pipeline without a separate Python dependency. + +--- + +## Speaker identification vs diarization + +| Feature | Diarization | Identification | +|---------|------------|----------------| +| Labels speakers | ✓ (anonymous: speaker 1, 2…) | ✓ (named: Mikael, Claude…) | +| Requires enrollment | ✗ | ✓ (reference clip per person) | +| Cross-session consistency | ✗ | ✓ | + +For named identification, a speaker embedding model compares voice fingerprints against enrolled reference clips. Options: + +- **resemblyzer** — simple cosine similarity on d-vector embeddings +- **wespeaker** — more accurate, available as ONNX (fits existing sherpa-onnx stack) +- **speechbrain** — full toolkit, heavier dependency + +--- + +## Future experiments + +### Multi-speaker transcription +Run pyannote diarization on a conversation recording, feed segments to Whisper, produce a tagged transcript: +``` +Mikael: How do I juggle oranges? +Claude: Start with two balls and work up from there. +``` + +### Real-time diarization +Extend the current VAD → Whisper pipeline with a rolling speaker embedding. Each new segment gets a speaker label by comparing its embedding to known speakers. Useful for meeting notes or multi-person dictation. + +### Voice enrollment +Record a short reference clip per person. Store embeddings. When a new segment arrives, score it against all enrolled speakers and assign the closest match above a confidence threshold. Unknown speakers get a temporary anonymous label. + +### Wake word per speaker +Combine identification with wake word detection so only enrolled voices can trigger the assistant — a simple voice-based access control layer. + +--- + +## VRAM considerations + +pyannote diarization adds roughly 1–2 GB depending on the model variant. wespeaker ONNX is much lighter. Both fit comfortably alongside the existing Whisper + Chatterbox stack on a 24 GB card. diff --git a/TODO.md b/TODO.md new file mode 100644 index 0000000..4283bbb --- /dev/null +++ b/TODO.md @@ -0,0 +1,41 @@ +# Voice Experiment — TODO + +## Classifier +- [ ] Add query journal (JSONL) logging all STT fragments, classifier decisions, and how each query was resolved (timeout / send-word / classifier). Enables replay and re-classification experiments without live mic input. +- [ ] Tune classifier — still too conservative on plain statements and short instructions +- [ ] Experiment with larger classifier model (Qwen 7B or 14B) + +## Voice switching +- [ ] Allow changing the active voice at runtime via voice command — "switch to the X voice", "use a different voice". Could be per-project as a cue (project A = voice A) or just on-demand preference. Implementation: store current `audio_prompt` path in a mutable state file; voice-switch commands update it directly without going through the full Claude dispatch. +- [ ] Consider a small set of named voices with friendly names so you can say "switch to the calm voice" rather than a file path. +- [ ] Daily voice rotation: select voice based on a hash of the current date — automatic variety without manual choice. Hash the date string mod N voices, pick from the named voice list. +- [ ] Character mode: pair a voice with a system prompt preamble injected before the query. Technically trivial — prepend to the dispatch prompt. Hard part is writing character definitions that hold up over a session rather than collapsing into generic assistant after a few turns. A shallow character just sounds like a costume. + +## Pipeline +- [ ] PTY-based Claude Code injection — replace X11/clipboard paste with direct PTY write to avoid focus stealing +- [ ] Voice command routing — detect slash command intents (rename session, switch session) and dispatch as `/command` rather than a prompt +- [ ] Activation phrase — say "computer" (or similar) to signal the start of a query; play a short acknowledgement chime (Star Trek style). Currently always-on; activation phrase would make it more intentional. +- [ ] Audio cues — replace or supplement the "Efforting" TTS utterance with short sound effects: one tone on query start/acknowledgement, a different one on dispatch. Faster feedback than TTS. +- [ ] Reserved control vocabulary — single words, pattern matched before any classifier: + + | Word(s) | Action | + |---------|--------| + | `computer` | Generic activate — start accumulating a query, play acknowledgement chime | + | `project note` | Activate with project context injected into the prompt preamble | + | `go` / `send` / `done` | Dispatch current query immediately | + | `cancel` / `scratch that` | Discard accumulated query, return to idle | + | `stop` / `goodbye` / `hang up` | Shut down the session cleanly | + | `switch voice` / `different voice` | Trigger voice-switch flow | + + Words are checked normalized (lowercase, punctuation stripped) before classifier runs. Classifier only sees real query content. + +- [ ] Classifier rule hierarchy — pre-classifiers (fast pattern match) → optional LLM classifiers. Each stage can short-circuit. Makes the system tuneable: add/remove LLM stages without touching the pattern layer. + +## TTS / Long text +- [ ] Long inputs collapse into gibberish — Chatterbox has a context limit. The tts-server should automatically split on sentence boundaries and queue them sequentially. `Chatterbox_Tts` already has `speak_streaming` that does this — use it in the server instead of `speak`. + +## TTS / Prompt +- [ ] Filename and path readback sounds bad when names are all-caps (e.g. `ARCHITECTURE.md` gets spelled out). Instruct Claude in the voice prompt to read filenames and paths in lowercase unless the term is a known acronym. Paralinguistic tags exist but are mostly for expression (laugh, sigh, etc.) — not useful here. + +## STT +- [ ] Try Whisper small.en or medium.en for better accuracy on accents diff --git a/VOICES.md b/VOICES.md new file mode 100644 index 0000000..b64702e --- /dev/null +++ b/VOICES.md @@ -0,0 +1,29 @@ +# Voice Clone Wishlist + +Voices to prepare reference clips for. + +## Ready + +| Voice | Character / Show | Notes | +|-------|-----------------|-------| +| Rommie | Andromeda — ship AI / android | `/home/devilholk/Documents/rommie-sample.wav` — working well | +| Wilford Brimley (Harold W. Smith) | Remo Williams: The Adventure Begins (1985) — head of CURE | Turned out well — gravelly authoritative delivery, dry clean speech | + +## To Do + +| Voice | Character / Show | Notes | +|-------|-----------------|-------| +| Fred Ward | Remo Williams: The Adventure Begins (1985) — Remo Williams himself | Distinctive gravelly voice, same film as Brimley so similar source quality | +| Alan Scarfe | TNG (Romulan), Andromeda S5 (Flavin) | Deep, authoritative voice — hunt for quiet scenes without ship hum | +| John Fleck | Enterprise (Silik the Suliban) | Distinctive raspy voice — Suliban scenes may have atmosphere noise | +| Steve Bacic | Andromeda (Telemachus Rhade) | — | +| Alex Diakun | Andromeda (Perseid character, name ~"Atune"), Stargate SG-1 (science role) | Prolific Vancouver sci-fi character actor, often plays scientists/scholars — distinctive voice | +| Unknown actress | Andromeda S4/S5 (Dylan's love interest), Babylon 5 (Mars rebellion leader) | Possibly Marjorie Monaghan (Number One in B5) — unconfirmed | +| Claudia Christian | Babylon 5 — Commander Ivanova | User already has a voice clip ready | + +## Notes + +- Animated/cartoon voices (e.g. Darkwing Duck) don't clone well — too far outside natural human speech distribution +- Compressed/heavily post-processed audio (spaceship hum, background score) degrades results even after noise reduction +- OGG vs WAV quality difference is likely source quality, not encoding — soundfile handles both +- Voice cloning quality scales with both clip length and emotional range — varied prosody (questions, statements, different tones) gives the model more to anchor on than flat monotone diff --git a/acting-demo-bark.mjs b/acting-demo-bark.mjs new file mode 100644 index 0000000..86737a5 --- /dev/null +++ b/acting-demo-bark.mjs @@ -0,0 +1,135 @@ +/** + * Demonstrates acting and expressiveness with Bark TTS. + * + * Run: node acting-demo-bark.mjs [start-section] + * + * Bark expressive techniques: + * UPPERCASE — stress/emphasis on a word + * ... — hesitation, trailing off + * [laughs] — laughter + * [sighs] — sigh + * [gasps] — sharp intake of breath + * [clears throat] — throat clearing + * ♪ text ♪ — sung phrase + * — — abrupt break + * ! — raised energy + * + * Voice presets (English): + * v2/en_speaker_0 calm female + * v2/en_speaker_1 calm male + * v2/en_speaker_3 deep male + * v2/en_speaker_6 neutral/warm (default) + * v2/en_speaker_9 expressive + */ + +import { Bark_Tts } from './lib/bark-tts.mjs' + +const tts = new Bark_Tts() + +process.stderr.write('[bark] loading model...\n') +await tts.init() +process.stderr.write('[bark] ready\n\n') + +const START = parseInt(process.argv[2] ?? '1', 10) + +function section(title) { + process.stdout.write(`\n${'─'.repeat(50)}\n${title}\n${'─'.repeat(50)}\n`) +} + +async function say(text, voice = undefined) { + const opts = { preprocess: false } + if (voice) opts.voice = voice + process.stdout.write(` ${voice ?? 'default'}\n "${text}"\n\n`) + await tts.speak(text, opts) +} + +// ─── 1. Baseline ─────────────────────────────────────── +section('1. Baseline') +if (START <= 1) { + await say('The package has arrived. Please sign here.') +} + +// ─── 2. Paralinguistic tokens ────────────────────────── +section('2. Paralinguistic tokens') +if (START <= 2) { + await say('[clears throat] Right. Where were we.') + await say('And then he just... [sighs] gave up.') + await say('[gasps] I had no idea you were here!') + await say("Ha. [laughs] That's the funniest thing I've heard all week.") +} + +// ─── 3. Emphasis via UPPERCASE ───────────────────────── +section('3. Emphasis via UPPERCASE') +if (START <= 3) { + await say('You need to stop. RIGHT NOW.') + await say('I have told you THIS BEFORE.') + await say('This is ABSOLUTELY unacceptable.') +} + +// ─── 4. Hesitation and rhythm ────────────────────────── +section('4. Hesitation and rhythm') +if (START <= 4) { + await say('It was... quiet. Too quiet.') + await say('I just... I don\'t know what to say.') + await say('He said he was fine — but his eyes told a different story.') +} + +// ─── 5. Sung phrase ──────────────────────────────────── +section('5. Singing') +if (START <= 5) { + await say('♪ Happy birthday to you, happy birthday to you ♪') +} + +// ─── 6. Acronyms ─────────────────────────────────────── +section('6. Acronyms') +if (START <= 6) { + // Without spacing — Bark may try to pronounce as a word + await say('The FBI is investigating.') + await say('NASA launched a new rocket.') + await say('I work with the CPU and GPU.') + // With letter spacing — forces spelling out + await say('The F B I is investigating.') + await say('N A S A launched a new rocket.') + await say('I work with the C P U and G P U.') +} + +// ─── 7. Voice character ──────────────────────────────── +section('7. Voice presets — same line') +if (START <= 7) { + const line = "Well, that's certainly one way to look at it." + for (const voice of ['v2/en_speaker_0', 'v2/en_speaker_3', 'v2/en_speaker_6', 'v2/en_speaker_9']) { + await say(line, voice) + } +} + +// ─── 7. Combined scene ───────────────────────────────── +section('8. Combined — a tense scene') +if (START <= 8) { + await say('Something is wrong.', 'v2/en_speaker_9') + await say('I can FEEL it.', 'v2/en_speaker_9') + await say('[gasps] Run!', 'v2/en_speaker_9') +} + +// ─── 9. Paragraph — voices 7-9 ───────────────────────── +// Pass voices as CLI args to override, e.g.: +// node acting-demo-bark.mjs 9 v2/en_speaker_0 v2/en_speaker_2 v2/en_speaker_5 +const PARA_VOICES = process.argv.slice(3).length + ? process.argv.slice(3) + : ['v2/en_speaker_7', 'v2/en_speaker_8', 'v2/en_speaker_9'] + +const PARAGRAPH = 'It was a cold evening when the letter FINALLY arrived. ' + + '[sighs] She had been waiting for months... not knowing whether the news would be good — or bad. ' + + 'Her hands trembled as she tore open the envelope. ' + + 'Inside was a single sheet of paper with just THREE words written on it. ' + + '[gasps] She read them twice... then sat down slowly, and stared at the wall.' + +section('9. Paragraph — voice comparison') +if (START <= 9) { + for (const voice of PARA_VOICES) { + process.stdout.write(`\n — ${voice} —\n`) + await say(PARAGRAPH, voice) + } +} + +tts.stop() +section('Done') diff --git a/acting-demo-chatterbox.mjs b/acting-demo-chatterbox.mjs new file mode 100644 index 0000000..93fd4ab --- /dev/null +++ b/acting-demo-chatterbox.mjs @@ -0,0 +1,109 @@ +/** + * Acting demo for Chatterbox TTS. + * + * Run: node acting-demo-chatterbox.mjs [start-section] [turbo|full] + * + * Paralinguistic tags: + * [laugh] [chuckle] [cough] [clear throat] [sigh] [shush] [groan] [sniff] [gasp] + * + * Exaggeration (full model only, 0.0-1.0): + * 0.0 = neutral, 1.0 = maximum emotion intensity + */ + +import { Chatterbox_Tts } from './lib/chatterbox-tts.mjs' + +const START = parseInt(process.argv[2] ?? '1', 10) +const VARIANT = process.argv[3] ?? 'turbo' + +const tts = new Chatterbox_Tts({ variant: VARIANT }) + +process.stderr.write(`[chatterbox] loading ${VARIANT} model...\n`) +await tts.init() +process.stderr.write('[chatterbox] ready\n\n') + +function section(title) { + process.stdout.write(`\n${'─'.repeat(50)}\n${title}\n${'─'.repeat(50)}\n`) +} + +async function say(text, opts = {}) { + process.stdout.write(` "${text}"\n`) + if (Object.keys(opts).length) process.stdout.write(` ${JSON.stringify(opts)}\n`) + process.stdout.write('\n') + await tts.speak(text, { preprocess: false, ...opts }) +} + +// ─── 1. Baseline ─────────────────────────────────────── +section('1. Baseline') +if (START <= 1) { + await say('The package has arrived. Please sign here.') +} + +// ─── 2. Paralinguistic tags ──────────────────────────── +section('2. Paralinguistic tags') +if (START <= 2) { + await say('[cough] Right, where were we.') + await say('And then he just... [sigh] gave up.') + await say('[gasp] I had no idea you were here!') + await say('Ha! [laugh] That is the funniest thing I have heard all week.') + await say('[chuckle] Well, that is certainly one way to look at it.') + await say('[groan] Not this again.') +} + +// ─── 3. Punctuation and rhythm ───────────────────────── +section('3. Punctuation and rhythm') +if (START <= 3) { + await say('It was... quiet. Too quiet.') + await say('He said he was fine — but his eyes told a different story.') + await say('I told you. I told you this would happen. But did anyone listen? No.') +} + +// ─── 4. Exaggeration — full model only ───────────────── +section(`4. Exaggeration (${VARIANT === 'full' ? 'active' : 'ignored in turbo — run with: node acting-demo-chatterbox.mjs 4 full'})`) +if (START <= 4) { + await say('I am so incredibly happy to see you today!', { exaggeration: 0.0 }) + await say('I am so incredibly happy to see you today!', { exaggeration: 0.5 }) + await say('I am so incredibly happy to see you today!', { exaggeration: 1.0 }) +} + +// ─── 5. Temperature ──────────────────────────────────── +section('5. Temperature (lower = more predictable)') +if (START <= 5) { + await say('Something is very wrong here and I think we should leave now.', { temperature: 0.3 }) + await say('Something is very wrong here and I think we should leave now.', { temperature: 0.8 }) + await say('Something is very wrong here and I think we should leave now.', { temperature: 1.4 }) +} + +// ─── 6. Paragraph ────────────────────────────────────── +section('6. Paragraph with tags') +if (START <= 6) { + await say( + 'It was a cold evening when the letter finally arrived. ' + + '[sigh] She had been waiting for months... not knowing whether the news would be good — or bad. ' + + 'Her hands trembled as she tore open the envelope. ' + + '[gasp] Inside was a single sheet of paper with just three words written on it.' + ) +} + +// ─── 7. Audio prompt / voice cloning ─────────────────── +// Pass a WAV file path as third argument: +// node acting-demo-chatterbox.mjs 7 turbo /path/to/voice.wav +section('7. Audio prompt (voice cloning)') +if (START <= 7) { + const audio_prompt = process.argv[4] + if (!audio_prompt) { + process.stdout.write(' Skipped — pass a WAV file as the 4th argument:\n') + process.stdout.write(' node acting-demo-chatterbox.mjs 7 turbo /path/to/voice.wav\n\n') + process.stdout.write(' Requirements: WAV file, at least 5 seconds, clean speech, no music/noise\n') + } else { + process.stdout.write(` voice: ${audio_prompt}\n\n`) + await say('Hello. This is a test of voice cloning.', { audio_prompt }) + await say( + 'It was a cold evening when the letter finally arrived. ' + + '[sigh] She had been waiting for months... not knowing whether the news would be good — or bad.', + { audio_prompt } + ) + } +} + +tts.stop() +section('Done') diff --git a/acting-demo.mjs b/acting-demo.mjs new file mode 100644 index 0000000..78de577 --- /dev/null +++ b/acting-demo.mjs @@ -0,0 +1,163 @@ +/** + * Demonstrates prosody and acting techniques with Kokoro TTS. + * + * Run: node acting-demo.mjs [start-section] + * E.g.: node acting-demo.mjs 5 + * + * Kokoro's expressive range comes from four levers: + * + * 1. Voice choice — each voice has a baked-in emotional character + * 2. Speed — slower = deliberate/grave, faster = excited/happy + * 3. silenceScale — controls pause length between sentences + * 4. Punctuation & markup — espeak-ng honours some SSML and typographic cues + * + * SSML tags that pass through espeak-ng: + * — explicit pause + * x — louder/longer (model-dependent) + * x — local rate change + * — force pronunciation + * + * Typographic cues that affect prosody without SSML: + * UPPERCASE words — some stress + * Ellipsis ... — trailing pause / trailing off + * Em-dash — — abrupt break + * Comma placement — micro-pauses + * Exclamation ! — raised energy + */ + +import { Tts } from './lib/tts.mjs' +import { spawn } from 'node:child_process' + +const tts = new Tts() +tts.init() + +// Overriding the internal _play is the simplest way to control +// per-utterance GenerationConfig without reworking the class. + +async function say(text, { voice = 'af_heart', speed = 1.0, silence_scale = 1.0 } = {}) { + const { createRequire } = await import('node:module') + const require = createRequire(import.meta.url) + const sherpa = require('sherpa-onnx-node') + + const VOICES = { af_heart:0, af_bella:1, af_nicole:2, af_sarah:3, af_sky:4, am_adam:5, am_michael:6 } + const sid = VOICES[voice] ?? 0 + + process.stdout.write(` voice=${voice} speed=${speed} silence=${silence_scale}\n "${text}"\n\n`) + + const audio = tts._tts.generate({ + text, + generationConfig: new sherpa.GenerationConfig({ sid, speed, silenceScale: silence_scale }), + }) + + await play(audio.samples, audio.sampleRate) +} + +// Synthesize multiple phrase/option pairs and play them as one continuous audio +async function say_concat(parts) { + const { createRequire } = await import('node:module') + const require = createRequire(import.meta.url) + const sherpa = require('sherpa-onnx-node') + const VOICES = { af_heart:0, af_bella:1, af_nicole:2, af_sarah:3, af_sky:4, am_adam:5, am_michael:6 } + + let total_len = 0 + const chunks = [] + + for (const [text, opts = {}] of parts) { + const { voice = 'af_heart', speed = 1.0, silence_scale = 1.0 } = opts + const sid = VOICES[voice] ?? 0 + const audio = tts._tts.generate({ + text, + generationConfig: new sherpa.GenerationConfig({ sid, speed, silenceScale: silence_scale }), + }) + process.stdout.write(` [${speed}x] "${text}"\n`) + chunks.push(audio.samples) + total_len += audio.samples.length + } + + const combined = new Float32Array(total_len) + let offset = 0 + for (const chunk of chunks) { + combined.set(chunk, offset) + offset += chunk.length + } + + await play(combined, tts._tts.sampleRate) + process.stdout.write('\n') +} + +function play(samples, sample_rate) { + return new Promise((resolve, reject) => { + const proc = spawn('pacat', [ + '--format=float32le', `--rate=${sample_rate}`, '--channels=1', + ], { stdio: ['pipe', 'ignore', 'inherit'] }) + proc.on('error', reject) + proc.on('close', resolve) + proc.stdin.write(Buffer.from(samples.buffer)) + proc.stdin.end() + }) +} + +const START = parseInt(process.argv[2] ?? '1', 10) + +function section(title) { + process.stdout.write(`\n${'─'.repeat(50)}\n${title}\n${'─'.repeat(50)}\n`) +} + +// ─── 1. Baseline ─────────────────────────────────────── +section('1. Baseline — neutral af_heart') +if (START <= 1) { + await say('The package has arrived. Please sign here.') +} + +// ─── 2. Speed as emotion ─────────────────────────────── +section('2. Speed — slow (grave) vs fast (excited)') +if (START <= 2) { + await say('I have some terrible news. The project has been cancelled.', { speed: 0.8 }) + await say("Oh my goodness, I can't believe you're actually here! This is amazing!", { speed: 1.25 }) +} + +// ─── 3. Voice character ──────────────────────────────── +section('3. Voice character — same line, different voices') +if (START <= 3) { + const line = "Well, that's certainly one way to look at it." + for (const voice of ['af_heart', 'af_sky', 'am_adam', 'bm_george']) { + await say(line, { voice }) + } +} + +// ─── 4. Punctuation-driven prosody ───────────────────── +section('4. Punctuation and typographic cues') +if (START <= 4) { + await say('I told you. I told you this would happen. But did anyone listen? No.') + await say('It was... quiet. Too quiet.') + await say('He said he was fine — but his eyes told a different story.') + await say('This is ABSOLUTELY unacceptable.') +} + +// ─── 5. Per-phrase speed — manual emphasis ───────────── +// SSML tags are not supported by this backend; emphasis is achieved by +// synthesizing key phrases at a different speed and concatenating the audio. +section('5. Per-phrase speed (manual emphasis)') +if (START <= 5) { + // "stop" spoken slower and then the rest faster — sounds like emphasis + await say_concat([ + ['You need to ', { speed: 1.0 }], + ['stop.', { speed: 0.7 }], + ['Right now.', { speed: 1.1 }], + ]) + await say_concat([ + ['Listen very carefully.', { speed: 0.75 }], + ['I will only say this once.', { speed: 0.9 }], + ]) +} + +// ─── 6. Combining levers ─────────────────────────────── +section('6. Combined — a tense scene') +if (START <= 6) { + await say('Something is wrong.', { speed: 0.85, silence_scale: 1.5 }) + await say('I can feel it.', { speed: 0.8, silence_scale: 2.0 }) + await say(' Run!', { speed: 1.3, voice: 'af_sky' }) +} + +section('Done') +process.stdout.write('Tip: adjust speed/silence_scale and swap voices to build a character.\n') diff --git a/bark-server.py b/bark-server.py new file mode 100755 index 0000000..c82b36e --- /dev/null +++ b/bark-server.py @@ -0,0 +1,135 @@ +#!/usr/bin/env -S bash -c 'exec "$(dirname "$0")/venv/bin/python3" "$0" "$@"' +""" +Bark TTS server — keeps model loaded, reads JSON lines from stdin. + +Protocol: + stdin: one JSON line per request: {"text": "...", "voice": "v2/en_speaker_6"} + stdout: "ok\n" after each utterance is finished playing + stderr: status/timing messages + +Usage: + python bark-server.py [model] [voice] + python bark-server.py suno/bark-small v2/en_speaker_3 + +Voices (English): + v2/en_speaker_0 .. v2/en_speaker_9 + Speaker 6 = neutral/warm, 9 = expressive, 3 = deep male +""" + +import os +import sys +import json +import time +import subprocess +import numpy as np + +MODEL = sys.argv[1] if len(sys.argv) > 1 else 'suno/bark' +DEF_VOICE = sys.argv[2] if len(sys.argv) > 2 else 'v2/en_speaker_6' +SAMPLE_RATE = 24000 + +TOKEN_FILE = os.path.expanduser('~/.secrets/hugging-face.token') +try: + with open(TOKEN_FILE) as f: + os.environ['HF_TOKEN'] = f.read().strip() +except FileNotFoundError: + pass + +# Disable background safetensors conversion attempts — the HF API endpoint +# it calls has changed and now errors. The pytorch_model.bin files work fine. +os.environ['SAFETENSORS_FAST_GPU'] = '0' +os.environ.setdefault('TRANSFORMERS_NO_ADVISORY_WARNINGS', '1') + +def log(msg): + print(f'[bark] {msg}', file=sys.stderr, flush=True) + +log(f'loading {MODEL}...') +t0 = time.time() + +import torch +from transformers import AutoProcessor, BarkModel + +device = 'cuda' if torch.cuda.is_available() else 'cpu' +dtype = torch.float16 if device == 'cuda' else torch.float32 + +if device == 'cuda': + # TF32 is faster on Ampere (30xx, 40xx) for both matmul and convolutions + torch.backends.cuda.matmul.allow_tf32 = True + torch.backends.cudnn.allow_tf32 = True + +processor = AutoProcessor.from_pretrained(MODEL) +model = BarkModel.from_pretrained(MODEL, torch_dtype=dtype, use_safetensors=False).to(device) + +# Bark's internal generation calls set both max_length and max_new_tokens simultaneously +# which causes a conflict warning and may cause early stopping. Clearing max_length +# lets max_new_tokens take sole control. +model.generation_config.max_length = None + +# torch.compile gives a meaningful speedup on 3090 (adds ~30s first-run compile) +try: + model = torch.compile(model, mode='reduce-overhead') + log('torch.compile applied') +except Exception as e: + log(f'torch.compile skipped: {e}') + +log(f'ready on {device} ({time.time() - t0:.1f}s load time)') +print('ready', flush=True) # signal to Node that we are ready + + +def speak(text, voice): + t1 = time.time() + + inputs = processor(text, voice_preset=voice, return_tensors='pt') + + # Set attention_mask if not provided by processor — without it the model + # cannot distinguish padding from real tokens, which causes erratic generation + # including leading filler sounds. + if 'attention_mask' not in inputs: + inputs['attention_mask'] = torch.ones_like(inputs['input_ids']) + + inputs = {k: v.to(device) for k, v in inputs.items()} + + with torch.inference_mode(): + audio = model.generate( + **inputs, + semantic_temperature = 0.3, + coarse_temperature = 0.3, + fine_temperature = 0.3, + ) + + samples = audio.cpu().numpy().squeeze().astype(np.float32) + elapsed_gen = time.time() - t1 + duration = len(samples) / SAMPLE_RATE + log(f'generated {duration:.1f}s audio in {elapsed_gen:.1f}s rtf={elapsed_gen/duration:.2f}') + + proc = subprocess.Popen( + ['pacat', '--format=float32le', f'--rate={SAMPLE_RATE}', '--channels=1'], + stdin=subprocess.PIPE, + ) + proc.stdin.write(samples.tobytes()) + proc.stdin.close() + proc.wait() + + +for line in sys.stdin: + line = line.strip() + if not line: + continue + + try: + req = json.loads(line) + text = req.get('text', '') + voice = req.get('voice', DEF_VOICE) + except json.JSONDecodeError: + text = line + voice = DEF_VOICE + + if not text: + print('ok', flush=True) + continue + + try: + speak(text, voice) + except Exception as e: + log(f'error: {e}') + + print('ok', flush=True) diff --git a/chatterbox-server.py b/chatterbox-server.py new file mode 100755 index 0000000..8236314 --- /dev/null +++ b/chatterbox-server.py @@ -0,0 +1,200 @@ +#!/usr/bin/env -S bash -c 'exec "$(dirname "$0")/venv/bin/python3" "$0" "$@"' +""" +Chatterbox TTS server — keeps model loaded, reads JSON lines from stdin. + +Protocol: + stdin: {"text": "...", "temperature": 0.8, "top_p": 0.95} + stdout: "ok\n" after each utterance is generated (playback may still be in progress) + stderr: status/timing messages + +Usage: + ./chatterbox-server.py + ./chatterbox-server.py turbo # default + ./chatterbox-server.py full # original model, supports exaggeration + +Paralinguistic tags supported in text: + [laugh] [chuckle] [cough] [clear throat] [sigh] [shush] [groan] [sniff] [gasp] + +Full model only: + exaggeration 0.0-1.0 emotion intensity (ignored in turbo) +""" + +import os +import sys +import json +import time +import queue +import threading +import subprocess +import numpy as np + +TOKEN_FILE = os.path.expanduser('~/.secrets/hugging-face.token') +try: + with open(TOKEN_FILE) as f: + os.environ['HF_TOKEN'] = f.read().strip() +except FileNotFoundError: + pass + +def find_hf_cache(repo_id): + """Return the local snapshot path if the model is already cached, else None.""" + from pathlib import Path + cache_dir = Path(os.environ.get('HF_HOME', os.path.expanduser('~/.cache/huggingface'))) / 'hub' + repo_dir = cache_dir / f"models--{repo_id.replace('/', '--')}" / 'snapshots' + if repo_dir.exists(): + snapshots = sorted(repo_dir.iterdir(), key=lambda p: p.stat().st_mtime) + if snapshots: + return str(snapshots[-1]) + return None + +VARIANT = sys.argv[1] if len(sys.argv) > 1 else 'turbo' +SAMPLE_RATE = 24000 + +def log(msg): + print(f'[chatterbox] {msg}', file=sys.stderr, flush=True) + +log(f'loading chatterbox-{VARIANT}...') +t0 = time.time() + +import tempfile +import traceback +import numpy as np +import torch +import soundfile as sf +import librosa as _librosa + +# librosa.resample returns float64 in newer numpy — patch it to always return float32 +_orig_resample = _librosa.resample +def _resample_float32(*args, **kwargs): + return _orig_resample(*args, **kwargs).astype(np.float32) +_librosa.resample = _resample_float32 + +device = 'cuda' if torch.cuda.is_available() else 'cpu' + +REPO_IDS = { + 'turbo': 'ResembleAI/chatterbox-turbo', + 'full': 'ResembleAI/chatterbox', +} + +if VARIANT == 'turbo': + from chatterbox.tts_turbo import ChatterboxTurboTTS as Model +else: + from chatterbox.tts import ChatterboxTTS as Model + +cached = find_hf_cache(REPO_IDS[VARIANT]) +if cached: + log(f'loading from cache: {cached}') + model = Model.from_local(cached, device=device) +else: + log('cache not found, downloading...') + model = Model.from_pretrained(device=device) + +log(f'ready on {device} ({time.time() - t0:.1f}s load time)') +print('ready', flush=True) + + +_wav_cache = {} + +def ensure_float32_wav(path): + """Re-save audio as float32 mono WAV to work around librosa/numpy float64 issue. + Result is cached by input path so repeated calls with the same file are free.""" + if path in _wav_cache: + return _wav_cache[path] + wav, sr = sf.read(path, dtype='float32', always_2d=True) + wav = wav.mean(axis=1) # stereo → mono if needed + tmp = tempfile.NamedTemporaryFile(suffix='.wav', delete=False) + sf.write(tmp.name, wav, sr, subtype='FLOAT') + _wav_cache[path] = tmp.name + return tmp.name + + +_SENTINEL = object() + +playback_queue = queue.Queue() + + +def playback_worker(): + """Plays audio samples in order. Runs in its own thread.""" + while True: + item = playback_queue.get() + if item is _SENTINEL: + break + samples = item + proc = subprocess.Popen( + ['pacat', '--format=float32le', f'--rate={SAMPLE_RATE}', '--channels=1'], + stdin=subprocess.PIPE, + ) + proc.stdin.write(samples.tobytes()) + proc.stdin.close() + proc.wait() + playback_queue.task_done() + + +playback_thread = threading.Thread(target=playback_worker, daemon=True) +playback_thread.start() + + +def generate(text, opts): + t1 = time.time() + + if VARIANT == 'turbo': + kwargs = { + 'temperature': opts.get('temperature', 0.8), + 'top_p': opts.get('top_p', 0.95), + 'top_k': opts.get('top_k', 1000), + 'repetition_penalty': opts.get('repetition_penalty', 1.2), + 'min_p': opts.get('min_p', 0.0), + } + else: + kwargs = { + 'temperature': opts.get('temperature', 0.8), + 'top_p': opts.get('top_p', 1.0), + 'repetition_penalty': opts.get('repetition_penalty', 1.2), + 'min_p': opts.get('min_p', 0.05), + 'exaggeration': opts.get('exaggeration', 0.5), + 'cfg_weight': opts.get('cfg_weight', 0.5), + } + + audio_prompt = opts.get('audio_prompt') + if audio_prompt: + kwargs['audio_prompt_path'] = ensure_float32_wav(audio_prompt) + + with torch.inference_mode(): + wav = model.generate(text, **kwargs) + + samples = wav.squeeze(0).cpu().numpy().astype(np.float32) + elapsed = time.time() - t1 + duration = len(samples) / SAMPLE_RATE + log(f'generated {duration:.1f}s audio in {elapsed:.1f}s rtf={elapsed/duration:.2f}') + + return samples + + +for line in sys.stdin: + line = line.strip() + if not line: + continue + + try: + req = json.loads(line) + text = req.pop('text', '') + opts = req # remaining fields are generation options + except json.JSONDecodeError: + text = line + opts = {} + + if not text: + print('ok', flush=True) + continue + + try: + samples = generate(text, opts) + playback_queue.put(samples) + except Exception as e: + log(f'error: {e}') + traceback.print_exc(file=sys.stderr) + + print('ok', flush=True) + +# Drain playback before exit +playback_queue.put(_SENTINEL) +playback_thread.join() diff --git a/demo-bark.mjs b/demo-bark.mjs new file mode 100644 index 0000000..8f99a8c --- /dev/null +++ b/demo-bark.mjs @@ -0,0 +1,85 @@ +/** + * Voice demo: microphone → Whisper STT → (optional LLM cleanup) → Bark TTS + * + * Bark supports emphasis via UPPERCASE, paralinguistic tokens ([laughs], [sighs], + * [clears throat]), and hesitation (...). Markdown is preprocessed automatically: + * **bold** → UPPERCASE, headers get a pause, etc. + * + * Environment variables: + * WHISPER_MODEL default: base.en + * options: tiny.en base.en small.en medium.en large-v3 large-v3-turbo + * BARK_MODEL default: suno/bark (use suno/bark-small for faster/smaller) + * BARK_VOICE default: v2/en_speaker_6 + * options: v2/en_speaker_0 .. v2/en_speaker_9 + * STT_PROVIDER default: cuda + * USE_LLM set to 0 to disable (default: 1) + * OLLAMA_MODEL default: phi3:mini + */ + +import { Stt } from './lib/stt.mjs' +import { Bark_Tts } from './lib/bark-tts.mjs' +import { llm_available, list_models, cleanup } from './lib/llm.mjs' + +const WHISPER_MODEL = process.env.WHISPER_MODEL || 'base.en' +const STT_PROVIDER = process.env.STT_PROVIDER || 'cuda' +const USE_LLM = process.env.USE_LLM !== '0' + +process.stderr.write(`[stt] loading whisper-${WHISPER_MODEL} on ${STT_PROVIDER}...\n`) +const stt = new Stt({ whisper_name: WHISPER_MODEL, provider: STT_PROVIDER }) +stt.init() +process.stderr.write('[stt] ready\n') + +process.stderr.write('[tts] loading Bark (this takes a moment)...\n') +const tts = new Bark_Tts() +await tts.init() +process.stderr.write('[tts] Bark ready\n') + +let has_llm = false +if (USE_LLM) { + has_llm = await llm_available() + if (has_llm) { + const models = await list_models() + process.stderr.write(`[llm] ollama available. models: ${models.slice(0, 6).join(', ')}\n`) + } else { + process.stderr.write('[llm] ollama not reachable, skipping cleanup\n') + } +} + +process.stderr.write(`\n[ready] whisper=${WHISPER_MODEL} bark=${process.env.BARK_MODEL || 'suno/bark'} llm=${has_llm}\n`) +process.stderr.write('[ready] speak into your microphone. Ctrl+C to stop.\n\n') + +let speaking = false + +const stop = stt.listen(async (raw_text) => { + if (speaking) { + process.stderr.write(`[skipped] ${raw_text}\n`) + return + } + speaking = true + + try { + process.stdout.write(`[raw] ${raw_text}\n`) + + let text = raw_text + if (has_llm) { + text = await cleanup(raw_text) + if (text !== raw_text) { + process.stdout.write(`[llm] ${text}\n`) + } + } + + await tts.speak_streaming(text) + process.stdout.write('\n') + } catch (err) { + process.stderr.write(`[error] ${err.message}\n`) + } finally { + speaking = false + } +}) + +process.on('SIGINT', () => { + process.stderr.write('\n[voice] stopping\n') + stop() + tts.stop() + process.exit(0) +}) diff --git a/demo-kokoro.mjs b/demo-kokoro.mjs new file mode 100644 index 0000000..f5169e7 --- /dev/null +++ b/demo-kokoro.mjs @@ -0,0 +1,85 @@ +/** + * Voice demo: microphone → Whisper STT → (optional LLM cleanup) → Kokoro TTS + * + * Environment variables: + * WHISPER_MODEL default: base.en + * options: tiny.en base.en small.en medium.en large-v3 large-v3-turbo + * VOICE default: af_heart + * options: af_heart af_bella af_nicole af_sarah af_sky am_adam am_michael + * STT_PROVIDER default: cuda + * USE_LLM set to 0 to disable (default: 1) + * OLLAMA_MODEL default: phi3:mini + */ + +import { Stt } from './lib/stt.mjs' +import { Tts, VOICES } from './lib/tts.mjs' +import { llm_available, list_models, cleanup } from './lib/llm.mjs' + +const WHISPER_MODEL = process.env.WHISPER_MODEL || 'base.en' +const VOICE = process.env.VOICE || 'af_heart' +const STT_PROVIDER = process.env.STT_PROVIDER || 'cuda' +const USE_LLM = process.env.USE_LLM !== '0' + +if (!(VOICE in VOICES)) { + process.stderr.write(`[voice] unknown voice "${VOICE}". Available: ${Object.keys(VOICES).join(', ')}\n`) + process.exit(1) +} + +process.stderr.write(`[stt] loading whisper-${WHISPER_MODEL} on ${STT_PROVIDER}...\n`) +const stt = new Stt({ whisper_name: WHISPER_MODEL, provider: STT_PROVIDER }) +stt.init() +process.stderr.write('[stt] ready\n') + +process.stderr.write('[tts] loading Kokoro...\n') +const tts = new Tts({ voice: VOICE }) +tts.init() +process.stderr.write('[tts] ready\n') + +let has_llm = false +if (USE_LLM) { + has_llm = await llm_available() + if (has_llm) { + const models = await list_models() + process.stderr.write(`[llm] ollama available. models: ${models.slice(0, 6).join(', ')}\n`) + } else { + process.stderr.write('[llm] ollama not reachable, skipping cleanup\n') + } +} + +process.stderr.write(`\n[ready] whisper=${WHISPER_MODEL} voice=${VOICE} llm=${has_llm}\n`) +process.stderr.write('[ready] speak into your microphone. Ctrl+C to stop.\n\n') + +let speaking = false + +const stop = stt.listen(async (raw_text) => { + if (speaking) { + process.stderr.write(`[skipped] ${raw_text}\n`) + return + } + speaking = true + + try { + process.stdout.write(`[raw] ${raw_text}\n`) + + let text = raw_text + if (has_llm) { + text = await cleanup(raw_text) + if (text !== raw_text) { + process.stdout.write(`[llm] ${text}\n`) + } + } + + await tts.speak_streaming(text) + process.stdout.write('\n') + } catch (err) { + process.stderr.write(`[error] ${err.message}\n`) + } finally { + speaking = false + } +}) + +process.on('SIGINT', () => { + process.stderr.write('\n[voice] stopping\n') + stop() + process.exit(0) +}) diff --git a/demos/darkwing-briefing.sh b/demos/darkwing-briefing.sh new file mode 100755 index 0000000..004af7f --- /dev/null +++ b/demos/darkwing-briefing.sh @@ -0,0 +1,24 @@ +#!/usr/bin/env bash +# Darkwing Briefing — Darkwing Duck, Gosalyn, Harold W. Smith, Marella, Rommie + +TTS="${TTS_URL:-http://192.168.2.99:11500}" +switch() { curl -s -X POST "$TTS/voice" -H 'Content-Type: application/json' -d "{\"name\": \"$1\"}" > /dev/null; } +say() { curl -s -X POST "$TTS/speak" -H 'Content-Type: application/json' -d "$(jq -n --arg t "$1" '{text:$t}')" > /dev/null; } + +switch darkwing-duck +say "I am the terror that flaps in the night. I am the highly classified operative that your clearance level does not cover." + +switch gosalyn +say "Dad, you're at the WRONG briefing AGAIN." + +switch harold-w-smith +say "How did a duck get into this facility." + +switch marella +say "The same way everything else does, sir. The paperwork looked official." + +switch rommie +say "I ran a verification check. The duck is, in fact, cleared for this. I have no further explanation." + +switch darkwing-duck +say "I love it when a plan comes together. Let's get dangerous." diff --git a/demos/threat-assessment.sh b/demos/threat-assessment.sh new file mode 100755 index 0000000..97cf890 --- /dev/null +++ b/demos/threat-assessment.sh @@ -0,0 +1,21 @@ +#!/usr/bin/env bash +# Threat Assessment — Ivanova, Archangel, Rommie, Chuin, Santini + +TTS="${TTS_URL:-http://192.168.2.99:11500}" +switch() { curl -s -X POST "$TTS/voice" -H 'Content-Type: application/json' -d "{\"name\": \"$1\"}" > /dev/null; } +say() { curl -s -X POST "$TTS/speak" -H 'Content-Type: application/json' -d "$(jq -n --arg t "$1" '{text:$t}')" > /dev/null; } + +switch ivanova +say "We have an unidentified vessel on an intercept course. Threat assessment?" + +switch rommie +say "Weapons hot, shields raised, and they're broadcasting on a frequency we're not supposed to know exists." + +switch archangel +say "Ah. That would be my people. Stand down, Commander — they're just making sure I'm still alive." + +switch chuin +say "Are you? I have been wondering this myself for some time." + +switch santini +say "I'm gonna need a bigger hangar." diff --git a/demos/wrong-meeting.sh b/demos/wrong-meeting.sh new file mode 100755 index 0000000..9a3a256 --- /dev/null +++ b/demos/wrong-meeting.sh @@ -0,0 +1,24 @@ +#!/usr/bin/env bash +# Wrong Meeting — Harold W. Smith, Ivanova, Archangel, Rommie, Penny Parker + +TTS="${TTS_URL:-http://192.168.2.99:11500}" +switch() { curl -s -X POST "$TTS/voice" -H 'Content-Type: application/json' -d "{\"name\": \"$1\"}" > /dev/null; } +say() { curl -s -X POST "$TTS/speak" -H 'Content-Type: application/json' -d "$(jq -n --arg t "$1" '{text:$t}')" > /dev/null; } + +switch harold-w-smith +say "This meeting never happened. None of you are here, and neither am I." + +switch ivanova +say "That's extremely reassuring. Last time someone said that to me, a space station blew up." + +switch archangel +say "Technically it was a controlled demolition. The paperwork was filed in triplicate." + +switch rommie +say "I have reviewed the paperwork. It was filed AFTER the explosion." + +switch penny-parker +say "You know, I walked into the wrong building again, didn't I." + +switch harold-w-smith +say "Yes. But you saw nothing." diff --git a/download-models.sh b/download-models.sh new file mode 100755 index 0000000..06a8d23 --- /dev/null +++ b/download-models.sh @@ -0,0 +1,63 @@ +#!/usr/bin/env bash +# Download ONNX models for the voice experiment. +# +# Usage: +# bash download-models.sh # default: whisper base.en +# WHISPER_MODEL=large-v3-turbo bash download-models.sh +# +# After downloading, run: node demo.mjs + +set -euo pipefail + +WHISPER_MODEL="${WHISPER_MODEL:-base.en}" +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" +MODELS_DIR="${SCRIPT_DIR}/models" + +mkdir -p "${MODELS_DIR}" +cd "${MODELS_DIR}" + +echo "==> models dir: ${MODELS_DIR}" + +# --- Silero VAD (~2 MB) --- +if [ ! -f silero_vad.onnx ]; then + echo "==> downloading silero_vad.onnx" + curl -L --progress-bar -o silero_vad.onnx \ + 'https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/silero_vad.onnx' +else + echo "==> silero_vad.onnx already present" +fi + +# --- Whisper STT --- +WHISPER_DIR="sherpa-onnx-whisper-${WHISPER_MODEL}" +if [ ! -d "${WHISPER_DIR}" ]; then + echo "==> downloading whisper-${WHISPER_MODEL}" + curl -L --progress-bar -o "${WHISPER_DIR}.tar.bz2" \ + "https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-whisper-${WHISPER_MODEL}.tar.bz2" + echo "==> extracting..." + tar xf "${WHISPER_DIR}.tar.bz2" +else + echo "==> whisper-${WHISPER_MODEL} already present" +fi + +# --- Kokoro TTS (~310 MB) --- +if [ ! -d kokoro-en-v0_19 ]; then + echo "==> downloading kokoro-en-v0_19 (English TTS)" + curl -L --progress-bar -o kokoro-en-v0_19.tar.bz2 \ + 'https://github.com/k2-fsa/sherpa-onnx/releases/download/tts-models/kokoro-en-v0_19.tar.bz2' + echo "==> extracting..." + tar xf kokoro-en-v0_19.tar.bz2 +else + echo "==> kokoro-en-v0_19 already present" +fi + +echo "" +echo "==> all models ready" +echo "" +echo "Next steps:" +echo " cd ${SCRIPT_DIR}" +echo " npm install" +echo " node demo.mjs" +echo "" +echo "To use a better STT model:" +echo " WHISPER_MODEL=large-v3-turbo bash download-models.sh" +echo " WHISPER_MODEL=large-v3-turbo node demo.mjs" diff --git a/lib/bark-tts.mjs b/lib/bark-tts.mjs new file mode 100644 index 0000000..79d90e8 --- /dev/null +++ b/lib/bark-tts.mjs @@ -0,0 +1,112 @@ +/** + * Bark TTS — Node.js wrapper around bark-server.py. + * + * Spawns the Python server once, keeps it alive, sends requests as JSON lines. + * Markdown is preprocessed before sending: **bold** → UPPERCASE, etc. + * + * Usage: + * const tts = new Bark_Tts() + * await tts.init() // spawns server, waits for model load + * await tts.speak('Hello') + * await tts.speak('**Very** important point.') // emphasis via CAPS + * tts.stop() + * + * Environment variables: + * BARK_MODEL HuggingFace model id (default: suno/bark) + * BARK_VOICE voice preset (default: v2/en_speaker_6) + * + * Bark voice presets (English): + * v2/en_speaker_0 calm female + * v2/en_speaker_1 calm male + * v2/en_speaker_3 deep male + * v2/en_speaker_6 neutral/warm (default) + * v2/en_speaker_9 expressive + */ + +import { spawn } from 'node:child_process' +import * as path from 'node:path' +import * as readline from 'node:readline' +import { markdown_to_bark, split_sentences } from './markdown.mjs' + +const BARK_MODEL = process.env.BARK_MODEL || 'suno/bark' +const BARK_VOICE = process.env.BARK_VOICE || 'v2/en_speaker_6' +const SERVER = path.join(import.meta.dirname, '..', 'bark-server.py') + +export class Bark_Tts { + constructor({ + model = BARK_MODEL, + voice = BARK_VOICE, + } = {}) { + this._model = model + this._voice = voice + this._proc = null + this._rl = null + this._resolve = null // resolver for the current in-flight request + } + + /** Spawn bark-server.py and wait until it signals "ready". */ + init() { + return new Promise((resolve, reject) => { + this._proc = spawn(SERVER, [this._model, this._voice], { + stdio: ['pipe', 'pipe', 'inherit'], + }) + + this._proc.on('error', reject) + this._proc.on('close', (code) => { + if (code !== 0 && code !== null) { + process.stderr.write(`[bark] server exited with code ${code}\n`) + } + }) + + this._rl = readline.createInterface({ input: this._proc.stdout }) + + this._rl.on('line', (line) => { + if (line === 'ready') { + resolve() + return + } + if (line === 'ok' && this._resolve) { + const res = this._resolve + this._resolve = null + res() + } + }) + }) + } + + /** Preprocess markdown and speak as a single request. */ + async speak(text, { voice = this._voice, preprocess = true } = {}) { + const clean = preprocess ? markdown_to_bark(text) : text + return this._send(clean, voice) + } + + /** + * Preprocess markdown and speak sentence by sentence. + * Lower latency — first sentence starts playing while rest are queued. + */ + async speak_streaming(text, opts = {}) { + const clean = opts.preprocess !== false ? markdown_to_bark(text) : text + const sentences = split_sentences(clean) + for (const s of sentences) { + await this._send(s, opts.voice ?? this._voice) + } + } + + _send(text, voice) { + return new Promise((resolve, reject) => { + if (!this._proc) { + return reject(new Error('Bark_Tts not initialized — call init() first')) + } + this._resolve = resolve + const payload = JSON.stringify({ text, voice }) + '\n' + this._proc.stdin.write(payload) + }) + } + + stop() { + this._rl?.close() + this._proc?.kill() + this._proc = null + this._rl = null + } +} diff --git a/lib/chatterbox-tts.mjs b/lib/chatterbox-tts.mjs new file mode 100644 index 0000000..e78c326 --- /dev/null +++ b/lib/chatterbox-tts.mjs @@ -0,0 +1,101 @@ +/** + * Chatterbox TTS — Node.js wrapper around chatterbox-server.py. + * + * Usage: + * const tts = new Chatterbox_Tts() + * await tts.init() + * await tts.speak('Hello! [chuckle] That is funny.') + * await tts.speak('Something intense.', { exaggeration: 0.8 }) // full model only + * tts.stop() + * + * Paralinguistic tags (embed in text): + * [laugh] [chuckle] [cough] [clear throat] [sigh] [shush] [groan] [sniff] [gasp] + * + * Generation options (passed as second arg to speak()): + * temperature 0.05–2.0 default 0.8 + * top_p 0–1 default 0.95 + * top_k 0–1000 default 1000 + * repetition_penalty ≥1.0 default 1.2 + * min_p 0–1 default 0.0 + * audio_prompt string path to reference WAV for voice cloning + * exaggeration 0–1 emotion intensity (full model only) + * cfg_weight 0–1 classifier-free guidance (full model only) + */ + +import { spawn } from 'node:child_process' +import * as path from 'node:path' +import * as readline from 'node:readline' +import { markdown_to_bark as markdown_to_speech, split_sentences } from './markdown.mjs' + +const SERVER = path.join(import.meta.dirname, '..', 'chatterbox-server.py') + +export class Chatterbox_Tts { + constructor({ + variant = 'turbo', // 'turbo' or 'full' + } = {}) { + this._variant = variant + this._proc = null + this._rl = null + this._resolve = null + } + + init() { + return new Promise((resolve, reject) => { + this._proc = spawn(SERVER, [this._variant], { + stdio: ['pipe', 'pipe', 'inherit'], + }) + + this._proc.on('error', reject) + this._proc.on('close', (code) => { + if (code !== null && code !== 0) { + process.stderr.write(`[chatterbox] server exited with code ${code}\n`) + } + }) + + this._rl = readline.createInterface({ input: this._proc.stdout }) + this._rl.on('line', (line) => { + if (line === 'ready') { + resolve() + return + } + if (line === 'ok' && this._resolve) { + const res = this._resolve + this._resolve = null + res() + } + }) + }) + } + + async speak(text, opts = {}) { + const clean = opts.preprocess === true ? markdown_to_speech(text) : text + return this._send(clean, opts) + } + + async speak_streaming(text, opts = {}) { + const clean = opts.preprocess !== false ? markdown_to_speech(text) : text + const sentences = split_sentences(clean) + for (const s of sentences) { + await this._send(s, opts) + } + } + + _send(text, opts = {}) { + return new Promise((resolve, reject) => { + if (!this._proc) { + return reject(new Error('Chatterbox_Tts not initialized — call init() first')) + } + this._resolve = resolve + const { preprocess: _, ...gen_opts } = opts + const payload = JSON.stringify({ text, ...gen_opts }) + '\n' + this._proc.stdin.write(payload) + }) + } + + stop() { + this._rl?.close() + this._proc?.kill() + this._proc = null + this._rl = null + } +} diff --git a/lib/llm.mjs b/lib/llm.mjs new file mode 100644 index 0000000..5ef1e6d --- /dev/null +++ b/lib/llm.mjs @@ -0,0 +1,64 @@ +/** + * Optional LLM cleanup via Ollama REST API. + * + * Cleans up raw Whisper output: fixes punctuation, capitalization, + * removes filler words, corrects obvious recognition errors. + * + * Set OLLAMA_MODEL env var to choose the model (default: phi3:mini). + * Set OLLAMA_URL env var to change the base URL (default: http://localhost:11434). + */ + +const OLLAMA_URL = process.env.OLLAMA_URL || 'http://localhost:11434' +const OLLAMA_MODEL = process.env.OLLAMA_MODEL || 'phi3:mini' + +export async function llm_available() { + try { + const res = await fetch(`${OLLAMA_URL}/api/tags`, { + signal: AbortSignal.timeout(2000), + }) + return res.ok + } catch { + return false + } +} + +export async function list_models() { + const res = await fetch(`${OLLAMA_URL}/api/tags`) + const data = await res.json() + return (data.models ?? []).map(m => m.name) +} + +/** + * Clean up a raw STT transcription. + * Returns the cleaned string, or the original if the request fails. + */ +export async function cleanup(raw_text, model = OLLAMA_MODEL) { + const prompt = [ + 'You are a transcription editor. Clean up this speech-to-text output:', + '- Fix punctuation and capitalization', + '- Remove meaningless filler words (um, uh, like, you know)', + '- Correct obvious word recognition errors based on context', + '- Keep the meaning and phrasing intact', + '- Return ONLY the cleaned text, no explanation, no quotes', + '', + raw_text, + ].join('\n') + + try { + const res = await fetch(`${OLLAMA_URL}/api/generate`, { + method: 'POST', + headers: { 'Content-Type': 'application/json' }, + body: JSON.stringify({ + model, + prompt, + stream: false, + options: { temperature: 0.1, num_predict: 256 }, + }), + signal: AbortSignal.timeout(10_000), + }) + const data = await res.json() + return data.response?.trim() || raw_text + } catch { + return raw_text + } +} diff --git a/lib/local-query-complete.mjs b/lib/local-query-complete.mjs new file mode 100644 index 0000000..da3d98f --- /dev/null +++ b/lib/local-query-complete.mjs @@ -0,0 +1,43 @@ +export async function is_query_complete(query, model = 'qwen2.5:3b') { + const payload = { + model, + messages: [ + { + role: 'system', + content: ` + You are a voice query completeness classifier. Determine if the input is complete enough to act on. + + Respond only with JSON: {"complete": true} or {"complete": false} + + Default to {"complete": true}. Only respond with {"complete": false} if the input is clearly mid-sentence, trails off, or is a single ambiguous word with no clear intent. + + Complete examples: "What time is it", "Turn off the lights", "How do I juggle", "Make a note", "Check the weather" + Incomplete examples: "How do I", "The thing is", "Can you" + `.replace(/^[\t ]+/gm, ''), + }, + { role: 'user', content: query }, + ], + stream: false, + format: 'json', + options: { num_predict: 10, num_ctx: 2048 }, + } + + const response = await fetch('http://192.168.2.99:11434/api/chat', { + method: 'POST', + headers: { 'Content-Type': 'application/json' }, + body: JSON.stringify(payload), + }) + + if (response.ok) { + try { + const data = await response.json() + const parsed = JSON.parse(data.message.content) + return parsed.complete === true + } catch { + return true + } + } else { + // Fail open — if classifier errors, let the query through + return true + } +} diff --git a/lib/markdown.mjs b/lib/markdown.mjs new file mode 100644 index 0000000..6d601b4 --- /dev/null +++ b/lib/markdown.mjs @@ -0,0 +1,66 @@ +/** + * Convert markdown to Bark-friendly plain text. + * + * Bark emphasis conventions: + * UPPERCASE words — stress/emphasis + * ... — hesitation / trailing off + * [laughs] [sighs] — paralinguistic tokens (pass through as-is) + * + * Markdown mapping: + * **bold** / __bold__ → UPPERCASE (strong emphasis) + * *italic* / _italic_ → unchanged (soft emphasis, Bark handles naturally) + * # Heading → text + pause via ".\n" + * `code` → unchanged (spoken literally) + * ```block``` → skipped + * [text](url) → text only + * --- → "..." (pause) + * - item / 1. item → item text (bullet stripped) + */ + +export function markdown_to_bark(text) { + // Remove fenced code blocks entirely + text = text.replace(/```[\s\S]*?```/g, '') + + // Inline code — keep the content, drop backticks + text = text.replace(/`([^`]+)`/g, '$1') + + // Links — keep visible text + text = text.replace(/\[([^\]]+)\]\([^)]*\)/g, '$1') + + // Images — drop + text = text.replace(/!\[([^\]]*)\]\([^)]*\)/g, '') + + // **bold** / __bold__ → UPPERCASE + text = text.replace(/\*\*([^*]+)\*\*/g, (_, w) => w.toUpperCase()) + text = text.replace(/__([^_]+)__/g, (_, w) => w.toUpperCase()) + + // *italic* / _italic_ — strip markers, keep text + text = text.replace(/\*([^*]+)\*/g, '$1') + text = text.replace(/_([^_]+)_/g, '$1') + + // Headings — keep text, add a beat after + text = text.replace(/^#{1,6}\s+(.+)$/gm, '$1.') + + // Horizontal rules → pause + text = text.replace(/^[-*_]{3,}\s*$/gm, '...') + + // List bullets / numbers — strip prefix + text = text.replace(/^\s*[-*+]\s+/gm, '') + text = text.replace(/^\s*\d+\.\s+/gm, '') + + // Collapse excess blank lines + text = text.replace(/\n{3,}/g, '\n\n') + + return text.trim() +} + +/** + * Split text into sentences suitable for sequential TTS. + * Keeps sentence-ending punctuation with the preceding sentence. + */ +export function split_sentences(text) { + return text + .split(/(?<=[.!?])\s+/) + .map(s => s.trim()) + .filter(s => s.length > 0) +} diff --git a/lib/stt.mjs b/lib/stt.mjs new file mode 100644 index 0000000..bb5bc80 --- /dev/null +++ b/lib/stt.mjs @@ -0,0 +1,170 @@ +/** + * Speech-to-Text using sherpa-onnx-node. + * + * Uses Whisper (offline/batch) as the recognizer, gated by Silero VAD so + * transcription only runs when a complete utterance has been detected. + * Audio is captured from the microphone via parec (PulseAudio) or arecord (ALSA). + * + * Model layout expected under models_dir: + * silero_vad.onnx + * sherpa-onnx-whisper-/ + * -encoder.int8.onnx + * -decoder.int8.onnx + * -tokens.txt + * + * Available Whisper model names (pass as whisper_name): + * tiny.en base.en small.en medium.en large-v3 large-v3-turbo + * Download with: bash download-models.sh [model-name] + */ + +import { spawn, execSync } from 'node:child_process' +import { createRequire } from 'node:module' +import * as path from 'node:path' + +const require = createRequire(import.meta.url) +const DEFAULT_MODELS_DIR = path.join(import.meta.dirname, '..', 'models') + +const PRE_ROLL_SAMPLES = 3200 // 200ms at 16kHz +const HISTORY_SAMPLES = 960000 // 60s ring buffer — matches VAD internal size + +function find_mic() { + const candidates = [ + ['parec', ['--format=s16le', '--rate=16000', '--channels=1', '--latency-msec=50']], + ['arecord', ['-f', 'S16_LE', '-r', '16000', '-c', '1', '-t', 'raw', '-q']], + ] + for (const [cmd, args] of candidates) { + try { + execSync(`which ${cmd}`, { stdio: 'ignore' }) + return [cmd, args] + } catch { /* try next */ } + } + throw new Error('No mic capture command found — need parec (PulseAudio) or arecord (ALSA)') +} + +function s16le_to_f32(buf) { + const out = new Float32Array(buf.length / 2) + for (let i = 0; i < out.length; i++) { + out[i] = buf.readInt16LE(i * 2) / 32768.0 + } + return out +} + +export class Stt { + constructor({ + models_dir = DEFAULT_MODELS_DIR, + whisper_name = 'base.en', + provider = 'cuda', + } = {}) { + this._whisper_dir = path.join(models_dir, `sherpa-onnx-whisper-${whisper_name}`) + this._whisper_name = whisper_name + this._vad_model = path.join(models_dir, 'silero_vad.onnx') + this._provider = provider + this._recognizer = null + this._vad = null + this._history = new Float32Array(HISTORY_SAMPLES) + this._history_pos = 0 // total samples fed, monotonically increasing + } + + init() { + const sherpa = require('sherpa-onnx-node') + const { _whisper_dir: wdir, _whisper_name: wname, _vad_model, _provider } = this + + this._recognizer = new sherpa.OfflineRecognizer({ + featConfig: { sampleRate: 16000, featureDim: 80 }, + modelConfig: { + whisper: { + encoder: path.join(wdir, `${wname}-encoder.int8.onnx`), + decoder: path.join(wdir, `${wname}-decoder.int8.onnx`), + }, + tokens: path.join(wdir, `${wname}-tokens.txt`), + numThreads: 2, + provider: _provider, + debug: false, + }, + }) + + // VAD runs on CPU — silero_vad.onnx is tiny + this._vad = new sherpa.Vad({ + sileroVad: { + model: _vad_model, + threshold: 0.5, + minSilenceDuration: 0.5, // seconds of silence to end an utterance + minSpeechDuration: 0.1, // ignore sub-100ms blips + windowSize: 512, // 32ms @ 16kHz + maxSpeechDuration: 30.0, + }, + sampleRate: 16000, + numThreads: 1, + provider: 'cpu', + debug: false, + }, 60) // 60-second internal ring buffer + } + + _transcribe(samples) { + const stream = this._recognizer.createStream() + stream.acceptWaveform({ samples, sampleRate: 16000 }) + this._recognizer.decode(stream) + return this._recognizer.getResult(stream).text.trim() + } + + _with_preroll(seg) { + const pre_start = Math.max(0, seg.start - PRE_ROLL_SAMPLES) + const pre_len = seg.start - pre_start + if (pre_len === 0) return seg.samples + const out = new Float32Array(pre_len + seg.samples.length) + for (let i = 0; i < pre_len; i++) { + out[i] = this._history[(pre_start + i) % HISTORY_SAMPLES] + } + out.set(seg.samples, pre_len) + return out + } + + /** + * Start listening. Calls on_text(text) for each detected utterance. + * Returns a stop() function. + */ + listen(on_text, { on_audio } = {}) { + const [cmd, args] = find_mic() + process.stderr.write(`[stt] mic: ${cmd} ${args.join(' ')}\n`) + const mic = spawn(cmd, args, { stdio: ['ignore', 'pipe', 'inherit'] }) + + const VAD_WIN = 512 // must match windowSize above + let pending = Buffer.alloc(0) + + mic.stdout.on('data', (chunk) => { + if (on_audio) on_audio() + pending = Buffer.concat([pending, chunk]) + + // Feed complete VAD windows + while (pending.length >= VAD_WIN * 2) { + const win = pending.subarray(0, VAD_WIN * 2) + pending = pending.subarray(VAD_WIN * 2) + const f32 = s16le_to_f32(win) + // Write to history ring buffer for pre-roll + const base = this._history_pos % HISTORY_SAMPLES + for (let i = 0; i < f32.length; i++) { + this._history[(base + i) % HISTORY_SAMPLES] = f32[i] + } + this._history_pos += f32.length + this._vad.acceptWaveform(f32) + } + + // Drain any complete speech segments + while (!this._vad.isEmpty()) { + const seg = this._vad.front() + this._vad.pop() + const text = this._transcribe(this._with_preroll(seg)) + if (text) on_text(text) + } + }) + + mic.on('error', err => process.stderr.write(`[stt] mic error: ${err.message}\n`)) + mic.on('close', code => { + if (code !== null && code !== 0) { + process.stderr.write(`[stt] mic exited with code ${code}\n`) + } + }) + + return () => mic.kill() + } +} diff --git a/lib/tts-client.mjs b/lib/tts-client.mjs new file mode 100644 index 0000000..4dfdc4e --- /dev/null +++ b/lib/tts-client.mjs @@ -0,0 +1,51 @@ +/** + * TTS HTTP client — speaks text via a running tts-server.mjs instance. + * + * Usage: + * const tts = new Tts_Client() + * await tts.speak('Hello world.') + * await tts.speak('Hello.', { audio_prompt: '/path/to/voice.wav' }) + * const { voices, current } = await tts.list_voices() + * await tts.set_voice('rommie') + * + * The server URL defaults to TTS_URL env var, then http://192.168.2.99:11500. + */ + +const DEFAULT_URL = process.env.TTS_URL ?? 'http://192.168.2.99:11500' + +export class Tts_Client { + constructor({ url = DEFAULT_URL } = {}) { + this._url = url + } + + async speak(text, opts = {}) { + const res = await fetch(`${this._url}/speak`, { + method: 'POST', + headers: { 'Content-Type': 'application/json' }, + body: JSON.stringify({ text, ...opts }), + }) + if (!res.ok) { + const data = await res.json().catch(() => ({})) + throw new Error(data.error ?? `HTTP ${res.status}`) + } + } + + async list_voices() { + const res = await fetch(`${this._url}/voices`) + if (!res.ok) throw new Error(`HTTP ${res.status}`) + return res.json() // { voices: [{name, description, active}], current } + } + + async set_voice(name) { + const res = await fetch(`${this._url}/voice`, { + method: 'POST', + headers: { 'Content-Type': 'application/json' }, + body: JSON.stringify({ name }), + }) + if (!res.ok) { + const data = await res.json().catch(() => ({})) + throw new Error(data.error ?? `HTTP ${res.status}`) + } + return res.json() // { ok, name, path } + } +} diff --git a/lib/tts.mjs b/lib/tts.mjs new file mode 100644 index 0000000..697fe56 --- /dev/null +++ b/lib/tts.mjs @@ -0,0 +1,107 @@ +/** + * Text-to-Speech using sherpa-onnx-node with the Kokoro model. + * + * Model layout expected under models_dir: + * kokoro-en-v0_19/ + * model.onnx + * voices.bin + * tokens.txt + * lexicon-us-en.txt + * espeak-ng-data/ + * + * Audio output goes to pacat (PulseAudio) as float32le @ 24kHz. + * speak_streaming() splits on sentence boundaries so the first sentence + * plays while the rest are still being synthesized. + */ + +import { spawn } from 'node:child_process' +import { createRequire } from 'node:module' +import * as path from 'node:path' + +const require = createRequire(import.meta.url) +const DEFAULT_MODELS_DIR = path.join(import.meta.dirname, '..', 'models') + +// Speaker IDs in kokoro-en-v0_19 +export const VOICES = { + af_heart: 0, + af_bella: 1, + af_nicole: 2, + af_sarah: 3, + af_sky: 4, + am_adam: 5, + am_michael: 6, +} + +export class Tts { + constructor({ + models_dir = DEFAULT_MODELS_DIR, + voice = 'af_heart', + speed = 1.0, + provider = 'cpu', // Kokoro synthesis is fast enough on CPU + } = {}) { + this._kokoro_dir = path.join(models_dir, 'kokoro-en-v0_19') + this._voice = voice + this._speed = speed + this._provider = provider + this._tts = null + } + + init() { + const sherpa = require('sherpa-onnx-node') + const { _kokoro_dir: dir, _provider } = this + + this._tts = new sherpa.OfflineTts({ + model: { + kokoro: { + model: path.join(dir, 'model.onnx'), + voices: path.join(dir, 'voices.bin'), + tokens: path.join(dir, 'tokens.txt'), + dataDir: path.join(dir, 'espeak-ng-data'), + }, + }, + maxNumSentences: 2, + numThreads: 2, + debug: false, + provider: _provider, + }) + } + + /** Synthesize and play a single piece of text. Resolves when playback ends. */ + async speak(text) { + const sid = VOICES[this._voice] ?? 0 + const audio = this._tts.generate({ text, sid, speed: this._speed }) + await this._play(audio.samples, audio.sampleRate) + } + + /** + * Split text on sentence boundaries and synthesize + play each sentence + * in sequence. Lower latency than waiting for full synthesis first. + */ + async speak_streaming(text) { + for (const sentence of split_sentences(text)) { + if (sentence.trim()) { + await this.speak(sentence.trim()) + } + } + } + + _play(samples, sample_rate) { + return new Promise((resolve, reject) => { + const proc = spawn('pacat', [ + '--format=float32le', + `--rate=${sample_rate}`, + '--channels=1', + ], { stdio: ['pipe', 'ignore', 'inherit'] }) + + proc.on('error', reject) + proc.on('close', resolve) + proc.stdin.write(Buffer.from(samples.buffer)) + proc.stdin.end() + }) + } +} + +function split_sentences(text) { + // Split after . ! ? — keep the punctuation with the preceding sentence + return text.split(/(?<=[.!?])\s+/) +} diff --git a/listen.mjs b/listen.mjs new file mode 100644 index 0000000..6ede8d6 --- /dev/null +++ b/listen.mjs @@ -0,0 +1,22 @@ +/** + * Listen with Whisper and print transcribed text to stdout. + * + * Usage: + * node listen.mjs + * node listen.mjs small.en + * node listen.mjs large-v3-turbo + */ + +import { Stt } from './lib/stt.mjs' + +const whisper_name = process.argv[2] ?? 'base.en' + +const stt = new Stt({ whisper_name }) +stt.init() + +process.stderr.write(`[listen] whisper model: ${whisper_name}\n`) +process.stderr.write(`[listen] listening — Ctrl+C to stop\n`) + +stt.listen((text) => { + process.stdout.write(text + '\n') +}) diff --git a/package.json b/package.json new file mode 100644 index 0000000..8293af0 --- /dev/null +++ b/package.json @@ -0,0 +1,13 @@ +{ + "name": "voice-experiment", + "version": "0.1.0", + "type": "module", + "scripts": { + "download": "bash download-models.sh", + "setup-venv": "bash setup-venv.sh" + }, + "dependencies": { + "sherpa-onnx-node": "^1.12.14", + "js-yaml": "^4.1.0" + } +} diff --git a/query-demo.mjs b/query-demo.mjs new file mode 100644 index 0000000..86c6df3 --- /dev/null +++ b/query-demo.mjs @@ -0,0 +1,123 @@ +/** + * Voice query demo �� accumulates STT utterances into a query, checks + * completeness with a local LLM classifier, then dispatches to Claude Code. + * + * Usage: + * node query-demo.mjs [--audio-prompt voice.wav] [--whisper model] [--window-id 0x...] [--claude-remote /path/to/claude-remote.mjs] + * + * If --window-id and --claude-remote are supplied, completed queries are + * dispatched to Claude Code. Otherwise they are only logged to stderr. + * + * A query is considered complete when: + * - The classifier says so + * - The last word is a send-word (go/done/send) + * - No new fragment arrives within SILENCE_TIMEOUT ms + */ + +import { execFileSync } from 'node:child_process' +import { Stt } from './lib/stt.mjs' +import { Tts_Client } from './lib/tts-client.mjs' +import { is_query_complete } from './lib/local-query-complete.mjs' + +function get_arg(name) { + const i = process.argv.indexOf(name) + return i !== -1 ? process.argv[i + 1] : null +} + +const audio_prompt = get_arg('--audio-prompt') ?? '/home/devilholk/Documents/rommie-sample.wav' +const whisper_name = get_arg('--whisper') ?? 'base.en' +const window_id = get_arg('--window-id') +const claude_remote = get_arg('--claude-remote') + +const SILENCE_TIMEOUT = parseInt(get_arg('--silence-timeout') ?? '6000') + +const tts = new Tts_Client() +const stt = new Stt({ whisper_name }) + +stt.init() +process.stderr.write(`[query-demo] whisper model: ${whisper_name}\n`) +process.stderr.write(`[query-demo] voice: ${audio_prompt}\n`) +if (window_id && claude_remote) { + process.stderr.write(`[query-demo] dispatch: window ${window_id} via ${claude_remote}\n`) +} else { + process.stderr.write(`[query-demo] no --window-id/--claude-remote — logging only\n`) +} +await tts.speak('Ready for input.', { audio_prompt }) + +function make_prompt(query) { + return [ + '[voice-buddy] You have received a voice query.', + '', + 'This is a voice interface. When you are done, use the `speak` shell command to inform the user of the outcome — keep it brief and spoken-word natural.', + 'If the task is a question, speak the answer directly.', + 'If the task is an action, carry it out and then speak a short confirmation.', + '', + '--- Query ---', + query, + '--- End of query ---', + ].join('\n') +} + +function dispatch(query) { + execFileSync('node', [claude_remote, '--window-id', window_id], { + input: make_prompt(query), + encoding: 'utf8', + stdio: ['pipe', 'inherit', 'inherit'], + }) +} + +async function submit(query) { + query = query.trim() + if (!query) { + return + } + process.stderr.write(`[query-demo] query: ${JSON.stringify(query)}\n`) + if (window_id && claude_remote) { + dispatch(query) + } + await tts.speak('Efforting.', { audio_prompt }) +} + +const SEND_WORDS = new Set(['go', 'done', 'send']) + +let accumulated = '' +let silence_timer = null + +function reset_silence_timer() { + clearTimeout(silence_timer) + silence_timer = setTimeout(async () => { + if (!accumulated) { + return + } + process.stderr.write('[query-demo] silence timeout — submitting\n') + const query = accumulated + accumulated = '' + await submit(query) + }, SILENCE_TIMEOUT) +} + +stt.listen(async (text) => { + accumulated = accumulated ? `${accumulated}\n${text}` : text + process.stderr.write(`[query-demo] fragment: ${JSON.stringify(accumulated)}\n`) + + reset_silence_timer() + + const last_line = accumulated.split('\n').at(-1) + const last_norm = last_line.toLowerCase().replace(/[^a-z]/g, '') + const forced = SEND_WORDS.has(last_norm) + + if (!forced) { + const complete = await is_query_complete(last_line) + if (!complete) { + return + } + } + + clearTimeout(silence_timer) + + const lines = accumulated.split('\n') + const query = (forced ? lines.slice(0, -1) : lines).join('\n') + accumulated = '' + + await submit(query) +}, { on_audio: () => { if (accumulated) reset_silence_timer() } }) diff --git a/requirements.txt b/requirements.txt new file mode 100644 index 0000000..ca0e588 --- /dev/null +++ b/requirements.txt @@ -0,0 +1,11 @@ +# STT +faster-whisper>=1.1.0 + +# TTS (already used in kokoro-tts-test) +kokoro>=0.9.4 + +# Audio capture +sounddevice>=0.4.6 + +# Numpy (usually already present) +numpy>=1.24.0 diff --git a/setup-venv.sh b/setup-venv.sh new file mode 100755 index 0000000..44a09e2 --- /dev/null +++ b/setup-venv.sh @@ -0,0 +1,24 @@ +#!/usr/bin/env bash +# Creates a venv in ./venv and installs Python deps for Bark. +# Safe to re-run — skips install if already done. + +set -euo pipefail + +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" +VENV="${SCRIPT_DIR}/venv" + +if [ ! -d "${VENV}" ]; then + echo "==> creating venv at ${VENV}" + python3 -m venv "${VENV}" +fi + +echo "==> installing Python dependencies" +"${VENV}/bin/pip" install --upgrade pip --quiet +"${VENV}/bin/pip" install transformers accelerate torch numpy +"${VENV}/bin/pip" install chatterbox-tts + +chmod +x "${SCRIPT_DIR}/bark-server.py" +chmod +x "${SCRIPT_DIR}/chatterbox-server.py" + +echo "" +echo "==> venv ready at ${VENV}" diff --git a/speak-as.mjs b/speak-as.mjs new file mode 100644 index 0000000..f7617e7 --- /dev/null +++ b/speak-as.mjs @@ -0,0 +1,47 @@ +/** + * Read text from stdin and speak it using a voice reference clip. + * + * Usage: + * echo "Hello world" | node speak-as.mjs /path/to/voice.wav + * cat script.txt | node speak-as.mjs /path/to/voice.wav + * node speak-as.mjs /path/to/voice.wav (then type, Ctrl+D when done) + */ + +import * as fs from 'node:fs' +import * as readline from 'node:readline' +import { Chatterbox_Tts } from './lib/chatterbox-tts.mjs' + +const audio_prompt = process.argv[2] + +if (!audio_prompt) { + process.stderr.write('Usage: node speak-as.mjs /path/to/voice.wav\n') + process.exit(1) +} + +if (!fs.existsSync(audio_prompt)) { + process.stderr.write(`File not found: ${audio_prompt}\n`) + process.exit(1) +} + +const tts = new Chatterbox_Tts() +await tts.init() + +const rl = readline.createInterface({ + input: process.stdin, + output: process.stdout, + terminal: process.stdin.isTTY, +}) + +const norm = s => s.toLowerCase().replace(/[^a-z0-9 ]/g, '').replace(/\s+/g, ' ').trim() +let last_norm = '' + +for await (const line of rl) { + const text = line.trim() + if (!text) continue + const n = norm(text) + if (n === last_norm) continue + last_norm = n + await tts.speak(text, { audio_prompt }) +} + +tts.stop() diff --git a/tts-server.mjs b/tts-server.mjs new file mode 100644 index 0000000..b124eb1 --- /dev/null +++ b/tts-server.mjs @@ -0,0 +1,133 @@ +/** + * TTS HTTP server — wraps chatterbox-server.py and exposes a simple HTTP API. + * Requests are serialized so generation and playback stay in order. + * + * Usage: + * node tts-server.mjs + * TTS_PORT=11500 node tts-server.mjs + * + * API: + * POST /speak { "text": "...", "audio_prompt": "/path/to/voice.wav", ... } + * → 200 { "ok": true } (after generation, playback continues in background) + * GET /voices → 200 { "voices": ["rommie", ...], "current": "rommie" | null } + * POST /voice { "name": "rommie" } + * → 200 { "ok": true, "name": "rommie", "path": "..." } + * GET /health → 200 { "ok": true } + */ + +import * as http from 'node:http' +import * as fs from 'node:fs' +import * as path from 'node:path' +import yaml from 'js-yaml' +import { Chatterbox_Tts } from './lib/chatterbox-tts.mjs' + +const PORT = parseInt(process.env.TTS_PORT ?? '11500') +const VOICES_FILE = path.join(import.meta.dirname, 'voices.yaml') + +function reload_voices() { + try { + const doc = yaml.load(fs.readFileSync(VOICES_FILE, 'utf8')) + return doc?.voices ?? {} + } catch { + return {} + } +} + +let voices = reload_voices() +let current_voice = null // name of active voice, or null + +// --- TTS setup --- +const tts = new Chatterbox_Tts() +process.stderr.write('[tts-server] starting chatterbox...\n') +await tts.init() +process.stderr.write('[tts-server] chatterbox ready\n') + +// Serialize all speak requests through a promise chain +let queue = Promise.resolve() + +function enqueue(fn) { + const result = queue.then(fn) + // Don't let a failed request poison the queue + queue = result.catch(() => {}) + return result +} + +function read_body(req) { + return new Promise((resolve, reject) => { + let buf = '' + req.on('data', chunk => { buf += chunk }) + req.on('end', () => { + try { resolve(JSON.parse(buf)) } catch (e) { reject(e) } + }) + req.on('error', reject) + }) +} + +function send(res, status, body) { + const payload = JSON.stringify(body) + res.writeHead(status, { 'Content-Type': 'application/json' }) + res.end(payload) +} + +const server = http.createServer(async (req, res) => { + if (req.method === 'GET' && req.url === '/health') { + return send(res, 200, { ok: true }) + } + + if (req.method === 'GET' && req.url === '/voices') { + voices = reload_voices() + const list = Object.entries(voices).map(([name, v]) => ({ + name, + description: v.description ?? '', + active: name === current_voice, + })) + return send(res, 200, { voices: list, current: current_voice }) + } + + if (req.method === 'POST' && req.url === '/voice') { + let body + try { body = await read_body(req) } catch { + return send(res, 400, { error: 'invalid JSON' }) + } + const { name } = body + if (!name) return send(res, 400, { error: 'name required' }) + voices = reload_voices() + if (!voices[name]) return send(res, 404, { error: `unknown voice: ${name}` }) + current_voice = name + process.stderr.write(`[tts-server] voice switched to: ${name}\n`) + return send(res, 200, { ok: true, name, path: voices[name].path }) + } + + if (req.method === 'POST' && req.url === '/speak') { + let body + try { + body = await read_body(req) + } catch { + return send(res, 400, { error: 'invalid JSON' }) + } + + const { text, ...opts } = body + if (!text) { + return send(res, 400, { error: 'text required' }) + } + + // Inject current voice as default audio_prompt if none provided + if (!opts.audio_prompt && current_voice && voices[current_voice]) { + opts.audio_prompt = voices[current_voice].path + } + + try { + await enqueue(() => tts.speak_streaming(text, { preprocess: false, ...opts })) + send(res, 200, { ok: true }) + } catch (err) { + send(res, 500, { error: err.message }) + } + return + } + + send(res, 404, { error: 'not found' }) +}) + +server.listen(PORT, () => { + process.stderr.write(`[tts-server] listening on port ${PORT}\n`) +}) diff --git a/voice-buddy.mjs b/voice-buddy.mjs new file mode 100644 index 0000000..b85797e --- /dev/null +++ b/voice-buddy.mjs @@ -0,0 +1,65 @@ +/** + * Voice buddy — reads voice queries from stdin (JSON lines) and dispatches + * them to a Claude Code instance via claude-remote. + * + * Usage: + * node query-demo.mjs | node voice-buddy.mjs --window-id 0x1234abc --claude-remote /workspace/claude-remote/claude-remote.mjs + * + * Options: + * --window-id X11 window ID of Claude Code (decimal or 0x hex) + * --claude-remote Path to claude-remote.mjs + */ + +import * as readline from 'node:readline' +import { execFileSync } from 'node:child_process' + +const args = process.argv.slice(2) + +function get_arg(name) { + const i = args.indexOf(name) + if (i === -1 || i + 1 >= args.length) { + process.stderr.write(`Missing argument: ${name}\n`) + process.exit(1) + } + return args[i + 1] +} + +const window_id = get_arg('--window-id') +const claude_remote = get_arg('--claude-remote') + +function make_prompt(query) { + return [ + '[voice-buddy] You have received a voice query.', + '', + 'Respond via the `speak` shell command. Keep your response brief and spoken-word natural — this is a voice interface.', + '', + '--- Query ---', + query, + '--- End of query ---', + ].join('\n') +} + +function dispatch(query) { + const prompt = make_prompt(query) + process.stderr.write(`[voice-buddy] dispatching: ${JSON.stringify(query)}\n`) + execFileSync('node', [claude_remote, '--window-id', window_id], { + input: prompt, + encoding: 'utf8', + stdio: ['pipe', 'inherit', 'inherit'], + }) +} + +const rl = readline.createInterface({ input: process.stdin }) + +for await (const line of rl) { + const trimmed = line.trim() + if (!trimmed) { + continue + } + try { + const query = JSON.parse(trimmed) + dispatch(query) + } catch { + process.stderr.write(`[voice-buddy] bad JSON line: ${trimmed}\n`) + } +} diff --git a/voices.yaml b/voices.yaml new file mode 100644 index 0000000..0531044 --- /dev/null +++ b/voices.yaml @@ -0,0 +1,54 @@ +# Named voices for TTS server. +# Switch via: POST /voice {"name": "rommie"} +# List via: GET /voices + +voices: + rommie: + path: /home/devilholk/Documents/rommie-sample.wav + description: Rommie - Andromeda ship AI / android + + archangel: + path: /home/devilholk/Documents/archangel.ogg + description: Arch Angel from Air Wolf + + marella: + path: /home/devilholk/Documents/murella.ogg + description: Marella from Air Wolf + + chuin: + path: /home/devilholk/Documents/chuin.ogg + description: Chuin from Remo Williams - The adventure begins + + harold-w-smith: + path: /home/devilholk/Documents/cure-leader.ogg + description: Harold W. Smith from Remo Williams - The adventure begins + + penny-parker: + path: /home/devilholk/Documents/penny.ogg + description: Penny from MacGyver + + santini: + path: /home/devilholk/Documents/santini.ogg + description: Santini from Air Wolf + + gosalyn: + path: /home/devilholk/Documents/gasalin.ogg + description: Gosalyn from Darkwing Duck + + darkwing-duck: + path: /home/devilholk/Documents/dwd.ogg + description: Darkwing Duck + + belitski: + path: /home/devilholk/Documents/slipstream2.ogg + description: Belitski from Slipstream + + ivanova: + path: /home/devilholk/Documents/ivanova.ogg + description: Susan Ivonavo from Babylon 5 + + sheila: + path: /home/devilholk/Documents/darkness-woman.ogg + description: Sheila from Army of Darkness + +# Add more voices as clips become available