Initial commit — voice pipeline experiment

STT (Silero VAD + Whisper via sherpa-onnx), Chatterbox TTS HTTP server,
query completeness classifier (Ollama), multi-voice demo scripts, and
planning docs. Kept as reference; clean rewrite planned in separate repos.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
2026-05-30 04:48:54 +00:00
commit db8889aeed
35 changed files with 2782 additions and 0 deletions

11
.gitignore vendored Normal file
View File

@@ -0,0 +1,11 @@
node_modules/
models/
venv/
*.ogg
*.wav
*.mp3
*.onnx
*.bin
package-lock.json
__pycache__/
*.pyc

159
CLEANUP-PLAN.md Normal file
View File

@@ -0,0 +1,159 @@
# Voice Project — Cleanup Plan
> [!NOTE]
> **Revised approach** — keep this project as-is (it works and serves as a reference). The path forward is new clean repos for each component, then a top-level project that composes them. See **New architecture** section below.
The project has accumulated experiment scripts, dead TTS backends, and outdated setup files. This document is the plan before doing the actual cleanup.
---
## Current state
### Active pipeline (keep)
| File | Role |
|------|------|
| `query-demo.mjs` | Main entry point — STT → classifier → dispatch |
| `tts-server.mjs` | TTS HTTP server (Chatterbox, voice switching) |
| `voices.yaml` | Named voice config |
| `lib/stt.mjs` | Silero VAD + Whisper, pre-roll buffer |
| `lib/chatterbox-tts.mjs` | Node wrapper for chatterbox-server.py |
| `lib/tts-client.mjs` | HTTP client for tts-server.mjs |
| `lib/local-query-complete.mjs` | Ollama classifier |
| `chatterbox-server.py` | Long-running Python TTS process |
| `listen.mjs` | Standalone mic → stdout tool |
| `speak-as.mjs` | Standalone stdin → TTS tool |
| `demos/` | Multi-voice demo scripts |
| `download-models.sh` | Whisper + VAD model download |
| `setup-venv.sh` | Python venv setup (needs rewrite — see below) |
| `models/` | Downloaded STT models |
| `venv/` | Python venv (not committed) |
### Dead / experimental (remove or archive)
| File | Reason |
|------|--------|
| `acting-demo-bark.mjs` | Bark TTS — abandoned |
| `acting-demo-chatterbox.mjs` | One-off demo, superseded |
| `acting-demo.mjs` | Old demo |
| `bark-server.py` | Bark backend — abandoned |
| `demo-bark.mjs` | Bark demo |
| `demo-kokoro.mjs` | Kokoro TTS experiment — abandoned |
| `lib/bark-tts.mjs` | Bark wrapper |
| `lib/tts.mjs` | Old TTS abstraction (superseded by tts-client.mjs) |
| `lib/markdown.mjs` | Unused? Verify before deleting |
| `lib/llm.mjs` | Unused? Verify before deleting |
| `requirements.txt` | Lists kokoro + faster-whisper deps that aren't used |
| `voice-buddy.mjs` | Merged into query-demo.mjs — delete |
---
## What needs to be documented / fixed
### 1. `setup-venv.sh` — rewrite
Current script installs Bark deps. The active backend is Chatterbox. New script should:
- Install `chatterbox-tts` and its deps (torch, transformers, accelerate, numpy)
- HuggingFace token instruction: write token to `~/.secrets/hugging-face.token`
- Note: `find_hf_cache()` in chatterbox-server.py avoids HF network calls if the model is cached
### 2. `requirements.txt` — replace or delete
Currently lists kokoro and faster-whisper (unused). Either:
- Replace with `chatterbox-requirements.txt` listing only what `setup-venv.sh` installs
- Or just delete it — `setup-venv.sh` is the source of truth
### 3. sherpa-onnx / CUDA build — document clearly
The npm package ships a CPU-only `.so`. Using CUDA requires a manual rebuild. This is documented in `NOTES.md` but should also appear in a top-level `README.md` or `SETUP.md` so it's hard to miss.
### 4. System dependencies — list them
Nothing currently lists what needs to be installed at the OS level:
- `parec` / `pacat` (PulseAudio) — audio I/O
- `cmake` — for optional CUDA sherpa-onnx rebuild
- `onnxruntime-opt-cuda` (or equivalent) — for GPU STT
- `ollama` running locally with `qwen2.5:3b` pulled — for classifier
- `jq` — used in shell functions and demo scripts
- Node.js (version? check what's required)
- Python 3 with venv support
### 5. External services — document
- **TTS server**: runs on the host (not in container), port 11500
- **Ollama**: runs on `192.168.2.99:11434` — classifier and future LLM routing
- **TTS_URL** env var: set in `~/.bashrc` to point at TTS server
### 6. `voices.yaml` — keep and document
Already the right format. Add a comment block at the top explaining how to add a voice (grab a clean WAV/OGG clip, at least 5 seconds, no background music).
---
## Proposed new top-level structure
```
voice/
README.md ← start here: what this is, quick start
SETUP.md ← full setup: OS deps, npm install, venv, CUDA, models
NOTES.md ← architecture deep-dives (keep as-is)
TODO.md ← keep as-is
VOICES.md ← wishlist (keep as-is)
voices.yaml ← runtime voice config
query-demo.mjs ← main entry point
tts-server.mjs ← TTS HTTP server
listen.mjs ← standalone STT tool
speak-as.mjs ← standalone TTS tool
chatterbox-server.py
setup-venv.sh ← rewritten for Chatterbox only
download-models.sh
package.json
lib/
stt.mjs
chatterbox-tts.mjs
tts-client.mjs
local-query-complete.mjs
demos/
*.sh
models/ ← gitignored
venv/ ← gitignored
```
---
## Cleanup steps (in order)
1. **Verify** `lib/markdown.mjs` and `lib/llm.mjs` are unused — grep for imports
2. **Delete** dead files: bark/kokoro scripts, old demos, voice-buddy.mjs, lib/bark-tts.mjs, lib/tts.mjs, and unused lib files
3. **Rewrite** `setup-venv.sh` for Chatterbox only
4. **Replace** `requirements.txt` with accurate one or delete
5. **Write** `SETUP.md` covering all install steps end-to-end
6. **Write** `README.md` with quick start
7. **Add** voice cloning tips to top of `voices.yaml`
8. **Commit** the whole cleanup as a single commit
---
## Open questions
- Keep `acting-demo-chatterbox.mjs` as a reference for TTS capability demos, or delete?
- Should `demos/` scripts be committed as-is or made more portable (no hardcoded paths)?
---
## New architecture (revised plan)
Keep this project as a reference. Build clean standalone repos instead:
| Repo | Contents | Depends on |
|------|----------|-----------|
| `tts-server` | chatterbox-server.py, tts-server.mjs, lib/chatterbox-tts.mjs, voices.yaml format, HTTP API | Python venv, Chatterbox |
| `stt` | lib/stt.mjs, sherpa-onnx, VAD + Whisper, pre-roll buffer, listen.mjs | sherpa-onnx-node, models |
| `voice-assistant` | query-demo.mjs, lib/tts-client.mjs, lib/local-query-complete.mjs | tts-server + stt as submodules |
Each standalone repo gets its own README, SETUP, and can be used independently. The voice-assistant repo composes them via git submodules (or npm workspace / symlinks — TBD).
### Open questions for new architecture
- **Composition: npm packages** — each repo published to the private npm.efforting.tech registry; voice-assistant depends on `@efforting/tts-server` and `@efforting/stt` as npm deps. Cleaner than submodules for ESM projects.
- **Model storage** — models are large, have different backup strategies, and should NOT live inside project directories. There are canonical model locations on the system; repos should reference them via an env var (e.g. `MODELS_DIR` or per-model env vars) rather than downloading into the project. Migrate existing `models/` directories out as part of the new repo setup. Not urgent — note for cleanup time.

51
LLM-ROUTING.md Normal file
View File

@@ -0,0 +1,51 @@
# LLM Routing Strategy
When to use a local model vs an Anthropic frontier model in the voice pipeline.
---
## Current model inventory
| Role | Model | Where | VRAM |
|------|-------|-------|------|
| STT | Whisper base.en + Silero VAD | sherpa-onnx (GPU) | ~0.5 GB |
| TTS | Chatterbox Turbo | Python/CUDA | ~4 GB |
| Completeness classifier | Qwen 2.5 3B | Ollama | ~2 GB |
| Main reasoning | Claude (Anthropic) | API / Claude Code | — |
Total local VRAM: ~67 GB on RTX 3090 (24 GB). Plenty of headroom.
---
## Routing principles
### Use local model when:
- Task is **classification or routing** (completeness, intent detection, command vs question)
- Task is **latency-sensitive** and the answer doesn't require world knowledge or reasoning
- Task is **high-frequency** (runs on every utterance — billing would add up)
- Privacy matters — query content shouldn't leave the machine
### Use Anthropic frontier model when:
- Task requires **reasoning, synthesis, or judgment**
- Task involves **code generation or editing**
- Task involves **world knowledge** beyond what a small model handles well
- Query is **low-frequency** (once per user request, not once per audio frame)
- Quality matters more than latency
---
## GPU split (future option)
If voice models and reasoning models compete for VRAM, split across cards:
- **Card A** (smaller VRAM): STT + TTS + classifier — pure inference, no large models needed
- **Card B** (larger VRAM): local reasoning model (e.g. Qwen 20B or similar) for tasks that benefit from a capable local model but shouldn't hit the API
Use `CUDA_VISIBLE_DEVICES` or Ollama's GPU assignment to pin models to specific cards.
---
## Upgrade candidates
- **Whisper small.en or medium.en** — better transcription accuracy, especially for accents. Small adds ~0.5 GB, medium ~1.5 GB.
- **Completeness classifier → larger model** — Qwen 7B or 14B would give significantly better sentence boundary detection and instruction/question classification.
- **Local reasoning model** — for tasks that are too simple to warrant an API call but too complex for a 3B classifier (e.g. summarisation, note-taking, simple Q&A). A 714B quantised model via Ollama would fit alongside the voice stack.

161
NOTES.md Normal file
View File

@@ -0,0 +1,161 @@
# Voice Experiment Notes
Real-time voice pipeline: microphone → STT → (LLM) → TTS. Everything runs locally, no internet after initial model download.
---
## Architecture
```
parec (PulseAudio)
└─ lib/stt.mjs Silero VAD + Whisper → text on stdout
└─ sherpa-onnx-node
└─ libsherpa-onnx-c-api.so (needs CUDA build — see below)
└─ libonnxruntime.so (system, with CUDA)
text → lib/chatterbox-tts.mjs
└─ chatterbox-server.py long-running Python TTS server
└─ chatterbox-tts (pip) ResembleAI Chatterbox model
```
---
## STT
**Library:** `sherpa-onnx-node` npm package
**Models:** Whisper (offline batch) + Silero VAD
**Audio capture:** `parec --format=s16le --rate=16000 --channels=1 --latency-msec=50`
### Model download
```bash
bash download-models.sh base.en # or: tiny.en small.en medium.en large-v3 large-v3-turbo
```
Models land in `models/`:
- `silero_vad.onnx`
- `sherpa-onnx-whisper-base.en/`
### How it works
`lib/stt.mjs` feeds 512-sample (32ms) windows into Silero VAD. When VAD signals speech end (after 500ms silence), the segment is sent to Whisper for transcription. The transcription is written to stdout — **note: the native addon writes directly to fd 1 (bypassing Node.js process.stdout), not via the JavaScript callback**. The JS callback is also called and can be used for additional processing.
### Getting CUDA
The npm package bundles a CPU-only `libsherpa-onnx-c-api.so` (`SHERPA_ONNX_ENABLE_GPU=OFF`). To use CUDA, rebuild it from source:
```bash
# Prerequisites: cmake, onnxruntime-opt-cuda (or onnxruntime-cuda) from pacman
git clone --depth 1 git@github.com:k2-fsa/sherpa-onnx ~/Foreign/sherpa-onnx
export SHERPA_ONNXRUNTIME_INCLUDE_DIR=/usr/include/onnxruntime
export SHERPA_ONNXRUNTIME_LIB_DIR=/usr/lib
cd ~/Foreign/sherpa-onnx
cmake -B build \
-DCMAKE_BUILD_TYPE=Release \
-DBUILD_SHARED_LIBS=ON \
-DSHERPA_ONNX_ENABLE_GPU=ON \
-DSHERPA_ONNX_ENABLE_PYTHON=OFF \
-DSHERPA_ONNX_ENABLE_TESTS=OFF \
.
cmake --build build --target sherpa-onnx-c-api
# Symlink into node_modules (remove bundled CPU onnxruntime so system one is used)
NMOD=~/workspace/experiments/voice/node_modules/sherpa-onnx-linux-x64
ln -sf ~/Foreign/sherpa-onnx/build/lib/libsherpa-onnx-c-api.so $NMOD/libsherpa-onnx-c-api.so
rm -f $NMOD/libonnxruntime.so
```
The system `/usr/lib/libonnxruntime.so` (from `onnxruntime-opt-cuda`) is used at runtime. The sherpa-onnx C API is stable across versions so the pre-built `sherpa-onnx.node` (1.13.2) works with the freshly built library.
---
## TTS
**Library:** `chatterbox-tts` Python package (ResembleAI)
**Models:** `ResembleAI/chatterbox-turbo` (default) or `ResembleAI/chatterbox`
**Playback:** `pacat --format=float32le --rate=24000 --channels=1`
### Setup
```bash
bash setup-venv.sh # creates ./venv, installs chatterbox-tts + deps
```
HuggingFace token (for model download) goes in `~/.secrets/hugging-face.token`.
### Variants
| Variant | RTF (RTX 3090) | Notes |
|---------|---------------|-------|
| turbo | ~0.2 | default, female voice, no exaggeration param |
| full | slower | male voice, supports `exaggeration` (0.01.0) |
### Paralinguistic tags
Work in both variants:
```
[laugh] [chuckle] [cough] [clear throat] [sigh] [shush] [groan] [sniff] [gasp]
```
### Voice cloning
Pass a WAV reference clip (at least 5 seconds, clean speech, no music). Quality scales with both length and emotional range — a clip with variation (question, statement, different tones) gives the model more to work with for natural prosody than a flat monotone sample of the same length.
```javascript
await tts.speak('Hello.', { audio_prompt: '/path/to/voice.wav' })
```
The server preprocesses the WAV to float32 mono to work around a librosa/numpy float64 issue.
### Server protocol
`chatterbox-server.py` is a long-running process. Communication:
- **stdin:** one JSON line per request: `{"text": "...", "temperature": 0.8, ...}`
- **stdout:** `ok\n` after each utterance is **generated** (playback continues in background)
- **stderr:** status/timing logs
Generation and playback are pipelined: the server generates the next utterance while the previous one is still playing. RTF ~0.2 means the next chunk is ready well before playback finishes.
### Acting tips
- Emphasis: `UPPERCASE` words
- Pause/rhythm: `...` `—`
- Acronym workaround: `un-workable` not `UNWORKABLE` (avoids letter-spelling)
- Exaggeration (full model only): `{ exaggeration: 0.01.0 }`
- Temperature: `{ temperature: 0.31.4 }` (lower = more consistent)
---
## Tools
| Script | Usage |
|--------|-------|
| `listen.mjs` | Mic → Whisper → stdout (one line per utterance) |
| `speak-as.mjs` | stdin → Chatterbox TTS with voice reference |
| `acting-demo-chatterbox.mjs` | TTS capability demo (sections 17) |
### Pipe example
```bash
node listen.mjs 2>/dev/null | node speak-as.mjs /path/to/voice.wav
```
`speak-as.mjs` deduplicates consecutive identical lines to handle VAD over-segmentation.
---
## Known Quirks
**librosa float64:** newer numpy makes `librosa.resample` return float64, breaking torch. Patched in `chatterbox-server.py`:
```python
_orig = _librosa.resample
_librosa.resample = lambda *a, **k: _orig(*a, **k).astype(np.float32)
```
**HuggingFace offline:** `from_pretrained()` always hits the internet. The server uses `find_hf_cache()` to detect local snapshots and calls `from_local()` instead.
**VAD over-segmentation:** a mid-utterance pause produces two segments that Whisper transcribes to similar or identical text. `speak-as.mjs` deduplicates consecutive lines (normalized, punctuation-stripped) before speaking.
**Native stdout:** the sherpa-onnx native addon writes transcription results directly to fd 1 via C `write()`, bypassing Node.js `process.stdout.write`. This cannot be intercepted from JavaScript. It only writes for non-trivial results (not silence/garbage).

80
SPEAKER-DIARIZATION.md Normal file
View File

@@ -0,0 +1,80 @@
# Speaker Diarization
Notes on adding speaker identification to the voice pipeline.
---
## What Whisper does and doesn't do
Whisper is a transcription model — it converts audio to text. It has no concept of who is speaking. In a multi-speaker conversation it will transcribe everything into one flat stream with no speaker labels.
---
## Adding diarization
Speaker diarization is the task of segmenting audio by speaker: "who spoke when." The standard approach is to run a diarization model before or alongside transcription, then align the speaker segments with the transcript.
### pyannote.audio
The most capable open-source option. Runs locally, GPU-accelerated.
```
pip install pyannote.audio
```
Produces time-stamped segments:
```
SPEAKER_00 0.5s 3.2s
SPEAKER_01 3.8s 7.1s
SPEAKER_00 7.5s 12.0s
```
Feed each segment to Whisper independently → transcript with speaker tags.
Requires a HuggingFace token and model access request (free).
### sherpa-onnx
Has some diarization support in newer versions — worth checking if it can be added to the existing pipeline without a separate Python dependency.
---
## Speaker identification vs diarization
| Feature | Diarization | Identification |
|---------|------------|----------------|
| Labels speakers | ✓ (anonymous: speaker 1, 2…) | ✓ (named: Mikael, Claude…) |
| Requires enrollment | ✗ | ✓ (reference clip per person) |
| Cross-session consistency | ✗ | ✓ |
For named identification, a speaker embedding model compares voice fingerprints against enrolled reference clips. Options:
- **resemblyzer** — simple cosine similarity on d-vector embeddings
- **wespeaker** — more accurate, available as ONNX (fits existing sherpa-onnx stack)
- **speechbrain** — full toolkit, heavier dependency
---
## Future experiments
### Multi-speaker transcription
Run pyannote diarization on a conversation recording, feed segments to Whisper, produce a tagged transcript:
```
Mikael: How do I juggle oranges?
Claude: Start with two balls and work up from there.
```
### Real-time diarization
Extend the current VAD → Whisper pipeline with a rolling speaker embedding. Each new segment gets a speaker label by comparing its embedding to known speakers. Useful for meeting notes or multi-person dictation.
### Voice enrollment
Record a short reference clip per person. Store embeddings. When a new segment arrives, score it against all enrolled speakers and assign the closest match above a confidence threshold. Unknown speakers get a temporary anonymous label.
### Wake word per speaker
Combine identification with wake word detection so only enrolled voices can trigger the assistant — a simple voice-based access control layer.
---
## VRAM considerations
pyannote diarization adds roughly 12 GB depending on the model variant. wespeaker ONNX is much lighter. Both fit comfortably alongside the existing Whisper + Chatterbox stack on a 24 GB card.

41
TODO.md Normal file
View File

@@ -0,0 +1,41 @@
# Voice Experiment — TODO
## Classifier
- [ ] Add query journal (JSONL) logging all STT fragments, classifier decisions, and how each query was resolved (timeout / send-word / classifier). Enables replay and re-classification experiments without live mic input.
- [ ] Tune classifier — still too conservative on plain statements and short instructions
- [ ] Experiment with larger classifier model (Qwen 7B or 14B)
## Voice switching
- [ ] Allow changing the active voice at runtime via voice command — "switch to the X voice", "use a different voice". Could be per-project as a cue (project A = voice A) or just on-demand preference. Implementation: store current `audio_prompt` path in a mutable state file; voice-switch commands update it directly without going through the full Claude dispatch.
- [ ] Consider a small set of named voices with friendly names so you can say "switch to the calm voice" rather than a file path.
- [ ] Daily voice rotation: select voice based on a hash of the current date — automatic variety without manual choice. Hash the date string mod N voices, pick from the named voice list.
- [ ] Character mode: pair a voice with a system prompt preamble injected before the query. Technically trivial — prepend to the dispatch prompt. Hard part is writing character definitions that hold up over a session rather than collapsing into generic assistant after a few turns. A shallow character just sounds like a costume.
## Pipeline
- [ ] PTY-based Claude Code injection — replace X11/clipboard paste with direct PTY write to avoid focus stealing
- [ ] Voice command routing — detect slash command intents (rename session, switch session) and dispatch as `/command` rather than a prompt
- [ ] Activation phrase — say "computer" (or similar) to signal the start of a query; play a short acknowledgement chime (Star Trek style). Currently always-on; activation phrase would make it more intentional.
- [ ] Audio cues — replace or supplement the "Efforting" TTS utterance with short sound effects: one tone on query start/acknowledgement, a different one on dispatch. Faster feedback than TTS.
- [ ] Reserved control vocabulary — single words, pattern matched before any classifier:
| Word(s) | Action |
|---------|--------|
| `computer` | Generic activate — start accumulating a query, play acknowledgement chime |
| `project note` | Activate with project context injected into the prompt preamble |
| `go` / `send` / `done` | Dispatch current query immediately |
| `cancel` / `scratch that` | Discard accumulated query, return to idle |
| `stop` / `goodbye` / `hang up` | Shut down the session cleanly |
| `switch voice` / `different voice` | Trigger voice-switch flow |
Words are checked normalized (lowercase, punctuation stripped) before classifier runs. Classifier only sees real query content.
- [ ] Classifier rule hierarchy — pre-classifiers (fast pattern match) → optional LLM classifiers. Each stage can short-circuit. Makes the system tuneable: add/remove LLM stages without touching the pattern layer.
## TTS / Long text
- [ ] Long inputs collapse into gibberish — Chatterbox has a context limit. The tts-server should automatically split on sentence boundaries and queue them sequentially. `Chatterbox_Tts` already has `speak_streaming` that does this — use it in the server instead of `speak`.
## TTS / Prompt
- [ ] Filename and path readback sounds bad when names are all-caps (e.g. `ARCHITECTURE.md` gets spelled out). Instruct Claude in the voice prompt to read filenames and paths in lowercase unless the term is a known acronym. Paralinguistic tags exist but are mostly for expression (laugh, sigh, etc.) — not useful here.
## STT
- [ ] Try Whisper small.en or medium.en for better accuracy on accents

29
VOICES.md Normal file
View File

@@ -0,0 +1,29 @@
# Voice Clone Wishlist
Voices to prepare reference clips for.
## Ready
| Voice | Character / Show | Notes |
|-------|-----------------|-------|
| Rommie | Andromeda — ship AI / android | `/home/devilholk/Documents/rommie-sample.wav` — working well |
| Wilford Brimley (Harold W. Smith) | Remo Williams: The Adventure Begins (1985) — head of CURE | Turned out well — gravelly authoritative delivery, dry clean speech |
## To Do
| Voice | Character / Show | Notes |
|-------|-----------------|-------|
| Fred Ward | Remo Williams: The Adventure Begins (1985) — Remo Williams himself | Distinctive gravelly voice, same film as Brimley so similar source quality |
| Alan Scarfe | TNG (Romulan), Andromeda S5 (Flavin) | Deep, authoritative voice — hunt for quiet scenes without ship hum |
| John Fleck | Enterprise (Silik the Suliban) | Distinctive raspy voice — Suliban scenes may have atmosphere noise |
| Steve Bacic | Andromeda (Telemachus Rhade) | — |
| Alex Diakun | Andromeda (Perseid character, name ~"Atune"), Stargate SG-1 (science role) | Prolific Vancouver sci-fi character actor, often plays scientists/scholars — distinctive voice |
| Unknown actress | Andromeda S4/S5 (Dylan's love interest), Babylon 5 (Mars rebellion leader) | Possibly Marjorie Monaghan (Number One in B5) — unconfirmed |
| Claudia Christian | Babylon 5 — Commander Ivanova | User already has a voice clip ready |
## Notes
- Animated/cartoon voices (e.g. Darkwing Duck) don't clone well — too far outside natural human speech distribution
- Compressed/heavily post-processed audio (spaceship hum, background score) degrades results even after noise reduction
- OGG vs WAV quality difference is likely source quality, not encoding — soundfile handles both
- Voice cloning quality scales with both clip length and emotional range — varied prosody (questions, statements, different tones) gives the model more to anchor on than flat monotone

135
acting-demo-bark.mjs Normal file
View File

@@ -0,0 +1,135 @@
/**
* Demonstrates acting and expressiveness with Bark TTS.
*
* Run: node acting-demo-bark.mjs [start-section]
*
* Bark expressive techniques:
* UPPERCASE — stress/emphasis on a word
* ... — hesitation, trailing off
* [laughs] — laughter
* [sighs] — sigh
* [gasps] — sharp intake of breath
* [clears throat] — throat clearing
* ♪ text ♪ — sung phrase
* — — abrupt break
* ! — raised energy
*
* Voice presets (English):
* v2/en_speaker_0 calm female
* v2/en_speaker_1 calm male
* v2/en_speaker_3 deep male
* v2/en_speaker_6 neutral/warm (default)
* v2/en_speaker_9 expressive
*/
import { Bark_Tts } from './lib/bark-tts.mjs'
const tts = new Bark_Tts()
process.stderr.write('[bark] loading model...\n')
await tts.init()
process.stderr.write('[bark] ready\n\n')
const START = parseInt(process.argv[2] ?? '1', 10)
function section(title) {
process.stdout.write(`\n${'─'.repeat(50)}\n${title}\n${'─'.repeat(50)}\n`)
}
async function say(text, voice = undefined) {
const opts = { preprocess: false }
if (voice) opts.voice = voice
process.stdout.write(` ${voice ?? 'default'}\n "${text}"\n\n`)
await tts.speak(text, opts)
}
// ─── 1. Baseline ───────────────────────────────────────
section('1. Baseline')
if (START <= 1) {
await say('The package has arrived. Please sign here.')
}
// ─── 2. Paralinguistic tokens ──────────────────────────
section('2. Paralinguistic tokens')
if (START <= 2) {
await say('[clears throat] Right. Where were we.')
await say('And then he just... [sighs] gave up.')
await say('[gasps] I had no idea you were here!')
await say("Ha. [laughs] That's the funniest thing I've heard all week.")
}
// ─── 3. Emphasis via UPPERCASE ─────────────────────────
section('3. Emphasis via UPPERCASE')
if (START <= 3) {
await say('You need to stop. RIGHT NOW.')
await say('I have told you THIS BEFORE.')
await say('This is ABSOLUTELY unacceptable.')
}
// ─── 4. Hesitation and rhythm ──────────────────────────
section('4. Hesitation and rhythm')
if (START <= 4) {
await say('It was... quiet. Too quiet.')
await say('I just... I don\'t know what to say.')
await say('He said he was fine — but his eyes told a different story.')
}
// ─── 5. Sung phrase ────────────────────────────────────
section('5. Singing')
if (START <= 5) {
await say('♪ Happy birthday to you, happy birthday to you ♪')
}
// ─── 6. Acronyms ───────────────────────────────────────
section('6. Acronyms')
if (START <= 6) {
// Without spacing — Bark may try to pronounce as a word
await say('The FBI is investigating.')
await say('NASA launched a new rocket.')
await say('I work with the CPU and GPU.')
// With letter spacing — forces spelling out
await say('The F B I is investigating.')
await say('N A S A launched a new rocket.')
await say('I work with the C P U and G P U.')
}
// ─── 7. Voice character ────────────────────────────────
section('7. Voice presets — same line')
if (START <= 7) {
const line = "Well, that's certainly one way to look at it."
for (const voice of ['v2/en_speaker_0', 'v2/en_speaker_3', 'v2/en_speaker_6', 'v2/en_speaker_9']) {
await say(line, voice)
}
}
// ─── 7. Combined scene ─────────────────────────────────
section('8. Combined — a tense scene')
if (START <= 8) {
await say('Something is wrong.', 'v2/en_speaker_9')
await say('I can FEEL it.', 'v2/en_speaker_9')
await say('[gasps] Run!', 'v2/en_speaker_9')
}
// ─── 9. Paragraph — voices 7-9 ─────────────────────────
// Pass voices as CLI args to override, e.g.:
// node acting-demo-bark.mjs 9 v2/en_speaker_0 v2/en_speaker_2 v2/en_speaker_5
const PARA_VOICES = process.argv.slice(3).length
? process.argv.slice(3)
: ['v2/en_speaker_7', 'v2/en_speaker_8', 'v2/en_speaker_9']
const PARAGRAPH = 'It was a cold evening when the letter FINALLY arrived. '
+ '[sighs] She had been waiting for months... not knowing whether the news would be good — or bad. '
+ 'Her hands trembled as she tore open the envelope. '
+ 'Inside was a single sheet of paper with just THREE words written on it. '
+ '[gasps] She read them twice... then sat down slowly, and stared at the wall.'
section('9. Paragraph — voice comparison')
if (START <= 9) {
for (const voice of PARA_VOICES) {
process.stdout.write(`\n${voice}\n`)
await say(PARAGRAPH, voice)
}
}
tts.stop()
section('Done')

109
acting-demo-chatterbox.mjs Normal file
View File

@@ -0,0 +1,109 @@
/**
* Acting demo for Chatterbox TTS.
*
* Run: node acting-demo-chatterbox.mjs [start-section] [turbo|full]
*
* Paralinguistic tags:
* [laugh] [chuckle] [cough] [clear throat] [sigh] [shush] [groan] [sniff] [gasp]
*
* Exaggeration (full model only, 0.0-1.0):
* 0.0 = neutral, 1.0 = maximum emotion intensity
*/
import { Chatterbox_Tts } from './lib/chatterbox-tts.mjs'
const START = parseInt(process.argv[2] ?? '1', 10)
const VARIANT = process.argv[3] ?? 'turbo'
const tts = new Chatterbox_Tts({ variant: VARIANT })
process.stderr.write(`[chatterbox] loading ${VARIANT} model...\n`)
await tts.init()
process.stderr.write('[chatterbox] ready\n\n')
function section(title) {
process.stdout.write(`\n${'─'.repeat(50)}\n${title}\n${'─'.repeat(50)}\n`)
}
async function say(text, opts = {}) {
process.stdout.write(` "${text}"\n`)
if (Object.keys(opts).length) process.stdout.write(` ${JSON.stringify(opts)}\n`)
process.stdout.write('\n')
await tts.speak(text, { preprocess: false, ...opts })
}
// ─── 1. Baseline ───────────────────────────────────────
section('1. Baseline')
if (START <= 1) {
await say('The package has arrived. Please sign here.')
}
// ─── 2. Paralinguistic tags ────────────────────────────
section('2. Paralinguistic tags')
if (START <= 2) {
await say('[cough] Right, where were we.')
await say('And then he just... [sigh] gave up.')
await say('[gasp] I had no idea you were here!')
await say('Ha! [laugh] That is the funniest thing I have heard all week.')
await say('[chuckle] Well, that is certainly one way to look at it.')
await say('[groan] Not this again.')
}
// ─── 3. Punctuation and rhythm ─────────────────────────
section('3. Punctuation and rhythm')
if (START <= 3) {
await say('It was... quiet. Too quiet.')
await say('He said he was fine — but his eyes told a different story.')
await say('I told you. I told you this would happen. But did anyone listen? No.')
}
// ─── 4. Exaggeration — full model only ─────────────────
section(`4. Exaggeration (${VARIANT === 'full' ? 'active' : 'ignored in turbo — run with: node acting-demo-chatterbox.mjs 4 full'})`)
if (START <= 4) {
await say('I am so incredibly happy to see you today!', { exaggeration: 0.0 })
await say('I am so incredibly happy to see you today!', { exaggeration: 0.5 })
await say('I am so incredibly happy to see you today!', { exaggeration: 1.0 })
}
// ─── 5. Temperature ────────────────────────────────────
section('5. Temperature (lower = more predictable)')
if (START <= 5) {
await say('Something is very wrong here and I think we should leave now.', { temperature: 0.3 })
await say('Something is very wrong here and I think we should leave now.', { temperature: 0.8 })
await say('Something is very wrong here and I think we should leave now.', { temperature: 1.4 })
}
// ─── 6. Paragraph ──────────────────────────────────────
section('6. Paragraph with tags')
if (START <= 6) {
await say(
'It was a cold evening when the letter finally arrived. '
+ '[sigh] She had been waiting for months... not knowing whether the news would be good — or bad. '
+ 'Her hands trembled as she tore open the envelope. '
+ '[gasp] Inside was a single sheet of paper with just three words written on it.'
)
}
// ─── 7. Audio prompt / voice cloning ───────────────────
// Pass a WAV file path as third argument:
// node acting-demo-chatterbox.mjs 7 turbo /path/to/voice.wav
section('7. Audio prompt (voice cloning)')
if (START <= 7) {
const audio_prompt = process.argv[4]
if (!audio_prompt) {
process.stdout.write(' Skipped — pass a WAV file as the 4th argument:\n')
process.stdout.write(' node acting-demo-chatterbox.mjs 7 turbo /path/to/voice.wav\n\n')
process.stdout.write(' Requirements: WAV file, at least 5 seconds, clean speech, no music/noise\n')
} else {
process.stdout.write(` voice: ${audio_prompt}\n\n`)
await say('Hello. This is a test of voice cloning.', { audio_prompt })
await say(
'It was a cold evening when the letter finally arrived. '
+ '[sigh] She had been waiting for months... not knowing whether the news would be good — or bad.',
{ audio_prompt }
)
}
}
tts.stop()
section('Done')

163
acting-demo.mjs Normal file
View File

@@ -0,0 +1,163 @@
/**
* Demonstrates prosody and acting techniques with Kokoro TTS.
*
* Run: node acting-demo.mjs [start-section]
* E.g.: node acting-demo.mjs 5
*
* Kokoro's expressive range comes from four levers:
*
* 1. Voice choice — each voice has a baked-in emotional character
* 2. Speed — slower = deliberate/grave, faster = excited/happy
* 3. silenceScale — controls pause length between sentences
* 4. Punctuation & markup — espeak-ng honours some SSML and typographic cues
*
* SSML tags that pass through espeak-ng:
* <break time="500ms"/> — explicit pause
* <emphasis level="strong">x</emphasis> — louder/longer (model-dependent)
* <prosody rate="slow">x</prosody> — local rate change
* <phoneme alphabet="ipa" ph="..."/> — force pronunciation
*
* Typographic cues that affect prosody without SSML:
* UPPERCASE words — some stress
* Ellipsis ... — trailing pause / trailing off
* Em-dash — — abrupt break
* Comma placement — micro-pauses
* Exclamation ! — raised energy
*/
import { Tts } from './lib/tts.mjs'
import { spawn } from 'node:child_process'
const tts = new Tts()
tts.init()
// Overriding the internal _play is the simplest way to control
// per-utterance GenerationConfig without reworking the class.
async function say(text, { voice = 'af_heart', speed = 1.0, silence_scale = 1.0 } = {}) {
const { createRequire } = await import('node:module')
const require = createRequire(import.meta.url)
const sherpa = require('sherpa-onnx-node')
const VOICES = { af_heart:0, af_bella:1, af_nicole:2, af_sarah:3, af_sky:4, am_adam:5, am_michael:6 }
const sid = VOICES[voice] ?? 0
process.stdout.write(` voice=${voice} speed=${speed} silence=${silence_scale}\n "${text}"\n\n`)
const audio = tts._tts.generate({
text,
generationConfig: new sherpa.GenerationConfig({ sid, speed, silenceScale: silence_scale }),
})
await play(audio.samples, audio.sampleRate)
}
// Synthesize multiple phrase/option pairs and play them as one continuous audio
async function say_concat(parts) {
const { createRequire } = await import('node:module')
const require = createRequire(import.meta.url)
const sherpa = require('sherpa-onnx-node')
const VOICES = { af_heart:0, af_bella:1, af_nicole:2, af_sarah:3, af_sky:4, am_adam:5, am_michael:6 }
let total_len = 0
const chunks = []
for (const [text, opts = {}] of parts) {
const { voice = 'af_heart', speed = 1.0, silence_scale = 1.0 } = opts
const sid = VOICES[voice] ?? 0
const audio = tts._tts.generate({
text,
generationConfig: new sherpa.GenerationConfig({ sid, speed, silenceScale: silence_scale }),
})
process.stdout.write(` [${speed}x] "${text}"\n`)
chunks.push(audio.samples)
total_len += audio.samples.length
}
const combined = new Float32Array(total_len)
let offset = 0
for (const chunk of chunks) {
combined.set(chunk, offset)
offset += chunk.length
}
await play(combined, tts._tts.sampleRate)
process.stdout.write('\n')
}
function play(samples, sample_rate) {
return new Promise((resolve, reject) => {
const proc = spawn('pacat', [
'--format=float32le', `--rate=${sample_rate}`, '--channels=1',
], { stdio: ['pipe', 'ignore', 'inherit'] })
proc.on('error', reject)
proc.on('close', resolve)
proc.stdin.write(Buffer.from(samples.buffer))
proc.stdin.end()
})
}
const START = parseInt(process.argv[2] ?? '1', 10)
function section(title) {
process.stdout.write(`\n${'─'.repeat(50)}\n${title}\n${'─'.repeat(50)}\n`)
}
// ─── 1. Baseline ───────────────────────────────────────
section('1. Baseline — neutral af_heart')
if (START <= 1) {
await say('The package has arrived. Please sign here.')
}
// ─── 2. Speed as emotion ───────────────────────────────
section('2. Speed — slow (grave) vs fast (excited)')
if (START <= 2) {
await say('I have some terrible news. The project has been cancelled.', { speed: 0.8 })
await say("Oh my goodness, I can't believe you're actually here! This is amazing!", { speed: 1.25 })
}
// ─── 3. Voice character ────────────────────────────────
section('3. Voice character — same line, different voices')
if (START <= 3) {
const line = "Well, that's certainly one way to look at it."
for (const voice of ['af_heart', 'af_sky', 'am_adam', 'bm_george']) {
await say(line, { voice })
}
}
// ─── 4. Punctuation-driven prosody ─────────────────────
section('4. Punctuation and typographic cues')
if (START <= 4) {
await say('I told you. I told you this would happen. But did anyone listen? No.')
await say('It was... quiet. Too quiet.')
await say('He said he was fine — but his eyes told a different story.')
await say('This is ABSOLUTELY unacceptable.')
}
// ─── 5. Per-phrase speed — manual emphasis ─────────────
// SSML tags are not supported by this backend; emphasis is achieved by
// synthesizing key phrases at a different speed and concatenating the audio.
section('5. Per-phrase speed (manual emphasis)')
if (START <= 5) {
// "stop" spoken slower and then the rest faster — sounds like emphasis
await say_concat([
['You need to ', { speed: 1.0 }],
['stop.', { speed: 0.7 }],
['Right now.', { speed: 1.1 }],
])
await say_concat([
['Listen very carefully.', { speed: 0.75 }],
['I will only say this once.', { speed: 0.9 }],
])
}
// ─── 6. Combining levers ───────────────────────────────
section('6. Combined — a tense scene')
if (START <= 6) {
await say('Something is wrong.', { speed: 0.85, silence_scale: 1.5 })
await say('I can feel it.', { speed: 0.8, silence_scale: 2.0 })
await say('<break time="600ms"/> Run!', { speed: 1.3, voice: 'af_sky' })
}
section('Done')
process.stdout.write('Tip: adjust speed/silence_scale and swap voices to build a character.\n')

135
bark-server.py Executable file
View File

@@ -0,0 +1,135 @@
#!/usr/bin/env -S bash -c 'exec "$(dirname "$0")/venv/bin/python3" "$0" "$@"'
"""
Bark TTS server — keeps model loaded, reads JSON lines from stdin.
Protocol:
stdin: one JSON line per request: {"text": "...", "voice": "v2/en_speaker_6"}
stdout: "ok\n" after each utterance is finished playing
stderr: status/timing messages
Usage:
python bark-server.py [model] [voice]
python bark-server.py suno/bark-small v2/en_speaker_3
Voices (English):
v2/en_speaker_0 .. v2/en_speaker_9
Speaker 6 = neutral/warm, 9 = expressive, 3 = deep male
"""
import os
import sys
import json
import time
import subprocess
import numpy as np
MODEL = sys.argv[1] if len(sys.argv) > 1 else 'suno/bark'
DEF_VOICE = sys.argv[2] if len(sys.argv) > 2 else 'v2/en_speaker_6'
SAMPLE_RATE = 24000
TOKEN_FILE = os.path.expanduser('~/.secrets/hugging-face.token')
try:
with open(TOKEN_FILE) as f:
os.environ['HF_TOKEN'] = f.read().strip()
except FileNotFoundError:
pass
# Disable background safetensors conversion attempts — the HF API endpoint
# it calls has changed and now errors. The pytorch_model.bin files work fine.
os.environ['SAFETENSORS_FAST_GPU'] = '0'
os.environ.setdefault('TRANSFORMERS_NO_ADVISORY_WARNINGS', '1')
def log(msg):
print(f'[bark] {msg}', file=sys.stderr, flush=True)
log(f'loading {MODEL}...')
t0 = time.time()
import torch
from transformers import AutoProcessor, BarkModel
device = 'cuda' if torch.cuda.is_available() else 'cpu'
dtype = torch.float16 if device == 'cuda' else torch.float32
if device == 'cuda':
# TF32 is faster on Ampere (30xx, 40xx) for both matmul and convolutions
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.allow_tf32 = True
processor = AutoProcessor.from_pretrained(MODEL)
model = BarkModel.from_pretrained(MODEL, torch_dtype=dtype, use_safetensors=False).to(device)
# Bark's internal generation calls set both max_length and max_new_tokens simultaneously
# which causes a conflict warning and may cause early stopping. Clearing max_length
# lets max_new_tokens take sole control.
model.generation_config.max_length = None
# torch.compile gives a meaningful speedup on 3090 (adds ~30s first-run compile)
try:
model = torch.compile(model, mode='reduce-overhead')
log('torch.compile applied')
except Exception as e:
log(f'torch.compile skipped: {e}')
log(f'ready on {device} ({time.time() - t0:.1f}s load time)')
print('ready', flush=True) # signal to Node that we are ready
def speak(text, voice):
t1 = time.time()
inputs = processor(text, voice_preset=voice, return_tensors='pt')
# Set attention_mask if not provided by processor — without it the model
# cannot distinguish padding from real tokens, which causes erratic generation
# including leading filler sounds.
if 'attention_mask' not in inputs:
inputs['attention_mask'] = torch.ones_like(inputs['input_ids'])
inputs = {k: v.to(device) for k, v in inputs.items()}
with torch.inference_mode():
audio = model.generate(
**inputs,
semantic_temperature = 0.3,
coarse_temperature = 0.3,
fine_temperature = 0.3,
)
samples = audio.cpu().numpy().squeeze().astype(np.float32)
elapsed_gen = time.time() - t1
duration = len(samples) / SAMPLE_RATE
log(f'generated {duration:.1f}s audio in {elapsed_gen:.1f}s rtf={elapsed_gen/duration:.2f}')
proc = subprocess.Popen(
['pacat', '--format=float32le', f'--rate={SAMPLE_RATE}', '--channels=1'],
stdin=subprocess.PIPE,
)
proc.stdin.write(samples.tobytes())
proc.stdin.close()
proc.wait()
for line in sys.stdin:
line = line.strip()
if not line:
continue
try:
req = json.loads(line)
text = req.get('text', '')
voice = req.get('voice', DEF_VOICE)
except json.JSONDecodeError:
text = line
voice = DEF_VOICE
if not text:
print('ok', flush=True)
continue
try:
speak(text, voice)
except Exception as e:
log(f'error: {e}')
print('ok', flush=True)

200
chatterbox-server.py Executable file
View File

@@ -0,0 +1,200 @@
#!/usr/bin/env -S bash -c 'exec "$(dirname "$0")/venv/bin/python3" "$0" "$@"'
"""
Chatterbox TTS server — keeps model loaded, reads JSON lines from stdin.
Protocol:
stdin: {"text": "...", "temperature": 0.8, "top_p": 0.95}
stdout: "ok\n" after each utterance is generated (playback may still be in progress)
stderr: status/timing messages
Usage:
./chatterbox-server.py
./chatterbox-server.py turbo # default
./chatterbox-server.py full # original model, supports exaggeration
Paralinguistic tags supported in text:
[laugh] [chuckle] [cough] [clear throat] [sigh] [shush] [groan] [sniff] [gasp]
Full model only:
exaggeration 0.0-1.0 emotion intensity (ignored in turbo)
"""
import os
import sys
import json
import time
import queue
import threading
import subprocess
import numpy as np
TOKEN_FILE = os.path.expanduser('~/.secrets/hugging-face.token')
try:
with open(TOKEN_FILE) as f:
os.environ['HF_TOKEN'] = f.read().strip()
except FileNotFoundError:
pass
def find_hf_cache(repo_id):
"""Return the local snapshot path if the model is already cached, else None."""
from pathlib import Path
cache_dir = Path(os.environ.get('HF_HOME', os.path.expanduser('~/.cache/huggingface'))) / 'hub'
repo_dir = cache_dir / f"models--{repo_id.replace('/', '--')}" / 'snapshots'
if repo_dir.exists():
snapshots = sorted(repo_dir.iterdir(), key=lambda p: p.stat().st_mtime)
if snapshots:
return str(snapshots[-1])
return None
VARIANT = sys.argv[1] if len(sys.argv) > 1 else 'turbo'
SAMPLE_RATE = 24000
def log(msg):
print(f'[chatterbox] {msg}', file=sys.stderr, flush=True)
log(f'loading chatterbox-{VARIANT}...')
t0 = time.time()
import tempfile
import traceback
import numpy as np
import torch
import soundfile as sf
import librosa as _librosa
# librosa.resample returns float64 in newer numpy — patch it to always return float32
_orig_resample = _librosa.resample
def _resample_float32(*args, **kwargs):
return _orig_resample(*args, **kwargs).astype(np.float32)
_librosa.resample = _resample_float32
device = 'cuda' if torch.cuda.is_available() else 'cpu'
REPO_IDS = {
'turbo': 'ResembleAI/chatterbox-turbo',
'full': 'ResembleAI/chatterbox',
}
if VARIANT == 'turbo':
from chatterbox.tts_turbo import ChatterboxTurboTTS as Model
else:
from chatterbox.tts import ChatterboxTTS as Model
cached = find_hf_cache(REPO_IDS[VARIANT])
if cached:
log(f'loading from cache: {cached}')
model = Model.from_local(cached, device=device)
else:
log('cache not found, downloading...')
model = Model.from_pretrained(device=device)
log(f'ready on {device} ({time.time() - t0:.1f}s load time)')
print('ready', flush=True)
_wav_cache = {}
def ensure_float32_wav(path):
"""Re-save audio as float32 mono WAV to work around librosa/numpy float64 issue.
Result is cached by input path so repeated calls with the same file are free."""
if path in _wav_cache:
return _wav_cache[path]
wav, sr = sf.read(path, dtype='float32', always_2d=True)
wav = wav.mean(axis=1) # stereo → mono if needed
tmp = tempfile.NamedTemporaryFile(suffix='.wav', delete=False)
sf.write(tmp.name, wav, sr, subtype='FLOAT')
_wav_cache[path] = tmp.name
return tmp.name
_SENTINEL = object()
playback_queue = queue.Queue()
def playback_worker():
"""Plays audio samples in order. Runs in its own thread."""
while True:
item = playback_queue.get()
if item is _SENTINEL:
break
samples = item
proc = subprocess.Popen(
['pacat', '--format=float32le', f'--rate={SAMPLE_RATE}', '--channels=1'],
stdin=subprocess.PIPE,
)
proc.stdin.write(samples.tobytes())
proc.stdin.close()
proc.wait()
playback_queue.task_done()
playback_thread = threading.Thread(target=playback_worker, daemon=True)
playback_thread.start()
def generate(text, opts):
t1 = time.time()
if VARIANT == 'turbo':
kwargs = {
'temperature': opts.get('temperature', 0.8),
'top_p': opts.get('top_p', 0.95),
'top_k': opts.get('top_k', 1000),
'repetition_penalty': opts.get('repetition_penalty', 1.2),
'min_p': opts.get('min_p', 0.0),
}
else:
kwargs = {
'temperature': opts.get('temperature', 0.8),
'top_p': opts.get('top_p', 1.0),
'repetition_penalty': opts.get('repetition_penalty', 1.2),
'min_p': opts.get('min_p', 0.05),
'exaggeration': opts.get('exaggeration', 0.5),
'cfg_weight': opts.get('cfg_weight', 0.5),
}
audio_prompt = opts.get('audio_prompt')
if audio_prompt:
kwargs['audio_prompt_path'] = ensure_float32_wav(audio_prompt)
with torch.inference_mode():
wav = model.generate(text, **kwargs)
samples = wav.squeeze(0).cpu().numpy().astype(np.float32)
elapsed = time.time() - t1
duration = len(samples) / SAMPLE_RATE
log(f'generated {duration:.1f}s audio in {elapsed:.1f}s rtf={elapsed/duration:.2f}')
return samples
for line in sys.stdin:
line = line.strip()
if not line:
continue
try:
req = json.loads(line)
text = req.pop('text', '')
opts = req # remaining fields are generation options
except json.JSONDecodeError:
text = line
opts = {}
if not text:
print('ok', flush=True)
continue
try:
samples = generate(text, opts)
playback_queue.put(samples)
except Exception as e:
log(f'error: {e}')
traceback.print_exc(file=sys.stderr)
print('ok', flush=True)
# Drain playback before exit
playback_queue.put(_SENTINEL)
playback_thread.join()

85
demo-bark.mjs Normal file
View File

@@ -0,0 +1,85 @@
/**
* Voice demo: microphone → Whisper STT → (optional LLM cleanup) → Bark TTS
*
* Bark supports emphasis via UPPERCASE, paralinguistic tokens ([laughs], [sighs],
* [clears throat]), and hesitation (...). Markdown is preprocessed automatically:
* **bold** → UPPERCASE, headers get a pause, etc.
*
* Environment variables:
* WHISPER_MODEL default: base.en
* options: tiny.en base.en small.en medium.en large-v3 large-v3-turbo
* BARK_MODEL default: suno/bark (use suno/bark-small for faster/smaller)
* BARK_VOICE default: v2/en_speaker_6
* options: v2/en_speaker_0 .. v2/en_speaker_9
* STT_PROVIDER default: cuda
* USE_LLM set to 0 to disable (default: 1)
* OLLAMA_MODEL default: phi3:mini
*/
import { Stt } from './lib/stt.mjs'
import { Bark_Tts } from './lib/bark-tts.mjs'
import { llm_available, list_models, cleanup } from './lib/llm.mjs'
const WHISPER_MODEL = process.env.WHISPER_MODEL || 'base.en'
const STT_PROVIDER = process.env.STT_PROVIDER || 'cuda'
const USE_LLM = process.env.USE_LLM !== '0'
process.stderr.write(`[stt] loading whisper-${WHISPER_MODEL} on ${STT_PROVIDER}...\n`)
const stt = new Stt({ whisper_name: WHISPER_MODEL, provider: STT_PROVIDER })
stt.init()
process.stderr.write('[stt] ready\n')
process.stderr.write('[tts] loading Bark (this takes a moment)...\n')
const tts = new Bark_Tts()
await tts.init()
process.stderr.write('[tts] Bark ready\n')
let has_llm = false
if (USE_LLM) {
has_llm = await llm_available()
if (has_llm) {
const models = await list_models()
process.stderr.write(`[llm] ollama available. models: ${models.slice(0, 6).join(', ')}\n`)
} else {
process.stderr.write('[llm] ollama not reachable, skipping cleanup\n')
}
}
process.stderr.write(`\n[ready] whisper=${WHISPER_MODEL} bark=${process.env.BARK_MODEL || 'suno/bark'} llm=${has_llm}\n`)
process.stderr.write('[ready] speak into your microphone. Ctrl+C to stop.\n\n')
let speaking = false
const stop = stt.listen(async (raw_text) => {
if (speaking) {
process.stderr.write(`[skipped] ${raw_text}\n`)
return
}
speaking = true
try {
process.stdout.write(`[raw] ${raw_text}\n`)
let text = raw_text
if (has_llm) {
text = await cleanup(raw_text)
if (text !== raw_text) {
process.stdout.write(`[llm] ${text}\n`)
}
}
await tts.speak_streaming(text)
process.stdout.write('\n')
} catch (err) {
process.stderr.write(`[error] ${err.message}\n`)
} finally {
speaking = false
}
})
process.on('SIGINT', () => {
process.stderr.write('\n[voice] stopping\n')
stop()
tts.stop()
process.exit(0)
})

85
demo-kokoro.mjs Normal file
View File

@@ -0,0 +1,85 @@
/**
* Voice demo: microphone → Whisper STT → (optional LLM cleanup) → Kokoro TTS
*
* Environment variables:
* WHISPER_MODEL default: base.en
* options: tiny.en base.en small.en medium.en large-v3 large-v3-turbo
* VOICE default: af_heart
* options: af_heart af_bella af_nicole af_sarah af_sky am_adam am_michael
* STT_PROVIDER default: cuda
* USE_LLM set to 0 to disable (default: 1)
* OLLAMA_MODEL default: phi3:mini
*/
import { Stt } from './lib/stt.mjs'
import { Tts, VOICES } from './lib/tts.mjs'
import { llm_available, list_models, cleanup } from './lib/llm.mjs'
const WHISPER_MODEL = process.env.WHISPER_MODEL || 'base.en'
const VOICE = process.env.VOICE || 'af_heart'
const STT_PROVIDER = process.env.STT_PROVIDER || 'cuda'
const USE_LLM = process.env.USE_LLM !== '0'
if (!(VOICE in VOICES)) {
process.stderr.write(`[voice] unknown voice "${VOICE}". Available: ${Object.keys(VOICES).join(', ')}\n`)
process.exit(1)
}
process.stderr.write(`[stt] loading whisper-${WHISPER_MODEL} on ${STT_PROVIDER}...\n`)
const stt = new Stt({ whisper_name: WHISPER_MODEL, provider: STT_PROVIDER })
stt.init()
process.stderr.write('[stt] ready\n')
process.stderr.write('[tts] loading Kokoro...\n')
const tts = new Tts({ voice: VOICE })
tts.init()
process.stderr.write('[tts] ready\n')
let has_llm = false
if (USE_LLM) {
has_llm = await llm_available()
if (has_llm) {
const models = await list_models()
process.stderr.write(`[llm] ollama available. models: ${models.slice(0, 6).join(', ')}\n`)
} else {
process.stderr.write('[llm] ollama not reachable, skipping cleanup\n')
}
}
process.stderr.write(`\n[ready] whisper=${WHISPER_MODEL} voice=${VOICE} llm=${has_llm}\n`)
process.stderr.write('[ready] speak into your microphone. Ctrl+C to stop.\n\n')
let speaking = false
const stop = stt.listen(async (raw_text) => {
if (speaking) {
process.stderr.write(`[skipped] ${raw_text}\n`)
return
}
speaking = true
try {
process.stdout.write(`[raw] ${raw_text}\n`)
let text = raw_text
if (has_llm) {
text = await cleanup(raw_text)
if (text !== raw_text) {
process.stdout.write(`[llm] ${text}\n`)
}
}
await tts.speak_streaming(text)
process.stdout.write('\n')
} catch (err) {
process.stderr.write(`[error] ${err.message}\n`)
} finally {
speaking = false
}
})
process.on('SIGINT', () => {
process.stderr.write('\n[voice] stopping\n')
stop()
process.exit(0)
})

24
demos/darkwing-briefing.sh Executable file
View File

@@ -0,0 +1,24 @@
#!/usr/bin/env bash
# Darkwing Briefing — Darkwing Duck, Gosalyn, Harold W. Smith, Marella, Rommie
TTS="${TTS_URL:-http://192.168.2.99:11500}"
switch() { curl -s -X POST "$TTS/voice" -H 'Content-Type: application/json' -d "{\"name\": \"$1\"}" > /dev/null; }
say() { curl -s -X POST "$TTS/speak" -H 'Content-Type: application/json' -d "$(jq -n --arg t "$1" '{text:$t}')" > /dev/null; }
switch darkwing-duck
say "I am the terror that flaps in the night. I am the highly classified operative that your clearance level does not cover."
switch gosalyn
say "Dad, you're at the WRONG briefing AGAIN."
switch harold-w-smith
say "How did a duck get into this facility."
switch marella
say "The same way everything else does, sir. The paperwork looked official."
switch rommie
say "I ran a verification check. The duck is, in fact, cleared for this. I have no further explanation."
switch darkwing-duck
say "I love it when a plan comes together. Let's get dangerous."

21
demos/threat-assessment.sh Executable file
View File

@@ -0,0 +1,21 @@
#!/usr/bin/env bash
# Threat Assessment — Ivanova, Archangel, Rommie, Chuin, Santini
TTS="${TTS_URL:-http://192.168.2.99:11500}"
switch() { curl -s -X POST "$TTS/voice" -H 'Content-Type: application/json' -d "{\"name\": \"$1\"}" > /dev/null; }
say() { curl -s -X POST "$TTS/speak" -H 'Content-Type: application/json' -d "$(jq -n --arg t "$1" '{text:$t}')" > /dev/null; }
switch ivanova
say "We have an unidentified vessel on an intercept course. Threat assessment?"
switch rommie
say "Weapons hot, shields raised, and they're broadcasting on a frequency we're not supposed to know exists."
switch archangel
say "Ah. That would be my people. Stand down, Commander — they're just making sure I'm still alive."
switch chuin
say "Are you? I have been wondering this myself for some time."
switch santini
say "I'm gonna need a bigger hangar."

24
demos/wrong-meeting.sh Executable file
View File

@@ -0,0 +1,24 @@
#!/usr/bin/env bash
# Wrong Meeting — Harold W. Smith, Ivanova, Archangel, Rommie, Penny Parker
TTS="${TTS_URL:-http://192.168.2.99:11500}"
switch() { curl -s -X POST "$TTS/voice" -H 'Content-Type: application/json' -d "{\"name\": \"$1\"}" > /dev/null; }
say() { curl -s -X POST "$TTS/speak" -H 'Content-Type: application/json' -d "$(jq -n --arg t "$1" '{text:$t}')" > /dev/null; }
switch harold-w-smith
say "This meeting never happened. None of you are here, and neither am I."
switch ivanova
say "That's extremely reassuring. Last time someone said that to me, a space station blew up."
switch archangel
say "Technically it was a controlled demolition. The paperwork was filed in triplicate."
switch rommie
say "I have reviewed the paperwork. It was filed AFTER the explosion."
switch penny-parker
say "You know, I walked into the wrong building again, didn't I."
switch harold-w-smith
say "Yes. But you saw nothing."

63
download-models.sh Executable file
View File

@@ -0,0 +1,63 @@
#!/usr/bin/env bash
# Download ONNX models for the voice experiment.
#
# Usage:
# bash download-models.sh # default: whisper base.en
# WHISPER_MODEL=large-v3-turbo bash download-models.sh
#
# After downloading, run: node demo.mjs
set -euo pipefail
WHISPER_MODEL="${WHISPER_MODEL:-base.en}"
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
MODELS_DIR="${SCRIPT_DIR}/models"
mkdir -p "${MODELS_DIR}"
cd "${MODELS_DIR}"
echo "==> models dir: ${MODELS_DIR}"
# --- Silero VAD (~2 MB) ---
if [ ! -f silero_vad.onnx ]; then
echo "==> downloading silero_vad.onnx"
curl -L --progress-bar -o silero_vad.onnx \
'https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/silero_vad.onnx'
else
echo "==> silero_vad.onnx already present"
fi
# --- Whisper STT ---
WHISPER_DIR="sherpa-onnx-whisper-${WHISPER_MODEL}"
if [ ! -d "${WHISPER_DIR}" ]; then
echo "==> downloading whisper-${WHISPER_MODEL}"
curl -L --progress-bar -o "${WHISPER_DIR}.tar.bz2" \
"https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-whisper-${WHISPER_MODEL}.tar.bz2"
echo "==> extracting..."
tar xf "${WHISPER_DIR}.tar.bz2"
else
echo "==> whisper-${WHISPER_MODEL} already present"
fi
# --- Kokoro TTS (~310 MB) ---
if [ ! -d kokoro-en-v0_19 ]; then
echo "==> downloading kokoro-en-v0_19 (English TTS)"
curl -L --progress-bar -o kokoro-en-v0_19.tar.bz2 \
'https://github.com/k2-fsa/sherpa-onnx/releases/download/tts-models/kokoro-en-v0_19.tar.bz2'
echo "==> extracting..."
tar xf kokoro-en-v0_19.tar.bz2
else
echo "==> kokoro-en-v0_19 already present"
fi
echo ""
echo "==> all models ready"
echo ""
echo "Next steps:"
echo " cd ${SCRIPT_DIR}"
echo " npm install"
echo " node demo.mjs"
echo ""
echo "To use a better STT model:"
echo " WHISPER_MODEL=large-v3-turbo bash download-models.sh"
echo " WHISPER_MODEL=large-v3-turbo node demo.mjs"

112
lib/bark-tts.mjs Normal file
View File

@@ -0,0 +1,112 @@
/**
* Bark TTS — Node.js wrapper around bark-server.py.
*
* Spawns the Python server once, keeps it alive, sends requests as JSON lines.
* Markdown is preprocessed before sending: **bold** → UPPERCASE, etc.
*
* Usage:
* const tts = new Bark_Tts()
* await tts.init() // spawns server, waits for model load
* await tts.speak('Hello')
* await tts.speak('**Very** important point.') // emphasis via CAPS
* tts.stop()
*
* Environment variables:
* BARK_MODEL HuggingFace model id (default: suno/bark)
* BARK_VOICE voice preset (default: v2/en_speaker_6)
*
* Bark voice presets (English):
* v2/en_speaker_0 calm female
* v2/en_speaker_1 calm male
* v2/en_speaker_3 deep male
* v2/en_speaker_6 neutral/warm (default)
* v2/en_speaker_9 expressive
*/
import { spawn } from 'node:child_process'
import * as path from 'node:path'
import * as readline from 'node:readline'
import { markdown_to_bark, split_sentences } from './markdown.mjs'
const BARK_MODEL = process.env.BARK_MODEL || 'suno/bark'
const BARK_VOICE = process.env.BARK_VOICE || 'v2/en_speaker_6'
const SERVER = path.join(import.meta.dirname, '..', 'bark-server.py')
export class Bark_Tts {
constructor({
model = BARK_MODEL,
voice = BARK_VOICE,
} = {}) {
this._model = model
this._voice = voice
this._proc = null
this._rl = null
this._resolve = null // resolver for the current in-flight request
}
/** Spawn bark-server.py and wait until it signals "ready". */
init() {
return new Promise((resolve, reject) => {
this._proc = spawn(SERVER, [this._model, this._voice], {
stdio: ['pipe', 'pipe', 'inherit'],
})
this._proc.on('error', reject)
this._proc.on('close', (code) => {
if (code !== 0 && code !== null) {
process.stderr.write(`[bark] server exited with code ${code}\n`)
}
})
this._rl = readline.createInterface({ input: this._proc.stdout })
this._rl.on('line', (line) => {
if (line === 'ready') {
resolve()
return
}
if (line === 'ok' && this._resolve) {
const res = this._resolve
this._resolve = null
res()
}
})
})
}
/** Preprocess markdown and speak as a single request. */
async speak(text, { voice = this._voice, preprocess = true } = {}) {
const clean = preprocess ? markdown_to_bark(text) : text
return this._send(clean, voice)
}
/**
* Preprocess markdown and speak sentence by sentence.
* Lower latency — first sentence starts playing while rest are queued.
*/
async speak_streaming(text, opts = {}) {
const clean = opts.preprocess !== false ? markdown_to_bark(text) : text
const sentences = split_sentences(clean)
for (const s of sentences) {
await this._send(s, opts.voice ?? this._voice)
}
}
_send(text, voice) {
return new Promise((resolve, reject) => {
if (!this._proc) {
return reject(new Error('Bark_Tts not initialized — call init() first'))
}
this._resolve = resolve
const payload = JSON.stringify({ text, voice }) + '\n'
this._proc.stdin.write(payload)
})
}
stop() {
this._rl?.close()
this._proc?.kill()
this._proc = null
this._rl = null
}
}

101
lib/chatterbox-tts.mjs Normal file
View File

@@ -0,0 +1,101 @@
/**
* Chatterbox TTS — Node.js wrapper around chatterbox-server.py.
*
* Usage:
* const tts = new Chatterbox_Tts()
* await tts.init()
* await tts.speak('Hello! [chuckle] That is funny.')
* await tts.speak('Something intense.', { exaggeration: 0.8 }) // full model only
* tts.stop()
*
* Paralinguistic tags (embed in text):
* [laugh] [chuckle] [cough] [clear throat] [sigh] [shush] [groan] [sniff] [gasp]
*
* Generation options (passed as second arg to speak()):
* temperature 0.052.0 default 0.8
* top_p 01 default 0.95
* top_k 01000 default 1000
* repetition_penalty ≥1.0 default 1.2
* min_p 01 default 0.0
* audio_prompt string path to reference WAV for voice cloning
* exaggeration 01 emotion intensity (full model only)
* cfg_weight 01 classifier-free guidance (full model only)
*/
import { spawn } from 'node:child_process'
import * as path from 'node:path'
import * as readline from 'node:readline'
import { markdown_to_bark as markdown_to_speech, split_sentences } from './markdown.mjs'
const SERVER = path.join(import.meta.dirname, '..', 'chatterbox-server.py')
export class Chatterbox_Tts {
constructor({
variant = 'turbo', // 'turbo' or 'full'
} = {}) {
this._variant = variant
this._proc = null
this._rl = null
this._resolve = null
}
init() {
return new Promise((resolve, reject) => {
this._proc = spawn(SERVER, [this._variant], {
stdio: ['pipe', 'pipe', 'inherit'],
})
this._proc.on('error', reject)
this._proc.on('close', (code) => {
if (code !== null && code !== 0) {
process.stderr.write(`[chatterbox] server exited with code ${code}\n`)
}
})
this._rl = readline.createInterface({ input: this._proc.stdout })
this._rl.on('line', (line) => {
if (line === 'ready') {
resolve()
return
}
if (line === 'ok' && this._resolve) {
const res = this._resolve
this._resolve = null
res()
}
})
})
}
async speak(text, opts = {}) {
const clean = opts.preprocess === true ? markdown_to_speech(text) : text
return this._send(clean, opts)
}
async speak_streaming(text, opts = {}) {
const clean = opts.preprocess !== false ? markdown_to_speech(text) : text
const sentences = split_sentences(clean)
for (const s of sentences) {
await this._send(s, opts)
}
}
_send(text, opts = {}) {
return new Promise((resolve, reject) => {
if (!this._proc) {
return reject(new Error('Chatterbox_Tts not initialized — call init() first'))
}
this._resolve = resolve
const { preprocess: _, ...gen_opts } = opts
const payload = JSON.stringify({ text, ...gen_opts }) + '\n'
this._proc.stdin.write(payload)
})
}
stop() {
this._rl?.close()
this._proc?.kill()
this._proc = null
this._rl = null
}
}

64
lib/llm.mjs Normal file
View File

@@ -0,0 +1,64 @@
/**
* Optional LLM cleanup via Ollama REST API.
*
* Cleans up raw Whisper output: fixes punctuation, capitalization,
* removes filler words, corrects obvious recognition errors.
*
* Set OLLAMA_MODEL env var to choose the model (default: phi3:mini).
* Set OLLAMA_URL env var to change the base URL (default: http://localhost:11434).
*/
const OLLAMA_URL = process.env.OLLAMA_URL || 'http://localhost:11434'
const OLLAMA_MODEL = process.env.OLLAMA_MODEL || 'phi3:mini'
export async function llm_available() {
try {
const res = await fetch(`${OLLAMA_URL}/api/tags`, {
signal: AbortSignal.timeout(2000),
})
return res.ok
} catch {
return false
}
}
export async function list_models() {
const res = await fetch(`${OLLAMA_URL}/api/tags`)
const data = await res.json()
return (data.models ?? []).map(m => m.name)
}
/**
* Clean up a raw STT transcription.
* Returns the cleaned string, or the original if the request fails.
*/
export async function cleanup(raw_text, model = OLLAMA_MODEL) {
const prompt = [
'You are a transcription editor. Clean up this speech-to-text output:',
'- Fix punctuation and capitalization',
'- Remove meaningless filler words (um, uh, like, you know)',
'- Correct obvious word recognition errors based on context',
'- Keep the meaning and phrasing intact',
'- Return ONLY the cleaned text, no explanation, no quotes',
'',
raw_text,
].join('\n')
try {
const res = await fetch(`${OLLAMA_URL}/api/generate`, {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({
model,
prompt,
stream: false,
options: { temperature: 0.1, num_predict: 256 },
}),
signal: AbortSignal.timeout(10_000),
})
const data = await res.json()
return data.response?.trim() || raw_text
} catch {
return raw_text
}
}

View File

@@ -0,0 +1,43 @@
export async function is_query_complete(query, model = 'qwen2.5:3b') {
const payload = {
model,
messages: [
{
role: 'system',
content: `
You are a voice query completeness classifier. Determine if the input is complete enough to act on.
Respond only with JSON: {"complete": true} or {"complete": false}
Default to {"complete": true}. Only respond with {"complete": false} if the input is clearly mid-sentence, trails off, or is a single ambiguous word with no clear intent.
Complete examples: "What time is it", "Turn off the lights", "How do I juggle", "Make a note", "Check the weather"
Incomplete examples: "How do I", "The thing is", "Can you"
`.replace(/^[\t ]+/gm, ''),
},
{ role: 'user', content: query },
],
stream: false,
format: 'json',
options: { num_predict: 10, num_ctx: 2048 },
}
const response = await fetch('http://192.168.2.99:11434/api/chat', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify(payload),
})
if (response.ok) {
try {
const data = await response.json()
const parsed = JSON.parse(data.message.content)
return parsed.complete === true
} catch {
return true
}
} else {
// Fail open — if classifier errors, let the query through
return true
}
}

66
lib/markdown.mjs Normal file
View File

@@ -0,0 +1,66 @@
/**
* Convert markdown to Bark-friendly plain text.
*
* Bark emphasis conventions:
* UPPERCASE words — stress/emphasis
* ... — hesitation / trailing off
* [laughs] [sighs] — paralinguistic tokens (pass through as-is)
*
* Markdown mapping:
* **bold** / __bold__ → UPPERCASE (strong emphasis)
* *italic* / _italic_ → unchanged (soft emphasis, Bark handles naturally)
* # Heading → text + pause via ".\n"
* `code` → unchanged (spoken literally)
* ```block``` → skipped
* [text](url) → text only
* --- → "..." (pause)
* - item / 1. item → item text (bullet stripped)
*/
export function markdown_to_bark(text) {
// Remove fenced code blocks entirely
text = text.replace(/```[\s\S]*?```/g, '')
// Inline code — keep the content, drop backticks
text = text.replace(/`([^`]+)`/g, '$1')
// Links — keep visible text
text = text.replace(/\[([^\]]+)\]\([^)]*\)/g, '$1')
// Images — drop
text = text.replace(/!\[([^\]]*)\]\([^)]*\)/g, '')
// **bold** / __bold__ → UPPERCASE
text = text.replace(/\*\*([^*]+)\*\*/g, (_, w) => w.toUpperCase())
text = text.replace(/__([^_]+)__/g, (_, w) => w.toUpperCase())
// *italic* / _italic_ — strip markers, keep text
text = text.replace(/\*([^*]+)\*/g, '$1')
text = text.replace(/_([^_]+)_/g, '$1')
// Headings — keep text, add a beat after
text = text.replace(/^#{1,6}\s+(.+)$/gm, '$1.')
// Horizontal rules → pause
text = text.replace(/^[-*_]{3,}\s*$/gm, '...')
// List bullets / numbers — strip prefix
text = text.replace(/^\s*[-*+]\s+/gm, '')
text = text.replace(/^\s*\d+\.\s+/gm, '')
// Collapse excess blank lines
text = text.replace(/\n{3,}/g, '\n\n')
return text.trim()
}
/**
* Split text into sentences suitable for sequential TTS.
* Keeps sentence-ending punctuation with the preceding sentence.
*/
export function split_sentences(text) {
return text
.split(/(?<=[.!?])\s+/)
.map(s => s.trim())
.filter(s => s.length > 0)
}

170
lib/stt.mjs Normal file
View File

@@ -0,0 +1,170 @@
/**
* Speech-to-Text using sherpa-onnx-node.
*
* Uses Whisper (offline/batch) as the recognizer, gated by Silero VAD so
* transcription only runs when a complete utterance has been detected.
* Audio is captured from the microphone via parec (PulseAudio) or arecord (ALSA).
*
* Model layout expected under models_dir:
* silero_vad.onnx
* sherpa-onnx-whisper-<name>/
* <name>-encoder.int8.onnx
* <name>-decoder.int8.onnx
* <name>-tokens.txt
*
* Available Whisper model names (pass as whisper_name):
* tiny.en base.en small.en medium.en large-v3 large-v3-turbo
* Download with: bash download-models.sh [model-name]
*/
import { spawn, execSync } from 'node:child_process'
import { createRequire } from 'node:module'
import * as path from 'node:path'
const require = createRequire(import.meta.url)
const DEFAULT_MODELS_DIR = path.join(import.meta.dirname, '..', 'models')
const PRE_ROLL_SAMPLES = 3200 // 200ms at 16kHz
const HISTORY_SAMPLES = 960000 // 60s ring buffer — matches VAD internal size
function find_mic() {
const candidates = [
['parec', ['--format=s16le', '--rate=16000', '--channels=1', '--latency-msec=50']],
['arecord', ['-f', 'S16_LE', '-r', '16000', '-c', '1', '-t', 'raw', '-q']],
]
for (const [cmd, args] of candidates) {
try {
execSync(`which ${cmd}`, { stdio: 'ignore' })
return [cmd, args]
} catch { /* try next */ }
}
throw new Error('No mic capture command found — need parec (PulseAudio) or arecord (ALSA)')
}
function s16le_to_f32(buf) {
const out = new Float32Array(buf.length / 2)
for (let i = 0; i < out.length; i++) {
out[i] = buf.readInt16LE(i * 2) / 32768.0
}
return out
}
export class Stt {
constructor({
models_dir = DEFAULT_MODELS_DIR,
whisper_name = 'base.en',
provider = 'cuda',
} = {}) {
this._whisper_dir = path.join(models_dir, `sherpa-onnx-whisper-${whisper_name}`)
this._whisper_name = whisper_name
this._vad_model = path.join(models_dir, 'silero_vad.onnx')
this._provider = provider
this._recognizer = null
this._vad = null
this._history = new Float32Array(HISTORY_SAMPLES)
this._history_pos = 0 // total samples fed, monotonically increasing
}
init() {
const sherpa = require('sherpa-onnx-node')
const { _whisper_dir: wdir, _whisper_name: wname, _vad_model, _provider } = this
this._recognizer = new sherpa.OfflineRecognizer({
featConfig: { sampleRate: 16000, featureDim: 80 },
modelConfig: {
whisper: {
encoder: path.join(wdir, `${wname}-encoder.int8.onnx`),
decoder: path.join(wdir, `${wname}-decoder.int8.onnx`),
},
tokens: path.join(wdir, `${wname}-tokens.txt`),
numThreads: 2,
provider: _provider,
debug: false,
},
})
// VAD runs on CPU — silero_vad.onnx is tiny
this._vad = new sherpa.Vad({
sileroVad: {
model: _vad_model,
threshold: 0.5,
minSilenceDuration: 0.5, // seconds of silence to end an utterance
minSpeechDuration: 0.1, // ignore sub-100ms blips
windowSize: 512, // 32ms @ 16kHz
maxSpeechDuration: 30.0,
},
sampleRate: 16000,
numThreads: 1,
provider: 'cpu',
debug: false,
}, 60) // 60-second internal ring buffer
}
_transcribe(samples) {
const stream = this._recognizer.createStream()
stream.acceptWaveform({ samples, sampleRate: 16000 })
this._recognizer.decode(stream)
return this._recognizer.getResult(stream).text.trim()
}
_with_preroll(seg) {
const pre_start = Math.max(0, seg.start - PRE_ROLL_SAMPLES)
const pre_len = seg.start - pre_start
if (pre_len === 0) return seg.samples
const out = new Float32Array(pre_len + seg.samples.length)
for (let i = 0; i < pre_len; i++) {
out[i] = this._history[(pre_start + i) % HISTORY_SAMPLES]
}
out.set(seg.samples, pre_len)
return out
}
/**
* Start listening. Calls on_text(text) for each detected utterance.
* Returns a stop() function.
*/
listen(on_text, { on_audio } = {}) {
const [cmd, args] = find_mic()
process.stderr.write(`[stt] mic: ${cmd} ${args.join(' ')}\n`)
const mic = spawn(cmd, args, { stdio: ['ignore', 'pipe', 'inherit'] })
const VAD_WIN = 512 // must match windowSize above
let pending = Buffer.alloc(0)
mic.stdout.on('data', (chunk) => {
if (on_audio) on_audio()
pending = Buffer.concat([pending, chunk])
// Feed complete VAD windows
while (pending.length >= VAD_WIN * 2) {
const win = pending.subarray(0, VAD_WIN * 2)
pending = pending.subarray(VAD_WIN * 2)
const f32 = s16le_to_f32(win)
// Write to history ring buffer for pre-roll
const base = this._history_pos % HISTORY_SAMPLES
for (let i = 0; i < f32.length; i++) {
this._history[(base + i) % HISTORY_SAMPLES] = f32[i]
}
this._history_pos += f32.length
this._vad.acceptWaveform(f32)
}
// Drain any complete speech segments
while (!this._vad.isEmpty()) {
const seg = this._vad.front()
this._vad.pop()
const text = this._transcribe(this._with_preroll(seg))
if (text) on_text(text)
}
})
mic.on('error', err => process.stderr.write(`[stt] mic error: ${err.message}\n`))
mic.on('close', code => {
if (code !== null && code !== 0) {
process.stderr.write(`[stt] mic exited with code ${code}\n`)
}
})
return () => mic.kill()
}
}

51
lib/tts-client.mjs Normal file
View File

@@ -0,0 +1,51 @@
/**
* TTS HTTP client — speaks text via a running tts-server.mjs instance.
*
* Usage:
* const tts = new Tts_Client()
* await tts.speak('Hello world.')
* await tts.speak('Hello.', { audio_prompt: '/path/to/voice.wav' })
* const { voices, current } = await tts.list_voices()
* await tts.set_voice('rommie')
*
* The server URL defaults to TTS_URL env var, then http://192.168.2.99:11500.
*/
const DEFAULT_URL = process.env.TTS_URL ?? 'http://192.168.2.99:11500'
export class Tts_Client {
constructor({ url = DEFAULT_URL } = {}) {
this._url = url
}
async speak(text, opts = {}) {
const res = await fetch(`${this._url}/speak`, {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ text, ...opts }),
})
if (!res.ok) {
const data = await res.json().catch(() => ({}))
throw new Error(data.error ?? `HTTP ${res.status}`)
}
}
async list_voices() {
const res = await fetch(`${this._url}/voices`)
if (!res.ok) throw new Error(`HTTP ${res.status}`)
return res.json() // { voices: [{name, description, active}], current }
}
async set_voice(name) {
const res = await fetch(`${this._url}/voice`, {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ name }),
})
if (!res.ok) {
const data = await res.json().catch(() => ({}))
throw new Error(data.error ?? `HTTP ${res.status}`)
}
return res.json() // { ok, name, path }
}
}

107
lib/tts.mjs Normal file
View File

@@ -0,0 +1,107 @@
/**
* Text-to-Speech using sherpa-onnx-node with the Kokoro model.
*
* Model layout expected under models_dir:
* kokoro-en-v0_19/
* model.onnx
* voices.bin
* tokens.txt
* lexicon-us-en.txt
* espeak-ng-data/
*
* Audio output goes to pacat (PulseAudio) as float32le @ 24kHz.
* speak_streaming() splits on sentence boundaries so the first sentence
* plays while the rest are still being synthesized.
*/
import { spawn } from 'node:child_process'
import { createRequire } from 'node:module'
import * as path from 'node:path'
const require = createRequire(import.meta.url)
const DEFAULT_MODELS_DIR = path.join(import.meta.dirname, '..', 'models')
// Speaker IDs in kokoro-en-v0_19
export const VOICES = {
af_heart: 0,
af_bella: 1,
af_nicole: 2,
af_sarah: 3,
af_sky: 4,
am_adam: 5,
am_michael: 6,
}
export class Tts {
constructor({
models_dir = DEFAULT_MODELS_DIR,
voice = 'af_heart',
speed = 1.0,
provider = 'cpu', // Kokoro synthesis is fast enough on CPU
} = {}) {
this._kokoro_dir = path.join(models_dir, 'kokoro-en-v0_19')
this._voice = voice
this._speed = speed
this._provider = provider
this._tts = null
}
init() {
const sherpa = require('sherpa-onnx-node')
const { _kokoro_dir: dir, _provider } = this
this._tts = new sherpa.OfflineTts({
model: {
kokoro: {
model: path.join(dir, 'model.onnx'),
voices: path.join(dir, 'voices.bin'),
tokens: path.join(dir, 'tokens.txt'),
dataDir: path.join(dir, 'espeak-ng-data'),
},
},
maxNumSentences: 2,
numThreads: 2,
debug: false,
provider: _provider,
})
}
/** Synthesize and play a single piece of text. Resolves when playback ends. */
async speak(text) {
const sid = VOICES[this._voice] ?? 0
const audio = this._tts.generate({ text, sid, speed: this._speed })
await this._play(audio.samples, audio.sampleRate)
}
/**
* Split text on sentence boundaries and synthesize + play each sentence
* in sequence. Lower latency than waiting for full synthesis first.
*/
async speak_streaming(text) {
for (const sentence of split_sentences(text)) {
if (sentence.trim()) {
await this.speak(sentence.trim())
}
}
}
_play(samples, sample_rate) {
return new Promise((resolve, reject) => {
const proc = spawn('pacat', [
'--format=float32le',
`--rate=${sample_rate}`,
'--channels=1',
], { stdio: ['pipe', 'ignore', 'inherit'] })
proc.on('error', reject)
proc.on('close', resolve)
proc.stdin.write(Buffer.from(samples.buffer))
proc.stdin.end()
})
}
}
function split_sentences(text) {
// Split after . ! ? — keep the punctuation with the preceding sentence
return text.split(/(?<=[.!?])\s+/)
}

22
listen.mjs Normal file
View File

@@ -0,0 +1,22 @@
/**
* Listen with Whisper and print transcribed text to stdout.
*
* Usage:
* node listen.mjs
* node listen.mjs small.en
* node listen.mjs large-v3-turbo
*/
import { Stt } from './lib/stt.mjs'
const whisper_name = process.argv[2] ?? 'base.en'
const stt = new Stt({ whisper_name })
stt.init()
process.stderr.write(`[listen] whisper model: ${whisper_name}\n`)
process.stderr.write(`[listen] listening — Ctrl+C to stop\n`)
stt.listen((text) => {
process.stdout.write(text + '\n')
})

13
package.json Normal file
View File

@@ -0,0 +1,13 @@
{
"name": "voice-experiment",
"version": "0.1.0",
"type": "module",
"scripts": {
"download": "bash download-models.sh",
"setup-venv": "bash setup-venv.sh"
},
"dependencies": {
"sherpa-onnx-node": "^1.12.14",
"js-yaml": "^4.1.0"
}
}

123
query-demo.mjs Normal file
View File

@@ -0,0 +1,123 @@
/**
* Voice query demo <20><> accumulates STT utterances into a query, checks
* completeness with a local LLM classifier, then dispatches to Claude Code.
*
* Usage:
* node query-demo.mjs [--audio-prompt voice.wav] [--whisper model] [--window-id 0x...] [--claude-remote /path/to/claude-remote.mjs]
*
* If --window-id and --claude-remote are supplied, completed queries are
* dispatched to Claude Code. Otherwise they are only logged to stderr.
*
* A query is considered complete when:
* - The classifier says so
* - The last word is a send-word (go/done/send)
* - No new fragment arrives within SILENCE_TIMEOUT ms
*/
import { execFileSync } from 'node:child_process'
import { Stt } from './lib/stt.mjs'
import { Tts_Client } from './lib/tts-client.mjs'
import { is_query_complete } from './lib/local-query-complete.mjs'
function get_arg(name) {
const i = process.argv.indexOf(name)
return i !== -1 ? process.argv[i + 1] : null
}
const audio_prompt = get_arg('--audio-prompt') ?? '/home/devilholk/Documents/rommie-sample.wav'
const whisper_name = get_arg('--whisper') ?? 'base.en'
const window_id = get_arg('--window-id')
const claude_remote = get_arg('--claude-remote')
const SILENCE_TIMEOUT = parseInt(get_arg('--silence-timeout') ?? '6000')
const tts = new Tts_Client()
const stt = new Stt({ whisper_name })
stt.init()
process.stderr.write(`[query-demo] whisper model: ${whisper_name}\n`)
process.stderr.write(`[query-demo] voice: ${audio_prompt}\n`)
if (window_id && claude_remote) {
process.stderr.write(`[query-demo] dispatch: window ${window_id} via ${claude_remote}\n`)
} else {
process.stderr.write(`[query-demo] no --window-id/--claude-remote — logging only\n`)
}
await tts.speak('Ready for input.', { audio_prompt })
function make_prompt(query) {
return [
'[voice-buddy] You have received a voice query.',
'',
'This is a voice interface. When you are done, use the `speak` shell command to inform the user of the outcome — keep it brief and spoken-word natural.',
'If the task is a question, speak the answer directly.',
'If the task is an action, carry it out and then speak a short confirmation.',
'',
'--- Query ---',
query,
'--- End of query ---',
].join('\n')
}
function dispatch(query) {
execFileSync('node', [claude_remote, '--window-id', window_id], {
input: make_prompt(query),
encoding: 'utf8',
stdio: ['pipe', 'inherit', 'inherit'],
})
}
async function submit(query) {
query = query.trim()
if (!query) {
return
}
process.stderr.write(`[query-demo] query: ${JSON.stringify(query)}\n`)
if (window_id && claude_remote) {
dispatch(query)
}
await tts.speak('Efforting.', { audio_prompt })
}
const SEND_WORDS = new Set(['go', 'done', 'send'])
let accumulated = ''
let silence_timer = null
function reset_silence_timer() {
clearTimeout(silence_timer)
silence_timer = setTimeout(async () => {
if (!accumulated) {
return
}
process.stderr.write('[query-demo] silence timeout — submitting\n')
const query = accumulated
accumulated = ''
await submit(query)
}, SILENCE_TIMEOUT)
}
stt.listen(async (text) => {
accumulated = accumulated ? `${accumulated}\n${text}` : text
process.stderr.write(`[query-demo] fragment: ${JSON.stringify(accumulated)}\n`)
reset_silence_timer()
const last_line = accumulated.split('\n').at(-1)
const last_norm = last_line.toLowerCase().replace(/[^a-z]/g, '')
const forced = SEND_WORDS.has(last_norm)
if (!forced) {
const complete = await is_query_complete(last_line)
if (!complete) {
return
}
}
clearTimeout(silence_timer)
const lines = accumulated.split('\n')
const query = (forced ? lines.slice(0, -1) : lines).join('\n')
accumulated = ''
await submit(query)
}, { on_audio: () => { if (accumulated) reset_silence_timer() } })

11
requirements.txt Normal file
View File

@@ -0,0 +1,11 @@
# STT
faster-whisper>=1.1.0
# TTS (already used in kokoro-tts-test)
kokoro>=0.9.4
# Audio capture
sounddevice>=0.4.6
# Numpy (usually already present)
numpy>=1.24.0

24
setup-venv.sh Executable file
View File

@@ -0,0 +1,24 @@
#!/usr/bin/env bash
# Creates a venv in ./venv and installs Python deps for Bark.
# Safe to re-run — skips install if already done.
set -euo pipefail
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
VENV="${SCRIPT_DIR}/venv"
if [ ! -d "${VENV}" ]; then
echo "==> creating venv at ${VENV}"
python3 -m venv "${VENV}"
fi
echo "==> installing Python dependencies"
"${VENV}/bin/pip" install --upgrade pip --quiet
"${VENV}/bin/pip" install transformers accelerate torch numpy
"${VENV}/bin/pip" install chatterbox-tts
chmod +x "${SCRIPT_DIR}/bark-server.py"
chmod +x "${SCRIPT_DIR}/chatterbox-server.py"
echo ""
echo "==> venv ready at ${VENV}"

47
speak-as.mjs Normal file
View File

@@ -0,0 +1,47 @@
/**
* Read text from stdin and speak it using a voice reference clip.
*
* Usage:
* echo "Hello world" | node speak-as.mjs /path/to/voice.wav
* cat script.txt | node speak-as.mjs /path/to/voice.wav
* node speak-as.mjs /path/to/voice.wav (then type, Ctrl+D when done)
*/
import * as fs from 'node:fs'
import * as readline from 'node:readline'
import { Chatterbox_Tts } from './lib/chatterbox-tts.mjs'
const audio_prompt = process.argv[2]
if (!audio_prompt) {
process.stderr.write('Usage: node speak-as.mjs /path/to/voice.wav\n')
process.exit(1)
}
if (!fs.existsSync(audio_prompt)) {
process.stderr.write(`File not found: ${audio_prompt}\n`)
process.exit(1)
}
const tts = new Chatterbox_Tts()
await tts.init()
const rl = readline.createInterface({
input: process.stdin,
output: process.stdout,
terminal: process.stdin.isTTY,
})
const norm = s => s.toLowerCase().replace(/[^a-z0-9 ]/g, '').replace(/\s+/g, ' ').trim()
let last_norm = ''
for await (const line of rl) {
const text = line.trim()
if (!text) continue
const n = norm(text)
if (n === last_norm) continue
last_norm = n
await tts.speak(text, { audio_prompt })
}
tts.stop()

133
tts-server.mjs Normal file
View File

@@ -0,0 +1,133 @@
/**
* TTS HTTP server — wraps chatterbox-server.py and exposes a simple HTTP API.
* Requests are serialized so generation and playback stay in order.
*
* Usage:
* node tts-server.mjs
* TTS_PORT=11500 node tts-server.mjs
*
* API:
* POST /speak { "text": "...", "audio_prompt": "/path/to/voice.wav", ... }
* → 200 { "ok": true } (after generation, playback continues in background)
* GET /voices → 200 { "voices": ["rommie", ...], "current": "rommie" | null }
* POST /voice { "name": "rommie" }
* → 200 { "ok": true, "name": "rommie", "path": "..." }
* GET /health → 200 { "ok": true }
*/
import * as http from 'node:http'
import * as fs from 'node:fs'
import * as path from 'node:path'
import yaml from 'js-yaml'
import { Chatterbox_Tts } from './lib/chatterbox-tts.mjs'
const PORT = parseInt(process.env.TTS_PORT ?? '11500')
const VOICES_FILE = path.join(import.meta.dirname, 'voices.yaml')
function reload_voices() {
try {
const doc = yaml.load(fs.readFileSync(VOICES_FILE, 'utf8'))
return doc?.voices ?? {}
} catch {
return {}
}
}
let voices = reload_voices()
let current_voice = null // name of active voice, or null
// --- TTS setup ---
const tts = new Chatterbox_Tts()
process.stderr.write('[tts-server] starting chatterbox...\n')
await tts.init()
process.stderr.write('[tts-server] chatterbox ready\n')
// Serialize all speak requests through a promise chain
let queue = Promise.resolve()
function enqueue(fn) {
const result = queue.then(fn)
// Don't let a failed request poison the queue
queue = result.catch(() => {})
return result
}
function read_body(req) {
return new Promise((resolve, reject) => {
let buf = ''
req.on('data', chunk => { buf += chunk })
req.on('end', () => {
try { resolve(JSON.parse(buf)) } catch (e) { reject(e) }
})
req.on('error', reject)
})
}
function send(res, status, body) {
const payload = JSON.stringify(body)
res.writeHead(status, { 'Content-Type': 'application/json' })
res.end(payload)
}
const server = http.createServer(async (req, res) => {
if (req.method === 'GET' && req.url === '/health') {
return send(res, 200, { ok: true })
}
if (req.method === 'GET' && req.url === '/voices') {
voices = reload_voices()
const list = Object.entries(voices).map(([name, v]) => ({
name,
description: v.description ?? '',
active: name === current_voice,
}))
return send(res, 200, { voices: list, current: current_voice })
}
if (req.method === 'POST' && req.url === '/voice') {
let body
try { body = await read_body(req) } catch {
return send(res, 400, { error: 'invalid JSON' })
}
const { name } = body
if (!name) return send(res, 400, { error: 'name required' })
voices = reload_voices()
if (!voices[name]) return send(res, 404, { error: `unknown voice: ${name}` })
current_voice = name
process.stderr.write(`[tts-server] voice switched to: ${name}\n`)
return send(res, 200, { ok: true, name, path: voices[name].path })
}
if (req.method === 'POST' && req.url === '/speak') {
let body
try {
body = await read_body(req)
} catch {
return send(res, 400, { error: 'invalid JSON' })
}
const { text, ...opts } = body
if (!text) {
return send(res, 400, { error: 'text required' })
}
// Inject current voice as default audio_prompt if none provided
if (!opts.audio_prompt && current_voice && voices[current_voice]) {
opts.audio_prompt = voices[current_voice].path
}
try {
await enqueue(() => tts.speak_streaming(text, { preprocess: false, ...opts }))
send(res, 200, { ok: true })
} catch (err) {
send(res, 500, { error: err.message })
}
return
}
send(res, 404, { error: 'not found' })
})
server.listen(PORT, () => {
process.stderr.write(`[tts-server] listening on port ${PORT}\n`)
})

65
voice-buddy.mjs Normal file
View File

@@ -0,0 +1,65 @@
/**
* Voice buddy — reads voice queries from stdin (JSON lines) and dispatches
* them to a Claude Code instance via claude-remote.
*
* Usage:
* node query-demo.mjs | node voice-buddy.mjs --window-id 0x1234abc --claude-remote /workspace/claude-remote/claude-remote.mjs
*
* Options:
* --window-id <id> X11 window ID of Claude Code (decimal or 0x hex)
* --claude-remote <path> Path to claude-remote.mjs
*/
import * as readline from 'node:readline'
import { execFileSync } from 'node:child_process'
const args = process.argv.slice(2)
function get_arg(name) {
const i = args.indexOf(name)
if (i === -1 || i + 1 >= args.length) {
process.stderr.write(`Missing argument: ${name}\n`)
process.exit(1)
}
return args[i + 1]
}
const window_id = get_arg('--window-id')
const claude_remote = get_arg('--claude-remote')
function make_prompt(query) {
return [
'[voice-buddy] You have received a voice query.',
'',
'Respond via the `speak` shell command. Keep your response brief and spoken-word natural — this is a voice interface.',
'',
'--- Query ---',
query,
'--- End of query ---',
].join('\n')
}
function dispatch(query) {
const prompt = make_prompt(query)
process.stderr.write(`[voice-buddy] dispatching: ${JSON.stringify(query)}\n`)
execFileSync('node', [claude_remote, '--window-id', window_id], {
input: prompt,
encoding: 'utf8',
stdio: ['pipe', 'inherit', 'inherit'],
})
}
const rl = readline.createInterface({ input: process.stdin })
for await (const line of rl) {
const trimmed = line.trim()
if (!trimmed) {
continue
}
try {
const query = JSON.parse(trimmed)
dispatch(query)
} catch {
process.stderr.write(`[voice-buddy] bad JSON line: ${trimmed}\n`)
}
}

54
voices.yaml Normal file
View File

@@ -0,0 +1,54 @@
# Named voices for TTS server.
# Switch via: POST /voice {"name": "rommie"}
# List via: GET /voices
voices:
rommie:
path: /home/devilholk/Documents/rommie-sample.wav
description: Rommie - Andromeda ship AI / android
archangel:
path: /home/devilholk/Documents/archangel.ogg
description: Arch Angel from Air Wolf
marella:
path: /home/devilholk/Documents/murella.ogg
description: Marella from Air Wolf
chuin:
path: /home/devilholk/Documents/chuin.ogg
description: Chuin from Remo Williams - The adventure begins
harold-w-smith:
path: /home/devilholk/Documents/cure-leader.ogg
description: Harold W. Smith from Remo Williams - The adventure begins
penny-parker:
path: /home/devilholk/Documents/penny.ogg
description: Penny from MacGyver
santini:
path: /home/devilholk/Documents/santini.ogg
description: Santini from Air Wolf
gosalyn:
path: /home/devilholk/Documents/gasalin.ogg
description: Gosalyn from Darkwing Duck
darkwing-duck:
path: /home/devilholk/Documents/dwd.ogg
description: Darkwing Duck
belitski:
path: /home/devilholk/Documents/slipstream2.ogg
description: Belitski from Slipstream
ivanova:
path: /home/devilholk/Documents/ivanova.ogg
description: Susan Ivonavo from Babylon 5
sheila:
path: /home/devilholk/Documents/darkness-woman.ogg
description: Sheila from Army of Darkness
# Add more voices as clips become available