Initial commit — voice pipeline experiment
STT (Silero VAD + Whisper via sherpa-onnx), Chatterbox TTS HTTP server, query completeness classifier (Ollama), multi-voice demo scripts, and planning docs. Kept as reference; clean rewrite planned in separate repos. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
11
.gitignore
vendored
Normal file
11
.gitignore
vendored
Normal file
@@ -0,0 +1,11 @@
|
|||||||
|
node_modules/
|
||||||
|
models/
|
||||||
|
venv/
|
||||||
|
*.ogg
|
||||||
|
*.wav
|
||||||
|
*.mp3
|
||||||
|
*.onnx
|
||||||
|
*.bin
|
||||||
|
package-lock.json
|
||||||
|
__pycache__/
|
||||||
|
*.pyc
|
||||||
159
CLEANUP-PLAN.md
Normal file
159
CLEANUP-PLAN.md
Normal file
@@ -0,0 +1,159 @@
|
|||||||
|
# Voice Project — Cleanup Plan
|
||||||
|
|
||||||
|
> [!NOTE]
|
||||||
|
> **Revised approach** — keep this project as-is (it works and serves as a reference). The path forward is new clean repos for each component, then a top-level project that composes them. See **New architecture** section below.
|
||||||
|
|
||||||
|
The project has accumulated experiment scripts, dead TTS backends, and outdated setup files. This document is the plan before doing the actual cleanup.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Current state
|
||||||
|
|
||||||
|
### Active pipeline (keep)
|
||||||
|
|
||||||
|
| File | Role |
|
||||||
|
|------|------|
|
||||||
|
| `query-demo.mjs` | Main entry point — STT → classifier → dispatch |
|
||||||
|
| `tts-server.mjs` | TTS HTTP server (Chatterbox, voice switching) |
|
||||||
|
| `voices.yaml` | Named voice config |
|
||||||
|
| `lib/stt.mjs` | Silero VAD + Whisper, pre-roll buffer |
|
||||||
|
| `lib/chatterbox-tts.mjs` | Node wrapper for chatterbox-server.py |
|
||||||
|
| `lib/tts-client.mjs` | HTTP client for tts-server.mjs |
|
||||||
|
| `lib/local-query-complete.mjs` | Ollama classifier |
|
||||||
|
| `chatterbox-server.py` | Long-running Python TTS process |
|
||||||
|
| `listen.mjs` | Standalone mic → stdout tool |
|
||||||
|
| `speak-as.mjs` | Standalone stdin → TTS tool |
|
||||||
|
| `demos/` | Multi-voice demo scripts |
|
||||||
|
| `download-models.sh` | Whisper + VAD model download |
|
||||||
|
| `setup-venv.sh` | Python venv setup (needs rewrite — see below) |
|
||||||
|
| `models/` | Downloaded STT models |
|
||||||
|
| `venv/` | Python venv (not committed) |
|
||||||
|
|
||||||
|
### Dead / experimental (remove or archive)
|
||||||
|
|
||||||
|
| File | Reason |
|
||||||
|
|------|--------|
|
||||||
|
| `acting-demo-bark.mjs` | Bark TTS — abandoned |
|
||||||
|
| `acting-demo-chatterbox.mjs` | One-off demo, superseded |
|
||||||
|
| `acting-demo.mjs` | Old demo |
|
||||||
|
| `bark-server.py` | Bark backend — abandoned |
|
||||||
|
| `demo-bark.mjs` | Bark demo |
|
||||||
|
| `demo-kokoro.mjs` | Kokoro TTS experiment — abandoned |
|
||||||
|
| `lib/bark-tts.mjs` | Bark wrapper |
|
||||||
|
| `lib/tts.mjs` | Old TTS abstraction (superseded by tts-client.mjs) |
|
||||||
|
| `lib/markdown.mjs` | Unused? Verify before deleting |
|
||||||
|
| `lib/llm.mjs` | Unused? Verify before deleting |
|
||||||
|
| `requirements.txt` | Lists kokoro + faster-whisper deps that aren't used |
|
||||||
|
| `voice-buddy.mjs` | Merged into query-demo.mjs — delete |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## What needs to be documented / fixed
|
||||||
|
|
||||||
|
### 1. `setup-venv.sh` — rewrite
|
||||||
|
|
||||||
|
Current script installs Bark deps. The active backend is Chatterbox. New script should:
|
||||||
|
- Install `chatterbox-tts` and its deps (torch, transformers, accelerate, numpy)
|
||||||
|
- HuggingFace token instruction: write token to `~/.secrets/hugging-face.token`
|
||||||
|
- Note: `find_hf_cache()` in chatterbox-server.py avoids HF network calls if the model is cached
|
||||||
|
|
||||||
|
### 2. `requirements.txt` — replace or delete
|
||||||
|
|
||||||
|
Currently lists kokoro and faster-whisper (unused). Either:
|
||||||
|
- Replace with `chatterbox-requirements.txt` listing only what `setup-venv.sh` installs
|
||||||
|
- Or just delete it — `setup-venv.sh` is the source of truth
|
||||||
|
|
||||||
|
### 3. sherpa-onnx / CUDA build — document clearly
|
||||||
|
|
||||||
|
The npm package ships a CPU-only `.so`. Using CUDA requires a manual rebuild. This is documented in `NOTES.md` but should also appear in a top-level `README.md` or `SETUP.md` so it's hard to miss.
|
||||||
|
|
||||||
|
### 4. System dependencies — list them
|
||||||
|
|
||||||
|
Nothing currently lists what needs to be installed at the OS level:
|
||||||
|
- `parec` / `pacat` (PulseAudio) — audio I/O
|
||||||
|
- `cmake` — for optional CUDA sherpa-onnx rebuild
|
||||||
|
- `onnxruntime-opt-cuda` (or equivalent) — for GPU STT
|
||||||
|
- `ollama` running locally with `qwen2.5:3b` pulled — for classifier
|
||||||
|
- `jq` — used in shell functions and demo scripts
|
||||||
|
- Node.js (version? check what's required)
|
||||||
|
- Python 3 with venv support
|
||||||
|
|
||||||
|
### 5. External services — document
|
||||||
|
|
||||||
|
- **TTS server**: runs on the host (not in container), port 11500
|
||||||
|
- **Ollama**: runs on `192.168.2.99:11434` — classifier and future LLM routing
|
||||||
|
- **TTS_URL** env var: set in `~/.bashrc` to point at TTS server
|
||||||
|
|
||||||
|
### 6. `voices.yaml` — keep and document
|
||||||
|
|
||||||
|
Already the right format. Add a comment block at the top explaining how to add a voice (grab a clean WAV/OGG clip, at least 5 seconds, no background music).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Proposed new top-level structure
|
||||||
|
|
||||||
|
```
|
||||||
|
voice/
|
||||||
|
README.md ← start here: what this is, quick start
|
||||||
|
SETUP.md ← full setup: OS deps, npm install, venv, CUDA, models
|
||||||
|
NOTES.md ← architecture deep-dives (keep as-is)
|
||||||
|
TODO.md ← keep as-is
|
||||||
|
VOICES.md ← wishlist (keep as-is)
|
||||||
|
voices.yaml ← runtime voice config
|
||||||
|
query-demo.mjs ← main entry point
|
||||||
|
tts-server.mjs ← TTS HTTP server
|
||||||
|
listen.mjs ← standalone STT tool
|
||||||
|
speak-as.mjs ← standalone TTS tool
|
||||||
|
chatterbox-server.py
|
||||||
|
setup-venv.sh ← rewritten for Chatterbox only
|
||||||
|
download-models.sh
|
||||||
|
package.json
|
||||||
|
lib/
|
||||||
|
stt.mjs
|
||||||
|
chatterbox-tts.mjs
|
||||||
|
tts-client.mjs
|
||||||
|
local-query-complete.mjs
|
||||||
|
demos/
|
||||||
|
*.sh
|
||||||
|
models/ ← gitignored
|
||||||
|
venv/ ← gitignored
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Cleanup steps (in order)
|
||||||
|
|
||||||
|
1. **Verify** `lib/markdown.mjs` and `lib/llm.mjs` are unused — grep for imports
|
||||||
|
2. **Delete** dead files: bark/kokoro scripts, old demos, voice-buddy.mjs, lib/bark-tts.mjs, lib/tts.mjs, and unused lib files
|
||||||
|
3. **Rewrite** `setup-venv.sh` for Chatterbox only
|
||||||
|
4. **Replace** `requirements.txt` with accurate one or delete
|
||||||
|
5. **Write** `SETUP.md` covering all install steps end-to-end
|
||||||
|
6. **Write** `README.md` with quick start
|
||||||
|
7. **Add** voice cloning tips to top of `voices.yaml`
|
||||||
|
8. **Commit** the whole cleanup as a single commit
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Open questions
|
||||||
|
|
||||||
|
- Keep `acting-demo-chatterbox.mjs` as a reference for TTS capability demos, or delete?
|
||||||
|
- Should `demos/` scripts be committed as-is or made more portable (no hardcoded paths)?
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## New architecture (revised plan)
|
||||||
|
|
||||||
|
Keep this project as a reference. Build clean standalone repos instead:
|
||||||
|
|
||||||
|
| Repo | Contents | Depends on |
|
||||||
|
|------|----------|-----------|
|
||||||
|
| `tts-server` | chatterbox-server.py, tts-server.mjs, lib/chatterbox-tts.mjs, voices.yaml format, HTTP API | Python venv, Chatterbox |
|
||||||
|
| `stt` | lib/stt.mjs, sherpa-onnx, VAD + Whisper, pre-roll buffer, listen.mjs | sherpa-onnx-node, models |
|
||||||
|
| `voice-assistant` | query-demo.mjs, lib/tts-client.mjs, lib/local-query-complete.mjs | tts-server + stt as submodules |
|
||||||
|
|
||||||
|
Each standalone repo gets its own README, SETUP, and can be used independently. The voice-assistant repo composes them via git submodules (or npm workspace / symlinks — TBD).
|
||||||
|
|
||||||
|
### Open questions for new architecture
|
||||||
|
|
||||||
|
- **Composition: npm packages** — each repo published to the private npm.efforting.tech registry; voice-assistant depends on `@efforting/tts-server` and `@efforting/stt` as npm deps. Cleaner than submodules for ESM projects.
|
||||||
|
- **Model storage** — models are large, have different backup strategies, and should NOT live inside project directories. There are canonical model locations on the system; repos should reference them via an env var (e.g. `MODELS_DIR` or per-model env vars) rather than downloading into the project. Migrate existing `models/` directories out as part of the new repo setup. Not urgent — note for cleanup time.
|
||||||
51
LLM-ROUTING.md
Normal file
51
LLM-ROUTING.md
Normal file
@@ -0,0 +1,51 @@
|
|||||||
|
# LLM Routing Strategy
|
||||||
|
|
||||||
|
When to use a local model vs an Anthropic frontier model in the voice pipeline.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Current model inventory
|
||||||
|
|
||||||
|
| Role | Model | Where | VRAM |
|
||||||
|
|------|-------|-------|------|
|
||||||
|
| STT | Whisper base.en + Silero VAD | sherpa-onnx (GPU) | ~0.5 GB |
|
||||||
|
| TTS | Chatterbox Turbo | Python/CUDA | ~4 GB |
|
||||||
|
| Completeness classifier | Qwen 2.5 3B | Ollama | ~2 GB |
|
||||||
|
| Main reasoning | Claude (Anthropic) | API / Claude Code | — |
|
||||||
|
|
||||||
|
Total local VRAM: ~6–7 GB on RTX 3090 (24 GB). Plenty of headroom.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Routing principles
|
||||||
|
|
||||||
|
### Use local model when:
|
||||||
|
- Task is **classification or routing** (completeness, intent detection, command vs question)
|
||||||
|
- Task is **latency-sensitive** and the answer doesn't require world knowledge or reasoning
|
||||||
|
- Task is **high-frequency** (runs on every utterance — billing would add up)
|
||||||
|
- Privacy matters — query content shouldn't leave the machine
|
||||||
|
|
||||||
|
### Use Anthropic frontier model when:
|
||||||
|
- Task requires **reasoning, synthesis, or judgment**
|
||||||
|
- Task involves **code generation or editing**
|
||||||
|
- Task involves **world knowledge** beyond what a small model handles well
|
||||||
|
- Query is **low-frequency** (once per user request, not once per audio frame)
|
||||||
|
- Quality matters more than latency
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## GPU split (future option)
|
||||||
|
|
||||||
|
If voice models and reasoning models compete for VRAM, split across cards:
|
||||||
|
- **Card A** (smaller VRAM): STT + TTS + classifier — pure inference, no large models needed
|
||||||
|
- **Card B** (larger VRAM): local reasoning model (e.g. Qwen 20B or similar) for tasks that benefit from a capable local model but shouldn't hit the API
|
||||||
|
|
||||||
|
Use `CUDA_VISIBLE_DEVICES` or Ollama's GPU assignment to pin models to specific cards.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Upgrade candidates
|
||||||
|
|
||||||
|
- **Whisper small.en or medium.en** — better transcription accuracy, especially for accents. Small adds ~0.5 GB, medium ~1.5 GB.
|
||||||
|
- **Completeness classifier → larger model** — Qwen 7B or 14B would give significantly better sentence boundary detection and instruction/question classification.
|
||||||
|
- **Local reasoning model** — for tasks that are too simple to warrant an API call but too complex for a 3B classifier (e.g. summarisation, note-taking, simple Q&A). A 7–14B quantised model via Ollama would fit alongside the voice stack.
|
||||||
161
NOTES.md
Normal file
161
NOTES.md
Normal file
@@ -0,0 +1,161 @@
|
|||||||
|
# Voice Experiment Notes
|
||||||
|
|
||||||
|
Real-time voice pipeline: microphone → STT → (LLM) → TTS. Everything runs locally, no internet after initial model download.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Architecture
|
||||||
|
|
||||||
|
```
|
||||||
|
parec (PulseAudio)
|
||||||
|
└─ lib/stt.mjs Silero VAD + Whisper → text on stdout
|
||||||
|
└─ sherpa-onnx-node
|
||||||
|
└─ libsherpa-onnx-c-api.so (needs CUDA build — see below)
|
||||||
|
└─ libonnxruntime.so (system, with CUDA)
|
||||||
|
|
||||||
|
text → lib/chatterbox-tts.mjs
|
||||||
|
└─ chatterbox-server.py long-running Python TTS server
|
||||||
|
└─ chatterbox-tts (pip) ResembleAI Chatterbox model
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## STT
|
||||||
|
|
||||||
|
**Library:** `sherpa-onnx-node` npm package
|
||||||
|
**Models:** Whisper (offline batch) + Silero VAD
|
||||||
|
**Audio capture:** `parec --format=s16le --rate=16000 --channels=1 --latency-msec=50`
|
||||||
|
|
||||||
|
### Model download
|
||||||
|
|
||||||
|
```bash
|
||||||
|
bash download-models.sh base.en # or: tiny.en small.en medium.en large-v3 large-v3-turbo
|
||||||
|
```
|
||||||
|
|
||||||
|
Models land in `models/`:
|
||||||
|
- `silero_vad.onnx`
|
||||||
|
- `sherpa-onnx-whisper-base.en/`
|
||||||
|
|
||||||
|
### How it works
|
||||||
|
|
||||||
|
`lib/stt.mjs` feeds 512-sample (32ms) windows into Silero VAD. When VAD signals speech end (after 500ms silence), the segment is sent to Whisper for transcription. The transcription is written to stdout — **note: the native addon writes directly to fd 1 (bypassing Node.js process.stdout), not via the JavaScript callback**. The JS callback is also called and can be used for additional processing.
|
||||||
|
|
||||||
|
### Getting CUDA
|
||||||
|
|
||||||
|
The npm package bundles a CPU-only `libsherpa-onnx-c-api.so` (`SHERPA_ONNX_ENABLE_GPU=OFF`). To use CUDA, rebuild it from source:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Prerequisites: cmake, onnxruntime-opt-cuda (or onnxruntime-cuda) from pacman
|
||||||
|
git clone --depth 1 git@github.com:k2-fsa/sherpa-onnx ~/Foreign/sherpa-onnx
|
||||||
|
|
||||||
|
export SHERPA_ONNXRUNTIME_INCLUDE_DIR=/usr/include/onnxruntime
|
||||||
|
export SHERPA_ONNXRUNTIME_LIB_DIR=/usr/lib
|
||||||
|
|
||||||
|
cd ~/Foreign/sherpa-onnx
|
||||||
|
cmake -B build \
|
||||||
|
-DCMAKE_BUILD_TYPE=Release \
|
||||||
|
-DBUILD_SHARED_LIBS=ON \
|
||||||
|
-DSHERPA_ONNX_ENABLE_GPU=ON \
|
||||||
|
-DSHERPA_ONNX_ENABLE_PYTHON=OFF \
|
||||||
|
-DSHERPA_ONNX_ENABLE_TESTS=OFF \
|
||||||
|
.
|
||||||
|
|
||||||
|
cmake --build build --target sherpa-onnx-c-api
|
||||||
|
|
||||||
|
# Symlink into node_modules (remove bundled CPU onnxruntime so system one is used)
|
||||||
|
NMOD=~/workspace/experiments/voice/node_modules/sherpa-onnx-linux-x64
|
||||||
|
ln -sf ~/Foreign/sherpa-onnx/build/lib/libsherpa-onnx-c-api.so $NMOD/libsherpa-onnx-c-api.so
|
||||||
|
rm -f $NMOD/libonnxruntime.so
|
||||||
|
```
|
||||||
|
|
||||||
|
The system `/usr/lib/libonnxruntime.so` (from `onnxruntime-opt-cuda`) is used at runtime. The sherpa-onnx C API is stable across versions so the pre-built `sherpa-onnx.node` (1.13.2) works with the freshly built library.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## TTS
|
||||||
|
|
||||||
|
**Library:** `chatterbox-tts` Python package (ResembleAI)
|
||||||
|
**Models:** `ResembleAI/chatterbox-turbo` (default) or `ResembleAI/chatterbox`
|
||||||
|
**Playback:** `pacat --format=float32le --rate=24000 --channels=1`
|
||||||
|
|
||||||
|
### Setup
|
||||||
|
|
||||||
|
```bash
|
||||||
|
bash setup-venv.sh # creates ./venv, installs chatterbox-tts + deps
|
||||||
|
```
|
||||||
|
|
||||||
|
HuggingFace token (for model download) goes in `~/.secrets/hugging-face.token`.
|
||||||
|
|
||||||
|
### Variants
|
||||||
|
|
||||||
|
| Variant | RTF (RTX 3090) | Notes |
|
||||||
|
|---------|---------------|-------|
|
||||||
|
| turbo | ~0.2 | default, female voice, no exaggeration param |
|
||||||
|
| full | slower | male voice, supports `exaggeration` (0.0–1.0) |
|
||||||
|
|
||||||
|
### Paralinguistic tags
|
||||||
|
|
||||||
|
Work in both variants:
|
||||||
|
```
|
||||||
|
[laugh] [chuckle] [cough] [clear throat] [sigh] [shush] [groan] [sniff] [gasp]
|
||||||
|
```
|
||||||
|
|
||||||
|
### Voice cloning
|
||||||
|
|
||||||
|
Pass a WAV reference clip (at least 5 seconds, clean speech, no music). Quality scales with both length and emotional range — a clip with variation (question, statement, different tones) gives the model more to work with for natural prosody than a flat monotone sample of the same length.
|
||||||
|
```javascript
|
||||||
|
await tts.speak('Hello.', { audio_prompt: '/path/to/voice.wav' })
|
||||||
|
```
|
||||||
|
|
||||||
|
The server preprocesses the WAV to float32 mono to work around a librosa/numpy float64 issue.
|
||||||
|
|
||||||
|
### Server protocol
|
||||||
|
|
||||||
|
`chatterbox-server.py` is a long-running process. Communication:
|
||||||
|
- **stdin:** one JSON line per request: `{"text": "...", "temperature": 0.8, ...}`
|
||||||
|
- **stdout:** `ok\n` after each utterance is **generated** (playback continues in background)
|
||||||
|
- **stderr:** status/timing logs
|
||||||
|
|
||||||
|
Generation and playback are pipelined: the server generates the next utterance while the previous one is still playing. RTF ~0.2 means the next chunk is ready well before playback finishes.
|
||||||
|
|
||||||
|
### Acting tips
|
||||||
|
|
||||||
|
- Emphasis: `UPPERCASE` words
|
||||||
|
- Pause/rhythm: `...` `—`
|
||||||
|
- Acronym workaround: `un-workable` not `UNWORKABLE` (avoids letter-spelling)
|
||||||
|
- Exaggeration (full model only): `{ exaggeration: 0.0–1.0 }`
|
||||||
|
- Temperature: `{ temperature: 0.3–1.4 }` (lower = more consistent)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Tools
|
||||||
|
|
||||||
|
| Script | Usage |
|
||||||
|
|--------|-------|
|
||||||
|
| `listen.mjs` | Mic → Whisper → stdout (one line per utterance) |
|
||||||
|
| `speak-as.mjs` | stdin → Chatterbox TTS with voice reference |
|
||||||
|
| `acting-demo-chatterbox.mjs` | TTS capability demo (sections 1–7) |
|
||||||
|
|
||||||
|
### Pipe example
|
||||||
|
|
||||||
|
```bash
|
||||||
|
node listen.mjs 2>/dev/null | node speak-as.mjs /path/to/voice.wav
|
||||||
|
```
|
||||||
|
|
||||||
|
`speak-as.mjs` deduplicates consecutive identical lines to handle VAD over-segmentation.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Known Quirks
|
||||||
|
|
||||||
|
**librosa float64:** newer numpy makes `librosa.resample` return float64, breaking torch. Patched in `chatterbox-server.py`:
|
||||||
|
```python
|
||||||
|
_orig = _librosa.resample
|
||||||
|
_librosa.resample = lambda *a, **k: _orig(*a, **k).astype(np.float32)
|
||||||
|
```
|
||||||
|
|
||||||
|
**HuggingFace offline:** `from_pretrained()` always hits the internet. The server uses `find_hf_cache()` to detect local snapshots and calls `from_local()` instead.
|
||||||
|
|
||||||
|
**VAD over-segmentation:** a mid-utterance pause produces two segments that Whisper transcribes to similar or identical text. `speak-as.mjs` deduplicates consecutive lines (normalized, punctuation-stripped) before speaking.
|
||||||
|
|
||||||
|
**Native stdout:** the sherpa-onnx native addon writes transcription results directly to fd 1 via C `write()`, bypassing Node.js `process.stdout.write`. This cannot be intercepted from JavaScript. It only writes for non-trivial results (not silence/garbage).
|
||||||
80
SPEAKER-DIARIZATION.md
Normal file
80
SPEAKER-DIARIZATION.md
Normal file
@@ -0,0 +1,80 @@
|
|||||||
|
# Speaker Diarization
|
||||||
|
|
||||||
|
Notes on adding speaker identification to the voice pipeline.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## What Whisper does and doesn't do
|
||||||
|
|
||||||
|
Whisper is a transcription model — it converts audio to text. It has no concept of who is speaking. In a multi-speaker conversation it will transcribe everything into one flat stream with no speaker labels.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Adding diarization
|
||||||
|
|
||||||
|
Speaker diarization is the task of segmenting audio by speaker: "who spoke when." The standard approach is to run a diarization model before or alongside transcription, then align the speaker segments with the transcript.
|
||||||
|
|
||||||
|
### pyannote.audio
|
||||||
|
|
||||||
|
The most capable open-source option. Runs locally, GPU-accelerated.
|
||||||
|
|
||||||
|
```
|
||||||
|
pip install pyannote.audio
|
||||||
|
```
|
||||||
|
|
||||||
|
Produces time-stamped segments:
|
||||||
|
```
|
||||||
|
SPEAKER_00 0.5s – 3.2s
|
||||||
|
SPEAKER_01 3.8s – 7.1s
|
||||||
|
SPEAKER_00 7.5s – 12.0s
|
||||||
|
```
|
||||||
|
|
||||||
|
Feed each segment to Whisper independently → transcript with speaker tags.
|
||||||
|
|
||||||
|
Requires a HuggingFace token and model access request (free).
|
||||||
|
|
||||||
|
### sherpa-onnx
|
||||||
|
|
||||||
|
Has some diarization support in newer versions — worth checking if it can be added to the existing pipeline without a separate Python dependency.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Speaker identification vs diarization
|
||||||
|
|
||||||
|
| Feature | Diarization | Identification |
|
||||||
|
|---------|------------|----------------|
|
||||||
|
| Labels speakers | ✓ (anonymous: speaker 1, 2…) | ✓ (named: Mikael, Claude…) |
|
||||||
|
| Requires enrollment | ✗ | ✓ (reference clip per person) |
|
||||||
|
| Cross-session consistency | ✗ | ✓ |
|
||||||
|
|
||||||
|
For named identification, a speaker embedding model compares voice fingerprints against enrolled reference clips. Options:
|
||||||
|
|
||||||
|
- **resemblyzer** — simple cosine similarity on d-vector embeddings
|
||||||
|
- **wespeaker** — more accurate, available as ONNX (fits existing sherpa-onnx stack)
|
||||||
|
- **speechbrain** — full toolkit, heavier dependency
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Future experiments
|
||||||
|
|
||||||
|
### Multi-speaker transcription
|
||||||
|
Run pyannote diarization on a conversation recording, feed segments to Whisper, produce a tagged transcript:
|
||||||
|
```
|
||||||
|
Mikael: How do I juggle oranges?
|
||||||
|
Claude: Start with two balls and work up from there.
|
||||||
|
```
|
||||||
|
|
||||||
|
### Real-time diarization
|
||||||
|
Extend the current VAD → Whisper pipeline with a rolling speaker embedding. Each new segment gets a speaker label by comparing its embedding to known speakers. Useful for meeting notes or multi-person dictation.
|
||||||
|
|
||||||
|
### Voice enrollment
|
||||||
|
Record a short reference clip per person. Store embeddings. When a new segment arrives, score it against all enrolled speakers and assign the closest match above a confidence threshold. Unknown speakers get a temporary anonymous label.
|
||||||
|
|
||||||
|
### Wake word per speaker
|
||||||
|
Combine identification with wake word detection so only enrolled voices can trigger the assistant — a simple voice-based access control layer.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## VRAM considerations
|
||||||
|
|
||||||
|
pyannote diarization adds roughly 1–2 GB depending on the model variant. wespeaker ONNX is much lighter. Both fit comfortably alongside the existing Whisper + Chatterbox stack on a 24 GB card.
|
||||||
41
TODO.md
Normal file
41
TODO.md
Normal file
@@ -0,0 +1,41 @@
|
|||||||
|
# Voice Experiment — TODO
|
||||||
|
|
||||||
|
## Classifier
|
||||||
|
- [ ] Add query journal (JSONL) logging all STT fragments, classifier decisions, and how each query was resolved (timeout / send-word / classifier). Enables replay and re-classification experiments without live mic input.
|
||||||
|
- [ ] Tune classifier — still too conservative on plain statements and short instructions
|
||||||
|
- [ ] Experiment with larger classifier model (Qwen 7B or 14B)
|
||||||
|
|
||||||
|
## Voice switching
|
||||||
|
- [ ] Allow changing the active voice at runtime via voice command — "switch to the X voice", "use a different voice". Could be per-project as a cue (project A = voice A) or just on-demand preference. Implementation: store current `audio_prompt` path in a mutable state file; voice-switch commands update it directly without going through the full Claude dispatch.
|
||||||
|
- [ ] Consider a small set of named voices with friendly names so you can say "switch to the calm voice" rather than a file path.
|
||||||
|
- [ ] Daily voice rotation: select voice based on a hash of the current date — automatic variety without manual choice. Hash the date string mod N voices, pick from the named voice list.
|
||||||
|
- [ ] Character mode: pair a voice with a system prompt preamble injected before the query. Technically trivial — prepend to the dispatch prompt. Hard part is writing character definitions that hold up over a session rather than collapsing into generic assistant after a few turns. A shallow character just sounds like a costume.
|
||||||
|
|
||||||
|
## Pipeline
|
||||||
|
- [ ] PTY-based Claude Code injection — replace X11/clipboard paste with direct PTY write to avoid focus stealing
|
||||||
|
- [ ] Voice command routing — detect slash command intents (rename session, switch session) and dispatch as `/command` rather than a prompt
|
||||||
|
- [ ] Activation phrase — say "computer" (or similar) to signal the start of a query; play a short acknowledgement chime (Star Trek style). Currently always-on; activation phrase would make it more intentional.
|
||||||
|
- [ ] Audio cues — replace or supplement the "Efforting" TTS utterance with short sound effects: one tone on query start/acknowledgement, a different one on dispatch. Faster feedback than TTS.
|
||||||
|
- [ ] Reserved control vocabulary — single words, pattern matched before any classifier:
|
||||||
|
|
||||||
|
| Word(s) | Action |
|
||||||
|
|---------|--------|
|
||||||
|
| `computer` | Generic activate — start accumulating a query, play acknowledgement chime |
|
||||||
|
| `project note` | Activate with project context injected into the prompt preamble |
|
||||||
|
| `go` / `send` / `done` | Dispatch current query immediately |
|
||||||
|
| `cancel` / `scratch that` | Discard accumulated query, return to idle |
|
||||||
|
| `stop` / `goodbye` / `hang up` | Shut down the session cleanly |
|
||||||
|
| `switch voice` / `different voice` | Trigger voice-switch flow |
|
||||||
|
|
||||||
|
Words are checked normalized (lowercase, punctuation stripped) before classifier runs. Classifier only sees real query content.
|
||||||
|
|
||||||
|
- [ ] Classifier rule hierarchy — pre-classifiers (fast pattern match) → optional LLM classifiers. Each stage can short-circuit. Makes the system tuneable: add/remove LLM stages without touching the pattern layer.
|
||||||
|
|
||||||
|
## TTS / Long text
|
||||||
|
- [ ] Long inputs collapse into gibberish — Chatterbox has a context limit. The tts-server should automatically split on sentence boundaries and queue them sequentially. `Chatterbox_Tts` already has `speak_streaming` that does this — use it in the server instead of `speak`.
|
||||||
|
|
||||||
|
## TTS / Prompt
|
||||||
|
- [ ] Filename and path readback sounds bad when names are all-caps (e.g. `ARCHITECTURE.md` gets spelled out). Instruct Claude in the voice prompt to read filenames and paths in lowercase unless the term is a known acronym. Paralinguistic tags exist but are mostly for expression (laugh, sigh, etc.) — not useful here.
|
||||||
|
|
||||||
|
## STT
|
||||||
|
- [ ] Try Whisper small.en or medium.en for better accuracy on accents
|
||||||
29
VOICES.md
Normal file
29
VOICES.md
Normal file
@@ -0,0 +1,29 @@
|
|||||||
|
# Voice Clone Wishlist
|
||||||
|
|
||||||
|
Voices to prepare reference clips for.
|
||||||
|
|
||||||
|
## Ready
|
||||||
|
|
||||||
|
| Voice | Character / Show | Notes |
|
||||||
|
|-------|-----------------|-------|
|
||||||
|
| Rommie | Andromeda — ship AI / android | `/home/devilholk/Documents/rommie-sample.wav` — working well |
|
||||||
|
| Wilford Brimley (Harold W. Smith) | Remo Williams: The Adventure Begins (1985) — head of CURE | Turned out well — gravelly authoritative delivery, dry clean speech |
|
||||||
|
|
||||||
|
## To Do
|
||||||
|
|
||||||
|
| Voice | Character / Show | Notes |
|
||||||
|
|-------|-----------------|-------|
|
||||||
|
| Fred Ward | Remo Williams: The Adventure Begins (1985) — Remo Williams himself | Distinctive gravelly voice, same film as Brimley so similar source quality |
|
||||||
|
| Alan Scarfe | TNG (Romulan), Andromeda S5 (Flavin) | Deep, authoritative voice — hunt for quiet scenes without ship hum |
|
||||||
|
| John Fleck | Enterprise (Silik the Suliban) | Distinctive raspy voice — Suliban scenes may have atmosphere noise |
|
||||||
|
| Steve Bacic | Andromeda (Telemachus Rhade) | — |
|
||||||
|
| Alex Diakun | Andromeda (Perseid character, name ~"Atune"), Stargate SG-1 (science role) | Prolific Vancouver sci-fi character actor, often plays scientists/scholars — distinctive voice |
|
||||||
|
| Unknown actress | Andromeda S4/S5 (Dylan's love interest), Babylon 5 (Mars rebellion leader) | Possibly Marjorie Monaghan (Number One in B5) — unconfirmed |
|
||||||
|
| Claudia Christian | Babylon 5 — Commander Ivanova | User already has a voice clip ready |
|
||||||
|
|
||||||
|
## Notes
|
||||||
|
|
||||||
|
- Animated/cartoon voices (e.g. Darkwing Duck) don't clone well — too far outside natural human speech distribution
|
||||||
|
- Compressed/heavily post-processed audio (spaceship hum, background score) degrades results even after noise reduction
|
||||||
|
- OGG vs WAV quality difference is likely source quality, not encoding — soundfile handles both
|
||||||
|
- Voice cloning quality scales with both clip length and emotional range — varied prosody (questions, statements, different tones) gives the model more to anchor on than flat monotone
|
||||||
135
acting-demo-bark.mjs
Normal file
135
acting-demo-bark.mjs
Normal file
@@ -0,0 +1,135 @@
|
|||||||
|
/**
|
||||||
|
* Demonstrates acting and expressiveness with Bark TTS.
|
||||||
|
*
|
||||||
|
* Run: node acting-demo-bark.mjs [start-section]
|
||||||
|
*
|
||||||
|
* Bark expressive techniques:
|
||||||
|
* UPPERCASE — stress/emphasis on a word
|
||||||
|
* ... — hesitation, trailing off
|
||||||
|
* [laughs] — laughter
|
||||||
|
* [sighs] — sigh
|
||||||
|
* [gasps] — sharp intake of breath
|
||||||
|
* [clears throat] — throat clearing
|
||||||
|
* ♪ text ♪ — sung phrase
|
||||||
|
* — — abrupt break
|
||||||
|
* ! — raised energy
|
||||||
|
*
|
||||||
|
* Voice presets (English):
|
||||||
|
* v2/en_speaker_0 calm female
|
||||||
|
* v2/en_speaker_1 calm male
|
||||||
|
* v2/en_speaker_3 deep male
|
||||||
|
* v2/en_speaker_6 neutral/warm (default)
|
||||||
|
* v2/en_speaker_9 expressive
|
||||||
|
*/
|
||||||
|
|
||||||
|
import { Bark_Tts } from './lib/bark-tts.mjs'
|
||||||
|
|
||||||
|
const tts = new Bark_Tts()
|
||||||
|
|
||||||
|
process.stderr.write('[bark] loading model...\n')
|
||||||
|
await tts.init()
|
||||||
|
process.stderr.write('[bark] ready\n\n')
|
||||||
|
|
||||||
|
const START = parseInt(process.argv[2] ?? '1', 10)
|
||||||
|
|
||||||
|
function section(title) {
|
||||||
|
process.stdout.write(`\n${'─'.repeat(50)}\n${title}\n${'─'.repeat(50)}\n`)
|
||||||
|
}
|
||||||
|
|
||||||
|
async function say(text, voice = undefined) {
|
||||||
|
const opts = { preprocess: false }
|
||||||
|
if (voice) opts.voice = voice
|
||||||
|
process.stdout.write(` ${voice ?? 'default'}\n "${text}"\n\n`)
|
||||||
|
await tts.speak(text, opts)
|
||||||
|
}
|
||||||
|
|
||||||
|
// ─── 1. Baseline ───────────────────────────────────────
|
||||||
|
section('1. Baseline')
|
||||||
|
if (START <= 1) {
|
||||||
|
await say('The package has arrived. Please sign here.')
|
||||||
|
}
|
||||||
|
|
||||||
|
// ─── 2. Paralinguistic tokens ──────────────────────────
|
||||||
|
section('2. Paralinguistic tokens')
|
||||||
|
if (START <= 2) {
|
||||||
|
await say('[clears throat] Right. Where were we.')
|
||||||
|
await say('And then he just... [sighs] gave up.')
|
||||||
|
await say('[gasps] I had no idea you were here!')
|
||||||
|
await say("Ha. [laughs] That's the funniest thing I've heard all week.")
|
||||||
|
}
|
||||||
|
|
||||||
|
// ─── 3. Emphasis via UPPERCASE ─────────────────────────
|
||||||
|
section('3. Emphasis via UPPERCASE')
|
||||||
|
if (START <= 3) {
|
||||||
|
await say('You need to stop. RIGHT NOW.')
|
||||||
|
await say('I have told you THIS BEFORE.')
|
||||||
|
await say('This is ABSOLUTELY unacceptable.')
|
||||||
|
}
|
||||||
|
|
||||||
|
// ─── 4. Hesitation and rhythm ──────────────────────────
|
||||||
|
section('4. Hesitation and rhythm')
|
||||||
|
if (START <= 4) {
|
||||||
|
await say('It was... quiet. Too quiet.')
|
||||||
|
await say('I just... I don\'t know what to say.')
|
||||||
|
await say('He said he was fine — but his eyes told a different story.')
|
||||||
|
}
|
||||||
|
|
||||||
|
// ─── 5. Sung phrase ────────────────────────────────────
|
||||||
|
section('5. Singing')
|
||||||
|
if (START <= 5) {
|
||||||
|
await say('♪ Happy birthday to you, happy birthday to you ♪')
|
||||||
|
}
|
||||||
|
|
||||||
|
// ─── 6. Acronyms ───────────────────────────────────────
|
||||||
|
section('6. Acronyms')
|
||||||
|
if (START <= 6) {
|
||||||
|
// Without spacing — Bark may try to pronounce as a word
|
||||||
|
await say('The FBI is investigating.')
|
||||||
|
await say('NASA launched a new rocket.')
|
||||||
|
await say('I work with the CPU and GPU.')
|
||||||
|
// With letter spacing — forces spelling out
|
||||||
|
await say('The F B I is investigating.')
|
||||||
|
await say('N A S A launched a new rocket.')
|
||||||
|
await say('I work with the C P U and G P U.')
|
||||||
|
}
|
||||||
|
|
||||||
|
// ─── 7. Voice character ────────────────────────────────
|
||||||
|
section('7. Voice presets — same line')
|
||||||
|
if (START <= 7) {
|
||||||
|
const line = "Well, that's certainly one way to look at it."
|
||||||
|
for (const voice of ['v2/en_speaker_0', 'v2/en_speaker_3', 'v2/en_speaker_6', 'v2/en_speaker_9']) {
|
||||||
|
await say(line, voice)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// ─── 7. Combined scene ─────────────────────────────────
|
||||||
|
section('8. Combined — a tense scene')
|
||||||
|
if (START <= 8) {
|
||||||
|
await say('Something is wrong.', 'v2/en_speaker_9')
|
||||||
|
await say('I can FEEL it.', 'v2/en_speaker_9')
|
||||||
|
await say('[gasps] Run!', 'v2/en_speaker_9')
|
||||||
|
}
|
||||||
|
|
||||||
|
// ─── 9. Paragraph — voices 7-9 ─────────────────────────
|
||||||
|
// Pass voices as CLI args to override, e.g.:
|
||||||
|
// node acting-demo-bark.mjs 9 v2/en_speaker_0 v2/en_speaker_2 v2/en_speaker_5
|
||||||
|
const PARA_VOICES = process.argv.slice(3).length
|
||||||
|
? process.argv.slice(3)
|
||||||
|
: ['v2/en_speaker_7', 'v2/en_speaker_8', 'v2/en_speaker_9']
|
||||||
|
|
||||||
|
const PARAGRAPH = 'It was a cold evening when the letter FINALLY arrived. '
|
||||||
|
+ '[sighs] She had been waiting for months... not knowing whether the news would be good — or bad. '
|
||||||
|
+ 'Her hands trembled as she tore open the envelope. '
|
||||||
|
+ 'Inside was a single sheet of paper with just THREE words written on it. '
|
||||||
|
+ '[gasps] She read them twice... then sat down slowly, and stared at the wall.'
|
||||||
|
|
||||||
|
section('9. Paragraph — voice comparison')
|
||||||
|
if (START <= 9) {
|
||||||
|
for (const voice of PARA_VOICES) {
|
||||||
|
process.stdout.write(`\n — ${voice} —\n`)
|
||||||
|
await say(PARAGRAPH, voice)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
tts.stop()
|
||||||
|
section('Done')
|
||||||
109
acting-demo-chatterbox.mjs
Normal file
109
acting-demo-chatterbox.mjs
Normal file
@@ -0,0 +1,109 @@
|
|||||||
|
/**
|
||||||
|
* Acting demo for Chatterbox TTS.
|
||||||
|
*
|
||||||
|
* Run: node acting-demo-chatterbox.mjs [start-section] [turbo|full]
|
||||||
|
*
|
||||||
|
* Paralinguistic tags:
|
||||||
|
* [laugh] [chuckle] [cough] [clear throat] [sigh] [shush] [groan] [sniff] [gasp]
|
||||||
|
*
|
||||||
|
* Exaggeration (full model only, 0.0-1.0):
|
||||||
|
* 0.0 = neutral, 1.0 = maximum emotion intensity
|
||||||
|
*/
|
||||||
|
|
||||||
|
import { Chatterbox_Tts } from './lib/chatterbox-tts.mjs'
|
||||||
|
|
||||||
|
const START = parseInt(process.argv[2] ?? '1', 10)
|
||||||
|
const VARIANT = process.argv[3] ?? 'turbo'
|
||||||
|
|
||||||
|
const tts = new Chatterbox_Tts({ variant: VARIANT })
|
||||||
|
|
||||||
|
process.stderr.write(`[chatterbox] loading ${VARIANT} model...\n`)
|
||||||
|
await tts.init()
|
||||||
|
process.stderr.write('[chatterbox] ready\n\n')
|
||||||
|
|
||||||
|
function section(title) {
|
||||||
|
process.stdout.write(`\n${'─'.repeat(50)}\n${title}\n${'─'.repeat(50)}\n`)
|
||||||
|
}
|
||||||
|
|
||||||
|
async function say(text, opts = {}) {
|
||||||
|
process.stdout.write(` "${text}"\n`)
|
||||||
|
if (Object.keys(opts).length) process.stdout.write(` ${JSON.stringify(opts)}\n`)
|
||||||
|
process.stdout.write('\n')
|
||||||
|
await tts.speak(text, { preprocess: false, ...opts })
|
||||||
|
}
|
||||||
|
|
||||||
|
// ─── 1. Baseline ───────────────────────────────────────
|
||||||
|
section('1. Baseline')
|
||||||
|
if (START <= 1) {
|
||||||
|
await say('The package has arrived. Please sign here.')
|
||||||
|
}
|
||||||
|
|
||||||
|
// ─── 2. Paralinguistic tags ────────────────────────────
|
||||||
|
section('2. Paralinguistic tags')
|
||||||
|
if (START <= 2) {
|
||||||
|
await say('[cough] Right, where were we.')
|
||||||
|
await say('And then he just... [sigh] gave up.')
|
||||||
|
await say('[gasp] I had no idea you were here!')
|
||||||
|
await say('Ha! [laugh] That is the funniest thing I have heard all week.')
|
||||||
|
await say('[chuckle] Well, that is certainly one way to look at it.')
|
||||||
|
await say('[groan] Not this again.')
|
||||||
|
}
|
||||||
|
|
||||||
|
// ─── 3. Punctuation and rhythm ─────────────────────────
|
||||||
|
section('3. Punctuation and rhythm')
|
||||||
|
if (START <= 3) {
|
||||||
|
await say('It was... quiet. Too quiet.')
|
||||||
|
await say('He said he was fine — but his eyes told a different story.')
|
||||||
|
await say('I told you. I told you this would happen. But did anyone listen? No.')
|
||||||
|
}
|
||||||
|
|
||||||
|
// ─── 4. Exaggeration — full model only ─────────────────
|
||||||
|
section(`4. Exaggeration (${VARIANT === 'full' ? 'active' : 'ignored in turbo — run with: node acting-demo-chatterbox.mjs 4 full'})`)
|
||||||
|
if (START <= 4) {
|
||||||
|
await say('I am so incredibly happy to see you today!', { exaggeration: 0.0 })
|
||||||
|
await say('I am so incredibly happy to see you today!', { exaggeration: 0.5 })
|
||||||
|
await say('I am so incredibly happy to see you today!', { exaggeration: 1.0 })
|
||||||
|
}
|
||||||
|
|
||||||
|
// ─── 5. Temperature ────────────────────────────────────
|
||||||
|
section('5. Temperature (lower = more predictable)')
|
||||||
|
if (START <= 5) {
|
||||||
|
await say('Something is very wrong here and I think we should leave now.', { temperature: 0.3 })
|
||||||
|
await say('Something is very wrong here and I think we should leave now.', { temperature: 0.8 })
|
||||||
|
await say('Something is very wrong here and I think we should leave now.', { temperature: 1.4 })
|
||||||
|
}
|
||||||
|
|
||||||
|
// ─── 6. Paragraph ──────────────────────────────────────
|
||||||
|
section('6. Paragraph with tags')
|
||||||
|
if (START <= 6) {
|
||||||
|
await say(
|
||||||
|
'It was a cold evening when the letter finally arrived. '
|
||||||
|
+ '[sigh] She had been waiting for months... not knowing whether the news would be good — or bad. '
|
||||||
|
+ 'Her hands trembled as she tore open the envelope. '
|
||||||
|
+ '[gasp] Inside was a single sheet of paper with just three words written on it.'
|
||||||
|
)
|
||||||
|
}
|
||||||
|
|
||||||
|
// ─── 7. Audio prompt / voice cloning ───────────────────
|
||||||
|
// Pass a WAV file path as third argument:
|
||||||
|
// node acting-demo-chatterbox.mjs 7 turbo /path/to/voice.wav
|
||||||
|
section('7. Audio prompt (voice cloning)')
|
||||||
|
if (START <= 7) {
|
||||||
|
const audio_prompt = process.argv[4]
|
||||||
|
if (!audio_prompt) {
|
||||||
|
process.stdout.write(' Skipped — pass a WAV file as the 4th argument:\n')
|
||||||
|
process.stdout.write(' node acting-demo-chatterbox.mjs 7 turbo /path/to/voice.wav\n\n')
|
||||||
|
process.stdout.write(' Requirements: WAV file, at least 5 seconds, clean speech, no music/noise\n')
|
||||||
|
} else {
|
||||||
|
process.stdout.write(` voice: ${audio_prompt}\n\n`)
|
||||||
|
await say('Hello. This is a test of voice cloning.', { audio_prompt })
|
||||||
|
await say(
|
||||||
|
'It was a cold evening when the letter finally arrived. '
|
||||||
|
+ '[sigh] She had been waiting for months... not knowing whether the news would be good — or bad.',
|
||||||
|
{ audio_prompt }
|
||||||
|
)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
tts.stop()
|
||||||
|
section('Done')
|
||||||
163
acting-demo.mjs
Normal file
163
acting-demo.mjs
Normal file
@@ -0,0 +1,163 @@
|
|||||||
|
/**
|
||||||
|
* Demonstrates prosody and acting techniques with Kokoro TTS.
|
||||||
|
*
|
||||||
|
* Run: node acting-demo.mjs [start-section]
|
||||||
|
* E.g.: node acting-demo.mjs 5
|
||||||
|
*
|
||||||
|
* Kokoro's expressive range comes from four levers:
|
||||||
|
*
|
||||||
|
* 1. Voice choice — each voice has a baked-in emotional character
|
||||||
|
* 2. Speed — slower = deliberate/grave, faster = excited/happy
|
||||||
|
* 3. silenceScale — controls pause length between sentences
|
||||||
|
* 4. Punctuation & markup — espeak-ng honours some SSML and typographic cues
|
||||||
|
*
|
||||||
|
* SSML tags that pass through espeak-ng:
|
||||||
|
* <break time="500ms"/> — explicit pause
|
||||||
|
* <emphasis level="strong">x</emphasis> — louder/longer (model-dependent)
|
||||||
|
* <prosody rate="slow">x</prosody> — local rate change
|
||||||
|
* <phoneme alphabet="ipa" ph="..."/> — force pronunciation
|
||||||
|
*
|
||||||
|
* Typographic cues that affect prosody without SSML:
|
||||||
|
* UPPERCASE words — some stress
|
||||||
|
* Ellipsis ... — trailing pause / trailing off
|
||||||
|
* Em-dash — — abrupt break
|
||||||
|
* Comma placement — micro-pauses
|
||||||
|
* Exclamation ! — raised energy
|
||||||
|
*/
|
||||||
|
|
||||||
|
import { Tts } from './lib/tts.mjs'
|
||||||
|
import { spawn } from 'node:child_process'
|
||||||
|
|
||||||
|
const tts = new Tts()
|
||||||
|
tts.init()
|
||||||
|
|
||||||
|
// Overriding the internal _play is the simplest way to control
|
||||||
|
// per-utterance GenerationConfig without reworking the class.
|
||||||
|
|
||||||
|
async function say(text, { voice = 'af_heart', speed = 1.0, silence_scale = 1.0 } = {}) {
|
||||||
|
const { createRequire } = await import('node:module')
|
||||||
|
const require = createRequire(import.meta.url)
|
||||||
|
const sherpa = require('sherpa-onnx-node')
|
||||||
|
|
||||||
|
const VOICES = { af_heart:0, af_bella:1, af_nicole:2, af_sarah:3, af_sky:4, am_adam:5, am_michael:6 }
|
||||||
|
const sid = VOICES[voice] ?? 0
|
||||||
|
|
||||||
|
process.stdout.write(` voice=${voice} speed=${speed} silence=${silence_scale}\n "${text}"\n\n`)
|
||||||
|
|
||||||
|
const audio = tts._tts.generate({
|
||||||
|
text,
|
||||||
|
generationConfig: new sherpa.GenerationConfig({ sid, speed, silenceScale: silence_scale }),
|
||||||
|
})
|
||||||
|
|
||||||
|
await play(audio.samples, audio.sampleRate)
|
||||||
|
}
|
||||||
|
|
||||||
|
// Synthesize multiple phrase/option pairs and play them as one continuous audio
|
||||||
|
async function say_concat(parts) {
|
||||||
|
const { createRequire } = await import('node:module')
|
||||||
|
const require = createRequire(import.meta.url)
|
||||||
|
const sherpa = require('sherpa-onnx-node')
|
||||||
|
const VOICES = { af_heart:0, af_bella:1, af_nicole:2, af_sarah:3, af_sky:4, am_adam:5, am_michael:6 }
|
||||||
|
|
||||||
|
let total_len = 0
|
||||||
|
const chunks = []
|
||||||
|
|
||||||
|
for (const [text, opts = {}] of parts) {
|
||||||
|
const { voice = 'af_heart', speed = 1.0, silence_scale = 1.0 } = opts
|
||||||
|
const sid = VOICES[voice] ?? 0
|
||||||
|
const audio = tts._tts.generate({
|
||||||
|
text,
|
||||||
|
generationConfig: new sherpa.GenerationConfig({ sid, speed, silenceScale: silence_scale }),
|
||||||
|
})
|
||||||
|
process.stdout.write(` [${speed}x] "${text}"\n`)
|
||||||
|
chunks.push(audio.samples)
|
||||||
|
total_len += audio.samples.length
|
||||||
|
}
|
||||||
|
|
||||||
|
const combined = new Float32Array(total_len)
|
||||||
|
let offset = 0
|
||||||
|
for (const chunk of chunks) {
|
||||||
|
combined.set(chunk, offset)
|
||||||
|
offset += chunk.length
|
||||||
|
}
|
||||||
|
|
||||||
|
await play(combined, tts._tts.sampleRate)
|
||||||
|
process.stdout.write('\n')
|
||||||
|
}
|
||||||
|
|
||||||
|
function play(samples, sample_rate) {
|
||||||
|
return new Promise((resolve, reject) => {
|
||||||
|
const proc = spawn('pacat', [
|
||||||
|
'--format=float32le', `--rate=${sample_rate}`, '--channels=1',
|
||||||
|
], { stdio: ['pipe', 'ignore', 'inherit'] })
|
||||||
|
proc.on('error', reject)
|
||||||
|
proc.on('close', resolve)
|
||||||
|
proc.stdin.write(Buffer.from(samples.buffer))
|
||||||
|
proc.stdin.end()
|
||||||
|
})
|
||||||
|
}
|
||||||
|
|
||||||
|
const START = parseInt(process.argv[2] ?? '1', 10)
|
||||||
|
|
||||||
|
function section(title) {
|
||||||
|
process.stdout.write(`\n${'─'.repeat(50)}\n${title}\n${'─'.repeat(50)}\n`)
|
||||||
|
}
|
||||||
|
|
||||||
|
// ─── 1. Baseline ───────────────────────────────────────
|
||||||
|
section('1. Baseline — neutral af_heart')
|
||||||
|
if (START <= 1) {
|
||||||
|
await say('The package has arrived. Please sign here.')
|
||||||
|
}
|
||||||
|
|
||||||
|
// ─── 2. Speed as emotion ───────────────────────────────
|
||||||
|
section('2. Speed — slow (grave) vs fast (excited)')
|
||||||
|
if (START <= 2) {
|
||||||
|
await say('I have some terrible news. The project has been cancelled.', { speed: 0.8 })
|
||||||
|
await say("Oh my goodness, I can't believe you're actually here! This is amazing!", { speed: 1.25 })
|
||||||
|
}
|
||||||
|
|
||||||
|
// ─── 3. Voice character ────────────────────────────────
|
||||||
|
section('3. Voice character — same line, different voices')
|
||||||
|
if (START <= 3) {
|
||||||
|
const line = "Well, that's certainly one way to look at it."
|
||||||
|
for (const voice of ['af_heart', 'af_sky', 'am_adam', 'bm_george']) {
|
||||||
|
await say(line, { voice })
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// ─── 4. Punctuation-driven prosody ─────────────────────
|
||||||
|
section('4. Punctuation and typographic cues')
|
||||||
|
if (START <= 4) {
|
||||||
|
await say('I told you. I told you this would happen. But did anyone listen? No.')
|
||||||
|
await say('It was... quiet. Too quiet.')
|
||||||
|
await say('He said he was fine — but his eyes told a different story.')
|
||||||
|
await say('This is ABSOLUTELY unacceptable.')
|
||||||
|
}
|
||||||
|
|
||||||
|
// ─── 5. Per-phrase speed — manual emphasis ─────────────
|
||||||
|
// SSML tags are not supported by this backend; emphasis is achieved by
|
||||||
|
// synthesizing key phrases at a different speed and concatenating the audio.
|
||||||
|
section('5. Per-phrase speed (manual emphasis)')
|
||||||
|
if (START <= 5) {
|
||||||
|
// "stop" spoken slower and then the rest faster — sounds like emphasis
|
||||||
|
await say_concat([
|
||||||
|
['You need to ', { speed: 1.0 }],
|
||||||
|
['stop.', { speed: 0.7 }],
|
||||||
|
['Right now.', { speed: 1.1 }],
|
||||||
|
])
|
||||||
|
await say_concat([
|
||||||
|
['Listen very carefully.', { speed: 0.75 }],
|
||||||
|
['I will only say this once.', { speed: 0.9 }],
|
||||||
|
])
|
||||||
|
}
|
||||||
|
|
||||||
|
// ─── 6. Combining levers ───────────────────────────────
|
||||||
|
section('6. Combined — a tense scene')
|
||||||
|
if (START <= 6) {
|
||||||
|
await say('Something is wrong.', { speed: 0.85, silence_scale: 1.5 })
|
||||||
|
await say('I can feel it.', { speed: 0.8, silence_scale: 2.0 })
|
||||||
|
await say('<break time="600ms"/> Run!', { speed: 1.3, voice: 'af_sky' })
|
||||||
|
}
|
||||||
|
|
||||||
|
section('Done')
|
||||||
|
process.stdout.write('Tip: adjust speed/silence_scale and swap voices to build a character.\n')
|
||||||
135
bark-server.py
Executable file
135
bark-server.py
Executable file
@@ -0,0 +1,135 @@
|
|||||||
|
#!/usr/bin/env -S bash -c 'exec "$(dirname "$0")/venv/bin/python3" "$0" "$@"'
|
||||||
|
"""
|
||||||
|
Bark TTS server — keeps model loaded, reads JSON lines from stdin.
|
||||||
|
|
||||||
|
Protocol:
|
||||||
|
stdin: one JSON line per request: {"text": "...", "voice": "v2/en_speaker_6"}
|
||||||
|
stdout: "ok\n" after each utterance is finished playing
|
||||||
|
stderr: status/timing messages
|
||||||
|
|
||||||
|
Usage:
|
||||||
|
python bark-server.py [model] [voice]
|
||||||
|
python bark-server.py suno/bark-small v2/en_speaker_3
|
||||||
|
|
||||||
|
Voices (English):
|
||||||
|
v2/en_speaker_0 .. v2/en_speaker_9
|
||||||
|
Speaker 6 = neutral/warm, 9 = expressive, 3 = deep male
|
||||||
|
"""
|
||||||
|
|
||||||
|
import os
|
||||||
|
import sys
|
||||||
|
import json
|
||||||
|
import time
|
||||||
|
import subprocess
|
||||||
|
import numpy as np
|
||||||
|
|
||||||
|
MODEL = sys.argv[1] if len(sys.argv) > 1 else 'suno/bark'
|
||||||
|
DEF_VOICE = sys.argv[2] if len(sys.argv) > 2 else 'v2/en_speaker_6'
|
||||||
|
SAMPLE_RATE = 24000
|
||||||
|
|
||||||
|
TOKEN_FILE = os.path.expanduser('~/.secrets/hugging-face.token')
|
||||||
|
try:
|
||||||
|
with open(TOKEN_FILE) as f:
|
||||||
|
os.environ['HF_TOKEN'] = f.read().strip()
|
||||||
|
except FileNotFoundError:
|
||||||
|
pass
|
||||||
|
|
||||||
|
# Disable background safetensors conversion attempts — the HF API endpoint
|
||||||
|
# it calls has changed and now errors. The pytorch_model.bin files work fine.
|
||||||
|
os.environ['SAFETENSORS_FAST_GPU'] = '0'
|
||||||
|
os.environ.setdefault('TRANSFORMERS_NO_ADVISORY_WARNINGS', '1')
|
||||||
|
|
||||||
|
def log(msg):
|
||||||
|
print(f'[bark] {msg}', file=sys.stderr, flush=True)
|
||||||
|
|
||||||
|
log(f'loading {MODEL}...')
|
||||||
|
t0 = time.time()
|
||||||
|
|
||||||
|
import torch
|
||||||
|
from transformers import AutoProcessor, BarkModel
|
||||||
|
|
||||||
|
device = 'cuda' if torch.cuda.is_available() else 'cpu'
|
||||||
|
dtype = torch.float16 if device == 'cuda' else torch.float32
|
||||||
|
|
||||||
|
if device == 'cuda':
|
||||||
|
# TF32 is faster on Ampere (30xx, 40xx) for both matmul and convolutions
|
||||||
|
torch.backends.cuda.matmul.allow_tf32 = True
|
||||||
|
torch.backends.cudnn.allow_tf32 = True
|
||||||
|
|
||||||
|
processor = AutoProcessor.from_pretrained(MODEL)
|
||||||
|
model = BarkModel.from_pretrained(MODEL, torch_dtype=dtype, use_safetensors=False).to(device)
|
||||||
|
|
||||||
|
# Bark's internal generation calls set both max_length and max_new_tokens simultaneously
|
||||||
|
# which causes a conflict warning and may cause early stopping. Clearing max_length
|
||||||
|
# lets max_new_tokens take sole control.
|
||||||
|
model.generation_config.max_length = None
|
||||||
|
|
||||||
|
# torch.compile gives a meaningful speedup on 3090 (adds ~30s first-run compile)
|
||||||
|
try:
|
||||||
|
model = torch.compile(model, mode='reduce-overhead')
|
||||||
|
log('torch.compile applied')
|
||||||
|
except Exception as e:
|
||||||
|
log(f'torch.compile skipped: {e}')
|
||||||
|
|
||||||
|
log(f'ready on {device} ({time.time() - t0:.1f}s load time)')
|
||||||
|
print('ready', flush=True) # signal to Node that we are ready
|
||||||
|
|
||||||
|
|
||||||
|
def speak(text, voice):
|
||||||
|
t1 = time.time()
|
||||||
|
|
||||||
|
inputs = processor(text, voice_preset=voice, return_tensors='pt')
|
||||||
|
|
||||||
|
# Set attention_mask if not provided by processor — without it the model
|
||||||
|
# cannot distinguish padding from real tokens, which causes erratic generation
|
||||||
|
# including leading filler sounds.
|
||||||
|
if 'attention_mask' not in inputs:
|
||||||
|
inputs['attention_mask'] = torch.ones_like(inputs['input_ids'])
|
||||||
|
|
||||||
|
inputs = {k: v.to(device) for k, v in inputs.items()}
|
||||||
|
|
||||||
|
with torch.inference_mode():
|
||||||
|
audio = model.generate(
|
||||||
|
**inputs,
|
||||||
|
semantic_temperature = 0.3,
|
||||||
|
coarse_temperature = 0.3,
|
||||||
|
fine_temperature = 0.3,
|
||||||
|
)
|
||||||
|
|
||||||
|
samples = audio.cpu().numpy().squeeze().astype(np.float32)
|
||||||
|
elapsed_gen = time.time() - t1
|
||||||
|
duration = len(samples) / SAMPLE_RATE
|
||||||
|
log(f'generated {duration:.1f}s audio in {elapsed_gen:.1f}s rtf={elapsed_gen/duration:.2f}')
|
||||||
|
|
||||||
|
proc = subprocess.Popen(
|
||||||
|
['pacat', '--format=float32le', f'--rate={SAMPLE_RATE}', '--channels=1'],
|
||||||
|
stdin=subprocess.PIPE,
|
||||||
|
)
|
||||||
|
proc.stdin.write(samples.tobytes())
|
||||||
|
proc.stdin.close()
|
||||||
|
proc.wait()
|
||||||
|
|
||||||
|
|
||||||
|
for line in sys.stdin:
|
||||||
|
line = line.strip()
|
||||||
|
if not line:
|
||||||
|
continue
|
||||||
|
|
||||||
|
try:
|
||||||
|
req = json.loads(line)
|
||||||
|
text = req.get('text', '')
|
||||||
|
voice = req.get('voice', DEF_VOICE)
|
||||||
|
except json.JSONDecodeError:
|
||||||
|
text = line
|
||||||
|
voice = DEF_VOICE
|
||||||
|
|
||||||
|
if not text:
|
||||||
|
print('ok', flush=True)
|
||||||
|
continue
|
||||||
|
|
||||||
|
try:
|
||||||
|
speak(text, voice)
|
||||||
|
except Exception as e:
|
||||||
|
log(f'error: {e}')
|
||||||
|
|
||||||
|
print('ok', flush=True)
|
||||||
200
chatterbox-server.py
Executable file
200
chatterbox-server.py
Executable file
@@ -0,0 +1,200 @@
|
|||||||
|
#!/usr/bin/env -S bash -c 'exec "$(dirname "$0")/venv/bin/python3" "$0" "$@"'
|
||||||
|
"""
|
||||||
|
Chatterbox TTS server — keeps model loaded, reads JSON lines from stdin.
|
||||||
|
|
||||||
|
Protocol:
|
||||||
|
stdin: {"text": "...", "temperature": 0.8, "top_p": 0.95}
|
||||||
|
stdout: "ok\n" after each utterance is generated (playback may still be in progress)
|
||||||
|
stderr: status/timing messages
|
||||||
|
|
||||||
|
Usage:
|
||||||
|
./chatterbox-server.py
|
||||||
|
./chatterbox-server.py turbo # default
|
||||||
|
./chatterbox-server.py full # original model, supports exaggeration
|
||||||
|
|
||||||
|
Paralinguistic tags supported in text:
|
||||||
|
[laugh] [chuckle] [cough] [clear throat] [sigh] [shush] [groan] [sniff] [gasp]
|
||||||
|
|
||||||
|
Full model only:
|
||||||
|
exaggeration 0.0-1.0 emotion intensity (ignored in turbo)
|
||||||
|
"""
|
||||||
|
|
||||||
|
import os
|
||||||
|
import sys
|
||||||
|
import json
|
||||||
|
import time
|
||||||
|
import queue
|
||||||
|
import threading
|
||||||
|
import subprocess
|
||||||
|
import numpy as np
|
||||||
|
|
||||||
|
TOKEN_FILE = os.path.expanduser('~/.secrets/hugging-face.token')
|
||||||
|
try:
|
||||||
|
with open(TOKEN_FILE) as f:
|
||||||
|
os.environ['HF_TOKEN'] = f.read().strip()
|
||||||
|
except FileNotFoundError:
|
||||||
|
pass
|
||||||
|
|
||||||
|
def find_hf_cache(repo_id):
|
||||||
|
"""Return the local snapshot path if the model is already cached, else None."""
|
||||||
|
from pathlib import Path
|
||||||
|
cache_dir = Path(os.environ.get('HF_HOME', os.path.expanduser('~/.cache/huggingface'))) / 'hub'
|
||||||
|
repo_dir = cache_dir / f"models--{repo_id.replace('/', '--')}" / 'snapshots'
|
||||||
|
if repo_dir.exists():
|
||||||
|
snapshots = sorted(repo_dir.iterdir(), key=lambda p: p.stat().st_mtime)
|
||||||
|
if snapshots:
|
||||||
|
return str(snapshots[-1])
|
||||||
|
return None
|
||||||
|
|
||||||
|
VARIANT = sys.argv[1] if len(sys.argv) > 1 else 'turbo'
|
||||||
|
SAMPLE_RATE = 24000
|
||||||
|
|
||||||
|
def log(msg):
|
||||||
|
print(f'[chatterbox] {msg}', file=sys.stderr, flush=True)
|
||||||
|
|
||||||
|
log(f'loading chatterbox-{VARIANT}...')
|
||||||
|
t0 = time.time()
|
||||||
|
|
||||||
|
import tempfile
|
||||||
|
import traceback
|
||||||
|
import numpy as np
|
||||||
|
import torch
|
||||||
|
import soundfile as sf
|
||||||
|
import librosa as _librosa
|
||||||
|
|
||||||
|
# librosa.resample returns float64 in newer numpy — patch it to always return float32
|
||||||
|
_orig_resample = _librosa.resample
|
||||||
|
def _resample_float32(*args, **kwargs):
|
||||||
|
return _orig_resample(*args, **kwargs).astype(np.float32)
|
||||||
|
_librosa.resample = _resample_float32
|
||||||
|
|
||||||
|
device = 'cuda' if torch.cuda.is_available() else 'cpu'
|
||||||
|
|
||||||
|
REPO_IDS = {
|
||||||
|
'turbo': 'ResembleAI/chatterbox-turbo',
|
||||||
|
'full': 'ResembleAI/chatterbox',
|
||||||
|
}
|
||||||
|
|
||||||
|
if VARIANT == 'turbo':
|
||||||
|
from chatterbox.tts_turbo import ChatterboxTurboTTS as Model
|
||||||
|
else:
|
||||||
|
from chatterbox.tts import ChatterboxTTS as Model
|
||||||
|
|
||||||
|
cached = find_hf_cache(REPO_IDS[VARIANT])
|
||||||
|
if cached:
|
||||||
|
log(f'loading from cache: {cached}')
|
||||||
|
model = Model.from_local(cached, device=device)
|
||||||
|
else:
|
||||||
|
log('cache not found, downloading...')
|
||||||
|
model = Model.from_pretrained(device=device)
|
||||||
|
|
||||||
|
log(f'ready on {device} ({time.time() - t0:.1f}s load time)')
|
||||||
|
print('ready', flush=True)
|
||||||
|
|
||||||
|
|
||||||
|
_wav_cache = {}
|
||||||
|
|
||||||
|
def ensure_float32_wav(path):
|
||||||
|
"""Re-save audio as float32 mono WAV to work around librosa/numpy float64 issue.
|
||||||
|
Result is cached by input path so repeated calls with the same file are free."""
|
||||||
|
if path in _wav_cache:
|
||||||
|
return _wav_cache[path]
|
||||||
|
wav, sr = sf.read(path, dtype='float32', always_2d=True)
|
||||||
|
wav = wav.mean(axis=1) # stereo → mono if needed
|
||||||
|
tmp = tempfile.NamedTemporaryFile(suffix='.wav', delete=False)
|
||||||
|
sf.write(tmp.name, wav, sr, subtype='FLOAT')
|
||||||
|
_wav_cache[path] = tmp.name
|
||||||
|
return tmp.name
|
||||||
|
|
||||||
|
|
||||||
|
_SENTINEL = object()
|
||||||
|
|
||||||
|
playback_queue = queue.Queue()
|
||||||
|
|
||||||
|
|
||||||
|
def playback_worker():
|
||||||
|
"""Plays audio samples in order. Runs in its own thread."""
|
||||||
|
while True:
|
||||||
|
item = playback_queue.get()
|
||||||
|
if item is _SENTINEL:
|
||||||
|
break
|
||||||
|
samples = item
|
||||||
|
proc = subprocess.Popen(
|
||||||
|
['pacat', '--format=float32le', f'--rate={SAMPLE_RATE}', '--channels=1'],
|
||||||
|
stdin=subprocess.PIPE,
|
||||||
|
)
|
||||||
|
proc.stdin.write(samples.tobytes())
|
||||||
|
proc.stdin.close()
|
||||||
|
proc.wait()
|
||||||
|
playback_queue.task_done()
|
||||||
|
|
||||||
|
|
||||||
|
playback_thread = threading.Thread(target=playback_worker, daemon=True)
|
||||||
|
playback_thread.start()
|
||||||
|
|
||||||
|
|
||||||
|
def generate(text, opts):
|
||||||
|
t1 = time.time()
|
||||||
|
|
||||||
|
if VARIANT == 'turbo':
|
||||||
|
kwargs = {
|
||||||
|
'temperature': opts.get('temperature', 0.8),
|
||||||
|
'top_p': opts.get('top_p', 0.95),
|
||||||
|
'top_k': opts.get('top_k', 1000),
|
||||||
|
'repetition_penalty': opts.get('repetition_penalty', 1.2),
|
||||||
|
'min_p': opts.get('min_p', 0.0),
|
||||||
|
}
|
||||||
|
else:
|
||||||
|
kwargs = {
|
||||||
|
'temperature': opts.get('temperature', 0.8),
|
||||||
|
'top_p': opts.get('top_p', 1.0),
|
||||||
|
'repetition_penalty': opts.get('repetition_penalty', 1.2),
|
||||||
|
'min_p': opts.get('min_p', 0.05),
|
||||||
|
'exaggeration': opts.get('exaggeration', 0.5),
|
||||||
|
'cfg_weight': opts.get('cfg_weight', 0.5),
|
||||||
|
}
|
||||||
|
|
||||||
|
audio_prompt = opts.get('audio_prompt')
|
||||||
|
if audio_prompt:
|
||||||
|
kwargs['audio_prompt_path'] = ensure_float32_wav(audio_prompt)
|
||||||
|
|
||||||
|
with torch.inference_mode():
|
||||||
|
wav = model.generate(text, **kwargs)
|
||||||
|
|
||||||
|
samples = wav.squeeze(0).cpu().numpy().astype(np.float32)
|
||||||
|
elapsed = time.time() - t1
|
||||||
|
duration = len(samples) / SAMPLE_RATE
|
||||||
|
log(f'generated {duration:.1f}s audio in {elapsed:.1f}s rtf={elapsed/duration:.2f}')
|
||||||
|
|
||||||
|
return samples
|
||||||
|
|
||||||
|
|
||||||
|
for line in sys.stdin:
|
||||||
|
line = line.strip()
|
||||||
|
if not line:
|
||||||
|
continue
|
||||||
|
|
||||||
|
try:
|
||||||
|
req = json.loads(line)
|
||||||
|
text = req.pop('text', '')
|
||||||
|
opts = req # remaining fields are generation options
|
||||||
|
except json.JSONDecodeError:
|
||||||
|
text = line
|
||||||
|
opts = {}
|
||||||
|
|
||||||
|
if not text:
|
||||||
|
print('ok', flush=True)
|
||||||
|
continue
|
||||||
|
|
||||||
|
try:
|
||||||
|
samples = generate(text, opts)
|
||||||
|
playback_queue.put(samples)
|
||||||
|
except Exception as e:
|
||||||
|
log(f'error: {e}')
|
||||||
|
traceback.print_exc(file=sys.stderr)
|
||||||
|
|
||||||
|
print('ok', flush=True)
|
||||||
|
|
||||||
|
# Drain playback before exit
|
||||||
|
playback_queue.put(_SENTINEL)
|
||||||
|
playback_thread.join()
|
||||||
85
demo-bark.mjs
Normal file
85
demo-bark.mjs
Normal file
@@ -0,0 +1,85 @@
|
|||||||
|
/**
|
||||||
|
* Voice demo: microphone → Whisper STT → (optional LLM cleanup) → Bark TTS
|
||||||
|
*
|
||||||
|
* Bark supports emphasis via UPPERCASE, paralinguistic tokens ([laughs], [sighs],
|
||||||
|
* [clears throat]), and hesitation (...). Markdown is preprocessed automatically:
|
||||||
|
* **bold** → UPPERCASE, headers get a pause, etc.
|
||||||
|
*
|
||||||
|
* Environment variables:
|
||||||
|
* WHISPER_MODEL default: base.en
|
||||||
|
* options: tiny.en base.en small.en medium.en large-v3 large-v3-turbo
|
||||||
|
* BARK_MODEL default: suno/bark (use suno/bark-small for faster/smaller)
|
||||||
|
* BARK_VOICE default: v2/en_speaker_6
|
||||||
|
* options: v2/en_speaker_0 .. v2/en_speaker_9
|
||||||
|
* STT_PROVIDER default: cuda
|
||||||
|
* USE_LLM set to 0 to disable (default: 1)
|
||||||
|
* OLLAMA_MODEL default: phi3:mini
|
||||||
|
*/
|
||||||
|
|
||||||
|
import { Stt } from './lib/stt.mjs'
|
||||||
|
import { Bark_Tts } from './lib/bark-tts.mjs'
|
||||||
|
import { llm_available, list_models, cleanup } from './lib/llm.mjs'
|
||||||
|
|
||||||
|
const WHISPER_MODEL = process.env.WHISPER_MODEL || 'base.en'
|
||||||
|
const STT_PROVIDER = process.env.STT_PROVIDER || 'cuda'
|
||||||
|
const USE_LLM = process.env.USE_LLM !== '0'
|
||||||
|
|
||||||
|
process.stderr.write(`[stt] loading whisper-${WHISPER_MODEL} on ${STT_PROVIDER}...\n`)
|
||||||
|
const stt = new Stt({ whisper_name: WHISPER_MODEL, provider: STT_PROVIDER })
|
||||||
|
stt.init()
|
||||||
|
process.stderr.write('[stt] ready\n')
|
||||||
|
|
||||||
|
process.stderr.write('[tts] loading Bark (this takes a moment)...\n')
|
||||||
|
const tts = new Bark_Tts()
|
||||||
|
await tts.init()
|
||||||
|
process.stderr.write('[tts] Bark ready\n')
|
||||||
|
|
||||||
|
let has_llm = false
|
||||||
|
if (USE_LLM) {
|
||||||
|
has_llm = await llm_available()
|
||||||
|
if (has_llm) {
|
||||||
|
const models = await list_models()
|
||||||
|
process.stderr.write(`[llm] ollama available. models: ${models.slice(0, 6).join(', ')}\n`)
|
||||||
|
} else {
|
||||||
|
process.stderr.write('[llm] ollama not reachable, skipping cleanup\n')
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
process.stderr.write(`\n[ready] whisper=${WHISPER_MODEL} bark=${process.env.BARK_MODEL || 'suno/bark'} llm=${has_llm}\n`)
|
||||||
|
process.stderr.write('[ready] speak into your microphone. Ctrl+C to stop.\n\n')
|
||||||
|
|
||||||
|
let speaking = false
|
||||||
|
|
||||||
|
const stop = stt.listen(async (raw_text) => {
|
||||||
|
if (speaking) {
|
||||||
|
process.stderr.write(`[skipped] ${raw_text}\n`)
|
||||||
|
return
|
||||||
|
}
|
||||||
|
speaking = true
|
||||||
|
|
||||||
|
try {
|
||||||
|
process.stdout.write(`[raw] ${raw_text}\n`)
|
||||||
|
|
||||||
|
let text = raw_text
|
||||||
|
if (has_llm) {
|
||||||
|
text = await cleanup(raw_text)
|
||||||
|
if (text !== raw_text) {
|
||||||
|
process.stdout.write(`[llm] ${text}\n`)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
await tts.speak_streaming(text)
|
||||||
|
process.stdout.write('\n')
|
||||||
|
} catch (err) {
|
||||||
|
process.stderr.write(`[error] ${err.message}\n`)
|
||||||
|
} finally {
|
||||||
|
speaking = false
|
||||||
|
}
|
||||||
|
})
|
||||||
|
|
||||||
|
process.on('SIGINT', () => {
|
||||||
|
process.stderr.write('\n[voice] stopping\n')
|
||||||
|
stop()
|
||||||
|
tts.stop()
|
||||||
|
process.exit(0)
|
||||||
|
})
|
||||||
85
demo-kokoro.mjs
Normal file
85
demo-kokoro.mjs
Normal file
@@ -0,0 +1,85 @@
|
|||||||
|
/**
|
||||||
|
* Voice demo: microphone → Whisper STT → (optional LLM cleanup) → Kokoro TTS
|
||||||
|
*
|
||||||
|
* Environment variables:
|
||||||
|
* WHISPER_MODEL default: base.en
|
||||||
|
* options: tiny.en base.en small.en medium.en large-v3 large-v3-turbo
|
||||||
|
* VOICE default: af_heart
|
||||||
|
* options: af_heart af_bella af_nicole af_sarah af_sky am_adam am_michael
|
||||||
|
* STT_PROVIDER default: cuda
|
||||||
|
* USE_LLM set to 0 to disable (default: 1)
|
||||||
|
* OLLAMA_MODEL default: phi3:mini
|
||||||
|
*/
|
||||||
|
|
||||||
|
import { Stt } from './lib/stt.mjs'
|
||||||
|
import { Tts, VOICES } from './lib/tts.mjs'
|
||||||
|
import { llm_available, list_models, cleanup } from './lib/llm.mjs'
|
||||||
|
|
||||||
|
const WHISPER_MODEL = process.env.WHISPER_MODEL || 'base.en'
|
||||||
|
const VOICE = process.env.VOICE || 'af_heart'
|
||||||
|
const STT_PROVIDER = process.env.STT_PROVIDER || 'cuda'
|
||||||
|
const USE_LLM = process.env.USE_LLM !== '0'
|
||||||
|
|
||||||
|
if (!(VOICE in VOICES)) {
|
||||||
|
process.stderr.write(`[voice] unknown voice "${VOICE}". Available: ${Object.keys(VOICES).join(', ')}\n`)
|
||||||
|
process.exit(1)
|
||||||
|
}
|
||||||
|
|
||||||
|
process.stderr.write(`[stt] loading whisper-${WHISPER_MODEL} on ${STT_PROVIDER}...\n`)
|
||||||
|
const stt = new Stt({ whisper_name: WHISPER_MODEL, provider: STT_PROVIDER })
|
||||||
|
stt.init()
|
||||||
|
process.stderr.write('[stt] ready\n')
|
||||||
|
|
||||||
|
process.stderr.write('[tts] loading Kokoro...\n')
|
||||||
|
const tts = new Tts({ voice: VOICE })
|
||||||
|
tts.init()
|
||||||
|
process.stderr.write('[tts] ready\n')
|
||||||
|
|
||||||
|
let has_llm = false
|
||||||
|
if (USE_LLM) {
|
||||||
|
has_llm = await llm_available()
|
||||||
|
if (has_llm) {
|
||||||
|
const models = await list_models()
|
||||||
|
process.stderr.write(`[llm] ollama available. models: ${models.slice(0, 6).join(', ')}\n`)
|
||||||
|
} else {
|
||||||
|
process.stderr.write('[llm] ollama not reachable, skipping cleanup\n')
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
process.stderr.write(`\n[ready] whisper=${WHISPER_MODEL} voice=${VOICE} llm=${has_llm}\n`)
|
||||||
|
process.stderr.write('[ready] speak into your microphone. Ctrl+C to stop.\n\n')
|
||||||
|
|
||||||
|
let speaking = false
|
||||||
|
|
||||||
|
const stop = stt.listen(async (raw_text) => {
|
||||||
|
if (speaking) {
|
||||||
|
process.stderr.write(`[skipped] ${raw_text}\n`)
|
||||||
|
return
|
||||||
|
}
|
||||||
|
speaking = true
|
||||||
|
|
||||||
|
try {
|
||||||
|
process.stdout.write(`[raw] ${raw_text}\n`)
|
||||||
|
|
||||||
|
let text = raw_text
|
||||||
|
if (has_llm) {
|
||||||
|
text = await cleanup(raw_text)
|
||||||
|
if (text !== raw_text) {
|
||||||
|
process.stdout.write(`[llm] ${text}\n`)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
await tts.speak_streaming(text)
|
||||||
|
process.stdout.write('\n')
|
||||||
|
} catch (err) {
|
||||||
|
process.stderr.write(`[error] ${err.message}\n`)
|
||||||
|
} finally {
|
||||||
|
speaking = false
|
||||||
|
}
|
||||||
|
})
|
||||||
|
|
||||||
|
process.on('SIGINT', () => {
|
||||||
|
process.stderr.write('\n[voice] stopping\n')
|
||||||
|
stop()
|
||||||
|
process.exit(0)
|
||||||
|
})
|
||||||
24
demos/darkwing-briefing.sh
Executable file
24
demos/darkwing-briefing.sh
Executable file
@@ -0,0 +1,24 @@
|
|||||||
|
#!/usr/bin/env bash
|
||||||
|
# Darkwing Briefing — Darkwing Duck, Gosalyn, Harold W. Smith, Marella, Rommie
|
||||||
|
|
||||||
|
TTS="${TTS_URL:-http://192.168.2.99:11500}"
|
||||||
|
switch() { curl -s -X POST "$TTS/voice" -H 'Content-Type: application/json' -d "{\"name\": \"$1\"}" > /dev/null; }
|
||||||
|
say() { curl -s -X POST "$TTS/speak" -H 'Content-Type: application/json' -d "$(jq -n --arg t "$1" '{text:$t}')" > /dev/null; }
|
||||||
|
|
||||||
|
switch darkwing-duck
|
||||||
|
say "I am the terror that flaps in the night. I am the highly classified operative that your clearance level does not cover."
|
||||||
|
|
||||||
|
switch gosalyn
|
||||||
|
say "Dad, you're at the WRONG briefing AGAIN."
|
||||||
|
|
||||||
|
switch harold-w-smith
|
||||||
|
say "How did a duck get into this facility."
|
||||||
|
|
||||||
|
switch marella
|
||||||
|
say "The same way everything else does, sir. The paperwork looked official."
|
||||||
|
|
||||||
|
switch rommie
|
||||||
|
say "I ran a verification check. The duck is, in fact, cleared for this. I have no further explanation."
|
||||||
|
|
||||||
|
switch darkwing-duck
|
||||||
|
say "I love it when a plan comes together. Let's get dangerous."
|
||||||
21
demos/threat-assessment.sh
Executable file
21
demos/threat-assessment.sh
Executable file
@@ -0,0 +1,21 @@
|
|||||||
|
#!/usr/bin/env bash
|
||||||
|
# Threat Assessment — Ivanova, Archangel, Rommie, Chuin, Santini
|
||||||
|
|
||||||
|
TTS="${TTS_URL:-http://192.168.2.99:11500}"
|
||||||
|
switch() { curl -s -X POST "$TTS/voice" -H 'Content-Type: application/json' -d "{\"name\": \"$1\"}" > /dev/null; }
|
||||||
|
say() { curl -s -X POST "$TTS/speak" -H 'Content-Type: application/json' -d "$(jq -n --arg t "$1" '{text:$t}')" > /dev/null; }
|
||||||
|
|
||||||
|
switch ivanova
|
||||||
|
say "We have an unidentified vessel on an intercept course. Threat assessment?"
|
||||||
|
|
||||||
|
switch rommie
|
||||||
|
say "Weapons hot, shields raised, and they're broadcasting on a frequency we're not supposed to know exists."
|
||||||
|
|
||||||
|
switch archangel
|
||||||
|
say "Ah. That would be my people. Stand down, Commander — they're just making sure I'm still alive."
|
||||||
|
|
||||||
|
switch chuin
|
||||||
|
say "Are you? I have been wondering this myself for some time."
|
||||||
|
|
||||||
|
switch santini
|
||||||
|
say "I'm gonna need a bigger hangar."
|
||||||
24
demos/wrong-meeting.sh
Executable file
24
demos/wrong-meeting.sh
Executable file
@@ -0,0 +1,24 @@
|
|||||||
|
#!/usr/bin/env bash
|
||||||
|
# Wrong Meeting — Harold W. Smith, Ivanova, Archangel, Rommie, Penny Parker
|
||||||
|
|
||||||
|
TTS="${TTS_URL:-http://192.168.2.99:11500}"
|
||||||
|
switch() { curl -s -X POST "$TTS/voice" -H 'Content-Type: application/json' -d "{\"name\": \"$1\"}" > /dev/null; }
|
||||||
|
say() { curl -s -X POST "$TTS/speak" -H 'Content-Type: application/json' -d "$(jq -n --arg t "$1" '{text:$t}')" > /dev/null; }
|
||||||
|
|
||||||
|
switch harold-w-smith
|
||||||
|
say "This meeting never happened. None of you are here, and neither am I."
|
||||||
|
|
||||||
|
switch ivanova
|
||||||
|
say "That's extremely reassuring. Last time someone said that to me, a space station blew up."
|
||||||
|
|
||||||
|
switch archangel
|
||||||
|
say "Technically it was a controlled demolition. The paperwork was filed in triplicate."
|
||||||
|
|
||||||
|
switch rommie
|
||||||
|
say "I have reviewed the paperwork. It was filed AFTER the explosion."
|
||||||
|
|
||||||
|
switch penny-parker
|
||||||
|
say "You know, I walked into the wrong building again, didn't I."
|
||||||
|
|
||||||
|
switch harold-w-smith
|
||||||
|
say "Yes. But you saw nothing."
|
||||||
63
download-models.sh
Executable file
63
download-models.sh
Executable file
@@ -0,0 +1,63 @@
|
|||||||
|
#!/usr/bin/env bash
|
||||||
|
# Download ONNX models for the voice experiment.
|
||||||
|
#
|
||||||
|
# Usage:
|
||||||
|
# bash download-models.sh # default: whisper base.en
|
||||||
|
# WHISPER_MODEL=large-v3-turbo bash download-models.sh
|
||||||
|
#
|
||||||
|
# After downloading, run: node demo.mjs
|
||||||
|
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
WHISPER_MODEL="${WHISPER_MODEL:-base.en}"
|
||||||
|
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||||
|
MODELS_DIR="${SCRIPT_DIR}/models"
|
||||||
|
|
||||||
|
mkdir -p "${MODELS_DIR}"
|
||||||
|
cd "${MODELS_DIR}"
|
||||||
|
|
||||||
|
echo "==> models dir: ${MODELS_DIR}"
|
||||||
|
|
||||||
|
# --- Silero VAD (~2 MB) ---
|
||||||
|
if [ ! -f silero_vad.onnx ]; then
|
||||||
|
echo "==> downloading silero_vad.onnx"
|
||||||
|
curl -L --progress-bar -o silero_vad.onnx \
|
||||||
|
'https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/silero_vad.onnx'
|
||||||
|
else
|
||||||
|
echo "==> silero_vad.onnx already present"
|
||||||
|
fi
|
||||||
|
|
||||||
|
# --- Whisper STT ---
|
||||||
|
WHISPER_DIR="sherpa-onnx-whisper-${WHISPER_MODEL}"
|
||||||
|
if [ ! -d "${WHISPER_DIR}" ]; then
|
||||||
|
echo "==> downloading whisper-${WHISPER_MODEL}"
|
||||||
|
curl -L --progress-bar -o "${WHISPER_DIR}.tar.bz2" \
|
||||||
|
"https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-whisper-${WHISPER_MODEL}.tar.bz2"
|
||||||
|
echo "==> extracting..."
|
||||||
|
tar xf "${WHISPER_DIR}.tar.bz2"
|
||||||
|
else
|
||||||
|
echo "==> whisper-${WHISPER_MODEL} already present"
|
||||||
|
fi
|
||||||
|
|
||||||
|
# --- Kokoro TTS (~310 MB) ---
|
||||||
|
if [ ! -d kokoro-en-v0_19 ]; then
|
||||||
|
echo "==> downloading kokoro-en-v0_19 (English TTS)"
|
||||||
|
curl -L --progress-bar -o kokoro-en-v0_19.tar.bz2 \
|
||||||
|
'https://github.com/k2-fsa/sherpa-onnx/releases/download/tts-models/kokoro-en-v0_19.tar.bz2'
|
||||||
|
echo "==> extracting..."
|
||||||
|
tar xf kokoro-en-v0_19.tar.bz2
|
||||||
|
else
|
||||||
|
echo "==> kokoro-en-v0_19 already present"
|
||||||
|
fi
|
||||||
|
|
||||||
|
echo ""
|
||||||
|
echo "==> all models ready"
|
||||||
|
echo ""
|
||||||
|
echo "Next steps:"
|
||||||
|
echo " cd ${SCRIPT_DIR}"
|
||||||
|
echo " npm install"
|
||||||
|
echo " node demo.mjs"
|
||||||
|
echo ""
|
||||||
|
echo "To use a better STT model:"
|
||||||
|
echo " WHISPER_MODEL=large-v3-turbo bash download-models.sh"
|
||||||
|
echo " WHISPER_MODEL=large-v3-turbo node demo.mjs"
|
||||||
112
lib/bark-tts.mjs
Normal file
112
lib/bark-tts.mjs
Normal file
@@ -0,0 +1,112 @@
|
|||||||
|
/**
|
||||||
|
* Bark TTS — Node.js wrapper around bark-server.py.
|
||||||
|
*
|
||||||
|
* Spawns the Python server once, keeps it alive, sends requests as JSON lines.
|
||||||
|
* Markdown is preprocessed before sending: **bold** → UPPERCASE, etc.
|
||||||
|
*
|
||||||
|
* Usage:
|
||||||
|
* const tts = new Bark_Tts()
|
||||||
|
* await tts.init() // spawns server, waits for model load
|
||||||
|
* await tts.speak('Hello')
|
||||||
|
* await tts.speak('**Very** important point.') // emphasis via CAPS
|
||||||
|
* tts.stop()
|
||||||
|
*
|
||||||
|
* Environment variables:
|
||||||
|
* BARK_MODEL HuggingFace model id (default: suno/bark)
|
||||||
|
* BARK_VOICE voice preset (default: v2/en_speaker_6)
|
||||||
|
*
|
||||||
|
* Bark voice presets (English):
|
||||||
|
* v2/en_speaker_0 calm female
|
||||||
|
* v2/en_speaker_1 calm male
|
||||||
|
* v2/en_speaker_3 deep male
|
||||||
|
* v2/en_speaker_6 neutral/warm (default)
|
||||||
|
* v2/en_speaker_9 expressive
|
||||||
|
*/
|
||||||
|
|
||||||
|
import { spawn } from 'node:child_process'
|
||||||
|
import * as path from 'node:path'
|
||||||
|
import * as readline from 'node:readline'
|
||||||
|
import { markdown_to_bark, split_sentences } from './markdown.mjs'
|
||||||
|
|
||||||
|
const BARK_MODEL = process.env.BARK_MODEL || 'suno/bark'
|
||||||
|
const BARK_VOICE = process.env.BARK_VOICE || 'v2/en_speaker_6'
|
||||||
|
const SERVER = path.join(import.meta.dirname, '..', 'bark-server.py')
|
||||||
|
|
||||||
|
export class Bark_Tts {
|
||||||
|
constructor({
|
||||||
|
model = BARK_MODEL,
|
||||||
|
voice = BARK_VOICE,
|
||||||
|
} = {}) {
|
||||||
|
this._model = model
|
||||||
|
this._voice = voice
|
||||||
|
this._proc = null
|
||||||
|
this._rl = null
|
||||||
|
this._resolve = null // resolver for the current in-flight request
|
||||||
|
}
|
||||||
|
|
||||||
|
/** Spawn bark-server.py and wait until it signals "ready". */
|
||||||
|
init() {
|
||||||
|
return new Promise((resolve, reject) => {
|
||||||
|
this._proc = spawn(SERVER, [this._model, this._voice], {
|
||||||
|
stdio: ['pipe', 'pipe', 'inherit'],
|
||||||
|
})
|
||||||
|
|
||||||
|
this._proc.on('error', reject)
|
||||||
|
this._proc.on('close', (code) => {
|
||||||
|
if (code !== 0 && code !== null) {
|
||||||
|
process.stderr.write(`[bark] server exited with code ${code}\n`)
|
||||||
|
}
|
||||||
|
})
|
||||||
|
|
||||||
|
this._rl = readline.createInterface({ input: this._proc.stdout })
|
||||||
|
|
||||||
|
this._rl.on('line', (line) => {
|
||||||
|
if (line === 'ready') {
|
||||||
|
resolve()
|
||||||
|
return
|
||||||
|
}
|
||||||
|
if (line === 'ok' && this._resolve) {
|
||||||
|
const res = this._resolve
|
||||||
|
this._resolve = null
|
||||||
|
res()
|
||||||
|
}
|
||||||
|
})
|
||||||
|
})
|
||||||
|
}
|
||||||
|
|
||||||
|
/** Preprocess markdown and speak as a single request. */
|
||||||
|
async speak(text, { voice = this._voice, preprocess = true } = {}) {
|
||||||
|
const clean = preprocess ? markdown_to_bark(text) : text
|
||||||
|
return this._send(clean, voice)
|
||||||
|
}
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Preprocess markdown and speak sentence by sentence.
|
||||||
|
* Lower latency — first sentence starts playing while rest are queued.
|
||||||
|
*/
|
||||||
|
async speak_streaming(text, opts = {}) {
|
||||||
|
const clean = opts.preprocess !== false ? markdown_to_bark(text) : text
|
||||||
|
const sentences = split_sentences(clean)
|
||||||
|
for (const s of sentences) {
|
||||||
|
await this._send(s, opts.voice ?? this._voice)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
_send(text, voice) {
|
||||||
|
return new Promise((resolve, reject) => {
|
||||||
|
if (!this._proc) {
|
||||||
|
return reject(new Error('Bark_Tts not initialized — call init() first'))
|
||||||
|
}
|
||||||
|
this._resolve = resolve
|
||||||
|
const payload = JSON.stringify({ text, voice }) + '\n'
|
||||||
|
this._proc.stdin.write(payload)
|
||||||
|
})
|
||||||
|
}
|
||||||
|
|
||||||
|
stop() {
|
||||||
|
this._rl?.close()
|
||||||
|
this._proc?.kill()
|
||||||
|
this._proc = null
|
||||||
|
this._rl = null
|
||||||
|
}
|
||||||
|
}
|
||||||
101
lib/chatterbox-tts.mjs
Normal file
101
lib/chatterbox-tts.mjs
Normal file
@@ -0,0 +1,101 @@
|
|||||||
|
/**
|
||||||
|
* Chatterbox TTS — Node.js wrapper around chatterbox-server.py.
|
||||||
|
*
|
||||||
|
* Usage:
|
||||||
|
* const tts = new Chatterbox_Tts()
|
||||||
|
* await tts.init()
|
||||||
|
* await tts.speak('Hello! [chuckle] That is funny.')
|
||||||
|
* await tts.speak('Something intense.', { exaggeration: 0.8 }) // full model only
|
||||||
|
* tts.stop()
|
||||||
|
*
|
||||||
|
* Paralinguistic tags (embed in text):
|
||||||
|
* [laugh] [chuckle] [cough] [clear throat] [sigh] [shush] [groan] [sniff] [gasp]
|
||||||
|
*
|
||||||
|
* Generation options (passed as second arg to speak()):
|
||||||
|
* temperature 0.05–2.0 default 0.8
|
||||||
|
* top_p 0–1 default 0.95
|
||||||
|
* top_k 0–1000 default 1000
|
||||||
|
* repetition_penalty ≥1.0 default 1.2
|
||||||
|
* min_p 0–1 default 0.0
|
||||||
|
* audio_prompt string path to reference WAV for voice cloning
|
||||||
|
* exaggeration 0–1 emotion intensity (full model only)
|
||||||
|
* cfg_weight 0–1 classifier-free guidance (full model only)
|
||||||
|
*/
|
||||||
|
|
||||||
|
import { spawn } from 'node:child_process'
|
||||||
|
import * as path from 'node:path'
|
||||||
|
import * as readline from 'node:readline'
|
||||||
|
import { markdown_to_bark as markdown_to_speech, split_sentences } from './markdown.mjs'
|
||||||
|
|
||||||
|
const SERVER = path.join(import.meta.dirname, '..', 'chatterbox-server.py')
|
||||||
|
|
||||||
|
export class Chatterbox_Tts {
|
||||||
|
constructor({
|
||||||
|
variant = 'turbo', // 'turbo' or 'full'
|
||||||
|
} = {}) {
|
||||||
|
this._variant = variant
|
||||||
|
this._proc = null
|
||||||
|
this._rl = null
|
||||||
|
this._resolve = null
|
||||||
|
}
|
||||||
|
|
||||||
|
init() {
|
||||||
|
return new Promise((resolve, reject) => {
|
||||||
|
this._proc = spawn(SERVER, [this._variant], {
|
||||||
|
stdio: ['pipe', 'pipe', 'inherit'],
|
||||||
|
})
|
||||||
|
|
||||||
|
this._proc.on('error', reject)
|
||||||
|
this._proc.on('close', (code) => {
|
||||||
|
if (code !== null && code !== 0) {
|
||||||
|
process.stderr.write(`[chatterbox] server exited with code ${code}\n`)
|
||||||
|
}
|
||||||
|
})
|
||||||
|
|
||||||
|
this._rl = readline.createInterface({ input: this._proc.stdout })
|
||||||
|
this._rl.on('line', (line) => {
|
||||||
|
if (line === 'ready') {
|
||||||
|
resolve()
|
||||||
|
return
|
||||||
|
}
|
||||||
|
if (line === 'ok' && this._resolve) {
|
||||||
|
const res = this._resolve
|
||||||
|
this._resolve = null
|
||||||
|
res()
|
||||||
|
}
|
||||||
|
})
|
||||||
|
})
|
||||||
|
}
|
||||||
|
|
||||||
|
async speak(text, opts = {}) {
|
||||||
|
const clean = opts.preprocess === true ? markdown_to_speech(text) : text
|
||||||
|
return this._send(clean, opts)
|
||||||
|
}
|
||||||
|
|
||||||
|
async speak_streaming(text, opts = {}) {
|
||||||
|
const clean = opts.preprocess !== false ? markdown_to_speech(text) : text
|
||||||
|
const sentences = split_sentences(clean)
|
||||||
|
for (const s of sentences) {
|
||||||
|
await this._send(s, opts)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
_send(text, opts = {}) {
|
||||||
|
return new Promise((resolve, reject) => {
|
||||||
|
if (!this._proc) {
|
||||||
|
return reject(new Error('Chatterbox_Tts not initialized — call init() first'))
|
||||||
|
}
|
||||||
|
this._resolve = resolve
|
||||||
|
const { preprocess: _, ...gen_opts } = opts
|
||||||
|
const payload = JSON.stringify({ text, ...gen_opts }) + '\n'
|
||||||
|
this._proc.stdin.write(payload)
|
||||||
|
})
|
||||||
|
}
|
||||||
|
|
||||||
|
stop() {
|
||||||
|
this._rl?.close()
|
||||||
|
this._proc?.kill()
|
||||||
|
this._proc = null
|
||||||
|
this._rl = null
|
||||||
|
}
|
||||||
|
}
|
||||||
64
lib/llm.mjs
Normal file
64
lib/llm.mjs
Normal file
@@ -0,0 +1,64 @@
|
|||||||
|
/**
|
||||||
|
* Optional LLM cleanup via Ollama REST API.
|
||||||
|
*
|
||||||
|
* Cleans up raw Whisper output: fixes punctuation, capitalization,
|
||||||
|
* removes filler words, corrects obvious recognition errors.
|
||||||
|
*
|
||||||
|
* Set OLLAMA_MODEL env var to choose the model (default: phi3:mini).
|
||||||
|
* Set OLLAMA_URL env var to change the base URL (default: http://localhost:11434).
|
||||||
|
*/
|
||||||
|
|
||||||
|
const OLLAMA_URL = process.env.OLLAMA_URL || 'http://localhost:11434'
|
||||||
|
const OLLAMA_MODEL = process.env.OLLAMA_MODEL || 'phi3:mini'
|
||||||
|
|
||||||
|
export async function llm_available() {
|
||||||
|
try {
|
||||||
|
const res = await fetch(`${OLLAMA_URL}/api/tags`, {
|
||||||
|
signal: AbortSignal.timeout(2000),
|
||||||
|
})
|
||||||
|
return res.ok
|
||||||
|
} catch {
|
||||||
|
return false
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
export async function list_models() {
|
||||||
|
const res = await fetch(`${OLLAMA_URL}/api/tags`)
|
||||||
|
const data = await res.json()
|
||||||
|
return (data.models ?? []).map(m => m.name)
|
||||||
|
}
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Clean up a raw STT transcription.
|
||||||
|
* Returns the cleaned string, or the original if the request fails.
|
||||||
|
*/
|
||||||
|
export async function cleanup(raw_text, model = OLLAMA_MODEL) {
|
||||||
|
const prompt = [
|
||||||
|
'You are a transcription editor. Clean up this speech-to-text output:',
|
||||||
|
'- Fix punctuation and capitalization',
|
||||||
|
'- Remove meaningless filler words (um, uh, like, you know)',
|
||||||
|
'- Correct obvious word recognition errors based on context',
|
||||||
|
'- Keep the meaning and phrasing intact',
|
||||||
|
'- Return ONLY the cleaned text, no explanation, no quotes',
|
||||||
|
'',
|
||||||
|
raw_text,
|
||||||
|
].join('\n')
|
||||||
|
|
||||||
|
try {
|
||||||
|
const res = await fetch(`${OLLAMA_URL}/api/generate`, {
|
||||||
|
method: 'POST',
|
||||||
|
headers: { 'Content-Type': 'application/json' },
|
||||||
|
body: JSON.stringify({
|
||||||
|
model,
|
||||||
|
prompt,
|
||||||
|
stream: false,
|
||||||
|
options: { temperature: 0.1, num_predict: 256 },
|
||||||
|
}),
|
||||||
|
signal: AbortSignal.timeout(10_000),
|
||||||
|
})
|
||||||
|
const data = await res.json()
|
||||||
|
return data.response?.trim() || raw_text
|
||||||
|
} catch {
|
||||||
|
return raw_text
|
||||||
|
}
|
||||||
|
}
|
||||||
43
lib/local-query-complete.mjs
Normal file
43
lib/local-query-complete.mjs
Normal file
@@ -0,0 +1,43 @@
|
|||||||
|
export async function is_query_complete(query, model = 'qwen2.5:3b') {
|
||||||
|
const payload = {
|
||||||
|
model,
|
||||||
|
messages: [
|
||||||
|
{
|
||||||
|
role: 'system',
|
||||||
|
content: `
|
||||||
|
You are a voice query completeness classifier. Determine if the input is complete enough to act on.
|
||||||
|
|
||||||
|
Respond only with JSON: {"complete": true} or {"complete": false}
|
||||||
|
|
||||||
|
Default to {"complete": true}. Only respond with {"complete": false} if the input is clearly mid-sentence, trails off, or is a single ambiguous word with no clear intent.
|
||||||
|
|
||||||
|
Complete examples: "What time is it", "Turn off the lights", "How do I juggle", "Make a note", "Check the weather"
|
||||||
|
Incomplete examples: "How do I", "The thing is", "Can you"
|
||||||
|
`.replace(/^[\t ]+/gm, ''),
|
||||||
|
},
|
||||||
|
{ role: 'user', content: query },
|
||||||
|
],
|
||||||
|
stream: false,
|
||||||
|
format: 'json',
|
||||||
|
options: { num_predict: 10, num_ctx: 2048 },
|
||||||
|
}
|
||||||
|
|
||||||
|
const response = await fetch('http://192.168.2.99:11434/api/chat', {
|
||||||
|
method: 'POST',
|
||||||
|
headers: { 'Content-Type': 'application/json' },
|
||||||
|
body: JSON.stringify(payload),
|
||||||
|
})
|
||||||
|
|
||||||
|
if (response.ok) {
|
||||||
|
try {
|
||||||
|
const data = await response.json()
|
||||||
|
const parsed = JSON.parse(data.message.content)
|
||||||
|
return parsed.complete === true
|
||||||
|
} catch {
|
||||||
|
return true
|
||||||
|
}
|
||||||
|
} else {
|
||||||
|
// Fail open — if classifier errors, let the query through
|
||||||
|
return true
|
||||||
|
}
|
||||||
|
}
|
||||||
66
lib/markdown.mjs
Normal file
66
lib/markdown.mjs
Normal file
@@ -0,0 +1,66 @@
|
|||||||
|
/**
|
||||||
|
* Convert markdown to Bark-friendly plain text.
|
||||||
|
*
|
||||||
|
* Bark emphasis conventions:
|
||||||
|
* UPPERCASE words — stress/emphasis
|
||||||
|
* ... — hesitation / trailing off
|
||||||
|
* [laughs] [sighs] — paralinguistic tokens (pass through as-is)
|
||||||
|
*
|
||||||
|
* Markdown mapping:
|
||||||
|
* **bold** / __bold__ → UPPERCASE (strong emphasis)
|
||||||
|
* *italic* / _italic_ → unchanged (soft emphasis, Bark handles naturally)
|
||||||
|
* # Heading → text + pause via ".\n"
|
||||||
|
* `code` → unchanged (spoken literally)
|
||||||
|
* ```block``` → skipped
|
||||||
|
* [text](url) → text only
|
||||||
|
* --- → "..." (pause)
|
||||||
|
* - item / 1. item → item text (bullet stripped)
|
||||||
|
*/
|
||||||
|
|
||||||
|
export function markdown_to_bark(text) {
|
||||||
|
// Remove fenced code blocks entirely
|
||||||
|
text = text.replace(/```[\s\S]*?```/g, '')
|
||||||
|
|
||||||
|
// Inline code — keep the content, drop backticks
|
||||||
|
text = text.replace(/`([^`]+)`/g, '$1')
|
||||||
|
|
||||||
|
// Links — keep visible text
|
||||||
|
text = text.replace(/\[([^\]]+)\]\([^)]*\)/g, '$1')
|
||||||
|
|
||||||
|
// Images — drop
|
||||||
|
text = text.replace(/!\[([^\]]*)\]\([^)]*\)/g, '')
|
||||||
|
|
||||||
|
// **bold** / __bold__ → UPPERCASE
|
||||||
|
text = text.replace(/\*\*([^*]+)\*\*/g, (_, w) => w.toUpperCase())
|
||||||
|
text = text.replace(/__([^_]+)__/g, (_, w) => w.toUpperCase())
|
||||||
|
|
||||||
|
// *italic* / _italic_ — strip markers, keep text
|
||||||
|
text = text.replace(/\*([^*]+)\*/g, '$1')
|
||||||
|
text = text.replace(/_([^_]+)_/g, '$1')
|
||||||
|
|
||||||
|
// Headings — keep text, add a beat after
|
||||||
|
text = text.replace(/^#{1,6}\s+(.+)$/gm, '$1.')
|
||||||
|
|
||||||
|
// Horizontal rules → pause
|
||||||
|
text = text.replace(/^[-*_]{3,}\s*$/gm, '...')
|
||||||
|
|
||||||
|
// List bullets / numbers — strip prefix
|
||||||
|
text = text.replace(/^\s*[-*+]\s+/gm, '')
|
||||||
|
text = text.replace(/^\s*\d+\.\s+/gm, '')
|
||||||
|
|
||||||
|
// Collapse excess blank lines
|
||||||
|
text = text.replace(/\n{3,}/g, '\n\n')
|
||||||
|
|
||||||
|
return text.trim()
|
||||||
|
}
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Split text into sentences suitable for sequential TTS.
|
||||||
|
* Keeps sentence-ending punctuation with the preceding sentence.
|
||||||
|
*/
|
||||||
|
export function split_sentences(text) {
|
||||||
|
return text
|
||||||
|
.split(/(?<=[.!?])\s+/)
|
||||||
|
.map(s => s.trim())
|
||||||
|
.filter(s => s.length > 0)
|
||||||
|
}
|
||||||
170
lib/stt.mjs
Normal file
170
lib/stt.mjs
Normal file
@@ -0,0 +1,170 @@
|
|||||||
|
/**
|
||||||
|
* Speech-to-Text using sherpa-onnx-node.
|
||||||
|
*
|
||||||
|
* Uses Whisper (offline/batch) as the recognizer, gated by Silero VAD so
|
||||||
|
* transcription only runs when a complete utterance has been detected.
|
||||||
|
* Audio is captured from the microphone via parec (PulseAudio) or arecord (ALSA).
|
||||||
|
*
|
||||||
|
* Model layout expected under models_dir:
|
||||||
|
* silero_vad.onnx
|
||||||
|
* sherpa-onnx-whisper-<name>/
|
||||||
|
* <name>-encoder.int8.onnx
|
||||||
|
* <name>-decoder.int8.onnx
|
||||||
|
* <name>-tokens.txt
|
||||||
|
*
|
||||||
|
* Available Whisper model names (pass as whisper_name):
|
||||||
|
* tiny.en base.en small.en medium.en large-v3 large-v3-turbo
|
||||||
|
* Download with: bash download-models.sh [model-name]
|
||||||
|
*/
|
||||||
|
|
||||||
|
import { spawn, execSync } from 'node:child_process'
|
||||||
|
import { createRequire } from 'node:module'
|
||||||
|
import * as path from 'node:path'
|
||||||
|
|
||||||
|
const require = createRequire(import.meta.url)
|
||||||
|
const DEFAULT_MODELS_DIR = path.join(import.meta.dirname, '..', 'models')
|
||||||
|
|
||||||
|
const PRE_ROLL_SAMPLES = 3200 // 200ms at 16kHz
|
||||||
|
const HISTORY_SAMPLES = 960000 // 60s ring buffer — matches VAD internal size
|
||||||
|
|
||||||
|
function find_mic() {
|
||||||
|
const candidates = [
|
||||||
|
['parec', ['--format=s16le', '--rate=16000', '--channels=1', '--latency-msec=50']],
|
||||||
|
['arecord', ['-f', 'S16_LE', '-r', '16000', '-c', '1', '-t', 'raw', '-q']],
|
||||||
|
]
|
||||||
|
for (const [cmd, args] of candidates) {
|
||||||
|
try {
|
||||||
|
execSync(`which ${cmd}`, { stdio: 'ignore' })
|
||||||
|
return [cmd, args]
|
||||||
|
} catch { /* try next */ }
|
||||||
|
}
|
||||||
|
throw new Error('No mic capture command found — need parec (PulseAudio) or arecord (ALSA)')
|
||||||
|
}
|
||||||
|
|
||||||
|
function s16le_to_f32(buf) {
|
||||||
|
const out = new Float32Array(buf.length / 2)
|
||||||
|
for (let i = 0; i < out.length; i++) {
|
||||||
|
out[i] = buf.readInt16LE(i * 2) / 32768.0
|
||||||
|
}
|
||||||
|
return out
|
||||||
|
}
|
||||||
|
|
||||||
|
export class Stt {
|
||||||
|
constructor({
|
||||||
|
models_dir = DEFAULT_MODELS_DIR,
|
||||||
|
whisper_name = 'base.en',
|
||||||
|
provider = 'cuda',
|
||||||
|
} = {}) {
|
||||||
|
this._whisper_dir = path.join(models_dir, `sherpa-onnx-whisper-${whisper_name}`)
|
||||||
|
this._whisper_name = whisper_name
|
||||||
|
this._vad_model = path.join(models_dir, 'silero_vad.onnx')
|
||||||
|
this._provider = provider
|
||||||
|
this._recognizer = null
|
||||||
|
this._vad = null
|
||||||
|
this._history = new Float32Array(HISTORY_SAMPLES)
|
||||||
|
this._history_pos = 0 // total samples fed, monotonically increasing
|
||||||
|
}
|
||||||
|
|
||||||
|
init() {
|
||||||
|
const sherpa = require('sherpa-onnx-node')
|
||||||
|
const { _whisper_dir: wdir, _whisper_name: wname, _vad_model, _provider } = this
|
||||||
|
|
||||||
|
this._recognizer = new sherpa.OfflineRecognizer({
|
||||||
|
featConfig: { sampleRate: 16000, featureDim: 80 },
|
||||||
|
modelConfig: {
|
||||||
|
whisper: {
|
||||||
|
encoder: path.join(wdir, `${wname}-encoder.int8.onnx`),
|
||||||
|
decoder: path.join(wdir, `${wname}-decoder.int8.onnx`),
|
||||||
|
},
|
||||||
|
tokens: path.join(wdir, `${wname}-tokens.txt`),
|
||||||
|
numThreads: 2,
|
||||||
|
provider: _provider,
|
||||||
|
debug: false,
|
||||||
|
},
|
||||||
|
})
|
||||||
|
|
||||||
|
// VAD runs on CPU — silero_vad.onnx is tiny
|
||||||
|
this._vad = new sherpa.Vad({
|
||||||
|
sileroVad: {
|
||||||
|
model: _vad_model,
|
||||||
|
threshold: 0.5,
|
||||||
|
minSilenceDuration: 0.5, // seconds of silence to end an utterance
|
||||||
|
minSpeechDuration: 0.1, // ignore sub-100ms blips
|
||||||
|
windowSize: 512, // 32ms @ 16kHz
|
||||||
|
maxSpeechDuration: 30.0,
|
||||||
|
},
|
||||||
|
sampleRate: 16000,
|
||||||
|
numThreads: 1,
|
||||||
|
provider: 'cpu',
|
||||||
|
debug: false,
|
||||||
|
}, 60) // 60-second internal ring buffer
|
||||||
|
}
|
||||||
|
|
||||||
|
_transcribe(samples) {
|
||||||
|
const stream = this._recognizer.createStream()
|
||||||
|
stream.acceptWaveform({ samples, sampleRate: 16000 })
|
||||||
|
this._recognizer.decode(stream)
|
||||||
|
return this._recognizer.getResult(stream).text.trim()
|
||||||
|
}
|
||||||
|
|
||||||
|
_with_preroll(seg) {
|
||||||
|
const pre_start = Math.max(0, seg.start - PRE_ROLL_SAMPLES)
|
||||||
|
const pre_len = seg.start - pre_start
|
||||||
|
if (pre_len === 0) return seg.samples
|
||||||
|
const out = new Float32Array(pre_len + seg.samples.length)
|
||||||
|
for (let i = 0; i < pre_len; i++) {
|
||||||
|
out[i] = this._history[(pre_start + i) % HISTORY_SAMPLES]
|
||||||
|
}
|
||||||
|
out.set(seg.samples, pre_len)
|
||||||
|
return out
|
||||||
|
}
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Start listening. Calls on_text(text) for each detected utterance.
|
||||||
|
* Returns a stop() function.
|
||||||
|
*/
|
||||||
|
listen(on_text, { on_audio } = {}) {
|
||||||
|
const [cmd, args] = find_mic()
|
||||||
|
process.stderr.write(`[stt] mic: ${cmd} ${args.join(' ')}\n`)
|
||||||
|
const mic = spawn(cmd, args, { stdio: ['ignore', 'pipe', 'inherit'] })
|
||||||
|
|
||||||
|
const VAD_WIN = 512 // must match windowSize above
|
||||||
|
let pending = Buffer.alloc(0)
|
||||||
|
|
||||||
|
mic.stdout.on('data', (chunk) => {
|
||||||
|
if (on_audio) on_audio()
|
||||||
|
pending = Buffer.concat([pending, chunk])
|
||||||
|
|
||||||
|
// Feed complete VAD windows
|
||||||
|
while (pending.length >= VAD_WIN * 2) {
|
||||||
|
const win = pending.subarray(0, VAD_WIN * 2)
|
||||||
|
pending = pending.subarray(VAD_WIN * 2)
|
||||||
|
const f32 = s16le_to_f32(win)
|
||||||
|
// Write to history ring buffer for pre-roll
|
||||||
|
const base = this._history_pos % HISTORY_SAMPLES
|
||||||
|
for (let i = 0; i < f32.length; i++) {
|
||||||
|
this._history[(base + i) % HISTORY_SAMPLES] = f32[i]
|
||||||
|
}
|
||||||
|
this._history_pos += f32.length
|
||||||
|
this._vad.acceptWaveform(f32)
|
||||||
|
}
|
||||||
|
|
||||||
|
// Drain any complete speech segments
|
||||||
|
while (!this._vad.isEmpty()) {
|
||||||
|
const seg = this._vad.front()
|
||||||
|
this._vad.pop()
|
||||||
|
const text = this._transcribe(this._with_preroll(seg))
|
||||||
|
if (text) on_text(text)
|
||||||
|
}
|
||||||
|
})
|
||||||
|
|
||||||
|
mic.on('error', err => process.stderr.write(`[stt] mic error: ${err.message}\n`))
|
||||||
|
mic.on('close', code => {
|
||||||
|
if (code !== null && code !== 0) {
|
||||||
|
process.stderr.write(`[stt] mic exited with code ${code}\n`)
|
||||||
|
}
|
||||||
|
})
|
||||||
|
|
||||||
|
return () => mic.kill()
|
||||||
|
}
|
||||||
|
}
|
||||||
51
lib/tts-client.mjs
Normal file
51
lib/tts-client.mjs
Normal file
@@ -0,0 +1,51 @@
|
|||||||
|
/**
|
||||||
|
* TTS HTTP client — speaks text via a running tts-server.mjs instance.
|
||||||
|
*
|
||||||
|
* Usage:
|
||||||
|
* const tts = new Tts_Client()
|
||||||
|
* await tts.speak('Hello world.')
|
||||||
|
* await tts.speak('Hello.', { audio_prompt: '/path/to/voice.wav' })
|
||||||
|
* const { voices, current } = await tts.list_voices()
|
||||||
|
* await tts.set_voice('rommie')
|
||||||
|
*
|
||||||
|
* The server URL defaults to TTS_URL env var, then http://192.168.2.99:11500.
|
||||||
|
*/
|
||||||
|
|
||||||
|
const DEFAULT_URL = process.env.TTS_URL ?? 'http://192.168.2.99:11500'
|
||||||
|
|
||||||
|
export class Tts_Client {
|
||||||
|
constructor({ url = DEFAULT_URL } = {}) {
|
||||||
|
this._url = url
|
||||||
|
}
|
||||||
|
|
||||||
|
async speak(text, opts = {}) {
|
||||||
|
const res = await fetch(`${this._url}/speak`, {
|
||||||
|
method: 'POST',
|
||||||
|
headers: { 'Content-Type': 'application/json' },
|
||||||
|
body: JSON.stringify({ text, ...opts }),
|
||||||
|
})
|
||||||
|
if (!res.ok) {
|
||||||
|
const data = await res.json().catch(() => ({}))
|
||||||
|
throw new Error(data.error ?? `HTTP ${res.status}`)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
async list_voices() {
|
||||||
|
const res = await fetch(`${this._url}/voices`)
|
||||||
|
if (!res.ok) throw new Error(`HTTP ${res.status}`)
|
||||||
|
return res.json() // { voices: [{name, description, active}], current }
|
||||||
|
}
|
||||||
|
|
||||||
|
async set_voice(name) {
|
||||||
|
const res = await fetch(`${this._url}/voice`, {
|
||||||
|
method: 'POST',
|
||||||
|
headers: { 'Content-Type': 'application/json' },
|
||||||
|
body: JSON.stringify({ name }),
|
||||||
|
})
|
||||||
|
if (!res.ok) {
|
||||||
|
const data = await res.json().catch(() => ({}))
|
||||||
|
throw new Error(data.error ?? `HTTP ${res.status}`)
|
||||||
|
}
|
||||||
|
return res.json() // { ok, name, path }
|
||||||
|
}
|
||||||
|
}
|
||||||
107
lib/tts.mjs
Normal file
107
lib/tts.mjs
Normal file
@@ -0,0 +1,107 @@
|
|||||||
|
/**
|
||||||
|
* Text-to-Speech using sherpa-onnx-node with the Kokoro model.
|
||||||
|
*
|
||||||
|
* Model layout expected under models_dir:
|
||||||
|
* kokoro-en-v0_19/
|
||||||
|
* model.onnx
|
||||||
|
* voices.bin
|
||||||
|
* tokens.txt
|
||||||
|
* lexicon-us-en.txt
|
||||||
|
* espeak-ng-data/
|
||||||
|
*
|
||||||
|
* Audio output goes to pacat (PulseAudio) as float32le @ 24kHz.
|
||||||
|
* speak_streaming() splits on sentence boundaries so the first sentence
|
||||||
|
* plays while the rest are still being synthesized.
|
||||||
|
*/
|
||||||
|
|
||||||
|
import { spawn } from 'node:child_process'
|
||||||
|
import { createRequire } from 'node:module'
|
||||||
|
import * as path from 'node:path'
|
||||||
|
|
||||||
|
const require = createRequire(import.meta.url)
|
||||||
|
const DEFAULT_MODELS_DIR = path.join(import.meta.dirname, '..', 'models')
|
||||||
|
|
||||||
|
// Speaker IDs in kokoro-en-v0_19
|
||||||
|
export const VOICES = {
|
||||||
|
af_heart: 0,
|
||||||
|
af_bella: 1,
|
||||||
|
af_nicole: 2,
|
||||||
|
af_sarah: 3,
|
||||||
|
af_sky: 4,
|
||||||
|
am_adam: 5,
|
||||||
|
am_michael: 6,
|
||||||
|
}
|
||||||
|
|
||||||
|
export class Tts {
|
||||||
|
constructor({
|
||||||
|
models_dir = DEFAULT_MODELS_DIR,
|
||||||
|
voice = 'af_heart',
|
||||||
|
speed = 1.0,
|
||||||
|
provider = 'cpu', // Kokoro synthesis is fast enough on CPU
|
||||||
|
} = {}) {
|
||||||
|
this._kokoro_dir = path.join(models_dir, 'kokoro-en-v0_19')
|
||||||
|
this._voice = voice
|
||||||
|
this._speed = speed
|
||||||
|
this._provider = provider
|
||||||
|
this._tts = null
|
||||||
|
}
|
||||||
|
|
||||||
|
init() {
|
||||||
|
const sherpa = require('sherpa-onnx-node')
|
||||||
|
const { _kokoro_dir: dir, _provider } = this
|
||||||
|
|
||||||
|
this._tts = new sherpa.OfflineTts({
|
||||||
|
model: {
|
||||||
|
kokoro: {
|
||||||
|
model: path.join(dir, 'model.onnx'),
|
||||||
|
voices: path.join(dir, 'voices.bin'),
|
||||||
|
tokens: path.join(dir, 'tokens.txt'),
|
||||||
|
dataDir: path.join(dir, 'espeak-ng-data'),
|
||||||
|
},
|
||||||
|
},
|
||||||
|
maxNumSentences: 2,
|
||||||
|
numThreads: 2,
|
||||||
|
debug: false,
|
||||||
|
provider: _provider,
|
||||||
|
})
|
||||||
|
}
|
||||||
|
|
||||||
|
/** Synthesize and play a single piece of text. Resolves when playback ends. */
|
||||||
|
async speak(text) {
|
||||||
|
const sid = VOICES[this._voice] ?? 0
|
||||||
|
const audio = this._tts.generate({ text, sid, speed: this._speed })
|
||||||
|
await this._play(audio.samples, audio.sampleRate)
|
||||||
|
}
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Split text on sentence boundaries and synthesize + play each sentence
|
||||||
|
* in sequence. Lower latency than waiting for full synthesis first.
|
||||||
|
*/
|
||||||
|
async speak_streaming(text) {
|
||||||
|
for (const sentence of split_sentences(text)) {
|
||||||
|
if (sentence.trim()) {
|
||||||
|
await this.speak(sentence.trim())
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
_play(samples, sample_rate) {
|
||||||
|
return new Promise((resolve, reject) => {
|
||||||
|
const proc = spawn('pacat', [
|
||||||
|
'--format=float32le',
|
||||||
|
`--rate=${sample_rate}`,
|
||||||
|
'--channels=1',
|
||||||
|
], { stdio: ['pipe', 'ignore', 'inherit'] })
|
||||||
|
|
||||||
|
proc.on('error', reject)
|
||||||
|
proc.on('close', resolve)
|
||||||
|
proc.stdin.write(Buffer.from(samples.buffer))
|
||||||
|
proc.stdin.end()
|
||||||
|
})
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
function split_sentences(text) {
|
||||||
|
// Split after . ! ? — keep the punctuation with the preceding sentence
|
||||||
|
return text.split(/(?<=[.!?])\s+/)
|
||||||
|
}
|
||||||
22
listen.mjs
Normal file
22
listen.mjs
Normal file
@@ -0,0 +1,22 @@
|
|||||||
|
/**
|
||||||
|
* Listen with Whisper and print transcribed text to stdout.
|
||||||
|
*
|
||||||
|
* Usage:
|
||||||
|
* node listen.mjs
|
||||||
|
* node listen.mjs small.en
|
||||||
|
* node listen.mjs large-v3-turbo
|
||||||
|
*/
|
||||||
|
|
||||||
|
import { Stt } from './lib/stt.mjs'
|
||||||
|
|
||||||
|
const whisper_name = process.argv[2] ?? 'base.en'
|
||||||
|
|
||||||
|
const stt = new Stt({ whisper_name })
|
||||||
|
stt.init()
|
||||||
|
|
||||||
|
process.stderr.write(`[listen] whisper model: ${whisper_name}\n`)
|
||||||
|
process.stderr.write(`[listen] listening — Ctrl+C to stop\n`)
|
||||||
|
|
||||||
|
stt.listen((text) => {
|
||||||
|
process.stdout.write(text + '\n')
|
||||||
|
})
|
||||||
13
package.json
Normal file
13
package.json
Normal file
@@ -0,0 +1,13 @@
|
|||||||
|
{
|
||||||
|
"name": "voice-experiment",
|
||||||
|
"version": "0.1.0",
|
||||||
|
"type": "module",
|
||||||
|
"scripts": {
|
||||||
|
"download": "bash download-models.sh",
|
||||||
|
"setup-venv": "bash setup-venv.sh"
|
||||||
|
},
|
||||||
|
"dependencies": {
|
||||||
|
"sherpa-onnx-node": "^1.12.14",
|
||||||
|
"js-yaml": "^4.1.0"
|
||||||
|
}
|
||||||
|
}
|
||||||
123
query-demo.mjs
Normal file
123
query-demo.mjs
Normal file
@@ -0,0 +1,123 @@
|
|||||||
|
/**
|
||||||
|
* Voice query demo <20><> accumulates STT utterances into a query, checks
|
||||||
|
* completeness with a local LLM classifier, then dispatches to Claude Code.
|
||||||
|
*
|
||||||
|
* Usage:
|
||||||
|
* node query-demo.mjs [--audio-prompt voice.wav] [--whisper model] [--window-id 0x...] [--claude-remote /path/to/claude-remote.mjs]
|
||||||
|
*
|
||||||
|
* If --window-id and --claude-remote are supplied, completed queries are
|
||||||
|
* dispatched to Claude Code. Otherwise they are only logged to stderr.
|
||||||
|
*
|
||||||
|
* A query is considered complete when:
|
||||||
|
* - The classifier says so
|
||||||
|
* - The last word is a send-word (go/done/send)
|
||||||
|
* - No new fragment arrives within SILENCE_TIMEOUT ms
|
||||||
|
*/
|
||||||
|
|
||||||
|
import { execFileSync } from 'node:child_process'
|
||||||
|
import { Stt } from './lib/stt.mjs'
|
||||||
|
import { Tts_Client } from './lib/tts-client.mjs'
|
||||||
|
import { is_query_complete } from './lib/local-query-complete.mjs'
|
||||||
|
|
||||||
|
function get_arg(name) {
|
||||||
|
const i = process.argv.indexOf(name)
|
||||||
|
return i !== -1 ? process.argv[i + 1] : null
|
||||||
|
}
|
||||||
|
|
||||||
|
const audio_prompt = get_arg('--audio-prompt') ?? '/home/devilholk/Documents/rommie-sample.wav'
|
||||||
|
const whisper_name = get_arg('--whisper') ?? 'base.en'
|
||||||
|
const window_id = get_arg('--window-id')
|
||||||
|
const claude_remote = get_arg('--claude-remote')
|
||||||
|
|
||||||
|
const SILENCE_TIMEOUT = parseInt(get_arg('--silence-timeout') ?? '6000')
|
||||||
|
|
||||||
|
const tts = new Tts_Client()
|
||||||
|
const stt = new Stt({ whisper_name })
|
||||||
|
|
||||||
|
stt.init()
|
||||||
|
process.stderr.write(`[query-demo] whisper model: ${whisper_name}\n`)
|
||||||
|
process.stderr.write(`[query-demo] voice: ${audio_prompt}\n`)
|
||||||
|
if (window_id && claude_remote) {
|
||||||
|
process.stderr.write(`[query-demo] dispatch: window ${window_id} via ${claude_remote}\n`)
|
||||||
|
} else {
|
||||||
|
process.stderr.write(`[query-demo] no --window-id/--claude-remote — logging only\n`)
|
||||||
|
}
|
||||||
|
await tts.speak('Ready for input.', { audio_prompt })
|
||||||
|
|
||||||
|
function make_prompt(query) {
|
||||||
|
return [
|
||||||
|
'[voice-buddy] You have received a voice query.',
|
||||||
|
'',
|
||||||
|
'This is a voice interface. When you are done, use the `speak` shell command to inform the user of the outcome — keep it brief and spoken-word natural.',
|
||||||
|
'If the task is a question, speak the answer directly.',
|
||||||
|
'If the task is an action, carry it out and then speak a short confirmation.',
|
||||||
|
'',
|
||||||
|
'--- Query ---',
|
||||||
|
query,
|
||||||
|
'--- End of query ---',
|
||||||
|
].join('\n')
|
||||||
|
}
|
||||||
|
|
||||||
|
function dispatch(query) {
|
||||||
|
execFileSync('node', [claude_remote, '--window-id', window_id], {
|
||||||
|
input: make_prompt(query),
|
||||||
|
encoding: 'utf8',
|
||||||
|
stdio: ['pipe', 'inherit', 'inherit'],
|
||||||
|
})
|
||||||
|
}
|
||||||
|
|
||||||
|
async function submit(query) {
|
||||||
|
query = query.trim()
|
||||||
|
if (!query) {
|
||||||
|
return
|
||||||
|
}
|
||||||
|
process.stderr.write(`[query-demo] query: ${JSON.stringify(query)}\n`)
|
||||||
|
if (window_id && claude_remote) {
|
||||||
|
dispatch(query)
|
||||||
|
}
|
||||||
|
await tts.speak('Efforting.', { audio_prompt })
|
||||||
|
}
|
||||||
|
|
||||||
|
const SEND_WORDS = new Set(['go', 'done', 'send'])
|
||||||
|
|
||||||
|
let accumulated = ''
|
||||||
|
let silence_timer = null
|
||||||
|
|
||||||
|
function reset_silence_timer() {
|
||||||
|
clearTimeout(silence_timer)
|
||||||
|
silence_timer = setTimeout(async () => {
|
||||||
|
if (!accumulated) {
|
||||||
|
return
|
||||||
|
}
|
||||||
|
process.stderr.write('[query-demo] silence timeout — submitting\n')
|
||||||
|
const query = accumulated
|
||||||
|
accumulated = ''
|
||||||
|
await submit(query)
|
||||||
|
}, SILENCE_TIMEOUT)
|
||||||
|
}
|
||||||
|
|
||||||
|
stt.listen(async (text) => {
|
||||||
|
accumulated = accumulated ? `${accumulated}\n${text}` : text
|
||||||
|
process.stderr.write(`[query-demo] fragment: ${JSON.stringify(accumulated)}\n`)
|
||||||
|
|
||||||
|
reset_silence_timer()
|
||||||
|
|
||||||
|
const last_line = accumulated.split('\n').at(-1)
|
||||||
|
const last_norm = last_line.toLowerCase().replace(/[^a-z]/g, '')
|
||||||
|
const forced = SEND_WORDS.has(last_norm)
|
||||||
|
|
||||||
|
if (!forced) {
|
||||||
|
const complete = await is_query_complete(last_line)
|
||||||
|
if (!complete) {
|
||||||
|
return
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
clearTimeout(silence_timer)
|
||||||
|
|
||||||
|
const lines = accumulated.split('\n')
|
||||||
|
const query = (forced ? lines.slice(0, -1) : lines).join('\n')
|
||||||
|
accumulated = ''
|
||||||
|
|
||||||
|
await submit(query)
|
||||||
|
}, { on_audio: () => { if (accumulated) reset_silence_timer() } })
|
||||||
11
requirements.txt
Normal file
11
requirements.txt
Normal file
@@ -0,0 +1,11 @@
|
|||||||
|
# STT
|
||||||
|
faster-whisper>=1.1.0
|
||||||
|
|
||||||
|
# TTS (already used in kokoro-tts-test)
|
||||||
|
kokoro>=0.9.4
|
||||||
|
|
||||||
|
# Audio capture
|
||||||
|
sounddevice>=0.4.6
|
||||||
|
|
||||||
|
# Numpy (usually already present)
|
||||||
|
numpy>=1.24.0
|
||||||
24
setup-venv.sh
Executable file
24
setup-venv.sh
Executable file
@@ -0,0 +1,24 @@
|
|||||||
|
#!/usr/bin/env bash
|
||||||
|
# Creates a venv in ./venv and installs Python deps for Bark.
|
||||||
|
# Safe to re-run — skips install if already done.
|
||||||
|
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||||
|
VENV="${SCRIPT_DIR}/venv"
|
||||||
|
|
||||||
|
if [ ! -d "${VENV}" ]; then
|
||||||
|
echo "==> creating venv at ${VENV}"
|
||||||
|
python3 -m venv "${VENV}"
|
||||||
|
fi
|
||||||
|
|
||||||
|
echo "==> installing Python dependencies"
|
||||||
|
"${VENV}/bin/pip" install --upgrade pip --quiet
|
||||||
|
"${VENV}/bin/pip" install transformers accelerate torch numpy
|
||||||
|
"${VENV}/bin/pip" install chatterbox-tts
|
||||||
|
|
||||||
|
chmod +x "${SCRIPT_DIR}/bark-server.py"
|
||||||
|
chmod +x "${SCRIPT_DIR}/chatterbox-server.py"
|
||||||
|
|
||||||
|
echo ""
|
||||||
|
echo "==> venv ready at ${VENV}"
|
||||||
47
speak-as.mjs
Normal file
47
speak-as.mjs
Normal file
@@ -0,0 +1,47 @@
|
|||||||
|
/**
|
||||||
|
* Read text from stdin and speak it using a voice reference clip.
|
||||||
|
*
|
||||||
|
* Usage:
|
||||||
|
* echo "Hello world" | node speak-as.mjs /path/to/voice.wav
|
||||||
|
* cat script.txt | node speak-as.mjs /path/to/voice.wav
|
||||||
|
* node speak-as.mjs /path/to/voice.wav (then type, Ctrl+D when done)
|
||||||
|
*/
|
||||||
|
|
||||||
|
import * as fs from 'node:fs'
|
||||||
|
import * as readline from 'node:readline'
|
||||||
|
import { Chatterbox_Tts } from './lib/chatterbox-tts.mjs'
|
||||||
|
|
||||||
|
const audio_prompt = process.argv[2]
|
||||||
|
|
||||||
|
if (!audio_prompt) {
|
||||||
|
process.stderr.write('Usage: node speak-as.mjs /path/to/voice.wav\n')
|
||||||
|
process.exit(1)
|
||||||
|
}
|
||||||
|
|
||||||
|
if (!fs.existsSync(audio_prompt)) {
|
||||||
|
process.stderr.write(`File not found: ${audio_prompt}\n`)
|
||||||
|
process.exit(1)
|
||||||
|
}
|
||||||
|
|
||||||
|
const tts = new Chatterbox_Tts()
|
||||||
|
await tts.init()
|
||||||
|
|
||||||
|
const rl = readline.createInterface({
|
||||||
|
input: process.stdin,
|
||||||
|
output: process.stdout,
|
||||||
|
terminal: process.stdin.isTTY,
|
||||||
|
})
|
||||||
|
|
||||||
|
const norm = s => s.toLowerCase().replace(/[^a-z0-9 ]/g, '').replace(/\s+/g, ' ').trim()
|
||||||
|
let last_norm = ''
|
||||||
|
|
||||||
|
for await (const line of rl) {
|
||||||
|
const text = line.trim()
|
||||||
|
if (!text) continue
|
||||||
|
const n = norm(text)
|
||||||
|
if (n === last_norm) continue
|
||||||
|
last_norm = n
|
||||||
|
await tts.speak(text, { audio_prompt })
|
||||||
|
}
|
||||||
|
|
||||||
|
tts.stop()
|
||||||
133
tts-server.mjs
Normal file
133
tts-server.mjs
Normal file
@@ -0,0 +1,133 @@
|
|||||||
|
/**
|
||||||
|
* TTS HTTP server — wraps chatterbox-server.py and exposes a simple HTTP API.
|
||||||
|
* Requests are serialized so generation and playback stay in order.
|
||||||
|
*
|
||||||
|
* Usage:
|
||||||
|
* node tts-server.mjs
|
||||||
|
* TTS_PORT=11500 node tts-server.mjs
|
||||||
|
*
|
||||||
|
* API:
|
||||||
|
* POST /speak { "text": "...", "audio_prompt": "/path/to/voice.wav", ... }
|
||||||
|
* → 200 { "ok": true } (after generation, playback continues in background)
|
||||||
|
* GET /voices → 200 { "voices": ["rommie", ...], "current": "rommie" | null }
|
||||||
|
* POST /voice { "name": "rommie" }
|
||||||
|
* → 200 { "ok": true, "name": "rommie", "path": "..." }
|
||||||
|
* GET /health → 200 { "ok": true }
|
||||||
|
*/
|
||||||
|
|
||||||
|
import * as http from 'node:http'
|
||||||
|
import * as fs from 'node:fs'
|
||||||
|
import * as path from 'node:path'
|
||||||
|
import yaml from 'js-yaml'
|
||||||
|
import { Chatterbox_Tts } from './lib/chatterbox-tts.mjs'
|
||||||
|
|
||||||
|
const PORT = parseInt(process.env.TTS_PORT ?? '11500')
|
||||||
|
const VOICES_FILE = path.join(import.meta.dirname, 'voices.yaml')
|
||||||
|
|
||||||
|
function reload_voices() {
|
||||||
|
try {
|
||||||
|
const doc = yaml.load(fs.readFileSync(VOICES_FILE, 'utf8'))
|
||||||
|
return doc?.voices ?? {}
|
||||||
|
} catch {
|
||||||
|
return {}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
let voices = reload_voices()
|
||||||
|
let current_voice = null // name of active voice, or null
|
||||||
|
|
||||||
|
// --- TTS setup ---
|
||||||
|
const tts = new Chatterbox_Tts()
|
||||||
|
process.stderr.write('[tts-server] starting chatterbox...\n')
|
||||||
|
await tts.init()
|
||||||
|
process.stderr.write('[tts-server] chatterbox ready\n')
|
||||||
|
|
||||||
|
// Serialize all speak requests through a promise chain
|
||||||
|
let queue = Promise.resolve()
|
||||||
|
|
||||||
|
function enqueue(fn) {
|
||||||
|
const result = queue.then(fn)
|
||||||
|
// Don't let a failed request poison the queue
|
||||||
|
queue = result.catch(() => {})
|
||||||
|
return result
|
||||||
|
}
|
||||||
|
|
||||||
|
function read_body(req) {
|
||||||
|
return new Promise((resolve, reject) => {
|
||||||
|
let buf = ''
|
||||||
|
req.on('data', chunk => { buf += chunk })
|
||||||
|
req.on('end', () => {
|
||||||
|
try { resolve(JSON.parse(buf)) } catch (e) { reject(e) }
|
||||||
|
})
|
||||||
|
req.on('error', reject)
|
||||||
|
})
|
||||||
|
}
|
||||||
|
|
||||||
|
function send(res, status, body) {
|
||||||
|
const payload = JSON.stringify(body)
|
||||||
|
res.writeHead(status, { 'Content-Type': 'application/json' })
|
||||||
|
res.end(payload)
|
||||||
|
}
|
||||||
|
|
||||||
|
const server = http.createServer(async (req, res) => {
|
||||||
|
if (req.method === 'GET' && req.url === '/health') {
|
||||||
|
return send(res, 200, { ok: true })
|
||||||
|
}
|
||||||
|
|
||||||
|
if (req.method === 'GET' && req.url === '/voices') {
|
||||||
|
voices = reload_voices()
|
||||||
|
const list = Object.entries(voices).map(([name, v]) => ({
|
||||||
|
name,
|
||||||
|
description: v.description ?? '',
|
||||||
|
active: name === current_voice,
|
||||||
|
}))
|
||||||
|
return send(res, 200, { voices: list, current: current_voice })
|
||||||
|
}
|
||||||
|
|
||||||
|
if (req.method === 'POST' && req.url === '/voice') {
|
||||||
|
let body
|
||||||
|
try { body = await read_body(req) } catch {
|
||||||
|
return send(res, 400, { error: 'invalid JSON' })
|
||||||
|
}
|
||||||
|
const { name } = body
|
||||||
|
if (!name) return send(res, 400, { error: 'name required' })
|
||||||
|
voices = reload_voices()
|
||||||
|
if (!voices[name]) return send(res, 404, { error: `unknown voice: ${name}` })
|
||||||
|
current_voice = name
|
||||||
|
process.stderr.write(`[tts-server] voice switched to: ${name}\n`)
|
||||||
|
return send(res, 200, { ok: true, name, path: voices[name].path })
|
||||||
|
}
|
||||||
|
|
||||||
|
if (req.method === 'POST' && req.url === '/speak') {
|
||||||
|
let body
|
||||||
|
try {
|
||||||
|
body = await read_body(req)
|
||||||
|
} catch {
|
||||||
|
return send(res, 400, { error: 'invalid JSON' })
|
||||||
|
}
|
||||||
|
|
||||||
|
const { text, ...opts } = body
|
||||||
|
if (!text) {
|
||||||
|
return send(res, 400, { error: 'text required' })
|
||||||
|
}
|
||||||
|
|
||||||
|
// Inject current voice as default audio_prompt if none provided
|
||||||
|
if (!opts.audio_prompt && current_voice && voices[current_voice]) {
|
||||||
|
opts.audio_prompt = voices[current_voice].path
|
||||||
|
}
|
||||||
|
|
||||||
|
try {
|
||||||
|
await enqueue(() => tts.speak_streaming(text, { preprocess: false, ...opts }))
|
||||||
|
send(res, 200, { ok: true })
|
||||||
|
} catch (err) {
|
||||||
|
send(res, 500, { error: err.message })
|
||||||
|
}
|
||||||
|
return
|
||||||
|
}
|
||||||
|
|
||||||
|
send(res, 404, { error: 'not found' })
|
||||||
|
})
|
||||||
|
|
||||||
|
server.listen(PORT, () => {
|
||||||
|
process.stderr.write(`[tts-server] listening on port ${PORT}\n`)
|
||||||
|
})
|
||||||
65
voice-buddy.mjs
Normal file
65
voice-buddy.mjs
Normal file
@@ -0,0 +1,65 @@
|
|||||||
|
/**
|
||||||
|
* Voice buddy — reads voice queries from stdin (JSON lines) and dispatches
|
||||||
|
* them to a Claude Code instance via claude-remote.
|
||||||
|
*
|
||||||
|
* Usage:
|
||||||
|
* node query-demo.mjs | node voice-buddy.mjs --window-id 0x1234abc --claude-remote /workspace/claude-remote/claude-remote.mjs
|
||||||
|
*
|
||||||
|
* Options:
|
||||||
|
* --window-id <id> X11 window ID of Claude Code (decimal or 0x hex)
|
||||||
|
* --claude-remote <path> Path to claude-remote.mjs
|
||||||
|
*/
|
||||||
|
|
||||||
|
import * as readline from 'node:readline'
|
||||||
|
import { execFileSync } from 'node:child_process'
|
||||||
|
|
||||||
|
const args = process.argv.slice(2)
|
||||||
|
|
||||||
|
function get_arg(name) {
|
||||||
|
const i = args.indexOf(name)
|
||||||
|
if (i === -1 || i + 1 >= args.length) {
|
||||||
|
process.stderr.write(`Missing argument: ${name}\n`)
|
||||||
|
process.exit(1)
|
||||||
|
}
|
||||||
|
return args[i + 1]
|
||||||
|
}
|
||||||
|
|
||||||
|
const window_id = get_arg('--window-id')
|
||||||
|
const claude_remote = get_arg('--claude-remote')
|
||||||
|
|
||||||
|
function make_prompt(query) {
|
||||||
|
return [
|
||||||
|
'[voice-buddy] You have received a voice query.',
|
||||||
|
'',
|
||||||
|
'Respond via the `speak` shell command. Keep your response brief and spoken-word natural — this is a voice interface.',
|
||||||
|
'',
|
||||||
|
'--- Query ---',
|
||||||
|
query,
|
||||||
|
'--- End of query ---',
|
||||||
|
].join('\n')
|
||||||
|
}
|
||||||
|
|
||||||
|
function dispatch(query) {
|
||||||
|
const prompt = make_prompt(query)
|
||||||
|
process.stderr.write(`[voice-buddy] dispatching: ${JSON.stringify(query)}\n`)
|
||||||
|
execFileSync('node', [claude_remote, '--window-id', window_id], {
|
||||||
|
input: prompt,
|
||||||
|
encoding: 'utf8',
|
||||||
|
stdio: ['pipe', 'inherit', 'inherit'],
|
||||||
|
})
|
||||||
|
}
|
||||||
|
|
||||||
|
const rl = readline.createInterface({ input: process.stdin })
|
||||||
|
|
||||||
|
for await (const line of rl) {
|
||||||
|
const trimmed = line.trim()
|
||||||
|
if (!trimmed) {
|
||||||
|
continue
|
||||||
|
}
|
||||||
|
try {
|
||||||
|
const query = JSON.parse(trimmed)
|
||||||
|
dispatch(query)
|
||||||
|
} catch {
|
||||||
|
process.stderr.write(`[voice-buddy] bad JSON line: ${trimmed}\n`)
|
||||||
|
}
|
||||||
|
}
|
||||||
54
voices.yaml
Normal file
54
voices.yaml
Normal file
@@ -0,0 +1,54 @@
|
|||||||
|
# Named voices for TTS server.
|
||||||
|
# Switch via: POST /voice {"name": "rommie"}
|
||||||
|
# List via: GET /voices
|
||||||
|
|
||||||
|
voices:
|
||||||
|
rommie:
|
||||||
|
path: /home/devilholk/Documents/rommie-sample.wav
|
||||||
|
description: Rommie - Andromeda ship AI / android
|
||||||
|
|
||||||
|
archangel:
|
||||||
|
path: /home/devilholk/Documents/archangel.ogg
|
||||||
|
description: Arch Angel from Air Wolf
|
||||||
|
|
||||||
|
marella:
|
||||||
|
path: /home/devilholk/Documents/murella.ogg
|
||||||
|
description: Marella from Air Wolf
|
||||||
|
|
||||||
|
chuin:
|
||||||
|
path: /home/devilholk/Documents/chuin.ogg
|
||||||
|
description: Chuin from Remo Williams - The adventure begins
|
||||||
|
|
||||||
|
harold-w-smith:
|
||||||
|
path: /home/devilholk/Documents/cure-leader.ogg
|
||||||
|
description: Harold W. Smith from Remo Williams - The adventure begins
|
||||||
|
|
||||||
|
penny-parker:
|
||||||
|
path: /home/devilholk/Documents/penny.ogg
|
||||||
|
description: Penny from MacGyver
|
||||||
|
|
||||||
|
santini:
|
||||||
|
path: /home/devilholk/Documents/santini.ogg
|
||||||
|
description: Santini from Air Wolf
|
||||||
|
|
||||||
|
gosalyn:
|
||||||
|
path: /home/devilholk/Documents/gasalin.ogg
|
||||||
|
description: Gosalyn from Darkwing Duck
|
||||||
|
|
||||||
|
darkwing-duck:
|
||||||
|
path: /home/devilholk/Documents/dwd.ogg
|
||||||
|
description: Darkwing Duck
|
||||||
|
|
||||||
|
belitski:
|
||||||
|
path: /home/devilholk/Documents/slipstream2.ogg
|
||||||
|
description: Belitski from Slipstream
|
||||||
|
|
||||||
|
ivanova:
|
||||||
|
path: /home/devilholk/Documents/ivanova.ogg
|
||||||
|
description: Susan Ivonavo from Babylon 5
|
||||||
|
|
||||||
|
sheila:
|
||||||
|
path: /home/devilholk/Documents/darkness-woman.ogg
|
||||||
|
description: Sheila from Army of Darkness
|
||||||
|
|
||||||
|
# Add more voices as clips become available
|
||||||
Reference in New Issue
Block a user