Initial commit — voice pipeline experiment
STT (Silero VAD + Whisper via sherpa-onnx), Chatterbox TTS HTTP server, query completeness classifier (Ollama), multi-voice demo scripts, and planning docs. Kept as reference; clean rewrite planned in separate repos. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
11
.gitignore
vendored
Normal file
11
.gitignore
vendored
Normal file
@@ -0,0 +1,11 @@
|
||||
node_modules/
|
||||
models/
|
||||
venv/
|
||||
*.ogg
|
||||
*.wav
|
||||
*.mp3
|
||||
*.onnx
|
||||
*.bin
|
||||
package-lock.json
|
||||
__pycache__/
|
||||
*.pyc
|
||||
159
CLEANUP-PLAN.md
Normal file
159
CLEANUP-PLAN.md
Normal file
@@ -0,0 +1,159 @@
|
||||
# Voice Project — Cleanup Plan
|
||||
|
||||
> [!NOTE]
|
||||
> **Revised approach** — keep this project as-is (it works and serves as a reference). The path forward is new clean repos for each component, then a top-level project that composes them. See **New architecture** section below.
|
||||
|
||||
The project has accumulated experiment scripts, dead TTS backends, and outdated setup files. This document is the plan before doing the actual cleanup.
|
||||
|
||||
---
|
||||
|
||||
## Current state
|
||||
|
||||
### Active pipeline (keep)
|
||||
|
||||
| File | Role |
|
||||
|------|------|
|
||||
| `query-demo.mjs` | Main entry point — STT → classifier → dispatch |
|
||||
| `tts-server.mjs` | TTS HTTP server (Chatterbox, voice switching) |
|
||||
| `voices.yaml` | Named voice config |
|
||||
| `lib/stt.mjs` | Silero VAD + Whisper, pre-roll buffer |
|
||||
| `lib/chatterbox-tts.mjs` | Node wrapper for chatterbox-server.py |
|
||||
| `lib/tts-client.mjs` | HTTP client for tts-server.mjs |
|
||||
| `lib/local-query-complete.mjs` | Ollama classifier |
|
||||
| `chatterbox-server.py` | Long-running Python TTS process |
|
||||
| `listen.mjs` | Standalone mic → stdout tool |
|
||||
| `speak-as.mjs` | Standalone stdin → TTS tool |
|
||||
| `demos/` | Multi-voice demo scripts |
|
||||
| `download-models.sh` | Whisper + VAD model download |
|
||||
| `setup-venv.sh` | Python venv setup (needs rewrite — see below) |
|
||||
| `models/` | Downloaded STT models |
|
||||
| `venv/` | Python venv (not committed) |
|
||||
|
||||
### Dead / experimental (remove or archive)
|
||||
|
||||
| File | Reason |
|
||||
|------|--------|
|
||||
| `acting-demo-bark.mjs` | Bark TTS — abandoned |
|
||||
| `acting-demo-chatterbox.mjs` | One-off demo, superseded |
|
||||
| `acting-demo.mjs` | Old demo |
|
||||
| `bark-server.py` | Bark backend — abandoned |
|
||||
| `demo-bark.mjs` | Bark demo |
|
||||
| `demo-kokoro.mjs` | Kokoro TTS experiment — abandoned |
|
||||
| `lib/bark-tts.mjs` | Bark wrapper |
|
||||
| `lib/tts.mjs` | Old TTS abstraction (superseded by tts-client.mjs) |
|
||||
| `lib/markdown.mjs` | Unused? Verify before deleting |
|
||||
| `lib/llm.mjs` | Unused? Verify before deleting |
|
||||
| `requirements.txt` | Lists kokoro + faster-whisper deps that aren't used |
|
||||
| `voice-buddy.mjs` | Merged into query-demo.mjs — delete |
|
||||
|
||||
---
|
||||
|
||||
## What needs to be documented / fixed
|
||||
|
||||
### 1. `setup-venv.sh` — rewrite
|
||||
|
||||
Current script installs Bark deps. The active backend is Chatterbox. New script should:
|
||||
- Install `chatterbox-tts` and its deps (torch, transformers, accelerate, numpy)
|
||||
- HuggingFace token instruction: write token to `~/.secrets/hugging-face.token`
|
||||
- Note: `find_hf_cache()` in chatterbox-server.py avoids HF network calls if the model is cached
|
||||
|
||||
### 2. `requirements.txt` — replace or delete
|
||||
|
||||
Currently lists kokoro and faster-whisper (unused). Either:
|
||||
- Replace with `chatterbox-requirements.txt` listing only what `setup-venv.sh` installs
|
||||
- Or just delete it — `setup-venv.sh` is the source of truth
|
||||
|
||||
### 3. sherpa-onnx / CUDA build — document clearly
|
||||
|
||||
The npm package ships a CPU-only `.so`. Using CUDA requires a manual rebuild. This is documented in `NOTES.md` but should also appear in a top-level `README.md` or `SETUP.md` so it's hard to miss.
|
||||
|
||||
### 4. System dependencies — list them
|
||||
|
||||
Nothing currently lists what needs to be installed at the OS level:
|
||||
- `parec` / `pacat` (PulseAudio) — audio I/O
|
||||
- `cmake` — for optional CUDA sherpa-onnx rebuild
|
||||
- `onnxruntime-opt-cuda` (or equivalent) — for GPU STT
|
||||
- `ollama` running locally with `qwen2.5:3b` pulled — for classifier
|
||||
- `jq` — used in shell functions and demo scripts
|
||||
- Node.js (version? check what's required)
|
||||
- Python 3 with venv support
|
||||
|
||||
### 5. External services — document
|
||||
|
||||
- **TTS server**: runs on the host (not in container), port 11500
|
||||
- **Ollama**: runs on `192.168.2.99:11434` — classifier and future LLM routing
|
||||
- **TTS_URL** env var: set in `~/.bashrc` to point at TTS server
|
||||
|
||||
### 6. `voices.yaml` — keep and document
|
||||
|
||||
Already the right format. Add a comment block at the top explaining how to add a voice (grab a clean WAV/OGG clip, at least 5 seconds, no background music).
|
||||
|
||||
---
|
||||
|
||||
## Proposed new top-level structure
|
||||
|
||||
```
|
||||
voice/
|
||||
README.md ← start here: what this is, quick start
|
||||
SETUP.md ← full setup: OS deps, npm install, venv, CUDA, models
|
||||
NOTES.md ← architecture deep-dives (keep as-is)
|
||||
TODO.md ← keep as-is
|
||||
VOICES.md ← wishlist (keep as-is)
|
||||
voices.yaml ← runtime voice config
|
||||
query-demo.mjs ← main entry point
|
||||
tts-server.mjs ← TTS HTTP server
|
||||
listen.mjs ← standalone STT tool
|
||||
speak-as.mjs ← standalone TTS tool
|
||||
chatterbox-server.py
|
||||
setup-venv.sh ← rewritten for Chatterbox only
|
||||
download-models.sh
|
||||
package.json
|
||||
lib/
|
||||
stt.mjs
|
||||
chatterbox-tts.mjs
|
||||
tts-client.mjs
|
||||
local-query-complete.mjs
|
||||
demos/
|
||||
*.sh
|
||||
models/ ← gitignored
|
||||
venv/ ← gitignored
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Cleanup steps (in order)
|
||||
|
||||
1. **Verify** `lib/markdown.mjs` and `lib/llm.mjs` are unused — grep for imports
|
||||
2. **Delete** dead files: bark/kokoro scripts, old demos, voice-buddy.mjs, lib/bark-tts.mjs, lib/tts.mjs, and unused lib files
|
||||
3. **Rewrite** `setup-venv.sh` for Chatterbox only
|
||||
4. **Replace** `requirements.txt` with accurate one or delete
|
||||
5. **Write** `SETUP.md` covering all install steps end-to-end
|
||||
6. **Write** `README.md` with quick start
|
||||
7. **Add** voice cloning tips to top of `voices.yaml`
|
||||
8. **Commit** the whole cleanup as a single commit
|
||||
|
||||
---
|
||||
|
||||
## Open questions
|
||||
|
||||
- Keep `acting-demo-chatterbox.mjs` as a reference for TTS capability demos, or delete?
|
||||
- Should `demos/` scripts be committed as-is or made more portable (no hardcoded paths)?
|
||||
|
||||
---
|
||||
|
||||
## New architecture (revised plan)
|
||||
|
||||
Keep this project as a reference. Build clean standalone repos instead:
|
||||
|
||||
| Repo | Contents | Depends on |
|
||||
|------|----------|-----------|
|
||||
| `tts-server` | chatterbox-server.py, tts-server.mjs, lib/chatterbox-tts.mjs, voices.yaml format, HTTP API | Python venv, Chatterbox |
|
||||
| `stt` | lib/stt.mjs, sherpa-onnx, VAD + Whisper, pre-roll buffer, listen.mjs | sherpa-onnx-node, models |
|
||||
| `voice-assistant` | query-demo.mjs, lib/tts-client.mjs, lib/local-query-complete.mjs | tts-server + stt as submodules |
|
||||
|
||||
Each standalone repo gets its own README, SETUP, and can be used independently. The voice-assistant repo composes them via git submodules (or npm workspace / symlinks — TBD).
|
||||
|
||||
### Open questions for new architecture
|
||||
|
||||
- **Composition: npm packages** — each repo published to the private npm.efforting.tech registry; voice-assistant depends on `@efforting/tts-server` and `@efforting/stt` as npm deps. Cleaner than submodules for ESM projects.
|
||||
- **Model storage** — models are large, have different backup strategies, and should NOT live inside project directories. There are canonical model locations on the system; repos should reference them via an env var (e.g. `MODELS_DIR` or per-model env vars) rather than downloading into the project. Migrate existing `models/` directories out as part of the new repo setup. Not urgent — note for cleanup time.
|
||||
51
LLM-ROUTING.md
Normal file
51
LLM-ROUTING.md
Normal file
@@ -0,0 +1,51 @@
|
||||
# LLM Routing Strategy
|
||||
|
||||
When to use a local model vs an Anthropic frontier model in the voice pipeline.
|
||||
|
||||
---
|
||||
|
||||
## Current model inventory
|
||||
|
||||
| Role | Model | Where | VRAM |
|
||||
|------|-------|-------|------|
|
||||
| STT | Whisper base.en + Silero VAD | sherpa-onnx (GPU) | ~0.5 GB |
|
||||
| TTS | Chatterbox Turbo | Python/CUDA | ~4 GB |
|
||||
| Completeness classifier | Qwen 2.5 3B | Ollama | ~2 GB |
|
||||
| Main reasoning | Claude (Anthropic) | API / Claude Code | — |
|
||||
|
||||
Total local VRAM: ~6–7 GB on RTX 3090 (24 GB). Plenty of headroom.
|
||||
|
||||
---
|
||||
|
||||
## Routing principles
|
||||
|
||||
### Use local model when:
|
||||
- Task is **classification or routing** (completeness, intent detection, command vs question)
|
||||
- Task is **latency-sensitive** and the answer doesn't require world knowledge or reasoning
|
||||
- Task is **high-frequency** (runs on every utterance — billing would add up)
|
||||
- Privacy matters — query content shouldn't leave the machine
|
||||
|
||||
### Use Anthropic frontier model when:
|
||||
- Task requires **reasoning, synthesis, or judgment**
|
||||
- Task involves **code generation or editing**
|
||||
- Task involves **world knowledge** beyond what a small model handles well
|
||||
- Query is **low-frequency** (once per user request, not once per audio frame)
|
||||
- Quality matters more than latency
|
||||
|
||||
---
|
||||
|
||||
## GPU split (future option)
|
||||
|
||||
If voice models and reasoning models compete for VRAM, split across cards:
|
||||
- **Card A** (smaller VRAM): STT + TTS + classifier — pure inference, no large models needed
|
||||
- **Card B** (larger VRAM): local reasoning model (e.g. Qwen 20B or similar) for tasks that benefit from a capable local model but shouldn't hit the API
|
||||
|
||||
Use `CUDA_VISIBLE_DEVICES` or Ollama's GPU assignment to pin models to specific cards.
|
||||
|
||||
---
|
||||
|
||||
## Upgrade candidates
|
||||
|
||||
- **Whisper small.en or medium.en** — better transcription accuracy, especially for accents. Small adds ~0.5 GB, medium ~1.5 GB.
|
||||
- **Completeness classifier → larger model** — Qwen 7B or 14B would give significantly better sentence boundary detection and instruction/question classification.
|
||||
- **Local reasoning model** — for tasks that are too simple to warrant an API call but too complex for a 3B classifier (e.g. summarisation, note-taking, simple Q&A). A 7–14B quantised model via Ollama would fit alongside the voice stack.
|
||||
161
NOTES.md
Normal file
161
NOTES.md
Normal file
@@ -0,0 +1,161 @@
|
||||
# Voice Experiment Notes
|
||||
|
||||
Real-time voice pipeline: microphone → STT → (LLM) → TTS. Everything runs locally, no internet after initial model download.
|
||||
|
||||
---
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
parec (PulseAudio)
|
||||
└─ lib/stt.mjs Silero VAD + Whisper → text on stdout
|
||||
└─ sherpa-onnx-node
|
||||
└─ libsherpa-onnx-c-api.so (needs CUDA build — see below)
|
||||
└─ libonnxruntime.so (system, with CUDA)
|
||||
|
||||
text → lib/chatterbox-tts.mjs
|
||||
└─ chatterbox-server.py long-running Python TTS server
|
||||
└─ chatterbox-tts (pip) ResembleAI Chatterbox model
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## STT
|
||||
|
||||
**Library:** `sherpa-onnx-node` npm package
|
||||
**Models:** Whisper (offline batch) + Silero VAD
|
||||
**Audio capture:** `parec --format=s16le --rate=16000 --channels=1 --latency-msec=50`
|
||||
|
||||
### Model download
|
||||
|
||||
```bash
|
||||
bash download-models.sh base.en # or: tiny.en small.en medium.en large-v3 large-v3-turbo
|
||||
```
|
||||
|
||||
Models land in `models/`:
|
||||
- `silero_vad.onnx`
|
||||
- `sherpa-onnx-whisper-base.en/`
|
||||
|
||||
### How it works
|
||||
|
||||
`lib/stt.mjs` feeds 512-sample (32ms) windows into Silero VAD. When VAD signals speech end (after 500ms silence), the segment is sent to Whisper for transcription. The transcription is written to stdout — **note: the native addon writes directly to fd 1 (bypassing Node.js process.stdout), not via the JavaScript callback**. The JS callback is also called and can be used for additional processing.
|
||||
|
||||
### Getting CUDA
|
||||
|
||||
The npm package bundles a CPU-only `libsherpa-onnx-c-api.so` (`SHERPA_ONNX_ENABLE_GPU=OFF`). To use CUDA, rebuild it from source:
|
||||
|
||||
```bash
|
||||
# Prerequisites: cmake, onnxruntime-opt-cuda (or onnxruntime-cuda) from pacman
|
||||
git clone --depth 1 git@github.com:k2-fsa/sherpa-onnx ~/Foreign/sherpa-onnx
|
||||
|
||||
export SHERPA_ONNXRUNTIME_INCLUDE_DIR=/usr/include/onnxruntime
|
||||
export SHERPA_ONNXRUNTIME_LIB_DIR=/usr/lib
|
||||
|
||||
cd ~/Foreign/sherpa-onnx
|
||||
cmake -B build \
|
||||
-DCMAKE_BUILD_TYPE=Release \
|
||||
-DBUILD_SHARED_LIBS=ON \
|
||||
-DSHERPA_ONNX_ENABLE_GPU=ON \
|
||||
-DSHERPA_ONNX_ENABLE_PYTHON=OFF \
|
||||
-DSHERPA_ONNX_ENABLE_TESTS=OFF \
|
||||
.
|
||||
|
||||
cmake --build build --target sherpa-onnx-c-api
|
||||
|
||||
# Symlink into node_modules (remove bundled CPU onnxruntime so system one is used)
|
||||
NMOD=~/workspace/experiments/voice/node_modules/sherpa-onnx-linux-x64
|
||||
ln -sf ~/Foreign/sherpa-onnx/build/lib/libsherpa-onnx-c-api.so $NMOD/libsherpa-onnx-c-api.so
|
||||
rm -f $NMOD/libonnxruntime.so
|
||||
```
|
||||
|
||||
The system `/usr/lib/libonnxruntime.so` (from `onnxruntime-opt-cuda`) is used at runtime. The sherpa-onnx C API is stable across versions so the pre-built `sherpa-onnx.node` (1.13.2) works with the freshly built library.
|
||||
|
||||
---
|
||||
|
||||
## TTS
|
||||
|
||||
**Library:** `chatterbox-tts` Python package (ResembleAI)
|
||||
**Models:** `ResembleAI/chatterbox-turbo` (default) or `ResembleAI/chatterbox`
|
||||
**Playback:** `pacat --format=float32le --rate=24000 --channels=1`
|
||||
|
||||
### Setup
|
||||
|
||||
```bash
|
||||
bash setup-venv.sh # creates ./venv, installs chatterbox-tts + deps
|
||||
```
|
||||
|
||||
HuggingFace token (for model download) goes in `~/.secrets/hugging-face.token`.
|
||||
|
||||
### Variants
|
||||
|
||||
| Variant | RTF (RTX 3090) | Notes |
|
||||
|---------|---------------|-------|
|
||||
| turbo | ~0.2 | default, female voice, no exaggeration param |
|
||||
| full | slower | male voice, supports `exaggeration` (0.0–1.0) |
|
||||
|
||||
### Paralinguistic tags
|
||||
|
||||
Work in both variants:
|
||||
```
|
||||
[laugh] [chuckle] [cough] [clear throat] [sigh] [shush] [groan] [sniff] [gasp]
|
||||
```
|
||||
|
||||
### Voice cloning
|
||||
|
||||
Pass a WAV reference clip (at least 5 seconds, clean speech, no music). Quality scales with both length and emotional range — a clip with variation (question, statement, different tones) gives the model more to work with for natural prosody than a flat monotone sample of the same length.
|
||||
```javascript
|
||||
await tts.speak('Hello.', { audio_prompt: '/path/to/voice.wav' })
|
||||
```
|
||||
|
||||
The server preprocesses the WAV to float32 mono to work around a librosa/numpy float64 issue.
|
||||
|
||||
### Server protocol
|
||||
|
||||
`chatterbox-server.py` is a long-running process. Communication:
|
||||
- **stdin:** one JSON line per request: `{"text": "...", "temperature": 0.8, ...}`
|
||||
- **stdout:** `ok\n` after each utterance is **generated** (playback continues in background)
|
||||
- **stderr:** status/timing logs
|
||||
|
||||
Generation and playback are pipelined: the server generates the next utterance while the previous one is still playing. RTF ~0.2 means the next chunk is ready well before playback finishes.
|
||||
|
||||
### Acting tips
|
||||
|
||||
- Emphasis: `UPPERCASE` words
|
||||
- Pause/rhythm: `...` `—`
|
||||
- Acronym workaround: `un-workable` not `UNWORKABLE` (avoids letter-spelling)
|
||||
- Exaggeration (full model only): `{ exaggeration: 0.0–1.0 }`
|
||||
- Temperature: `{ temperature: 0.3–1.4 }` (lower = more consistent)
|
||||
|
||||
---
|
||||
|
||||
## Tools
|
||||
|
||||
| Script | Usage |
|
||||
|--------|-------|
|
||||
| `listen.mjs` | Mic → Whisper → stdout (one line per utterance) |
|
||||
| `speak-as.mjs` | stdin → Chatterbox TTS with voice reference |
|
||||
| `acting-demo-chatterbox.mjs` | TTS capability demo (sections 1–7) |
|
||||
|
||||
### Pipe example
|
||||
|
||||
```bash
|
||||
node listen.mjs 2>/dev/null | node speak-as.mjs /path/to/voice.wav
|
||||
```
|
||||
|
||||
`speak-as.mjs` deduplicates consecutive identical lines to handle VAD over-segmentation.
|
||||
|
||||
---
|
||||
|
||||
## Known Quirks
|
||||
|
||||
**librosa float64:** newer numpy makes `librosa.resample` return float64, breaking torch. Patched in `chatterbox-server.py`:
|
||||
```python
|
||||
_orig = _librosa.resample
|
||||
_librosa.resample = lambda *a, **k: _orig(*a, **k).astype(np.float32)
|
||||
```
|
||||
|
||||
**HuggingFace offline:** `from_pretrained()` always hits the internet. The server uses `find_hf_cache()` to detect local snapshots and calls `from_local()` instead.
|
||||
|
||||
**VAD over-segmentation:** a mid-utterance pause produces two segments that Whisper transcribes to similar or identical text. `speak-as.mjs` deduplicates consecutive lines (normalized, punctuation-stripped) before speaking.
|
||||
|
||||
**Native stdout:** the sherpa-onnx native addon writes transcription results directly to fd 1 via C `write()`, bypassing Node.js `process.stdout.write`. This cannot be intercepted from JavaScript. It only writes for non-trivial results (not silence/garbage).
|
||||
80
SPEAKER-DIARIZATION.md
Normal file
80
SPEAKER-DIARIZATION.md
Normal file
@@ -0,0 +1,80 @@
|
||||
# Speaker Diarization
|
||||
|
||||
Notes on adding speaker identification to the voice pipeline.
|
||||
|
||||
---
|
||||
|
||||
## What Whisper does and doesn't do
|
||||
|
||||
Whisper is a transcription model — it converts audio to text. It has no concept of who is speaking. In a multi-speaker conversation it will transcribe everything into one flat stream with no speaker labels.
|
||||
|
||||
---
|
||||
|
||||
## Adding diarization
|
||||
|
||||
Speaker diarization is the task of segmenting audio by speaker: "who spoke when." The standard approach is to run a diarization model before or alongside transcription, then align the speaker segments with the transcript.
|
||||
|
||||
### pyannote.audio
|
||||
|
||||
The most capable open-source option. Runs locally, GPU-accelerated.
|
||||
|
||||
```
|
||||
pip install pyannote.audio
|
||||
```
|
||||
|
||||
Produces time-stamped segments:
|
||||
```
|
||||
SPEAKER_00 0.5s – 3.2s
|
||||
SPEAKER_01 3.8s – 7.1s
|
||||
SPEAKER_00 7.5s – 12.0s
|
||||
```
|
||||
|
||||
Feed each segment to Whisper independently → transcript with speaker tags.
|
||||
|
||||
Requires a HuggingFace token and model access request (free).
|
||||
|
||||
### sherpa-onnx
|
||||
|
||||
Has some diarization support in newer versions — worth checking if it can be added to the existing pipeline without a separate Python dependency.
|
||||
|
||||
---
|
||||
|
||||
## Speaker identification vs diarization
|
||||
|
||||
| Feature | Diarization | Identification |
|
||||
|---------|------------|----------------|
|
||||
| Labels speakers | ✓ (anonymous: speaker 1, 2…) | ✓ (named: Mikael, Claude…) |
|
||||
| Requires enrollment | ✗ | ✓ (reference clip per person) |
|
||||
| Cross-session consistency | ✗ | ✓ |
|
||||
|
||||
For named identification, a speaker embedding model compares voice fingerprints against enrolled reference clips. Options:
|
||||
|
||||
- **resemblyzer** — simple cosine similarity on d-vector embeddings
|
||||
- **wespeaker** — more accurate, available as ONNX (fits existing sherpa-onnx stack)
|
||||
- **speechbrain** — full toolkit, heavier dependency
|
||||
|
||||
---
|
||||
|
||||
## Future experiments
|
||||
|
||||
### Multi-speaker transcription
|
||||
Run pyannote diarization on a conversation recording, feed segments to Whisper, produce a tagged transcript:
|
||||
```
|
||||
Mikael: How do I juggle oranges?
|
||||
Claude: Start with two balls and work up from there.
|
||||
```
|
||||
|
||||
### Real-time diarization
|
||||
Extend the current VAD → Whisper pipeline with a rolling speaker embedding. Each new segment gets a speaker label by comparing its embedding to known speakers. Useful for meeting notes or multi-person dictation.
|
||||
|
||||
### Voice enrollment
|
||||
Record a short reference clip per person. Store embeddings. When a new segment arrives, score it against all enrolled speakers and assign the closest match above a confidence threshold. Unknown speakers get a temporary anonymous label.
|
||||
|
||||
### Wake word per speaker
|
||||
Combine identification with wake word detection so only enrolled voices can trigger the assistant — a simple voice-based access control layer.
|
||||
|
||||
---
|
||||
|
||||
## VRAM considerations
|
||||
|
||||
pyannote diarization adds roughly 1–2 GB depending on the model variant. wespeaker ONNX is much lighter. Both fit comfortably alongside the existing Whisper + Chatterbox stack on a 24 GB card.
|
||||
41
TODO.md
Normal file
41
TODO.md
Normal file
@@ -0,0 +1,41 @@
|
||||
# Voice Experiment — TODO
|
||||
|
||||
## Classifier
|
||||
- [ ] Add query journal (JSONL) logging all STT fragments, classifier decisions, and how each query was resolved (timeout / send-word / classifier). Enables replay and re-classification experiments without live mic input.
|
||||
- [ ] Tune classifier — still too conservative on plain statements and short instructions
|
||||
- [ ] Experiment with larger classifier model (Qwen 7B or 14B)
|
||||
|
||||
## Voice switching
|
||||
- [ ] Allow changing the active voice at runtime via voice command — "switch to the X voice", "use a different voice". Could be per-project as a cue (project A = voice A) or just on-demand preference. Implementation: store current `audio_prompt` path in a mutable state file; voice-switch commands update it directly without going through the full Claude dispatch.
|
||||
- [ ] Consider a small set of named voices with friendly names so you can say "switch to the calm voice" rather than a file path.
|
||||
- [ ] Daily voice rotation: select voice based on a hash of the current date — automatic variety without manual choice. Hash the date string mod N voices, pick from the named voice list.
|
||||
- [ ] Character mode: pair a voice with a system prompt preamble injected before the query. Technically trivial — prepend to the dispatch prompt. Hard part is writing character definitions that hold up over a session rather than collapsing into generic assistant after a few turns. A shallow character just sounds like a costume.
|
||||
|
||||
## Pipeline
|
||||
- [ ] PTY-based Claude Code injection — replace X11/clipboard paste with direct PTY write to avoid focus stealing
|
||||
- [ ] Voice command routing — detect slash command intents (rename session, switch session) and dispatch as `/command` rather than a prompt
|
||||
- [ ] Activation phrase — say "computer" (or similar) to signal the start of a query; play a short acknowledgement chime (Star Trek style). Currently always-on; activation phrase would make it more intentional.
|
||||
- [ ] Audio cues — replace or supplement the "Efforting" TTS utterance with short sound effects: one tone on query start/acknowledgement, a different one on dispatch. Faster feedback than TTS.
|
||||
- [ ] Reserved control vocabulary — single words, pattern matched before any classifier:
|
||||
|
||||
| Word(s) | Action |
|
||||
|---------|--------|
|
||||
| `computer` | Generic activate — start accumulating a query, play acknowledgement chime |
|
||||
| `project note` | Activate with project context injected into the prompt preamble |
|
||||
| `go` / `send` / `done` | Dispatch current query immediately |
|
||||
| `cancel` / `scratch that` | Discard accumulated query, return to idle |
|
||||
| `stop` / `goodbye` / `hang up` | Shut down the session cleanly |
|
||||
| `switch voice` / `different voice` | Trigger voice-switch flow |
|
||||
|
||||
Words are checked normalized (lowercase, punctuation stripped) before classifier runs. Classifier only sees real query content.
|
||||
|
||||
- [ ] Classifier rule hierarchy — pre-classifiers (fast pattern match) → optional LLM classifiers. Each stage can short-circuit. Makes the system tuneable: add/remove LLM stages without touching the pattern layer.
|
||||
|
||||
## TTS / Long text
|
||||
- [ ] Long inputs collapse into gibberish — Chatterbox has a context limit. The tts-server should automatically split on sentence boundaries and queue them sequentially. `Chatterbox_Tts` already has `speak_streaming` that does this — use it in the server instead of `speak`.
|
||||
|
||||
## TTS / Prompt
|
||||
- [ ] Filename and path readback sounds bad when names are all-caps (e.g. `ARCHITECTURE.md` gets spelled out). Instruct Claude in the voice prompt to read filenames and paths in lowercase unless the term is a known acronym. Paralinguistic tags exist but are mostly for expression (laugh, sigh, etc.) — not useful here.
|
||||
|
||||
## STT
|
||||
- [ ] Try Whisper small.en or medium.en for better accuracy on accents
|
||||
29
VOICES.md
Normal file
29
VOICES.md
Normal file
@@ -0,0 +1,29 @@
|
||||
# Voice Clone Wishlist
|
||||
|
||||
Voices to prepare reference clips for.
|
||||
|
||||
## Ready
|
||||
|
||||
| Voice | Character / Show | Notes |
|
||||
|-------|-----------------|-------|
|
||||
| Rommie | Andromeda — ship AI / android | `/home/devilholk/Documents/rommie-sample.wav` — working well |
|
||||
| Wilford Brimley (Harold W. Smith) | Remo Williams: The Adventure Begins (1985) — head of CURE | Turned out well — gravelly authoritative delivery, dry clean speech |
|
||||
|
||||
## To Do
|
||||
|
||||
| Voice | Character / Show | Notes |
|
||||
|-------|-----------------|-------|
|
||||
| Fred Ward | Remo Williams: The Adventure Begins (1985) — Remo Williams himself | Distinctive gravelly voice, same film as Brimley so similar source quality |
|
||||
| Alan Scarfe | TNG (Romulan), Andromeda S5 (Flavin) | Deep, authoritative voice — hunt for quiet scenes without ship hum |
|
||||
| John Fleck | Enterprise (Silik the Suliban) | Distinctive raspy voice — Suliban scenes may have atmosphere noise |
|
||||
| Steve Bacic | Andromeda (Telemachus Rhade) | — |
|
||||
| Alex Diakun | Andromeda (Perseid character, name ~"Atune"), Stargate SG-1 (science role) | Prolific Vancouver sci-fi character actor, often plays scientists/scholars — distinctive voice |
|
||||
| Unknown actress | Andromeda S4/S5 (Dylan's love interest), Babylon 5 (Mars rebellion leader) | Possibly Marjorie Monaghan (Number One in B5) — unconfirmed |
|
||||
| Claudia Christian | Babylon 5 — Commander Ivanova | User already has a voice clip ready |
|
||||
|
||||
## Notes
|
||||
|
||||
- Animated/cartoon voices (e.g. Darkwing Duck) don't clone well — too far outside natural human speech distribution
|
||||
- Compressed/heavily post-processed audio (spaceship hum, background score) degrades results even after noise reduction
|
||||
- OGG vs WAV quality difference is likely source quality, not encoding — soundfile handles both
|
||||
- Voice cloning quality scales with both clip length and emotional range — varied prosody (questions, statements, different tones) gives the model more to anchor on than flat monotone
|
||||
135
acting-demo-bark.mjs
Normal file
135
acting-demo-bark.mjs
Normal file
@@ -0,0 +1,135 @@
|
||||
/**
|
||||
* Demonstrates acting and expressiveness with Bark TTS.
|
||||
*
|
||||
* Run: node acting-demo-bark.mjs [start-section]
|
||||
*
|
||||
* Bark expressive techniques:
|
||||
* UPPERCASE — stress/emphasis on a word
|
||||
* ... — hesitation, trailing off
|
||||
* [laughs] — laughter
|
||||
* [sighs] — sigh
|
||||
* [gasps] — sharp intake of breath
|
||||
* [clears throat] — throat clearing
|
||||
* ♪ text ♪ — sung phrase
|
||||
* — — abrupt break
|
||||
* ! — raised energy
|
||||
*
|
||||
* Voice presets (English):
|
||||
* v2/en_speaker_0 calm female
|
||||
* v2/en_speaker_1 calm male
|
||||
* v2/en_speaker_3 deep male
|
||||
* v2/en_speaker_6 neutral/warm (default)
|
||||
* v2/en_speaker_9 expressive
|
||||
*/
|
||||
|
||||
import { Bark_Tts } from './lib/bark-tts.mjs'
|
||||
|
||||
const tts = new Bark_Tts()
|
||||
|
||||
process.stderr.write('[bark] loading model...\n')
|
||||
await tts.init()
|
||||
process.stderr.write('[bark] ready\n\n')
|
||||
|
||||
const START = parseInt(process.argv[2] ?? '1', 10)
|
||||
|
||||
function section(title) {
|
||||
process.stdout.write(`\n${'─'.repeat(50)}\n${title}\n${'─'.repeat(50)}\n`)
|
||||
}
|
||||
|
||||
async function say(text, voice = undefined) {
|
||||
const opts = { preprocess: false }
|
||||
if (voice) opts.voice = voice
|
||||
process.stdout.write(` ${voice ?? 'default'}\n "${text}"\n\n`)
|
||||
await tts.speak(text, opts)
|
||||
}
|
||||
|
||||
// ─── 1. Baseline ───────────────────────────────────────
|
||||
section('1. Baseline')
|
||||
if (START <= 1) {
|
||||
await say('The package has arrived. Please sign here.')
|
||||
}
|
||||
|
||||
// ─── 2. Paralinguistic tokens ──────────────────────────
|
||||
section('2. Paralinguistic tokens')
|
||||
if (START <= 2) {
|
||||
await say('[clears throat] Right. Where were we.')
|
||||
await say('And then he just... [sighs] gave up.')
|
||||
await say('[gasps] I had no idea you were here!')
|
||||
await say("Ha. [laughs] That's the funniest thing I've heard all week.")
|
||||
}
|
||||
|
||||
// ─── 3. Emphasis via UPPERCASE ─────────────────────────
|
||||
section('3. Emphasis via UPPERCASE')
|
||||
if (START <= 3) {
|
||||
await say('You need to stop. RIGHT NOW.')
|
||||
await say('I have told you THIS BEFORE.')
|
||||
await say('This is ABSOLUTELY unacceptable.')
|
||||
}
|
||||
|
||||
// ─── 4. Hesitation and rhythm ──────────────────────────
|
||||
section('4. Hesitation and rhythm')
|
||||
if (START <= 4) {
|
||||
await say('It was... quiet. Too quiet.')
|
||||
await say('I just... I don\'t know what to say.')
|
||||
await say('He said he was fine — but his eyes told a different story.')
|
||||
}
|
||||
|
||||
// ─── 5. Sung phrase ────────────────────────────────────
|
||||
section('5. Singing')
|
||||
if (START <= 5) {
|
||||
await say('♪ Happy birthday to you, happy birthday to you ♪')
|
||||
}
|
||||
|
||||
// ─── 6. Acronyms ───────────────────────────────────────
|
||||
section('6. Acronyms')
|
||||
if (START <= 6) {
|
||||
// Without spacing — Bark may try to pronounce as a word
|
||||
await say('The FBI is investigating.')
|
||||
await say('NASA launched a new rocket.')
|
||||
await say('I work with the CPU and GPU.')
|
||||
// With letter spacing — forces spelling out
|
||||
await say('The F B I is investigating.')
|
||||
await say('N A S A launched a new rocket.')
|
||||
await say('I work with the C P U and G P U.')
|
||||
}
|
||||
|
||||
// ─── 7. Voice character ────────────────────────────────
|
||||
section('7. Voice presets — same line')
|
||||
if (START <= 7) {
|
||||
const line = "Well, that's certainly one way to look at it."
|
||||
for (const voice of ['v2/en_speaker_0', 'v2/en_speaker_3', 'v2/en_speaker_6', 'v2/en_speaker_9']) {
|
||||
await say(line, voice)
|
||||
}
|
||||
}
|
||||
|
||||
// ─── 7. Combined scene ─────────────────────────────────
|
||||
section('8. Combined — a tense scene')
|
||||
if (START <= 8) {
|
||||
await say('Something is wrong.', 'v2/en_speaker_9')
|
||||
await say('I can FEEL it.', 'v2/en_speaker_9')
|
||||
await say('[gasps] Run!', 'v2/en_speaker_9')
|
||||
}
|
||||
|
||||
// ─── 9. Paragraph — voices 7-9 ─────────────────────────
|
||||
// Pass voices as CLI args to override, e.g.:
|
||||
// node acting-demo-bark.mjs 9 v2/en_speaker_0 v2/en_speaker_2 v2/en_speaker_5
|
||||
const PARA_VOICES = process.argv.slice(3).length
|
||||
? process.argv.slice(3)
|
||||
: ['v2/en_speaker_7', 'v2/en_speaker_8', 'v2/en_speaker_9']
|
||||
|
||||
const PARAGRAPH = 'It was a cold evening when the letter FINALLY arrived. '
|
||||
+ '[sighs] She had been waiting for months... not knowing whether the news would be good — or bad. '
|
||||
+ 'Her hands trembled as she tore open the envelope. '
|
||||
+ 'Inside was a single sheet of paper with just THREE words written on it. '
|
||||
+ '[gasps] She read them twice... then sat down slowly, and stared at the wall.'
|
||||
|
||||
section('9. Paragraph — voice comparison')
|
||||
if (START <= 9) {
|
||||
for (const voice of PARA_VOICES) {
|
||||
process.stdout.write(`\n — ${voice} —\n`)
|
||||
await say(PARAGRAPH, voice)
|
||||
}
|
||||
}
|
||||
|
||||
tts.stop()
|
||||
section('Done')
|
||||
109
acting-demo-chatterbox.mjs
Normal file
109
acting-demo-chatterbox.mjs
Normal file
@@ -0,0 +1,109 @@
|
||||
/**
|
||||
* Acting demo for Chatterbox TTS.
|
||||
*
|
||||
* Run: node acting-demo-chatterbox.mjs [start-section] [turbo|full]
|
||||
*
|
||||
* Paralinguistic tags:
|
||||
* [laugh] [chuckle] [cough] [clear throat] [sigh] [shush] [groan] [sniff] [gasp]
|
||||
*
|
||||
* Exaggeration (full model only, 0.0-1.0):
|
||||
* 0.0 = neutral, 1.0 = maximum emotion intensity
|
||||
*/
|
||||
|
||||
import { Chatterbox_Tts } from './lib/chatterbox-tts.mjs'
|
||||
|
||||
const START = parseInt(process.argv[2] ?? '1', 10)
|
||||
const VARIANT = process.argv[3] ?? 'turbo'
|
||||
|
||||
const tts = new Chatterbox_Tts({ variant: VARIANT })
|
||||
|
||||
process.stderr.write(`[chatterbox] loading ${VARIANT} model...\n`)
|
||||
await tts.init()
|
||||
process.stderr.write('[chatterbox] ready\n\n')
|
||||
|
||||
function section(title) {
|
||||
process.stdout.write(`\n${'─'.repeat(50)}\n${title}\n${'─'.repeat(50)}\n`)
|
||||
}
|
||||
|
||||
async function say(text, opts = {}) {
|
||||
process.stdout.write(` "${text}"\n`)
|
||||
if (Object.keys(opts).length) process.stdout.write(` ${JSON.stringify(opts)}\n`)
|
||||
process.stdout.write('\n')
|
||||
await tts.speak(text, { preprocess: false, ...opts })
|
||||
}
|
||||
|
||||
// ─── 1. Baseline ───────────────────────────────────────
|
||||
section('1. Baseline')
|
||||
if (START <= 1) {
|
||||
await say('The package has arrived. Please sign here.')
|
||||
}
|
||||
|
||||
// ─── 2. Paralinguistic tags ────────────────────────────
|
||||
section('2. Paralinguistic tags')
|
||||
if (START <= 2) {
|
||||
await say('[cough] Right, where were we.')
|
||||
await say('And then he just... [sigh] gave up.')
|
||||
await say('[gasp] I had no idea you were here!')
|
||||
await say('Ha! [laugh] That is the funniest thing I have heard all week.')
|
||||
await say('[chuckle] Well, that is certainly one way to look at it.')
|
||||
await say('[groan] Not this again.')
|
||||
}
|
||||
|
||||
// ─── 3. Punctuation and rhythm ─────────────────────────
|
||||
section('3. Punctuation and rhythm')
|
||||
if (START <= 3) {
|
||||
await say('It was... quiet. Too quiet.')
|
||||
await say('He said he was fine — but his eyes told a different story.')
|
||||
await say('I told you. I told you this would happen. But did anyone listen? No.')
|
||||
}
|
||||
|
||||
// ─── 4. Exaggeration — full model only ─────────────────
|
||||
section(`4. Exaggeration (${VARIANT === 'full' ? 'active' : 'ignored in turbo — run with: node acting-demo-chatterbox.mjs 4 full'})`)
|
||||
if (START <= 4) {
|
||||
await say('I am so incredibly happy to see you today!', { exaggeration: 0.0 })
|
||||
await say('I am so incredibly happy to see you today!', { exaggeration: 0.5 })
|
||||
await say('I am so incredibly happy to see you today!', { exaggeration: 1.0 })
|
||||
}
|
||||
|
||||
// ─── 5. Temperature ────────────────────────────────────
|
||||
section('5. Temperature (lower = more predictable)')
|
||||
if (START <= 5) {
|
||||
await say('Something is very wrong here and I think we should leave now.', { temperature: 0.3 })
|
||||
await say('Something is very wrong here and I think we should leave now.', { temperature: 0.8 })
|
||||
await say('Something is very wrong here and I think we should leave now.', { temperature: 1.4 })
|
||||
}
|
||||
|
||||
// ─── 6. Paragraph ──────────────────────────────────────
|
||||
section('6. Paragraph with tags')
|
||||
if (START <= 6) {
|
||||
await say(
|
||||
'It was a cold evening when the letter finally arrived. '
|
||||
+ '[sigh] She had been waiting for months... not knowing whether the news would be good — or bad. '
|
||||
+ 'Her hands trembled as she tore open the envelope. '
|
||||
+ '[gasp] Inside was a single sheet of paper with just three words written on it.'
|
||||
)
|
||||
}
|
||||
|
||||
// ─── 7. Audio prompt / voice cloning ───────────────────
|
||||
// Pass a WAV file path as third argument:
|
||||
// node acting-demo-chatterbox.mjs 7 turbo /path/to/voice.wav
|
||||
section('7. Audio prompt (voice cloning)')
|
||||
if (START <= 7) {
|
||||
const audio_prompt = process.argv[4]
|
||||
if (!audio_prompt) {
|
||||
process.stdout.write(' Skipped — pass a WAV file as the 4th argument:\n')
|
||||
process.stdout.write(' node acting-demo-chatterbox.mjs 7 turbo /path/to/voice.wav\n\n')
|
||||
process.stdout.write(' Requirements: WAV file, at least 5 seconds, clean speech, no music/noise\n')
|
||||
} else {
|
||||
process.stdout.write(` voice: ${audio_prompt}\n\n`)
|
||||
await say('Hello. This is a test of voice cloning.', { audio_prompt })
|
||||
await say(
|
||||
'It was a cold evening when the letter finally arrived. '
|
||||
+ '[sigh] She had been waiting for months... not knowing whether the news would be good — or bad.',
|
||||
{ audio_prompt }
|
||||
)
|
||||
}
|
||||
}
|
||||
|
||||
tts.stop()
|
||||
section('Done')
|
||||
163
acting-demo.mjs
Normal file
163
acting-demo.mjs
Normal file
@@ -0,0 +1,163 @@
|
||||
/**
|
||||
* Demonstrates prosody and acting techniques with Kokoro TTS.
|
||||
*
|
||||
* Run: node acting-demo.mjs [start-section]
|
||||
* E.g.: node acting-demo.mjs 5
|
||||
*
|
||||
* Kokoro's expressive range comes from four levers:
|
||||
*
|
||||
* 1. Voice choice — each voice has a baked-in emotional character
|
||||
* 2. Speed — slower = deliberate/grave, faster = excited/happy
|
||||
* 3. silenceScale — controls pause length between sentences
|
||||
* 4. Punctuation & markup — espeak-ng honours some SSML and typographic cues
|
||||
*
|
||||
* SSML tags that pass through espeak-ng:
|
||||
* <break time="500ms"/> — explicit pause
|
||||
* <emphasis level="strong">x</emphasis> — louder/longer (model-dependent)
|
||||
* <prosody rate="slow">x</prosody> — local rate change
|
||||
* <phoneme alphabet="ipa" ph="..."/> — force pronunciation
|
||||
*
|
||||
* Typographic cues that affect prosody without SSML:
|
||||
* UPPERCASE words — some stress
|
||||
* Ellipsis ... — trailing pause / trailing off
|
||||
* Em-dash — — abrupt break
|
||||
* Comma placement — micro-pauses
|
||||
* Exclamation ! — raised energy
|
||||
*/
|
||||
|
||||
import { Tts } from './lib/tts.mjs'
|
||||
import { spawn } from 'node:child_process'
|
||||
|
||||
const tts = new Tts()
|
||||
tts.init()
|
||||
|
||||
// Overriding the internal _play is the simplest way to control
|
||||
// per-utterance GenerationConfig without reworking the class.
|
||||
|
||||
async function say(text, { voice = 'af_heart', speed = 1.0, silence_scale = 1.0 } = {}) {
|
||||
const { createRequire } = await import('node:module')
|
||||
const require = createRequire(import.meta.url)
|
||||
const sherpa = require('sherpa-onnx-node')
|
||||
|
||||
const VOICES = { af_heart:0, af_bella:1, af_nicole:2, af_sarah:3, af_sky:4, am_adam:5, am_michael:6 }
|
||||
const sid = VOICES[voice] ?? 0
|
||||
|
||||
process.stdout.write(` voice=${voice} speed=${speed} silence=${silence_scale}\n "${text}"\n\n`)
|
||||
|
||||
const audio = tts._tts.generate({
|
||||
text,
|
||||
generationConfig: new sherpa.GenerationConfig({ sid, speed, silenceScale: silence_scale }),
|
||||
})
|
||||
|
||||
await play(audio.samples, audio.sampleRate)
|
||||
}
|
||||
|
||||
// Synthesize multiple phrase/option pairs and play them as one continuous audio
|
||||
async function say_concat(parts) {
|
||||
const { createRequire } = await import('node:module')
|
||||
const require = createRequire(import.meta.url)
|
||||
const sherpa = require('sherpa-onnx-node')
|
||||
const VOICES = { af_heart:0, af_bella:1, af_nicole:2, af_sarah:3, af_sky:4, am_adam:5, am_michael:6 }
|
||||
|
||||
let total_len = 0
|
||||
const chunks = []
|
||||
|
||||
for (const [text, opts = {}] of parts) {
|
||||
const { voice = 'af_heart', speed = 1.0, silence_scale = 1.0 } = opts
|
||||
const sid = VOICES[voice] ?? 0
|
||||
const audio = tts._tts.generate({
|
||||
text,
|
||||
generationConfig: new sherpa.GenerationConfig({ sid, speed, silenceScale: silence_scale }),
|
||||
})
|
||||
process.stdout.write(` [${speed}x] "${text}"\n`)
|
||||
chunks.push(audio.samples)
|
||||
total_len += audio.samples.length
|
||||
}
|
||||
|
||||
const combined = new Float32Array(total_len)
|
||||
let offset = 0
|
||||
for (const chunk of chunks) {
|
||||
combined.set(chunk, offset)
|
||||
offset += chunk.length
|
||||
}
|
||||
|
||||
await play(combined, tts._tts.sampleRate)
|
||||
process.stdout.write('\n')
|
||||
}
|
||||
|
||||
function play(samples, sample_rate) {
|
||||
return new Promise((resolve, reject) => {
|
||||
const proc = spawn('pacat', [
|
||||
'--format=float32le', `--rate=${sample_rate}`, '--channels=1',
|
||||
], { stdio: ['pipe', 'ignore', 'inherit'] })
|
||||
proc.on('error', reject)
|
||||
proc.on('close', resolve)
|
||||
proc.stdin.write(Buffer.from(samples.buffer))
|
||||
proc.stdin.end()
|
||||
})
|
||||
}
|
||||
|
||||
const START = parseInt(process.argv[2] ?? '1', 10)
|
||||
|
||||
function section(title) {
|
||||
process.stdout.write(`\n${'─'.repeat(50)}\n${title}\n${'─'.repeat(50)}\n`)
|
||||
}
|
||||
|
||||
// ─── 1. Baseline ───────────────────────────────────────
|
||||
section('1. Baseline — neutral af_heart')
|
||||
if (START <= 1) {
|
||||
await say('The package has arrived. Please sign here.')
|
||||
}
|
||||
|
||||
// ─── 2. Speed as emotion ───────────────────────────────
|
||||
section('2. Speed — slow (grave) vs fast (excited)')
|
||||
if (START <= 2) {
|
||||
await say('I have some terrible news. The project has been cancelled.', { speed: 0.8 })
|
||||
await say("Oh my goodness, I can't believe you're actually here! This is amazing!", { speed: 1.25 })
|
||||
}
|
||||
|
||||
// ─── 3. Voice character ────────────────────────────────
|
||||
section('3. Voice character — same line, different voices')
|
||||
if (START <= 3) {
|
||||
const line = "Well, that's certainly one way to look at it."
|
||||
for (const voice of ['af_heart', 'af_sky', 'am_adam', 'bm_george']) {
|
||||
await say(line, { voice })
|
||||
}
|
||||
}
|
||||
|
||||
// ─── 4. Punctuation-driven prosody ─────────────────────
|
||||
section('4. Punctuation and typographic cues')
|
||||
if (START <= 4) {
|
||||
await say('I told you. I told you this would happen. But did anyone listen? No.')
|
||||
await say('It was... quiet. Too quiet.')
|
||||
await say('He said he was fine — but his eyes told a different story.')
|
||||
await say('This is ABSOLUTELY unacceptable.')
|
||||
}
|
||||
|
||||
// ─── 5. Per-phrase speed — manual emphasis ─────────────
|
||||
// SSML tags are not supported by this backend; emphasis is achieved by
|
||||
// synthesizing key phrases at a different speed and concatenating the audio.
|
||||
section('5. Per-phrase speed (manual emphasis)')
|
||||
if (START <= 5) {
|
||||
// "stop" spoken slower and then the rest faster — sounds like emphasis
|
||||
await say_concat([
|
||||
['You need to ', { speed: 1.0 }],
|
||||
['stop.', { speed: 0.7 }],
|
||||
['Right now.', { speed: 1.1 }],
|
||||
])
|
||||
await say_concat([
|
||||
['Listen very carefully.', { speed: 0.75 }],
|
||||
['I will only say this once.', { speed: 0.9 }],
|
||||
])
|
||||
}
|
||||
|
||||
// ─── 6. Combining levers ───────────────────────────────
|
||||
section('6. Combined — a tense scene')
|
||||
if (START <= 6) {
|
||||
await say('Something is wrong.', { speed: 0.85, silence_scale: 1.5 })
|
||||
await say('I can feel it.', { speed: 0.8, silence_scale: 2.0 })
|
||||
await say('<break time="600ms"/> Run!', { speed: 1.3, voice: 'af_sky' })
|
||||
}
|
||||
|
||||
section('Done')
|
||||
process.stdout.write('Tip: adjust speed/silence_scale and swap voices to build a character.\n')
|
||||
135
bark-server.py
Executable file
135
bark-server.py
Executable file
@@ -0,0 +1,135 @@
|
||||
#!/usr/bin/env -S bash -c 'exec "$(dirname "$0")/venv/bin/python3" "$0" "$@"'
|
||||
"""
|
||||
Bark TTS server — keeps model loaded, reads JSON lines from stdin.
|
||||
|
||||
Protocol:
|
||||
stdin: one JSON line per request: {"text": "...", "voice": "v2/en_speaker_6"}
|
||||
stdout: "ok\n" after each utterance is finished playing
|
||||
stderr: status/timing messages
|
||||
|
||||
Usage:
|
||||
python bark-server.py [model] [voice]
|
||||
python bark-server.py suno/bark-small v2/en_speaker_3
|
||||
|
||||
Voices (English):
|
||||
v2/en_speaker_0 .. v2/en_speaker_9
|
||||
Speaker 6 = neutral/warm, 9 = expressive, 3 = deep male
|
||||
"""
|
||||
|
||||
import os
|
||||
import sys
|
||||
import json
|
||||
import time
|
||||
import subprocess
|
||||
import numpy as np
|
||||
|
||||
MODEL = sys.argv[1] if len(sys.argv) > 1 else 'suno/bark'
|
||||
DEF_VOICE = sys.argv[2] if len(sys.argv) > 2 else 'v2/en_speaker_6'
|
||||
SAMPLE_RATE = 24000
|
||||
|
||||
TOKEN_FILE = os.path.expanduser('~/.secrets/hugging-face.token')
|
||||
try:
|
||||
with open(TOKEN_FILE) as f:
|
||||
os.environ['HF_TOKEN'] = f.read().strip()
|
||||
except FileNotFoundError:
|
||||
pass
|
||||
|
||||
# Disable background safetensors conversion attempts — the HF API endpoint
|
||||
# it calls has changed and now errors. The pytorch_model.bin files work fine.
|
||||
os.environ['SAFETENSORS_FAST_GPU'] = '0'
|
||||
os.environ.setdefault('TRANSFORMERS_NO_ADVISORY_WARNINGS', '1')
|
||||
|
||||
def log(msg):
|
||||
print(f'[bark] {msg}', file=sys.stderr, flush=True)
|
||||
|
||||
log(f'loading {MODEL}...')
|
||||
t0 = time.time()
|
||||
|
||||
import torch
|
||||
from transformers import AutoProcessor, BarkModel
|
||||
|
||||
device = 'cuda' if torch.cuda.is_available() else 'cpu'
|
||||
dtype = torch.float16 if device == 'cuda' else torch.float32
|
||||
|
||||
if device == 'cuda':
|
||||
# TF32 is faster on Ampere (30xx, 40xx) for both matmul and convolutions
|
||||
torch.backends.cuda.matmul.allow_tf32 = True
|
||||
torch.backends.cudnn.allow_tf32 = True
|
||||
|
||||
processor = AutoProcessor.from_pretrained(MODEL)
|
||||
model = BarkModel.from_pretrained(MODEL, torch_dtype=dtype, use_safetensors=False).to(device)
|
||||
|
||||
# Bark's internal generation calls set both max_length and max_new_tokens simultaneously
|
||||
# which causes a conflict warning and may cause early stopping. Clearing max_length
|
||||
# lets max_new_tokens take sole control.
|
||||
model.generation_config.max_length = None
|
||||
|
||||
# torch.compile gives a meaningful speedup on 3090 (adds ~30s first-run compile)
|
||||
try:
|
||||
model = torch.compile(model, mode='reduce-overhead')
|
||||
log('torch.compile applied')
|
||||
except Exception as e:
|
||||
log(f'torch.compile skipped: {e}')
|
||||
|
||||
log(f'ready on {device} ({time.time() - t0:.1f}s load time)')
|
||||
print('ready', flush=True) # signal to Node that we are ready
|
||||
|
||||
|
||||
def speak(text, voice):
|
||||
t1 = time.time()
|
||||
|
||||
inputs = processor(text, voice_preset=voice, return_tensors='pt')
|
||||
|
||||
# Set attention_mask if not provided by processor — without it the model
|
||||
# cannot distinguish padding from real tokens, which causes erratic generation
|
||||
# including leading filler sounds.
|
||||
if 'attention_mask' not in inputs:
|
||||
inputs['attention_mask'] = torch.ones_like(inputs['input_ids'])
|
||||
|
||||
inputs = {k: v.to(device) for k, v in inputs.items()}
|
||||
|
||||
with torch.inference_mode():
|
||||
audio = model.generate(
|
||||
**inputs,
|
||||
semantic_temperature = 0.3,
|
||||
coarse_temperature = 0.3,
|
||||
fine_temperature = 0.3,
|
||||
)
|
||||
|
||||
samples = audio.cpu().numpy().squeeze().astype(np.float32)
|
||||
elapsed_gen = time.time() - t1
|
||||
duration = len(samples) / SAMPLE_RATE
|
||||
log(f'generated {duration:.1f}s audio in {elapsed_gen:.1f}s rtf={elapsed_gen/duration:.2f}')
|
||||
|
||||
proc = subprocess.Popen(
|
||||
['pacat', '--format=float32le', f'--rate={SAMPLE_RATE}', '--channels=1'],
|
||||
stdin=subprocess.PIPE,
|
||||
)
|
||||
proc.stdin.write(samples.tobytes())
|
||||
proc.stdin.close()
|
||||
proc.wait()
|
||||
|
||||
|
||||
for line in sys.stdin:
|
||||
line = line.strip()
|
||||
if not line:
|
||||
continue
|
||||
|
||||
try:
|
||||
req = json.loads(line)
|
||||
text = req.get('text', '')
|
||||
voice = req.get('voice', DEF_VOICE)
|
||||
except json.JSONDecodeError:
|
||||
text = line
|
||||
voice = DEF_VOICE
|
||||
|
||||
if not text:
|
||||
print('ok', flush=True)
|
||||
continue
|
||||
|
||||
try:
|
||||
speak(text, voice)
|
||||
except Exception as e:
|
||||
log(f'error: {e}')
|
||||
|
||||
print('ok', flush=True)
|
||||
200
chatterbox-server.py
Executable file
200
chatterbox-server.py
Executable file
@@ -0,0 +1,200 @@
|
||||
#!/usr/bin/env -S bash -c 'exec "$(dirname "$0")/venv/bin/python3" "$0" "$@"'
|
||||
"""
|
||||
Chatterbox TTS server — keeps model loaded, reads JSON lines from stdin.
|
||||
|
||||
Protocol:
|
||||
stdin: {"text": "...", "temperature": 0.8, "top_p": 0.95}
|
||||
stdout: "ok\n" after each utterance is generated (playback may still be in progress)
|
||||
stderr: status/timing messages
|
||||
|
||||
Usage:
|
||||
./chatterbox-server.py
|
||||
./chatterbox-server.py turbo # default
|
||||
./chatterbox-server.py full # original model, supports exaggeration
|
||||
|
||||
Paralinguistic tags supported in text:
|
||||
[laugh] [chuckle] [cough] [clear throat] [sigh] [shush] [groan] [sniff] [gasp]
|
||||
|
||||
Full model only:
|
||||
exaggeration 0.0-1.0 emotion intensity (ignored in turbo)
|
||||
"""
|
||||
|
||||
import os
|
||||
import sys
|
||||
import json
|
||||
import time
|
||||
import queue
|
||||
import threading
|
||||
import subprocess
|
||||
import numpy as np
|
||||
|
||||
TOKEN_FILE = os.path.expanduser('~/.secrets/hugging-face.token')
|
||||
try:
|
||||
with open(TOKEN_FILE) as f:
|
||||
os.environ['HF_TOKEN'] = f.read().strip()
|
||||
except FileNotFoundError:
|
||||
pass
|
||||
|
||||
def find_hf_cache(repo_id):
|
||||
"""Return the local snapshot path if the model is already cached, else None."""
|
||||
from pathlib import Path
|
||||
cache_dir = Path(os.environ.get('HF_HOME', os.path.expanduser('~/.cache/huggingface'))) / 'hub'
|
||||
repo_dir = cache_dir / f"models--{repo_id.replace('/', '--')}" / 'snapshots'
|
||||
if repo_dir.exists():
|
||||
snapshots = sorted(repo_dir.iterdir(), key=lambda p: p.stat().st_mtime)
|
||||
if snapshots:
|
||||
return str(snapshots[-1])
|
||||
return None
|
||||
|
||||
VARIANT = sys.argv[1] if len(sys.argv) > 1 else 'turbo'
|
||||
SAMPLE_RATE = 24000
|
||||
|
||||
def log(msg):
|
||||
print(f'[chatterbox] {msg}', file=sys.stderr, flush=True)
|
||||
|
||||
log(f'loading chatterbox-{VARIANT}...')
|
||||
t0 = time.time()
|
||||
|
||||
import tempfile
|
||||
import traceback
|
||||
import numpy as np
|
||||
import torch
|
||||
import soundfile as sf
|
||||
import librosa as _librosa
|
||||
|
||||
# librosa.resample returns float64 in newer numpy — patch it to always return float32
|
||||
_orig_resample = _librosa.resample
|
||||
def _resample_float32(*args, **kwargs):
|
||||
return _orig_resample(*args, **kwargs).astype(np.float32)
|
||||
_librosa.resample = _resample_float32
|
||||
|
||||
device = 'cuda' if torch.cuda.is_available() else 'cpu'
|
||||
|
||||
REPO_IDS = {
|
||||
'turbo': 'ResembleAI/chatterbox-turbo',
|
||||
'full': 'ResembleAI/chatterbox',
|
||||
}
|
||||
|
||||
if VARIANT == 'turbo':
|
||||
from chatterbox.tts_turbo import ChatterboxTurboTTS as Model
|
||||
else:
|
||||
from chatterbox.tts import ChatterboxTTS as Model
|
||||
|
||||
cached = find_hf_cache(REPO_IDS[VARIANT])
|
||||
if cached:
|
||||
log(f'loading from cache: {cached}')
|
||||
model = Model.from_local(cached, device=device)
|
||||
else:
|
||||
log('cache not found, downloading...')
|
||||
model = Model.from_pretrained(device=device)
|
||||
|
||||
log(f'ready on {device} ({time.time() - t0:.1f}s load time)')
|
||||
print('ready', flush=True)
|
||||
|
||||
|
||||
_wav_cache = {}
|
||||
|
||||
def ensure_float32_wav(path):
|
||||
"""Re-save audio as float32 mono WAV to work around librosa/numpy float64 issue.
|
||||
Result is cached by input path so repeated calls with the same file are free."""
|
||||
if path in _wav_cache:
|
||||
return _wav_cache[path]
|
||||
wav, sr = sf.read(path, dtype='float32', always_2d=True)
|
||||
wav = wav.mean(axis=1) # stereo → mono if needed
|
||||
tmp = tempfile.NamedTemporaryFile(suffix='.wav', delete=False)
|
||||
sf.write(tmp.name, wav, sr, subtype='FLOAT')
|
||||
_wav_cache[path] = tmp.name
|
||||
return tmp.name
|
||||
|
||||
|
||||
_SENTINEL = object()
|
||||
|
||||
playback_queue = queue.Queue()
|
||||
|
||||
|
||||
def playback_worker():
|
||||
"""Plays audio samples in order. Runs in its own thread."""
|
||||
while True:
|
||||
item = playback_queue.get()
|
||||
if item is _SENTINEL:
|
||||
break
|
||||
samples = item
|
||||
proc = subprocess.Popen(
|
||||
['pacat', '--format=float32le', f'--rate={SAMPLE_RATE}', '--channels=1'],
|
||||
stdin=subprocess.PIPE,
|
||||
)
|
||||
proc.stdin.write(samples.tobytes())
|
||||
proc.stdin.close()
|
||||
proc.wait()
|
||||
playback_queue.task_done()
|
||||
|
||||
|
||||
playback_thread = threading.Thread(target=playback_worker, daemon=True)
|
||||
playback_thread.start()
|
||||
|
||||
|
||||
def generate(text, opts):
|
||||
t1 = time.time()
|
||||
|
||||
if VARIANT == 'turbo':
|
||||
kwargs = {
|
||||
'temperature': opts.get('temperature', 0.8),
|
||||
'top_p': opts.get('top_p', 0.95),
|
||||
'top_k': opts.get('top_k', 1000),
|
||||
'repetition_penalty': opts.get('repetition_penalty', 1.2),
|
||||
'min_p': opts.get('min_p', 0.0),
|
||||
}
|
||||
else:
|
||||
kwargs = {
|
||||
'temperature': opts.get('temperature', 0.8),
|
||||
'top_p': opts.get('top_p', 1.0),
|
||||
'repetition_penalty': opts.get('repetition_penalty', 1.2),
|
||||
'min_p': opts.get('min_p', 0.05),
|
||||
'exaggeration': opts.get('exaggeration', 0.5),
|
||||
'cfg_weight': opts.get('cfg_weight', 0.5),
|
||||
}
|
||||
|
||||
audio_prompt = opts.get('audio_prompt')
|
||||
if audio_prompt:
|
||||
kwargs['audio_prompt_path'] = ensure_float32_wav(audio_prompt)
|
||||
|
||||
with torch.inference_mode():
|
||||
wav = model.generate(text, **kwargs)
|
||||
|
||||
samples = wav.squeeze(0).cpu().numpy().astype(np.float32)
|
||||
elapsed = time.time() - t1
|
||||
duration = len(samples) / SAMPLE_RATE
|
||||
log(f'generated {duration:.1f}s audio in {elapsed:.1f}s rtf={elapsed/duration:.2f}')
|
||||
|
||||
return samples
|
||||
|
||||
|
||||
for line in sys.stdin:
|
||||
line = line.strip()
|
||||
if not line:
|
||||
continue
|
||||
|
||||
try:
|
||||
req = json.loads(line)
|
||||
text = req.pop('text', '')
|
||||
opts = req # remaining fields are generation options
|
||||
except json.JSONDecodeError:
|
||||
text = line
|
||||
opts = {}
|
||||
|
||||
if not text:
|
||||
print('ok', flush=True)
|
||||
continue
|
||||
|
||||
try:
|
||||
samples = generate(text, opts)
|
||||
playback_queue.put(samples)
|
||||
except Exception as e:
|
||||
log(f'error: {e}')
|
||||
traceback.print_exc(file=sys.stderr)
|
||||
|
||||
print('ok', flush=True)
|
||||
|
||||
# Drain playback before exit
|
||||
playback_queue.put(_SENTINEL)
|
||||
playback_thread.join()
|
||||
85
demo-bark.mjs
Normal file
85
demo-bark.mjs
Normal file
@@ -0,0 +1,85 @@
|
||||
/**
|
||||
* Voice demo: microphone → Whisper STT → (optional LLM cleanup) → Bark TTS
|
||||
*
|
||||
* Bark supports emphasis via UPPERCASE, paralinguistic tokens ([laughs], [sighs],
|
||||
* [clears throat]), and hesitation (...). Markdown is preprocessed automatically:
|
||||
* **bold** → UPPERCASE, headers get a pause, etc.
|
||||
*
|
||||
* Environment variables:
|
||||
* WHISPER_MODEL default: base.en
|
||||
* options: tiny.en base.en small.en medium.en large-v3 large-v3-turbo
|
||||
* BARK_MODEL default: suno/bark (use suno/bark-small for faster/smaller)
|
||||
* BARK_VOICE default: v2/en_speaker_6
|
||||
* options: v2/en_speaker_0 .. v2/en_speaker_9
|
||||
* STT_PROVIDER default: cuda
|
||||
* USE_LLM set to 0 to disable (default: 1)
|
||||
* OLLAMA_MODEL default: phi3:mini
|
||||
*/
|
||||
|
||||
import { Stt } from './lib/stt.mjs'
|
||||
import { Bark_Tts } from './lib/bark-tts.mjs'
|
||||
import { llm_available, list_models, cleanup } from './lib/llm.mjs'
|
||||
|
||||
const WHISPER_MODEL = process.env.WHISPER_MODEL || 'base.en'
|
||||
const STT_PROVIDER = process.env.STT_PROVIDER || 'cuda'
|
||||
const USE_LLM = process.env.USE_LLM !== '0'
|
||||
|
||||
process.stderr.write(`[stt] loading whisper-${WHISPER_MODEL} on ${STT_PROVIDER}...\n`)
|
||||
const stt = new Stt({ whisper_name: WHISPER_MODEL, provider: STT_PROVIDER })
|
||||
stt.init()
|
||||
process.stderr.write('[stt] ready\n')
|
||||
|
||||
process.stderr.write('[tts] loading Bark (this takes a moment)...\n')
|
||||
const tts = new Bark_Tts()
|
||||
await tts.init()
|
||||
process.stderr.write('[tts] Bark ready\n')
|
||||
|
||||
let has_llm = false
|
||||
if (USE_LLM) {
|
||||
has_llm = await llm_available()
|
||||
if (has_llm) {
|
||||
const models = await list_models()
|
||||
process.stderr.write(`[llm] ollama available. models: ${models.slice(0, 6).join(', ')}\n`)
|
||||
} else {
|
||||
process.stderr.write('[llm] ollama not reachable, skipping cleanup\n')
|
||||
}
|
||||
}
|
||||
|
||||
process.stderr.write(`\n[ready] whisper=${WHISPER_MODEL} bark=${process.env.BARK_MODEL || 'suno/bark'} llm=${has_llm}\n`)
|
||||
process.stderr.write('[ready] speak into your microphone. Ctrl+C to stop.\n\n')
|
||||
|
||||
let speaking = false
|
||||
|
||||
const stop = stt.listen(async (raw_text) => {
|
||||
if (speaking) {
|
||||
process.stderr.write(`[skipped] ${raw_text}\n`)
|
||||
return
|
||||
}
|
||||
speaking = true
|
||||
|
||||
try {
|
||||
process.stdout.write(`[raw] ${raw_text}\n`)
|
||||
|
||||
let text = raw_text
|
||||
if (has_llm) {
|
||||
text = await cleanup(raw_text)
|
||||
if (text !== raw_text) {
|
||||
process.stdout.write(`[llm] ${text}\n`)
|
||||
}
|
||||
}
|
||||
|
||||
await tts.speak_streaming(text)
|
||||
process.stdout.write('\n')
|
||||
} catch (err) {
|
||||
process.stderr.write(`[error] ${err.message}\n`)
|
||||
} finally {
|
||||
speaking = false
|
||||
}
|
||||
})
|
||||
|
||||
process.on('SIGINT', () => {
|
||||
process.stderr.write('\n[voice] stopping\n')
|
||||
stop()
|
||||
tts.stop()
|
||||
process.exit(0)
|
||||
})
|
||||
85
demo-kokoro.mjs
Normal file
85
demo-kokoro.mjs
Normal file
@@ -0,0 +1,85 @@
|
||||
/**
|
||||
* Voice demo: microphone → Whisper STT → (optional LLM cleanup) → Kokoro TTS
|
||||
*
|
||||
* Environment variables:
|
||||
* WHISPER_MODEL default: base.en
|
||||
* options: tiny.en base.en small.en medium.en large-v3 large-v3-turbo
|
||||
* VOICE default: af_heart
|
||||
* options: af_heart af_bella af_nicole af_sarah af_sky am_adam am_michael
|
||||
* STT_PROVIDER default: cuda
|
||||
* USE_LLM set to 0 to disable (default: 1)
|
||||
* OLLAMA_MODEL default: phi3:mini
|
||||
*/
|
||||
|
||||
import { Stt } from './lib/stt.mjs'
|
||||
import { Tts, VOICES } from './lib/tts.mjs'
|
||||
import { llm_available, list_models, cleanup } from './lib/llm.mjs'
|
||||
|
||||
const WHISPER_MODEL = process.env.WHISPER_MODEL || 'base.en'
|
||||
const VOICE = process.env.VOICE || 'af_heart'
|
||||
const STT_PROVIDER = process.env.STT_PROVIDER || 'cuda'
|
||||
const USE_LLM = process.env.USE_LLM !== '0'
|
||||
|
||||
if (!(VOICE in VOICES)) {
|
||||
process.stderr.write(`[voice] unknown voice "${VOICE}". Available: ${Object.keys(VOICES).join(', ')}\n`)
|
||||
process.exit(1)
|
||||
}
|
||||
|
||||
process.stderr.write(`[stt] loading whisper-${WHISPER_MODEL} on ${STT_PROVIDER}...\n`)
|
||||
const stt = new Stt({ whisper_name: WHISPER_MODEL, provider: STT_PROVIDER })
|
||||
stt.init()
|
||||
process.stderr.write('[stt] ready\n')
|
||||
|
||||
process.stderr.write('[tts] loading Kokoro...\n')
|
||||
const tts = new Tts({ voice: VOICE })
|
||||
tts.init()
|
||||
process.stderr.write('[tts] ready\n')
|
||||
|
||||
let has_llm = false
|
||||
if (USE_LLM) {
|
||||
has_llm = await llm_available()
|
||||
if (has_llm) {
|
||||
const models = await list_models()
|
||||
process.stderr.write(`[llm] ollama available. models: ${models.slice(0, 6).join(', ')}\n`)
|
||||
} else {
|
||||
process.stderr.write('[llm] ollama not reachable, skipping cleanup\n')
|
||||
}
|
||||
}
|
||||
|
||||
process.stderr.write(`\n[ready] whisper=${WHISPER_MODEL} voice=${VOICE} llm=${has_llm}\n`)
|
||||
process.stderr.write('[ready] speak into your microphone. Ctrl+C to stop.\n\n')
|
||||
|
||||
let speaking = false
|
||||
|
||||
const stop = stt.listen(async (raw_text) => {
|
||||
if (speaking) {
|
||||
process.stderr.write(`[skipped] ${raw_text}\n`)
|
||||
return
|
||||
}
|
||||
speaking = true
|
||||
|
||||
try {
|
||||
process.stdout.write(`[raw] ${raw_text}\n`)
|
||||
|
||||
let text = raw_text
|
||||
if (has_llm) {
|
||||
text = await cleanup(raw_text)
|
||||
if (text !== raw_text) {
|
||||
process.stdout.write(`[llm] ${text}\n`)
|
||||
}
|
||||
}
|
||||
|
||||
await tts.speak_streaming(text)
|
||||
process.stdout.write('\n')
|
||||
} catch (err) {
|
||||
process.stderr.write(`[error] ${err.message}\n`)
|
||||
} finally {
|
||||
speaking = false
|
||||
}
|
||||
})
|
||||
|
||||
process.on('SIGINT', () => {
|
||||
process.stderr.write('\n[voice] stopping\n')
|
||||
stop()
|
||||
process.exit(0)
|
||||
})
|
||||
24
demos/darkwing-briefing.sh
Executable file
24
demos/darkwing-briefing.sh
Executable file
@@ -0,0 +1,24 @@
|
||||
#!/usr/bin/env bash
|
||||
# Darkwing Briefing — Darkwing Duck, Gosalyn, Harold W. Smith, Marella, Rommie
|
||||
|
||||
TTS="${TTS_URL:-http://192.168.2.99:11500}"
|
||||
switch() { curl -s -X POST "$TTS/voice" -H 'Content-Type: application/json' -d "{\"name\": \"$1\"}" > /dev/null; }
|
||||
say() { curl -s -X POST "$TTS/speak" -H 'Content-Type: application/json' -d "$(jq -n --arg t "$1" '{text:$t}')" > /dev/null; }
|
||||
|
||||
switch darkwing-duck
|
||||
say "I am the terror that flaps in the night. I am the highly classified operative that your clearance level does not cover."
|
||||
|
||||
switch gosalyn
|
||||
say "Dad, you're at the WRONG briefing AGAIN."
|
||||
|
||||
switch harold-w-smith
|
||||
say "How did a duck get into this facility."
|
||||
|
||||
switch marella
|
||||
say "The same way everything else does, sir. The paperwork looked official."
|
||||
|
||||
switch rommie
|
||||
say "I ran a verification check. The duck is, in fact, cleared for this. I have no further explanation."
|
||||
|
||||
switch darkwing-duck
|
||||
say "I love it when a plan comes together. Let's get dangerous."
|
||||
21
demos/threat-assessment.sh
Executable file
21
demos/threat-assessment.sh
Executable file
@@ -0,0 +1,21 @@
|
||||
#!/usr/bin/env bash
|
||||
# Threat Assessment — Ivanova, Archangel, Rommie, Chuin, Santini
|
||||
|
||||
TTS="${TTS_URL:-http://192.168.2.99:11500}"
|
||||
switch() { curl -s -X POST "$TTS/voice" -H 'Content-Type: application/json' -d "{\"name\": \"$1\"}" > /dev/null; }
|
||||
say() { curl -s -X POST "$TTS/speak" -H 'Content-Type: application/json' -d "$(jq -n --arg t "$1" '{text:$t}')" > /dev/null; }
|
||||
|
||||
switch ivanova
|
||||
say "We have an unidentified vessel on an intercept course. Threat assessment?"
|
||||
|
||||
switch rommie
|
||||
say "Weapons hot, shields raised, and they're broadcasting on a frequency we're not supposed to know exists."
|
||||
|
||||
switch archangel
|
||||
say "Ah. That would be my people. Stand down, Commander — they're just making sure I'm still alive."
|
||||
|
||||
switch chuin
|
||||
say "Are you? I have been wondering this myself for some time."
|
||||
|
||||
switch santini
|
||||
say "I'm gonna need a bigger hangar."
|
||||
24
demos/wrong-meeting.sh
Executable file
24
demos/wrong-meeting.sh
Executable file
@@ -0,0 +1,24 @@
|
||||
#!/usr/bin/env bash
|
||||
# Wrong Meeting — Harold W. Smith, Ivanova, Archangel, Rommie, Penny Parker
|
||||
|
||||
TTS="${TTS_URL:-http://192.168.2.99:11500}"
|
||||
switch() { curl -s -X POST "$TTS/voice" -H 'Content-Type: application/json' -d "{\"name\": \"$1\"}" > /dev/null; }
|
||||
say() { curl -s -X POST "$TTS/speak" -H 'Content-Type: application/json' -d "$(jq -n --arg t "$1" '{text:$t}')" > /dev/null; }
|
||||
|
||||
switch harold-w-smith
|
||||
say "This meeting never happened. None of you are here, and neither am I."
|
||||
|
||||
switch ivanova
|
||||
say "That's extremely reassuring. Last time someone said that to me, a space station blew up."
|
||||
|
||||
switch archangel
|
||||
say "Technically it was a controlled demolition. The paperwork was filed in triplicate."
|
||||
|
||||
switch rommie
|
||||
say "I have reviewed the paperwork. It was filed AFTER the explosion."
|
||||
|
||||
switch penny-parker
|
||||
say "You know, I walked into the wrong building again, didn't I."
|
||||
|
||||
switch harold-w-smith
|
||||
say "Yes. But you saw nothing."
|
||||
63
download-models.sh
Executable file
63
download-models.sh
Executable file
@@ -0,0 +1,63 @@
|
||||
#!/usr/bin/env bash
|
||||
# Download ONNX models for the voice experiment.
|
||||
#
|
||||
# Usage:
|
||||
# bash download-models.sh # default: whisper base.en
|
||||
# WHISPER_MODEL=large-v3-turbo bash download-models.sh
|
||||
#
|
||||
# After downloading, run: node demo.mjs
|
||||
|
||||
set -euo pipefail
|
||||
|
||||
WHISPER_MODEL="${WHISPER_MODEL:-base.en}"
|
||||
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||
MODELS_DIR="${SCRIPT_DIR}/models"
|
||||
|
||||
mkdir -p "${MODELS_DIR}"
|
||||
cd "${MODELS_DIR}"
|
||||
|
||||
echo "==> models dir: ${MODELS_DIR}"
|
||||
|
||||
# --- Silero VAD (~2 MB) ---
|
||||
if [ ! -f silero_vad.onnx ]; then
|
||||
echo "==> downloading silero_vad.onnx"
|
||||
curl -L --progress-bar -o silero_vad.onnx \
|
||||
'https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/silero_vad.onnx'
|
||||
else
|
||||
echo "==> silero_vad.onnx already present"
|
||||
fi
|
||||
|
||||
# --- Whisper STT ---
|
||||
WHISPER_DIR="sherpa-onnx-whisper-${WHISPER_MODEL}"
|
||||
if [ ! -d "${WHISPER_DIR}" ]; then
|
||||
echo "==> downloading whisper-${WHISPER_MODEL}"
|
||||
curl -L --progress-bar -o "${WHISPER_DIR}.tar.bz2" \
|
||||
"https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-whisper-${WHISPER_MODEL}.tar.bz2"
|
||||
echo "==> extracting..."
|
||||
tar xf "${WHISPER_DIR}.tar.bz2"
|
||||
else
|
||||
echo "==> whisper-${WHISPER_MODEL} already present"
|
||||
fi
|
||||
|
||||
# --- Kokoro TTS (~310 MB) ---
|
||||
if [ ! -d kokoro-en-v0_19 ]; then
|
||||
echo "==> downloading kokoro-en-v0_19 (English TTS)"
|
||||
curl -L --progress-bar -o kokoro-en-v0_19.tar.bz2 \
|
||||
'https://github.com/k2-fsa/sherpa-onnx/releases/download/tts-models/kokoro-en-v0_19.tar.bz2'
|
||||
echo "==> extracting..."
|
||||
tar xf kokoro-en-v0_19.tar.bz2
|
||||
else
|
||||
echo "==> kokoro-en-v0_19 already present"
|
||||
fi
|
||||
|
||||
echo ""
|
||||
echo "==> all models ready"
|
||||
echo ""
|
||||
echo "Next steps:"
|
||||
echo " cd ${SCRIPT_DIR}"
|
||||
echo " npm install"
|
||||
echo " node demo.mjs"
|
||||
echo ""
|
||||
echo "To use a better STT model:"
|
||||
echo " WHISPER_MODEL=large-v3-turbo bash download-models.sh"
|
||||
echo " WHISPER_MODEL=large-v3-turbo node demo.mjs"
|
||||
112
lib/bark-tts.mjs
Normal file
112
lib/bark-tts.mjs
Normal file
@@ -0,0 +1,112 @@
|
||||
/**
|
||||
* Bark TTS — Node.js wrapper around bark-server.py.
|
||||
*
|
||||
* Spawns the Python server once, keeps it alive, sends requests as JSON lines.
|
||||
* Markdown is preprocessed before sending: **bold** → UPPERCASE, etc.
|
||||
*
|
||||
* Usage:
|
||||
* const tts = new Bark_Tts()
|
||||
* await tts.init() // spawns server, waits for model load
|
||||
* await tts.speak('Hello')
|
||||
* await tts.speak('**Very** important point.') // emphasis via CAPS
|
||||
* tts.stop()
|
||||
*
|
||||
* Environment variables:
|
||||
* BARK_MODEL HuggingFace model id (default: suno/bark)
|
||||
* BARK_VOICE voice preset (default: v2/en_speaker_6)
|
||||
*
|
||||
* Bark voice presets (English):
|
||||
* v2/en_speaker_0 calm female
|
||||
* v2/en_speaker_1 calm male
|
||||
* v2/en_speaker_3 deep male
|
||||
* v2/en_speaker_6 neutral/warm (default)
|
||||
* v2/en_speaker_9 expressive
|
||||
*/
|
||||
|
||||
import { spawn } from 'node:child_process'
|
||||
import * as path from 'node:path'
|
||||
import * as readline from 'node:readline'
|
||||
import { markdown_to_bark, split_sentences } from './markdown.mjs'
|
||||
|
||||
const BARK_MODEL = process.env.BARK_MODEL || 'suno/bark'
|
||||
const BARK_VOICE = process.env.BARK_VOICE || 'v2/en_speaker_6'
|
||||
const SERVER = path.join(import.meta.dirname, '..', 'bark-server.py')
|
||||
|
||||
export class Bark_Tts {
|
||||
constructor({
|
||||
model = BARK_MODEL,
|
||||
voice = BARK_VOICE,
|
||||
} = {}) {
|
||||
this._model = model
|
||||
this._voice = voice
|
||||
this._proc = null
|
||||
this._rl = null
|
||||
this._resolve = null // resolver for the current in-flight request
|
||||
}
|
||||
|
||||
/** Spawn bark-server.py and wait until it signals "ready". */
|
||||
init() {
|
||||
return new Promise((resolve, reject) => {
|
||||
this._proc = spawn(SERVER, [this._model, this._voice], {
|
||||
stdio: ['pipe', 'pipe', 'inherit'],
|
||||
})
|
||||
|
||||
this._proc.on('error', reject)
|
||||
this._proc.on('close', (code) => {
|
||||
if (code !== 0 && code !== null) {
|
||||
process.stderr.write(`[bark] server exited with code ${code}\n`)
|
||||
}
|
||||
})
|
||||
|
||||
this._rl = readline.createInterface({ input: this._proc.stdout })
|
||||
|
||||
this._rl.on('line', (line) => {
|
||||
if (line === 'ready') {
|
||||
resolve()
|
||||
return
|
||||
}
|
||||
if (line === 'ok' && this._resolve) {
|
||||
const res = this._resolve
|
||||
this._resolve = null
|
||||
res()
|
||||
}
|
||||
})
|
||||
})
|
||||
}
|
||||
|
||||
/** Preprocess markdown and speak as a single request. */
|
||||
async speak(text, { voice = this._voice, preprocess = true } = {}) {
|
||||
const clean = preprocess ? markdown_to_bark(text) : text
|
||||
return this._send(clean, voice)
|
||||
}
|
||||
|
||||
/**
|
||||
* Preprocess markdown and speak sentence by sentence.
|
||||
* Lower latency — first sentence starts playing while rest are queued.
|
||||
*/
|
||||
async speak_streaming(text, opts = {}) {
|
||||
const clean = opts.preprocess !== false ? markdown_to_bark(text) : text
|
||||
const sentences = split_sentences(clean)
|
||||
for (const s of sentences) {
|
||||
await this._send(s, opts.voice ?? this._voice)
|
||||
}
|
||||
}
|
||||
|
||||
_send(text, voice) {
|
||||
return new Promise((resolve, reject) => {
|
||||
if (!this._proc) {
|
||||
return reject(new Error('Bark_Tts not initialized — call init() first'))
|
||||
}
|
||||
this._resolve = resolve
|
||||
const payload = JSON.stringify({ text, voice }) + '\n'
|
||||
this._proc.stdin.write(payload)
|
||||
})
|
||||
}
|
||||
|
||||
stop() {
|
||||
this._rl?.close()
|
||||
this._proc?.kill()
|
||||
this._proc = null
|
||||
this._rl = null
|
||||
}
|
||||
}
|
||||
101
lib/chatterbox-tts.mjs
Normal file
101
lib/chatterbox-tts.mjs
Normal file
@@ -0,0 +1,101 @@
|
||||
/**
|
||||
* Chatterbox TTS — Node.js wrapper around chatterbox-server.py.
|
||||
*
|
||||
* Usage:
|
||||
* const tts = new Chatterbox_Tts()
|
||||
* await tts.init()
|
||||
* await tts.speak('Hello! [chuckle] That is funny.')
|
||||
* await tts.speak('Something intense.', { exaggeration: 0.8 }) // full model only
|
||||
* tts.stop()
|
||||
*
|
||||
* Paralinguistic tags (embed in text):
|
||||
* [laugh] [chuckle] [cough] [clear throat] [sigh] [shush] [groan] [sniff] [gasp]
|
||||
*
|
||||
* Generation options (passed as second arg to speak()):
|
||||
* temperature 0.05–2.0 default 0.8
|
||||
* top_p 0–1 default 0.95
|
||||
* top_k 0–1000 default 1000
|
||||
* repetition_penalty ≥1.0 default 1.2
|
||||
* min_p 0–1 default 0.0
|
||||
* audio_prompt string path to reference WAV for voice cloning
|
||||
* exaggeration 0–1 emotion intensity (full model only)
|
||||
* cfg_weight 0–1 classifier-free guidance (full model only)
|
||||
*/
|
||||
|
||||
import { spawn } from 'node:child_process'
|
||||
import * as path from 'node:path'
|
||||
import * as readline from 'node:readline'
|
||||
import { markdown_to_bark as markdown_to_speech, split_sentences } from './markdown.mjs'
|
||||
|
||||
const SERVER = path.join(import.meta.dirname, '..', 'chatterbox-server.py')
|
||||
|
||||
export class Chatterbox_Tts {
|
||||
constructor({
|
||||
variant = 'turbo', // 'turbo' or 'full'
|
||||
} = {}) {
|
||||
this._variant = variant
|
||||
this._proc = null
|
||||
this._rl = null
|
||||
this._resolve = null
|
||||
}
|
||||
|
||||
init() {
|
||||
return new Promise((resolve, reject) => {
|
||||
this._proc = spawn(SERVER, [this._variant], {
|
||||
stdio: ['pipe', 'pipe', 'inherit'],
|
||||
})
|
||||
|
||||
this._proc.on('error', reject)
|
||||
this._proc.on('close', (code) => {
|
||||
if (code !== null && code !== 0) {
|
||||
process.stderr.write(`[chatterbox] server exited with code ${code}\n`)
|
||||
}
|
||||
})
|
||||
|
||||
this._rl = readline.createInterface({ input: this._proc.stdout })
|
||||
this._rl.on('line', (line) => {
|
||||
if (line === 'ready') {
|
||||
resolve()
|
||||
return
|
||||
}
|
||||
if (line === 'ok' && this._resolve) {
|
||||
const res = this._resolve
|
||||
this._resolve = null
|
||||
res()
|
||||
}
|
||||
})
|
||||
})
|
||||
}
|
||||
|
||||
async speak(text, opts = {}) {
|
||||
const clean = opts.preprocess === true ? markdown_to_speech(text) : text
|
||||
return this._send(clean, opts)
|
||||
}
|
||||
|
||||
async speak_streaming(text, opts = {}) {
|
||||
const clean = opts.preprocess !== false ? markdown_to_speech(text) : text
|
||||
const sentences = split_sentences(clean)
|
||||
for (const s of sentences) {
|
||||
await this._send(s, opts)
|
||||
}
|
||||
}
|
||||
|
||||
_send(text, opts = {}) {
|
||||
return new Promise((resolve, reject) => {
|
||||
if (!this._proc) {
|
||||
return reject(new Error('Chatterbox_Tts not initialized — call init() first'))
|
||||
}
|
||||
this._resolve = resolve
|
||||
const { preprocess: _, ...gen_opts } = opts
|
||||
const payload = JSON.stringify({ text, ...gen_opts }) + '\n'
|
||||
this._proc.stdin.write(payload)
|
||||
})
|
||||
}
|
||||
|
||||
stop() {
|
||||
this._rl?.close()
|
||||
this._proc?.kill()
|
||||
this._proc = null
|
||||
this._rl = null
|
||||
}
|
||||
}
|
||||
64
lib/llm.mjs
Normal file
64
lib/llm.mjs
Normal file
@@ -0,0 +1,64 @@
|
||||
/**
|
||||
* Optional LLM cleanup via Ollama REST API.
|
||||
*
|
||||
* Cleans up raw Whisper output: fixes punctuation, capitalization,
|
||||
* removes filler words, corrects obvious recognition errors.
|
||||
*
|
||||
* Set OLLAMA_MODEL env var to choose the model (default: phi3:mini).
|
||||
* Set OLLAMA_URL env var to change the base URL (default: http://localhost:11434).
|
||||
*/
|
||||
|
||||
const OLLAMA_URL = process.env.OLLAMA_URL || 'http://localhost:11434'
|
||||
const OLLAMA_MODEL = process.env.OLLAMA_MODEL || 'phi3:mini'
|
||||
|
||||
export async function llm_available() {
|
||||
try {
|
||||
const res = await fetch(`${OLLAMA_URL}/api/tags`, {
|
||||
signal: AbortSignal.timeout(2000),
|
||||
})
|
||||
return res.ok
|
||||
} catch {
|
||||
return false
|
||||
}
|
||||
}
|
||||
|
||||
export async function list_models() {
|
||||
const res = await fetch(`${OLLAMA_URL}/api/tags`)
|
||||
const data = await res.json()
|
||||
return (data.models ?? []).map(m => m.name)
|
||||
}
|
||||
|
||||
/**
|
||||
* Clean up a raw STT transcription.
|
||||
* Returns the cleaned string, or the original if the request fails.
|
||||
*/
|
||||
export async function cleanup(raw_text, model = OLLAMA_MODEL) {
|
||||
const prompt = [
|
||||
'You are a transcription editor. Clean up this speech-to-text output:',
|
||||
'- Fix punctuation and capitalization',
|
||||
'- Remove meaningless filler words (um, uh, like, you know)',
|
||||
'- Correct obvious word recognition errors based on context',
|
||||
'- Keep the meaning and phrasing intact',
|
||||
'- Return ONLY the cleaned text, no explanation, no quotes',
|
||||
'',
|
||||
raw_text,
|
||||
].join('\n')
|
||||
|
||||
try {
|
||||
const res = await fetch(`${OLLAMA_URL}/api/generate`, {
|
||||
method: 'POST',
|
||||
headers: { 'Content-Type': 'application/json' },
|
||||
body: JSON.stringify({
|
||||
model,
|
||||
prompt,
|
||||
stream: false,
|
||||
options: { temperature: 0.1, num_predict: 256 },
|
||||
}),
|
||||
signal: AbortSignal.timeout(10_000),
|
||||
})
|
||||
const data = await res.json()
|
||||
return data.response?.trim() || raw_text
|
||||
} catch {
|
||||
return raw_text
|
||||
}
|
||||
}
|
||||
43
lib/local-query-complete.mjs
Normal file
43
lib/local-query-complete.mjs
Normal file
@@ -0,0 +1,43 @@
|
||||
export async function is_query_complete(query, model = 'qwen2.5:3b') {
|
||||
const payload = {
|
||||
model,
|
||||
messages: [
|
||||
{
|
||||
role: 'system',
|
||||
content: `
|
||||
You are a voice query completeness classifier. Determine if the input is complete enough to act on.
|
||||
|
||||
Respond only with JSON: {"complete": true} or {"complete": false}
|
||||
|
||||
Default to {"complete": true}. Only respond with {"complete": false} if the input is clearly mid-sentence, trails off, or is a single ambiguous word with no clear intent.
|
||||
|
||||
Complete examples: "What time is it", "Turn off the lights", "How do I juggle", "Make a note", "Check the weather"
|
||||
Incomplete examples: "How do I", "The thing is", "Can you"
|
||||
`.replace(/^[\t ]+/gm, ''),
|
||||
},
|
||||
{ role: 'user', content: query },
|
||||
],
|
||||
stream: false,
|
||||
format: 'json',
|
||||
options: { num_predict: 10, num_ctx: 2048 },
|
||||
}
|
||||
|
||||
const response = await fetch('http://192.168.2.99:11434/api/chat', {
|
||||
method: 'POST',
|
||||
headers: { 'Content-Type': 'application/json' },
|
||||
body: JSON.stringify(payload),
|
||||
})
|
||||
|
||||
if (response.ok) {
|
||||
try {
|
||||
const data = await response.json()
|
||||
const parsed = JSON.parse(data.message.content)
|
||||
return parsed.complete === true
|
||||
} catch {
|
||||
return true
|
||||
}
|
||||
} else {
|
||||
// Fail open — if classifier errors, let the query through
|
||||
return true
|
||||
}
|
||||
}
|
||||
66
lib/markdown.mjs
Normal file
66
lib/markdown.mjs
Normal file
@@ -0,0 +1,66 @@
|
||||
/**
|
||||
* Convert markdown to Bark-friendly plain text.
|
||||
*
|
||||
* Bark emphasis conventions:
|
||||
* UPPERCASE words — stress/emphasis
|
||||
* ... — hesitation / trailing off
|
||||
* [laughs] [sighs] — paralinguistic tokens (pass through as-is)
|
||||
*
|
||||
* Markdown mapping:
|
||||
* **bold** / __bold__ → UPPERCASE (strong emphasis)
|
||||
* *italic* / _italic_ → unchanged (soft emphasis, Bark handles naturally)
|
||||
* # Heading → text + pause via ".\n"
|
||||
* `code` → unchanged (spoken literally)
|
||||
* ```block``` → skipped
|
||||
* [text](url) → text only
|
||||
* --- → "..." (pause)
|
||||
* - item / 1. item → item text (bullet stripped)
|
||||
*/
|
||||
|
||||
export function markdown_to_bark(text) {
|
||||
// Remove fenced code blocks entirely
|
||||
text = text.replace(/```[\s\S]*?```/g, '')
|
||||
|
||||
// Inline code — keep the content, drop backticks
|
||||
text = text.replace(/`([^`]+)`/g, '$1')
|
||||
|
||||
// Links — keep visible text
|
||||
text = text.replace(/\[([^\]]+)\]\([^)]*\)/g, '$1')
|
||||
|
||||
// Images — drop
|
||||
text = text.replace(/!\[([^\]]*)\]\([^)]*\)/g, '')
|
||||
|
||||
// **bold** / __bold__ → UPPERCASE
|
||||
text = text.replace(/\*\*([^*]+)\*\*/g, (_, w) => w.toUpperCase())
|
||||
text = text.replace(/__([^_]+)__/g, (_, w) => w.toUpperCase())
|
||||
|
||||
// *italic* / _italic_ — strip markers, keep text
|
||||
text = text.replace(/\*([^*]+)\*/g, '$1')
|
||||
text = text.replace(/_([^_]+)_/g, '$1')
|
||||
|
||||
// Headings — keep text, add a beat after
|
||||
text = text.replace(/^#{1,6}\s+(.+)$/gm, '$1.')
|
||||
|
||||
// Horizontal rules → pause
|
||||
text = text.replace(/^[-*_]{3,}\s*$/gm, '...')
|
||||
|
||||
// List bullets / numbers — strip prefix
|
||||
text = text.replace(/^\s*[-*+]\s+/gm, '')
|
||||
text = text.replace(/^\s*\d+\.\s+/gm, '')
|
||||
|
||||
// Collapse excess blank lines
|
||||
text = text.replace(/\n{3,}/g, '\n\n')
|
||||
|
||||
return text.trim()
|
||||
}
|
||||
|
||||
/**
|
||||
* Split text into sentences suitable for sequential TTS.
|
||||
* Keeps sentence-ending punctuation with the preceding sentence.
|
||||
*/
|
||||
export function split_sentences(text) {
|
||||
return text
|
||||
.split(/(?<=[.!?])\s+/)
|
||||
.map(s => s.trim())
|
||||
.filter(s => s.length > 0)
|
||||
}
|
||||
170
lib/stt.mjs
Normal file
170
lib/stt.mjs
Normal file
@@ -0,0 +1,170 @@
|
||||
/**
|
||||
* Speech-to-Text using sherpa-onnx-node.
|
||||
*
|
||||
* Uses Whisper (offline/batch) as the recognizer, gated by Silero VAD so
|
||||
* transcription only runs when a complete utterance has been detected.
|
||||
* Audio is captured from the microphone via parec (PulseAudio) or arecord (ALSA).
|
||||
*
|
||||
* Model layout expected under models_dir:
|
||||
* silero_vad.onnx
|
||||
* sherpa-onnx-whisper-<name>/
|
||||
* <name>-encoder.int8.onnx
|
||||
* <name>-decoder.int8.onnx
|
||||
* <name>-tokens.txt
|
||||
*
|
||||
* Available Whisper model names (pass as whisper_name):
|
||||
* tiny.en base.en small.en medium.en large-v3 large-v3-turbo
|
||||
* Download with: bash download-models.sh [model-name]
|
||||
*/
|
||||
|
||||
import { spawn, execSync } from 'node:child_process'
|
||||
import { createRequire } from 'node:module'
|
||||
import * as path from 'node:path'
|
||||
|
||||
const require = createRequire(import.meta.url)
|
||||
const DEFAULT_MODELS_DIR = path.join(import.meta.dirname, '..', 'models')
|
||||
|
||||
const PRE_ROLL_SAMPLES = 3200 // 200ms at 16kHz
|
||||
const HISTORY_SAMPLES = 960000 // 60s ring buffer — matches VAD internal size
|
||||
|
||||
function find_mic() {
|
||||
const candidates = [
|
||||
['parec', ['--format=s16le', '--rate=16000', '--channels=1', '--latency-msec=50']],
|
||||
['arecord', ['-f', 'S16_LE', '-r', '16000', '-c', '1', '-t', 'raw', '-q']],
|
||||
]
|
||||
for (const [cmd, args] of candidates) {
|
||||
try {
|
||||
execSync(`which ${cmd}`, { stdio: 'ignore' })
|
||||
return [cmd, args]
|
||||
} catch { /* try next */ }
|
||||
}
|
||||
throw new Error('No mic capture command found — need parec (PulseAudio) or arecord (ALSA)')
|
||||
}
|
||||
|
||||
function s16le_to_f32(buf) {
|
||||
const out = new Float32Array(buf.length / 2)
|
||||
for (let i = 0; i < out.length; i++) {
|
||||
out[i] = buf.readInt16LE(i * 2) / 32768.0
|
||||
}
|
||||
return out
|
||||
}
|
||||
|
||||
export class Stt {
|
||||
constructor({
|
||||
models_dir = DEFAULT_MODELS_DIR,
|
||||
whisper_name = 'base.en',
|
||||
provider = 'cuda',
|
||||
} = {}) {
|
||||
this._whisper_dir = path.join(models_dir, `sherpa-onnx-whisper-${whisper_name}`)
|
||||
this._whisper_name = whisper_name
|
||||
this._vad_model = path.join(models_dir, 'silero_vad.onnx')
|
||||
this._provider = provider
|
||||
this._recognizer = null
|
||||
this._vad = null
|
||||
this._history = new Float32Array(HISTORY_SAMPLES)
|
||||
this._history_pos = 0 // total samples fed, monotonically increasing
|
||||
}
|
||||
|
||||
init() {
|
||||
const sherpa = require('sherpa-onnx-node')
|
||||
const { _whisper_dir: wdir, _whisper_name: wname, _vad_model, _provider } = this
|
||||
|
||||
this._recognizer = new sherpa.OfflineRecognizer({
|
||||
featConfig: { sampleRate: 16000, featureDim: 80 },
|
||||
modelConfig: {
|
||||
whisper: {
|
||||
encoder: path.join(wdir, `${wname}-encoder.int8.onnx`),
|
||||
decoder: path.join(wdir, `${wname}-decoder.int8.onnx`),
|
||||
},
|
||||
tokens: path.join(wdir, `${wname}-tokens.txt`),
|
||||
numThreads: 2,
|
||||
provider: _provider,
|
||||
debug: false,
|
||||
},
|
||||
})
|
||||
|
||||
// VAD runs on CPU — silero_vad.onnx is tiny
|
||||
this._vad = new sherpa.Vad({
|
||||
sileroVad: {
|
||||
model: _vad_model,
|
||||
threshold: 0.5,
|
||||
minSilenceDuration: 0.5, // seconds of silence to end an utterance
|
||||
minSpeechDuration: 0.1, // ignore sub-100ms blips
|
||||
windowSize: 512, // 32ms @ 16kHz
|
||||
maxSpeechDuration: 30.0,
|
||||
},
|
||||
sampleRate: 16000,
|
||||
numThreads: 1,
|
||||
provider: 'cpu',
|
||||
debug: false,
|
||||
}, 60) // 60-second internal ring buffer
|
||||
}
|
||||
|
||||
_transcribe(samples) {
|
||||
const stream = this._recognizer.createStream()
|
||||
stream.acceptWaveform({ samples, sampleRate: 16000 })
|
||||
this._recognizer.decode(stream)
|
||||
return this._recognizer.getResult(stream).text.trim()
|
||||
}
|
||||
|
||||
_with_preroll(seg) {
|
||||
const pre_start = Math.max(0, seg.start - PRE_ROLL_SAMPLES)
|
||||
const pre_len = seg.start - pre_start
|
||||
if (pre_len === 0) return seg.samples
|
||||
const out = new Float32Array(pre_len + seg.samples.length)
|
||||
for (let i = 0; i < pre_len; i++) {
|
||||
out[i] = this._history[(pre_start + i) % HISTORY_SAMPLES]
|
||||
}
|
||||
out.set(seg.samples, pre_len)
|
||||
return out
|
||||
}
|
||||
|
||||
/**
|
||||
* Start listening. Calls on_text(text) for each detected utterance.
|
||||
* Returns a stop() function.
|
||||
*/
|
||||
listen(on_text, { on_audio } = {}) {
|
||||
const [cmd, args] = find_mic()
|
||||
process.stderr.write(`[stt] mic: ${cmd} ${args.join(' ')}\n`)
|
||||
const mic = spawn(cmd, args, { stdio: ['ignore', 'pipe', 'inherit'] })
|
||||
|
||||
const VAD_WIN = 512 // must match windowSize above
|
||||
let pending = Buffer.alloc(0)
|
||||
|
||||
mic.stdout.on('data', (chunk) => {
|
||||
if (on_audio) on_audio()
|
||||
pending = Buffer.concat([pending, chunk])
|
||||
|
||||
// Feed complete VAD windows
|
||||
while (pending.length >= VAD_WIN * 2) {
|
||||
const win = pending.subarray(0, VAD_WIN * 2)
|
||||
pending = pending.subarray(VAD_WIN * 2)
|
||||
const f32 = s16le_to_f32(win)
|
||||
// Write to history ring buffer for pre-roll
|
||||
const base = this._history_pos % HISTORY_SAMPLES
|
||||
for (let i = 0; i < f32.length; i++) {
|
||||
this._history[(base + i) % HISTORY_SAMPLES] = f32[i]
|
||||
}
|
||||
this._history_pos += f32.length
|
||||
this._vad.acceptWaveform(f32)
|
||||
}
|
||||
|
||||
// Drain any complete speech segments
|
||||
while (!this._vad.isEmpty()) {
|
||||
const seg = this._vad.front()
|
||||
this._vad.pop()
|
||||
const text = this._transcribe(this._with_preroll(seg))
|
||||
if (text) on_text(text)
|
||||
}
|
||||
})
|
||||
|
||||
mic.on('error', err => process.stderr.write(`[stt] mic error: ${err.message}\n`))
|
||||
mic.on('close', code => {
|
||||
if (code !== null && code !== 0) {
|
||||
process.stderr.write(`[stt] mic exited with code ${code}\n`)
|
||||
}
|
||||
})
|
||||
|
||||
return () => mic.kill()
|
||||
}
|
||||
}
|
||||
51
lib/tts-client.mjs
Normal file
51
lib/tts-client.mjs
Normal file
@@ -0,0 +1,51 @@
|
||||
/**
|
||||
* TTS HTTP client — speaks text via a running tts-server.mjs instance.
|
||||
*
|
||||
* Usage:
|
||||
* const tts = new Tts_Client()
|
||||
* await tts.speak('Hello world.')
|
||||
* await tts.speak('Hello.', { audio_prompt: '/path/to/voice.wav' })
|
||||
* const { voices, current } = await tts.list_voices()
|
||||
* await tts.set_voice('rommie')
|
||||
*
|
||||
* The server URL defaults to TTS_URL env var, then http://192.168.2.99:11500.
|
||||
*/
|
||||
|
||||
const DEFAULT_URL = process.env.TTS_URL ?? 'http://192.168.2.99:11500'
|
||||
|
||||
export class Tts_Client {
|
||||
constructor({ url = DEFAULT_URL } = {}) {
|
||||
this._url = url
|
||||
}
|
||||
|
||||
async speak(text, opts = {}) {
|
||||
const res = await fetch(`${this._url}/speak`, {
|
||||
method: 'POST',
|
||||
headers: { 'Content-Type': 'application/json' },
|
||||
body: JSON.stringify({ text, ...opts }),
|
||||
})
|
||||
if (!res.ok) {
|
||||
const data = await res.json().catch(() => ({}))
|
||||
throw new Error(data.error ?? `HTTP ${res.status}`)
|
||||
}
|
||||
}
|
||||
|
||||
async list_voices() {
|
||||
const res = await fetch(`${this._url}/voices`)
|
||||
if (!res.ok) throw new Error(`HTTP ${res.status}`)
|
||||
return res.json() // { voices: [{name, description, active}], current }
|
||||
}
|
||||
|
||||
async set_voice(name) {
|
||||
const res = await fetch(`${this._url}/voice`, {
|
||||
method: 'POST',
|
||||
headers: { 'Content-Type': 'application/json' },
|
||||
body: JSON.stringify({ name }),
|
||||
})
|
||||
if (!res.ok) {
|
||||
const data = await res.json().catch(() => ({}))
|
||||
throw new Error(data.error ?? `HTTP ${res.status}`)
|
||||
}
|
||||
return res.json() // { ok, name, path }
|
||||
}
|
||||
}
|
||||
107
lib/tts.mjs
Normal file
107
lib/tts.mjs
Normal file
@@ -0,0 +1,107 @@
|
||||
/**
|
||||
* Text-to-Speech using sherpa-onnx-node with the Kokoro model.
|
||||
*
|
||||
* Model layout expected under models_dir:
|
||||
* kokoro-en-v0_19/
|
||||
* model.onnx
|
||||
* voices.bin
|
||||
* tokens.txt
|
||||
* lexicon-us-en.txt
|
||||
* espeak-ng-data/
|
||||
*
|
||||
* Audio output goes to pacat (PulseAudio) as float32le @ 24kHz.
|
||||
* speak_streaming() splits on sentence boundaries so the first sentence
|
||||
* plays while the rest are still being synthesized.
|
||||
*/
|
||||
|
||||
import { spawn } from 'node:child_process'
|
||||
import { createRequire } from 'node:module'
|
||||
import * as path from 'node:path'
|
||||
|
||||
const require = createRequire(import.meta.url)
|
||||
const DEFAULT_MODELS_DIR = path.join(import.meta.dirname, '..', 'models')
|
||||
|
||||
// Speaker IDs in kokoro-en-v0_19
|
||||
export const VOICES = {
|
||||
af_heart: 0,
|
||||
af_bella: 1,
|
||||
af_nicole: 2,
|
||||
af_sarah: 3,
|
||||
af_sky: 4,
|
||||
am_adam: 5,
|
||||
am_michael: 6,
|
||||
}
|
||||
|
||||
export class Tts {
|
||||
constructor({
|
||||
models_dir = DEFAULT_MODELS_DIR,
|
||||
voice = 'af_heart',
|
||||
speed = 1.0,
|
||||
provider = 'cpu', // Kokoro synthesis is fast enough on CPU
|
||||
} = {}) {
|
||||
this._kokoro_dir = path.join(models_dir, 'kokoro-en-v0_19')
|
||||
this._voice = voice
|
||||
this._speed = speed
|
||||
this._provider = provider
|
||||
this._tts = null
|
||||
}
|
||||
|
||||
init() {
|
||||
const sherpa = require('sherpa-onnx-node')
|
||||
const { _kokoro_dir: dir, _provider } = this
|
||||
|
||||
this._tts = new sherpa.OfflineTts({
|
||||
model: {
|
||||
kokoro: {
|
||||
model: path.join(dir, 'model.onnx'),
|
||||
voices: path.join(dir, 'voices.bin'),
|
||||
tokens: path.join(dir, 'tokens.txt'),
|
||||
dataDir: path.join(dir, 'espeak-ng-data'),
|
||||
},
|
||||
},
|
||||
maxNumSentences: 2,
|
||||
numThreads: 2,
|
||||
debug: false,
|
||||
provider: _provider,
|
||||
})
|
||||
}
|
||||
|
||||
/** Synthesize and play a single piece of text. Resolves when playback ends. */
|
||||
async speak(text) {
|
||||
const sid = VOICES[this._voice] ?? 0
|
||||
const audio = this._tts.generate({ text, sid, speed: this._speed })
|
||||
await this._play(audio.samples, audio.sampleRate)
|
||||
}
|
||||
|
||||
/**
|
||||
* Split text on sentence boundaries and synthesize + play each sentence
|
||||
* in sequence. Lower latency than waiting for full synthesis first.
|
||||
*/
|
||||
async speak_streaming(text) {
|
||||
for (const sentence of split_sentences(text)) {
|
||||
if (sentence.trim()) {
|
||||
await this.speak(sentence.trim())
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
_play(samples, sample_rate) {
|
||||
return new Promise((resolve, reject) => {
|
||||
const proc = spawn('pacat', [
|
||||
'--format=float32le',
|
||||
`--rate=${sample_rate}`,
|
||||
'--channels=1',
|
||||
], { stdio: ['pipe', 'ignore', 'inherit'] })
|
||||
|
||||
proc.on('error', reject)
|
||||
proc.on('close', resolve)
|
||||
proc.stdin.write(Buffer.from(samples.buffer))
|
||||
proc.stdin.end()
|
||||
})
|
||||
}
|
||||
}
|
||||
|
||||
function split_sentences(text) {
|
||||
// Split after . ! ? — keep the punctuation with the preceding sentence
|
||||
return text.split(/(?<=[.!?])\s+/)
|
||||
}
|
||||
22
listen.mjs
Normal file
22
listen.mjs
Normal file
@@ -0,0 +1,22 @@
|
||||
/**
|
||||
* Listen with Whisper and print transcribed text to stdout.
|
||||
*
|
||||
* Usage:
|
||||
* node listen.mjs
|
||||
* node listen.mjs small.en
|
||||
* node listen.mjs large-v3-turbo
|
||||
*/
|
||||
|
||||
import { Stt } from './lib/stt.mjs'
|
||||
|
||||
const whisper_name = process.argv[2] ?? 'base.en'
|
||||
|
||||
const stt = new Stt({ whisper_name })
|
||||
stt.init()
|
||||
|
||||
process.stderr.write(`[listen] whisper model: ${whisper_name}\n`)
|
||||
process.stderr.write(`[listen] listening — Ctrl+C to stop\n`)
|
||||
|
||||
stt.listen((text) => {
|
||||
process.stdout.write(text + '\n')
|
||||
})
|
||||
13
package.json
Normal file
13
package.json
Normal file
@@ -0,0 +1,13 @@
|
||||
{
|
||||
"name": "voice-experiment",
|
||||
"version": "0.1.0",
|
||||
"type": "module",
|
||||
"scripts": {
|
||||
"download": "bash download-models.sh",
|
||||
"setup-venv": "bash setup-venv.sh"
|
||||
},
|
||||
"dependencies": {
|
||||
"sherpa-onnx-node": "^1.12.14",
|
||||
"js-yaml": "^4.1.0"
|
||||
}
|
||||
}
|
||||
123
query-demo.mjs
Normal file
123
query-demo.mjs
Normal file
@@ -0,0 +1,123 @@
|
||||
/**
|
||||
* Voice query demo <20><> accumulates STT utterances into a query, checks
|
||||
* completeness with a local LLM classifier, then dispatches to Claude Code.
|
||||
*
|
||||
* Usage:
|
||||
* node query-demo.mjs [--audio-prompt voice.wav] [--whisper model] [--window-id 0x...] [--claude-remote /path/to/claude-remote.mjs]
|
||||
*
|
||||
* If --window-id and --claude-remote are supplied, completed queries are
|
||||
* dispatched to Claude Code. Otherwise they are only logged to stderr.
|
||||
*
|
||||
* A query is considered complete when:
|
||||
* - The classifier says so
|
||||
* - The last word is a send-word (go/done/send)
|
||||
* - No new fragment arrives within SILENCE_TIMEOUT ms
|
||||
*/
|
||||
|
||||
import { execFileSync } from 'node:child_process'
|
||||
import { Stt } from './lib/stt.mjs'
|
||||
import { Tts_Client } from './lib/tts-client.mjs'
|
||||
import { is_query_complete } from './lib/local-query-complete.mjs'
|
||||
|
||||
function get_arg(name) {
|
||||
const i = process.argv.indexOf(name)
|
||||
return i !== -1 ? process.argv[i + 1] : null
|
||||
}
|
||||
|
||||
const audio_prompt = get_arg('--audio-prompt') ?? '/home/devilholk/Documents/rommie-sample.wav'
|
||||
const whisper_name = get_arg('--whisper') ?? 'base.en'
|
||||
const window_id = get_arg('--window-id')
|
||||
const claude_remote = get_arg('--claude-remote')
|
||||
|
||||
const SILENCE_TIMEOUT = parseInt(get_arg('--silence-timeout') ?? '6000')
|
||||
|
||||
const tts = new Tts_Client()
|
||||
const stt = new Stt({ whisper_name })
|
||||
|
||||
stt.init()
|
||||
process.stderr.write(`[query-demo] whisper model: ${whisper_name}\n`)
|
||||
process.stderr.write(`[query-demo] voice: ${audio_prompt}\n`)
|
||||
if (window_id && claude_remote) {
|
||||
process.stderr.write(`[query-demo] dispatch: window ${window_id} via ${claude_remote}\n`)
|
||||
} else {
|
||||
process.stderr.write(`[query-demo] no --window-id/--claude-remote — logging only\n`)
|
||||
}
|
||||
await tts.speak('Ready for input.', { audio_prompt })
|
||||
|
||||
function make_prompt(query) {
|
||||
return [
|
||||
'[voice-buddy] You have received a voice query.',
|
||||
'',
|
||||
'This is a voice interface. When you are done, use the `speak` shell command to inform the user of the outcome — keep it brief and spoken-word natural.',
|
||||
'If the task is a question, speak the answer directly.',
|
||||
'If the task is an action, carry it out and then speak a short confirmation.',
|
||||
'',
|
||||
'--- Query ---',
|
||||
query,
|
||||
'--- End of query ---',
|
||||
].join('\n')
|
||||
}
|
||||
|
||||
function dispatch(query) {
|
||||
execFileSync('node', [claude_remote, '--window-id', window_id], {
|
||||
input: make_prompt(query),
|
||||
encoding: 'utf8',
|
||||
stdio: ['pipe', 'inherit', 'inherit'],
|
||||
})
|
||||
}
|
||||
|
||||
async function submit(query) {
|
||||
query = query.trim()
|
||||
if (!query) {
|
||||
return
|
||||
}
|
||||
process.stderr.write(`[query-demo] query: ${JSON.stringify(query)}\n`)
|
||||
if (window_id && claude_remote) {
|
||||
dispatch(query)
|
||||
}
|
||||
await tts.speak('Efforting.', { audio_prompt })
|
||||
}
|
||||
|
||||
const SEND_WORDS = new Set(['go', 'done', 'send'])
|
||||
|
||||
let accumulated = ''
|
||||
let silence_timer = null
|
||||
|
||||
function reset_silence_timer() {
|
||||
clearTimeout(silence_timer)
|
||||
silence_timer = setTimeout(async () => {
|
||||
if (!accumulated) {
|
||||
return
|
||||
}
|
||||
process.stderr.write('[query-demo] silence timeout — submitting\n')
|
||||
const query = accumulated
|
||||
accumulated = ''
|
||||
await submit(query)
|
||||
}, SILENCE_TIMEOUT)
|
||||
}
|
||||
|
||||
stt.listen(async (text) => {
|
||||
accumulated = accumulated ? `${accumulated}\n${text}` : text
|
||||
process.stderr.write(`[query-demo] fragment: ${JSON.stringify(accumulated)}\n`)
|
||||
|
||||
reset_silence_timer()
|
||||
|
||||
const last_line = accumulated.split('\n').at(-1)
|
||||
const last_norm = last_line.toLowerCase().replace(/[^a-z]/g, '')
|
||||
const forced = SEND_WORDS.has(last_norm)
|
||||
|
||||
if (!forced) {
|
||||
const complete = await is_query_complete(last_line)
|
||||
if (!complete) {
|
||||
return
|
||||
}
|
||||
}
|
||||
|
||||
clearTimeout(silence_timer)
|
||||
|
||||
const lines = accumulated.split('\n')
|
||||
const query = (forced ? lines.slice(0, -1) : lines).join('\n')
|
||||
accumulated = ''
|
||||
|
||||
await submit(query)
|
||||
}, { on_audio: () => { if (accumulated) reset_silence_timer() } })
|
||||
11
requirements.txt
Normal file
11
requirements.txt
Normal file
@@ -0,0 +1,11 @@
|
||||
# STT
|
||||
faster-whisper>=1.1.0
|
||||
|
||||
# TTS (already used in kokoro-tts-test)
|
||||
kokoro>=0.9.4
|
||||
|
||||
# Audio capture
|
||||
sounddevice>=0.4.6
|
||||
|
||||
# Numpy (usually already present)
|
||||
numpy>=1.24.0
|
||||
24
setup-venv.sh
Executable file
24
setup-venv.sh
Executable file
@@ -0,0 +1,24 @@
|
||||
#!/usr/bin/env bash
|
||||
# Creates a venv in ./venv and installs Python deps for Bark.
|
||||
# Safe to re-run — skips install if already done.
|
||||
|
||||
set -euo pipefail
|
||||
|
||||
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||
VENV="${SCRIPT_DIR}/venv"
|
||||
|
||||
if [ ! -d "${VENV}" ]; then
|
||||
echo "==> creating venv at ${VENV}"
|
||||
python3 -m venv "${VENV}"
|
||||
fi
|
||||
|
||||
echo "==> installing Python dependencies"
|
||||
"${VENV}/bin/pip" install --upgrade pip --quiet
|
||||
"${VENV}/bin/pip" install transformers accelerate torch numpy
|
||||
"${VENV}/bin/pip" install chatterbox-tts
|
||||
|
||||
chmod +x "${SCRIPT_DIR}/bark-server.py"
|
||||
chmod +x "${SCRIPT_DIR}/chatterbox-server.py"
|
||||
|
||||
echo ""
|
||||
echo "==> venv ready at ${VENV}"
|
||||
47
speak-as.mjs
Normal file
47
speak-as.mjs
Normal file
@@ -0,0 +1,47 @@
|
||||
/**
|
||||
* Read text from stdin and speak it using a voice reference clip.
|
||||
*
|
||||
* Usage:
|
||||
* echo "Hello world" | node speak-as.mjs /path/to/voice.wav
|
||||
* cat script.txt | node speak-as.mjs /path/to/voice.wav
|
||||
* node speak-as.mjs /path/to/voice.wav (then type, Ctrl+D when done)
|
||||
*/
|
||||
|
||||
import * as fs from 'node:fs'
|
||||
import * as readline from 'node:readline'
|
||||
import { Chatterbox_Tts } from './lib/chatterbox-tts.mjs'
|
||||
|
||||
const audio_prompt = process.argv[2]
|
||||
|
||||
if (!audio_prompt) {
|
||||
process.stderr.write('Usage: node speak-as.mjs /path/to/voice.wav\n')
|
||||
process.exit(1)
|
||||
}
|
||||
|
||||
if (!fs.existsSync(audio_prompt)) {
|
||||
process.stderr.write(`File not found: ${audio_prompt}\n`)
|
||||
process.exit(1)
|
||||
}
|
||||
|
||||
const tts = new Chatterbox_Tts()
|
||||
await tts.init()
|
||||
|
||||
const rl = readline.createInterface({
|
||||
input: process.stdin,
|
||||
output: process.stdout,
|
||||
terminal: process.stdin.isTTY,
|
||||
})
|
||||
|
||||
const norm = s => s.toLowerCase().replace(/[^a-z0-9 ]/g, '').replace(/\s+/g, ' ').trim()
|
||||
let last_norm = ''
|
||||
|
||||
for await (const line of rl) {
|
||||
const text = line.trim()
|
||||
if (!text) continue
|
||||
const n = norm(text)
|
||||
if (n === last_norm) continue
|
||||
last_norm = n
|
||||
await tts.speak(text, { audio_prompt })
|
||||
}
|
||||
|
||||
tts.stop()
|
||||
133
tts-server.mjs
Normal file
133
tts-server.mjs
Normal file
@@ -0,0 +1,133 @@
|
||||
/**
|
||||
* TTS HTTP server — wraps chatterbox-server.py and exposes a simple HTTP API.
|
||||
* Requests are serialized so generation and playback stay in order.
|
||||
*
|
||||
* Usage:
|
||||
* node tts-server.mjs
|
||||
* TTS_PORT=11500 node tts-server.mjs
|
||||
*
|
||||
* API:
|
||||
* POST /speak { "text": "...", "audio_prompt": "/path/to/voice.wav", ... }
|
||||
* → 200 { "ok": true } (after generation, playback continues in background)
|
||||
* GET /voices → 200 { "voices": ["rommie", ...], "current": "rommie" | null }
|
||||
* POST /voice { "name": "rommie" }
|
||||
* → 200 { "ok": true, "name": "rommie", "path": "..." }
|
||||
* GET /health → 200 { "ok": true }
|
||||
*/
|
||||
|
||||
import * as http from 'node:http'
|
||||
import * as fs from 'node:fs'
|
||||
import * as path from 'node:path'
|
||||
import yaml from 'js-yaml'
|
||||
import { Chatterbox_Tts } from './lib/chatterbox-tts.mjs'
|
||||
|
||||
const PORT = parseInt(process.env.TTS_PORT ?? '11500')
|
||||
const VOICES_FILE = path.join(import.meta.dirname, 'voices.yaml')
|
||||
|
||||
function reload_voices() {
|
||||
try {
|
||||
const doc = yaml.load(fs.readFileSync(VOICES_FILE, 'utf8'))
|
||||
return doc?.voices ?? {}
|
||||
} catch {
|
||||
return {}
|
||||
}
|
||||
}
|
||||
|
||||
let voices = reload_voices()
|
||||
let current_voice = null // name of active voice, or null
|
||||
|
||||
// --- TTS setup ---
|
||||
const tts = new Chatterbox_Tts()
|
||||
process.stderr.write('[tts-server] starting chatterbox...\n')
|
||||
await tts.init()
|
||||
process.stderr.write('[tts-server] chatterbox ready\n')
|
||||
|
||||
// Serialize all speak requests through a promise chain
|
||||
let queue = Promise.resolve()
|
||||
|
||||
function enqueue(fn) {
|
||||
const result = queue.then(fn)
|
||||
// Don't let a failed request poison the queue
|
||||
queue = result.catch(() => {})
|
||||
return result
|
||||
}
|
||||
|
||||
function read_body(req) {
|
||||
return new Promise((resolve, reject) => {
|
||||
let buf = ''
|
||||
req.on('data', chunk => { buf += chunk })
|
||||
req.on('end', () => {
|
||||
try { resolve(JSON.parse(buf)) } catch (e) { reject(e) }
|
||||
})
|
||||
req.on('error', reject)
|
||||
})
|
||||
}
|
||||
|
||||
function send(res, status, body) {
|
||||
const payload = JSON.stringify(body)
|
||||
res.writeHead(status, { 'Content-Type': 'application/json' })
|
||||
res.end(payload)
|
||||
}
|
||||
|
||||
const server = http.createServer(async (req, res) => {
|
||||
if (req.method === 'GET' && req.url === '/health') {
|
||||
return send(res, 200, { ok: true })
|
||||
}
|
||||
|
||||
if (req.method === 'GET' && req.url === '/voices') {
|
||||
voices = reload_voices()
|
||||
const list = Object.entries(voices).map(([name, v]) => ({
|
||||
name,
|
||||
description: v.description ?? '',
|
||||
active: name === current_voice,
|
||||
}))
|
||||
return send(res, 200, { voices: list, current: current_voice })
|
||||
}
|
||||
|
||||
if (req.method === 'POST' && req.url === '/voice') {
|
||||
let body
|
||||
try { body = await read_body(req) } catch {
|
||||
return send(res, 400, { error: 'invalid JSON' })
|
||||
}
|
||||
const { name } = body
|
||||
if (!name) return send(res, 400, { error: 'name required' })
|
||||
voices = reload_voices()
|
||||
if (!voices[name]) return send(res, 404, { error: `unknown voice: ${name}` })
|
||||
current_voice = name
|
||||
process.stderr.write(`[tts-server] voice switched to: ${name}\n`)
|
||||
return send(res, 200, { ok: true, name, path: voices[name].path })
|
||||
}
|
||||
|
||||
if (req.method === 'POST' && req.url === '/speak') {
|
||||
let body
|
||||
try {
|
||||
body = await read_body(req)
|
||||
} catch {
|
||||
return send(res, 400, { error: 'invalid JSON' })
|
||||
}
|
||||
|
||||
const { text, ...opts } = body
|
||||
if (!text) {
|
||||
return send(res, 400, { error: 'text required' })
|
||||
}
|
||||
|
||||
// Inject current voice as default audio_prompt if none provided
|
||||
if (!opts.audio_prompt && current_voice && voices[current_voice]) {
|
||||
opts.audio_prompt = voices[current_voice].path
|
||||
}
|
||||
|
||||
try {
|
||||
await enqueue(() => tts.speak_streaming(text, { preprocess: false, ...opts }))
|
||||
send(res, 200, { ok: true })
|
||||
} catch (err) {
|
||||
send(res, 500, { error: err.message })
|
||||
}
|
||||
return
|
||||
}
|
||||
|
||||
send(res, 404, { error: 'not found' })
|
||||
})
|
||||
|
||||
server.listen(PORT, () => {
|
||||
process.stderr.write(`[tts-server] listening on port ${PORT}\n`)
|
||||
})
|
||||
65
voice-buddy.mjs
Normal file
65
voice-buddy.mjs
Normal file
@@ -0,0 +1,65 @@
|
||||
/**
|
||||
* Voice buddy — reads voice queries from stdin (JSON lines) and dispatches
|
||||
* them to a Claude Code instance via claude-remote.
|
||||
*
|
||||
* Usage:
|
||||
* node query-demo.mjs | node voice-buddy.mjs --window-id 0x1234abc --claude-remote /workspace/claude-remote/claude-remote.mjs
|
||||
*
|
||||
* Options:
|
||||
* --window-id <id> X11 window ID of Claude Code (decimal or 0x hex)
|
||||
* --claude-remote <path> Path to claude-remote.mjs
|
||||
*/
|
||||
|
||||
import * as readline from 'node:readline'
|
||||
import { execFileSync } from 'node:child_process'
|
||||
|
||||
const args = process.argv.slice(2)
|
||||
|
||||
function get_arg(name) {
|
||||
const i = args.indexOf(name)
|
||||
if (i === -1 || i + 1 >= args.length) {
|
||||
process.stderr.write(`Missing argument: ${name}\n`)
|
||||
process.exit(1)
|
||||
}
|
||||
return args[i + 1]
|
||||
}
|
||||
|
||||
const window_id = get_arg('--window-id')
|
||||
const claude_remote = get_arg('--claude-remote')
|
||||
|
||||
function make_prompt(query) {
|
||||
return [
|
||||
'[voice-buddy] You have received a voice query.',
|
||||
'',
|
||||
'Respond via the `speak` shell command. Keep your response brief and spoken-word natural — this is a voice interface.',
|
||||
'',
|
||||
'--- Query ---',
|
||||
query,
|
||||
'--- End of query ---',
|
||||
].join('\n')
|
||||
}
|
||||
|
||||
function dispatch(query) {
|
||||
const prompt = make_prompt(query)
|
||||
process.stderr.write(`[voice-buddy] dispatching: ${JSON.stringify(query)}\n`)
|
||||
execFileSync('node', [claude_remote, '--window-id', window_id], {
|
||||
input: prompt,
|
||||
encoding: 'utf8',
|
||||
stdio: ['pipe', 'inherit', 'inherit'],
|
||||
})
|
||||
}
|
||||
|
||||
const rl = readline.createInterface({ input: process.stdin })
|
||||
|
||||
for await (const line of rl) {
|
||||
const trimmed = line.trim()
|
||||
if (!trimmed) {
|
||||
continue
|
||||
}
|
||||
try {
|
||||
const query = JSON.parse(trimmed)
|
||||
dispatch(query)
|
||||
} catch {
|
||||
process.stderr.write(`[voice-buddy] bad JSON line: ${trimmed}\n`)
|
||||
}
|
||||
}
|
||||
54
voices.yaml
Normal file
54
voices.yaml
Normal file
@@ -0,0 +1,54 @@
|
||||
# Named voices for TTS server.
|
||||
# Switch via: POST /voice {"name": "rommie"}
|
||||
# List via: GET /voices
|
||||
|
||||
voices:
|
||||
rommie:
|
||||
path: /home/devilholk/Documents/rommie-sample.wav
|
||||
description: Rommie - Andromeda ship AI / android
|
||||
|
||||
archangel:
|
||||
path: /home/devilholk/Documents/archangel.ogg
|
||||
description: Arch Angel from Air Wolf
|
||||
|
||||
marella:
|
||||
path: /home/devilholk/Documents/murella.ogg
|
||||
description: Marella from Air Wolf
|
||||
|
||||
chuin:
|
||||
path: /home/devilholk/Documents/chuin.ogg
|
||||
description: Chuin from Remo Williams - The adventure begins
|
||||
|
||||
harold-w-smith:
|
||||
path: /home/devilholk/Documents/cure-leader.ogg
|
||||
description: Harold W. Smith from Remo Williams - The adventure begins
|
||||
|
||||
penny-parker:
|
||||
path: /home/devilholk/Documents/penny.ogg
|
||||
description: Penny from MacGyver
|
||||
|
||||
santini:
|
||||
path: /home/devilholk/Documents/santini.ogg
|
||||
description: Santini from Air Wolf
|
||||
|
||||
gosalyn:
|
||||
path: /home/devilholk/Documents/gasalin.ogg
|
||||
description: Gosalyn from Darkwing Duck
|
||||
|
||||
darkwing-duck:
|
||||
path: /home/devilholk/Documents/dwd.ogg
|
||||
description: Darkwing Duck
|
||||
|
||||
belitski:
|
||||
path: /home/devilholk/Documents/slipstream2.ogg
|
||||
description: Belitski from Slipstream
|
||||
|
||||
ivanova:
|
||||
path: /home/devilholk/Documents/ivanova.ogg
|
||||
description: Susan Ivonavo from Babylon 5
|
||||
|
||||
sheila:
|
||||
path: /home/devilholk/Documents/darkness-woman.ogg
|
||||
description: Sheila from Army of Darkness
|
||||
|
||||
# Add more voices as clips become available
|
||||
Reference in New Issue
Block a user