Files
claude-voice-experiment/NOTES.md

168 lines
6.1 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Voice Experiment Notes
Real-time voice pipeline: microphone → STT → (LLM) → TTS. Everything runs locally, no internet after initial model download.
---
## Architecture
```
parec (PulseAudio)
└─ lib/stt.mjs Silero VAD + Whisper → text on stdout
└─ sherpa-onnx-node
└─ libsherpa-onnx-c-api.so (needs CUDA build — see below)
└─ libonnxruntime.so (system, with CUDA)
text → lib/chatterbox-tts.mjs
└─ chatterbox-server.py long-running Python TTS server
└─ chatterbox-tts (pip) ResembleAI Chatterbox model
```
---
## STT
**Library:** `sherpa-onnx-node` npm package
**Models:** Whisper (offline batch) + Silero VAD
**Audio capture:** `parec --format=s16le --rate=16000 --channels=1 --latency-msec=50`
### Model download
```bash
bash download-models.sh base.en # or: tiny.en small.en medium.en large-v3 large-v3-turbo
```
Models land in `models/`:
- `silero_vad.onnx`
- `sherpa-onnx-whisper-base.en/`
### How it works
`lib/stt.mjs` feeds 512-sample (32ms) windows into Silero VAD. When VAD signals speech end (after 500ms silence), the segment is sent to Whisper for transcription. The transcription is written to stdout — **note: the native addon writes directly to fd 1 (bypassing Node.js process.stdout), not via the JavaScript callback**. The JS callback is also called and can be used for additional processing.
### Getting CUDA
The npm package bundles a CPU-only `libsherpa-onnx-c-api.so` (`SHERPA_ONNX_ENABLE_GPU=OFF`). To use CUDA, rebuild it from source:
```bash
# Prerequisites: cmake, onnxruntime-opt-cuda (or onnxruntime-cuda) from pacman
git clone --depth 1 git@github.com:k2-fsa/sherpa-onnx ~/Foreign/sherpa-onnx
export SHERPA_ONNXRUNTIME_INCLUDE_DIR=/usr/include/onnxruntime
export SHERPA_ONNXRUNTIME_LIB_DIR=/usr/lib
cd ~/Foreign/sherpa-onnx
cmake -B build \
-DCMAKE_BUILD_TYPE=Release \
-DBUILD_SHARED_LIBS=ON \
-DSHERPA_ONNX_ENABLE_GPU=ON \
-DSHERPA_ONNX_ENABLE_PYTHON=OFF \
-DSHERPA_ONNX_ENABLE_TESTS=OFF \
.
cmake --build build --target sherpa-onnx-c-api
# Symlink into node_modules (remove bundled CPU onnxruntime so system one is used)
NMOD=~/workspace/experiments/voice/node_modules/sherpa-onnx-linux-x64
ln -sf ~/Foreign/sherpa-onnx/build/lib/libsherpa-onnx-c-api.so $NMOD/libsherpa-onnx-c-api.so
rm -f $NMOD/libonnxruntime.so
```
The system `/usr/lib/libonnxruntime.so` (from `onnxruntime-opt-cuda`) is used at runtime. The sherpa-onnx C API is stable across versions so the pre-built `sherpa-onnx.node` (1.13.2) works with the freshly built library.
---
## TTS
**Library:** `chatterbox-tts` Python package (ResembleAI)
**Models:** `ResembleAI/chatterbox-turbo` (default) or `ResembleAI/chatterbox`
**Playback:** `pacat --format=float32le --rate=24000 --channels=1`
### Setup
```bash
bash setup-venv.sh # creates ./venv, installs chatterbox-tts + deps
```
HuggingFace token (for model download) goes in `~/.secrets/hugging-face.token`.
### Variants
| Variant | RTF (RTX 3090) | Notes |
|---------|---------------|-------|
| turbo | ~0.2 | default, female voice, no exaggeration param |
| full | slower | male voice, supports `exaggeration` (0.01.0) |
### Paralinguistic tags
Work in both variants:
```
[laugh] [chuckle] [cough] [clear throat] [sigh] [shush] [groan] [sniff] [gasp]
```
### Voice cloning
Pass a WAV reference clip (at least 5 seconds, clean speech, no music). Quality scales with both length and emotional range — a clip with variation (question, statement, different tones) gives the model more to work with for natural prosody than a flat monotone sample of the same length.
```javascript
await tts.speak('Hello.', { audio_prompt: '/path/to/voice.wav' })
```
The server preprocesses the WAV to float32 mono to work around a librosa/numpy float64 issue.
### Server protocol
`chatterbox-server.py` is a long-running process. Communication:
- **stdin:** one JSON line per request: `{"text": "...", "temperature": 0.8, ...}`
- **stdout:** `ok\n` after each utterance is **generated** (playback continues in background)
- **stderr:** status/timing logs
Generation and playback are pipelined: the server generates the next utterance while the previous one is still playing. RTF ~0.2 means the next chunk is ready well before playback finishes.
### Acting tips
- Emphasis: `UPPERCASE` words
- Pause/rhythm: `...` `—`
- Acronym workaround: `un-workable` not `UNWORKABLE` (avoids letter-spelling)
- Exaggeration (full model only): `{ exaggeration: 0.01.0 }`
- Temperature: `{ temperature: 0.31.4 }` (lower = more consistent)
---
## Tools
| Script | Usage |
|--------|-------|
| `listen.mjs` | Mic → Whisper → stdout (one line per utterance) |
| `speak-as.mjs` | stdin → Chatterbox TTS with voice reference |
| `acting-demo-chatterbox.mjs` | TTS capability demo (sections 17) |
### Pipe example
```bash
node listen.mjs 2>/dev/null | node speak-as.mjs /path/to/voice.wav
```
`speak-as.mjs` deduplicates consecutive identical lines to handle VAD over-segmentation.
---
## Known Quirks
**librosa float64:** newer numpy makes `librosa.resample` return float64, breaking torch. Patched in `chatterbox-server.py`:
```python
_orig = _librosa.resample
_librosa.resample = lambda *a, **k: _orig(*a, **k).astype(np.float32)
```
**HuggingFace offline:** `from_pretrained()` always hits the internet. The server uses `find_hf_cache()` to detect local snapshots and calls `from_local()` instead.
**VAD over-segmentation:** a mid-utterance pause produces two segments that Whisper transcribes to similar or identical text. `speak-as.mjs` deduplicates consecutive lines (normalized, punctuation-stripped) before speaking.
**Native stdout:** the sherpa-onnx native addon writes transcription results directly to fd 1 via C `write()`, bypassing Node.js `process.stdout.write`. This cannot be intercepted from JavaScript. It only writes for non-trivial results (not silence/garbage).
---
## Session reference
The Claude Code session used to develop this project is named **voice-buddy** — a self-referential session where the voice assistant was used to direct its own development.