168 lines
6.1 KiB
Markdown
168 lines
6.1 KiB
Markdown
# Voice Experiment Notes
|
||
|
||
Real-time voice pipeline: microphone → STT → (LLM) → TTS. Everything runs locally, no internet after initial model download.
|
||
|
||
---
|
||
|
||
## Architecture
|
||
|
||
```
|
||
parec (PulseAudio)
|
||
└─ lib/stt.mjs Silero VAD + Whisper → text on stdout
|
||
└─ sherpa-onnx-node
|
||
└─ libsherpa-onnx-c-api.so (needs CUDA build — see below)
|
||
└─ libonnxruntime.so (system, with CUDA)
|
||
|
||
text → lib/chatterbox-tts.mjs
|
||
└─ chatterbox-server.py long-running Python TTS server
|
||
└─ chatterbox-tts (pip) ResembleAI Chatterbox model
|
||
```
|
||
|
||
---
|
||
|
||
## STT
|
||
|
||
**Library:** `sherpa-onnx-node` npm package
|
||
**Models:** Whisper (offline batch) + Silero VAD
|
||
**Audio capture:** `parec --format=s16le --rate=16000 --channels=1 --latency-msec=50`
|
||
|
||
### Model download
|
||
|
||
```bash
|
||
bash download-models.sh base.en # or: tiny.en small.en medium.en large-v3 large-v3-turbo
|
||
```
|
||
|
||
Models land in `models/`:
|
||
- `silero_vad.onnx`
|
||
- `sherpa-onnx-whisper-base.en/`
|
||
|
||
### How it works
|
||
|
||
`lib/stt.mjs` feeds 512-sample (32ms) windows into Silero VAD. When VAD signals speech end (after 500ms silence), the segment is sent to Whisper for transcription. The transcription is written to stdout — **note: the native addon writes directly to fd 1 (bypassing Node.js process.stdout), not via the JavaScript callback**. The JS callback is also called and can be used for additional processing.
|
||
|
||
### Getting CUDA
|
||
|
||
The npm package bundles a CPU-only `libsherpa-onnx-c-api.so` (`SHERPA_ONNX_ENABLE_GPU=OFF`). To use CUDA, rebuild it from source:
|
||
|
||
```bash
|
||
# Prerequisites: cmake, onnxruntime-opt-cuda (or onnxruntime-cuda) from pacman
|
||
git clone --depth 1 git@github.com:k2-fsa/sherpa-onnx ~/Foreign/sherpa-onnx
|
||
|
||
export SHERPA_ONNXRUNTIME_INCLUDE_DIR=/usr/include/onnxruntime
|
||
export SHERPA_ONNXRUNTIME_LIB_DIR=/usr/lib
|
||
|
||
cd ~/Foreign/sherpa-onnx
|
||
cmake -B build \
|
||
-DCMAKE_BUILD_TYPE=Release \
|
||
-DBUILD_SHARED_LIBS=ON \
|
||
-DSHERPA_ONNX_ENABLE_GPU=ON \
|
||
-DSHERPA_ONNX_ENABLE_PYTHON=OFF \
|
||
-DSHERPA_ONNX_ENABLE_TESTS=OFF \
|
||
.
|
||
|
||
cmake --build build --target sherpa-onnx-c-api
|
||
|
||
# Symlink into node_modules (remove bundled CPU onnxruntime so system one is used)
|
||
NMOD=~/workspace/experiments/voice/node_modules/sherpa-onnx-linux-x64
|
||
ln -sf ~/Foreign/sherpa-onnx/build/lib/libsherpa-onnx-c-api.so $NMOD/libsherpa-onnx-c-api.so
|
||
rm -f $NMOD/libonnxruntime.so
|
||
```
|
||
|
||
The system `/usr/lib/libonnxruntime.so` (from `onnxruntime-opt-cuda`) is used at runtime. The sherpa-onnx C API is stable across versions so the pre-built `sherpa-onnx.node` (1.13.2) works with the freshly built library.
|
||
|
||
---
|
||
|
||
## TTS
|
||
|
||
**Library:** `chatterbox-tts` Python package (ResembleAI)
|
||
**Models:** `ResembleAI/chatterbox-turbo` (default) or `ResembleAI/chatterbox`
|
||
**Playback:** `pacat --format=float32le --rate=24000 --channels=1`
|
||
|
||
### Setup
|
||
|
||
```bash
|
||
bash setup-venv.sh # creates ./venv, installs chatterbox-tts + deps
|
||
```
|
||
|
||
HuggingFace token (for model download) goes in `~/.secrets/hugging-face.token`.
|
||
|
||
### Variants
|
||
|
||
| Variant | RTF (RTX 3090) | Notes |
|
||
|---------|---------------|-------|
|
||
| turbo | ~0.2 | default, female voice, no exaggeration param |
|
||
| full | slower | male voice, supports `exaggeration` (0.0–1.0) |
|
||
|
||
### Paralinguistic tags
|
||
|
||
Work in both variants:
|
||
```
|
||
[laugh] [chuckle] [cough] [clear throat] [sigh] [shush] [groan] [sniff] [gasp]
|
||
```
|
||
|
||
### Voice cloning
|
||
|
||
Pass a WAV reference clip (at least 5 seconds, clean speech, no music). Quality scales with both length and emotional range — a clip with variation (question, statement, different tones) gives the model more to work with for natural prosody than a flat monotone sample of the same length.
|
||
```javascript
|
||
await tts.speak('Hello.', { audio_prompt: '/path/to/voice.wav' })
|
||
```
|
||
|
||
The server preprocesses the WAV to float32 mono to work around a librosa/numpy float64 issue.
|
||
|
||
### Server protocol
|
||
|
||
`chatterbox-server.py` is a long-running process. Communication:
|
||
- **stdin:** one JSON line per request: `{"text": "...", "temperature": 0.8, ...}`
|
||
- **stdout:** `ok\n` after each utterance is **generated** (playback continues in background)
|
||
- **stderr:** status/timing logs
|
||
|
||
Generation and playback are pipelined: the server generates the next utterance while the previous one is still playing. RTF ~0.2 means the next chunk is ready well before playback finishes.
|
||
|
||
### Acting tips
|
||
|
||
- Emphasis: `UPPERCASE` words
|
||
- Pause/rhythm: `...` `—`
|
||
- Acronym workaround: `un-workable` not `UNWORKABLE` (avoids letter-spelling)
|
||
- Exaggeration (full model only): `{ exaggeration: 0.0–1.0 }`
|
||
- Temperature: `{ temperature: 0.3–1.4 }` (lower = more consistent)
|
||
|
||
---
|
||
|
||
## Tools
|
||
|
||
| Script | Usage |
|
||
|--------|-------|
|
||
| `listen.mjs` | Mic → Whisper → stdout (one line per utterance) |
|
||
| `speak-as.mjs` | stdin → Chatterbox TTS with voice reference |
|
||
| `acting-demo-chatterbox.mjs` | TTS capability demo (sections 1–7) |
|
||
|
||
### Pipe example
|
||
|
||
```bash
|
||
node listen.mjs 2>/dev/null | node speak-as.mjs /path/to/voice.wav
|
||
```
|
||
|
||
`speak-as.mjs` deduplicates consecutive identical lines to handle VAD over-segmentation.
|
||
|
||
---
|
||
|
||
## Known Quirks
|
||
|
||
**librosa float64:** newer numpy makes `librosa.resample` return float64, breaking torch. Patched in `chatterbox-server.py`:
|
||
```python
|
||
_orig = _librosa.resample
|
||
_librosa.resample = lambda *a, **k: _orig(*a, **k).astype(np.float32)
|
||
```
|
||
|
||
**HuggingFace offline:** `from_pretrained()` always hits the internet. The server uses `find_hf_cache()` to detect local snapshots and calls `from_local()` instead.
|
||
|
||
**VAD over-segmentation:** a mid-utterance pause produces two segments that Whisper transcribes to similar or identical text. `speak-as.mjs` deduplicates consecutive lines (normalized, punctuation-stripped) before speaking.
|
||
|
||
**Native stdout:** the sherpa-onnx native addon writes transcription results directly to fd 1 via C `write()`, bypassing Node.js `process.stdout.write`. This cannot be intercepted from JavaScript. It only writes for non-trivial results (not silence/garbage).
|
||
|
||
---
|
||
|
||
## Session reference
|
||
|
||
The Claude Code session used to develop this project is named **voice-buddy** — a self-referential session where the voice assistant was used to direct its own development.
|