mikael-lovqvists-claude-agent/claude-voice-experiment

Files

mikael-lovqvists-claude-agent db8889aeed Initial commit — voice pipeline experiment

STT (Silero VAD + Whisper via sherpa-onnx), Chatterbox TTS HTTP server,
query completeness classifier (Ollama), multi-voice demo scripts, and
planning docs. Kept as reference; clean rewrite planned in separate repos.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-05-30 04:48:54 +00:00

5.9 KiB

Raw Blame History

Voice Experiment Notes

Real-time voice pipeline: microphone → STT → (LLM) → TTS. Everything runs locally, no internet after initial model download.

Architecture

parec (PulseAudio)
  └─ lib/stt.mjs          Silero VAD + Whisper → text on stdout
       └─ sherpa-onnx-node
            └─ libsherpa-onnx-c-api.so   (needs CUDA build — see below)
                 └─ libonnxruntime.so     (system, with CUDA)

text → lib/chatterbox-tts.mjs
         └─ chatterbox-server.py         long-running Python TTS server
              └─ chatterbox-tts (pip)    ResembleAI Chatterbox model

STT

Library: sherpa-onnx-node npm package Models: Whisper (offline batch) + Silero VAD Audio capture: parec --format=s16le --rate=16000 --channels=1 --latency-msec=50

Model download

bash download-models.sh base.en   # or: tiny.en small.en medium.en large-v3 large-v3-turbo

Models land in models/:

silero_vad.onnx
sherpa-onnx-whisper-base.en/

How it works

lib/stt.mjs feeds 512-sample (32ms) windows into Silero VAD. When VAD signals speech end (after 500ms silence), the segment is sent to Whisper for transcription. The transcription is written to stdout — note: the native addon writes directly to fd 1 (bypassing Node.js process.stdout), not via the JavaScript callback. The JS callback is also called and can be used for additional processing.

Getting CUDA

The npm package bundles a CPU-only libsherpa-onnx-c-api.so (SHERPA_ONNX_ENABLE_GPU=OFF). To use CUDA, rebuild it from source:

# Prerequisites: cmake, onnxruntime-opt-cuda (or onnxruntime-cuda) from pacman
git clone --depth 1 git@github.com:k2-fsa/sherpa-onnx ~/Foreign/sherpa-onnx

export SHERPA_ONNXRUNTIME_INCLUDE_DIR=/usr/include/onnxruntime
export SHERPA_ONNXRUNTIME_LIB_DIR=/usr/lib

cd ~/Foreign/sherpa-onnx
cmake -B build \
  -DCMAKE_BUILD_TYPE=Release \
  -DBUILD_SHARED_LIBS=ON \
  -DSHERPA_ONNX_ENABLE_GPU=ON \
  -DSHERPA_ONNX_ENABLE_PYTHON=OFF \
  -DSHERPA_ONNX_ENABLE_TESTS=OFF \
  .

cmake --build build --target sherpa-onnx-c-api

# Symlink into node_modules (remove bundled CPU onnxruntime so system one is used)
NMOD=~/workspace/experiments/voice/node_modules/sherpa-onnx-linux-x64
ln -sf ~/Foreign/sherpa-onnx/build/lib/libsherpa-onnx-c-api.so $NMOD/libsherpa-onnx-c-api.so
rm -f $NMOD/libonnxruntime.so

The system /usr/lib/libonnxruntime.so (from onnxruntime-opt-cuda) is used at runtime. The sherpa-onnx C API is stable across versions so the pre-built sherpa-onnx.node (1.13.2) works with the freshly built library.

TTS

Library: chatterbox-tts Python package (ResembleAI) Models: ResembleAI/chatterbox-turbo (default) or ResembleAI/chatterbox Playback: pacat --format=float32le --rate=24000 --channels=1

Setup

bash setup-venv.sh   # creates ./venv, installs chatterbox-tts + deps

HuggingFace token (for model download) goes in ~/.secrets/hugging-face.token.

Variants

Variant	RTF (RTX 3090)	Notes
turbo	~0.2	default, female voice, no exaggeration param
full	slower	male voice, supports `exaggeration` (0.0–1.0)

Paralinguistic tags

Work in both variants:

[laugh] [chuckle] [cough] [clear throat] [sigh] [shush] [groan] [sniff] [gasp]

Voice cloning

Pass a WAV reference clip (at least 5 seconds, clean speech, no music). Quality scales with both length and emotional range — a clip with variation (question, statement, different tones) gives the model more to work with for natural prosody than a flat monotone sample of the same length.

await tts.speak('Hello.', { audio_prompt: '/path/to/voice.wav' })

The server preprocesses the WAV to float32 mono to work around a librosa/numpy float64 issue.

Server protocol

chatterbox-server.py is a long-running process. Communication:

stdin: one JSON line per request: {"text": "...", "temperature": 0.8, ...}
stdout: ok\n after each utterance is generated (playback continues in background)
stderr: status/timing logs

Generation and playback are pipelined: the server generates the next utterance while the previous one is still playing. RTF ~0.2 means the next chunk is ready well before playback finishes.

Acting tips

Emphasis: UPPERCASE words
Pause/rhythm: ... —
Acronym workaround: un-workable not UNWORKABLE (avoids letter-spelling)
Exaggeration (full model only): { exaggeration: 0.0–1.0 }
Temperature: { temperature: 0.3–1.4 } (lower = more consistent)

Tools

Script	Usage
`listen.mjs`	Mic → Whisper → stdout (one line per utterance)
`speak-as.mjs`	stdin → Chatterbox TTS with voice reference
`acting-demo-chatterbox.mjs`	TTS capability demo (sections 1–7)

Pipe example

node listen.mjs 2>/dev/null | node speak-as.mjs /path/to/voice.wav

speak-as.mjs deduplicates consecutive identical lines to handle VAD over-segmentation.

Known Quirks

librosa float64: newer numpy makes librosa.resample return float64, breaking torch. Patched in chatterbox-server.py:

_orig = _librosa.resample
_librosa.resample = lambda *a, **k: _orig(*a, **k).astype(np.float32)

HuggingFace offline: from_pretrained() always hits the internet. The server uses find_hf_cache() to detect local snapshots and calls from_local() instead.

VAD over-segmentation: a mid-utterance pause produces two segments that Whisper transcribes to similar or identical text. speak-as.mjs deduplicates consecutive lines (normalized, punctuation-stripped) before speaking.

Native stdout: the sherpa-onnx native addon writes transcription results directly to fd 1 via C write(), bypassing Node.js process.stdout.write. This cannot be intercepted from JavaScript. It only writes for non-trivial results (not silence/garbage).

5.9 KiB Raw Blame History Unescape Escape