mikael-lovqvists-claude-agent/claude-voice-experiment

Files

mikael-lovqvists-claude-agent db8889aeed Initial commit — voice pipeline experiment

STT (Silero VAD + Whisper via sherpa-onnx), Chatterbox TTS HTTP server,
query completeness classifier (Ollama), multi-voice demo scripts, and
planning docs. Kept as reference; clean rewrite planned in separate repos.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-05-30 04:48:54 +00:00

2.9 KiB

Raw Blame History

Speaker Diarization

Notes on adding speaker identification to the voice pipeline.

What Whisper does and doesn't do

Whisper is a transcription model — it converts audio to text. It has no concept of who is speaking. In a multi-speaker conversation it will transcribe everything into one flat stream with no speaker labels.

Adding diarization

Speaker diarization is the task of segmenting audio by speaker: "who spoke when." The standard approach is to run a diarization model before or alongside transcription, then align the speaker segments with the transcript.

pyannote.audio

The most capable open-source option. Runs locally, GPU-accelerated.

pip install pyannote.audio

Produces time-stamped segments:

SPEAKER_00  0.5s – 3.2s
SPEAKER_01  3.8s – 7.1s
SPEAKER_00  7.5s – 12.0s

Feed each segment to Whisper independently → transcript with speaker tags.

Requires a HuggingFace token and model access request (free).

sherpa-onnx

Has some diarization support in newer versions — worth checking if it can be added to the existing pipeline without a separate Python dependency.

Speaker identification vs diarization

Feature	Diarization	Identification
Labels speakers	✓ (anonymous: speaker 1, 2…)	✓ (named: Mikael, Claude…)
Requires enrollment	✗	✓ (reference clip per person)
Cross-session consistency	✗	✓

For named identification, a speaker embedding model compares voice fingerprints against enrolled reference clips. Options:

resemblyzer — simple cosine similarity on d-vector embeddings
wespeaker — more accurate, available as ONNX (fits existing sherpa-onnx stack)
speechbrain — full toolkit, heavier dependency

Future experiments

Multi-speaker transcription

Run pyannote diarization on a conversation recording, feed segments to Whisper, produce a tagged transcript:

Mikael: How do I juggle oranges?
Claude: Start with two balls and work up from there.

Real-time diarization

Extend the current VAD → Whisper pipeline with a rolling speaker embedding. Each new segment gets a speaker label by comparing its embedding to known speakers. Useful for meeting notes or multi-person dictation.

Voice enrollment

Record a short reference clip per person. Store embeddings. When a new segment arrives, score it against all enrolled speakers and assign the closest match above a confidence threshold. Unknown speakers get a temporary anonymous label.

Wake word per speaker

Combine identification with wake word detection so only enrolled voices can trigger the assistant — a simple voice-based access control layer.

VRAM considerations

pyannote diarization adds roughly 1–2 GB depending on the model variant. wespeaker ONNX is much lighter. Both fit comfortably alongside the existing Whisper + Chatterbox stack on a 24 GB card.

2.9 KiB Raw Blame History Unescape Escape