Files
claude-voice-experiment/SPEAKER-DIARIZATION.md
mikael-lovqvists-claude-agent db8889aeed Initial commit — voice pipeline experiment
STT (Silero VAD + Whisper via sherpa-onnx), Chatterbox TTS HTTP server,
query completeness classifier (Ollama), multi-voice demo scripts, and
planning docs. Kept as reference; clean rewrite planned in separate repos.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-30 04:48:54 +00:00

2.9 KiB
Raw Blame History

Speaker Diarization

Notes on adding speaker identification to the voice pipeline.


What Whisper does and doesn't do

Whisper is a transcription model — it converts audio to text. It has no concept of who is speaking. In a multi-speaker conversation it will transcribe everything into one flat stream with no speaker labels.


Adding diarization

Speaker diarization is the task of segmenting audio by speaker: "who spoke when." The standard approach is to run a diarization model before or alongside transcription, then align the speaker segments with the transcript.

pyannote.audio

The most capable open-source option. Runs locally, GPU-accelerated.

pip install pyannote.audio

Produces time-stamped segments:

SPEAKER_00  0.5s  3.2s
SPEAKER_01  3.8s  7.1s
SPEAKER_00  7.5s  12.0s

Feed each segment to Whisper independently → transcript with speaker tags.

Requires a HuggingFace token and model access request (free).

sherpa-onnx

Has some diarization support in newer versions — worth checking if it can be added to the existing pipeline without a separate Python dependency.


Speaker identification vs diarization

Feature Diarization Identification
Labels speakers ✓ (anonymous: speaker 1, 2…) ✓ (named: Mikael, Claude…)
Requires enrollment ✓ (reference clip per person)
Cross-session consistency

For named identification, a speaker embedding model compares voice fingerprints against enrolled reference clips. Options:

  • resemblyzer — simple cosine similarity on d-vector embeddings
  • wespeaker — more accurate, available as ONNX (fits existing sherpa-onnx stack)
  • speechbrain — full toolkit, heavier dependency

Future experiments

Multi-speaker transcription

Run pyannote diarization on a conversation recording, feed segments to Whisper, produce a tagged transcript:

Mikael: How do I juggle oranges?
Claude: Start with two balls and work up from there.

Real-time diarization

Extend the current VAD → Whisper pipeline with a rolling speaker embedding. Each new segment gets a speaker label by comparing its embedding to known speakers. Useful for meeting notes or multi-person dictation.

Voice enrollment

Record a short reference clip per person. Store embeddings. When a new segment arrives, score it against all enrolled speakers and assign the closest match above a confidence threshold. Unknown speakers get a temporary anonymous label.

Wake word per speaker

Combine identification with wake word detection so only enrolled voices can trigger the assistant — a simple voice-based access control layer.


VRAM considerations

pyannote diarization adds roughly 12 GB depending on the model variant. wespeaker ONNX is much lighter. Both fit comfortably alongside the existing Whisper + Chatterbox stack on a 24 GB card.