Files
claude-voice-experiment/SPEAKER-DIARIZATION.md
mikael-lovqvists-claude-agent db8889aeed Initial commit — voice pipeline experiment
STT (Silero VAD + Whisper via sherpa-onnx), Chatterbox TTS HTTP server,
query completeness classifier (Ollama), multi-voice demo scripts, and
planning docs. Kept as reference; clean rewrite planned in separate repos.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-30 04:48:54 +00:00

81 lines
2.9 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Speaker Diarization
Notes on adding speaker identification to the voice pipeline.
---
## What Whisper does and doesn't do
Whisper is a transcription model — it converts audio to text. It has no concept of who is speaking. In a multi-speaker conversation it will transcribe everything into one flat stream with no speaker labels.
---
## Adding diarization
Speaker diarization is the task of segmenting audio by speaker: "who spoke when." The standard approach is to run a diarization model before or alongside transcription, then align the speaker segments with the transcript.
### pyannote.audio
The most capable open-source option. Runs locally, GPU-accelerated.
```
pip install pyannote.audio
```
Produces time-stamped segments:
```
SPEAKER_00 0.5s 3.2s
SPEAKER_01 3.8s 7.1s
SPEAKER_00 7.5s 12.0s
```
Feed each segment to Whisper independently → transcript with speaker tags.
Requires a HuggingFace token and model access request (free).
### sherpa-onnx
Has some diarization support in newer versions — worth checking if it can be added to the existing pipeline without a separate Python dependency.
---
## Speaker identification vs diarization
| Feature | Diarization | Identification |
|---------|------------|----------------|
| Labels speakers | ✓ (anonymous: speaker 1, 2…) | ✓ (named: Mikael, Claude…) |
| Requires enrollment | ✗ | ✓ (reference clip per person) |
| Cross-session consistency | ✗ | ✓ |
For named identification, a speaker embedding model compares voice fingerprints against enrolled reference clips. Options:
- **resemblyzer** — simple cosine similarity on d-vector embeddings
- **wespeaker** — more accurate, available as ONNX (fits existing sherpa-onnx stack)
- **speechbrain** — full toolkit, heavier dependency
---
## Future experiments
### Multi-speaker transcription
Run pyannote diarization on a conversation recording, feed segments to Whisper, produce a tagged transcript:
```
Mikael: How do I juggle oranges?
Claude: Start with two balls and work up from there.
```
### Real-time diarization
Extend the current VAD → Whisper pipeline with a rolling speaker embedding. Each new segment gets a speaker label by comparing its embedding to known speakers. Useful for meeting notes or multi-person dictation.
### Voice enrollment
Record a short reference clip per person. Store embeddings. When a new segment arrives, score it against all enrolled speakers and assign the closest match above a confidence threshold. Unknown speakers get a temporary anonymous label.
### Wake word per speaker
Combine identification with wake word detection so only enrolled voices can trigger the assistant — a simple voice-based access control layer.
---
## VRAM considerations
pyannote diarization adds roughly 12 GB depending on the model variant. wespeaker ONNX is much lighter. Both fit comfortably alongside the existing Whisper + Chatterbox stack on a 24 GB card.