STT (Silero VAD + Whisper via sherpa-onnx), Chatterbox TTS HTTP server, query completeness classifier (Ollama), multi-voice demo scripts, and planning docs. Kept as reference; clean rewrite planned in separate repos. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
81 lines
2.9 KiB
Markdown
81 lines
2.9 KiB
Markdown
# Speaker Diarization
|
||
|
||
Notes on adding speaker identification to the voice pipeline.
|
||
|
||
---
|
||
|
||
## What Whisper does and doesn't do
|
||
|
||
Whisper is a transcription model — it converts audio to text. It has no concept of who is speaking. In a multi-speaker conversation it will transcribe everything into one flat stream with no speaker labels.
|
||
|
||
---
|
||
|
||
## Adding diarization
|
||
|
||
Speaker diarization is the task of segmenting audio by speaker: "who spoke when." The standard approach is to run a diarization model before or alongside transcription, then align the speaker segments with the transcript.
|
||
|
||
### pyannote.audio
|
||
|
||
The most capable open-source option. Runs locally, GPU-accelerated.
|
||
|
||
```
|
||
pip install pyannote.audio
|
||
```
|
||
|
||
Produces time-stamped segments:
|
||
```
|
||
SPEAKER_00 0.5s – 3.2s
|
||
SPEAKER_01 3.8s – 7.1s
|
||
SPEAKER_00 7.5s – 12.0s
|
||
```
|
||
|
||
Feed each segment to Whisper independently → transcript with speaker tags.
|
||
|
||
Requires a HuggingFace token and model access request (free).
|
||
|
||
### sherpa-onnx
|
||
|
||
Has some diarization support in newer versions — worth checking if it can be added to the existing pipeline without a separate Python dependency.
|
||
|
||
---
|
||
|
||
## Speaker identification vs diarization
|
||
|
||
| Feature | Diarization | Identification |
|
||
|---------|------------|----------------|
|
||
| Labels speakers | ✓ (anonymous: speaker 1, 2…) | ✓ (named: Mikael, Claude…) |
|
||
| Requires enrollment | ✗ | ✓ (reference clip per person) |
|
||
| Cross-session consistency | ✗ | ✓ |
|
||
|
||
For named identification, a speaker embedding model compares voice fingerprints against enrolled reference clips. Options:
|
||
|
||
- **resemblyzer** — simple cosine similarity on d-vector embeddings
|
||
- **wespeaker** — more accurate, available as ONNX (fits existing sherpa-onnx stack)
|
||
- **speechbrain** — full toolkit, heavier dependency
|
||
|
||
---
|
||
|
||
## Future experiments
|
||
|
||
### Multi-speaker transcription
|
||
Run pyannote diarization on a conversation recording, feed segments to Whisper, produce a tagged transcript:
|
||
```
|
||
Mikael: How do I juggle oranges?
|
||
Claude: Start with two balls and work up from there.
|
||
```
|
||
|
||
### Real-time diarization
|
||
Extend the current VAD → Whisper pipeline with a rolling speaker embedding. Each new segment gets a speaker label by comparing its embedding to known speakers. Useful for meeting notes or multi-person dictation.
|
||
|
||
### Voice enrollment
|
||
Record a short reference clip per person. Store embeddings. When a new segment arrives, score it against all enrolled speakers and assign the closest match above a confidence threshold. Unknown speakers get a temporary anonymous label.
|
||
|
||
### Wake word per speaker
|
||
Combine identification with wake word detection so only enrolled voices can trigger the assistant — a simple voice-based access control layer.
|
||
|
||
---
|
||
|
||
## VRAM considerations
|
||
|
||
pyannote diarization adds roughly 1–2 GB depending on the model variant. wespeaker ONNX is much lighter. Both fit comfortably alongside the existing Whisper + Chatterbox stack on a 24 GB card.
|