claude-voice-experiment/SPEAKER-DIARIZATION.md

# Speaker Diarization

Notes on adding speaker identification to the voice pipeline.

---

## What Whisper does and doesn't do

Whisper is a transcription model — it converts audio to text. It has no concept of who is speaking. In a multi-speaker conversation it will transcribe everything into one flat stream with no speaker labels.

---

## Adding diarization

Speaker diarization is the task of segmenting audio by speaker: "who spoke when." The standard approach is to run a diarization model before or alongside transcription, then align the speaker segments with the transcript.

### pyannote.audio

The most capable open-source option. Runs locally, GPU-accelerated.

```
pip install pyannote.audio
```

Produces time-stamped segments:
```
SPEAKER_00  0.5s – 3.2s
SPEAKER_01  3.8s – 7.1s
SPEAKER_00  7.5s – 12.0s
```

Feed each segment to Whisper independently → transcript with speaker tags.

Requires a HuggingFace token and model access request (free).

### sherpa-onnx

Has some diarization support in newer versions — worth checking if it can be added to the existing pipeline without a separate Python dependency.

---

## Speaker identification vs diarization

| Feature | Diarization | Identification |
|---------|------------|----------------|
| Labels speakers | ✓ (anonymous: speaker 1, 2…) | ✓ (named: Mikael, Claude…) |
| Requires enrollment | ✗ | ✓ (reference clip per person) |
| Cross-session consistency | ✗ | ✓ |

For named identification, a speaker embedding model compares voice fingerprints against enrolled reference clips. Options:

- **resemblyzer** — simple cosine similarity on d-vector embeddings
- **wespeaker** — more accurate, available as ONNX (fits existing sherpa-onnx stack)
- **speechbrain** — full toolkit, heavier dependency

---

## Future experiments

### Multi-speaker transcription
Run pyannote diarization on a conversation recording, feed segments to Whisper, produce a tagged transcript:
```
Mikael: How do I juggle oranges?
Claude: Start with two balls and work up from there.
```

### Real-time diarization
Extend the current VAD → Whisper pipeline with a rolling speaker embedding. Each new segment gets a speaker label by comparing its embedding to known speakers. Useful for meeting notes or multi-person dictation.

### Voice enrollment
Record a short reference clip per person. Store embeddings. When a new segment arrives, score it against all enrolled speakers and assign the closest match above a confidence threshold. Unknown speakers get a temporary anonymous label.

### Wake word per speaker
Combine identification with wake word detection so only enrolled voices can trigger the assistant — a simple voice-based access control layer.

---

## VRAM considerations

pyannote diarization adds roughly 1–2 GB depending on the model variant. wespeaker ONNX is much lighter. Both fit comfortably alongside the existing Whisper + Chatterbox stack on a 24 GB card.