# Speaker Diarization Notes on adding speaker identification to the voice pipeline. --- ## What Whisper does and doesn't do Whisper is a transcription model — it converts audio to text. It has no concept of who is speaking. In a multi-speaker conversation it will transcribe everything into one flat stream with no speaker labels. --- ## Adding diarization Speaker diarization is the task of segmenting audio by speaker: "who spoke when." The standard approach is to run a diarization model before or alongside transcription, then align the speaker segments with the transcript. ### pyannote.audio The most capable open-source option. Runs locally, GPU-accelerated. ``` pip install pyannote.audio ``` Produces time-stamped segments: ``` SPEAKER_00 0.5s – 3.2s SPEAKER_01 3.8s – 7.1s SPEAKER_00 7.5s – 12.0s ``` Feed each segment to Whisper independently → transcript with speaker tags. Requires a HuggingFace token and model access request (free). ### sherpa-onnx Has some diarization support in newer versions — worth checking if it can be added to the existing pipeline without a separate Python dependency. --- ## Speaker identification vs diarization | Feature | Diarization | Identification | |---------|------------|----------------| | Labels speakers | ✓ (anonymous: speaker 1, 2…) | ✓ (named: Mikael, Claude…) | | Requires enrollment | ✗ | ✓ (reference clip per person) | | Cross-session consistency | ✗ | ✓ | For named identification, a speaker embedding model compares voice fingerprints against enrolled reference clips. Options: - **resemblyzer** — simple cosine similarity on d-vector embeddings - **wespeaker** — more accurate, available as ONNX (fits existing sherpa-onnx stack) - **speechbrain** — full toolkit, heavier dependency --- ## Future experiments ### Multi-speaker transcription Run pyannote diarization on a conversation recording, feed segments to Whisper, produce a tagged transcript: ``` Mikael: How do I juggle oranges? Claude: Start with two balls and work up from there. ``` ### Real-time diarization Extend the current VAD → Whisper pipeline with a rolling speaker embedding. Each new segment gets a speaker label by comparing its embedding to known speakers. Useful for meeting notes or multi-person dictation. ### Voice enrollment Record a short reference clip per person. Store embeddings. When a new segment arrives, score it against all enrolled speakers and assign the closest match above a confidence threshold. Unknown speakers get a temporary anonymous label. ### Wake word per speaker Combine identification with wake word detection so only enrolled voices can trigger the assistant — a simple voice-based access control layer. --- ## VRAM considerations pyannote diarization adds roughly 1–2 GB depending on the model variant. wespeaker ONNX is much lighter. Both fit comfortably alongside the existing Whisper + Chatterbox stack on a 24 GB card.