STT (Silero VAD + Whisper via sherpa-onnx), Chatterbox TTS HTTP server, query completeness classifier (Ollama), multi-voice demo scripts, and planning docs. Kept as reference; clean rewrite planned in separate repos. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
42 lines
3.7 KiB
Markdown
42 lines
3.7 KiB
Markdown
# Voice Experiment — TODO
|
|
|
|
## Classifier
|
|
- [ ] Add query journal (JSONL) logging all STT fragments, classifier decisions, and how each query was resolved (timeout / send-word / classifier). Enables replay and re-classification experiments without live mic input.
|
|
- [ ] Tune classifier — still too conservative on plain statements and short instructions
|
|
- [ ] Experiment with larger classifier model (Qwen 7B or 14B)
|
|
|
|
## Voice switching
|
|
- [ ] Allow changing the active voice at runtime via voice command — "switch to the X voice", "use a different voice". Could be per-project as a cue (project A = voice A) or just on-demand preference. Implementation: store current `audio_prompt` path in a mutable state file; voice-switch commands update it directly without going through the full Claude dispatch.
|
|
- [ ] Consider a small set of named voices with friendly names so you can say "switch to the calm voice" rather than a file path.
|
|
- [ ] Daily voice rotation: select voice based on a hash of the current date — automatic variety without manual choice. Hash the date string mod N voices, pick from the named voice list.
|
|
- [ ] Character mode: pair a voice with a system prompt preamble injected before the query. Technically trivial — prepend to the dispatch prompt. Hard part is writing character definitions that hold up over a session rather than collapsing into generic assistant after a few turns. A shallow character just sounds like a costume.
|
|
|
|
## Pipeline
|
|
- [ ] PTY-based Claude Code injection — replace X11/clipboard paste with direct PTY write to avoid focus stealing
|
|
- [ ] Voice command routing — detect slash command intents (rename session, switch session) and dispatch as `/command` rather than a prompt
|
|
- [ ] Activation phrase — say "computer" (or similar) to signal the start of a query; play a short acknowledgement chime (Star Trek style). Currently always-on; activation phrase would make it more intentional.
|
|
- [ ] Audio cues — replace or supplement the "Efforting" TTS utterance with short sound effects: one tone on query start/acknowledgement, a different one on dispatch. Faster feedback than TTS.
|
|
- [ ] Reserved control vocabulary — single words, pattern matched before any classifier:
|
|
|
|
| Word(s) | Action |
|
|
|---------|--------|
|
|
| `computer` | Generic activate — start accumulating a query, play acknowledgement chime |
|
|
| `project note` | Activate with project context injected into the prompt preamble |
|
|
| `go` / `send` / `done` | Dispatch current query immediately |
|
|
| `cancel` / `scratch that` | Discard accumulated query, return to idle |
|
|
| `stop` / `goodbye` / `hang up` | Shut down the session cleanly |
|
|
| `switch voice` / `different voice` | Trigger voice-switch flow |
|
|
|
|
Words are checked normalized (lowercase, punctuation stripped) before classifier runs. Classifier only sees real query content.
|
|
|
|
- [ ] Classifier rule hierarchy — pre-classifiers (fast pattern match) → optional LLM classifiers. Each stage can short-circuit. Makes the system tuneable: add/remove LLM stages without touching the pattern layer.
|
|
|
|
## TTS / Long text
|
|
- [ ] Long inputs collapse into gibberish — Chatterbox has a context limit. The tts-server should automatically split on sentence boundaries and queue them sequentially. `Chatterbox_Tts` already has `speak_streaming` that does this — use it in the server instead of `speak`.
|
|
|
|
## TTS / Prompt
|
|
- [ ] Filename and path readback sounds bad when names are all-caps (e.g. `ARCHITECTURE.md` gets spelled out). Instruct Claude in the voice prompt to read filenames and paths in lowercase unless the term is a known acronym. Paralinguistic tags exist but are mostly for expression (laugh, sigh, etc.) — not useful here.
|
|
|
|
## STT
|
|
- [ ] Try Whisper small.en or medium.en for better accuracy on accents
|