Add PRESENTATION.md and update voices/todo notes
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
61
PRESENTATION.md
Normal file
61
PRESENTATION.md
Normal file
@@ -0,0 +1,61 @@
|
||||
# Presenting the Voice Assistant
|
||||
|
||||
Notes on how to demonstrate this system effectively.
|
||||
|
||||
---
|
||||
|
||||
## Core insight
|
||||
|
||||
The most compelling demo is an **unscripted real working session** — not a rehearsed script. The system's value is visible when it's actually being used to get things done. A scripted demo flattens the rough edges that make it feel real.
|
||||
|
||||
The query journal (when implemented) will make it possible to replay or reconstruct a real session for a recorded demo.
|
||||
|
||||
---
|
||||
|
||||
## Most interesting things to show
|
||||
|
||||
### Voice switching mid-session
|
||||
|
||||
Switch between characters and have each one say something in character. The contrast between voices makes the capability concrete. The multi-voice demo scripts in `demos/` are a starting point but an improvised exchange lands better.
|
||||
|
||||
### Doing actual work
|
||||
|
||||
A real task — adding something to a list, asking a factual question, modifying a file — demonstrates the pipeline end to end. The fact that it works while doing something else (cooking, tidying) is part of the story.
|
||||
|
||||
### The classifier working (and not working)
|
||||
|
||||
Showing that short fragments don't trigger prematurely, and that send-words like "go" force dispatch, gives a sense of how the pipeline is designed. Showing a case where it misfires, and explaining why, is honest and interesting.
|
||||
|
||||
### The "hands-free while doing something else" workflow
|
||||
|
||||
If possible, demonstrate speaking while visibly occupied with something physical. This is the use case most people haven't thought about and it's immediately legible.
|
||||
|
||||
### The planning capability
|
||||
|
||||
This session itself is a good example: several hours of brainstorming, planning, and note-taking without touching a keyboard. The output (TODO.md, VOICES.md, CLEANUP-PLAN.md, WORKFLOWS.md, this file) is the artifact.
|
||||
|
||||
---
|
||||
|
||||
## What makes a good moment to record
|
||||
|
||||
- Natural silence after a query where the system is clearly thinking, then responds
|
||||
- A voice switch that lands well and sounds good
|
||||
- A query that gets the classifier right — fragment doesn't dispatch, full sentence does
|
||||
- An unexpected result from a voice clone that prompts a reaction
|
||||
- A task that was actually useful, not just a demo task
|
||||
|
||||
---
|
||||
|
||||
## Things to prepare before a demo
|
||||
|
||||
- Voice samples loaded and tested — all voices in `voices.yaml` checked for quality
|
||||
- TTS server running and warmed up (first response is slow)
|
||||
- At least one chime configured for acknowledgement
|
||||
- Silence timeout tuned so it doesn't misfire during normal speech pauses
|
||||
- A few interesting voice quotes ready to demonstrate character contrast
|
||||
|
||||
---
|
||||
|
||||
## Framing
|
||||
|
||||
The system is not finished. That's fine to say. The interesting thing is that it's **useful now** in its current form, and the direction is clear. Showing the TODO and CLEANUP-PLAN alongside the working demo makes it credible — this is a real project, not a polished proof of concept that's going nowhere.
|
||||
59
TODO.md
59
TODO.md
@@ -1,7 +1,7 @@
|
||||
# Voice Experiment — TODO
|
||||
|
||||
## Classifier
|
||||
- [ ] Add query journal (JSONL) logging all STT fragments, classifier decisions, and how each query was resolved (timeout / send-word / classifier). Enables replay and re-classification experiments without live mic input.
|
||||
- [ ] Add query journal (JSONL) logging all STT fragments, classifier decisions, and how each query was resolved (timeout / send-word / classifier). Enables replay and re-classification experiments without live mic input. Also useful as a presentation artifact — an unscripted real working session logged end-to-end shows the system better than any scripted demo.
|
||||
- [ ] Tune classifier — still too conservative on plain statements and short instructions
|
||||
- [ ] Experiment with larger classifier model (Qwen 7B or 14B)
|
||||
|
||||
@@ -11,17 +11,33 @@
|
||||
- [ ] Daily voice rotation: select voice based on a hash of the current date — automatic variety without manual choice. Hash the date string mod N voices, pick from the named voice list.
|
||||
- [ ] Character mode: pair a voice with a system prompt preamble injected before the query. Technically trivial — prepend to the dispatch prompt. Hard part is writing character definitions that hold up over a session rather than collapsing into generic assistant after a few turns. A shallow character just sounds like a costume.
|
||||
|
||||
## Service management
|
||||
- [ ] systemd units for tts-server and query-demo (STT pipeline) once the clean repos exist
|
||||
- [ ] Claude Code needs a way to restart these services after making changes — options:
|
||||
- Dedicated sudo rules for specific restart commands only (e.g. `sudo systemctl restart tts-server`)
|
||||
- Via CCC conduit — safer since it goes through the user's session
|
||||
- **Security note**: giving Claude Code service restart capability creates a prompt injection risk — malicious content in a voice query could trigger a restart or code reload that runs attacker-controlled code. Mitigations:
|
||||
1. Restart-only sudo rules — no deploy, no code pull
|
||||
2. Route through CCC conduit rather than direct sudo
|
||||
3. Run each service as a dedicated system user with minimal filesystem access — read-only access to its own code and models, write access only to its own log/state, no access to the rest of the system. Damage from compromise is contained to what that user can reach.
|
||||
|
||||
## Pipeline
|
||||
- [ ] PTY-based Claude Code injection — replace X11/clipboard paste with direct PTY write to avoid focus stealing
|
||||
- [ ] Voice command routing — detect slash command intents (rename session, switch session) and dispatch as `/command` rather than a prompt
|
||||
- [ ] Activation phrase — say "computer" (or similar) to signal the start of a query; play a short acknowledgement chime (Star Trek style). Currently always-on; activation phrase would make it more intentional.
|
||||
- [ ] Hardware interface — dedicated physical buttons:
|
||||
- **Toggle switch + LED**: always-on ambient mode, LED indicates listening state visually from across the room, also signals to others in the room
|
||||
- **Momentary push-to-talk button**: for discrete single exchanges without enabling ambient mode
|
||||
- The two modes complement each other — set toggle for a working session, use PTT for occasional quick queries
|
||||
- Abort button (PB2 from earlier concept): held while releasing PTT discards the recording before it dispatches
|
||||
- [ ] Audio cues — replace or supplement the "Efforting" TTS utterance with short sound effects: one tone on query start/acknowledgement, a different one on dispatch. Faster feedback than TTS.
|
||||
- [ ] Reserved control vocabulary — single words, pattern matched before any classifier:
|
||||
|
||||
| Word(s) | Action |
|
||||
|---------|--------|
|
||||
| `computer` | Generic activate — start accumulating a query, play acknowledgement chime |
|
||||
| `computer` | Generic activate — route through normal classifier pipeline |
|
||||
| `project note` | Activate with project context injected into the prompt preamble |
|
||||
| `ask Claude` / `hey Claude` | Activate and force remote dispatch — bypass local classifier entirely |
|
||||
| `go` / `send` / `done` | Dispatch current query immediately |
|
||||
| `cancel` / `scratch that` | Discard accumulated query, return to idle |
|
||||
| `stop` / `goodbye` / `hang up` | Shut down the session cleanly |
|
||||
@@ -31,11 +47,50 @@
|
||||
|
||||
- [ ] Classifier rule hierarchy — pre-classifiers (fast pattern match) → optional LLM classifiers. Each stage can short-circuit. Makes the system tuneable: add/remove LLM stages without touching the pattern layer.
|
||||
|
||||
## Query review before dispatch
|
||||
- [ ] **Readback** — control word "read back" or "read it back" speaks the accumulated query verbatim via TTS before dispatching. Lets you catch STT errors before they reach Claude.
|
||||
- [ ] **Summary** — control word "summarize" runs the accumulated text through a local Ollama model and speaks a brief summary. Useful for longer dictations to confirm the intent was captured correctly.
|
||||
- [ ] **In-place edit** — after a readback, say an edit instruction like "change X to Y" or "remove the last sentence". Two implementation options:
|
||||
1. Pre-dispatch: local LLM applies the edit to the accumulated text before sending — cleaner but adds latency
|
||||
2. Annotated dispatch: send the original query plus the edit instruction as separate sections; the frontier model applies the edit itself — simpler to implement, relies on Claude's instruction following
|
||||
- [ ] **Revert** — discard the current accumulated text and start fresh without dispatching
|
||||
- [ ] **Working on it** reminder — after dispatching to Claude Code, if no TTS response has come back after N seconds (e.g. 10s), speak "still working on it". Implementation: tts-server tracks timestamp of last /speak call (expose via GET /health or GET /status); query-demo polls after dispatch and fires the reminder if the timestamp hasn't advanced. Single-client assumption keeps this simple — any new /speak means Claude responded.
|
||||
- [ ] **Multi-client with named roles** — extend the window ID concept to allow multiple named Claude instances, each with its own voice, role, and system prompt. The TTS server uses the client ID to select the correct voice automatically. Voice queries can be addressed to a specific instance: "brainstorm, what if we..." or "notes, add item...". Demo concept: one instance takes notes while another brainstorms — different voices, parallel contexts, all addressable by name. Implementation sketched: client ID passed in /speak body, tts-server maps client IDs to voice names in config.
|
||||
|
||||
## Configuration / Context
|
||||
- [ ] Central config file (YAML) — "context" in the broad sense: all settings, preferences, and routing rules in one place. Single source of truth, no scattered CLI args or hardcoded strings. Includes:
|
||||
- **Utterances / chimes**: what to say (or what sound to play) for each system event — ready, query received, dispatching, error, session end
|
||||
- **Activation phrases**: each phrase maps to a context definition containing:
|
||||
- Routing target: local classifier, specific local handler, Ollama model, or force-remote (Claude)
|
||||
- System prompt preamble to inject
|
||||
- Active voice override (optional)
|
||||
- Per-role phrase lists — a list of utterances for each system event, picked randomly for variety and character:
|
||||
- `acknowledgement`: what to say when a query is received ("On it", "Understood", "Affirmative")
|
||||
- `working`: what to say when taking a while ("Still working on it", "Give me a moment")
|
||||
- `ready`: what to say at startup ("Ready", "Systems online")
|
||||
- `cancel`: what to say when a query is discarded
|
||||
Gosalyn would have different phrases than Ivanova or Harold Smith — the randomness plus character-appropriate lines makes each role feel alive rather than mechanical.
|
||||
- **Control word vocabulary**: cancel, send, hang up, switch voice, etc.
|
||||
- **Pipeline settings**: silence timeout, Whisper model, classifier model and prompt, which stages are enabled
|
||||
- **TTS settings**: server URL, default voice
|
||||
- **Local handlers**: time, date, current voice, session status — answered without hitting any LLM
|
||||
|
||||
## TTS / Long text
|
||||
- [ ] Long inputs collapse into gibberish — Chatterbox has a context limit. The tts-server should automatically split on sentence boundaries and queue them sequentially. `Chatterbox_Tts` already has `speak_streaming` that does this — use it in the server instead of `speak`.
|
||||
|
||||
## TTS / Prompt
|
||||
- [ ] Filename and path readback sounds bad when names are all-caps (e.g. `ARCHITECTURE.md` gets spelled out). Instruct Claude in the voice prompt to read filenames and paths in lowercase unless the term is a known acronym. Paralinguistic tags exist but are mostly for expression (laugh, sigh, etc.) — not useful here.
|
||||
- [ ] TTS response normalisation skill / prompt rules — common technical terms that sound wrong when spoken verbatim:
|
||||
- `TODO` → `to-do` (not "toddo" or spelled out)
|
||||
- `README` → `read me` (not "ree-duh-mee" or spelled out)
|
||||
- `NOTES.md` → `notes dot m d` or just `notes`
|
||||
- General rule: hyphenate compound words that should be spoken as two words; lowercase filenames; spell out true acronyms (API → A P I) but not others
|
||||
- Could be a post-processing step on Claude's output before it reaches TTS, or a set of explicit instructions in the voice prompt
|
||||
|
||||
## Remote access
|
||||
- [ ] Remote voice via efforting.tech app — preferred path. Architecture: VAD runs on the phone (Silero is a small ONNX model, runs fine on mobile), silence is never sent, only completed speech segments are transmitted. Each segment boundary is a clean latency reset. Server receives discrete segments and processes them identically to the desktop pipeline. Different port/protocol from file sync since voice has different latency and reliability requirements.
|
||||
- [ ] Alternative: standalone WebRTC page — browser captures mic, streams to Node server, same STT/TTS pipeline, audio back. Self-hosted, works on any phone browser with no app install.
|
||||
- [ ] Alternative: Discord bot — replace parec/pacat with discordjs/voice (Opus codec); join channel via slash command. More involved, ties the system to Discord.
|
||||
|
||||
## STT
|
||||
- [ ] Try Whisper small.en or medium.en for better accuracy on accents
|
||||
|
||||
@@ -18,11 +18,13 @@ Voices to prepare reference clips for.
|
||||
| John Fleck | Enterprise (Silik the Suliban) | Distinctive raspy voice — Suliban scenes may have atmosphere noise |
|
||||
| Steve Bacic | Andromeda (Telemachus Rhade) | — |
|
||||
| Alex Diakun | Andromeda (Perseid character, name ~"Atune"), Stargate SG-1 (science role) | Prolific Vancouver sci-fi character actor, often plays scientists/scholars — distinctive voice |
|
||||
| Tim Russ (Tuvok) | Star Trek Voyager — Tuvok; also TNG "Starship Mine" as a human mercenary | Measured, deep Vulcan delivery — try ready room/quarters scenes for less ship hum. TNG villain role offers different vocal performance from same actor. |
|
||||
| Unknown actress | Andromeda S4/S5 (Dylan's love interest), Babylon 5 (Mars rebellion leader) | Possibly Marjorie Monaghan (Number One in B5) — unconfirmed |
|
||||
| Claudia Christian | Babylon 5 — Commander Ivanova | User already has a voice clip ready |
|
||||
| Claudia Christian | Babylon 5 — Commander Ivanova | Clip ready but mispronounces her own name — splits "Ivan" and "Nova" instead of correct stress (i-VA-no-va). Find a clip where she introduces herself: "Commander Susan Ivanova" or "Ivanova speaking". |
|
||||
|
||||
## Notes
|
||||
|
||||
- **Self-introduction in reference clip** — include a clip of the character saying their own name, so the model learns the correct pronunciation. Many characters have distinctive names with non-obvious stress. Introductory scenes ("I'm Commander Ivanova", "The name's Remmie") are ideal for this.
|
||||
- Animated/cartoon voices (e.g. Darkwing Duck) don't clone well — too far outside natural human speech distribution
|
||||
- Compressed/heavily post-processed audio (spaceship hum, background score) degrades results even after noise reduction
|
||||
- OGG vs WAV quality difference is likely source quality, not encoding — soundfile handles both
|
||||
|
||||
Reference in New Issue
Block a user