Files
claude-voice-experiment/TODO.md
mikael-lovqvists-claude-agent 20873be786 Add README, faster-whisper backend, and session fixes
- README explaining experimental/transparency purpose
- faster-whisper STT backend (fw-stt.mjs, faster-whisper-server.py, install-faster-whisper.sh)
- Bug fixes: Buffer alignment in on_audio, --debug-waveform URL parsing, silent fetch errors, instant dispatch timer leak
- Global uncaughtException/unhandledRejection handlers in query-demo.mjs
- Design docs: CHANGELOG, COMMAND-DISPATCH, INTERFACE-THEORY, VOICE-POLICY
- Systemd service unit templates

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-07 06:39:14 +00:00

114 lines
14 KiB
Markdown

# Voice Experiment — TODO
## Classifier
- [ ] Add query journal (JSONL) logging all STT fragments, classifier decisions, and how each query was resolved (timeout / send-word / classifier). Enables replay and re-classification experiments without live mic input. Also useful as a presentation artifact — an unscripted real working session logged end-to-end shows the system better than any scripted demo.
- [ ] Tune classifier — still too conservative on plain statements and short instructions
- [ ] Experiment with larger classifier model (Qwen 7B or 14B)
## Voice switching
- [ ] Allow changing the active voice at runtime via voice command — "switch to the X voice", "use a different voice". Could be per-project as a cue (project A = voice A) or just on-demand preference. Implementation: store current `audio_prompt` path in a mutable state file; voice-switch commands update it directly without going through the full Claude dispatch.
- [ ] Consider a small set of named voices with friendly names so you can say "switch to the calm voice" rather than a file path.
- [ ] Daily voice rotation: select voice based on a hash of the current date — automatic variety without manual choice. Hash the date string mod N voices, pick from the named voice list.
- [ ] Character mode: pair a voice with a system prompt preamble injected before the query. Technically trivial — prepend to the dispatch prompt. Hard part is writing character definitions that hold up over a session rather than collapsing into generic assistant after a few turns. A shallow character just sounds like a costume.
## Service management
- [ ] systemd units for tts-server and query-demo (STT pipeline) once the clean repos exist
- [ ] Claude Code needs a way to restart these services after making changes — options:
- Dedicated sudo rules for specific restart commands only (e.g. `sudo systemctl restart tts-server`)
- Via CCC conduit — safer since it goes through the user's session
- **Security note**: giving Claude Code service restart capability creates a prompt injection risk — malicious content in a voice query could trigger a restart or code reload that runs attacker-controlled code. Mitigations:
1. Restart-only sudo rules — no deploy, no code pull
2. Route through CCC conduit rather than direct sudo
3. Run each service as a dedicated system user with minimal filesystem access — read-only access to its own code and models, write access only to its own log/state, no access to the rest of the system. Damage from compromise is contained to what that user can reach.
## Pipeline
- [ ] PTY-based Claude Code injection — replace X11/clipboard paste with direct PTY write to avoid focus stealing
- [ ] Voice command routing — detect slash command intents (rename session, switch session) and dispatch as `/command` rather than a prompt
- [ ] Activation phrase — say "computer" (or similar) to signal the start of a query; play a short acknowledgement chime (Star Trek style). Currently always-on; activation phrase would make it more intentional.
- [ ] Hardware interface — dedicated physical buttons:
- **Toggle switch + LED**: always-on ambient mode, LED indicates listening state visually from across the room, also signals to others in the room
- **Momentary push-to-talk button**: for discrete single exchanges without enabling ambient mode
- The two modes complement each other — set toggle for a working session, use PTT for occasional quick queries
- Abort button (PB2 from earlier concept): held while releasing PTT discards the recording before it dispatches
- [ ] Audio cues — replace or supplement the "Efforting" TTS utterance with short sound effects: one tone on query start/acknowledgement, a different one on dispatch. Faster feedback than TTS.
- [ ] Reserved control vocabulary — single words, pattern matched before any classifier:
| Word(s) | Action |
|---------|--------|
| `computer` | Generic activate — route through normal classifier pipeline |
| `project note` | Activate with project context injected into the prompt preamble |
| `ask Claude` / `hey Claude` | Activate and force remote dispatch — bypass local classifier entirely |
| `go` / `send` / `done` | Dispatch current query immediately |
| `cancel` / `scratch that` | Discard accumulated query, return to idle |
| `stop` / `goodbye` / `hang up` | Shut down the session cleanly |
| `switch voice` / `different voice` | Trigger voice-switch flow |
| `retry` / `computer retry` | Abort hung Claude API call and retry — sends Escape, waits ~300ms, then Enter to the active terminal |
Words are checked normalized (lowercase, punctuation stripped) before classifier runs. Classifier only sees real query content.
- [ ] Classifier rule hierarchy — pre-classifiers (fast pattern match) → optional LLM classifiers. Each stage can short-circuit. Makes the system tuneable: add/remove LLM stages without touching the pattern layer.
## Query review before dispatch
- [ ] **Readback** — control word "read back" or "read it back" speaks the accumulated query verbatim via TTS before dispatching. Lets you catch STT errors before they reach Claude.
- [ ] **Summary** — control word "summarize" runs the accumulated text through a local Ollama model and speaks a brief summary. Useful for longer dictations to confirm the intent was captured correctly.
- [ ] **In-place edit** — after a readback, say an edit instruction like "change X to Y" or "remove the last sentence". Two implementation options:
1. Pre-dispatch: local LLM applies the edit to the accumulated text before sending — cleaner but adds latency
2. Annotated dispatch: send the original query plus the edit instruction as separate sections; the frontier model applies the edit itself — simpler to implement, relies on Claude's instruction following
- [ ] **Revert** — discard the current accumulated text and start fresh without dispatching
- [ ] **Working on it** reminder — after dispatching to Claude Code, if no TTS response has come back after N seconds (e.g. 10s), speak "still working on it". Implementation: tts-server tracks timestamp of last /speak call (expose via GET /health or GET /status); query-demo polls after dispatch and fires the reminder if the timestamp hasn't advanced. Single-client assumption keeps this simple — any new /speak means Claude responded.
- [ ] **Multi-client with named roles** — extend the window ID concept to allow multiple named Claude instances, each with its own voice, role, and system prompt. The TTS server uses the client ID to select the correct voice automatically. Voice queries can be addressed to a specific instance: "brainstorm, what if we..." or "notes, add item...". Demo concept: one instance takes notes while another brainstorms — different voices, parallel contexts, all addressable by name. Implementation sketched: client ID passed in /speak body, tts-server maps client IDs to voice names in config.
- **Activation phrase routing** — two distinct modes triggered by the same named prefix:
- Activation phrase *alone* in an utterance → enters interactive query mode for that agent's session (subsequent speech becomes the query)
- Activation phrase *followed immediately by a command* in the same utterance → dispatches instantly without entering query mode; ideal for brief one-shot requests like "Archangel, add a compact command"
- **Administrator role concept** — one named instance (e.g. Archangel with Archangel voice) manages the voice system itself. Example: while working on a project in another session, say "Archangel, add compact to the vocabulary" — Archangel modifies the voice assistant code, restarts the service, and the new feature is immediately available to the active session without any context switch.
- **Agent-restart flow** — Archangel (or any admin role) should be able to restart voice services after modifying them. Pairs with the CCC conduit restart approach from Service management.
- **Team overview** — a meta-query like "who are my characters?" or "what's my team?" lists all currently defined named roles with their voices and assigned contexts. Could be answered locally from the config without hitting any LLM.
- **Typical role taxonomy** — general-purpose roles (notes, calendar, planning, life admin) alongside project-specific roles. Multiple project roles can run simultaneously for parallel project work. One "main" character serves as the default when no specific role is addressed.
## Analytics / Observability
- [ ] **Per-turn token analysis script** — read a Claude Code session JSONL and report: total tokens spent, cache hit rate per turn, input/output ratio, and most expensive turns. Session files already contain full usage data (`input_tokens`, `output_tokens`, `cache_read_input_tokens`, `cache_creation_input_tokens`) per assistant message. Would make the cost of the voice system prompt injection directly visible, and help tune injection frequency. Stretch goal.
## Context management
- [ ] **Voice prompt injection frequency** — the system prompt is currently injected into every single dispatch, which wastes tokens as context grows. Consider injecting it only every N queries (e.g. every 3rd), or only when the session is new/resumed. The instructions are mostly stable; Claude doesn't need a reminder every time.
- [ ] **Context size warning** — before context grows large enough to trigger auto-compact, the agent should proactively warn: "Your context is getting large — do you want to compact now?" This gives the user a chance to agree before compaction causes a sudden multi-minute pause mid-session. The human should have agency over when compaction happens; an unexpected quiet five minutes is a worse experience than a mutual decision.
- [ ] **On-demand compact command** — "compact" or "computer compact" as a reserved control word that triggers `/compact` immediately. Pairs with the warning above — agent warns, user says compact.
- [ ] **Compact frequency** — consider compacting more aggressively (smaller threshold) now that on-demand compact and warnings make it less jarring. A smaller context is faster and cheaper per query.
## Configuration / Context
- [ ] Central config file (YAML) — "context" in the broad sense: all settings, preferences, and routing rules in one place. Single source of truth, no scattered CLI args or hardcoded strings. Includes:
- **Utterances / chimes**: what to say (or what sound to play) for each system event — ready, query received, dispatching, error, session end
- **Activation phrases**: each phrase maps to a context definition containing:
- Routing target: local classifier, specific local handler, Ollama model, or force-remote (Claude)
- System prompt preamble to inject
- Active voice override (optional)
- Per-role phrase lists — a list of utterances for each system event, picked randomly for variety and character:
- `acknowledgement`: what to say when a query is received ("On it", "Understood", "Affirmative")
- `working`: what to say when taking a while ("Still working on it", "Give me a moment")
- `ready`: what to say at startup ("Ready", "Systems online")
- `cancel`: what to say when a query is discarded
Gosalyn would have different phrases than Ivanova or Harold Smith — the randomness plus character-appropriate lines makes each role feel alive rather than mechanical.
- **Control word vocabulary**: cancel, send, hang up, switch voice, etc.
- **Pipeline settings**: silence timeout, Whisper model, classifier model and prompt, which stages are enabled
- **TTS settings**: server URL, default voice
- **Local handlers**: time, date, current voice, session status — answered without hitting any LLM
## TTS / Long text
- [ ] Long inputs collapse into gibberish — Chatterbox has a context limit. The tts-server should automatically split on sentence boundaries and queue them sequentially. `Chatterbox_Tts` already has `speak_streaming` that does this — use it in the server instead of `speak`.
## TTS / Prompt
- [ ] Filename and path readback sounds bad when names are all-caps (e.g. `ARCHITECTURE.md` gets spelled out). Instruct Claude in the voice prompt to read filenames and paths in lowercase unless the term is a known acronym. Paralinguistic tags exist but are mostly for expression (laugh, sigh, etc.) — not useful here.
- [ ] TTS response normalisation skill / prompt rules — common technical terms that sound wrong when spoken verbatim:
- `TODO``to-do` (not "toddo" or spelled out)
- `README``read me` (not "ree-duh-mee" or spelled out)
- `NOTES.md``notes dot m d` or just `notes`
- General rule: hyphenate compound words that should be spoken as two words; lowercase filenames; spell out true acronyms (API → A P I) but not others
- Could be a post-processing step on Claude's output before it reaches TTS, or a set of explicit instructions in the voice prompt
## Remote access
- [ ] Remote voice via efforting.tech app — preferred path. Architecture: VAD runs on the phone (Silero is a small ONNX model, runs fine on mobile), silence is never sent, only completed speech segments are transmitted. Each segment boundary is a clean latency reset. Server receives discrete segments and processes them identically to the desktop pipeline. Different port/protocol from file sync since voice has different latency and reliability requirements.
- [ ] Alternative: standalone WebRTC page — browser captures mic, streams to Node server, same STT/TTS pipeline, audio back. Self-hosted, works on any phone browser with no app install.
- [ ] Alternative: Discord bot — replace parec/pacat with discordjs/voice (Opus codec); join channel via slash command. More involved, ties the system to Discord.
## STT
- [ ] Try Whisper small.en or medium.en for better accuracy on accents