Files
claude-voice-experiment/TODO.md
2026-05-30 06:40:22 +00:00

10 KiB

Voice Experiment — TODO

Classifier

  • Add query journal (JSONL) logging all STT fragments, classifier decisions, and how each query was resolved (timeout / send-word / classifier). Enables replay and re-classification experiments without live mic input. Also useful as a presentation artifact — an unscripted real working session logged end-to-end shows the system better than any scripted demo.
  • Tune classifier — still too conservative on plain statements and short instructions
  • Experiment with larger classifier model (Qwen 7B or 14B)

Voice switching

  • Allow changing the active voice at runtime via voice command — "switch to the X voice", "use a different voice". Could be per-project as a cue (project A = voice A) or just on-demand preference. Implementation: store current audio_prompt path in a mutable state file; voice-switch commands update it directly without going through the full Claude dispatch.
  • Consider a small set of named voices with friendly names so you can say "switch to the calm voice" rather than a file path.
  • Daily voice rotation: select voice based on a hash of the current date — automatic variety without manual choice. Hash the date string mod N voices, pick from the named voice list.
  • Character mode: pair a voice with a system prompt preamble injected before the query. Technically trivial — prepend to the dispatch prompt. Hard part is writing character definitions that hold up over a session rather than collapsing into generic assistant after a few turns. A shallow character just sounds like a costume.

Service management

  • systemd units for tts-server and query-demo (STT pipeline) once the clean repos exist
  • Claude Code needs a way to restart these services after making changes — options:
    • Dedicated sudo rules for specific restart commands only (e.g. sudo systemctl restart tts-server)
    • Via CCC conduit — safer since it goes through the user's session
    • Security note: giving Claude Code service restart capability creates a prompt injection risk — malicious content in a voice query could trigger a restart or code reload that runs attacker-controlled code. Mitigations:
      1. Restart-only sudo rules — no deploy, no code pull
      2. Route through CCC conduit rather than direct sudo
      3. Run each service as a dedicated system user with minimal filesystem access — read-only access to its own code and models, write access only to its own log/state, no access to the rest of the system. Damage from compromise is contained to what that user can reach.

Pipeline

  • PTY-based Claude Code injection — replace X11/clipboard paste with direct PTY write to avoid focus stealing

  • Voice command routing — detect slash command intents (rename session, switch session) and dispatch as /command rather than a prompt

  • Activation phrase — say "computer" (or similar) to signal the start of a query; play a short acknowledgement chime (Star Trek style). Currently always-on; activation phrase would make it more intentional.

  • Hardware interface — dedicated physical buttons:

    • Toggle switch + LED: always-on ambient mode, LED indicates listening state visually from across the room, also signals to others in the room
    • Momentary push-to-talk button: for discrete single exchanges without enabling ambient mode
    • The two modes complement each other — set toggle for a working session, use PTT for occasional quick queries
    • Abort button (PB2 from earlier concept): held while releasing PTT discards the recording before it dispatches
  • Audio cues — replace or supplement the "Efforting" TTS utterance with short sound effects: one tone on query start/acknowledgement, a different one on dispatch. Faster feedback than TTS.

  • Reserved control vocabulary — single words, pattern matched before any classifier:

    Word(s) Action
    computer Generic activate — route through normal classifier pipeline
    project note Activate with project context injected into the prompt preamble
    ask Claude / hey Claude Activate and force remote dispatch — bypass local classifier entirely
    go / send / done Dispatch current query immediately
    cancel / scratch that Discard accumulated query, return to idle
    stop / goodbye / hang up Shut down the session cleanly
    switch voice / different voice Trigger voice-switch flow

    Words are checked normalized (lowercase, punctuation stripped) before classifier runs. Classifier only sees real query content.

  • Classifier rule hierarchy — pre-classifiers (fast pattern match) → optional LLM classifiers. Each stage can short-circuit. Makes the system tuneable: add/remove LLM stages without touching the pattern layer.

Query review before dispatch

  • Readback — control word "read back" or "read it back" speaks the accumulated query verbatim via TTS before dispatching. Lets you catch STT errors before they reach Claude.
  • Summary — control word "summarize" runs the accumulated text through a local Ollama model and speaks a brief summary. Useful for longer dictations to confirm the intent was captured correctly.
  • In-place edit — after a readback, say an edit instruction like "change X to Y" or "remove the last sentence". Two implementation options:
    1. Pre-dispatch: local LLM applies the edit to the accumulated text before sending — cleaner but adds latency
    2. Annotated dispatch: send the original query plus the edit instruction as separate sections; the frontier model applies the edit itself — simpler to implement, relies on Claude's instruction following
  • Revert — discard the current accumulated text and start fresh without dispatching
  • Working on it reminder — after dispatching to Claude Code, if no TTS response has come back after N seconds (e.g. 10s), speak "still working on it". Implementation: tts-server tracks timestamp of last /speak call (expose via GET /health or GET /status); query-demo polls after dispatch and fires the reminder if the timestamp hasn't advanced. Single-client assumption keeps this simple — any new /speak means Claude responded.
  • Multi-client with named roles — extend the window ID concept to allow multiple named Claude instances, each with its own voice, role, and system prompt. The TTS server uses the client ID to select the correct voice automatically. Voice queries can be addressed to a specific instance: "brainstorm, what if we..." or "notes, add item...". Demo concept: one instance takes notes while another brainstorms — different voices, parallel contexts, all addressable by name. Implementation sketched: client ID passed in /speak body, tts-server maps client IDs to voice names in config.

Configuration / Context

  • Central config file (YAML) — "context" in the broad sense: all settings, preferences, and routing rules in one place. Single source of truth, no scattered CLI args or hardcoded strings. Includes:
    • Utterances / chimes: what to say (or what sound to play) for each system event — ready, query received, dispatching, error, session end
    • Activation phrases: each phrase maps to a context definition containing:
      • Routing target: local classifier, specific local handler, Ollama model, or force-remote (Claude)
      • System prompt preamble to inject
      • Active voice override (optional)
      • Per-role phrase lists — a list of utterances for each system event, picked randomly for variety and character:
        • acknowledgement: what to say when a query is received ("On it", "Understood", "Affirmative")
        • working: what to say when taking a while ("Still working on it", "Give me a moment")
        • ready: what to say at startup ("Ready", "Systems online")
        • cancel: what to say when a query is discarded Gosalyn would have different phrases than Ivanova or Harold Smith — the randomness plus character-appropriate lines makes each role feel alive rather than mechanical.
    • Control word vocabulary: cancel, send, hang up, switch voice, etc.
    • Pipeline settings: silence timeout, Whisper model, classifier model and prompt, which stages are enabled
    • TTS settings: server URL, default voice
    • Local handlers: time, date, current voice, session status — answered without hitting any LLM

TTS / Long text

  • Long inputs collapse into gibberish — Chatterbox has a context limit. The tts-server should automatically split on sentence boundaries and queue them sequentially. Chatterbox_Tts already has speak_streaming that does this — use it in the server instead of speak.

TTS / Prompt

  • Filename and path readback sounds bad when names are all-caps (e.g. ARCHITECTURE.md gets spelled out). Instruct Claude in the voice prompt to read filenames and paths in lowercase unless the term is a known acronym. Paralinguistic tags exist but are mostly for expression (laugh, sigh, etc.) — not useful here.
  • TTS response normalisation skill / prompt rules — common technical terms that sound wrong when spoken verbatim:
    • TODOto-do (not "toddo" or spelled out)
    • READMEread me (not "ree-duh-mee" or spelled out)
    • NOTES.mdnotes dot m d or just notes
    • General rule: hyphenate compound words that should be spoken as two words; lowercase filenames; spell out true acronyms (API → A P I) but not others
    • Could be a post-processing step on Claude's output before it reaches TTS, or a set of explicit instructions in the voice prompt

Remote access

  • Remote voice via efforting.tech app — preferred path. Architecture: VAD runs on the phone (Silero is a small ONNX model, runs fine on mobile), silence is never sent, only completed speech segments are transmitted. Each segment boundary is a clean latency reset. Server receives discrete segments and processes them identically to the desktop pipeline. Different port/protocol from file sync since voice has different latency and reliability requirements.
  • Alternative: standalone WebRTC page — browser captures mic, streams to Node server, same STT/TTS pipeline, audio back. Self-hosted, works on any phone browser with no app install.
  • Alternative: Discord bot — replace parec/pacat with discordjs/voice (Opus codec); join channel via slash command. More involved, ties the system to Discord.

STT

  • Try Whisper small.en or medium.en for better accuracy on accents