mikael-lovqvists-claude-agent/claude-voice-experiment

Files

mikael-lovqvists-claude-agent 20873be786 Add README, faster-whisper backend, and session fixes

- README explaining experimental/transparency purpose
- faster-whisper STT backend (fw-stt.mjs, faster-whisper-server.py, install-faster-whisper.sh)
- Bug fixes: Buffer alignment in on_audio, --debug-waveform URL parsing, silent fetch errors, instant dispatch timer leak
- Global uncaughtException/unhandledRejection handlers in query-demo.mjs
- Design docs: CHANGELOG, COMMAND-DISPATCH, INTERFACE-THEORY, VOICE-POLICY
- Systemd service unit templates

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-06-07 06:39:14 +00:00

8.7 KiB

Raw Permalink Blame History

Command Dispatch — Design

Problem with the current approach

Pending_Query.process_utterance() is a hardcoded if-chain. Every vocabulary set and its associated behaviour is baked in. Adding new commands means growing the chain. Mode management, query accumulation, and audio energy detection are all mixed in the same class. There is no concept of context — every word is evaluated against the same flat set of rules regardless of what the pipeline is currently doing.

Design: mode stack with rule-based frames

The pipeline maintains a stack of frames. Each frame owns:

A list of rules — ordered, evaluated top to bottom
Any per-frame state (e.g. accumulated query text lives in the query frame, not on a shared outer object)

When an utterance arrives, the dispatcher walks the stack from top (innermost) to bottom (base), running each frame's rule list in order. A rule either:

Matches and handles — processing stops (the frame consumed the utterance)
Does not match — the next rule in the same frame is tried
check_parent — a special rule that, when reached, hands off to the next frame down the stack

The `check_parent` rule

Fallthrough is not a flag on the frame. It is an explicit rule placed wherever it belongs in the rule list (usually last). If a frame omits check_parent, it is implicitly exclusive — utterances not matched within that frame are silently discarded. If it includes check_parent, unmatched utterances are passed down.

This keeps the fallthrough behaviour visible inside the rule list itself, and allows partial fallthrough (e.g. a frame that handles some words locally and delegates everything else to the parent).

Because position is explicit, check_parent can also appear before a catch-all rule. This gives the parent frame first refusal — the child only catches what the parent left unhandled. Example: a query frame that puts check_parent before its catch-all lets the base frame absorb mode-switch words without the query frame needing to list them explicitly, and the catch-all then accumulates everything the base frame ignored.

Example stack states

Idle (wake-word mode)

[ base frame ]
  - "computer"  → push query frame, fire on_activate
  - "mode query" → speak current mode
  - "always listen" → switch mode
  - (no check_parent — base is the bottom)

Active query

[ query frame ]          ← top
  - "cancel" / "abort"  → cancel, pop frame
  - "go" / "done" / "send" → submit, pop frame
  - "read back"         → speak accumulated text
  - real words          → append to this frame's accumulated text; reset silence timer
  - check_parent        ← falls through for anything not matched above

[ base frame ]           ← bottom
  - "mode query" → speak current mode
  - "always listen" → switch mode
  - ...

Voice picker (entered by saying "switch voice")

[ voice picker frame ]   ← top
  - known voice name     → set voice, speak confirmation, pop frame
  - "cancel"             → pop frame without changing voice
  (no check_parent — only voice names and cancel are valid here)

[ query frame ]          ← or base, depending on where voice switch was triggered
[ base frame ]

Raw dictation (escape mode)

[ raw dictation frame ]  ← top
  - "end dictation"      → pop frame, submit accumulated content
  - catch-all            → append verbatim, no command matching
  (no check_parent, no silence timer — everything is content until exit phrase)

[ ... parent frames ... ]

Activated by saying "start dictation". Every utterance until "end dictation" is captured as literal content — command words like "cancel", "go", or repeated fragments like "zero zero zero" are not interpreted. The only way out is the explicit exit phrase (or a hard abort at the application level).

This is the same narrow-exit frame pattern as voice picker and yes/no — a frame with one or two specific exits and a catch-all for everything else. It also resolves the code dictation bug (repeated word collapse, command word interference) without any special-casing in the rule engine.

Yes/no confirmation

[ confirm frame ]        ← top
  - "yes" / "confirm"   → run pending action, pop frame
  - "no" / "cancel"     → discard pending action, pop frame
  (no check_parent — only yes/no accepted)

[ ... parent frames ... ]

Per-frame state

State that belongs to a mode lives in its frame, not on a shared object:

Frame	Owns
query frame	accumulated text, silence timer, loud-streak counter
voice picker frame	list of valid voice names
confirm frame	the pending action closure

When a frame is popped, its state is gone. No cleanup needed at the outer level.

Rule types

Type	Description
Exact set	Match normalized text against a `Set` of strings — fast, no LLM
Prefix / pattern	Regex or startsWith match
Catch-all	Matches any utterance with real words — used by query frame to accumulate
`check_parent`	Special sentinel: delegate to next frame down the stack
LLM classifier	Async — used when pattern matching is insufficient

Rules are evaluated in order. Exact-set rules should come before catch-alls.

Dispatch targets

The frame's on_submit handler decides where accumulated text goes — it does not have to be Claude. Examples:

Target	Description
Claude Code	Current default — injects query as a voice-buddy prompt
File / note	Append raw text to a file for later review or processing
Task inventory	POST to the task API directly, no LLM involved
Ollama (local)	Run through a local model for cleanup, summarisation, or classification
Deferred queue	Save text + timestamp; a separate process handles it later

Inventory / dictation mode is a concrete example: accumulate a long stream of spoken items (shelf labels, part names, addresses) without submitting after each pause. The frame disables or greatly extends the silence timer, uses an explicit save word to submit, and writes the result to a file. A follow-up command can optionally pass that file through Ollama to structure it — but the capture step needs no LLM at all.

This separation — capture now, process later — keeps latency out of the dictation loop and lets the user move physically around a space without waiting for a model response between items.

Audio debug viewer (planned)

A small web server that receives captured audio segments from the STT pipeline and renders them as interactive waveforms in the browser. Motivation: Whisper sometimes returns [BLANK_AUDIO] (its special token for rejected audio) which becomes an empty string and silently drops — no logging, no visibility into why.

What it should show

Full waveform of the segment (pre-roll prepended + live VAD segment)
Visible boundary marker at the pre-roll / live splice point — the critical diagnostic. A level discontinuity at that point means the ring buffer indexing is off and the stitched audio has a jump that Whisper interprets as an artifact.
Whisper result alongside the waveform (including raw tokens like [BLANK_AUDIO] before they are stripped)
Zoom — short digit segments may be only a few hundred milliseconds
Segment metadata: sample count, duration ms, VAD trigger timestamp

Protocol

STT pipeline POSTs a JSON payload to the debug server:

{
  "samples": [...],        // float32 array
  "preroll_length": 3200,  // samples prepended from history
  "transcript": "",        // raw Whisper output including special tokens
  "timestamp": 1234567890
}

The browser polls or uses SSE to receive new segments and appends them to a scrollable list.

Why the discontinuity matters

The pre-roll splices audio from a ring buffer (history) onto the front of the live VAD segment. If _history_pos indexing is off by even one window, the splice introduces a sudden amplitude jump. Whisper is sensitive to this — a clean segment with a jump can produce [BLANK_AUDIO] even when the speech content is perfectly audible. The boundary marker makes this immediately visible without having to read hex dumps.

Relationship to `Pending_Query`

Pending_Query is the current monolith. Under this design it is replaced by:

Command_Router — owns the stack, runs dispatch
Query_Frame — the frame that accumulates text, equivalent to the core of Pending_Query minus everything that isn't query accumulation
Base_Frame — wake word, mode switch, mode query, instant dispatch

query-demo.mjs becomes responsible for constructing and pushing these frames rather than passing a flat bag of callbacks to a single class.

8.7 KiB Raw Permalink Blame History