- README explaining experimental/transparency purpose - faster-whisper STT backend (fw-stt.mjs, faster-whisper-server.py, install-faster-whisper.sh) - Bug fixes: Buffer alignment in on_audio, --debug-waveform URL parsing, silent fetch errors, instant dispatch timer leak - Global uncaughtException/unhandledRejection handlers in query-demo.mjs - Design docs: CHANGELOG, COMMAND-DISPATCH, INTERFACE-THEORY, VOICE-POLICY - Systemd service unit templates Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
232 lines
8.7 KiB
Markdown
232 lines
8.7 KiB
Markdown
# Command Dispatch — Design
|
|
|
|
## Problem with the current approach
|
|
|
|
`Pending_Query.process_utterance()` is a hardcoded if-chain. Every vocabulary set
|
|
and its associated behaviour is baked in. Adding new commands means growing the
|
|
chain. Mode management, query accumulation, and audio energy detection are all
|
|
mixed in the same class. There is no concept of context — every word is evaluated
|
|
against the same flat set of rules regardless of what the pipeline is currently doing.
|
|
|
|
---
|
|
|
|
## Design: mode stack with rule-based frames
|
|
|
|
The pipeline maintains a **stack of frames**. Each frame owns:
|
|
|
|
- A list of **rules** — ordered, evaluated top to bottom
|
|
- Any **per-frame state** (e.g. accumulated query text lives in the query frame,
|
|
not on a shared outer object)
|
|
|
|
When an utterance arrives, the dispatcher walks the stack from top (innermost) to
|
|
bottom (base), running each frame's rule list in order. A rule either:
|
|
|
|
- **Matches and handles** — processing stops (the frame consumed the utterance)
|
|
- **Does not match** — the next rule in the same frame is tried
|
|
- **`check_parent`** — a special rule that, when reached, hands off to the next
|
|
frame down the stack
|
|
|
|
### The `check_parent` rule
|
|
|
|
Fallthrough is **not a flag on the frame**. It is an explicit rule placed wherever
|
|
it belongs in the rule list (usually last). If a frame omits `check_parent`, it is
|
|
implicitly exclusive — utterances not matched within that frame are silently
|
|
discarded. If it includes `check_parent`, unmatched utterances are passed down.
|
|
|
|
This keeps the fallthrough behaviour visible inside the rule list itself, and
|
|
allows partial fallthrough (e.g. a frame that handles some words locally and
|
|
delegates everything else to the parent).
|
|
|
|
Because position is explicit, `check_parent` can also appear **before** a
|
|
catch-all rule. This gives the parent frame first refusal — the child only catches
|
|
what the parent left unhandled. Example: a query frame that puts `check_parent`
|
|
before its catch-all lets the base frame absorb mode-switch words without the
|
|
query frame needing to list them explicitly, and the catch-all then accumulates
|
|
everything the base frame ignored.
|
|
|
|
---
|
|
|
|
## Example stack states
|
|
|
|
### Idle (wake-word mode)
|
|
|
|
```
|
|
[ base frame ]
|
|
- "computer" → push query frame, fire on_activate
|
|
- "mode query" → speak current mode
|
|
- "always listen" → switch mode
|
|
- (no check_parent — base is the bottom)
|
|
```
|
|
|
|
### Active query
|
|
|
|
```
|
|
[ query frame ] ← top
|
|
- "cancel" / "abort" → cancel, pop frame
|
|
- "go" / "done" / "send" → submit, pop frame
|
|
- "read back" → speak accumulated text
|
|
- real words → append to this frame's accumulated text; reset silence timer
|
|
- check_parent ← falls through for anything not matched above
|
|
|
|
[ base frame ] ← bottom
|
|
- "mode query" → speak current mode
|
|
- "always listen" → switch mode
|
|
- ...
|
|
```
|
|
|
|
### Voice picker (entered by saying "switch voice")
|
|
|
|
```
|
|
[ voice picker frame ] ← top
|
|
- known voice name → set voice, speak confirmation, pop frame
|
|
- "cancel" → pop frame without changing voice
|
|
(no check_parent — only voice names and cancel are valid here)
|
|
|
|
[ query frame ] ← or base, depending on where voice switch was triggered
|
|
[ base frame ]
|
|
```
|
|
|
|
### Raw dictation (escape mode)
|
|
|
|
```
|
|
[ raw dictation frame ] ← top
|
|
- "end dictation" → pop frame, submit accumulated content
|
|
- catch-all → append verbatim, no command matching
|
|
(no check_parent, no silence timer — everything is content until exit phrase)
|
|
|
|
[ ... parent frames ... ]
|
|
```
|
|
|
|
Activated by saying "start dictation". Every utterance until "end dictation" is
|
|
captured as literal content — command words like "cancel", "go", or repeated
|
|
fragments like "zero zero zero" are not interpreted. The only way out is the
|
|
explicit exit phrase (or a hard abort at the application level).
|
|
|
|
This is the same narrow-exit frame pattern as voice picker and yes/no — a frame
|
|
with one or two specific exits and a catch-all for everything else. It also
|
|
resolves the code dictation bug (repeated word collapse, command word interference)
|
|
without any special-casing in the rule engine.
|
|
|
|
### Yes/no confirmation
|
|
|
|
```
|
|
[ confirm frame ] ← top
|
|
- "yes" / "confirm" → run pending action, pop frame
|
|
- "no" / "cancel" → discard pending action, pop frame
|
|
(no check_parent — only yes/no accepted)
|
|
|
|
[ ... parent frames ... ]
|
|
```
|
|
|
|
---
|
|
|
|
## Per-frame state
|
|
|
|
State that belongs to a mode lives in its frame, not on a shared object:
|
|
|
|
| Frame | Owns |
|
|
|-------|------|
|
|
| query frame | accumulated text, silence timer, loud-streak counter |
|
|
| voice picker frame | list of valid voice names |
|
|
| confirm frame | the pending action closure |
|
|
|
|
When a frame is popped, its state is gone. No cleanup needed at the outer level.
|
|
|
|
---
|
|
|
|
## Rule types
|
|
|
|
| Type | Description |
|
|
|------|-------------|
|
|
| Exact set | Match normalized text against a `Set` of strings — fast, no LLM |
|
|
| Prefix / pattern | Regex or startsWith match |
|
|
| Catch-all | Matches any utterance with real words — used by query frame to accumulate |
|
|
| `check_parent` | Special sentinel: delegate to next frame down the stack |
|
|
| LLM classifier | Async — used when pattern matching is insufficient |
|
|
|
|
Rules are evaluated in order. Exact-set rules should come before catch-alls.
|
|
|
|
---
|
|
|
|
## Dispatch targets
|
|
|
|
The frame's `on_submit` handler decides where accumulated text goes — it does not
|
|
have to be Claude. Examples:
|
|
|
|
| Target | Description |
|
|
|--------|-------------|
|
|
| Claude Code | Current default — injects query as a voice-buddy prompt |
|
|
| File / note | Append raw text to a file for later review or processing |
|
|
| Task inventory | POST to the task API directly, no LLM involved |
|
|
| Ollama (local) | Run through a local model for cleanup, summarisation, or classification |
|
|
| Deferred queue | Save text + timestamp; a separate process handles it later |
|
|
|
|
**Inventory / dictation mode** is a concrete example: accumulate a long stream of
|
|
spoken items (shelf labels, part names, addresses) without submitting after each
|
|
pause. The frame disables or greatly extends the silence timer, uses an explicit
|
|
save word to submit, and writes the result to a file. A follow-up command can
|
|
optionally pass that file through Ollama to structure it — but the capture step
|
|
needs no LLM at all.
|
|
|
|
This separation — capture now, process later — keeps latency out of the
|
|
dictation loop and lets the user move physically around a space without waiting
|
|
for a model response between items.
|
|
|
|
---
|
|
|
|
## Audio debug viewer (planned)
|
|
|
|
A small web server that receives captured audio segments from the STT pipeline
|
|
and renders them as interactive waveforms in the browser. Motivation: Whisper
|
|
sometimes returns `[BLANK_AUDIO]` (its special token for rejected audio) which
|
|
becomes an empty string and silently drops — no logging, no visibility into why.
|
|
|
|
### What it should show
|
|
|
|
- Full waveform of the segment (pre-roll prepended + live VAD segment)
|
|
- **Visible boundary marker** at the pre-roll / live splice point — the critical
|
|
diagnostic. A level discontinuity at that point means the ring buffer indexing
|
|
is off and the stitched audio has a jump that Whisper interprets as an artifact.
|
|
- Whisper result alongside the waveform (including raw tokens like `[BLANK_AUDIO]`
|
|
before they are stripped)
|
|
- Zoom — short digit segments may be only a few hundred milliseconds
|
|
- Segment metadata: sample count, duration ms, VAD trigger timestamp
|
|
|
|
### Protocol
|
|
|
|
STT pipeline POSTs a JSON payload to the debug server:
|
|
```json
|
|
{
|
|
"samples": [...], // float32 array
|
|
"preroll_length": 3200, // samples prepended from history
|
|
"transcript": "", // raw Whisper output including special tokens
|
|
"timestamp": 1234567890
|
|
}
|
|
```
|
|
|
|
The browser polls or uses SSE to receive new segments and appends them to a
|
|
scrollable list.
|
|
|
|
### Why the discontinuity matters
|
|
|
|
The pre-roll splices audio from a ring buffer (history) onto the front of the
|
|
live VAD segment. If `_history_pos` indexing is off by even one window, the
|
|
splice introduces a sudden amplitude jump. Whisper is sensitive to this — a
|
|
clean segment with a jump can produce `[BLANK_AUDIO]` even when the speech
|
|
content is perfectly audible. The boundary marker makes this immediately visible
|
|
without having to read hex dumps.
|
|
|
|
---
|
|
|
|
## Relationship to `Pending_Query`
|
|
|
|
`Pending_Query` is the current monolith. Under this design it is replaced by:
|
|
|
|
1. **`Command_Router`** — owns the stack, runs dispatch
|
|
2. **`Query_Frame`** — the frame that accumulates text, equivalent to the core of
|
|
`Pending_Query` minus everything that isn't query accumulation
|
|
3. **`Base_Frame`** — wake word, mode switch, mode query, instant dispatch
|
|
|
|
`query-demo.mjs` becomes responsible for constructing and pushing these frames
|
|
rather than passing a flat bag of callbacks to a single class.
|