claude-voice-experiment/COMMAND-DISPATCH.md

# Command Dispatch — Design

## Problem with the current approach

`Pending_Query.process_utterance()` is a hardcoded if-chain. Every vocabulary set
and its associated behaviour is baked in. Adding new commands means growing the
chain. Mode management, query accumulation, and audio energy detection are all
mixed in the same class. There is no concept of context — every word is evaluated
against the same flat set of rules regardless of what the pipeline is currently doing.

---

## Design: mode stack with rule-based frames

The pipeline maintains a **stack of frames**. Each frame owns:

- A list of **rules** — ordered, evaluated top to bottom
- Any **per-frame state** (e.g. accumulated query text lives in the query frame,
  not on a shared outer object)

When an utterance arrives, the dispatcher walks the stack from top (innermost) to
bottom (base), running each frame's rule list in order. A rule either:

- **Matches and handles** — processing stops (the frame consumed the utterance)
- **Does not match** — the next rule in the same frame is tried
- **`check_parent`** — a special rule that, when reached, hands off to the next
  frame down the stack

### The `check_parent` rule

Fallthrough is **not a flag on the frame**. It is an explicit rule placed wherever
it belongs in the rule list (usually last). If a frame omits `check_parent`, it is
implicitly exclusive — utterances not matched within that frame are silently
discarded. If it includes `check_parent`, unmatched utterances are passed down.

This keeps the fallthrough behaviour visible inside the rule list itself, and
allows partial fallthrough (e.g. a frame that handles some words locally and
delegates everything else to the parent).

Because position is explicit, `check_parent` can also appear **before** a
catch-all rule. This gives the parent frame first refusal — the child only catches
what the parent left unhandled. Example: a query frame that puts `check_parent`
before its catch-all lets the base frame absorb mode-switch words without the
query frame needing to list them explicitly, and the catch-all then accumulates
everything the base frame ignored.

---

## Example stack states

### Idle (wake-word mode)

```
[ base frame ]
  - "computer"  → push query frame, fire on_activate
  - "mode query" → speak current mode
  - "always listen" → switch mode
  - (no check_parent — base is the bottom)
```

### Active query

```
[ query frame ]          ← top
  - "cancel" / "abort"  → cancel, pop frame
  - "go" / "done" / "send" → submit, pop frame
  - "read back"         → speak accumulated text
  - real words          → append to this frame's accumulated text; reset silence timer
  - check_parent        ← falls through for anything not matched above

[ base frame ]           ← bottom
  - "mode query" → speak current mode
  - "always listen" → switch mode
  - ...
```

### Voice picker (entered by saying "switch voice")

```
[ voice picker frame ]   ← top
  - known voice name     → set voice, speak confirmation, pop frame
  - "cancel"             → pop frame without changing voice
  (no check_parent — only voice names and cancel are valid here)

[ query frame ]          ← or base, depending on where voice switch was triggered
[ base frame ]
```

### Raw dictation (escape mode)

```
[ raw dictation frame ]  ← top
  - "end dictation"      → pop frame, submit accumulated content
  - catch-all            → append verbatim, no command matching
  (no check_parent, no silence timer — everything is content until exit phrase)

[ ... parent frames ... ]
```

Activated by saying "start dictation". Every utterance until "end dictation" is
captured as literal content — command words like "cancel", "go", or repeated
fragments like "zero zero zero" are not interpreted. The only way out is the
explicit exit phrase (or a hard abort at the application level).

This is the same narrow-exit frame pattern as voice picker and yes/no — a frame
with one or two specific exits and a catch-all for everything else. It also
resolves the code dictation bug (repeated word collapse, command word interference)
without any special-casing in the rule engine.

### Yes/no confirmation

```
[ confirm frame ]        ← top
  - "yes" / "confirm"   → run pending action, pop frame
  - "no" / "cancel"     → discard pending action, pop frame
  (no check_parent — only yes/no accepted)

[ ... parent frames ... ]
```

---

## Per-frame state

State that belongs to a mode lives in its frame, not on a shared object:

| Frame | Owns |
|-------|------|
| query frame | accumulated text, silence timer, loud-streak counter |
| voice picker frame | list of valid voice names |
| confirm frame | the pending action closure |

When a frame is popped, its state is gone. No cleanup needed at the outer level.

---

## Rule types

| Type | Description |
|------|-------------|
| Exact set | Match normalized text against a `Set` of strings — fast, no LLM |
| Prefix / pattern | Regex or startsWith match |
| Catch-all | Matches any utterance with real words — used by query frame to accumulate |
| `check_parent` | Special sentinel: delegate to next frame down the stack |
| LLM classifier | Async — used when pattern matching is insufficient |

Rules are evaluated in order. Exact-set rules should come before catch-alls.

---

## Dispatch targets

The frame's `on_submit` handler decides where accumulated text goes — it does not
have to be Claude. Examples:

| Target | Description |
|--------|-------------|
| Claude Code | Current default — injects query as a voice-buddy prompt |
| File / note | Append raw text to a file for later review or processing |
| Task inventory | POST to the task API directly, no LLM involved |
| Ollama (local) | Run through a local model for cleanup, summarisation, or classification |
| Deferred queue | Save text + timestamp; a separate process handles it later |

**Inventory / dictation mode** is a concrete example: accumulate a long stream of
spoken items (shelf labels, part names, addresses) without submitting after each
pause. The frame disables or greatly extends the silence timer, uses an explicit
save word to submit, and writes the result to a file. A follow-up command can
optionally pass that file through Ollama to structure it — but the capture step
needs no LLM at all.

This separation — capture now, process later — keeps latency out of the
dictation loop and lets the user move physically around a space without waiting
for a model response between items.

---

## Audio debug viewer (planned)

A small web server that receives captured audio segments from the STT pipeline
and renders them as interactive waveforms in the browser. Motivation: Whisper
sometimes returns `[BLANK_AUDIO]` (its special token for rejected audio) which
becomes an empty string and silently drops — no logging, no visibility into why.

### What it should show

- Full waveform of the segment (pre-roll prepended + live VAD segment)
- **Visible boundary marker** at the pre-roll / live splice point — the critical
  diagnostic. A level discontinuity at that point means the ring buffer indexing
  is off and the stitched audio has a jump that Whisper interprets as an artifact.
- Whisper result alongside the waveform (including raw tokens like `[BLANK_AUDIO]`
  before they are stripped)
- Zoom — short digit segments may be only a few hundred milliseconds
- Segment metadata: sample count, duration ms, VAD trigger timestamp

### Protocol

STT pipeline POSTs a JSON payload to the debug server:
```json
{
  "samples": [...],        // float32 array
  "preroll_length": 3200,  // samples prepended from history
  "transcript": "",        // raw Whisper output including special tokens
  "timestamp": 1234567890
}
```

The browser polls or uses SSE to receive new segments and appends them to a
scrollable list.

### Why the discontinuity matters

The pre-roll splices audio from a ring buffer (history) onto the front of the
live VAD segment. If `_history_pos` indexing is off by even one window, the
splice introduces a sudden amplitude jump. Whisper is sensitive to this — a
clean segment with a jump can produce `[BLANK_AUDIO]` even when the speech
content is perfectly audible. The boundary marker makes this immediately visible
without having to read hex dumps.

---

## Relationship to `Pending_Query`

`Pending_Query` is the current monolith. Under this design it is replaced by:

1. **`Command_Router`** — owns the stack, runs dispatch
2. **`Query_Frame`** — the frame that accumulates text, equivalent to the core of
   `Pending_Query` minus everything that isn't query accumulation
3. **`Base_Frame`** — wake word, mode switch, mode query, instant dispatch

`query-demo.mjs` becomes responsible for constructing and pushing these frames
rather than passing a flat bag of callbacks to a single class.