Add README, faster-whisper backend, and session fixes

- README explaining experimental/transparency purpose
- faster-whisper STT backend (fw-stt.mjs, faster-whisper-server.py, install-faster-whisper.sh)
- Bug fixes: Buffer alignment in on_audio, --debug-waveform URL parsing, silent fetch errors, instant dispatch timer leak
- Global uncaughtException/unhandledRejection handlers in query-demo.mjs
- Design docs: CHANGELOG, COMMAND-DISPATCH, INTERFACE-THEORY, VOICE-POLICY
- Systemd service unit templates

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
2026-06-07 06:39:14 +00:00
parent a7fa2fd218
commit 20873be786
19 changed files with 1070 additions and 25 deletions

7
.gitignore vendored
View File

@@ -1,11 +1,6 @@
node_modules/
models/
venv/
*.ogg
*.wav
*.mp3
*.onnx
*.bin
package-lock.json
__pycache__/
*.pyc
build/

29
CHANGELOG.md Normal file
View File

@@ -0,0 +1,29 @@
# Changelog
## Session 2026-06-06
### faster-whisper backend (`lib/fw-stt.mjs`, `faster-whisper-server.py`)
- Added faster-whisper as an alternative STT backend with word-level timestamps
- Select with `--stt faster-whisper` flag on `query-demo.mjs`
- Python subprocess protocol: JSON lines over stdin/stdout, `ready\n` on startup
- Sherpa-onnx VAD still used for segment detection; transcription goes to Python
### Install script (`install-faster-whisper.sh`)
- Rewritten to use persistent build directory (`build/ctranslate2/`) instead of tmpfs temp dir — cmake's incremental build now works across runs
- Each step (clone, cmake build, Python bindings, faster-whisper) is skipped if already done
- Builds ctranslate2 C++ library from source against system CUDA 13 (PyPI wheel requires CUDA 12)
- Installs C++ library into venv so Python bindings find headers and library automatically
- Detects and removes PyPI ctranslate2 wheel (which bundles its own CUDA 12 library) before installing source-built bindings
- Fixed: `set -euo pipefail` was silently killing the script when `ls` found no files in symlink loop
### Bug fixes
- **`pending-query.mjs` alignment bug**: `new Int16Array(chunk.buffer, chunk.byteOffset, ...)` threw on odd byte offsets from Node.js Buffer pool slices, silently killing audio-based timer resets and causing premature silence timeouts mid-sentence. Fixed by copying to aligned buffer when `byteOffset % 2 !== 0`.
- **`query-demo.mjs` URL parsing**: `--debug-waveform` without a value was grabbing the next flag (`--stt`) as the URL. Fixed by checking that the parsed value doesn't start with `--`.
- **`fw-stt.mjs` silent fetch errors**: `_post_debug` was calling `fetch().catch(() => {})`, swallowing connection errors to the waveform viewer silently. Now logs failures.
- **`pending-query.mjs` instant dispatch timer leak**: silence timer was not cleared on instant dispatch, causing double-submit. Fixed by clearing timer and resetting state in the instant dispatch path.
### Global exception handling (`query-demo.mjs`)
- Added `process.on('uncaughtException')` and `process.on('unhandledRejection')` handlers to surface hidden errors in stderr.
### Command dispatch design (`COMMAND-DISPATCH.md`)
- Documented planned mode stack architecture with `check_parent` rule, frame types, and dispatch targets

View File

@@ -153,6 +153,12 @@ Also detect Claude Code's session feedback prompt ("How is Claude doing this ses
---
## Silence timer warning chime (planned)
Before the silence timer fires and submits the query, play a short warning chime (e.g. 12 seconds before expiry) so the user has a moment to say `pause` or continue speaking if the submission would be premature. Requires a secondary timeout that fires slightly before the main one.
---
## Audio mixing — chimes over speech (future)
Currently the playback queue is strictly sequential — chimes and speech never overlap. A future improvement would be to mix chime audio on top of ongoing speech rather than waiting for it to finish. This would make activation chimes feel more responsive. Requires summing float32 buffers before passing to pacat rather than queuing them separately.
@@ -194,12 +200,62 @@ Silence timer auto-enabled in wake-word mode; disabled in always-on mode.
---
## Utterance pre-processing: collapse repeated words (planned)
## Bug: alphanumeric code dictation fails
Two problems when dictating codes like "F 0 7 5 0 0 0 0":
1. **Repeated word collapse** — "zero zero zero zero" becomes "zero". The collapse rule is too aggressive; it should only apply to command words, not to content being accumulated.
2. **Whisper letter recognition** — single spoken letters are often transcribed incorrectly (e.g. "F" → "if", "ef"). A NATO phonetic alphabet mode ("foxtrot zero seven five") would be more reliable for codes.
3. **Silence timer** — pauses between individual characters trigger early submission.
Fix: disable the collapse rule for content (only apply it to the command-word matching step, not to the accumulated text). Consider a "code entry mode" that disables the silence timer and uses phonetic alphabet mapping.
---
## Utterance pre-processing: collapse repeated words (temporary fix)
Implemented as a quick fix in `Pending_Query._collapse_repeated_words()`. Should be absorbed into the proper rewrite/rule engine when that is built — the rule engine is the right home for utterance normalization and transformation.
STT occasionally produces repeated words as artifacts (e.g. "go go", "cancel cancel"). Add a pre-processing step before command matching that collapses consecutive identical words in a fragment into one (e.g. "go go" → "go"). This would be a simple normalisation rule applied to the `norm` string before any word-set lookups.
---
## API integrations (future)
Direct API integrations the voice assistant can call without involving Claude:
- YouTube — check for new videos from specific creators (via YouTube Data API with user account)
- Weather — already partially working via web search; proper API integration would be cleaner
- Calendar — check upcoming events
- Simple lookups that don't need LLM reasoning
For complex queries involving these services, route to Claude with the relevant tool/API access provided. The split: voice assistant handles simple factual retrieval; Claude handles reasoning, summarization, or multi-step tasks.
---
## Query journal (planned)
Log all STT fragments and command decisions to a JSONL journal file. Each entry should include: timestamp, raw fragment, normalised text, what command (if any) it matched, accumulated text at the time, and how the query was resolved (send word, silence timeout, cancel, instant dispatch). Purposes:
- Debug utterance recognition issues without live mic
- Replay and re-classify fragments when the routing rules change
- Tune the rule engine against real recorded data
Will become essential once the routing layer grows complex.
---
## Next major direction: rule-based command routing and multi-target dispatch
The voice interface should be able to route queries to different targets, not just Claude Code. This requires a more structured command layer. Planned capabilities:
- **Stashed queries** — accumulate a query while waiting for Claude to respond to a previous one; hold it and send when ready
- **Voice notes** — dispatch short notes directly to the personal note/task system (task inventory) without involving Claude Code at all
- **Multi-target routing** — based on query prefix or context, route to: Claude Code, note system, calendar, web search, etc.
- **Rule-based command structure** — replace the current ad-hoc word sets with a proper routing table that maps utterance patterns to handlers and targets
This is the core next step for the project. The current `Pending_Query` accumulation model is the right foundation; the routing layer sits above it.
---
## Utterance routing layer (planned)
As query interaction grows more complex, utterance dispatch needs its own class/config rather than living inside `Pending_Query`. Rough shape:
@@ -214,6 +270,14 @@ As query interaction grows more complex, utterance dispatch needs its own class/
## Query staging / readback (planned)
Two distinct commands:
- **`read back`** — speak the raw verbatim accumulated text so you can hear exactly what Whisper captured
- **`clean up`** (or `process`) — send accumulated text through a local LLM to fix dictation artifacts (repeated words, wrong homophones, missing punctuation, expand abbreviations) then read back the cleaned result; optionally replace accumulated text with the cleaned version before sending
The clean-up pass is especially useful for dictating letters or longer text where Whisper artifacts would be embarrassing if sent directly.
---
Before a query is sent to Claude, give the user a chance to review and correct it via voice:
- After accumulating utterances, read back the query text via TTS so the user can confirm it looks right

231
COMMAND-DISPATCH.md Normal file
View File

@@ -0,0 +1,231 @@
# Command Dispatch — Design
## Problem with the current approach
`Pending_Query.process_utterance()` is a hardcoded if-chain. Every vocabulary set
and its associated behaviour is baked in. Adding new commands means growing the
chain. Mode management, query accumulation, and audio energy detection are all
mixed in the same class. There is no concept of context — every word is evaluated
against the same flat set of rules regardless of what the pipeline is currently doing.
---
## Design: mode stack with rule-based frames
The pipeline maintains a **stack of frames**. Each frame owns:
- A list of **rules** — ordered, evaluated top to bottom
- Any **per-frame state** (e.g. accumulated query text lives in the query frame,
not on a shared outer object)
When an utterance arrives, the dispatcher walks the stack from top (innermost) to
bottom (base), running each frame's rule list in order. A rule either:
- **Matches and handles** — processing stops (the frame consumed the utterance)
- **Does not match** — the next rule in the same frame is tried
- **`check_parent`** — a special rule that, when reached, hands off to the next
frame down the stack
### The `check_parent` rule
Fallthrough is **not a flag on the frame**. It is an explicit rule placed wherever
it belongs in the rule list (usually last). If a frame omits `check_parent`, it is
implicitly exclusive — utterances not matched within that frame are silently
discarded. If it includes `check_parent`, unmatched utterances are passed down.
This keeps the fallthrough behaviour visible inside the rule list itself, and
allows partial fallthrough (e.g. a frame that handles some words locally and
delegates everything else to the parent).
Because position is explicit, `check_parent` can also appear **before** a
catch-all rule. This gives the parent frame first refusal — the child only catches
what the parent left unhandled. Example: a query frame that puts `check_parent`
before its catch-all lets the base frame absorb mode-switch words without the
query frame needing to list them explicitly, and the catch-all then accumulates
everything the base frame ignored.
---
## Example stack states
### Idle (wake-word mode)
```
[ base frame ]
- "computer" → push query frame, fire on_activate
- "mode query" → speak current mode
- "always listen" → switch mode
- (no check_parent — base is the bottom)
```
### Active query
```
[ query frame ] ← top
- "cancel" / "abort" → cancel, pop frame
- "go" / "done" / "send" → submit, pop frame
- "read back" → speak accumulated text
- real words → append to this frame's accumulated text; reset silence timer
- check_parent ← falls through for anything not matched above
[ base frame ] ← bottom
- "mode query" → speak current mode
- "always listen" → switch mode
- ...
```
### Voice picker (entered by saying "switch voice")
```
[ voice picker frame ] ← top
- known voice name → set voice, speak confirmation, pop frame
- "cancel" → pop frame without changing voice
(no check_parent — only voice names and cancel are valid here)
[ query frame ] ← or base, depending on where voice switch was triggered
[ base frame ]
```
### Raw dictation (escape mode)
```
[ raw dictation frame ] ← top
- "end dictation" → pop frame, submit accumulated content
- catch-all → append verbatim, no command matching
(no check_parent, no silence timer — everything is content until exit phrase)
[ ... parent frames ... ]
```
Activated by saying "start dictation". Every utterance until "end dictation" is
captured as literal content — command words like "cancel", "go", or repeated
fragments like "zero zero zero" are not interpreted. The only way out is the
explicit exit phrase (or a hard abort at the application level).
This is the same narrow-exit frame pattern as voice picker and yes/no — a frame
with one or two specific exits and a catch-all for everything else. It also
resolves the code dictation bug (repeated word collapse, command word interference)
without any special-casing in the rule engine.
### Yes/no confirmation
```
[ confirm frame ] ← top
- "yes" / "confirm" → run pending action, pop frame
- "no" / "cancel" → discard pending action, pop frame
(no check_parent — only yes/no accepted)
[ ... parent frames ... ]
```
---
## Per-frame state
State that belongs to a mode lives in its frame, not on a shared object:
| Frame | Owns |
|-------|------|
| query frame | accumulated text, silence timer, loud-streak counter |
| voice picker frame | list of valid voice names |
| confirm frame | the pending action closure |
When a frame is popped, its state is gone. No cleanup needed at the outer level.
---
## Rule types
| Type | Description |
|------|-------------|
| Exact set | Match normalized text against a `Set` of strings — fast, no LLM |
| Prefix / pattern | Regex or startsWith match |
| Catch-all | Matches any utterance with real words — used by query frame to accumulate |
| `check_parent` | Special sentinel: delegate to next frame down the stack |
| LLM classifier | Async — used when pattern matching is insufficient |
Rules are evaluated in order. Exact-set rules should come before catch-alls.
---
## Dispatch targets
The frame's `on_submit` handler decides where accumulated text goes — it does not
have to be Claude. Examples:
| Target | Description |
|--------|-------------|
| Claude Code | Current default — injects query as a voice-buddy prompt |
| File / note | Append raw text to a file for later review or processing |
| Task inventory | POST to the task API directly, no LLM involved |
| Ollama (local) | Run through a local model for cleanup, summarisation, or classification |
| Deferred queue | Save text + timestamp; a separate process handles it later |
**Inventory / dictation mode** is a concrete example: accumulate a long stream of
spoken items (shelf labels, part names, addresses) without submitting after each
pause. The frame disables or greatly extends the silence timer, uses an explicit
save word to submit, and writes the result to a file. A follow-up command can
optionally pass that file through Ollama to structure it — but the capture step
needs no LLM at all.
This separation — capture now, process later — keeps latency out of the
dictation loop and lets the user move physically around a space without waiting
for a model response between items.
---
## Audio debug viewer (planned)
A small web server that receives captured audio segments from the STT pipeline
and renders them as interactive waveforms in the browser. Motivation: Whisper
sometimes returns `[BLANK_AUDIO]` (its special token for rejected audio) which
becomes an empty string and silently drops — no logging, no visibility into why.
### What it should show
- Full waveform of the segment (pre-roll prepended + live VAD segment)
- **Visible boundary marker** at the pre-roll / live splice point — the critical
diagnostic. A level discontinuity at that point means the ring buffer indexing
is off and the stitched audio has a jump that Whisper interprets as an artifact.
- Whisper result alongside the waveform (including raw tokens like `[BLANK_AUDIO]`
before they are stripped)
- Zoom — short digit segments may be only a few hundred milliseconds
- Segment metadata: sample count, duration ms, VAD trigger timestamp
### Protocol
STT pipeline POSTs a JSON payload to the debug server:
```json
{
"samples": [...], // float32 array
"preroll_length": 3200, // samples prepended from history
"transcript": "", // raw Whisper output including special tokens
"timestamp": 1234567890
}
```
The browser polls or uses SSE to receive new segments and appends them to a
scrollable list.
### Why the discontinuity matters
The pre-roll splices audio from a ring buffer (history) onto the front of the
live VAD segment. If `_history_pos` indexing is off by even one window, the
splice introduces a sudden amplitude jump. Whisper is sensitive to this — a
clean segment with a jump can produce `[BLANK_AUDIO]` even when the speech
content is perfectly audible. The boundary marker makes this immediately visible
without having to read hex dumps.
---
## Relationship to `Pending_Query`
`Pending_Query` is the current monolith. Under this design it is replaced by:
1. **`Command_Router`** — owns the stack, runs dispatch
2. **`Query_Frame`** — the frame that accumulates text, equivalent to the core of
`Pending_Query` minus everything that isn't query accumulation
3. **`Base_Frame`** — wake word, mode switch, mode query, instant dispatch
`query-demo.mjs` becomes responsible for constructing and pushing these frames
rather than passing a flat bag of callbacks to a single class.

40
INTERFACE-THEORY.md Normal file
View File

@@ -0,0 +1,40 @@
# Interface Theory Notes
## The opacity problem with voice-driven development
Voice is rich and fast for expressing intent, but it hides the artifact. You cannot skim a diff by listening. Each response says "done" but you don't see what was actually changed.
**Risk**: accumulated invisible drift. Each individual change seems correct in isolation, but over time the architecture moves away from what you would have chosen had you read it. The AI makes reasonable local decisions that conflict with your broader intentions.
## The interleaving principle
Voice and visual interfaces are complementary, not alternatives:
- **Voice** — intent, direction, questions, routing. Fast, natural, hands-free.
- **Visual** — verification, review, architectural oversight. Necessary to catch drift.
Neither alone is sufficient. A purely voice-driven workflow loses quality control. A purely visual workflow loses speed and naturalness.
## Practical mitigation
- **Commit checkpoints**: review diffs visually before committing, even if the work was directed by voice
- **Periodic code review**: especially before publishing or sharing — catch architectural drift early
- **Voice for high-level, visual for detail**: use voice to say what you want, then verify the implementation matches your intent
- **Ask for plans before changes**: "tell me what you're going to do" before "do it" — keeps intent explicit and reviewable
## Git workflow for voice-driven development
To make drift visible and reversible:
- Assistant works on a **dedicated branch**, not directly on main
- **Frequent small commits** per change — much easier to review than one large diff covering a whole session
- User reviews the branch visually, cherry-picks or merges what they agree with
- Keeps the user in control of what actually lands in the canonical history
This pairs naturally with the voice interface: voice for direction, branch history for transparency.
---
## Related
This applies to any AI-assisted workflow, not just voice. The opacity of LLM output is an inherent property. Voice just amplifies it by removing the reading step entirely.

View File

@@ -14,6 +14,14 @@ The query journal (when implemented) will make it possible to replay or reconstr
## Most interesting things to show
### Multi-character routing — the strongest demo angle
Address different named characters for different tasks: "Trance, take a note — calendar appointment Thursday at three" then "Archangel, add a new keyword to the vocabulary." The framing is immediately legible: different characters have different capabilities, different voices, different latency.
The latency difference tells the story on its own. Trance answers instantly from a local handler; Archangel pauses while Claude thinks. Audiences understand "local vs. remote model" poorly in the abstract, but they see and feel the pause — and it's meaningful rather than broken-feeling because it maps to character.
Some characters could be entirely local (notes, time, calendar lookups), some mostly Claude, some a mix. The architecture is visible in the demo without needing explanation.
### Voice switching mid-session
Switch between characters and have each one say something in character. The contrast between voices makes the capability concrete. The multi-voice demo scripts in `demos/` are a starting point but an improvised exchange lands better.

40
README.md Normal file
View File

@@ -0,0 +1,40 @@
# Voice Pipeline Experiment
This repository is an active experiment. It is published for **transparency and reference** — not as a finished or production-ready project. Expect rough edges, dead ends, work-in-progress notes, and design docs that describe things not yet built.
## What this is
A local voice interface for [Claude Code](https://github.com/anthropics/claude-code): speak a query, it transcribes, classifies intent, and dispatches to Claude. The stack runs entirely on local hardware — no cloud STT/TTS.
Core components:
| File | Role |
|------|------|
| `query-demo.mjs` | Main entry point — mic → STT → query state machine → dispatch |
| `lib/pending-query.mjs` | Query state machine: wake word, silence timer, send/cancel/pause |
| `lib/stt.mjs` | Silero VAD + Whisper STT (sherpa-onnx backend) |
| `lib/fw-stt.mjs` | faster-whisper STT backend with word-level timestamps |
| `tts-server.mjs` | TTS HTTP server (Chatterbox model, voice switching) |
| `lib/tts-client.mjs` | HTTP client for the TTS server |
| `voices.yaml` | Named voice configuration |
| `faster-whisper-server.py` | Python subprocess for faster-whisper transcription |
| `install-faster-whisper.sh` | Builds ctranslate2 from source for CUDA 13 compatibility |
| `download-models.sh` | Downloads Whisper and VAD models |
| `query-demo.service` / `tts-server.service` | Systemd unit templates |
## Status
Working but experimental. The design docs (`CLEANUP-PLAN.md`, `COMMAND-DISPATCH.md`, etc.) describe architectural directions not yet implemented. The project will eventually split into separate focused repositories.
## Offshoots
Links to derived, cleaner projects will be added here as they become ready.
## Requirements
- Node.js (ESM)
- Python 3 with a venv (`setup-venv.sh` or `install-faster-whisper.sh`)
- CUDA GPU (for faster-whisper backend)
- PulseAudio or ALSA for mic capture
- `sherpa-onnx-node` npm package
- Chatterbox TTS model

17
TODO.md
View File

@@ -42,6 +42,7 @@
| `cancel` / `scratch that` | Discard accumulated query, return to idle |
| `stop` / `goodbye` / `hang up` | Shut down the session cleanly |
| `switch voice` / `different voice` | Trigger voice-switch flow |
| `retry` / `computer retry` | Abort hung Claude API call and retry — sends Escape, waits ~300ms, then Enter to the active terminal |
Words are checked normalized (lowercase, punctuation stripped) before classifier runs. Classifier only sees real query content.
@@ -56,6 +57,22 @@
- [ ] **Revert** — discard the current accumulated text and start fresh without dispatching
- [ ] **Working on it** reminder — after dispatching to Claude Code, if no TTS response has come back after N seconds (e.g. 10s), speak "still working on it". Implementation: tts-server tracks timestamp of last /speak call (expose via GET /health or GET /status); query-demo polls after dispatch and fires the reminder if the timestamp hasn't advanced. Single-client assumption keeps this simple — any new /speak means Claude responded.
- [ ] **Multi-client with named roles** — extend the window ID concept to allow multiple named Claude instances, each with its own voice, role, and system prompt. The TTS server uses the client ID to select the correct voice automatically. Voice queries can be addressed to a specific instance: "brainstorm, what if we..." or "notes, add item...". Demo concept: one instance takes notes while another brainstorms — different voices, parallel contexts, all addressable by name. Implementation sketched: client ID passed in /speak body, tts-server maps client IDs to voice names in config.
- **Activation phrase routing** — two distinct modes triggered by the same named prefix:
- Activation phrase *alone* in an utterance → enters interactive query mode for that agent's session (subsequent speech becomes the query)
- Activation phrase *followed immediately by a command* in the same utterance → dispatches instantly without entering query mode; ideal for brief one-shot requests like "Archangel, add a compact command"
- **Administrator role concept** — one named instance (e.g. Archangel with Archangel voice) manages the voice system itself. Example: while working on a project in another session, say "Archangel, add compact to the vocabulary" — Archangel modifies the voice assistant code, restarts the service, and the new feature is immediately available to the active session without any context switch.
- **Agent-restart flow** — Archangel (or any admin role) should be able to restart voice services after modifying them. Pairs with the CCC conduit restart approach from Service management.
- **Team overview** — a meta-query like "who are my characters?" or "what's my team?" lists all currently defined named roles with their voices and assigned contexts. Could be answered locally from the config without hitting any LLM.
- **Typical role taxonomy** — general-purpose roles (notes, calendar, planning, life admin) alongside project-specific roles. Multiple project roles can run simultaneously for parallel project work. One "main" character serves as the default when no specific role is addressed.
## Analytics / Observability
- [ ] **Per-turn token analysis script** — read a Claude Code session JSONL and report: total tokens spent, cache hit rate per turn, input/output ratio, and most expensive turns. Session files already contain full usage data (`input_tokens`, `output_tokens`, `cache_read_input_tokens`, `cache_creation_input_tokens`) per assistant message. Would make the cost of the voice system prompt injection directly visible, and help tune injection frequency. Stretch goal.
## Context management
- [ ] **Voice prompt injection frequency** — the system prompt is currently injected into every single dispatch, which wastes tokens as context grows. Consider injecting it only every N queries (e.g. every 3rd), or only when the session is new/resumed. The instructions are mostly stable; Claude doesn't need a reminder every time.
- [ ] **Context size warning** — before context grows large enough to trigger auto-compact, the agent should proactively warn: "Your context is getting large — do you want to compact now?" This gives the user a chance to agree before compaction causes a sudden multi-minute pause mid-session. The human should have agency over when compaction happens; an unexpected quiet five minutes is a worse experience than a mutual decision.
- [ ] **On-demand compact command** — "compact" or "computer compact" as a reserved control word that triggers `/compact` immediately. Pairs with the warning above — agent warns, user says compact.
- [ ] **Compact frequency** — consider compacting more aggressively (smaller threshold) now that on-demand compact and warnings make it less jarring. A smaller context is faster and cheaper per query.
## Configuration / Context
- [ ] Central config file (YAML) — "context" in the broad sense: all settings, preferences, and routing rules in one place. Single source of truth, no scattered CLI args or hardcoded strings. Includes:

53
VOICE-POLICY.md Normal file
View File

@@ -0,0 +1,53 @@
# Voice Policy
## Project stance on voice cloning
This project ships no voice samples and endorses no specific voice. The TTS system (Chatterbox) supports voice cloning via a reference audio sample — the user provides their own.
## Guidance for users: customising the voice
We provide instructions for how to prepare a voice sample. What you clone is your choice and your responsibility.
**How to prepare a voice sample:**
1. Find audio of the voice you want — TV show, podcast, audiobook, whatever appeals to you
2. Cut out a clean 1030 second clip with clear speech, no background music, minimal noise
3. Use Audacity (or equivalent) to: reduce noise, normalise loudness, export as WAV
4. Pass the file as `--audio-prompt` or configure it in `voices.yaml`
**Quality tips:**
- Avoid clips with background music — it bleeds into the synthesised voice
- Multiple shorter clips tend to work better than one long one
- Emotional range in the source clip produces more expressive output
## Public framing
We treat voice customisation similarly to ringtones: a personal preference the user configures for themselves, using material they have access to. We do not distribute voice samples, do not name or reference specific shows, actors, or productions, and do not encourage or assist with commercial exploitation of anyone's voice.
Users are responsible for complying with the laws of their jurisdiction, including right of publicity and copyright. For personal, non-commercial use this is generally low-risk. For public or commercial deployments, use a commissioned voice actor with a proper release.
## Demo videos
Using a cloned voice in a demo video is low practical risk:
- No commercial transaction, no product being sold
- Transformative use — demonstrating a technical tool, not reproducing entertainment
- Amount of cloned audio is small
- Fan/hobbyist use of fictional AI voices from TV shows is extremely common and essentially never enforced
Realistic risk threshold: only if the video went massively viral and the production company or actor decided to make an example of it. For a niche open source demo this is negligible.
**Recommendation**: use the voice in demos freely. Do not name the project after the fictional character or actor, and do not distribute the source audio sample.
In the unlikely event of a takedown request from the production company or actor, the appropriate response is simply to comply — edit or remove the video. This is a low-friction outcome and not a legal or financial threat.
Response plan if takedown occurs:
1. Comply immediately
2. Publicly note who made the request and why — the Streisand effect tends to do the rest
3. Remake the demo with a fully synthetic voice, cloned own voice, or commissioned voice — the project stands on its technical merits regardless of which voice is used
---
## Project name
The project does not use a name referencing any fictional AI character, show, actor, or voice. Name TBD — working title "voice buddy" for the dispatcher component.

1
example.sh Normal file
View File

@@ -0,0 +1 @@
node query-demo.mjs --tmux-container-id 915f3325d63a --claude-remote ~/Projekt/claude-docker/workspace/claude-remote/claude-remote.mjs --debug-waveform --stt faster-whisper

81
faster-whisper-server.py Executable file
View File

@@ -0,0 +1,81 @@
"""
faster-whisper transcription server.
Protocol (stdin/stdout, one JSON line per exchange):
Request: {"audio_b64": "<base64 float32 LE>", "sample_rate": 16000}
Response: {"text": "...", "words": [{"word": "...", "start": 0.0, "end": 0.1, "probability": 0.99}]}
Startup: writes "ready\n" to stdout when the model is loaded.
"""
import sys
import json
import base64
import argparse
import traceback
import numpy as np
parser = argparse.ArgumentParser()
parser.add_argument('--model', default='base.en')
parser.add_argument('--device', default='cuda')
parser.add_argument('--compute-type', default='int8')
args = parser.parse_args()
sys.stderr.write(f'[fw-server] loading {args.model} ({args.device}, {args.compute_type})...\n')
sys.stderr.flush()
try:
from faster_whisper import WhisperModel
except ImportError:
sys.stderr.write('[fw-server] faster-whisper not installed — run: pip install faster-whisper\n')
sys.exit(1)
try:
model = WhisperModel(args.model, device=args.device, compute_type=args.compute_type)
sys.stderr.write(f'[fw-server] ready on {args.device}\n')
except Exception as e:
sys.stderr.write(f'[fw-server] {args.device} failed ({e}), falling back to cpu\n')
model = WhisperModel(args.model, device='cpu', compute_type='int8')
sys.stderr.write('[fw-server] ready on cpu\n')
sys.stdout.write('ready\n')
sys.stdout.flush()
for raw_line in sys.stdin:
line = raw_line.strip()
if not line:
continue
try:
req = json.loads(line)
audio = np.frombuffer(base64.b64decode(req['audio_b64']), dtype=np.float32)
sr = req.get('sample_rate', 16000)
segments, _ = model.transcribe(
audio,
language='en',
word_timestamps=True,
vad_filter=False, # VAD already done upstream
)
text = ''
words = []
for seg in segments:
text += seg.text
for w in (seg.words or []):
words.append({
'word': w.word,
'start': round(float(w.start), 4),
'end': round(float(w.end), 4),
'probability': round(float(w.probability), 4),
})
sys.stderr.write(f'[fw-server] {json.dumps(text)} ({len(words)} words)\n')
sys.stderr.flush()
sys.stdout.write(json.dumps({'text': text, 'words': words}) + '\n')
sys.stdout.flush()
except Exception:
sys.stderr.write(f'[fw-server] error:\n{traceback.format_exc()}\n')
sys.stderr.flush()
sys.stdout.write(json.dumps({'text': '', 'words': [], 'error': traceback.format_exc()}) + '\n')
sys.stdout.flush()

118
install-faster-whisper.sh Executable file
View File

@@ -0,0 +1,118 @@
#!/usr/bin/env bash
set -euo pipefail
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
VENV="${SCRIPT_DIR}/venv"
BUILD_DIR="${SCRIPT_DIR}/build/ctranslate2"
MODEL="${1:-base.en}"
TOKEN_FILE="${HOME}/.secrets/hugging-face.token"
# Locate CUDA
if [ -z "${CUDA_HOME:-}" ]; then
for candidate in /opt/cuda /usr/local/cuda /usr; do
if [ -f "${candidate}/bin/nvcc" ]; then
export CUDA_HOME="${candidate}"
break
fi
done
fi
if [ -z "${CUDA_HOME:-}" ]; then
echo "ERROR: CUDA not found. Set CUDA_HOME manually." >&2
exit 1
fi
echo "==> CUDA: ${CUDA_HOME}"
"${CUDA_HOME}/bin/nvcc" --version | head -1
for tool in cmake git; do
if ! command -v "${tool}" &>/dev/null; then
echo "ERROR: ${tool} not found — install with: sudo pacman -S ${tool}" >&2
exit 1
fi
done
if [ ! -d "${VENV}" ]; then
echo "==> creating venv at ${VENV}"
python3 -m venv "${VENV}"
fi
echo "==> upgrading pip + build tools"
"${VENV}/bin/pip" install --upgrade pip wheel setuptools pybind11 --quiet
# --- clone (skipped if already done) ---
if [ ! -d "${BUILD_DIR}/src/.git" ]; then
echo "==> cloning ctranslate2 from source..."
mkdir -p "${BUILD_DIR}"
git clone --recursive --depth 1 https://github.com/OpenNMT/CTranslate2 "${BUILD_DIR}/src"
else
echo "==> ctranslate2 source already present, skipping clone"
fi
# --- cmake build (skipped if library already installed) ---
if [ ! -f "${VENV}/lib/libctranslate2.so" ] && ! ls "${VENV}/lib/libctranslate2.so."* &>/dev/null 2>&1; then
echo "==> configuring ctranslate2 C++ library..."
mkdir -p "${BUILD_DIR}/cmake-build"
cmake \
-S "${BUILD_DIR}/src" \
-B "${BUILD_DIR}/cmake-build" \
-DCMAKE_BUILD_TYPE=Release \
-DCMAKE_INSTALL_PREFIX="${VENV}" \
-DWITH_CUDA=ON \
-DCUDA_TOOLKIT_ROOT_DIR="${CUDA_HOME}" \
-DCMAKE_CUDA_COMPILER="${CUDA_HOME}/bin/nvcc" \
-DWITH_MKL=OFF \
-DBUILD_CLI=OFF \
-DWITH_TESTS=OFF \
-DCMAKE_POLICY_VERSION_MINIMUM=3.5
echo "==> building ctranslate2 C++ library (this takes 10-20 minutes)..."
cmake --build "${BUILD_DIR}/cmake-build" --parallel "$(nproc)"
cmake --install "${BUILD_DIR}/cmake-build"
else
echo "==> libctranslate2 already installed, skipping cmake build"
fi
echo "==> verifying install..."
ls "${VENV}/include/ctranslate2/" | head -3
ls "${VENV}/lib/libctranslate2"* 2>/dev/null || { echo "ERROR: libctranslate2 not found in venv/lib" >&2; exit 1; }
grep "WITH_CUDA" "${BUILD_DIR}/cmake-build/CMakeCache.txt" | grep -v "^#" || true
# --- Python bindings ---
# Always reinstall from source to ensure we use our CUDA 13 build, not a PyPI wheel
echo "==> removing any existing ctranslate2 install..."
"${VENV}/bin/pip" uninstall -y ctranslate2 2>/dev/null || true
echo "==> building ctranslate2 Python bindings from source..."
CT2_ROOT="${VENV}" \
LIBRARY_PATH="${VENV}/lib:${VENV}/lib64${LIBRARY_PATH:+:${LIBRARY_PATH}}" \
LDFLAGS="-Wl,-rpath,${VENV}/lib" \
"${VENV}/bin/pip" install "${BUILD_DIR}/src/python" --no-build-isolation
# --- faster-whisper ---
if ! "${VENV}/bin/python3" -c "import faster_whisper" &>/dev/null 2>&1; then
echo "==> installing faster-whisper"
"${VENV}/bin/pip" install faster-whisper --no-deps
else
echo "==> faster-whisper already installed, skipping"
fi
if [ -f "${TOKEN_FILE}" ]; then
export HF_TOKEN="$(cat "${TOKEN_FILE}")"
echo "==> HuggingFace token loaded from ${TOKEN_FILE}"
else
echo "==> no token found at ${TOKEN_FILE} — unauthenticated download"
fi
echo "==> pre-downloading model: ${MODEL}"
"${VENV}/bin/python3" - <<EOF
from faster_whisper import WhisperModel
print("downloading ${MODEL}...")
WhisperModel("${MODEL}", device="cuda", compute_type="int8_float16")
print("done")
EOF
chmod +x "${SCRIPT_DIR}/faster-whisper-server.py"
echo ""
echo "==> done. Run with: node query-demo.mjs --stt faster-whisper"

235
lib/fw-stt.mjs Normal file
View File

@@ -0,0 +1,235 @@
/**
* faster-whisper STT backend — drop-in replacement for stt.mjs.
*
* Uses sherpa-onnx Silero VAD for segment detection (same as stt.mjs),
* but transcribes via a faster-whisper Python subprocess which returns
* word-level timestamps.
*/
import { spawn, execSync } from 'node:child_process'
import { createRequire } from 'node:module'
import * as path from 'node:path'
import { existsSync } from 'node:fs'
const require = createRequire(import.meta.url)
const DEFAULT_MODELS_DIR = path.join(import.meta.dirname, '..', 'models')
const DEFAULT_SERVER = path.join(import.meta.dirname, '..', 'faster-whisper-server.py')
const VENV_PYTHON = path.join(import.meta.dirname, '..', 'venv', 'bin', 'python3')
const PRE_ROLL_SAMPLES = 3200
const HISTORY_SAMPLES = 960000
function find_mic() {
const candidates = [
['parec', ['--format=s16le', '--rate=16000', '--channels=1', '--latency-msec=50']],
['arecord', ['-f', 'S16_LE', '-r', '16000', '-c', '1', '-t', 'raw', '-q']],
]
for (const [cmd, args] of candidates) {
try {
execSync(`which ${cmd}`, { stdio: 'ignore' })
return [cmd, args]
} catch { /* try next */ }
}
throw new Error('No mic capture command found — need parec or arecord')
}
function s16le_to_f32(buf) {
const out = new Float32Array(buf.length / 2)
for (let i = 0; i < out.length; i++) {
out[i] = buf.readInt16LE(i * 2) / 32768.0
}
return out
}
export class Stt {
constructor({
models_dir = DEFAULT_MODELS_DIR,
whisper_name = 'base.en',
server_path = DEFAULT_SERVER,
device = 'cuda',
debug_url = null,
} = {}) {
this._models_dir = models_dir
this._whisper_name = whisper_name
this._server_path = server_path
this._device = device
this._debug_url = debug_url
this._vad = null
this._server = null
this._history = new Float32Array(HISTORY_SAMPLES)
this._history_pos = 0
// Response queue: each entry is a resolve function waiting for the next JSON line
this._response_queue = []
this._line_buf = ''
// Ready promise — resolved when server prints "ready"
let ready_resolve
this._ready = new Promise(r => { ready_resolve = r })
this._ready_resolve = ready_resolve
}
init() {
const sherpa = require('sherpa-onnx-node')
const vad_model = path.join(this._models_dir, 'silero_vad.onnx')
this._vad = new sherpa.Vad({
sileroVad: {
model: vad_model,
threshold: 0.5,
minSilenceDuration: 0.5,
minSpeechDuration: 0.1,
windowSize: 512,
maxSpeechDuration: 30.0,
},
sampleRate: 16000,
numThreads: 1,
provider: 'cpu',
debug: false,
}, 60)
const python = existsSync(VENV_PYTHON) ? VENV_PYTHON : 'python3'
process.stderr.write(`[fw-stt] python: ${python}\n`)
process.stderr.write(`[fw-stt] model: ${this._whisper_name}\n`)
this._server = spawn(python, [
this._server_path,
'--model', this._whisper_name,
'--device', this._device,
], {
stdio: ['pipe', 'pipe', 'inherit'],
env: { ...process.env },
})
this._server.stdout.on('data', (chunk) => {
this._line_buf += chunk.toString()
let nl
while ((nl = this._line_buf.indexOf('\n')) !== -1) {
const line = this._line_buf.slice(0, nl).trim()
this._line_buf = this._line_buf.slice(nl + 1)
if (line === 'ready') {
this._ready_resolve()
continue
}
const resolver = this._response_queue.shift()
if (resolver) {
try {
resolver(JSON.parse(line))
} catch {
resolver({ text: '', words: [] })
}
}
}
})
this._server.on('error', err => process.stderr.write(`[fw-stt] server error: ${err.message}\n`))
this._server.on('close', code => process.stderr.write(`[fw-stt] server exited (${code})\n`))
}
_with_preroll(seg) {
const pre_start = Math.max(0, seg.start - PRE_ROLL_SAMPLES)
const pre_len = seg.start - pre_start
if (pre_len === 0) return seg.samples
const out = new Float32Array(pre_len + seg.samples.length)
for (let i = 0; i < pre_len; i++) {
out[i] = this._history[(pre_start + i) % HISTORY_SAMPLES]
}
out.set(seg.samples, pre_len)
return out
}
async _transcribe(samples) {
await this._ready
return new Promise((resolve) => {
this._response_queue.push(resolve)
const bytes = Buffer.from(samples.buffer, samples.byteOffset, samples.byteLength)
const request = JSON.stringify({ audio_b64: bytes.toString('base64'), sample_rate: 16000 }) + '\n'
this._server.stdin.write(request)
})
}
_post_debug(meta, samples) {
const json_buf = Buffer.from(JSON.stringify(meta), 'utf8')
const len_buf = Buffer.allocUnsafe(4)
len_buf.writeUInt32LE(json_buf.byteLength, 0)
const samp_buf = Buffer.from(samples.buffer, samples.byteOffset, samples.byteLength)
const body = Buffer.concat([len_buf, json_buf, samp_buf])
fetch(this._debug_url, {
method: 'POST',
body,
headers: { 'Content-Type': 'application/octet-stream' },
}).catch(err => {
process.stderr.write(`[fw-stt] debug post failed: ${err.message}\n`)
})
}
listen(on_text, { on_audio } = {}) {
const [cmd, args] = find_mic()
process.stderr.write(`[fw-stt] mic: ${cmd} ${args.join(' ')}\n`)
const mic = spawn(cmd, args, { stdio: ['ignore', 'pipe', 'inherit'] })
const VAD_WIN = 512
let pending = Buffer.alloc(0)
mic.stdout.on('data', (chunk) => {
if (on_audio) on_audio(chunk)
pending = Buffer.concat([pending, chunk])
while (pending.length >= VAD_WIN * 2) {
const win = pending.subarray(0, VAD_WIN * 2)
pending = pending.subarray(VAD_WIN * 2)
const f32 = s16le_to_f32(win)
const base = this._history_pos % HISTORY_SAMPLES
for (let i = 0; i < f32.length; i++) {
this._history[(base + i) % HISTORY_SAMPLES] = f32[i]
}
this._history_pos += f32.length
this._vad.acceptWaveform(f32)
}
while (!this._vad.isEmpty()) {
const seg = this._vad.front()
this._vad.pop()
const drift = seg.start - (this._history_pos - seg.samples.length)
const with_pre = this._with_preroll(seg)
process.stderr.write(`[fw-stt] segment: samples=${seg.samples.length} drift=${drift}\n`)
this._transcribe(with_pre).then(result => {
const text = (result.text ?? '').trim()
const words = result.words ?? []
process.stderr.write(`[fw-stt] raw: ${JSON.stringify(result.text)} (${words.length} words)\n`)
if (this._debug_url) {
try {
this._post_debug({
preroll_length: with_pre.length - seg.samples.length,
transcript: result.text ?? '',
tokens: words.map(w => w.word),
timestamps: words.map(w => w.start),
durations: words.map(w => w.end - w.start),
timestamp: Date.now(),
drift,
}, with_pre)
} catch (err) {
process.stderr.write(`[fw-stt] debug post error: ${err.message}\n`)
}
}
if (text) on_text(text)
}).catch(err => {
process.stderr.write(`[fw-stt] transcription error: ${err.message}\n`)
})
}
})
mic.on('error', err => process.stderr.write(`[fw-stt] mic error: ${err.message}\n`))
mic.on('close', code => {
if (code !== null && code !== 0) {
process.stderr.write(`[fw-stt] mic exited (${code})\n`)
}
})
return () => { mic.kill(); this._server?.kill() }
}
}

View File

@@ -55,7 +55,7 @@ export class Pending_Query {
}
_has_real_words(text) {
return /[a-z]{2,}/i.test(text)
return (text.match(/\w/g)?.length ?? 0) >= 1
}
append(text) {
@@ -77,7 +77,8 @@ export class Pending_Query {
if (!this.use_timer || this.is_empty) {
return
}
const samples = new Int16Array(chunk.buffer, chunk.byteOffset, chunk.byteLength >> 1)
const aligned = (chunk.byteOffset % 2 === 0) ? chunk : Buffer.from(chunk)
const samples = new Int16Array(aligned.buffer, aligned.byteOffset, aligned.byteLength >> 1)
let sum = 0
for (let i = 0; i < samples.length; i++) {
sum += samples[i] * samples[i]
@@ -86,6 +87,7 @@ export class Pending_Query {
if (rms > 0.02) {
this._loud_streak++
if (this._loud_streak >= LOUD_STREAK_NEEDED) {
process.stderr.write(`[pending-query] audio reset timer (rms=${rms.toFixed(3)})\n`)
this._reset_silence_timer()
}
} else {
@@ -93,8 +95,14 @@ export class Pending_Query {
}
}
_collapse_repeated_words(norm) {
return norm.replace(/\b(\w+)(?:\s+\1){2,}\b/g, '$1')
}
async process_utterance(text) {
const norm = text.trim().toLowerCase().replace(/[^a-z ]/g, '').replace(/\s+/g, ' ').trim()
const norm = this._collapse_repeated_words(
text.trim().toLowerCase().replace(/[^a-z ]/g, '').replace(/\s+/g, ' ').trim()
)
process.stderr.write(`[pending-query] utterance norm: ${JSON.stringify(norm)}\n`)
@@ -153,6 +161,13 @@ export class Pending_Query {
if (this.instant_dispatch.has(norm)) {
const query = this.instant_dispatch.get(norm)
process.stderr.write(`[pending-query] instant dispatch: ${JSON.stringify(query)}\n`)
clearTimeout(this._silence_timer)
this._silence_timer = null
this.accumulated = ''
this._loud_streak = 0
if (this.use_wake_word) {
this._active = false
}
this.on_submit?.(query)
return
}

View File

@@ -54,17 +54,32 @@ export class Stt {
models_dir = DEFAULT_MODELS_DIR,
whisper_name = 'base.en',
provider = 'cuda',
debug_url = null,
} = {}) {
this._whisper_dir = path.join(models_dir, `sherpa-onnx-whisper-${whisper_name}`)
this._whisper_name = whisper_name
this._vad_model = path.join(models_dir, 'silero_vad.onnx')
this._provider = provider
this._debug_url = debug_url
this._recognizer = null
this._vad = null
this._history = new Float32Array(HISTORY_SAMPLES)
this._history_pos = 0 // total samples fed, monotonically increasing
}
_post_debug(meta, samples) {
const json_buf = Buffer.from(JSON.stringify(meta), 'utf8')
const len_buf = Buffer.allocUnsafe(4)
len_buf.writeUInt32LE(json_buf.byteLength, 0)
const samp_buf = Buffer.from(samples.buffer, samples.byteOffset, samples.byteLength)
const body = Buffer.concat([len_buf, json_buf, samp_buf])
fetch(this._debug_url, {
method: 'POST',
body,
headers: { 'Content-Type': 'application/octet-stream' },
}).catch(() => {})
}
init() {
const sherpa = require('sherpa-onnx-node')
const { _whisper_dir: wdir, _whisper_name: wname, _vad_model, _provider } = this
@@ -75,6 +90,7 @@ export class Stt {
whisper: {
encoder: path.join(wdir, `${wname}-encoder.int8.onnx`),
decoder: path.join(wdir, `${wname}-decoder.int8.onnx`),
enableTokenTimestamps: 1,
},
tokens: path.join(wdir, `${wname}-tokens.txt`),
numThreads: 2,
@@ -100,11 +116,20 @@ export class Stt {
}, 60) // 60-second internal ring buffer
}
_transcribe(samples) {
_get_result(samples) {
const stream = this._recognizer.createStream()
try { stream.setOption('enableTokenTimestamps', '1') } catch {}
stream.acceptWaveform({ samples, sampleRate: 16000 })
this._recognizer.decode(stream)
return this._recognizer.getResult(stream).text.trim()
return this._recognizer.getResult(stream)
}
_transcribe_raw(samples) {
return this._get_result(samples).text
}
_transcribe(samples) {
return this._transcribe_raw(samples).trim()
}
_with_preroll(seg) {
@@ -153,8 +178,38 @@ export class Stt {
while (!this._vad.isEmpty()) {
const seg = this._vad.front()
this._vad.pop()
const text = this._transcribe(this._with_preroll(seg))
if (text) on_text(text)
const drift = seg.start - (this._history_pos - seg.samples.length)
const with_pre = this._with_preroll(seg)
process.stderr.write(`[stt] segment: start=${seg.start} history_pos=${this._history_pos} samples=${seg.samples.length} drift=${drift}\n`)
const result = this._get_result(with_pre)
if (!this._result_keys_logged) {
this._result_keys_logged = true
process.stderr.write(`[stt] result keys: ${JSON.stringify(Object.keys(result))}\n`)
process.stderr.write(`[stt] result sample: ${JSON.stringify({ tokens: result.tokens?.slice(0,3), timestamps: result.timestamps?.slice(0,3), durations: result.durations?.slice(0,3) })}\n`)
}
const raw = result.text ?? ''
const text = raw.trim()
process.stderr.write(`[stt] whisper raw: ${JSON.stringify(raw)}\n`)
if (this._debug_url) {
try {
this._post_debug({
preroll_length: with_pre.length - seg.samples.length,
transcript: raw,
tokens: result.tokens ?? [],
timestamps: result.timestamps ?? [],
durations: result.durations ?? [],
timestamp: Date.now(),
seg_start: seg.start,
history_pos: this._history_pos,
drift,
}, with_pre)
} catch (err) {
process.stderr.write(`[stt] debug post error: ${err.message}\n`)
}
}
if (text) {
on_text(text)
}
}
})

View File

@@ -16,7 +16,13 @@
import { execFileSync } from 'node:child_process'
import { Stt } from './lib/stt.mjs'
process.on('uncaughtException', err => {
process.stderr.write(`[uncaughtException] ${err.stack ?? err.message}\n`)
})
process.on('unhandledRejection', (reason) => {
process.stderr.write(`[unhandledRejection] ${reason?.stack ?? reason}\n`)
})
import { Tts_Client } from './lib/tts-client.mjs'
import { Pending_Query } from './lib/pending-query.mjs'
@@ -29,12 +35,20 @@ const audio_prompt = get_arg('--audio-prompt') ?? '/home/devilholk/Documents/r
const whisper_name = get_arg('--whisper') ?? 'base.en'
const container_id = get_arg('--tmux-container-id')
const claude_remote = get_arg('--claude-remote')
const _dw = get_arg('--debug-waveform')
const debug_waveform = (_dw && !_dw.startsWith('--'))
? _dw
: (process.argv.includes('--debug-waveform') ? 'http://localhost:3888/emit/audio-debug' : null)
const stt_backend = get_arg('--stt') ?? 'sherpa-onnx'
const SILENCE_TIMEOUT = parseInt(get_arg('--silence-timeout') ?? '6000')
const { Stt } = stt_backend === 'faster-whisper'
? await import('./lib/fw-stt.mjs')
: await import('./lib/stt.mjs')
const tts = new Tts_Client()
const stt = new Stt({ whisper_name })
const stt = new Stt({ whisper_name, debug_url: debug_waveform })
const USE_WAKE_WORD = get_arg('--wake-word') !== 'off'
@@ -47,6 +61,10 @@ if (container_id && claude_remote) {
process.stderr.write(`[query-demo] no --tmux-container-id/--claude-remote — logging only\n`)
}
process.stderr.write(`[query-demo] wake word: ${USE_WAKE_WORD ? 'on (say "computer" to activate)' : 'off'}\n`)
process.stderr.write(`[query-demo] stt backend: ${stt_backend}\n`)
if (debug_waveform) {
process.stderr.write(`[query-demo] debug waveform: ${debug_waveform}\n`)
}
if (USE_WAKE_WORD) {
await tts.speak('Ready. Say computer to begin a query, or always listen for hands-free mode. Say help for usage.', { audio_prompt })
} else {
@@ -105,8 +123,12 @@ const pending = new Pending_Query({
setTimeout(async () => {
try {
const res = await fetch(`${tts._url}/activity`)
if (!res.ok) {
await tts.chime('working').catch(() => {})
return
}
const { last_speak_at } = await res.json()
if (last_speak_at < dispatch_time) {
if (!(last_speak_at >= dispatch_time)) {
await tts.chime('working').catch(() => {})
}
} catch {

18
query-demo.service Normal file
View File

@@ -0,0 +1,18 @@
[Unit]
Description=Voice Query Demo
After=tts-server.service
Requires=tts-server.service
[Service]
Type=simple
WorkingDirectory=%h/PATH/TO/voice
ExecStart=node query-demo.mjs \
--tmux-container-id CONTAINER_ID \
--claude-remote %h/PATH/TO/claude-remote/claude-remote.mjs
Restart=on-failure
RestartSec=5
StandardOutput=journal
StandardError=journal
[Install]
WantedBy=default.target

View File

@@ -153,12 +153,20 @@ const server = http.createServer(async (req, res) => {
opts.audio_prompt = voices[current_voice].path
}
const wait = opts.wait === true
const promise = enqueue(() => tts.speak_streaming(text, { preprocess: false, ...opts }))
if (wait) {
try {
await enqueue(() => tts.speak_streaming(text, { preprocess: false, ...opts }))
await promise
send(res, 200, { ok: true })
} catch (err) {
send(res, 500, { error: err.message })
}
} else {
promise.catch(err => console.error('speak error:', err.message))
send(res, 200, { ok: true, queued: true })
}
return
}

15
tts-server.service Normal file
View File

@@ -0,0 +1,15 @@
[Unit]
Description=TTS Server (Chatterbox)
After=network.target
[Service]
Type=simple
WorkingDirectory=%h/PATH/TO/voice
ExecStart=node tts-server.mjs
Restart=on-failure
RestartSec=5
StandardOutput=journal
StandardError=journal
[Install]
WantedBy=default.target