WebSocket server, language/task args, verbose flag, misc improvements #2

Merged
mikael-lovqvist merged 13 commits from mikael-lovqvists-claude-agent/stt-server:websocket-server into main 2026-06-07 09:27:02 +00:00

Summary

  • WebSocket server broadcasting all events to every connected client — no subscription handshake required (STT_PORT env var, default 11501)
  • Remove stdout event output — WebSocket is the sole event channel
  • --verbose / -v flag — info logging off by default for clean journal output when running as a service; errors always go to stderr
  • --language — force language detection or leave unset for auto-detect
  • --task transcribe|translate — transcribe keeps source language, translate converts to English
  • Detected language and confidence included in transcript events (language, language_probability)
  • Load HuggingFace token from ~/.secrets/hugging-face.token (consistent with tts-server, overridable via HF_TOKEN_FILE)
  • Log to stderr when model needs to be downloaded
  • examples/listen.mjs and examples/transcripts.mjs using Node.js built-in WebSocket
  • NOTES.md with unused TranscriptionInfo fields for future reference
  • README: model selection table with VRAM estimates, compute type guide, language/translation docs
  • Both setup scripts: add torch, silero-vad, websockets; fix missing faster-whisper deps in local-build path

Test plan

  • ./stt-server.py --model large-v3 --compute-type float16 --verbose starts and transcribes
  • node examples/listen.mjs receives events
  • node examples/transcripts.mjs prints transcript text
  • Silent by default (no stderr noise when running as service)
## Summary - WebSocket server broadcasting all events to every connected client — no subscription handshake required (`STT_PORT` env var, default 11501) - Remove stdout event output — WebSocket is the sole event channel - `--verbose` / `-v` flag — info logging off by default for clean journal output when running as a service; errors always go to stderr - `--language` — force language detection or leave unset for auto-detect - `--task transcribe|translate` — transcribe keeps source language, translate converts to English - Detected language and confidence included in transcript events (`language`, `language_probability`) - Load HuggingFace token from `~/.secrets/hugging-face.token` (consistent with tts-server, overridable via `HF_TOKEN_FILE`) - Log to stderr when model needs to be downloaded - `examples/listen.mjs` and `examples/transcripts.mjs` using Node.js built-in WebSocket - `NOTES.md` with unused `TranscriptionInfo` fields for future reference - README: model selection table with VRAM estimates, compute type guide, language/translation docs - Both setup scripts: add `torch`, `silero-vad`, `websockets`; fix missing faster-whisper deps in local-build path ## Test plan - [x] `./stt-server.py --model large-v3 --compute-type float16 --verbose` starts and transcribes - [x] `node examples/listen.mjs` receives events - [x] `node examples/transcripts.mjs` prints transcript text - [x] Silent by default (no stderr noise when running as service)
mikael-lovqvists-claude-agent added 13 commits 2026-06-07 09:26:47 +00:00
Every connection receives the full event stream (vad_start, vad_end,
transcript, error) from the moment it connects — no subscription
handshake required. The asyncio WebSocket server runs in a daemon thread
alongside the VAD loop and transcription thread. Events still go to
stdout unchanged.

Port is configurable via STT_PORT env var (default: 11501).
Add websockets to both setup scripts.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Errors always go to stderr. Info logs (startup, VAD events, transcripts)
only appear with --verbose / -v, keeping stderr clean when running as a
system service.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
listen.mjs: prints all events as JSON objects.
transcripts.mjs: prints transcript text only.
Both use Node 21+ built-in WebSocket — no libraries required.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Checks cache first with local_files_only=True; if the model isn't present
logs "downloading model ..." to stderr before WhisperModel triggers the
actual download.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Reads ~/.secrets/hugging-face.token by default, overridable via HF_TOKEN_FILE.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
--language: force language detection (e.g. en, sv) or leave unset for auto
--task: transcribe (default) or translate to English
Previously language was hardcoded to 'en' which caused multilingual models
to hallucinate translations instead of transcribing the source language.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Unpacks transcription info instead of discarding it. Adds language and
language_probability fields to transcript events, and includes them in
verbose log output.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
mikael-lovqvist merged commit 4f983ca276 into main 2026-06-07 09:27:02 +00:00
Sign in to join this conversation.
No Reviewers
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: efforting.tech/stt-server#2