Merge pull request 'WebSocket server, language/task args, verbose flag, misc improvements' (#2 ) from mikael-lovqvists-claude-agent/stt-server:websocket-server into main

Reviewed-on: #2
Add NOTES.md with TranscriptionInfo unused fields
2026-06-07 09:27:01 +00:00 · 2026-06-07 09:24:53 +00:00 · 2026-06-07 09:21:44 +00:00 · 2026-06-07 09:16:19 +00:00 · 2026-06-07 09:14:35 +00:00 · 2026-06-07 09:14:00 +00:00
8 changed files with 457 additions and 21 deletions
--- a/.gitignore
+++ b/.gitignore
@@ -1 +1,2 @@
-venv/
+venv/
+build/
--- a/NOTES.md
+++ b/NOTES.md
@@ -0,0 +1,9 @@
+# Notes
+
+## TranscriptionInfo — unused fields
+
+`model.transcribe()` returns a `TranscriptionInfo` object as its second value. We currently use `language` and `language_probability`. Other available fields:
+
+- **`all_language_probs`** — full ranked list of `(language, probability)` tuples for the segment. Useful for debugging misdetection — e.g. when the model hallucinates Sinhala on noise, this would show Sinhala at the top with a high probability. Could be included in transcript events or exposed as a diagnostic endpoint.
+- **`duration`** — total audio duration fed to the model.
+- **`duration_after_vad`** — speech duration according to Whisper's internal VAD (not meaningful since we pass `vad_filter=False`).
--- a/README.md
+++ b/README.md
@@ -15,4 +15,48 @@ This project started as a [vibe-coded](https://en.wikipedia.org/wiki/Vibe_coding

 ### Setup [venv](https://docs.python.org/3/library/venv.html) for [python](https://www.python.org/)

-We will have two different setups here depending on if you want to build ctranslate2 locally or not. This shall be documented.
+We will have two different setups here depending on if you want to build ctranslate2 locally or not. This shall be documented.
+
+
+## Model selection
+
+Pass `--model <name>` to `stt-server.py`. Models are downloaded automatically from HuggingFace on first use.
+
+| Model | VRAM | Quality | Notes |
+|-------|------|---------|-------|
+| `base.en` | ~0.5 GB (`float16`) / ~1 GB (`float32`) | Low | Default. Fast, but struggles with similar-sounding consonants (V/B/D). |
+| `small.en` | ~1 GB (`float16`) / ~2 GB (`float32`) | Medium | Noticeable improvement over base for most speech. |
+| `medium.en` | ~2.5 GB (`float16`) / ~5 GB (`float32`) | Good | Recommended starting point for production use. |
+| `large-v3` | ~5 GB (`float16`) / ~10 GB (`float32`) | Best | Highest accuracy, use if VRAM allows. |
+
+English-only models (`.en` suffix) are faster and more accurate than multilingual models for English speech.
+
+
+## Compute type
+
+Pass `--compute-type <type>` to control the numeric precision used during inference.
+
+| Type | Notes |
+|------|-------|
+| `int8_float16` | Default. Good balance of speed and accuracy on modern GPUs. |
+| `float16` | Slightly better accuracy, higher VRAM usage. |
+| `int8` | CPU-friendly, lower quality. |
+
+If you see a CUDA error about mismatched library versions at startup, use `setup-venv-local-build.sh` to build ctranslate2 against your system CUDA version rather than using the PyPI wheel.
+
+
+## Language and translation
+
+By default the server auto-detects the spoken language and transcribes it as-is.
+
+| Argument | Default | Notes |
+|----------|---------|-------|
+| `--language <code>` | none (auto-detect) | Force a specific language, e.g. `--language en` or `--language sv`. Speeds up detection and avoids misidentification. |
+| `--task transcribe` | default | Output text in the spoken language. |
+| `--task translate` | | Translate speech to English regardless of source language. |
+
+> [!NOTE]
+> The `.en` model variants (`base.en`, `small.en` etc.) are English-only and do not support `--task translate` or non-English `--language`. Use a multilingual model (`large-v3`, `medium`) for multilingual or translation use cases.
+
+> [!WARNING]
+> Omitting `--language` with a multilingual model and English-only speech may cause occasional misdetection. Pass `--language en` to avoid this if you only speak English.
--- a/examples/listen.mjs
+++ b/examples/listen.mjs
@@ -0,0 +1,24 @@
+// Connect to the STT server and print all events.
+// Usage: node listen.mjs
+
+const PORT = process.env.STT_PORT ?? '11501'
+const ws   = new WebSocket(`ws://localhost:${PORT}`)
+
+ws.addEventListener('open', () => {
+	process.stderr.write(`connected to ws://localhost:${PORT}\n`)
+})
+
+ws.addEventListener('message', ({ data }) => {
+	const event = JSON.parse(data)
+	console.log(event)
+})
+
+ws.addEventListener('close', () => {
+	process.stderr.write('disconnected\n')
+	process.exit(0)
+})
+
+ws.addEventListener('error', (err) => {
+	process.stderr.write(`error: ${err.message}\n`)
+	process.exit(1)
+})
--- a/examples/transcripts.mjs
+++ b/examples/transcripts.mjs
@@ -0,0 +1,19 @@
+// Connect to the STT server and print transcript text only.
+// Usage: node transcripts.mjs
+
+const PORT = process.env.STT_PORT ?? '11501'
+const ws   = new WebSocket(`ws://localhost:${PORT}`)
+
+ws.addEventListener('open', () => {
+	process.stderr.write(`connected to ws://localhost:${PORT}\n`)
+})
+
+ws.addEventListener('message', ({ data }) => {
+	const event = JSON.parse(data)
+	if (event.event === 'transcript') {
+		console.log(event.text)
+	}
+})
+
+ws.addEventListener('close', () => process.exit(0))
+ws.addEventListener('error', () => process.exit(1))
--- a/setup-venv-local-build.sh
+++ b/setup-venv-local-build.sh
@@ -1,11 +1,29 @@
 #!/usr/bin/env bash
+#
+# setup-venv-local-build.sh — builds ctranslate2 from source and installs faster-whisper.
+#
+# USE THIS SCRIPT when the PyPI ctranslate2 wheel does not match your CUDA version.
+# The PyPI wheel targets a specific CUDA major version (e.g. CUDA 12). If your system
+# has a newer version (e.g. CUDA 13), the wheel will fail at runtime because it tries
+# to dlopen libcublas.so.12 which does not exist. Building from source compiles against
+# your actual installed CUDA and links correctly.
+#
+# For systems where the PyPI wheel works (CUDA version matches), use setup-venv.sh
+# instead — it is much faster and simpler.
+#
+# Environment overrides:
+#   PYTHON_ENV      path to venv (default: ./venv)
+#   HF_TOKEN_FILE   path to HuggingFace token file (default: ~/.secrets/hugging-face.token)
+#   HF_HUB_CACHE    path to HuggingFace hub cache (default: ~/.cache/huggingface/hub)
+#   CUDA_HOME       path to CUDA toolkit (auto-detected if not set)
+#
 set -euo pipefail

 SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
-VENV="${SCRIPT_DIR}/venv"
+VENV="${PYTHON_ENV:-${SCRIPT_DIR}/venv}"
 BUILD_DIR="${SCRIPT_DIR}/build/ctranslate2"
 MODEL="${1:-base.en}"
-TOKEN_FILE="${HOME}/.secrets/hugging-face.token"
+TOKEN_FILE="${HF_TOKEN_FILE:-${HOME}/.secrets/hugging-face.token}"

 # Locate CUDA
 if [ -z "${CUDA_HOME:-}" ]; then
@@ -25,9 +43,9 @@ fi
 echo "==> CUDA: ${CUDA_HOME}"
 "${CUDA_HOME}/bin/nvcc" --version | head -1

-for tool in cmake git; do
+for tool in cmake git python3; do
 	if ! command -v "${tool}" &>/dev/null; then
-		echo "ERROR: ${tool} not found — install with: sudo pacman -S ${tool}" >&2
+		echo "ERROR: ${tool} not found" >&2
 		exit 1
 	fi
 done
@@ -39,6 +57,7 @@ fi

 echo "==> upgrading pip + build tools"
 "${VENV}/bin/pip" install --upgrade pip wheel setuptools pybind11 --quiet
+"${VENV}/bin/pip" install torch silero-vad websockets

 # --- clone (skipped if already done) ---
 if [ ! -d "${BUILD_DIR}/src/.git" ]; then
@@ -78,25 +97,30 @@ ls "${VENV}/include/ctranslate2/" | head -3
 ls "${VENV}/lib/libctranslate2"* 2>/dev/null || { echo "ERROR: libctranslate2 not found in venv/lib" >&2; exit 1; }
 grep "WITH_CUDA" "${BUILD_DIR}/cmake-build/CMakeCache.txt" | grep -v "^#" || true

-# --- Python bindings ---
-# Always reinstall from source to ensure we use our CUDA 13 build, not a PyPI wheel
-echo "==> removing any existing ctranslate2 install..."
+# --- faster-whisper (with all deps, including PyPI ctranslate2) ---
+# Install faster-whisper normally so all its dependencies (av, huggingface_hub, etc.)
+# are satisfied. This will pull in the PyPI ctranslate2 wheel, which we override next.
+if ! "${VENV}/bin/python3" -c "import faster_whisper" &>/dev/null 2>&1; then
+	echo "==> installing faster-whisper"
+	"${VENV}/bin/pip" install faster-whisper
+else
+	echo "==> faster-whisper already installed, skipping"
+fi
+
+# --- Python bindings (always reinstalled from source) ---
+# Override the PyPI ctranslate2 wheel pulled in above with our source-built version.
+# This is the whole point of this script: the PyPI wheel links against a fixed CUDA
+# major version (e.g. libcublas.so.12) while our build links against the system version.
+echo "==> removing PyPI ctranslate2..."
 "${VENV}/bin/pip" uninstall -y ctranslate2 2>/dev/null || true

-echo "==> building ctranslate2 Python bindings from source..."
+echo "==> installing source-built ctranslate2 Python bindings..."
 CT2_ROOT="${VENV}" \
 LIBRARY_PATH="${VENV}/lib:${VENV}/lib64${LIBRARY_PATH:+:${LIBRARY_PATH}}" \
 LDFLAGS="-Wl,-rpath,${VENV}/lib" \
 	"${VENV}/bin/pip" install "${BUILD_DIR}/src/python" --no-build-isolation

-# --- faster-whisper ---
-if ! "${VENV}/bin/python3" -c "import faster_whisper" &>/dev/null 2>&1; then
-	echo "==> installing faster-whisper"
-	"${VENV}/bin/pip" install faster-whisper --no-deps
-else
-	echo "==> faster-whisper already installed, skipping"
-fi
-
+# --- model download ---
 if [ -f "${TOKEN_FILE}" ]; then
 	export HF_TOKEN="$(cat "${TOKEN_FILE}")"
 	echo "==> HuggingFace token loaded from ${TOKEN_FILE}"
@@ -104,6 +128,10 @@ else
 	echo "==> no token found at ${TOKEN_FILE} — unauthenticated download"
 fi

+if [ -n "${HF_HUB_CACHE:-}" ]; then
+	echo "==> HuggingFace cache: ${HF_HUB_CACHE}"
+fi
+
 echo "==> pre-downloading model: ${MODEL}"
 "${VENV}/bin/python3" - <<EOF
 from faster_whisper import WhisperModel
@@ -112,7 +140,5 @@ WhisperModel("${MODEL}", device="cuda", compute_type="int8_float16")
 print("done")
 EOF

-chmod +x "${SCRIPT_DIR}/faster-whisper-server.py"
-
 echo ""
-echo "==> done. Run with: node query-demo.mjs --stt faster-whisper"
+echo "==> done. Venv ready at ${VENV}"
--- a/setup-venv.sh
+++ b/setup-venv.sh
@@ -0,0 +1,31 @@
+#!/usr/bin/env bash
+#
+# setup-venv.sh — installs faster-whisper from PyPI into a venv.
+#
+# USE THIS SCRIPT when the PyPI ctranslate2 wheel matches your CUDA version.
+# PyPI wheels target a specific CUDA major version; if your system matches,
+# this is the fastest way to get started — no compilation required.
+#
+# If you see errors like "libcublas.so.12: cannot open shared object file" at
+# runtime, your CUDA version does not match the wheel. Use setup-venv-local-build.sh
+# instead, which compiles ctranslate2 against your actual CUDA installation.
+#
+# Environment overrides:
+#   PYTHON_ENV      path to venv (default: ./venv)
+#
+set -euo pipefail
+
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+VENV="${PYTHON_ENV:-${SCRIPT_DIR}/venv}"
+
+if [ ! -d "${VENV}" ]; then
+	echo "==> creating venv at ${VENV}"
+	python3 -m venv "${VENV}"
+fi
+
+echo "==> installing torch and faster-whisper"
+"${VENV}/bin/pip" install --upgrade pip --quiet
+"${VENV}/bin/pip" install torch faster-whisper silero-vad websockets
+
+echo ""
+echo "==> done. Venv ready at ${VENV}"
--- a/stt-server.py
+++ b/stt-server.py
@@ -0,0 +1,282 @@
+#!/usr/bin/env -S bash -c 'exec "$(dirname "$0")/venv/bin/python3" "$0" "$@"'
+"""
+STT server: records audio, runs Silero VAD, transcribes with faster-whisper.
+Broadcasts JSON events to all connected WebSocket clients and to stdout.
+
+Events:
+  {"event": "ready"}
+  {"event": "vad_start"}
+  {"event": "vad_end",    "duration": 1.23}
+  {"event": "transcript", "text": "...", "words": [...], "duration": 1.23}
+  {"event": "error",      "message": "..."}
+
+word format: {"word": "hello", "start": 0.12, "end": 0.45, "probability": 0.99}
+
+Every WebSocket connection receives the full event stream from the moment it
+connects — no subscription handshake required.
+
+Machine-readable events are sent over WebSocket only.
+Pass --verbose to enable logging to stderr (startup, VAD events, transcripts).
+Errors always go to stderr regardless of verbosity.
+
+Environment:
+  STT_PORT   WebSocket port (default: 11501)
+
+Usage:
+  ./stt-server.py
+  ./stt-server.py --model large-v3 --device cuda --compute-type int8_float16 --verbose
+"""
+
+import os
+import sys
+import json
+import signal
+import argparse
+import threading
+import queue
+import subprocess
+import traceback
+import asyncio
+import websockets
+import numpy as np
+import torch
+
+SAMPLE_RATE      = 16000
+VAD_WINDOW       = 512      # samples per VAD chunk (32ms at 16kHz)
+PRE_ROLL_SAMPLES = 3200     # 0.2s prepended to each segment for context
+HISTORY_SAMPLES  = 960000   # 60s ring buffer for pre-roll
+PORT             = int(__import__('os').environ.get('STT_PORT', 11501))
+
+
+def log(msg, error=False):
+	if error or verbose:
+		sys.stderr.write(f'[stt] {msg}\n')
+		sys.stderr.flush()
+
+
+# --- WebSocket broadcast ---
+
+_ws_loop    = None
+_ws_clients = set()   # set of asyncio.Queue, one per connection
+
+
+def emit(event):
+	line = json.dumps(event)
+	if _ws_loop is not None:
+		for q in list(_ws_clients):
+			_ws_loop.call_soon_threadsafe(q.put_nowait, line)
+
+
+async def ws_handler(websocket):
+	q = asyncio.Queue()
+	_ws_clients.add(q)
+	log(f'client connected ({len(_ws_clients)} total)')
+	try:
+		while True:
+			msg = await q.get()
+			await websocket.send(msg)
+	except websockets.ConnectionClosed:
+		pass
+	finally:
+		_ws_clients.discard(q)
+		log(f'client disconnected ({len(_ws_clients)} remaining)')
+
+
+async def ws_main():
+	global _ws_loop
+	_ws_loop = asyncio.get_running_loop()
+	async with websockets.serve(ws_handler, '', PORT):
+		log(f'WebSocket listening on port {PORT}')
+		await asyncio.Future()   # run forever
+
+
+def start_ws_server():
+	asyncio.run(ws_main())
+
+
+# --- Mic ---
+
+def find_mic():
+	candidates = [
+		['parec',   ['--format=s16le', '--rate=16000', '--channels=1', '--latency-msec=50']],
+		['arecord', ['-f', 'S16_LE', '-r', '16000', '-c', '1', '-t', 'raw', '-q']],
+	]
+	for cmd, args in candidates:
+		try:
+			subprocess.run(['which', cmd], check=True, capture_output=True)
+			return cmd, args
+		except subprocess.CalledProcessError:
+			pass
+	raise RuntimeError('no mic capture command found — need parec or arecord')
+
+
+def s16le_to_f32(data):
+	return np.frombuffer(data, dtype=np.int16).astype(np.float32) / 32768.0
+
+
+# --- Args + model loading ---
+
+parser = argparse.ArgumentParser()
+parser.add_argument('--model',        default='base.en')
+parser.add_argument('--device',       default='cuda')
+parser.add_argument('--compute-type', default='int8_float16')
+parser.add_argument('--language',     default=None,          help='language code (e.g. en, sv) or None for auto-detect')
+parser.add_argument('--task',         default='transcribe',  choices=['transcribe', 'translate'], help='transcribe keeps the source language; translate converts to English')
+parser.add_argument('--verbose', '-v', action='store_true')
+args    = parser.parse_args()
+verbose = args.verbose
+
+token_file = os.environ.get('HF_TOKEN_FILE', os.path.expanduser('~/.secrets/hugging-face.token'))
+try:
+	with open(token_file) as f:
+		os.environ['HF_TOKEN'] = f.read().strip()
+except FileNotFoundError:
+	pass
+
+from faster_whisper import WhisperModel
+from huggingface_hub import snapshot_download
+
+try:
+	snapshot_download(f'Systran/faster-whisper-{args.model}', local_files_only=True)
+except Exception:
+	log(f'downloading model {args.model}...')
+
+log(f'loading faster-whisper {args.model} ({args.device}, {args.compute_type})...')
+try:
+	model = WhisperModel(args.model, device=args.device, compute_type=args.compute_type)
+	log(f'model ready on {args.device}')
+except Exception as e:
+	log(f'{args.device} failed ({e}), falling back to cpu', error=True)
+	model = WhisperModel(args.model, device='cpu', compute_type='int8')
+	log('model ready on cpu')
+
+log('loading silero VAD...')
+from silero_vad import load_silero_vad, VADIterator
+vad_model = load_silero_vad()
+vad       = VADIterator(vad_model, sampling_rate=SAMPLE_RATE,
+                        threshold=0.5, min_silence_duration_ms=500)
+log('VAD ready')
+
+
+# --- Pre-roll ring buffer ---
+
+history     = np.zeros(HISTORY_SAMPLES, dtype=np.float32)
+history_pos = 0
+
+def push_history(samples):
+	global history_pos
+	n     = len(samples)
+	base  = history_pos % HISTORY_SAMPLES
+	space = HISTORY_SAMPLES - base
+	if n <= space:
+		history[base:base + n] = samples
+	else:
+		history[base:]      = samples[:space]
+		history[:n - space] = samples[space:]
+	history_pos += n
+
+def get_preroll():
+	start = max(0, history_pos - PRE_ROLL_SAMPLES)
+	count = history_pos - start
+	out   = np.empty(count, dtype=np.float32)
+	for i in range(count):
+		out[i] = history[(start + i) % HISTORY_SAMPLES]
+	return out
+
+
+# --- Transcription thread ---
+
+transcription_queue = queue.Queue()
+
+def transcription_worker():
+	while True:
+		item = transcription_queue.get()
+		if item is None:
+			break
+		samples, duration = item
+		try:
+			segments, info = model.transcribe(
+				samples,
+				language=args.language,
+				task=args.task,
+				word_timestamps=True,
+				vad_filter=False,
+			)
+			text  = ''
+			words = []
+			for seg in segments:
+				text += seg.text
+				for w in (seg.words or []):
+					words.append({
+						'word':        w.word,
+						'start':       round(float(w.start), 4),
+						'end':         round(float(w.end),   4),
+						'probability': round(float(w.probability), 4),
+					})
+			language    = info.language
+			lang_prob   = round(float(info.language_probability), 3)
+			log(f'transcript [{language} {lang_prob}]: {json.dumps(text.strip())} ({len(words)} words)')
+			if text.strip():
+				emit({'event': 'transcript', 'text': text.strip(), 'words': words, 'duration': round(duration, 3), 'language': language, 'language_probability': lang_prob})
+		except Exception:
+			msg = traceback.format_exc()
+			log(f'transcription error:\n{msg}', error=True)
+			emit({'event': 'error', 'message': msg})
+		finally:
+			transcription_queue.task_done()
+
+
+threading.Thread(target=transcription_worker, daemon=True).start()
+threading.Thread(target=start_ws_server, daemon=True).start()
+
+
+# --- Main recording + VAD loop ---
+
+cmd, cmd_args = find_mic()
+log(f'mic: {cmd} {" ".join(cmd_args)}')
+mic = subprocess.Popen([cmd] + cmd_args, stdout=subprocess.PIPE, stderr=subprocess.DEVNULL)
+
+def shutdown(sig=None, frame=None):
+	mic.terminate()
+	transcription_queue.put(None)
+	sys.exit(0)
+
+signal.signal(signal.SIGTERM, shutdown)
+signal.signal(signal.SIGINT,  shutdown)
+
+emit({'event': 'ready'})
+
+speech_samples = []
+speech_start   = None
+pending        = b''
+
+for chunk in mic.stdout:
+	pending += chunk
+	while len(pending) >= VAD_WINDOW * 2:
+		raw     = pending[:VAD_WINDOW * 2]
+		pending = pending[VAD_WINDOW * 2:]
+
+		f32 = s16le_to_f32(raw)
+		push_history(f32)
+
+		result = vad(torch.from_numpy(f32), return_seconds=True)
+
+		if result is not None:
+			if 'start' in result:
+				speech_start   = result['start']
+				speech_samples = [get_preroll()]
+				log(f'VAD start at {speech_start:.2f}s')
+				emit({'event': 'vad_start'})
+
+			elif 'end' in result and speech_start is not None:
+				duration = result['end'] - speech_start
+				log(f'VAD end at {result["end"]:.2f}s (duration {duration:.2f}s)')
+				emit({'event': 'vad_end', 'duration': round(duration, 3)})
+				segment = np.concatenate(speech_samples)
+				transcription_queue.put((segment, duration))
+				speech_samples = []
+				speech_start   = None
+				vad.reset_states()
+
+		if speech_start is not None:
+			speech_samples.append(f32)
Author	SHA1	Message	Date
Mikael Lövqvist	4f983ca276	Merge pull request 'WebSocket server, language/task args, verbose flag, misc improvements' (#2 ) from mikael-lovqvists-claude-agent/stt-server:websocket-server into main Reviewed-on: #2	2026-06-07 09:27:01 +00:00
mikael-lovqvists-claude-agent	81e9ea82cf	Add NOTES.md with TranscriptionInfo unused fields Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-07 09:24:53 +00:00
mikael-lovqvists-claude-agent	0afe761625	Include detected language and confidence in transcript events Unpacks transcription info instead of discarding it. Adds language and language_probability fields to transcript events, and includes them in verbose log output. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-07 09:21:44 +00:00
mikael-lovqvists-claude-agent	bdb1aac885	Add --language and --task CLI arguments, document in README --language: force language detection (e.g. en, sv) or leave unset for auto --task: transcribe (default) or translate to English Previously language was hardcoded to 'en' which caused multilingual models to hallucinate translations instead of transcribing the source language. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-07 09:16:19 +00:00
mikael-lovqvists-claude-agent	f2ba15185e	Update VRAM estimates to show float16/float32 for all models Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-07 09:14:35 +00:00
mikael-lovqvists-claude-agent	dd6e74a7a8	Fix large-v3 VRAM estimate — ~5GB with float16, not ~10GB Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-07 09:14:00 +00:00
mikael-lovqvists-claude-agent	be1efd9edb	Add model selection and compute type sections to README Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-07 09:11:39 +00:00
mikael-lovqvists-claude-agent	9030b1315d	Load HF_TOKEN from token file at startup (consistent with tts-server) Reads ~/.secrets/hugging-face.token by default, overridable via HF_TOKEN_FILE. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-07 09:10:54 +00:00
mikael-lovqvists-claude-agent	7b03deddb5	Gate download log message behind --verbose like everything else Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-07 09:09:03 +00:00
mikael-lovqvists-claude-agent	218687b039	Log to stderr when model needs to be downloaded Checks cache first with local_files_only=True; if the model isn't present logs "downloading model ..." to stderr before WhisperModel triggers the actual download. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-07 09:07:13 +00:00
mikael-lovqvists-claude-agent	6bbc04dde7	Add Node.js WebSocket example scripts listen.mjs: prints all events as JSON objects. transcripts.mjs: prints transcript text only. Both use Node 21+ built-in WebSocket — no libraries required. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-07 08:59:38 +00:00
mikael-lovqvists-claude-agent	aad1bda3bf	Remove stdout event output — WebSocket is the sole event channel Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-07 08:58:29 +00:00
mikael-lovqvists-claude-agent	f18330608d	Add --verbose flag; suppress info logging by default Errors always go to stderr. Info logs (startup, VAD events, transcripts) only appear with --verbose / -v, keeping stderr clean when running as a system service. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-07 08:56:58 +00:00
mikael-lovqvists-claude-agent	18404708e3	Add WebSocket broadcast to stt-server.py Every connection receives the full event stream (vad_start, vad_end, transcript, error) from the moment it connects — no subscription handshake required. The asyncio WebSocket server runs in a daemon thread alongside the VAD loop and transcription thread. Events still go to stdout unchanged. Port is configurable via STT_PORT env var (default: 11501). Add websockets to both setup scripts. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-07 08:53:54 +00:00
Mikael Lövqvist	a4fe95b24a	Merge pull request 'Add stt-server.py and fix setup scripts' (#1 ) from mikael-lovqvists-claude-agent/stt-server:stt-process into main Reviewed-on: #1	2026-06-07 08:51:22 +00:00
mikael-lovqvists-claude-agent	01210e878f	Add silero-vad to both venv setup scripts Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-07 08:48:59 +00:00
mikael-lovqvists-claude-agent	c0a72679f8	Add torch to both venv setup scripts Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-07 08:47:03 +00:00
mikael-lovqvists-claude-agent	2af47373c4	Add stt-server.py: self-contained recording + VAD + transcription process Replaces the old stdin/stdout transcription-only server. Now handles the full pipeline in Python: - Launches parec or arecord for mic capture - Runs Silero VAD (via silero-vad, already a faster-whisper dep — no sherpa-onnx needed) - Pre-roll ring buffer (0.2s) prepended to each segment for context - Transcribes with faster-whisper in a separate thread (GPU not blocking VAD) - Emits JSON line events to stdout: ready, vad_start, vad_end, transcript, error Event protocol is designed to map directly to WebSocket subscriptions later. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-07 08:41:45 +00:00
mikael-lovqvists-claude-agent	bbde89a2cc	Fix missing faster-whisper deps when using local ctranslate2 build --no-deps skipped av and other required packages. Fix by installing faster-whisper normally first (satisfies all deps, pulls PyPI ctranslate2), then immediately overriding ctranslate2 with the source-built version. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-07 08:28:42 +00:00
mikael-lovqvists-claude-agent	346c7c6585	Remove Arch Linux specific package suggestions Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-07 08:24:43 +00:00
mikael-lovqvists-claude-agent	3db7058646	Add setup-venv.sh, clean up setup-venv-local-build.sh setup-venv.sh: simple PyPI install path — just pip install faster-whisper. Use this when the PyPI ctranslate2 wheel matches the system CUDA version. setup-venv-local-build.sh: - PYTHON_ENV env var for venv path override (consistent with tts-server) - HF_TOKEN_FILE env var instead of hardcoded path - HF_HUB_CACHE env var surfaced in output when set - Remove stray chmod on faster-whisper-server.py (not part of this repo) - Remove voice-experiment-specific "run with" message - Add python3 to tool prerequisite check - Arch Linux package suggestions extended to cover CUDA and python - Document why each script exists and when to use which Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-07 08:22:40 +00:00