Compare commits

..

21 Commits

Author SHA1 Message Date
4f983ca276 Merge pull request 'WebSocket server, language/task args, verbose flag, misc improvements' (#2) from mikael-lovqvists-claude-agent/stt-server:websocket-server into main
Reviewed-on: #2
2026-06-07 09:27:01 +00:00
81e9ea82cf Add NOTES.md with TranscriptionInfo unused fields
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-07 09:24:53 +00:00
0afe761625 Include detected language and confidence in transcript events
Unpacks transcription info instead of discarding it. Adds language and
language_probability fields to transcript events, and includes them in
verbose log output.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-07 09:21:44 +00:00
bdb1aac885 Add --language and --task CLI arguments, document in README
--language: force language detection (e.g. en, sv) or leave unset for auto
--task: transcribe (default) or translate to English
Previously language was hardcoded to 'en' which caused multilingual models
to hallucinate translations instead of transcribing the source language.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-07 09:16:19 +00:00
f2ba15185e Update VRAM estimates to show float16/float32 for all models
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-07 09:14:35 +00:00
dd6e74a7a8 Fix large-v3 VRAM estimate — ~5GB with float16, not ~10GB
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-07 09:14:00 +00:00
be1efd9edb Add model selection and compute type sections to README
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-07 09:11:39 +00:00
9030b1315d Load HF_TOKEN from token file at startup (consistent with tts-server)
Reads ~/.secrets/hugging-face.token by default, overridable via HF_TOKEN_FILE.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-07 09:10:54 +00:00
7b03deddb5 Gate download log message behind --verbose like everything else
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-07 09:09:03 +00:00
218687b039 Log to stderr when model needs to be downloaded
Checks cache first with local_files_only=True; if the model isn't present
logs "downloading model ..." to stderr before WhisperModel triggers the
actual download.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-07 09:07:13 +00:00
6bbc04dde7 Add Node.js WebSocket example scripts
listen.mjs: prints all events as JSON objects.
transcripts.mjs: prints transcript text only.
Both use Node 21+ built-in WebSocket — no libraries required.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-07 08:59:38 +00:00
aad1bda3bf Remove stdout event output — WebSocket is the sole event channel
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-07 08:58:29 +00:00
f18330608d Add --verbose flag; suppress info logging by default
Errors always go to stderr. Info logs (startup, VAD events, transcripts)
only appear with --verbose / -v, keeping stderr clean when running as a
system service.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-07 08:56:58 +00:00
18404708e3 Add WebSocket broadcast to stt-server.py
Every connection receives the full event stream (vad_start, vad_end,
transcript, error) from the moment it connects — no subscription
handshake required. The asyncio WebSocket server runs in a daemon thread
alongside the VAD loop and transcription thread. Events still go to
stdout unchanged.

Port is configurable via STT_PORT env var (default: 11501).
Add websockets to both setup scripts.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-07 08:53:54 +00:00
a4fe95b24a Merge pull request 'Add stt-server.py and fix setup scripts' (#1) from mikael-lovqvists-claude-agent/stt-server:stt-process into main
Reviewed-on: #1
2026-06-07 08:51:22 +00:00
01210e878f Add silero-vad to both venv setup scripts
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-07 08:48:59 +00:00
c0a72679f8 Add torch to both venv setup scripts
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-07 08:47:03 +00:00
2af47373c4 Add stt-server.py: self-contained recording + VAD + transcription process
Replaces the old stdin/stdout transcription-only server. Now handles the
full pipeline in Python:
- Launches parec or arecord for mic capture
- Runs Silero VAD (via silero-vad, already a faster-whisper dep — no sherpa-onnx needed)
- Pre-roll ring buffer (0.2s) prepended to each segment for context
- Transcribes with faster-whisper in a separate thread (GPU not blocking VAD)
- Emits JSON line events to stdout: ready, vad_start, vad_end, transcript, error

Event protocol is designed to map directly to WebSocket subscriptions later.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-07 08:41:45 +00:00
bbde89a2cc Fix missing faster-whisper deps when using local ctranslate2 build
--no-deps skipped av and other required packages. Fix by installing
faster-whisper normally first (satisfies all deps, pulls PyPI ctranslate2),
then immediately overriding ctranslate2 with the source-built version.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-07 08:28:42 +00:00
346c7c6585 Remove Arch Linux specific package suggestions
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-07 08:24:43 +00:00
3db7058646 Add setup-venv.sh, clean up setup-venv-local-build.sh
setup-venv.sh: simple PyPI install path — just pip install faster-whisper.
Use this when the PyPI ctranslate2 wheel matches the system CUDA version.

setup-venv-local-build.sh:
- PYTHON_ENV env var for venv path override (consistent with tts-server)
- HF_TOKEN_FILE env var instead of hardcoded path
- HF_HUB_CACHE env var surfaced in output when set
- Remove stray chmod on faster-whisper-server.py (not part of this repo)
- Remove voice-experiment-specific "run with" message
- Add python3 to tool prerequisite check
- Arch Linux package suggestions extended to cover CUDA and python
- Document why each script exists and when to use which

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-07 08:22:40 +00:00
8 changed files with 457 additions and 21 deletions

3
.gitignore vendored
View File

@@ -1 +1,2 @@
venv/
venv/
build/

9
NOTES.md Normal file
View File

@@ -0,0 +1,9 @@
# Notes
## TranscriptionInfo — unused fields
`model.transcribe()` returns a `TranscriptionInfo` object as its second value. We currently use `language` and `language_probability`. Other available fields:
- **`all_language_probs`** — full ranked list of `(language, probability)` tuples for the segment. Useful for debugging misdetection — e.g. when the model hallucinates Sinhala on noise, this would show Sinhala at the top with a high probability. Could be included in transcript events or exposed as a diagnostic endpoint.
- **`duration`** — total audio duration fed to the model.
- **`duration_after_vad`** — speech duration according to Whisper's internal VAD (not meaningful since we pass `vad_filter=False`).

View File

@@ -15,4 +15,48 @@ This project started as a [vibe-coded](https://en.wikipedia.org/wiki/Vibe_coding
### Setup [venv](https://docs.python.org/3/library/venv.html) for [python](https://www.python.org/)
We will have two different setups here depending on if you want to build ctranslate2 locally or not. This shall be documented.
We will have two different setups here depending on if you want to build ctranslate2 locally or not. This shall be documented.
## Model selection
Pass `--model <name>` to `stt-server.py`. Models are downloaded automatically from HuggingFace on first use.
| Model | VRAM | Quality | Notes |
|-------|------|---------|-------|
| `base.en` | ~0.5 GB (`float16`) / ~1 GB (`float32`) | Low | Default. Fast, but struggles with similar-sounding consonants (V/B/D). |
| `small.en` | ~1 GB (`float16`) / ~2 GB (`float32`) | Medium | Noticeable improvement over base for most speech. |
| `medium.en` | ~2.5 GB (`float16`) / ~5 GB (`float32`) | Good | Recommended starting point for production use. |
| `large-v3` | ~5 GB (`float16`) / ~10 GB (`float32`) | Best | Highest accuracy, use if VRAM allows. |
English-only models (`.en` suffix) are faster and more accurate than multilingual models for English speech.
## Compute type
Pass `--compute-type <type>` to control the numeric precision used during inference.
| Type | Notes |
|------|-------|
| `int8_float16` | Default. Good balance of speed and accuracy on modern GPUs. |
| `float16` | Slightly better accuracy, higher VRAM usage. |
| `int8` | CPU-friendly, lower quality. |
If you see a CUDA error about mismatched library versions at startup, use `setup-venv-local-build.sh` to build ctranslate2 against your system CUDA version rather than using the PyPI wheel.
## Language and translation
By default the server auto-detects the spoken language and transcribes it as-is.
| Argument | Default | Notes |
|----------|---------|-------|
| `--language <code>` | none (auto-detect) | Force a specific language, e.g. `--language en` or `--language sv`. Speeds up detection and avoids misidentification. |
| `--task transcribe` | default | Output text in the spoken language. |
| `--task translate` | | Translate speech to English regardless of source language. |
> [!NOTE]
> The `.en` model variants (`base.en`, `small.en` etc.) are English-only and do not support `--task translate` or non-English `--language`. Use a multilingual model (`large-v3`, `medium`) for multilingual or translation use cases.
> [!WARNING]
> Omitting `--language` with a multilingual model and English-only speech may cause occasional misdetection. Pass `--language en` to avoid this if you only speak English.

24
examples/listen.mjs Normal file
View File

@@ -0,0 +1,24 @@
// Connect to the STT server and print all events.
// Usage: node listen.mjs
const PORT = process.env.STT_PORT ?? '11501'
const ws = new WebSocket(`ws://localhost:${PORT}`)
ws.addEventListener('open', () => {
process.stderr.write(`connected to ws://localhost:${PORT}\n`)
})
ws.addEventListener('message', ({ data }) => {
const event = JSON.parse(data)
console.log(event)
})
ws.addEventListener('close', () => {
process.stderr.write('disconnected\n')
process.exit(0)
})
ws.addEventListener('error', (err) => {
process.stderr.write(`error: ${err.message}\n`)
process.exit(1)
})

19
examples/transcripts.mjs Normal file
View File

@@ -0,0 +1,19 @@
// Connect to the STT server and print transcript text only.
// Usage: node transcripts.mjs
const PORT = process.env.STT_PORT ?? '11501'
const ws = new WebSocket(`ws://localhost:${PORT}`)
ws.addEventListener('open', () => {
process.stderr.write(`connected to ws://localhost:${PORT}\n`)
})
ws.addEventListener('message', ({ data }) => {
const event = JSON.parse(data)
if (event.event === 'transcript') {
console.log(event.text)
}
})
ws.addEventListener('close', () => process.exit(0))
ws.addEventListener('error', () => process.exit(1))

View File

@@ -1,11 +1,29 @@
#!/usr/bin/env bash
#
# setup-venv-local-build.sh — builds ctranslate2 from source and installs faster-whisper.
#
# USE THIS SCRIPT when the PyPI ctranslate2 wheel does not match your CUDA version.
# The PyPI wheel targets a specific CUDA major version (e.g. CUDA 12). If your system
# has a newer version (e.g. CUDA 13), the wheel will fail at runtime because it tries
# to dlopen libcublas.so.12 which does not exist. Building from source compiles against
# your actual installed CUDA and links correctly.
#
# For systems where the PyPI wheel works (CUDA version matches), use setup-venv.sh
# instead — it is much faster and simpler.
#
# Environment overrides:
# PYTHON_ENV path to venv (default: ./venv)
# HF_TOKEN_FILE path to HuggingFace token file (default: ~/.secrets/hugging-face.token)
# HF_HUB_CACHE path to HuggingFace hub cache (default: ~/.cache/huggingface/hub)
# CUDA_HOME path to CUDA toolkit (auto-detected if not set)
#
set -euo pipefail
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
VENV="${SCRIPT_DIR}/venv"
VENV="${PYTHON_ENV:-${SCRIPT_DIR}/venv}"
BUILD_DIR="${SCRIPT_DIR}/build/ctranslate2"
MODEL="${1:-base.en}"
TOKEN_FILE="${HOME}/.secrets/hugging-face.token"
TOKEN_FILE="${HF_TOKEN_FILE:-${HOME}/.secrets/hugging-face.token}"
# Locate CUDA
if [ -z "${CUDA_HOME:-}" ]; then
@@ -25,9 +43,9 @@ fi
echo "==> CUDA: ${CUDA_HOME}"
"${CUDA_HOME}/bin/nvcc" --version | head -1
for tool in cmake git; do
for tool in cmake git python3; do
if ! command -v "${tool}" &>/dev/null; then
echo "ERROR: ${tool} not found — install with: sudo pacman -S ${tool}" >&2
echo "ERROR: ${tool} not found" >&2
exit 1
fi
done
@@ -39,6 +57,7 @@ fi
echo "==> upgrading pip + build tools"
"${VENV}/bin/pip" install --upgrade pip wheel setuptools pybind11 --quiet
"${VENV}/bin/pip" install torch silero-vad websockets
# --- clone (skipped if already done) ---
if [ ! -d "${BUILD_DIR}/src/.git" ]; then
@@ -78,25 +97,30 @@ ls "${VENV}/include/ctranslate2/" | head -3
ls "${VENV}/lib/libctranslate2"* 2>/dev/null || { echo "ERROR: libctranslate2 not found in venv/lib" >&2; exit 1; }
grep "WITH_CUDA" "${BUILD_DIR}/cmake-build/CMakeCache.txt" | grep -v "^#" || true
# --- Python bindings ---
# Always reinstall from source to ensure we use our CUDA 13 build, not a PyPI wheel
echo "==> removing any existing ctranslate2 install..."
# --- faster-whisper (with all deps, including PyPI ctranslate2) ---
# Install faster-whisper normally so all its dependencies (av, huggingface_hub, etc.)
# are satisfied. This will pull in the PyPI ctranslate2 wheel, which we override next.
if ! "${VENV}/bin/python3" -c "import faster_whisper" &>/dev/null 2>&1; then
echo "==> installing faster-whisper"
"${VENV}/bin/pip" install faster-whisper
else
echo "==> faster-whisper already installed, skipping"
fi
# --- Python bindings (always reinstalled from source) ---
# Override the PyPI ctranslate2 wheel pulled in above with our source-built version.
# This is the whole point of this script: the PyPI wheel links against a fixed CUDA
# major version (e.g. libcublas.so.12) while our build links against the system version.
echo "==> removing PyPI ctranslate2..."
"${VENV}/bin/pip" uninstall -y ctranslate2 2>/dev/null || true
echo "==> building ctranslate2 Python bindings from source..."
echo "==> installing source-built ctranslate2 Python bindings..."
CT2_ROOT="${VENV}" \
LIBRARY_PATH="${VENV}/lib:${VENV}/lib64${LIBRARY_PATH:+:${LIBRARY_PATH}}" \
LDFLAGS="-Wl,-rpath,${VENV}/lib" \
"${VENV}/bin/pip" install "${BUILD_DIR}/src/python" --no-build-isolation
# --- faster-whisper ---
if ! "${VENV}/bin/python3" -c "import faster_whisper" &>/dev/null 2>&1; then
echo "==> installing faster-whisper"
"${VENV}/bin/pip" install faster-whisper --no-deps
else
echo "==> faster-whisper already installed, skipping"
fi
# --- model download ---
if [ -f "${TOKEN_FILE}" ]; then
export HF_TOKEN="$(cat "${TOKEN_FILE}")"
echo "==> HuggingFace token loaded from ${TOKEN_FILE}"
@@ -104,6 +128,10 @@ else
echo "==> no token found at ${TOKEN_FILE} — unauthenticated download"
fi
if [ -n "${HF_HUB_CACHE:-}" ]; then
echo "==> HuggingFace cache: ${HF_HUB_CACHE}"
fi
echo "==> pre-downloading model: ${MODEL}"
"${VENV}/bin/python3" - <<EOF
from faster_whisper import WhisperModel
@@ -112,7 +140,5 @@ WhisperModel("${MODEL}", device="cuda", compute_type="int8_float16")
print("done")
EOF
chmod +x "${SCRIPT_DIR}/faster-whisper-server.py"
echo ""
echo "==> done. Run with: node query-demo.mjs --stt faster-whisper"
echo "==> done. Venv ready at ${VENV}"

31
setup-venv.sh Executable file
View File

@@ -0,0 +1,31 @@
#!/usr/bin/env bash
#
# setup-venv.sh — installs faster-whisper from PyPI into a venv.
#
# USE THIS SCRIPT when the PyPI ctranslate2 wheel matches your CUDA version.
# PyPI wheels target a specific CUDA major version; if your system matches,
# this is the fastest way to get started — no compilation required.
#
# If you see errors like "libcublas.so.12: cannot open shared object file" at
# runtime, your CUDA version does not match the wheel. Use setup-venv-local-build.sh
# instead, which compiles ctranslate2 against your actual CUDA installation.
#
# Environment overrides:
# PYTHON_ENV path to venv (default: ./venv)
#
set -euo pipefail
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
VENV="${PYTHON_ENV:-${SCRIPT_DIR}/venv}"
if [ ! -d "${VENV}" ]; then
echo "==> creating venv at ${VENV}"
python3 -m venv "${VENV}"
fi
echo "==> installing torch and faster-whisper"
"${VENV}/bin/pip" install --upgrade pip --quiet
"${VENV}/bin/pip" install torch faster-whisper silero-vad websockets
echo ""
echo "==> done. Venv ready at ${VENV}"

282
stt-server.py Executable file
View File

@@ -0,0 +1,282 @@
#!/usr/bin/env -S bash -c 'exec "$(dirname "$0")/venv/bin/python3" "$0" "$@"'
"""
STT server: records audio, runs Silero VAD, transcribes with faster-whisper.
Broadcasts JSON events to all connected WebSocket clients and to stdout.
Events:
{"event": "ready"}
{"event": "vad_start"}
{"event": "vad_end", "duration": 1.23}
{"event": "transcript", "text": "...", "words": [...], "duration": 1.23}
{"event": "error", "message": "..."}
word format: {"word": "hello", "start": 0.12, "end": 0.45, "probability": 0.99}
Every WebSocket connection receives the full event stream from the moment it
connects — no subscription handshake required.
Machine-readable events are sent over WebSocket only.
Pass --verbose to enable logging to stderr (startup, VAD events, transcripts).
Errors always go to stderr regardless of verbosity.
Environment:
STT_PORT WebSocket port (default: 11501)
Usage:
./stt-server.py
./stt-server.py --model large-v3 --device cuda --compute-type int8_float16 --verbose
"""
import os
import sys
import json
import signal
import argparse
import threading
import queue
import subprocess
import traceback
import asyncio
import websockets
import numpy as np
import torch
SAMPLE_RATE = 16000
VAD_WINDOW = 512 # samples per VAD chunk (32ms at 16kHz)
PRE_ROLL_SAMPLES = 3200 # 0.2s prepended to each segment for context
HISTORY_SAMPLES = 960000 # 60s ring buffer for pre-roll
PORT = int(__import__('os').environ.get('STT_PORT', 11501))
def log(msg, error=False):
if error or verbose:
sys.stderr.write(f'[stt] {msg}\n')
sys.stderr.flush()
# --- WebSocket broadcast ---
_ws_loop = None
_ws_clients = set() # set of asyncio.Queue, one per connection
def emit(event):
line = json.dumps(event)
if _ws_loop is not None:
for q in list(_ws_clients):
_ws_loop.call_soon_threadsafe(q.put_nowait, line)
async def ws_handler(websocket):
q = asyncio.Queue()
_ws_clients.add(q)
log(f'client connected ({len(_ws_clients)} total)')
try:
while True:
msg = await q.get()
await websocket.send(msg)
except websockets.ConnectionClosed:
pass
finally:
_ws_clients.discard(q)
log(f'client disconnected ({len(_ws_clients)} remaining)')
async def ws_main():
global _ws_loop
_ws_loop = asyncio.get_running_loop()
async with websockets.serve(ws_handler, '', PORT):
log(f'WebSocket listening on port {PORT}')
await asyncio.Future() # run forever
def start_ws_server():
asyncio.run(ws_main())
# --- Mic ---
def find_mic():
candidates = [
['parec', ['--format=s16le', '--rate=16000', '--channels=1', '--latency-msec=50']],
['arecord', ['-f', 'S16_LE', '-r', '16000', '-c', '1', '-t', 'raw', '-q']],
]
for cmd, args in candidates:
try:
subprocess.run(['which', cmd], check=True, capture_output=True)
return cmd, args
except subprocess.CalledProcessError:
pass
raise RuntimeError('no mic capture command found — need parec or arecord')
def s16le_to_f32(data):
return np.frombuffer(data, dtype=np.int16).astype(np.float32) / 32768.0
# --- Args + model loading ---
parser = argparse.ArgumentParser()
parser.add_argument('--model', default='base.en')
parser.add_argument('--device', default='cuda')
parser.add_argument('--compute-type', default='int8_float16')
parser.add_argument('--language', default=None, help='language code (e.g. en, sv) or None for auto-detect')
parser.add_argument('--task', default='transcribe', choices=['transcribe', 'translate'], help='transcribe keeps the source language; translate converts to English')
parser.add_argument('--verbose', '-v', action='store_true')
args = parser.parse_args()
verbose = args.verbose
token_file = os.environ.get('HF_TOKEN_FILE', os.path.expanduser('~/.secrets/hugging-face.token'))
try:
with open(token_file) as f:
os.environ['HF_TOKEN'] = f.read().strip()
except FileNotFoundError:
pass
from faster_whisper import WhisperModel
from huggingface_hub import snapshot_download
try:
snapshot_download(f'Systran/faster-whisper-{args.model}', local_files_only=True)
except Exception:
log(f'downloading model {args.model}...')
log(f'loading faster-whisper {args.model} ({args.device}, {args.compute_type})...')
try:
model = WhisperModel(args.model, device=args.device, compute_type=args.compute_type)
log(f'model ready on {args.device}')
except Exception as e:
log(f'{args.device} failed ({e}), falling back to cpu', error=True)
model = WhisperModel(args.model, device='cpu', compute_type='int8')
log('model ready on cpu')
log('loading silero VAD...')
from silero_vad import load_silero_vad, VADIterator
vad_model = load_silero_vad()
vad = VADIterator(vad_model, sampling_rate=SAMPLE_RATE,
threshold=0.5, min_silence_duration_ms=500)
log('VAD ready')
# --- Pre-roll ring buffer ---
history = np.zeros(HISTORY_SAMPLES, dtype=np.float32)
history_pos = 0
def push_history(samples):
global history_pos
n = len(samples)
base = history_pos % HISTORY_SAMPLES
space = HISTORY_SAMPLES - base
if n <= space:
history[base:base + n] = samples
else:
history[base:] = samples[:space]
history[:n - space] = samples[space:]
history_pos += n
def get_preroll():
start = max(0, history_pos - PRE_ROLL_SAMPLES)
count = history_pos - start
out = np.empty(count, dtype=np.float32)
for i in range(count):
out[i] = history[(start + i) % HISTORY_SAMPLES]
return out
# --- Transcription thread ---
transcription_queue = queue.Queue()
def transcription_worker():
while True:
item = transcription_queue.get()
if item is None:
break
samples, duration = item
try:
segments, info = model.transcribe(
samples,
language=args.language,
task=args.task,
word_timestamps=True,
vad_filter=False,
)
text = ''
words = []
for seg in segments:
text += seg.text
for w in (seg.words or []):
words.append({
'word': w.word,
'start': round(float(w.start), 4),
'end': round(float(w.end), 4),
'probability': round(float(w.probability), 4),
})
language = info.language
lang_prob = round(float(info.language_probability), 3)
log(f'transcript [{language} {lang_prob}]: {json.dumps(text.strip())} ({len(words)} words)')
if text.strip():
emit({'event': 'transcript', 'text': text.strip(), 'words': words, 'duration': round(duration, 3), 'language': language, 'language_probability': lang_prob})
except Exception:
msg = traceback.format_exc()
log(f'transcription error:\n{msg}', error=True)
emit({'event': 'error', 'message': msg})
finally:
transcription_queue.task_done()
threading.Thread(target=transcription_worker, daemon=True).start()
threading.Thread(target=start_ws_server, daemon=True).start()
# --- Main recording + VAD loop ---
cmd, cmd_args = find_mic()
log(f'mic: {cmd} {" ".join(cmd_args)}')
mic = subprocess.Popen([cmd] + cmd_args, stdout=subprocess.PIPE, stderr=subprocess.DEVNULL)
def shutdown(sig=None, frame=None):
mic.terminate()
transcription_queue.put(None)
sys.exit(0)
signal.signal(signal.SIGTERM, shutdown)
signal.signal(signal.SIGINT, shutdown)
emit({'event': 'ready'})
speech_samples = []
speech_start = None
pending = b''
for chunk in mic.stdout:
pending += chunk
while len(pending) >= VAD_WINDOW * 2:
raw = pending[:VAD_WINDOW * 2]
pending = pending[VAD_WINDOW * 2:]
f32 = s16le_to_f32(raw)
push_history(f32)
result = vad(torch.from_numpy(f32), return_seconds=True)
if result is not None:
if 'start' in result:
speech_start = result['start']
speech_samples = [get_preroll()]
log(f'VAD start at {speech_start:.2f}s')
emit({'event': 'vad_start'})
elif 'end' in result and speech_start is not None:
duration = result['end'] - speech_start
log(f'VAD end at {result["end"]:.2f}s (duration {duration:.2f}s)')
emit({'event': 'vad_end', 'duration': round(duration, 3)})
segment = np.concatenate(speech_samples)
transcription_queue.put((segment, duration))
speech_samples = []
speech_start = None
vad.reset_states()
if speech_start is not None:
speech_samples.append(f32)