Compare commits
21 Commits
c96bf0ecf5
...
main
| Author | SHA1 | Date | |
|---|---|---|---|
| 4f983ca276 | |||
| 81e9ea82cf | |||
| 0afe761625 | |||
| bdb1aac885 | |||
| f2ba15185e | |||
| dd6e74a7a8 | |||
| be1efd9edb | |||
| 9030b1315d | |||
| 7b03deddb5 | |||
| 218687b039 | |||
| 6bbc04dde7 | |||
| aad1bda3bf | |||
| f18330608d | |||
| 18404708e3 | |||
| a4fe95b24a | |||
| 01210e878f | |||
| c0a72679f8 | |||
| 2af47373c4 | |||
| bbde89a2cc | |||
| 346c7c6585 | |||
| 3db7058646 |
3
.gitignore
vendored
3
.gitignore
vendored
@@ -1 +1,2 @@
|
||||
venv/
|
||||
venv/
|
||||
build/
|
||||
9
NOTES.md
Normal file
9
NOTES.md
Normal file
@@ -0,0 +1,9 @@
|
||||
# Notes
|
||||
|
||||
## TranscriptionInfo — unused fields
|
||||
|
||||
`model.transcribe()` returns a `TranscriptionInfo` object as its second value. We currently use `language` and `language_probability`. Other available fields:
|
||||
|
||||
- **`all_language_probs`** — full ranked list of `(language, probability)` tuples for the segment. Useful for debugging misdetection — e.g. when the model hallucinates Sinhala on noise, this would show Sinhala at the top with a high probability. Could be included in transcript events or exposed as a diagnostic endpoint.
|
||||
- **`duration`** — total audio duration fed to the model.
|
||||
- **`duration_after_vad`** — speech duration according to Whisper's internal VAD (not meaningful since we pass `vad_filter=False`).
|
||||
46
README.md
46
README.md
@@ -15,4 +15,48 @@ This project started as a [vibe-coded](https://en.wikipedia.org/wiki/Vibe_coding
|
||||
|
||||
### Setup [venv](https://docs.python.org/3/library/venv.html) for [python](https://www.python.org/)
|
||||
|
||||
We will have two different setups here depending on if you want to build ctranslate2 locally or not. This shall be documented.
|
||||
We will have two different setups here depending on if you want to build ctranslate2 locally or not. This shall be documented.
|
||||
|
||||
|
||||
## Model selection
|
||||
|
||||
Pass `--model <name>` to `stt-server.py`. Models are downloaded automatically from HuggingFace on first use.
|
||||
|
||||
| Model | VRAM | Quality | Notes |
|
||||
|-------|------|---------|-------|
|
||||
| `base.en` | ~0.5 GB (`float16`) / ~1 GB (`float32`) | Low | Default. Fast, but struggles with similar-sounding consonants (V/B/D). |
|
||||
| `small.en` | ~1 GB (`float16`) / ~2 GB (`float32`) | Medium | Noticeable improvement over base for most speech. |
|
||||
| `medium.en` | ~2.5 GB (`float16`) / ~5 GB (`float32`) | Good | Recommended starting point for production use. |
|
||||
| `large-v3` | ~5 GB (`float16`) / ~10 GB (`float32`) | Best | Highest accuracy, use if VRAM allows. |
|
||||
|
||||
English-only models (`.en` suffix) are faster and more accurate than multilingual models for English speech.
|
||||
|
||||
|
||||
## Compute type
|
||||
|
||||
Pass `--compute-type <type>` to control the numeric precision used during inference.
|
||||
|
||||
| Type | Notes |
|
||||
|------|-------|
|
||||
| `int8_float16` | Default. Good balance of speed and accuracy on modern GPUs. |
|
||||
| `float16` | Slightly better accuracy, higher VRAM usage. |
|
||||
| `int8` | CPU-friendly, lower quality. |
|
||||
|
||||
If you see a CUDA error about mismatched library versions at startup, use `setup-venv-local-build.sh` to build ctranslate2 against your system CUDA version rather than using the PyPI wheel.
|
||||
|
||||
|
||||
## Language and translation
|
||||
|
||||
By default the server auto-detects the spoken language and transcribes it as-is.
|
||||
|
||||
| Argument | Default | Notes |
|
||||
|----------|---------|-------|
|
||||
| `--language <code>` | none (auto-detect) | Force a specific language, e.g. `--language en` or `--language sv`. Speeds up detection and avoids misidentification. |
|
||||
| `--task transcribe` | default | Output text in the spoken language. |
|
||||
| `--task translate` | | Translate speech to English regardless of source language. |
|
||||
|
||||
> [!NOTE]
|
||||
> The `.en` model variants (`base.en`, `small.en` etc.) are English-only and do not support `--task translate` or non-English `--language`. Use a multilingual model (`large-v3`, `medium`) for multilingual or translation use cases.
|
||||
|
||||
> [!WARNING]
|
||||
> Omitting `--language` with a multilingual model and English-only speech may cause occasional misdetection. Pass `--language en` to avoid this if you only speak English.
|
||||
24
examples/listen.mjs
Normal file
24
examples/listen.mjs
Normal file
@@ -0,0 +1,24 @@
|
||||
// Connect to the STT server and print all events.
|
||||
// Usage: node listen.mjs
|
||||
|
||||
const PORT = process.env.STT_PORT ?? '11501'
|
||||
const ws = new WebSocket(`ws://localhost:${PORT}`)
|
||||
|
||||
ws.addEventListener('open', () => {
|
||||
process.stderr.write(`connected to ws://localhost:${PORT}\n`)
|
||||
})
|
||||
|
||||
ws.addEventListener('message', ({ data }) => {
|
||||
const event = JSON.parse(data)
|
||||
console.log(event)
|
||||
})
|
||||
|
||||
ws.addEventListener('close', () => {
|
||||
process.stderr.write('disconnected\n')
|
||||
process.exit(0)
|
||||
})
|
||||
|
||||
ws.addEventListener('error', (err) => {
|
||||
process.stderr.write(`error: ${err.message}\n`)
|
||||
process.exit(1)
|
||||
})
|
||||
19
examples/transcripts.mjs
Normal file
19
examples/transcripts.mjs
Normal file
@@ -0,0 +1,19 @@
|
||||
// Connect to the STT server and print transcript text only.
|
||||
// Usage: node transcripts.mjs
|
||||
|
||||
const PORT = process.env.STT_PORT ?? '11501'
|
||||
const ws = new WebSocket(`ws://localhost:${PORT}`)
|
||||
|
||||
ws.addEventListener('open', () => {
|
||||
process.stderr.write(`connected to ws://localhost:${PORT}\n`)
|
||||
})
|
||||
|
||||
ws.addEventListener('message', ({ data }) => {
|
||||
const event = JSON.parse(data)
|
||||
if (event.event === 'transcript') {
|
||||
console.log(event.text)
|
||||
}
|
||||
})
|
||||
|
||||
ws.addEventListener('close', () => process.exit(0))
|
||||
ws.addEventListener('error', () => process.exit(1))
|
||||
@@ -1,11 +1,29 @@
|
||||
#!/usr/bin/env bash
|
||||
#
|
||||
# setup-venv-local-build.sh — builds ctranslate2 from source and installs faster-whisper.
|
||||
#
|
||||
# USE THIS SCRIPT when the PyPI ctranslate2 wheel does not match your CUDA version.
|
||||
# The PyPI wheel targets a specific CUDA major version (e.g. CUDA 12). If your system
|
||||
# has a newer version (e.g. CUDA 13), the wheel will fail at runtime because it tries
|
||||
# to dlopen libcublas.so.12 which does not exist. Building from source compiles against
|
||||
# your actual installed CUDA and links correctly.
|
||||
#
|
||||
# For systems where the PyPI wheel works (CUDA version matches), use setup-venv.sh
|
||||
# instead — it is much faster and simpler.
|
||||
#
|
||||
# Environment overrides:
|
||||
# PYTHON_ENV path to venv (default: ./venv)
|
||||
# HF_TOKEN_FILE path to HuggingFace token file (default: ~/.secrets/hugging-face.token)
|
||||
# HF_HUB_CACHE path to HuggingFace hub cache (default: ~/.cache/huggingface/hub)
|
||||
# CUDA_HOME path to CUDA toolkit (auto-detected if not set)
|
||||
#
|
||||
set -euo pipefail
|
||||
|
||||
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||
VENV="${SCRIPT_DIR}/venv"
|
||||
VENV="${PYTHON_ENV:-${SCRIPT_DIR}/venv}"
|
||||
BUILD_DIR="${SCRIPT_DIR}/build/ctranslate2"
|
||||
MODEL="${1:-base.en}"
|
||||
TOKEN_FILE="${HOME}/.secrets/hugging-face.token"
|
||||
TOKEN_FILE="${HF_TOKEN_FILE:-${HOME}/.secrets/hugging-face.token}"
|
||||
|
||||
# Locate CUDA
|
||||
if [ -z "${CUDA_HOME:-}" ]; then
|
||||
@@ -25,9 +43,9 @@ fi
|
||||
echo "==> CUDA: ${CUDA_HOME}"
|
||||
"${CUDA_HOME}/bin/nvcc" --version | head -1
|
||||
|
||||
for tool in cmake git; do
|
||||
for tool in cmake git python3; do
|
||||
if ! command -v "${tool}" &>/dev/null; then
|
||||
echo "ERROR: ${tool} not found — install with: sudo pacman -S ${tool}" >&2
|
||||
echo "ERROR: ${tool} not found" >&2
|
||||
exit 1
|
||||
fi
|
||||
done
|
||||
@@ -39,6 +57,7 @@ fi
|
||||
|
||||
echo "==> upgrading pip + build tools"
|
||||
"${VENV}/bin/pip" install --upgrade pip wheel setuptools pybind11 --quiet
|
||||
"${VENV}/bin/pip" install torch silero-vad websockets
|
||||
|
||||
# --- clone (skipped if already done) ---
|
||||
if [ ! -d "${BUILD_DIR}/src/.git" ]; then
|
||||
@@ -78,25 +97,30 @@ ls "${VENV}/include/ctranslate2/" | head -3
|
||||
ls "${VENV}/lib/libctranslate2"* 2>/dev/null || { echo "ERROR: libctranslate2 not found in venv/lib" >&2; exit 1; }
|
||||
grep "WITH_CUDA" "${BUILD_DIR}/cmake-build/CMakeCache.txt" | grep -v "^#" || true
|
||||
|
||||
# --- Python bindings ---
|
||||
# Always reinstall from source to ensure we use our CUDA 13 build, not a PyPI wheel
|
||||
echo "==> removing any existing ctranslate2 install..."
|
||||
# --- faster-whisper (with all deps, including PyPI ctranslate2) ---
|
||||
# Install faster-whisper normally so all its dependencies (av, huggingface_hub, etc.)
|
||||
# are satisfied. This will pull in the PyPI ctranslate2 wheel, which we override next.
|
||||
if ! "${VENV}/bin/python3" -c "import faster_whisper" &>/dev/null 2>&1; then
|
||||
echo "==> installing faster-whisper"
|
||||
"${VENV}/bin/pip" install faster-whisper
|
||||
else
|
||||
echo "==> faster-whisper already installed, skipping"
|
||||
fi
|
||||
|
||||
# --- Python bindings (always reinstalled from source) ---
|
||||
# Override the PyPI ctranslate2 wheel pulled in above with our source-built version.
|
||||
# This is the whole point of this script: the PyPI wheel links against a fixed CUDA
|
||||
# major version (e.g. libcublas.so.12) while our build links against the system version.
|
||||
echo "==> removing PyPI ctranslate2..."
|
||||
"${VENV}/bin/pip" uninstall -y ctranslate2 2>/dev/null || true
|
||||
|
||||
echo "==> building ctranslate2 Python bindings from source..."
|
||||
echo "==> installing source-built ctranslate2 Python bindings..."
|
||||
CT2_ROOT="${VENV}" \
|
||||
LIBRARY_PATH="${VENV}/lib:${VENV}/lib64${LIBRARY_PATH:+:${LIBRARY_PATH}}" \
|
||||
LDFLAGS="-Wl,-rpath,${VENV}/lib" \
|
||||
"${VENV}/bin/pip" install "${BUILD_DIR}/src/python" --no-build-isolation
|
||||
|
||||
# --- faster-whisper ---
|
||||
if ! "${VENV}/bin/python3" -c "import faster_whisper" &>/dev/null 2>&1; then
|
||||
echo "==> installing faster-whisper"
|
||||
"${VENV}/bin/pip" install faster-whisper --no-deps
|
||||
else
|
||||
echo "==> faster-whisper already installed, skipping"
|
||||
fi
|
||||
|
||||
# --- model download ---
|
||||
if [ -f "${TOKEN_FILE}" ]; then
|
||||
export HF_TOKEN="$(cat "${TOKEN_FILE}")"
|
||||
echo "==> HuggingFace token loaded from ${TOKEN_FILE}"
|
||||
@@ -104,6 +128,10 @@ else
|
||||
echo "==> no token found at ${TOKEN_FILE} — unauthenticated download"
|
||||
fi
|
||||
|
||||
if [ -n "${HF_HUB_CACHE:-}" ]; then
|
||||
echo "==> HuggingFace cache: ${HF_HUB_CACHE}"
|
||||
fi
|
||||
|
||||
echo "==> pre-downloading model: ${MODEL}"
|
||||
"${VENV}/bin/python3" - <<EOF
|
||||
from faster_whisper import WhisperModel
|
||||
@@ -112,7 +140,5 @@ WhisperModel("${MODEL}", device="cuda", compute_type="int8_float16")
|
||||
print("done")
|
||||
EOF
|
||||
|
||||
chmod +x "${SCRIPT_DIR}/faster-whisper-server.py"
|
||||
|
||||
echo ""
|
||||
echo "==> done. Run with: node query-demo.mjs --stt faster-whisper"
|
||||
echo "==> done. Venv ready at ${VENV}"
|
||||
|
||||
31
setup-venv.sh
Executable file
31
setup-venv.sh
Executable file
@@ -0,0 +1,31 @@
|
||||
#!/usr/bin/env bash
|
||||
#
|
||||
# setup-venv.sh — installs faster-whisper from PyPI into a venv.
|
||||
#
|
||||
# USE THIS SCRIPT when the PyPI ctranslate2 wheel matches your CUDA version.
|
||||
# PyPI wheels target a specific CUDA major version; if your system matches,
|
||||
# this is the fastest way to get started — no compilation required.
|
||||
#
|
||||
# If you see errors like "libcublas.so.12: cannot open shared object file" at
|
||||
# runtime, your CUDA version does not match the wheel. Use setup-venv-local-build.sh
|
||||
# instead, which compiles ctranslate2 against your actual CUDA installation.
|
||||
#
|
||||
# Environment overrides:
|
||||
# PYTHON_ENV path to venv (default: ./venv)
|
||||
#
|
||||
set -euo pipefail
|
||||
|
||||
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||
VENV="${PYTHON_ENV:-${SCRIPT_DIR}/venv}"
|
||||
|
||||
if [ ! -d "${VENV}" ]; then
|
||||
echo "==> creating venv at ${VENV}"
|
||||
python3 -m venv "${VENV}"
|
||||
fi
|
||||
|
||||
echo "==> installing torch and faster-whisper"
|
||||
"${VENV}/bin/pip" install --upgrade pip --quiet
|
||||
"${VENV}/bin/pip" install torch faster-whisper silero-vad websockets
|
||||
|
||||
echo ""
|
||||
echo "==> done. Venv ready at ${VENV}"
|
||||
282
stt-server.py
Executable file
282
stt-server.py
Executable file
@@ -0,0 +1,282 @@
|
||||
#!/usr/bin/env -S bash -c 'exec "$(dirname "$0")/venv/bin/python3" "$0" "$@"'
|
||||
"""
|
||||
STT server: records audio, runs Silero VAD, transcribes with faster-whisper.
|
||||
Broadcasts JSON events to all connected WebSocket clients and to stdout.
|
||||
|
||||
Events:
|
||||
{"event": "ready"}
|
||||
{"event": "vad_start"}
|
||||
{"event": "vad_end", "duration": 1.23}
|
||||
{"event": "transcript", "text": "...", "words": [...], "duration": 1.23}
|
||||
{"event": "error", "message": "..."}
|
||||
|
||||
word format: {"word": "hello", "start": 0.12, "end": 0.45, "probability": 0.99}
|
||||
|
||||
Every WebSocket connection receives the full event stream from the moment it
|
||||
connects — no subscription handshake required.
|
||||
|
||||
Machine-readable events are sent over WebSocket only.
|
||||
Pass --verbose to enable logging to stderr (startup, VAD events, transcripts).
|
||||
Errors always go to stderr regardless of verbosity.
|
||||
|
||||
Environment:
|
||||
STT_PORT WebSocket port (default: 11501)
|
||||
|
||||
Usage:
|
||||
./stt-server.py
|
||||
./stt-server.py --model large-v3 --device cuda --compute-type int8_float16 --verbose
|
||||
"""
|
||||
|
||||
import os
|
||||
import sys
|
||||
import json
|
||||
import signal
|
||||
import argparse
|
||||
import threading
|
||||
import queue
|
||||
import subprocess
|
||||
import traceback
|
||||
import asyncio
|
||||
import websockets
|
||||
import numpy as np
|
||||
import torch
|
||||
|
||||
SAMPLE_RATE = 16000
|
||||
VAD_WINDOW = 512 # samples per VAD chunk (32ms at 16kHz)
|
||||
PRE_ROLL_SAMPLES = 3200 # 0.2s prepended to each segment for context
|
||||
HISTORY_SAMPLES = 960000 # 60s ring buffer for pre-roll
|
||||
PORT = int(__import__('os').environ.get('STT_PORT', 11501))
|
||||
|
||||
|
||||
def log(msg, error=False):
|
||||
if error or verbose:
|
||||
sys.stderr.write(f'[stt] {msg}\n')
|
||||
sys.stderr.flush()
|
||||
|
||||
|
||||
# --- WebSocket broadcast ---
|
||||
|
||||
_ws_loop = None
|
||||
_ws_clients = set() # set of asyncio.Queue, one per connection
|
||||
|
||||
|
||||
def emit(event):
|
||||
line = json.dumps(event)
|
||||
if _ws_loop is not None:
|
||||
for q in list(_ws_clients):
|
||||
_ws_loop.call_soon_threadsafe(q.put_nowait, line)
|
||||
|
||||
|
||||
async def ws_handler(websocket):
|
||||
q = asyncio.Queue()
|
||||
_ws_clients.add(q)
|
||||
log(f'client connected ({len(_ws_clients)} total)')
|
||||
try:
|
||||
while True:
|
||||
msg = await q.get()
|
||||
await websocket.send(msg)
|
||||
except websockets.ConnectionClosed:
|
||||
pass
|
||||
finally:
|
||||
_ws_clients.discard(q)
|
||||
log(f'client disconnected ({len(_ws_clients)} remaining)')
|
||||
|
||||
|
||||
async def ws_main():
|
||||
global _ws_loop
|
||||
_ws_loop = asyncio.get_running_loop()
|
||||
async with websockets.serve(ws_handler, '', PORT):
|
||||
log(f'WebSocket listening on port {PORT}')
|
||||
await asyncio.Future() # run forever
|
||||
|
||||
|
||||
def start_ws_server():
|
||||
asyncio.run(ws_main())
|
||||
|
||||
|
||||
# --- Mic ---
|
||||
|
||||
def find_mic():
|
||||
candidates = [
|
||||
['parec', ['--format=s16le', '--rate=16000', '--channels=1', '--latency-msec=50']],
|
||||
['arecord', ['-f', 'S16_LE', '-r', '16000', '-c', '1', '-t', 'raw', '-q']],
|
||||
]
|
||||
for cmd, args in candidates:
|
||||
try:
|
||||
subprocess.run(['which', cmd], check=True, capture_output=True)
|
||||
return cmd, args
|
||||
except subprocess.CalledProcessError:
|
||||
pass
|
||||
raise RuntimeError('no mic capture command found — need parec or arecord')
|
||||
|
||||
|
||||
def s16le_to_f32(data):
|
||||
return np.frombuffer(data, dtype=np.int16).astype(np.float32) / 32768.0
|
||||
|
||||
|
||||
# --- Args + model loading ---
|
||||
|
||||
parser = argparse.ArgumentParser()
|
||||
parser.add_argument('--model', default='base.en')
|
||||
parser.add_argument('--device', default='cuda')
|
||||
parser.add_argument('--compute-type', default='int8_float16')
|
||||
parser.add_argument('--language', default=None, help='language code (e.g. en, sv) or None for auto-detect')
|
||||
parser.add_argument('--task', default='transcribe', choices=['transcribe', 'translate'], help='transcribe keeps the source language; translate converts to English')
|
||||
parser.add_argument('--verbose', '-v', action='store_true')
|
||||
args = parser.parse_args()
|
||||
verbose = args.verbose
|
||||
|
||||
token_file = os.environ.get('HF_TOKEN_FILE', os.path.expanduser('~/.secrets/hugging-face.token'))
|
||||
try:
|
||||
with open(token_file) as f:
|
||||
os.environ['HF_TOKEN'] = f.read().strip()
|
||||
except FileNotFoundError:
|
||||
pass
|
||||
|
||||
from faster_whisper import WhisperModel
|
||||
from huggingface_hub import snapshot_download
|
||||
|
||||
try:
|
||||
snapshot_download(f'Systran/faster-whisper-{args.model}', local_files_only=True)
|
||||
except Exception:
|
||||
log(f'downloading model {args.model}...')
|
||||
|
||||
log(f'loading faster-whisper {args.model} ({args.device}, {args.compute_type})...')
|
||||
try:
|
||||
model = WhisperModel(args.model, device=args.device, compute_type=args.compute_type)
|
||||
log(f'model ready on {args.device}')
|
||||
except Exception as e:
|
||||
log(f'{args.device} failed ({e}), falling back to cpu', error=True)
|
||||
model = WhisperModel(args.model, device='cpu', compute_type='int8')
|
||||
log('model ready on cpu')
|
||||
|
||||
log('loading silero VAD...')
|
||||
from silero_vad import load_silero_vad, VADIterator
|
||||
vad_model = load_silero_vad()
|
||||
vad = VADIterator(vad_model, sampling_rate=SAMPLE_RATE,
|
||||
threshold=0.5, min_silence_duration_ms=500)
|
||||
log('VAD ready')
|
||||
|
||||
|
||||
# --- Pre-roll ring buffer ---
|
||||
|
||||
history = np.zeros(HISTORY_SAMPLES, dtype=np.float32)
|
||||
history_pos = 0
|
||||
|
||||
def push_history(samples):
|
||||
global history_pos
|
||||
n = len(samples)
|
||||
base = history_pos % HISTORY_SAMPLES
|
||||
space = HISTORY_SAMPLES - base
|
||||
if n <= space:
|
||||
history[base:base + n] = samples
|
||||
else:
|
||||
history[base:] = samples[:space]
|
||||
history[:n - space] = samples[space:]
|
||||
history_pos += n
|
||||
|
||||
def get_preroll():
|
||||
start = max(0, history_pos - PRE_ROLL_SAMPLES)
|
||||
count = history_pos - start
|
||||
out = np.empty(count, dtype=np.float32)
|
||||
for i in range(count):
|
||||
out[i] = history[(start + i) % HISTORY_SAMPLES]
|
||||
return out
|
||||
|
||||
|
||||
# --- Transcription thread ---
|
||||
|
||||
transcription_queue = queue.Queue()
|
||||
|
||||
def transcription_worker():
|
||||
while True:
|
||||
item = transcription_queue.get()
|
||||
if item is None:
|
||||
break
|
||||
samples, duration = item
|
||||
try:
|
||||
segments, info = model.transcribe(
|
||||
samples,
|
||||
language=args.language,
|
||||
task=args.task,
|
||||
word_timestamps=True,
|
||||
vad_filter=False,
|
||||
)
|
||||
text = ''
|
||||
words = []
|
||||
for seg in segments:
|
||||
text += seg.text
|
||||
for w in (seg.words or []):
|
||||
words.append({
|
||||
'word': w.word,
|
||||
'start': round(float(w.start), 4),
|
||||
'end': round(float(w.end), 4),
|
||||
'probability': round(float(w.probability), 4),
|
||||
})
|
||||
language = info.language
|
||||
lang_prob = round(float(info.language_probability), 3)
|
||||
log(f'transcript [{language} {lang_prob}]: {json.dumps(text.strip())} ({len(words)} words)')
|
||||
if text.strip():
|
||||
emit({'event': 'transcript', 'text': text.strip(), 'words': words, 'duration': round(duration, 3), 'language': language, 'language_probability': lang_prob})
|
||||
except Exception:
|
||||
msg = traceback.format_exc()
|
||||
log(f'transcription error:\n{msg}', error=True)
|
||||
emit({'event': 'error', 'message': msg})
|
||||
finally:
|
||||
transcription_queue.task_done()
|
||||
|
||||
|
||||
threading.Thread(target=transcription_worker, daemon=True).start()
|
||||
threading.Thread(target=start_ws_server, daemon=True).start()
|
||||
|
||||
|
||||
# --- Main recording + VAD loop ---
|
||||
|
||||
cmd, cmd_args = find_mic()
|
||||
log(f'mic: {cmd} {" ".join(cmd_args)}')
|
||||
mic = subprocess.Popen([cmd] + cmd_args, stdout=subprocess.PIPE, stderr=subprocess.DEVNULL)
|
||||
|
||||
def shutdown(sig=None, frame=None):
|
||||
mic.terminate()
|
||||
transcription_queue.put(None)
|
||||
sys.exit(0)
|
||||
|
||||
signal.signal(signal.SIGTERM, shutdown)
|
||||
signal.signal(signal.SIGINT, shutdown)
|
||||
|
||||
emit({'event': 'ready'})
|
||||
|
||||
speech_samples = []
|
||||
speech_start = None
|
||||
pending = b''
|
||||
|
||||
for chunk in mic.stdout:
|
||||
pending += chunk
|
||||
while len(pending) >= VAD_WINDOW * 2:
|
||||
raw = pending[:VAD_WINDOW * 2]
|
||||
pending = pending[VAD_WINDOW * 2:]
|
||||
|
||||
f32 = s16le_to_f32(raw)
|
||||
push_history(f32)
|
||||
|
||||
result = vad(torch.from_numpy(f32), return_seconds=True)
|
||||
|
||||
if result is not None:
|
||||
if 'start' in result:
|
||||
speech_start = result['start']
|
||||
speech_samples = [get_preroll()]
|
||||
log(f'VAD start at {speech_start:.2f}s')
|
||||
emit({'event': 'vad_start'})
|
||||
|
||||
elif 'end' in result and speech_start is not None:
|
||||
duration = result['end'] - speech_start
|
||||
log(f'VAD end at {result["end"]:.2f}s (duration {duration:.2f}s)')
|
||||
emit({'event': 'vad_end', 'duration': round(duration, 3)})
|
||||
segment = np.concatenate(speech_samples)
|
||||
transcription_queue.put((segment, duration))
|
||||
speech_samples = []
|
||||
speech_start = None
|
||||
vad.reset_states()
|
||||
|
||||
if speech_start is not None:
|
||||
speech_samples.append(f32)
|
||||
Reference in New Issue
Block a user