mikael-lovqvists-claude-agent/stt-server

Go to file

mikael-lovqvists-claude-agent 81e9ea82cf Add NOTES.md with TranscriptionInfo unused fields

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-06-07 09:24:53 +00:00

examples

Add Node.js WebSocket example scripts

2026-06-07 08:59:38 +00:00

.gitignore

Add stt-server.py: self-contained recording + VAD + transcription process

2026-06-07 08:41:45 +00:00

NOTES.md

Add NOTES.md with TranscriptionInfo unused fields

2026-06-07 09:24:53 +00:00

README.md

Add --language and --task CLI arguments, document in README

2026-06-07 09:16:19 +00:00

setup-venv-local-build.sh

Add WebSocket broadcast to stt-server.py

2026-06-07 08:53:54 +00:00

setup-venv.sh

Add WebSocket broadcast to stt-server.py

2026-06-07 08:53:54 +00:00

stt-server.py

Include detected language and confidence in transcript events

2026-06-07 09:21:44 +00:00

README.md

Voice to text interface

Overview

This project aims to provide voice to text using faster-whisper as backend.

Origin

This project started as a vibe-coded experiment but this version is somewhat more hands on.

Setup

Setup venv for python

We will have two different setups here depending on if you want to build ctranslate2 locally or not. This shall be documented.

Model selection

Pass --model <name> to stt-server.py. Models are downloaded automatically from HuggingFace on first use.

Model	VRAM	Quality	Notes
`base.en`	~0.5 GB (`float16`) / ~1 GB (`float32`)	Low	Default. Fast, but struggles with similar-sounding consonants (V/B/D).
`small.en`	~1 GB (`float16`) / ~2 GB (`float32`)	Medium	Noticeable improvement over base for most speech.
`medium.en`	~2.5 GB (`float16`) / ~5 GB (`float32`)	Good	Recommended starting point for production use.
`large-v3`	~5 GB (`float16`) / ~10 GB (`float32`)	Best	Highest accuracy, use if VRAM allows.

English-only models (.en suffix) are faster and more accurate than multilingual models for English speech.

Compute type

Pass --compute-type <type> to control the numeric precision used during inference.

Type	Notes
`int8_float16`	Default. Good balance of speed and accuracy on modern GPUs.
`float16`	Slightly better accuracy, higher VRAM usage.
`int8`	CPU-friendly, lower quality.

If you see a CUDA error about mismatched library versions at startup, use setup-venv-local-build.sh to build ctranslate2 against your system CUDA version rather than using the PyPI wheel.

Language and translation

By default the server auto-detects the spoken language and transcribes it as-is.

Argument	Default	Notes
`--language <code>`	none (auto-detect)	Force a specific language, e.g. `--language en` or `--language sv`. Speeds up detection and avoids misidentification.
`--task transcribe`	default	Output text in the spoken language.
`--task translate`		Translate speech to English regardless of source language.

Note

The .en model variants (base.en, small.en etc.) are English-only and do not support --task translate or non-English --language. Use a multilingual model (large-v3, medium) for multilingual or translation use cases.

Warning

Omitting --language with a multilingual model and English-only speech may cause occasional misdetection. Pass --language en to avoid this if you only speak English.