Reviewed-on: #2
Voice to text interface
Overview
This project aims to provide voice to text using faster-whisper as backend.
Origin
This project started as a vibe-coded experiment but this version is somewhat more hands on.
Setup
Setup venv for python
We will have two different setups here depending on if you want to build ctranslate2 locally or not. This shall be documented.
Model selection
Pass --model <name> to stt-server.py. Models are downloaded automatically from HuggingFace on first use.
| Model | VRAM | Quality | Notes |
|---|---|---|---|
base.en |
~0.5 GB (float16) / ~1 GB (float32) |
Low | Default. Fast, but struggles with similar-sounding consonants (V/B/D). |
small.en |
~1 GB (float16) / ~2 GB (float32) |
Medium | Noticeable improvement over base for most speech. |
medium.en |
~2.5 GB (float16) / ~5 GB (float32) |
Good | Recommended starting point for production use. |
large-v3 |
~5 GB (float16) / ~10 GB (float32) |
Best | Highest accuracy, use if VRAM allows. |
English-only models (.en suffix) are faster and more accurate than multilingual models for English speech.
Compute type
Pass --compute-type <type> to control the numeric precision used during inference.
| Type | Notes |
|---|---|
int8_float16 |
Default. Good balance of speed and accuracy on modern GPUs. |
float16 |
Slightly better accuracy, higher VRAM usage. |
int8 |
CPU-friendly, lower quality. |
If you see a CUDA error about mismatched library versions at startup, use setup-venv-local-build.sh to build ctranslate2 against your system CUDA version rather than using the PyPI wheel.
Language and translation
By default the server auto-detects the spoken language and transcribes it as-is.
| Argument | Default | Notes |
|---|---|---|
--language <code> |
none (auto-detect) | Force a specific language, e.g. --language en or --language sv. Speeds up detection and avoids misidentification. |
--task transcribe |
default | Output text in the spoken language. |
--task translate |
Translate speech to English regardless of source language. |
Note
The
.enmodel variants (base.en,small.enetc.) are English-only and do not support--task translateor non-English--language. Use a multilingual model (large-v3,medium) for multilingual or translation use cases.
Warning
Omitting
--languagewith a multilingual model and English-only speech may cause occasional misdetection. Pass--language ento avoid this if you only speak English.