# Voice to text interface ## Overview This project aims to provide voice to text using [faster-whisper](https://github.com/SYSTRAN/faster-whisper) as backend. ## Origin This project started as a [vibe-coded](https://en.wikipedia.org/wiki/Vibe_coding) [experiment](https://gitea.efforting.tech/mikael-lovqvists-claude-agent/claude-voice-experiment) but this version is somewhat more hands on. ## Setup ### Setup [venv](https://docs.python.org/3/library/venv.html) for [python](https://www.python.org/) We will have two different setups here depending on if you want to build ctranslate2 locally or not. This shall be documented. ## Model selection Pass `--model ` to `stt-server.py`. Models are downloaded automatically from HuggingFace on first use. | Model | VRAM | Quality | Notes | |-------|------|---------|-------| | `base.en` | ~0.5 GB (`float16`) / ~1 GB (`float32`) | Low | Default. Fast, but struggles with similar-sounding consonants (V/B/D). | | `small.en` | ~1 GB (`float16`) / ~2 GB (`float32`) | Medium | Noticeable improvement over base for most speech. | | `medium.en` | ~2.5 GB (`float16`) / ~5 GB (`float32`) | Good | Recommended starting point for production use. | | `large-v3` | ~5 GB (`float16`) / ~10 GB (`float32`) | Best | Highest accuracy, use if VRAM allows. | English-only models (`.en` suffix) are faster and more accurate than multilingual models for English speech. ## Compute type Pass `--compute-type ` to control the numeric precision used during inference. | Type | Notes | |------|-------| | `int8_float16` | Default. Good balance of speed and accuracy on modern GPUs. | | `float16` | Slightly better accuracy, higher VRAM usage. | | `int8` | CPU-friendly, lower quality. | If you see a CUDA error about mismatched library versions at startup, use `setup-venv-local-build.sh` to build ctranslate2 against your system CUDA version rather than using the PyPI wheel. ## Language and translation By default the server auto-detects the spoken language and transcribes it as-is. | Argument | Default | Notes | |----------|---------|-------| | `--language ` | none (auto-detect) | Force a specific language, e.g. `--language en` or `--language sv`. Speeds up detection and avoids misidentification. | | `--task transcribe` | default | Output text in the spoken language. | | `--task translate` | | Translate speech to English regardless of source language. | > [!NOTE] > The `.en` model variants (`base.en`, `small.en` etc.) are English-only and do not support `--task translate` or non-English `--language`. Use a multilingual model (`large-v3`, `medium`) for multilingual or translation use cases. > [!WARNING] > Omitting `--language` with a multilingual model and English-only speech may cause occasional misdetection. Pass `--language en` to avoid this if you only speak English.