--language: force language detection (e.g. en, sv) or leave unset for auto --task: transcribe (default) or translate to English Previously language was hardcoded to 'en' which caused multilingual models to hallucinate translations instead of transcribing the source language. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
62 lines
2.8 KiB
Markdown
62 lines
2.8 KiB
Markdown
# Voice to text interface
|
|
|
|
## Overview
|
|
|
|
This project aims to provide voice to text using [faster-whisper](https://github.com/SYSTRAN/faster-whisper) as backend.
|
|
|
|
|
|
## Origin
|
|
|
|
This project started as a [vibe-coded](https://en.wikipedia.org/wiki/Vibe_coding) [experiment](https://gitea.efforting.tech/mikael-lovqvists-claude-agent/claude-voice-experiment) but this version is somewhat more hands on.
|
|
|
|
|
|
|
|
## Setup
|
|
|
|
### Setup [venv](https://docs.python.org/3/library/venv.html) for [python](https://www.python.org/)
|
|
|
|
We will have two different setups here depending on if you want to build ctranslate2 locally or not. This shall be documented.
|
|
|
|
|
|
## Model selection
|
|
|
|
Pass `--model <name>` to `stt-server.py`. Models are downloaded automatically from HuggingFace on first use.
|
|
|
|
| Model | VRAM | Quality | Notes |
|
|
|-------|------|---------|-------|
|
|
| `base.en` | ~0.5 GB (`float16`) / ~1 GB (`float32`) | Low | Default. Fast, but struggles with similar-sounding consonants (V/B/D). |
|
|
| `small.en` | ~1 GB (`float16`) / ~2 GB (`float32`) | Medium | Noticeable improvement over base for most speech. |
|
|
| `medium.en` | ~2.5 GB (`float16`) / ~5 GB (`float32`) | Good | Recommended starting point for production use. |
|
|
| `large-v3` | ~5 GB (`float16`) / ~10 GB (`float32`) | Best | Highest accuracy, use if VRAM allows. |
|
|
|
|
English-only models (`.en` suffix) are faster and more accurate than multilingual models for English speech.
|
|
|
|
|
|
## Compute type
|
|
|
|
Pass `--compute-type <type>` to control the numeric precision used during inference.
|
|
|
|
| Type | Notes |
|
|
|------|-------|
|
|
| `int8_float16` | Default. Good balance of speed and accuracy on modern GPUs. |
|
|
| `float16` | Slightly better accuracy, higher VRAM usage. |
|
|
| `int8` | CPU-friendly, lower quality. |
|
|
|
|
If you see a CUDA error about mismatched library versions at startup, use `setup-venv-local-build.sh` to build ctranslate2 against your system CUDA version rather than using the PyPI wheel.
|
|
|
|
|
|
## Language and translation
|
|
|
|
By default the server auto-detects the spoken language and transcribes it as-is.
|
|
|
|
| Argument | Default | Notes |
|
|
|----------|---------|-------|
|
|
| `--language <code>` | none (auto-detect) | Force a specific language, e.g. `--language en` or `--language sv`. Speeds up detection and avoids misidentification. |
|
|
| `--task transcribe` | default | Output text in the spoken language. |
|
|
| `--task translate` | | Translate speech to English regardless of source language. |
|
|
|
|
> [!NOTE]
|
|
> The `.en` model variants (`base.en`, `small.en` etc.) are English-only and do not support `--task translate` or non-English `--language`. Use a multilingual model (`large-v3`, `medium`) for multilingual or translation use cases.
|
|
|
|
> [!WARNING]
|
|
> Omitting `--language` with a multilingual model and English-only speech may cause occasional misdetection. Pass `--language en` to avoid this if you only speak English. |