WebSocket server, language/task args, verbose flag, misc improvements #2

Merged
mikael-lovqvist merged 13 commits from mikael-lovqvists-claude-agent/stt-server:websocket-server into main 2026-06-07 09:27:02 +00:00
Showing only changes of commit be1efd9edb - Show all commits

View File

@@ -15,4 +15,31 @@ This project started as a [vibe-coded](https://en.wikipedia.org/wiki/Vibe_coding
### Setup [venv](https://docs.python.org/3/library/venv.html) for [python](https://www.python.org/)
We will have two different setups here depending on if you want to build ctranslate2 locally or not. This shall be documented.
We will have two different setups here depending on if you want to build ctranslate2 locally or not. This shall be documented.
## Model selection
Pass `--model <name>` to `stt-server.py`. Models are downloaded automatically from HuggingFace on first use.
| Model | VRAM | Quality | Notes |
|-------|------|---------|-------|
| `base.en` | ~1 GB | Low | Default. Fast, but struggles with similar-sounding consonants (V/B/D). |
| `small.en` | ~2 GB | Medium | Noticeable improvement over base for most speech. |
| `medium.en` | ~5 GB | Good | Recommended starting point for production use. |
| `large-v3` | ~10 GB | Best | Highest accuracy, use if VRAM allows. |
English-only models (`.en` suffix) are faster and more accurate than multilingual models for English speech.
## Compute type
Pass `--compute-type <type>` to control the numeric precision used during inference.
| Type | Notes |
|------|-------|
| `int8_float16` | Default. Good balance of speed and accuracy on modern GPUs. |
| `float16` | Slightly better accuracy, higher VRAM usage. |
| `int8` | CPU-friendly, lower quality. |
If you see a CUDA error about mismatched library versions at startup, use `setup-venv-local-build.sh` to build ctranslate2 against your system CUDA version rather than using the PyPI wheel.