Model	VRAM	Quality	Notes
`base.en`	~0.5 GB (`float16`) / ~1 GB (`float32`)	Low	Default. Fast, but struggles with similar-sounding consonants (V/B/D).
`small.en`	~1 GB (`float16`) / ~2 GB (`float32`)	Medium	Noticeable improvement over base for most speech.
`medium.en`	~2.5 GB (`float16`) / ~5 GB (`float32`)	Good	Recommended starting point for production use.
`large-v3`	~5 GB (`float16`) / ~10 GB (`float32`)	Best	Highest accuracy, use if VRAM allows.

English-only models (.en suffix) are faster and more accurate than multilingual models for English speech.

Compute type

Pass --compute-type <type> to control the numeric precision used during inference.

Type	Notes
`int8_float16`	Default. Good balance of speed and accuracy on modern GPUs.
`float16`	Slightly better accuracy, higher VRAM usage.
`int8`	CPU-friendly, lower quality.

If you see a CUDA error about mismatched library versions at startup, use setup-venv-local-build.sh to build ctranslate2 against your system CUDA version rather than using the PyPI wheel.

Language and translation

By default the server auto-detects the spoken language and transcribes it as-is.

Argument	Default	Notes
`--language <code>`	none (auto-detect)	Force a specific language, e.g. `--language en` or `--language sv`. Speeds up detection and avoids misidentification.
`--task transcribe`	default	Output text in the spoken language.
`--task translate`		Translate speech to English regardless of source language.

Note

The .en model variants (base.en, small.en etc.) are English-only and do not support --task translate or non-English --language. Use a multilingual model (large-v3, medium) for multilingual or translation use cases.

Warning

Omitting --language with a multilingual model and English-only speech may cause occasional misdetection. Pass --language en to avoid this if you only speak English.