# Voice to text interface

## Overview

This project aims to provide voice to text using [faster-whisper](https://github.com/SYSTRAN/faster-whisper) as backend.


## Origin

This project started as a [vibe-coded](https://en.wikipedia.org/wiki/Vibe_coding) [experiment](https://gitea.efforting.tech/mikael-lovqvists-claude-agent/claude-voice-experiment) but this version is somewhat more hands on.


## Setup

### Setup [venv](https://docs.python.org/3/library/venv.html) for [python](https://www.python.org/)

We will have two different setups here depending on if you want to build ctranslate2 locally or not. This shall be documented.


## Model selection

Pass `--model <name>` to `stt-server.py`. Models are downloaded automatically from HuggingFace on first use.

| Model | VRAM | Quality | Notes |
|-------|------|---------|-------|
| `base.en` | ~0.5 GB (`float16`) / ~1 GB (`float32`) | Low | Default. Fast, but struggles with similar-sounding consonants (V/B/D). |
| `small.en` | ~1 GB (`float16`) / ~2 GB (`float32`) | Medium | Noticeable improvement over base for most speech. |
| `medium.en` | ~2.5 GB (`float16`) / ~5 GB (`float32`) | Good | Recommended starting point for production use. |
| `large-v3` | ~5 GB (`float16`) / ~10 GB (`float32`) | Best | Highest accuracy, use if VRAM allows. |

English-only models (`.en` suffix) are faster and more accurate than multilingual models for English speech.


## Compute type

Pass `--compute-type <type>` to control the numeric precision used during inference.

| Type | Notes |
|------|-------|
| `int8_float16` | Default. Good balance of speed and accuracy on modern GPUs. |
| `float16` | Slightly better accuracy, higher VRAM usage. |
| `int8` | CPU-friendly, lower quality. |

If you see a CUDA error about mismatched library versions at startup, use `setup-venv-local-build.sh` to build ctranslate2 against your system CUDA version rather than using the PyPI wheel.


## Language and translation

By default the server auto-detects the spoken language and transcribes it as-is.

| Argument | Default | Notes |
|----------|---------|-------|
| `--language <code>` | none (auto-detect) | Force a specific language, e.g. `--language en` or `--language sv`. Speeds up detection and avoids misidentification. |
| `--task transcribe` | default | Output text in the spoken language. |
| `--task translate` | | Translate speech to English regardless of source language. |

> [!NOTE]
> The `.en` model variants (`base.en`, `small.en` etc.) are English-only and do not support `--task translate` or non-English `--language`. Use a multilingual model (`large-v3`, `medium`) for multilingual or translation use cases.

> [!WARNING]
> Omitting `--language` with a multilingual model and English-only speech may cause occasional misdetection. Pass `--language en` to avoid this if you only speak English.