Voice to text interface

Overview

This project aims to provide voice to text using faster-whisper as backend.

Origin

This project started as a vibe-coded experiment but this version is somewhat more hands on.

Setup

Setup venv for python

We will have two different setups here depending on if you want to build ctranslate2 locally or not. This shall be documented.

Model selection

Pass --model <name> to stt-server.py. Models are downloaded automatically from HuggingFace on first use.

Model VRAM Quality Notes
base.en ~1 GB Low Default. Fast, but struggles with similar-sounding consonants (V/B/D).
small.en ~2 GB Medium Noticeable improvement over base for most speech.
medium.en ~5 GB Good Recommended starting point for production use.
large-v3 ~10 GB Best Highest accuracy, use if VRAM allows.

English-only models (.en suffix) are faster and more accurate than multilingual models for English speech.

Compute type

Pass --compute-type <type> to control the numeric precision used during inference.

Type Notes
int8_float16 Default. Good balance of speed and accuracy on modern GPUs.
float16 Slightly better accuracy, higher VRAM usage.
int8 CPU-friendly, lower quality.

If you see a CUDA error about mismatched library versions at startup, use setup-venv-local-build.sh to build ctranslate2 against your system CUDA version rather than using the PyPI wheel.

Description
Voice to text server using [faster-whisper](https://github.com/SYSTRAN/faster-whisper)
Readme 89 KiB
Languages
Python 56.7%
Shell 43.3%