Features¶

ASR Transcription provides four working modes: live transcription via microphone, processing of single audio files, sequential batch processing of multiple files, and a status and configuration tab for diagnostics. Across all processing paths, optional audio preprocessing, automatic segmentation, voice activity detection, and post-processing for hallucination detection are applied.

Use cases¶

Live transcription of talks and lectures — During the event, microphone audio is sent to the ASR server in configurable time windows. The transcript builds up cumulatively in the browser and can be saved as a file at any time.
Post-hoc transcription of single recordings — Uploaded recordings of interviews, meetings, or research conversations are transcribed in full. Long files are split into segments automatically and reassembled at the end.
Batch transcription of larger collections — Multiple audio files are processed sequentially in one run. This is suitable for research corpora, lecture archives, or the preparation of qualitative datasets.
Subtitle generation for accessible teaching materials — The SRT or VTT response format produces subtitle files directly, ready for use in learning platforms or video editing tools.
Multilingual transcription with technical vocabulary — The language is either specified or detected automatically. The context field allows proper names, acronyms, and technical terms to be passed along so they are written correctly.
Diagnostics and configuration check — Before production use or after a model switch, the connection to the inference server can be tested. Available models, response behavior of both endpoints, and active configuration are bundled into a single tab.

At a glance¶

Four working modes: Live (microphone streaming), File (single file), Batch (multiple files), Test (diagnostics)
Two selectable ASR models via configuration: Qwen3-ASR-1.7B or Voxtral-Mini-4B-Realtime
Two API methods: classic transcriptions API or chat API with system prompt for better handling of proper names
Audio input as WAV, MP3, FLAC, OGG, M4A, or WEBM; internal processing as 16 kHz mono PCM16
Response formats: JSON, verbose JSON (with per-segment timestamps), plain text, SRT, VTT
Quality control through audio preprocessing, VAD, automatic segmentation, and hallucination cleanup
13 languages (Voxtral) or 52 languages (Qwen3-ASR), each including auto-detection

Input and data sources¶

Audio enters the application via two paths. The Live tab captures audio chunks from the browser microphone in configurable time windows; alternatively, the File and Batch tabs accept uploaded audio files.

Microphone streaming — The browser audio source delivers numpy chunks at the native sample rate (typically 44.1 or 48 kHz). The application resamples to 16 kHz mono automatically.
File upload — Common audio containers and codecs are supported (WAV, MP3, FLAC, OGG, M4A, WEBM). Decoding is handled internally via librosa and FFmpeg.
Multi-file upload for batch processing — The Batch tab accepts multiple files at once; they are processed one after another with identical settings.

Inference connection¶

The actual speech recognition runs on a separately operated inference server. The application communicates exclusively with this server.

vLLM server (internal) — Provides the Qwen3-ASR or Voxtral models via the /v1/audio/transcriptions and /v1/chat/completions endpoints. Server URL and an optional bearer token are configured through environment variables.

Response formats¶

The server response can be requested in five formats:

JSON — Plain transcription text in the standard field; default for Live and Batch modes.
verbose JSON — Additionally returns the detected language, audio duration, and segment list with timestamps; suitable for downstream processing.
Plain text — Unstructured running text.
SRT and VTT — Subtitle formats with time codes, directly usable in video platforms and editing tools.

The download button writes the chosen format to a file with the matching extension.

Quality control features¶

Several mechanisms aim to improve transcription reliability over a direct model call.

Audio preprocessing — Optional and applied before segmentation. Three individually toggleable stages: peak normalization to -1 dB for consistent loudness, spectral gating to remove stationary background noise (fans, air conditioning, room tone), and a high-pass filter at 80 Hz against mains hum and low-frequency interference.
Voice activity detection — Energy-based detection of speech content. Segments below a threshold are skipped before being sent to the model. This saves compute time and avoids hallucinations on pure silence.
Automatic segmentation with overlap — Audio files exceeding the configured segment length (default 15 minutes) are split into chunks with two seconds of overlap. This works around both the length limit and the file size limit of the inference API without losing words at segment boundaries.
Hallucination cleanup — Common ASR artifacts (repetitive character sequences, repeated short phrases) are detected and reduced via regular expressions. If cleanup removes more than 80 % of the original text, the response is treated as fully hallucinated and replaced with a notice.
Context prompt — Each tab provides a hint field for proper names, technical terms, topical context, or style notes. The content is sent with every segment, stabilizing the spelling of names and terms.
API method choice — For demanding proper-name requirements, the classic transcriptions API can be switched for the chat API with system prompt. The latter passes an explicit instruction to the model to return only the transcribed text and to spell given terms exactly.
Progress and error display — Long processing runs are tracked segment by segment in the result field. HTTP errors from the backend are shown with status code and a snippet of the response, instead of aborting processing.
Connection diagnostics — The Test tab checks the reachability of the inference server, lists the registered models, sends a silence test to both endpoints, and summarizes the active configuration.

Configuration¶

Key parameters are controlled via environment variables or a .env file: backend URL and API key, model type and model identifier, default values for temperature, response format, and live processing window, maximum segment length, as well as bind address, port, and reverse-proxy path of the interface.