ASR Transcription¶

ASR Transcription is a web interface for speech recognition in a higher-education context. Audio is captured through the browser microphone or uploaded as a file and forwarded to an ASR server (vLLM with an OpenAI-compatible API); the transcript appears in the browser. The application combines a configurable ASR model with multi-stage audio preprocessing, automatic segmentation of long recordings, voice activity detection, and hallucination cleanup in post-processing.

At a glance¶

Transcribe lectures and talks live — the transcript builds up in the browser as recording continues
Upload recorded interviews, meetings, or talk recordings as files and have them fully transcribed
Process several audio files sequentially in a single run (batch)
Generate subtitle files in SRT or VTT format for accessible teaching and research materials
Provide proper names, technical terms, or topical context through a hint field so the model spells them correctly
Automatically clean up recordings with background noise or low-frequency hum before transcription
Choose between 13 and 52 languages, or rely on automatic language detection

Highlights¶

In contrast to a direct call against a transcription API or a simple script, ASR Transcription bundles several processing stages that together produce more stable and more usable results.

Selectable ASR model — Configuration switches between Qwen3-ASR-1.7B (52 languages) and Voxtral-Mini-4B-Realtime-2602 (13 languages). The interface adapts the language list automatically to the active model.
Two API methods — Alongside the classic transcriptions API, a chat API with system prompt is available. The latter passes proper names and technical terms as an explicit instruction to the model, leading to more reliable spellings.
Multi-stage audio preprocessing — Peak normalization, spectral noise reduction, and an 80 Hz high-pass filter can be enabled before the audio is segmented and sent. This improves recognition quality on non-ideal recordings.
Automatic segmentation with silence detection — Recordings exceeding the inference API length limit are split into overlapping segments. Segments without speech content are skipped via energy-based voice activity detection, which reduces processing time and hallucinations.
Hallucination cleanup in post-processing — Common ASR artifacts (repetitive character sequences, repeated short phrases) are detected and reduced via pattern matching. Responses identified as fully hallucinated are replaced with a notice.
Context prompt for proper names and terminology — An optional text field carries names, terms, or topical hints. The context accompanies every segment, including those produced by automatic splitting of long files.
Connection to internal inference infrastructure — Transcription runs against a separately operated vLLM server. The application itself does not transmit audio to external providers; URL and credentials are configured.
Privacy-aware setup — No external fonts or CDN resources, interface telemetry disabled, no third-party tracking.
Containerized deployment — Delivered as a Docker image. Audio system libraries (FFmpeg, libsndfile) are included, the process runs as a non-privileged user, and a health check monitors the interface.