Architecture¶
ASR Transcription is designed as a self-contained application container that acts as an intermediary between browser and inference server. Audio is preprocessed, segmented, and quality-cleaned locally within the application; the actual speech recognition runs on a separately operated vLLM server. This separation allows the frontend and the model to be operated independently and the ASR model to be switched through a configuration change.
At a glance¶
- Three-layer model: browser (interface) → application (audio pipeline and orchestration) → vLLM (inference)
- Stateless toward the backend; the only application-side state is the live buffer per session
- Streaming in live mode via Gradio stream events; file and batch modes are request-based
- Audio pipeline: resampling to 16 kHz mono, optional preprocessing, segmentation with VAD, inference call, hallucination cleanup
- Two usable backend endpoints:
/v1/audio/transcriptions(default) and/v1/chat/completions(with system prompt) - Configuration entirely through environment variables and
.envfile - Containerized delivery with non-privileged runtime user and built-in health check
Architecture description¶
The application is structured into a frontend layer (Gradio 6 with four tabs for Live, File, Batch, Test), a processing layer (audio helpers, preprocessing, segmentation, dispatcher, post-processing), and an external inference layer (vLLM server). The frontend is rendered by the application process itself; the browser communicates with the application via Gradio's own protocols (HTTP requests and stream events). The processing layer converts incoming audio to the format expected by the ASR server (PCM16, 16 kHz, mono), applies preprocessing on demand, segments long recordings, forwards individual segments to the vLLM server via HTTP, and cleans up the response.
flowchart LR
subgraph Client["Browser"]
Mic[Microphone stream]
Upload[File upload]
Out[Transcript / Download]
end
subgraph App["ASR Transcription container"]
UI[Gradio interface<br/>Live · File · Batch · Test]
Buf[Session buffer<br/>live mode]
Pre[Audio preprocessing<br/>normalization · denoise · high-pass]
Seg[Segmentation with<br/>VAD silence detection]
Disp{API dispatcher}
Post[Hallucination<br/>cleanup]
end
subgraph Backend["Inference backend"]
vLLM[vLLM server<br/>OpenAI-compatible API]
end
Mic --> UI
Upload --> UI
UI --> Buf
UI --> Pre
Buf --> Disp
Pre --> Seg
Seg --> Disp
Disp -->|transcriptions API| vLLM
Disp -->|chat API| vLLM
vLLM --> Post
Post --> UI
UI --> Out
Workflow¶
In live mode, the browser sends numpy audio chunks to the application at sub-second intervals. A buffer is held per session into which the chunks are written; once the accumulated duration reaches the configured processing window (default 3 seconds), the buffer is drained, converted to WAV (16 kHz, mono, PCM16), and sent to the inference server. The response is run through hallucination cleanup and appended to the transcript field. When recording stops, any remaining buffer is processed once more and the session state is discarded.
In file mode, the uploaded file is first preprocessed if enabled (high-pass filter, spectral denoising, peak normalization) and then checked against the maximum segment length. Shorter files are sent to the server unchanged; longer files are cut into overlapping FLAC segments, each subjected to VAD (silent segments are skipped), and transcribed in sequence. Partial results are timestamped and merged in the result field; the field updates after each segment, keeping progress visible.
The batch mode applies the same processing path to multiple files sequentially. Each file is preprocessed, segmented, transcribed, and added to the overall output as its own result block.
The Test tab retrieves the model list from the inference server, sends a brief silence test to both the transcriptions and chat APIs, and reports status codes and response excerpts in a status block; the active configuration values are appended.
API methods¶
The dispatcher decides, based on a selection in the interface, whether a segment is sent to the server via /v1/audio/transcriptions or via /v1/chat/completions. The transcriptions variant submits audio as a multipart upload and passes language, temperature, and an optional prompt as form fields. The chat variant encodes the audio as a base64 data URI inside a chat message object and attaches a system prompt instructing the model to return only the transcribed text and to write supplied proper names and terms exactly. For the chat variant, the response is additionally checked for an ASR-typical wrapper (<asr_text>…) and stripped accordingly.
Audio pipeline¶
The audio pipeline targets the format expected by the model (16 kHz, mono, PCM16). On input, the browser audio array is normalized to the [-1, 1] range, mixed down if needed, and resampled. Optional preprocessing first applies a Butterworth high-pass at 80 Hz, then spectral gating against stationary noise, and finally peak normalization to -1 dB. Segmentation uses energy-based VAD (RMS per 25 ms frame with configured threshold and minimum speech ratio) and produces FLAC segments with two seconds of overlap.
Post-processing¶
Server responses are run through hallucination cleanup before display. Two patterns are recognized: repetitions of short character sequences, as occurring with CJK hallucinations, and repeated short phrases. Both are reduced to the first match plus an ellipsis. If cleanup removes more than 80 % of the original text, the response is considered fully hallucinated and replaced with a notice.
Concurrency and robustness¶
Live buffers are kept in a central dictionary per session and protected by locks; on session termination, they are released through a cleanup callback. At application startup, a warm-up request is sent in parallel to the chat API so the first production request is not delayed by initialization of the inference handler. Errors in individual segments do not interrupt processing: HTTP status code and a snippet of the response are added to the result, and processing continues.
Configuration and deployment¶
The application is delivered as a Docker image based on python:3.11-slim. The image bundles the audio system libraries (libsndfile, FFmpeg) and the Python dependencies. The process runs as a non-privileged user; a health check verifies reachability of the Gradio port. All runtime parameters — backend URL, API key, model type and model ID, defaults for temperature, response format, and processing window, maximum segment length, server bind, and reverse-proxy path — are set via environment variables or a .env file.
Technology overview¶
- Frontend — Gradio 6 (streaming audio component, tabs, state management, theme without external fonts)
- Backend API — vLLM ≥ 0.17.0 with OpenAI-compatible audio API (
/v1/audio/transcriptions) and chat API with audio input - ASR models — Qwen/Qwen3-ASR-1.7B or mistralai/Voxtral-Mini-4B-Realtime-2602, selected via configuration
- Audio processing — librosa (decoding, resampling, RMS, preemphasis), soundfile (WAV/FLAC), noisereduce (spectral gating), scipy (Butterworth high-pass)
- HTTP — requests with optional bearer auth
- Configuration — python-dotenv, environment variables
- Containerization — Docker (python:3.11-slim, FFmpeg, libsndfile, non-privileged user, health check)