Features¶

TTS-Suite presents five sequential workflow steps as a tab interface: script creation, voice configuration, generation, review, and export. The functional scope covers connectors to multiple TTS and ASR models, centralized emotion management, automated quality control, and a configurable audio post-processing pipeline.

Use cases¶

Research podcast — A scientific manuscript becomes a multi-voice podcast with two or three hosts. The LLM produces a natural conversational tone with thematic transitions; speaker profiles are assembled from preset voices or cloned voices of the participants.
Expert interview — A prepared topic outline becomes a structured question-and-answer format with a moderator and a guest. Real background information about the participants flows verbatim into the prompt via a free-text field, ruling out hallucination.
Audiobook production — Longer texts are voiced with a narrator and character voices. Per speaker profile, emotion variants can be stored as reference audio, so dramatic, calm, and neutral passages receive different tonal colors.
Tutorial video voiceover — Scripts for e-learning modules are synthesized as a calm off-screen voice. The format automatically enforces a constant neutral tone, extends pauses between sentences, and applies the dedicated tutorial-video audio preset (–19 LUFS, focused on speech intelligibility).
Audio lecture or keynote — A lecture manuscript becomes a structured single-speaker presentation with introduction, main section, and summary, suitable for use as a podcast episode or audio version of a publication.
Accessible study materials — Texts from teaching materials, scripts, or assignments are converted into an audio version. The ASR-based quality check ensures that the spoken audio matches the original text — relevant for students with visual impairment or reading-writing disability.

At a glance¶

Connectors to three TTS model families with seven operating modes, plus one ASR model and one OpenAI-compatible LLM server
Six preconfigured dialog formats with format-specific prompt templates and pipeline adjustments
Seven audio presets ranging from raw cut to broadcast EBU R128
Automated quality assurance via ASR round-trip with configurable WER threshold
Persistent voice and script library with management functions
Import: raw text and reference audio in WAV, MP3, OGG, FLAC, M4A (with automatic conversion of browser-specific formats such as WebM, CAF, WMA)
Export: WAV, MP3, OGG

TTS backends¶

TTS-Suite integrates three model families with complementary strengths. Integration runs through the OpenAI-compatible API of the vLLM-Omni inference servers throughout.

Qwen3-TTS CustomVoice — Nine pretrained voices (including Vivian, Serena, Ryan) for ten languages. Emotion control via natural-language instructions in the instructions field. Suited for ready-to-use, consistent voices.
Qwen3-TTS VoiceDesign — Generates a voice from a pure text description ("warm female voice with a calming tone"). No reference audio required.
Qwen3-TTS Base (Clone) — Voice cloning from 10–30 seconds of reference audio. A separate reference clip is stored per emotion; the model takes the prosody of the respective clip.
Fish Speech S2 Pro — Voice cloning with fine-grained inline tag control. Language is detected automatically from the input text (80+ languages). Tags such as [whisper] or [laughing] can be placed anywhere in the script.
Voxtral 4B TTS — Twenty pretrained voices for nine languages, or voice cloning from as little as three seconds of reference audio. Emotion control via the instructions field.

Script generation and dialog formats¶

Six dialog formats, each with its own LLM prompt templates, speaker count constraints, and downstream pipeline adjustments:

Podcast — Two or three hosts with distinguishable personalities, informal conversational tone, mutual address by name.
Interview — Moderator and guest in a structured question-and-answer format.
Discussion — Three to five participants with a moderator and divergent positions.
Audiobook — Narrator with optional character voices for dialog passages.
Lecture — Single speaker with introduction, main section, summary, and an objective, trustworthy tone.
Tutorial video — Neutral off-screen voice with short, clearly separated sentences, extended pauses, and consistently neutral emotion.

The optional intro/outro generation tailors greeting and closing to the chosen format. A separate text field for real speaker background information is inserted verbatim into the prompt to rule out fabricated titles or institutions.

Voice library¶

Speaker profiles are stored persistently and remain available across projects. A profile contains name, backend type, mode (preset, voice design, clone), and — depending on the mode — preset name or emotion variants with reference audio.

Live preview: each profile can be auditioned with a sample sentence in the chosen language before production use.
Management: individual profiles, individual emotion variants, or entire projects can be deleted in isolation without affecting other data.
Clone profiles list their reference recordings with file name and emotion label.

Emotion control¶

Seven base emotions (neutral, friendly, thoughtful, excited, serious, questioning, humorous) form a backend-agnostic vocabulary. Related emotions (e.g. warm-hearted, enthusiastic, skeptical) are mapped to one of the base emotions via fuzzy matching. Depending on the backend, the emotion is realized through one of three pathways:

Qwen3 / Voxtral: natural-language instruction in the instructions API field
Fish Speech: inline tag in the spoken text ([warm and friendly tone], etc.)
Clone modes: selection of the matching reference audio clip per emotion

Manually placed Fish Speech tags in the script are preserved and combined with the automatic tags; when switching to a different backend, they are removed from the text so they are not read out literally.

Quality assurance¶

After generation, each segment is automatically transcribed back via Qwen3-ASR and compared with the original script text. The result is a word error rate (WER) per segment.

Configurable WER threshold per project (default: 10 %)
Automatic flagging of suspicious segments in the review tab
Single and batch regeneration with alternative emotion or edited text
Display of the ASR transcript next to the original for direct comparison
Format-specific stabilization: in the tutorial video format, the emotion router overrides individual emotions with neutral, so all segments are generated in a uniform tone
Anti-hallucination logic in the LLM prompts, including a separate field for real speaker background information

Audio pipeline and export¶

The generated segments pass through a multi-stage pipeline:

Segment RMS normalization to a common level before stitching
Sample-rate harmonization across backends with different output rates (24 kHz, 44.1 kHz)
Crossfading and format-dependent pauses between segments (same speaker, speaker change, paragraph end)
Highpass filter, compressor with attack/release, LUFS normalization, true-peak limiter
Steady-state initialization of the IIR filters to avoid audible settling artifacts at the start of the file

Seven audio presets are available: raw, podcast (–16 LUFS), radio (–14 LUFS), audiobook (–18 LUFS), broadcast EBU R128 (–23 LUFS), sonor (warm radio voice with presence EQ), and tutorial video (–19 LUFS). Export is available as WAV, MP3, or OGG.

Security and robustness¶

All SQL queries are parameterized (no SQL injection risk)
Audio uploads are validated for size, format, and WAV header
Filename sanitization and path traversal protection for user-generated paths
Length limits for text input and segment count
Robust handling of inconsistent audio formats from different browsers and recording devices (WebM, CAF, WMA, etc.) with automatic conversion to WAV