Skip to content

TTS-Suite

TTS-Suite is a web-based platform for producing multi-speaker audio content from text. It bundles three heterogeneous TTS model families with seven operating modes, an ASR model for automated quality assurance, and an LLM for script generation into a continuous production chain. Processing runs as a multi-stage, segment-wise pipeline with centralized emotion control and studio-grade audio post-processing.

At a glance

  • Produce finished audio output from raw text — from LLM-generated script to export-ready file in a single tool
  • Create speaker profiles from preset voices, text descriptions, or own reference audio, with emotion variants per profile
  • Produce audio content for research podcasts, expert interviews, moderated panel discussions, audiobooks, audio lectures, tutorial videos, and accessible study materials
  • Automatically check generated segments for fidelity to the original text via ASR, and selectively regenerate only problematic segments
  • Post-process audio to industry-standard loudness targets (LUFS, EBU R128) for different distribution channels
  • Maintain a persistent voice library and reuse it across projects

Highlights

In contrast to a direct call to a single TTS model or a thin script wrapping an API, TTS-Suite provides a complete production chain. The following differentiators shape output quality:

  • Three TTS model families behind one abstraction — Qwen3-TTS (Alibaba), Voxtral 4B (Mistral AI), and Fish Speech S2 Pro are integrated through a single router. Seven operating modes (preset voices, voice design from text description, voice cloning) are available without exposing the differences between the backend APIs.
  • Connectors to seven external endpoints — vLLM-Omni instances for the five TTS modes and the ASR model, plus an OpenAI-compatible LLM server (Ollama, vLLM, OpenAI API) for script generation.
  • Unified emotion control across two paradigms — An emotion router translates an abstract emotion (neutral, friendly, excited, etc.) into either a natural-language instruction (Qwen3, Voxtral) or an inline tag in the spoken text (Fish Speech), depending on the backend. Fuzzy mapping recognizes synonyms; a three-stage fallback (direct → mapped → default) guarantees a runnable routing decision.
  • ASR-based quality assurance — Each generated segment is transcribed back via Qwen3-ASR and compared with the original. The word error rate (WER) flags suspicious segments, which can be regenerated individually or in batches with an alternative emotion. The WER threshold is configurable per project.
  • LLM-based script generation with format rules — Six predefined dialog formats (podcast, interview, discussion, audiobook, lecture, tutorial video) ship with their own prompt templates, speaker count constraints, and downstream pipeline adjustments. Anti-hallucination rules and a separate field for verbatim insertion of speaker background information prevent fabricated biographical details.
  • Consistent tone across segment-wise synthesis — Longer audio sequences sound consistent across segment boundaries, because the LLM bundles related sentences into larger segments, the emotion router enforces neutral mode for the tutorial video format, and backend-specific stability tags are injected.
  • Studio-grade audio post-processing — A configurable DSP chain comprising highpass, compressor, LUFS normalization, and true-peak limiting, with seven presets (raw, podcast, radio, audiobook, broadcast EBU R128, sonor, tutorial video). IIR filters use steady-state initialization, so no audible startup transients appear at the beginning of the file.
  • Robust level and sample-rate harmonization — Before stitching, segments are normalized to a common RMS level, and sample rates from different backends (24 kHz Qwen3/Voxtral, 44.1 kHz Fish Speech) are aligned with each other.
  • Guided workflow with shared session state — Script ID, project ID, and WER threshold flow automatically between the five tabs. No manual copying of IDs or JSON structures is required.
  • Persistent voice and script library — Voice profiles including reference audio and scripts reside in an SQLite database (WAL mode, thread-safe) and are available across projects.