Skip to content

Architecture

TTS-Suite follows a layered architecture with clearly separated responsibilities: a tab-oriented web interface, a central orchestration layer, specialized service components for routing, synthesis, quality assurance, and audio processing, and an SQLite-based persistence layer. The TTS and ASR models run outside the application as separate vLLM-Omni inference servers and are addressed via OpenAI-compatible HTTP APIs. Generation is implemented as a generator-based pipeline that streams progress events to the UI.

At a glance

  • Layer separation: UI (Gradio 6) → orchestration → services → persistence and external endpoints
  • Generator-based pipeline run with progress events and local state per generation (no shared state between sessions)
  • Persistence via SQLite in WAL mode with separate connections per call
  • External TTS and ASR models accessed through OpenAI-compatible HTTP, optional bearer token
  • Configuration exclusively via .env file and environment variables
  • State propagation between tabs only via UUIDs in gr.State, not via JSON blobs
  • Deployment as Docker container or direct Python installation

Architecture description

Layers and components

The application is organized into five layers:

  • UI layer — A Gradio 6 application with five sequential tabs (Script Workshop, Voice Studio, Generation, Review, Export). A shared session state holds only the UUIDs of the active script and project plus the WER threshold, keeping UI reactivity stable even with large scripts.
  • Orchestration layer — The Orchestrator coordinates the five phases of a generation run (segment synthesis, ASR-based QA, stitching, post-processing, persistence). Each invocation creates a local progress record (GenerationProgress) and yields it as a generator to the UI.
  • Service layer — Specialized services for the individual processing steps: TTSRouter (HTTP communication with the TTS backends, including backend-specific payload formats), EmotionsRouter (resolving an emotion against a voice profile), VoiceRegistry and ScriptRegistry (CRUD on the voice and script libraries), QAEngine (ASR call and WER computation), AudioStitcher (segment level normalization, sample-rate adjustment, crossfading), AudioPipeline (DSP chain and presets), LLMService (prompt construction and JSON validation of the LLM response), and Security (input and path validation).
  • Persistence layer — An SQLite database in WAL mode stores voice profiles, scripts, and projects as JSON-serialized Pydantic models. Each operation opens its own connection; reads can run concurrently. Reference audio and generated segments are stored as files in the configurable data directory.
  • External endpoints — Seven OpenAI-compatible inference services are addressed: five vLLM-Omni endpoints for the TTS modes (Qwen3-CustomVoice, Qwen3-Base, Qwen3-VoiceDesign, Fish Speech S2 Pro, Voxtral), one vLLM-Omni endpoint for Qwen3-ASR, and one OpenAI-compatible LLM server (Ollama, vLLM, or OpenAI API) for script generation. The endpoints are reached only via HTTP, optionally with a bearer token.

Data flow

flowchart TB
    User([User])

    subgraph UI["UI layer (Gradio 6)"]
        Tab1[Script Workshop]
        Tab2[Voice Studio]
        Tab3[Generation]
        Tab4[Review]
        Tab5[Export]
    end

    subgraph Orchestration["Orchestration"]
        Orch[Orchestrator]
    end

    subgraph Services["Services"]
        LLM[LLM Service]
        VR[Voice Registry]
        SR[Script Registry]
        ER[Emotions Router]
        TR[TTS Router]
        QA[QA Engine]
        ST[Audio Stitcher]
        AP[Audio Pipeline]
        SEC[Security]
    end

    subgraph Persistence["Persistence"]
        DB[(SQLite)]
        FS[(File system)]
    end

    subgraph External["External endpoints"]
        LLMSrv[LLM server]
        Q3CV[vLLM Qwen3-CustomVoice]
        Q3VD[vLLM Qwen3-VoiceDesign]
        Q3B[vLLM Qwen3-Base]
        FS2[vLLM Fish Speech S2 Pro]
        VOX[vLLM Voxtral 4B]
        ASR[vLLM Qwen3-ASR]
    end

    User --> UI

    Tab1 --> LLM
    LLM --> LLMSrv
    Tab1 --> SR
    SR --> DB

    Tab2 --> VR
    VR --> DB
    VR --> FS
    Tab2 --> TR

    Tab3 --> Orch
    Orch --> SR
    Orch --> ER
    ER --> VR
    Orch --> TR
    TR --> Q3CV
    TR --> Q3VD
    TR --> Q3B
    TR --> FS2
    TR --> VOX
    Orch --> QA
    QA --> ASR
    Orch --> ST
    Orch --> AP
    Orch --> DB
    Orch --> FS

    Tab4 --> Orch
    Tab5 --> AP
    Tab5 --> FS

    SEC -.validates.-> Tab1
    SEC -.validates.-> Tab2
    SEC -.validates.-> Tab5

Diagram explanation

The user interacts only with the UI layer. Tab 1 (Script Workshop) calls the LLM Service, which assembles a format-specific prompt and dispatches it to an external OpenAI-compatible LLM server. The resulting script is stored via the Script Registry in the SQLite database; only the script UUID flows into the UI session state.

Tab 2 (Voice Studio) manages voice profiles via the Voice Registry. Profiles including metadata reside in the database, reference audio in the file system. For the live preview, tab 2 calls the TTS Router directly.

Tab 3 (Generation) starts the Orchestrator. It loads the script and the project, iterates over the segments, and delegates per segment to the Emotions Router, which produces a routing structure against the respective voice profile (backend type, reference audio or preset name, resolved emotion, instruction text or inline tag). The TTS Router selects the appropriate payload structure based on the backend type and calls the corresponding vLLM-Omni endpoint. After each segment, a progress event is yielded to the UI. The QA Engine then transcribes each successfully generated segment back via Qwen3-ASR and computes the WER. The Audio Stitcher normalizes segment levels, aligns sample rates, and joins the segments with format-dependent pauses and crossfades. The Audio Pipeline applies the chosen preset. The finished project lands in the database and the project directory.

Tab 4 (Review) uses the Orchestrator to regenerate individual segments with an alternative emotion or edited text. Tab 5 (Export) optionally applies a different audio preset and writes the final file in the desired format.

The Security service validates input, file uploads, and paths across multiple tabs — it acts cross-cuttingly, outside the main data flow.

AI components in the workflow

Three classes of AI components interact in the pipeline, embedded in a rule-based orchestration:

  • LLM (script generation) — Turns raw text into a structured dialog script with speaker definitions, segments, and emotion tags. The response is validated against the Pydantic data models as a JSON schema; on format errors, a retry with a constrained prompt is triggered.
  • TTS models (synthesis) — Three heterogeneous model families with differing API contracts. The TTS Router encapsulates this heterogeneity (Qwen3-CustomVoice: voice + instructions + language; Qwen3-Base: ref_audio as a base64 data URL plus ref_text; Qwen3-VoiceDesign: task_type:VoiceDesign plus description text; Fish Speech: voice:default plus separate ref_audio plus inline tags; Voxtral: preset name or base64 audio in the same voice field).
  • ASR model (quality assurance) — Qwen3-ASR transcribes each generated segment back. A Levenshtein-based WER computation compares the result with the original text. Both texts are normalized before comparison (whitespace and inline tags removed).

The Emotions Router is rule-based and links these components: it translates the abstract emotion labels from the LLM script into the respective backend format (instruction or inline tag) and selects the matching reference audio clip per emotion in clone mode.

Concurrency and robustness

The Orchestrator is stateless between calls — each generation creates a local GenerationProgress object so that parallel sessions do not interfere with each other. The SQLite database runs in WAL mode with its own connection per operation, permitting concurrent reads. TTS calls have a configurable retry (default: three attempts); failed segments are flagged as QA_FAILED without aborting the overall run. Audio uploads from different browsers and recording devices are robustly converted to WAV via pydub, with intelligible error messages instead of generic exceptions.

Configuration and deployment

All endpoints, model names, audio parameters, stitching parameters, and security limits are set exclusively via a .env file or environment variables; there are no command-line arguments. The application starts either directly via python app.py or as a Docker container (Python 3.12 slim base with ffmpeg and libsndfile). The external vLLM-Omni inference servers run separately and may reside anywhere on the network as long as they are reachable over HTTP.

Technology overview

  • UI: Gradio 6
  • Data modeling: Pydantic v2
  • HTTP client: httpx
  • Persistence: SQLite (WAL mode), file system
  • Audio processing: pydub, scipy, numpy, pyloudnorm, noisereduce
  • TTS models: Qwen3-TTS (CustomVoice, Base, VoiceDesign), Voxtral 4B TTS, Fish Speech S2 Pro
  • ASR: Qwen3-ASR
  • Inference backend for TTS and ASR: vLLM-Omni with OpenAI-compatible API
  • LLM backend: any OpenAI-compatible server (Ollama, vLLM, OpenAI API)
  • Deployment: Docker or direct Python 3.10+ installation