TextWerkstatt — Architecture¶

TextWerkstatt is built as a containerised web application with clear layer separation. A Gradio-based interface communicates with an asynchronous processing layer that routes tasks, depending on the mode, to either a simple chat path or a multi-stage pipeline. On the backend side, only internal services are addressed — two OpenAI-compatible LLM endpoints, an embedder, and a reranker for semantic retrieval and fact verification.

At a glance¶

Layered model: interface, mode router, pipeline and intent handlers, storage (inventory, version list, scope), connectors (LLM, embedder, reranker, file reader), exporters.
Asynchronous processing with semaphore-based concurrency control per backend service.
Two modes sharing a common profile context: chat path and multi-stage pipeline.
Two LLM slots (primary, fast) with configurable model family and automatic reasoning control.
Inventory as the central data structure for retrieval and the second-stage hallucination check.
Version management with a cursor model and marking of branched states.
Containerised deployment behind a reverse proxy with streaming support for long-running processes.

Architecture description¶

Components¶

The application is divided into seven functional layers.

Interface. A Gradio interface provides input field, file upload, profile selection, mode display, work plan preview, version list, and export buttons. It communicates with the backend via Gradio's own streaming interface.

Mode router. On initial entry, the mode router decides whether to run the chat path or the pipeline based on the task profile, the volume of material, and the input length. Some profiles are bound to the pipeline by default; manual selection is also supported.

Processing. The chat path performs a single model call with the profile-specific system prompt. The pipeline runs through five phases: an LLM-driven survey of the material with detection of key elements, gaps, and open questions; a decomposition into a work plan with section structure, instructions, and quality criteria; a sequential section-by-section processing; a quality check against material and criteria with an optional correction loop; and a final homogenisation of the entire document.

Inventory and retrieval. Uploaded material is optionally inventoried: an LLM segments the document into semantic units (e.g. agenda items, resolutions, chapters, criteria, code blocks), each captured with title, summary, and full text. Each entry receives an embedding vector; the entire index is held in memory. Pipeline steps that require material-bound retrieval query the index by cosine similarity and have the hits optionally reranked.

Intent router and scope. After the initial generation, the intent router converts each follow-up input into a classified operation. High-confidence classifications execute directly; medium ones trigger a confirmation; low ones a clarification. The scope manager limits operations to the entire document, a section, or a paragraph and determines which edit strategy applies — full text, per section, or surgical.

Storage. Three in-memory structures hold session state: the inventory with embeddings, a cursor-based version list with branch marking, and the current scope. There is no persistent storage; each session begins with an empty state.

Exporters. Three exporters generate Word, Markdown, and text files with a provenance footer. The Word exporter converts the internally used Markdown representation into structured .docx documents.

Workflow diagram¶

flowchart TB
    User[User] --> UI[Gradio interface]
    UI --> Upload[File reader]
    Upload --> Inventory[Inventorying]
    Inventory --> Embedder[(Embedder)]
    Inventory --> Index[Inventory index]

    UI --> Router{Mode router}
    Router -->"short / simple"| Chat[Quick mode / chat]
    Router -->"extensive / complex"| Pipeline

    subgraph Pipeline[Workshop pipeline]
        direction TB
        P1[1. Analysis] --> P2[2. Work plan]
        P2 --> P3[3. Section processing]
        P3 --> P4[4. Quality check]
        P4 --> P5[5. Homogenisation]
    end

    Chat --> LLM_P[(Primary LLM)]
    P1 --> LLM_P
    P2 --> LLM_P
    P3 --> LLM_F[(Fast LLM)]
    P4 --> LLM_P
    P4 --> SemCheck[Semantic fact verification]
    SemCheck --> Embedder
    SemCheck --> Index
    P5 --> LLM_P

    Index --> P3
    Reranker[(Reranker)] --> P3

    Pipeline --> Doc[Result document]
    Chat --> Doc

    Doc --> History[Version list]
    Doc --> Intent{Intent router}
    Intent --> Scope[Scope: entire / section / paragraph]
    Scope --> Handler[Handler: deterministic / LLM / LLM+retrieval]
    Handler --> LLM_P
    Handler --> LLM_F
    Handler --> Doc

    Doc --> Export[Exporter]
    Export --> Word[Word .docx]
    Export --> MD[Markdown .md]
    Export --> Text[Text .txt]

The diagram shows the data flow from input to export. Uploaded material is optionally inventoried, with the embedder producing vectors for each entry. The mode router decides between the chat path and the pipeline. The pipeline calls the primary model for quality-critical phases (analysis, plan, check, homogenisation) and the fast model for section processing; when an inventory is present, section processing and quality check both draw on the index, and the check additionally uses it for semantic fact verification. Follow-up inputs are classified by the intent router and applied via a scope to the relevant document part before flowing back into the result document. The version list and exporters operate on the current document state.

Role of the AI components¶

Primary LLM. Responsible for steps where quality and coherence are decisive: material analysis, creation of the work plan with section-wise task structure, quality verification against criteria and material, correction of identified defects, and final homogenisation.

Fast LLM. Responsible for section-by-section processing in workshop mode and for small edits triggered by the intent router. If no secondary model is configured, the primary LLM takes over these tasks.

Embedder. Generates vectors for inventory entries and for the statements extracted during semantic fact verification. Detects the endpoint variant (OpenAI style or HuggingFace TEI) automatically.

Reranker. Re-orders retrieval candidates by query relevance. Used optionally when the embedder returns more candidates than required; on failure, the system falls back to embedding-based ordering.

Agentic pipeline. The phases act as specialised steps, each with its own prompt, task definition, and model assignment. Instead of a single model call, a complex task passes through several stations, each documenting its intermediate result and handing it to the next.

Concurrency and robustness¶

Processing is asynchronous throughout. A semaphore with a configurable depth is used per LLM to limit the number of concurrent backend calls. The Gradio queue allows only one pipeline at a time per user, so that there is no contention for model slots. Model responses are decoded with a three-stage JSON parser (direct, from a Markdown code block, from the first JSON block); when a response cannot be parsed, a fallback is applied. Pipeline phases encapsulate errors so that individual section failures do not abort the entire run; unsuccessful sections are flagged. Backend calls have separately configurable timeouts.

Configuration and deployment¶

Configuration is provided via environment variables in a .env file: endpoints and API keys for the four backend services, model families for reasoning control, thresholds for the mode router and the semantic hallucination check, and pipeline parameters such as maximum number of sections and correction iterations. Deployment is as a container; a reverse proxy configuration with active WebSocket upgrade, disabled buffering, and generous timeouts supports the multi-minute pipeline runs and file uploads up to 50 MB.

Technologies¶

Interface. Gradio (Python framework for interactive web interfaces) with custom CSS.
LLM integration. OpenAI-compatible async client (openai Python SDK), extended with vLLM-specific reasoning toggles via chat_template_kwargs.
HTTP. httpx as the asynchronous HTTP client for embedder and reranker calls.
File readers. python-docx for Word documents, pdfminer.six for PDFs, python-pptx for PowerPoint, the standard library for txt/md/csv/html.
Word export. python-docx with a custom Markdown-to-DOCX converter.
Configuration. python-dotenv for .env files.
Deployment. Docker container, operated behind nginx as a reverse proxy with WebSocket upgrade.