Skip to content

Architecture

Umfrage-Analyse-System is a Python application with a web-based frontend, a service-oriented processing layer and a SQLite-based persistence layer. The components are clearly separated: the frontend holds no state, the evaluation logic is modularised by question type, and all intermediate results are stored in the database. Long-running processing steps — particularly LLM calls — are implemented as resumable batch operations.

At a glance

  • Four-layer structure: web interface, application core, processing services, persistence
  • Handler registry with seven question-type handlers behind a uniform interface
  • Central SQLite database with version-managed schema and automatic migration
  • External LLM connection via an OpenAI-compatible interface
  • Pipeline orchestration with caching and resume after interruption
  • Configuration via YAML files and environment variables
  • Local execution as a single process; usable via a web browser

Architecture description

Layers and components

The web interface is based on Gradio and bundles the operating and evaluation tabs (import, configuration, item extraction, analysis, workbench, export, themes). It calls only functions of the application core and performs no processing itself.

The application core maintains a global application state (AppState) with a reference to the current DataFrame, the active project, the most recent import and the loaded configuration. A central handler registry provides the appropriate handler for each question type; the handlers inherit from an abstract base class with uniform methods for analysis, visualisation and export.

The processing services encapsulate the domain logic:

  • the data import reads CSV/Excel and optionally parses LimeSurvey structure files;
  • the response translator detects foreign languages (a hybrid of keyword lists and language detection) and translates responses into the target language;
  • the item extractor splits free texts using rule-based methods and LLM support;
  • the cluster service groups items via the LLM with dynamically scaled thresholds;
  • the statistics modules compute chi-square, Cramér's V and question-to-question correlations;
  • the summary service produces text blocks per question and language;
  • the theme analyser extracts keywords and produces theme assignments.

The persistence layer lies in a SQLite database with tables for, among others, projects, imports, raw responses, extracted items, clusters, translations, analysis results, themes and question summaries. At program start a migration component checks the schema version and performs necessary updates automatically.

The output layer comprises the structured Word export with charts, detail tables and appendices, a dashboard JSON export and the Document Builder, which combines Markdown sources into multilingual publication documents.

Diagram

flowchart TB
    User([Evaluator])

    subgraph UI["Web interface (Gradio)"]
        direction LR
        T1[Import]
        T2[Item extraction]
        T3[Analysis]
        T4[Workbench]
        T5[Export]
        T6[Themes]
    end

    subgraph Core["Application core"]
        State[AppState]
        Config[YAML configuration]
        Registry[Handler registry]
    end

    subgraph Handlers["Question-type handlers"]
        H1[single_choice]
        H2[multi_choice]
        H3[multi_choice_binary]
        H4[matrix_likert]
        H5[ranking]
        H6[freetext]
        H7[cooperation_matrix]
    end

    subgraph Services["Processing services"]
        Imp[Data import]
        Trans[Response translation]
        Items[Item extraction]
        Clust[Clustering]
        Stat[Statistics / correlation]
        Sum[Summaries]
        Themes[Theme analysis]
    end

    DB[(SQLite database)]
    LLM["LLM API (external)"]

    subgraph Output["Output"]
        Word[Structured Word report]
        Builder[Document Builder]
        DashJSON[Dashboard JSON]
        Files[CSV / PNG]
    end

    User --> UI
    UI --> Core
    Core --> Registry
    Registry --> Handlers
    Handlers --> Services
    Services <--> DB
    Services --> LLM
    Builder --> LLM
    Services --> Output
    Builder --> Output

Workflow

A typical processing run begins with the import of a survey via the web interface. The data import creates an entry in the database and stores the raw responses. Optionally, a LimeSurvey structure file supplements the multilingual question texts. Foreign-language responses are then translated into the working language; translations are cached and reused on the next run.

For free-text questions, item extraction follows. A rule-based pre-check captures clearly structured responses (separators, enumerations); for the remaining cases an LLM call performs the splitting. The extracted items are persisted and serve as input for clustering.

In the analysis step, the pipeline iterates over all configured questions and invokes the appropriate handler in each case. Closed questions are aggregated directly, free-text questions are clustered via the LLM. Subsequently, the statistics component computes significance tests per segment and correlations between questions, and the summary service produces a description, interpretation and segment text for each question.

In the workbench step, the automatically generated clusters and items can be reworked manually. Changes are persisted in the database; an optional re-clustering run aligns the items with the corrected categories.

The export step reads the persisted results, generates charts via the chart generator and combines them into a structured Word report. In parallel, a dashboard JSON can be produced that contains all global and segmented results. The Document Builder concatenates the report with translated Markdown sources (cover, introduction, background) into a multilingual publication document.

Role of the LLM

The LLM fulfils several clearly delimited tasks, each with its own prompt and validation:

  • translation with glossary context, both for responses and for report texts;
  • item extraction in two modes (conservative and thematic), each with example prompts;
  • clustering with completeness validation (every response ID must be assigned to exactly one cluster) and with multiple attempts in case of insufficient results;
  • selection of representative examples per cluster;
  • question summaries in three languages;
  • theme keyword extraction.

The LLM calls use an OpenAI-compatible interface. Model name, endpoint, token limits, timeout and temperature are configurable via environment variables; calls are counted and documented with input and output lengths. Embedding or reranker components are not used; the AI processing is based exclusively on prompt-driven LLM calls.

Robustness and configuration

Long pipeline operations are resumable. On a renewed start, previously computed translations, item extractions, clusters and question summaries are loaded from the database, and only missing steps are executed. An optional recomputation forces complete processing from scratch.

The central configuration resides in YAML files (questions.yaml, themes.yaml, glossary.yaml, documents.yaml, translations.yaml, diagram_labels.yaml). It defines question groups, handler assignments, answer options, segmentation variables, report structure and translation glossary. Server parameters and LLM access are set via .env variables.

The application runs as a single Python process; deployment behind a reverse proxy is provided for via the root_path configuration. The SQLite database is file-based and requires no separate database service.

Technology overview

  • Web interface: Gradio
  • Data processing: pandas, numpy
  • Statistics: scipy
  • Visualisation: matplotlib
  • Word export: python-docx, docxcompose
  • Language detection: langdetect
  • HTTP client: httpx (for the LLM API)
  • Configuration: PyYAML, python-dotenv
  • Persistence: SQLite (via the standard module)
  • Optional diagram rendering: Mermaid (via mermaid.ink)
  • LLM endpoint: OpenAI-compatible API (e.g. vLLM)