Skip to content

Architecture

The application is built as a monolithic Python service that runs in a single container instance. Within the service, responsibilities are organised into four separate layers: user interface, document preparation, language-model integration, and configuration. External services are contacted only for LLM inference; all document processing takes place locally inside the container.

At a glance

  • Container-based deployment (Docker, optionally via Docker Compose)
  • Layered architecture with a clear separation of UI, pipeline, LLM integration, and configuration
  • Gradio web framework as the combined UI and HTTP layer
  • Processing pipeline with extraction, cleaning, deduplication, Markdown conversion, and referencing
  • Language-model integration via the OpenAI API protocol with streaming support
  • In-memory session management per browser session
  • Configuration exclusively via environment variables and CLI arguments

Architecture overview

The UI layer is based on Gradio and provides both the HTTP interface and the frontend. It accepts file uploads, forwards requests to the pipeline, streams model responses back, and maintains a per-session state with the loaded documents and the chat history.

The processing layer encapsulates all steps between an uploaded document and the context handed to the language model. It is divided into three sub-areas: extraction (reading raw content from binary formats), optimisation (cleaning, deduplication, Markdown formatting), and referencing (assigning unique paragraph markers for source tracing).

The LLM layer manages the multi-document context, checks token limits, builds the prompt with system and user instructions, and performs the call against the configured language-model endpoint. The response is received as a stream and forwarded to the UI layer in real time.

The configuration layer reads environment variables and CLI arguments, validates them, and exposes them as typed data classes to the other components.

flowchart TB
    User([User])

    subgraph Container[Container]
        UI[Gradio UI<br/>file upload · chat · streaming]

        subgraph Pipeline[Processing pipeline]
            EX[Extraction<br/>unstructured + pypdf]
            CL[Cleaning<br/>headers · footers · page numbers]
            DD[Deduplication<br/>hash + fuzzy match]
            FM[Markdown formatter]
            RF[Reference system<br/>P1, P2, ...]
        end

        subgraph LLM[LLM integration]
            MM[Multi-Document Manager<br/>token and limit control]
            IF[LLM interface<br/>OpenAI API client]
        end

        WX[Word export<br/>python-docx]

        CFG[Configuration<br/>ENV + CLI]
    end

    LLMSrv[(Language-model endpoint<br/>OpenAI-API-compatible)]

    User -->|upload| UI
    UI --> EX
    EX --> CL --> DD --> FM --> RF
    RF --> MM
    User -->|question| UI
    UI --> MM --> IF
    IF -->|HTTPS| LLMSrv
    LLMSrv -.->|streaming| IF
    IF -.-> UI
    UI -.-> User
    UI -.-> WX
    WX -.->|.docx| User
    CFG -.-> UI
    CFG -.-> IF
    CFG -.-> Pipeline

Workflow

The typical sequence begins with the upload of one or more documents. The UI layer hands each file to the extraction component, which dispatches the document to a parser appropriate for its file type. For PDFs, interactive form fields (AcroForm) are additionally read and appended to the extracted text. The result is structured raw text with heading markers and — where available — page numbers.

In the optimisation step, the cleaning component removes recurring headers, footers, page numbers, and typographic artefacts via compiled regex patterns and a frequency analysis. The deduplicator then checks each element against a growing buffer of previously seen content: exact duplicates are detected via MD5 hashes, near duplicates via sequence matching with a configurable similarity threshold. Tables and lists are additionally compared structurally. The Markdown formatter converts the result to uniform Markdown while preserving lists, tables, and headings.

In the referencing step, every sufficiently long paragraph is assigned a unique ID of the form [Pn]. An internal map stores, for each ID, a preview text, the full paragraph, the corresponding document name, and — where available — the page number. This map is used later to resolve the markers emitted by the model.

When the user submits a question, the Multi-Document Manager combines all active documents into a joint context and verifies that, together with the question, it fits within the context limit. The LLM interface assembles a system prompt — with or without an instruction to use source markers — and a user prompt, and calls the language-model endpoint in streaming mode. Incoming tokens are buffered, flushed to the UI at sentence boundaries or when the buffer is full, and rendered there continuously. Once the response completes, any [Pn] markers it contains are extracted, resolved against the reference map, grouped by document, and appended as a source list.

On request, the Word exporter turns the chat history into a formatted .docx document, including the source references.

Concurrency, robustness, and configuration

The application uses Gradio's queue functionality for request serialisation and supports targeted cancellation of an ongoing streaming response by the user. Sessions are identified by a UUID and held in memory; once a session ends, the corresponding memory is released. Errors during document extraction are routed to a generic fallback path and reported per file in the interface, so that a faulty file does not block the entire processing run. Token-limit overruns are detected before the LLM call and rejected with a corresponding message.

Configuration is fully declarative, via environment variables or CLI arguments. A central configuration class reads the values, validates them, and exposes them in typed form. The application can therefore be adapted to different language-model endpoints, sub-path deployments behind a reverse proxy, and divergent token limits without code changes.

Technology overview

Area Component
Language and runtime Python 3.13
Web framework / UI Gradio (version 5+)
Document extraction unstructured (PDF, DOCX, PPTX), pypdf for interactive form fields, pdfplumber, openpyxl, python-pptx
Language-model integration OpenAI Python SDK (OpenAI API protocol)
Word export python-docx
Helper libraries beautifulsoup4, lxml, chardet, python-magic, tiktoken
Containerisation Docker (multi-stage build), Docker Compose
System dependencies poppler-utils, tesseract-ocr, libreoffice, libmagic