Architecture¶

LLM-Chat is built as a containerized web application with a clear separation of layers. A Gradio-based UI accepts inputs and returns streaming responses; a modular business-logic layer encapsulates session, document, and image processing as well as the connection to the language-model server. The connection to the language model is implemented asynchronously and parameterized through model profiles. Configuration and connection parameters are set entirely through environment variables.

At a glance¶

Web application in Python 3.12, delivered as a container image, served on a single HTTP port.
UI layer built on Gradio (MultimodalTextbox, Chatbot, Sidebar, Accordion); CSS and a custom theme are kept in separate modules.
Business logic strictly separated from the UI in modules for sessions, documents, images, streaming, and export.
Asynchronous LLM connection using AsyncOpenAI with streaming; sampling and reasoning parameters configurable per model profile.
Document pipeline with two-stage extraction (fast, format-specific extractor → universal fallback) followed by cleanup.
Application state per browser tab via gr.State; documents are shared across sessions but local to the tab.
Configuration entirely through environment variables with documented defaults; local processing without external third-party resources.

Architectural overview¶

The application is organized into four areas: the UI layer, the tab-local application state, the business logic with its processing pipelines, and the external connection to a language-model server.

Components and data flow¶

flowchart TB
    User([Browser])

    subgraph UI["UI layer (Gradio)"]
        Input["Multimodal input"]
        ChatUI["Chat area"]
        Sidebar["Sidebar with sessions and documents"]
        Export["Word export"]
    end

    subgraph Logic["Business logic (per browser tab)"]
        SessMgr["SessionManager"]
        DocMgr["DocumentManager"]
        ImgProc["ImageProcessor"]
        DocProc["DocumentProcessor"]
        StreamChat["StreamingChat"]
    end

    subgraph Pipeline["Document pipeline"]
        Fast["Fast extractors"]
        Universal["Universal parser"]
        Clean["Cleanup and token counting"]
    end

    Profiles[("Model profiles<br/>Qwen3 / Qwen3.5 / Kimi / GLM / Default")]
    LLM[("LLM endpoint<br/>OpenAI-compatible")]

    User --> Input
    User --> Sidebar
    Input --> ImgProc
    Input --> DocProc
    DocProc --> Fast
    Fast -- "on failure" --> Universal
    Fast --> Clean
    Universal --> Clean
    Clean --> DocMgr
    ImgProc --> SessMgr
    Sidebar --> SessMgr
    Sidebar --> DocMgr
    SessMgr --> StreamChat
    DocMgr --> StreamChat
    Profiles --> StreamChat
    StreamChat -- "streaming request" --> LLM
    LLM -- "response stream" --> StreamChat
    StreamChat --> ChatUI
    SessMgr --> ChatUI
    SessMgr --> Export

UI layer¶

The UI layer is implemented with Gradio. A multimodal input row accepts text and file attachments together; classification as image or document is based on the file extension. The sidebar shows the session history, loaded documents with token usage, and an additional upload zone, and contains the trigger for the Word export. Markdown and LaTeX rendering happen directly in the chat area; the CSS is centralized in a dedicated module.

Application state¶

Per browser tab, an AppState is instantiated through gr.State. It bundles session management, document management, the processors for documents and images, the streaming client, and the active configuration. This separation ensures that several parallel browser tabs do not affect each other, while within a tab the document pool remains available across sessions.

Document pipeline¶

Incoming documents pass through a two-stage pipeline. First, a lightweight extractor is tried depending on the file format: PDF via pdfminer.six, DOCX via python-docx, PPTX via python-pptx, XLSX via openpyxl, HTML via a dedicated tag parser, and plain-text formats via a multi-encoding read. If the fast path fails or no fast extractor exists for the format (DOC, PPT, XLS), unstructured takes over with format-specific partitioners; for PDF, the configuration allows switching between a fast variant and a high-resolution layout analysis. Subsequently, repeated empty lines and whitespace are reduced, page numbers are optionally removed, and overly short paragraphs are filtered out. The token requirement is approximated and checked against the configured budget before the document is handed over to the DocumentManager.

Image pipeline¶

Images are validated by the ImageProcessor (format, file size), corrected for EXIF orientation, scaled to a maximum edge length when needed, and converted to the RGB color space. They are passed to the language model as base64-encoded JPEG data inside the image_url structure of the OpenAI API. A persistent copy is also cached for display in the chat history.

LLM connection and model profiles¶

The connection to the language model is encapsulated in the StreamingChat module, which uses AsyncOpenAI with active streaming. Before each call, the active model profile is determined: first via an explicit configuration variable, otherwise via a substring match on the model name, otherwise via the default profile. Each profile holds separate parameter sets for reasoning and instruct mode; these include regular OpenAI parameters (temperature, top-p, penalty values), vLLM-specific extra parameters (top-k, min-p, repetition penalty), and chat_template_kwargs for controlling reasoning behavior. Profiles are pre-configured for Qwen3, Qwen3.5, Kimi, and GLM; further models are served by the default profile.

During streaming, incoming chunks are continuously checked for <think> markers. As long as a reasoning block is open (including partial tag prefixes), the UI is updated with a compact status indicator; once the block is closed, it is removed from the visible response and regular text is again output token by token.

Request workflow¶

A user input is processed as follows: on submit, text and attachments are stored and the input field is cleared (anti-blink pattern). Attachments are classified and processed in parallel — documents pass through the extraction pipeline and are handed to the DocumentManager, images are prepared by the ImageProcessor and attached to the current message. If no input text is present, a context-dependent default prompt is used (image description or document question). The ChatSession then assembles the full message sequence: system prompt with date hint and, if applicable, the current document context, followed by the prior chat history and the new input. This list is passed to StreamingChat, where the profile-dependent parameters are applied and the request is sent to the LLM endpoint as a streaming call. Returning chunks are filtered and forwarded to the UI piece by piece. After the first response of a session, a short title is additionally generated through a separate, non-streaming call.

Concurrency and robustness¶

The LLM connection is asynchronous; an ongoing generation can be interrupted promptly via a stop signal without corrupting the rest of the state. Connection errors are translated into user-friendly messages (server overloaded, model not found, no connection) and do not interrupt the chat permanently. In document extraction, the fallback path is taken only when the fast path actually returns empty content or fails, and the event is logged.

Configuration and deployment¶

All configuration is set through environment variables, bundled in a dataclass-based AppConfig. This includes connection parameters for the LLM endpoint, token budgets, image and document limits, processing options, and server and path settings. An optional .env file is loaded automatically; without it, documented defaults apply. Delivery is as a container image based on python:3.12-slim with the system libraries required for unstructured (libmagic, poppler, OpenGL/glib). The application listens on a configurable port by default and can be mounted under a configurable URL path.

Technology overview¶

Language and runtime: Python 3.12.
UI: Gradio.
LLM client: openai Python SDK (AsyncOpenAI) against an OpenAI-compatible endpoint.
Model profiles: pre-configured for Qwen3, Qwen3.5, Kimi, and GLM; default profile for further models.
Fast document extraction: pdfminer.six (PDF), python-docx (DOCX), python-pptx (PPTX), openpyxl (XLSX), dedicated HTML parser.
Universal extraction: unstructured with format-specific partitioners.
Image processing: Pillow.
Word export: python-docx with a self-contained markdown renderer.
Configuration: environment variables, optionally through python-dotenv.
Deployment: Docker (python:3.12-slim) with system dependencies for unstructured[pdf].